## [ Combining and Merging Datasets ]
Data contained in pandas objects can be combined in a number of ways:

#### Table:

| Method         | Use Case                                     | Join Type       |
|----------------|----------------------------------------------|------------------|
| `concat()`     | Stack DataFrames (vertically/horizontally)   | Union of data    |
| `merge()`      | Join on common columns                       | Inner/Outer/Left/Right |
| `join()`       | Join based on index                          | Simpler syntax   |
| `combine_first()` | Fill missing values from another df       | Null-filler      |
| `update()`     | Overwrite values with another df             | In-place         |


####  When to use what?

| Situation                                        | Use this          |
|-------------------------------------------------|-------------------|
| Stacking datasets with same columns             | `concat()`        |
| Combining tables using key columns (like SQL)   | `merge()`         |
| Joining on index (instead of column)            | `join()`          |
| Filling missing values from another dataset     | `combine_first()` |
| Updating values in-place                        | `update()`        |


In [2]:
import numpy as np 
import pandas as pd 

## [ Database-Style DataFrame Joins ]
- `merge` or `join` operations combine datasets by linking rows using one or more `keys`.
- These operations are particularly important in relational databases (e.g., SQL-based)
- `pandas.merge` function is the main entry point for using these algorithms on our data

In [3]:
df1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "a", "b"],
                    "data1": pd.Series(range(7), dtype="Int64")})
df2 = pd.DataFrame({"key": ["a", "b", "d"],
                    "data2": pd.Series(range(3), dtype="Int64")})
print(df1)
print(df2)

  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   a      5
6   b      6
  key  data2
0   a      0
1   b      1
2   d      2


this is an example of a many-to-one join

In [4]:
# this is an example of a many-to-one join
# the data in df1 has multiple rows labeled a and b, whereas df2 has only one row for each value in the key column

# calling pd.merge() with these objects
pd.merge(df1, df2)

# for each matching key, it pairs every row from df1 with every matching row from df2. This is called a Cartesian Product

# if not specified which column to join on, pd.merge uses the overlapping column names as the keys

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,a,2,0
3,a,4,0
4,a,5,0
5,b,6,1


In [5]:
# specifying explicitly
pd.merge(df1, df2, on="key", sort="key")
# means join both dataframes where df1.key == df2.key. Only keep the matches

# joins on common values of key
# each matching key does a Cartesian product of the matching rows
    # here: 'a' matches → 3 rows (df1) × 1 row (df2) = 3 rows
    #       'b' matches → 3 rows (df1) × 1 row (df2) = 3 rows
    # total: 6 rows

# non-matching keys are dropped by default

Unnamed: 0,key,data1,data2
0,a,2,0
1,a,4,0
2,a,5,0
3,b,0,1
4,b,1,1
5,b,6,1


In [6]:
# merge tow DataFrames with different column names for the keys
df3 = pd.DataFrame({
    "lkey": ["b", "b", "a", "c", "a", "a", "b"],
    "data1": pd.Series(range(7), dtype="Int64")})

df4 = pd.DataFrame({
    "rkey": ["a", "b", "d"],
    "data2": pd.Series(range(3), dtype="Int64")})

print(df3)
print(df4)

  lkey  data1
0    b      0
1    b      1
2    a      2
3    c      3
4    a      4
5    a      5
6    b      6
  rkey  data2
0    a      0
1    b      1
2    d      2


In [7]:
# merge operation
pd.merge(df3, df4, left_on="lkey", right_on="rkey", sort="lkey")

# means:
#     use df3["lkey"] as the join key from the left DataFrame.
#     use df4["rkey"] as the join key from the right DataFrame.

# it is equivalent to saying: join rows in df3 and df4 where the value in lkey(from df3) equals the value in rkey(from df4)

# Only keys a and b match → rows with lkey = a or b in df3 will be included.
# For each match, pandas does a Cartesian product of matching rows.


# Summary:
    # left_on and right_on are used when column names differ in each DataFrame.
    # Matching is done just like before (inner join by default).
    # Useful for merging datasets with different naming conventions.

Unnamed: 0,lkey,data1,rkey,data2
0,a,2,a,0
1,a,4,a,0
2,a,5,a,0
3,b,0,b,1
4,b,1,b,1
5,b,6,b,1


- Four types of Joins in pandas.merge
- Each join type controls which keys (values in the joining columns) appear in the result

| Join Type | Keeps Keys From | Fills with NaN |
|-----------|-----------------|----------------|
| inner     | both            | no             |
| left      | left            | right columns  |
| right     | right           | left columns   |
| outer     | all             | both sides     |


In [8]:
pd.merge(df1, df2, how="outer")

Unnamed: 0,key,data1,data2
0,a,2.0,0.0
1,a,4.0,0.0
2,a,5.0,0.0
3,b,0.0,1.0
4,b,1.0,1.0
5,b,6.0,1.0
6,c,3.0,
7,d,,2.0


In [9]:
pd.merge(df3, df4, left_on="lkey", right_on="rkey", how="outer")

Unnamed: 0,lkey,data1,rkey,data2
0,a,2.0,a,0.0
1,a,4.0,a,0.0
2,a,5.0,a,0.0
3,b,0.0,b,1.0
4,b,1.0,b,1.0
5,b,6.0,b,1.0
6,c,3.0,,
7,,,d,2.0


many-to-many merges form the cartesian product of the matching keys

In [10]:
df1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                    "data1": pd.Series(range(6), dtype="Int64")})
df2 = pd.DataFrame({"key": ["a", "b", "a", "b", "d"],
                    "data2": pd.Series(range(5), dtype="Int64")})
print(df1)
print(df2)

  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   b      5
  key  data2
0   a      0
1   b      1
2   a      2
3   b      3
4   d      4


In [11]:
pd.merge(df1, df2, on="key", how="left", sort="key")

Unnamed: 0,key,data1,data2
0,a,2,0.0
1,a,2,2.0
2,a,4,0.0
3,a,4,2.0
4,b,0,1.0
5,b,0,3.0
6,b,1,1.0
7,b,1,3.0
8,b,5,1.0
9,b,5,3.0


In [12]:
pd.merge(df1, df2, how="inner")

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,a,2,0
5,a,2,2
6,a,4,0
7,a,4,2
8,b,5,1
9,b,5,3


to merge with multiple keys, pass a list of column names

In [13]:
left = pd.DataFrame({"key1": ["foo", "foo", "bar"],
                     "key2": ["one", "two", "one"],
                     "lval": pd.Series([1, 2, 3], dtype='Int64')})

right = pd.DataFrame({"key1": ["foo", "foo", "bar", "bar"],
                      "key2": ["one", "one", "one", "two"],
                      "rval": pd.Series([4, 5, 6, 7], dtype='Int64')})

print(left)
print(right)

  key1 key2  lval
0  foo  one     1
1  foo  two     2
2  bar  one     3
  key1 key2  rval
0  foo  one     4
1  foo  one     5
2  bar  one     6
3  bar  two     7


In [14]:
pd.merge(left, right, on=["key1", "key2"], how="outer")

# An outer join means:
#     Give me all the combinations of key1 + key2 from both tables, even if there is no match
# So it includes:
#     All rows from left
#     All rows from right
#     If a match is found on both key1 and key2, join the data
#     If not found, fill missing part with NaN

# All unique (key1, key2) combinations across both:
    # key1        key2
    # foo         one
    # foo         two
    # bar         one
    # bar         two

# look at the matches and build final table

# example: match for foo + one
#   from left: lval = 1
#   from right: two matches : rval = 4 and 5
#   this gives two rows
    # key1    key2    lval    rval
    # foo     one     1       4
    # foo     one     1       5

Unnamed: 0,key1,key2,lval,rval
0,bar,one,3.0,6.0
1,bar,two,,7.0
2,foo,one,1.0,4.0
3,foo,one,1.0,5.0
4,foo,two,2.0,


In [15]:
# treatment of overlapping column names
pd.merge(left, right, on="key1")

# we're merging only on key1
# other columns (key2, lval, rval) ar overlapping or unreleated
# so pandas will include both key2 columns: one from each dataFrame
# since both left and right have key2, pandas adds suffixes to avoid name clashes

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


In [16]:
# pandas.merge has a suffixes option for specifying strings to append to overlapping names in the left and right DataFrame objects
pd.merge(left, right, on="key1", suffixes=("_left", "_right"))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7



#### Basic Arguments:

| Argument         | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| `left`           | First DataFrame to merge                                                    |
| `right`          | Second DataFrame to merge                                                   |
| `how`            | Type of merge: `'inner'` (default), `'outer'`, `'left'`, `'right'`          |
| `on`             | Column(s) name(s) to join on (must be common in both DataFrames)            |
| `left_on`        | Column(s) from `left` DataFrame to join on                                  |
| `right_on`       | Column(s) from `right` DataFrame to join on                                 |
| `left_index`     | Use index from `left` as join key                                            |
| `right_index`    | Use index from `right` as join key                                           |


#### Output Control:

| Argument         | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| `sort`           | Sort the result DataFrame by join keys (`True` or `False`, default `True`)  |
| `suffixes`       | Tuple of string suffixes to apply to overlapping columns (default: `('_x', '_y')`) |
| `copy`           | If `False`, avoid copying data unnecessarily (default is `True`)            |
| `indicator`      | If `True`, adds a column showing the source of each row in the merge        |
| `validate`       | Check if the merge is of a specific type (e.g., `'one_to_one'`, `'1:m'`)    |
