# Concatenating, Merging, and Compare


## Database-style DataFrame (`join()` and `merge()`)
Pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. 

## `merge()`

Accepts a left and right `DataFrame` or `named Series` objects

You can use it as 

`pd.merge(df1, df2)` or `df1.merge(df2)`

The related `join()` method, uses merge internally for the index-on-index (by default) and column(s)-on-index join.

In [None]:
left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)
left

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3


In [None]:
right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)
right

Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3


In [None]:
result = pd.merge(left, right, on=["key1", "key2"])
result

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


In [None]:
result = pd.merge(left, right, on=["key1", "key2"], how="left")
display_side_by_side(left, result)

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


In [None]:
result = pd.merge(left, right, on=["key1", "key2"], how="right")
display_side_by_side(right, result)

Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


In [None]:
result = pd.merge(left, right, on=["key1", "key2"], how="outer")
display_side_by_side(left, right, result)

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3

Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,
5,K2,K0,,,C3,D3


In [None]:
result = pd.merge(left, right, how="cross")
display_side_by_side(left, right, result)

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3

Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3

Unnamed: 0,key1_x,key2_x,A,B,key1_y,key2_y,C,D
0,K0,K0,A0,B0,K0,K0,C0,D0
1,K0,K0,A0,B0,K1,K0,C1,D1
2,K0,K0,A0,B0,K1,K0,C2,D2
3,K0,K0,A0,B0,K2,K0,C3,D3
4,K0,K1,A1,B1,K0,K0,C0,D0
5,K0,K1,A1,B1,K1,K0,C1,D1
6,K0,K1,A1,B1,K1,K0,C2,D2
7,K0,K1,A1,B1,K2,K0,C3,D3
8,K1,K0,A2,B2,K0,K0,C0,D0
9,K1,K0,A2,B2,K1,K0,C1,D1


In [None]:
ser = pd.Series(
    ["a", "b", "c", "d", "e", "f"],
    index=pd.MultiIndex.from_arrays(
        [["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"]
    ),
)
ser

Let  Num
A    1      a
B    2      b
C    3      c
A    4      d
B    5      e
C    6      f
dtype: object

In [None]:
ser.reset_index()

Unnamed: 0,Let,Num,0
0,A,1,a
1,B,2,b
2,C,3,c
3,A,4,d
4,B,5,e
5,C,6,f


**Warning:** Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, which may result in memory overflow. It is the user’ s responsibility to manage duplicate values in keys before joining large DataFrames.

In [None]:
left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})

right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

result = pd.merge(left, right, on="B", how="outer")

display_side_by_side(left, right, result)

# este ejercicio me hace dudar todo lo que entendi antes

Unnamed: 0,A,B
0,1,2
1,2,2

Unnamed: 0,A,B
0,4,2
1,5,2
2,6,2

Unnamed: 0,A_x,B,A_y
0,1,2,4
1,1,2,5
2,1,2,6
3,2,2,4
4,2,2,5
5,2,2,6


In [None]:

result = pd.merge(left, right, on="B", how="inner")

display_side_by_side(left, right, result)

Unnamed: 0,A,B
0,1,2
1,2,2

Unnamed: 0,A,B
0,4,2
1,5,2
2,6,2

Unnamed: 0,A_x,B,A_y
0,1,2,4
1,1,2,5
2,1,2,6
3,2,2,4
4,2,2,5
5,2,2,6


If a string matches both a column name and an index level name, then a warning is issued and the column takes precedence. This will result in an ambiguity error in a future version.

**Note**: the non-unique keys problem that I didn't understand happens because
there are overlapping cases where for example

![image.png](attachment:image.png)

If we want to 'inner' merge using k as key we encounter overlapping in the values
of v because

1. left has K0 1
2. right has K0 3 and K0 4

Then, the overlapping is solve by adding two columns at the resultin merged df
as v_x and v_y

v_x for the left values of column v and v_y for the right values of column v

In [None]:
df1 = pd.DataFrame({"v": [1, 2, 3]}, index=["K0", "K1", "K2"])
df2 = pd.DataFrame({"v": [4, 5, 6]}, index=["K0", "K0", "K3"])
df3 = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])

result = df1.join([df2, df3])

display_side_by_side(df1, df2, df3, result)


Unnamed: 0,v
K0,1
K1,2
K2,3

Unnamed: 0,v
K0,4
K0,5
K3,6

Unnamed: 0,v
K1,7
K1,8
K2,9

Unnamed: 0,v_x,v_y,v
K0,1,4.0,
K0,1,5.0,
K1,2,,7.0
K1,2,,8.0
K2,3,,9.0


In [None]:
#the idea of combine_first is to have two data frames that are similar in the sense
# that they have same index and columns (not necessarily all of them)
# and you want to "patch" or fill some nan values from left with the values
# from right.

df1 = pd.DataFrame(
    [[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]]
)


df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])

result = df1.combine_first(df2)

display_side_by_side(df1, df2, result)

Unnamed: 0,0,1,2
0,,3.0,5.0
1,-4.6,,
2,,7.0,

Unnamed: 0,0,1,2
1,-42.6,,-8.2
2,-5.0,1.6,4.0

Unnamed: 0,0,1,2
0,,3.0,5.0
1,-4.6,,-8.2
2,-5.0,7.0,4.0


In [None]:
# related method that alterate the NaN values in place
# but replace all the values from right into the left structure
df1.update(df2)
df1

Unnamed: 0,0,1,2
0,,3.0,5.0
1,-42.6,,-8.2
2,-5.0,1.6,4.0


In [1]:
import pandas as pd
import numpy as np

In [2]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [3]:
df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [5]:
df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
    index=[8, 9, 10, 11],
)
df3

Unnamed: 0,A,B,C,D
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


In [6]:
frames = [df1, df2, df3]

result = pd.concat(frames)
result

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [9]:
frames = [df1, df2]

result = pd.concat(frames, keys=["df1", "df2"])
result

Unnamed: 0,Unnamed: 1,A,B,C,D
df1,0,A0,B0,C0,D0
df1,1,A1,B1,C1,D1
df1,2,A2,B2,C2,D2
df1,3,A3,B3,C3,D3
df2,4,A4,B4,C4,D4
df2,5,A5,B5,C5,D5
df2,6,A6,B6,C6,D6
df2,7,A7,B7,C7,D7


In [10]:
result.loc["df1"]

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


**Note**: It is worth noting that `concat()` **makes a full copy of the data**, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.


In [None]:
#difference between join='inner' (interception) and join='outer' (union default)

In [13]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [42]:
df2_updated = df2.copy()
df2_updated = df2_updated.reset_index(drop=True)
df2_updated = df2_updated.iloc[:,2:]
df2_updated

Unnamed: 0,C,D
0,C4,D4
1,C5,D5
2,C6,D6
3,C7,D7


In [47]:
pd.concat([df2, df2_updated],  axis=1, join="inner")

Unnamed: 0,A,B,C,D,C.1,D.1


In [46]:
pd.concat([df2, df2_updated],  axis=1, join="outer")

Unnamed: 0,A,B,C,D,C.1,D.1
4,A4,B4,C4,D4,,
5,A5,B5,C5,D5,,
6,A6,B6,C6,D6,,
7,A7,B7,C7,D7,,
0,,,,,C4,D4
1,,,,,C5,D5
2,,,,,C6,D6
3,,,,,C7,D7


In [45]:
pd.concat([df2, df2_updated],join="outer")

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
0,,,C4,D4
1,,,C5,D5
2,,,C6,D6
3,,,C7,D7


In [44]:
pd.concat([df2, df2_updated],join="inner")

Unnamed: 0,C,D
4,C4,D4
5,C5,D5
6,C6,D6
7,C7,D7
0,C4,D4
1,C5,D5
2,C6,D6
3,C7,D7
