## DAY9 Merging, Joining, and Concatenating

将 DataFrame 组合在一起的主要方式有 3 种：

- Merging
- Joining
- Concatenating

下面来讨论这三种方法。

____

In [2]:
import pandas as pd

In [3]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

In [4]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=['A', 'B', 'C', 'D']) 

In [5]:
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

In [6]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [7]:
df2

Unnamed: 0,A,B,C,D
A,A4,B4,C4,D4
B,A5,B5,C5,D5
C,A6,B6,C6,D6
D,A7,B7,C7,D7


In [8]:
df3

Unnamed: 0,A,B,C,D
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


### 1. Concatenation

https://pandas.pydata.org/docs/reference/api/pandas.concat.html

```python
pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
```

Concatenation 是将 DataFrame 粘合在一起。

注意：尺寸应该沿着您要连接的轴匹配。

可使用 **pd.concat** 并传入一个 DataFrame 列表以连接在一起

In [10]:
pd.concat([df1,df2,df3])
# 默认是axis=0 理解为 stack（上下合并）
# axis若为1，理解为side by side（左右合并）

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
A,A4,B4,C4,D4
B,A5,B5,C5,D5
C,A6,B6,C6,D6
D,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [11]:
pd.concat([df1,df2,df3],axis=0,ignore_index = True)
# 忽略index相当于reset_index

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [15]:
pd.concat([df1,df2,df3],axis=0, keys=['df1', 'df2','df3'])
# 如果我们想找到合并之前的df该怎么办？，加keys参数

Unnamed: 0,Unnamed: 1,A,B,C,D
df1,0,A0,B0,C0,D0
df1,1,A1,B1,C1,D1
df1,2,A2,B2,C2,D2
df1,3,A3,B3,C3,D3
df2,A,A4,B4,C4,D4
df2,B,A5,B5,C5,D5
df2,C,A6,B6,C6,D6
df2,D,A7,B7,C7,D7
df3,8,A8,B8,C8,D8
df3,9,A9,B9,C9,D9


In [40]:
(pd.concat([df1,df2,df3],axis=0, keys=['df1', 'df2','df3'])).loc['df1']
# .loc['df1']定位df1，并取出

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [16]:
pd.concat([df1,df2,df3],axis=1)
# side by side
# 发现一堆nan
# 为什么？

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,A0,B0,C0,D0,,,,,,,,
1,A1,B1,C1,D1,,,,,,,,
2,A2,B2,C2,D2,,,,,,,,
3,A3,B3,C3,D3,,,,,,,,
A,,,,,A4,B4,C4,D4,,,,
B,,,,,A5,B5,C5,D5,,,,
C,,,,,A6,B6,C6,D6,,,,
D,,,,,A7,B7,C7,D7,,,,
8,,,,,,,,,A8,B8,C8,D8
9,,,,,,,,,A9,B9,C9,D9


In [44]:
# 因为index没对齐，所以我们把index对齐即可
pd.concat(
    [df1.reset_index(drop=True),
     df2.reset_index(drop=True),
     df3.reset_index(drop=True)],axis=1
)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,A0,B0,C0,D0,A4,B4,C4,D4,A8,B8,C8,D8
1,A1,B1,C1,D1,A5,B5,C5,D5,A9,B9,C9,D9
2,A2,B2,C2,D2,A6,B6,C6,D6,A10,B10,C10,D10
3,A3,B3,C3,D3,A7,B7,C7,D7,A11,B11,C11,D11


## Merging

**merge** 函数与将 SQL 表合并在一起的类似逻辑将 DataFrame 合并在一起

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

```python
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
```

In [17]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})    

In [47]:
left

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3


In [48]:
right

Unnamed: 0,key,C,D
0,K0,C0,D0
1,K1,C1,D1
2,K2,C2,D2
3,K3,C3,D3


In [18]:
left.merge(right,on = 'key')

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


In [49]:
pd.merge(left,right,how='inner',on='key')

# sql 的 inner join
# 用 key 这一列连接

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


In [19]:
# 多个key
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})

In [20]:
left

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3


In [21]:
right

Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3


In [55]:
pd.merge(left, right, on=['key1', 'key2'])
# 相当于key1 key2连一起看，如果相等就取出对应的行

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


In [57]:
pd.merge(left, right, how='outer', on=['key1', 'key2'])
# full outer join

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,
5,K2,K0,,,C3,D3


In [20]:
pd.merge(left, right, how='right', on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


In [22]:
left = pd.DataFrame({'col1': [-1, 1, 2, 4], 'col_left':['a','b','c','d']})
right = pd.DataFrame({'col1': [1, 2, 2, 6],'col_right':[2, 3, 4, 5]})
pd.merge(left, right, on='col1', how='outer', indicator=True)

Unnamed: 0,col1,col_left,col_right,_merge
0,-1,a,,left_only
1,1,b,2.0,both
2,2,c,3.0,both
3,2,c,4.0,both
4,4,d,,left_only
5,6,,5.0,right_only


## Joining

Joining 是将两个可能具有不同索引的 DataFrame 的列组合成单个 DataFrame 的便捷方法

In [23]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [24]:
left

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2


In [25]:
right

Unnamed: 0,C,D
K0,C0,D0
K2,C2,D2
K3,C3,D3


In [25]:
left.join(right)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [28]:
left = left.reset_index()

In [29]:
right = right.reset_index()

In [35]:
left.join(right, 
          how='outer', 
          lsuffix = '_左',
          rsuffix = '_右')

Unnamed: 0,index_左,A,B,index_右,C,D
0,K0,A0,B0,K0,C0,D0
1,K1,A1,B1,K2,C2,D2
2,K2,A2,B2,K3,C3,D3
