#### Pandas怎样实现DataFrame的Merge
Pandas的merge，相当于sql的join，将不同的表按key关联到一个表

#### merge的语法：
pd.merge(left,right,how='inner',on=None,left_on=None,right_on=None,left_index=False,right_index=False,sort=True,suffixes=('_x','_y'),copy=True,indicator=False,validate=None)
+ left,right:要merge的dataframe或者有name的Series
+ how：join类型，'left','right','outer','inner'
+ on: join的key，left和right都需要有这个key
+ left_on：left的df或者series的key
+ right_on：right的df或者series的key
+ left_index,right_index：使用index而不是普通的column做join
+ suffixes：两个元素的后缀，如果列有重名，自动添加后缀，默认是('_x','_y')

#### 数据来源 <https://grouplens.org/datasets/movielens/>
已下载到本地 C:\Users\pxpxz_ct9p1p3\DSWorkshop\pandas\files\ml-latest-small

In [35]:
import pandas as pd

In [36]:
df_ratings=pd.read_csv(
            './files/ml-latest-small/ratings.csv',
#             sep='::',
#             engine='python',
#             names='userId::movieId::rating::timestamp'.split('::')
            )

In [37]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [38]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [39]:
df_tags=pd.read_csv(
            './files/ml-latest-small/tags.csv',
#             sep='::',
#             engine='python',
#             names='userId::movieId::rating::timestamp'.split('::')
            )

In [40]:
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [43]:
# join movies and rating
df_movies_ratings = pd.merge(
    df_movies, df_ratings, left_on='movieId', right_on='movieId', how='inner')

In [44]:
df_movies_ratings.head(10)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483
5,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,18,3.5,1455209816
6,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,19,4.0,965705637
7,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,21,3.5,1407618878
8,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,27,3.0,962685262
9,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,31,5.0,850466616


In [47]:
# join movies_ratings and tags
df_movies_ratings_tags = pd.merge(
    df_movies_ratings, df_tags, left_on='movieId', right_on='movieId', how='inner')

In [49]:
df_movies_ratings_tags.head()

Unnamed: 0,movieId,title,genres,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,336,pixar,1139045764
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,474,pixar,1137206825
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,567,fun,1525286013
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,336,pixar,1139045764
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,474,pixar,1137206825


#### 2、理解merge时数量的对齐关系
以下关系要正确理解
+ one-to-one:一对一关系，关联的key都是唯一的
    + 比如（学号，姓名）merge(学校，年龄)
    + 结果条数为：1*1
+ one-to-many:一对多关系，左边唯一key，右边不唯一key
    + 比如（学号，姓名）merge（学号，[语文成绩，数学成绩，英语成绩]）
    + 结果条数为：1*N
+ many-to-many：多对多关系，左边右边都不唯一
    + 比如（学号，[语文成绩，数学成绩，英语成绩] merge（学号，[篮球，足球，兵乓球]）
    + 结果条数为：N*N


#### 3、理解left join、right join、inner join、outer join的区别

In [None]:
#### 4、如果出现非key的字段重名怎么办
+ 非key_x, 非key_y 会出现，
+ 可以用suffixes自定义  后缀 如：suffixes=（'_left','_right'）