# Chapter 8 - Data Wrangling: Join, Combine, and Reshape

## 8.2 Combining and Merging Datasets

In [1]:
import pandas as pd

- Merging dataframes using `df1.merge(df2)` or `pd.merge(df1, df2)` with parameters
- Using different flavours of `how` when merging, including `left`, `right`, `inner` and `outer` 
- using parameters when merging to determine columns (and indices) to merge on, incluing `on`, `left_on`, `right_on`, `left_index=True` and `right_index=True`
- modifying columns when column names overlap using `lsuffix` and `rsuffix`
- Merging on indices using `df1.join(df2)`
- Stacking `df`s below each other using `pd.concat([df1, df2])` (or horizontally using `axis=1` parameter)
- Overlaying 2 `df`s together to fill in missing values using `df1.combine_first(df2)`

In [2]:
# Read from df and some data preparation. Note the year for each of the df
df = pd.read_csv('dataset-C-enrolment.csv')
cols_for_analysis = ['year', 'sex', 'course', 'graduates']
df_f = df[df.sex=='F']
df_f = df_f[cols_for_analysis]
df_f = df_f.head(8)
display(df_f)
df_mf = df[df.sex=='MF']
df_mf = df_mf[cols_for_analysis]
df_mf = df_mf.tail(8)
display(df_mf)

Unnamed: 0,year,sex,course,graduates
1,2005,F,Law,125
3,2006,F,Law,134
5,2007,F,Law,123
7,2008,F,Law,115
9,2009,F,Law,118
11,2010,F,Law,89
13,2011,F,Law,208
15,2012,F,Law,207


Unnamed: 0,year,sex,course,graduates
10,2010,MF,Law,227
12,2011,MF,Law,329
14,2012,MF,Law,347
16,2013,MF,Law,368
18,2014,MF,Law,356
20,2015,MF,Law,355
22,2016,MF,Law,351
24,2017,MF,Law,375


In [3]:
# Rename columns
df_f1 = df_f.copy()[['year', 'graduates']]
_ = df_f1.rename(columns={'graduates' : 'graduates_f'}, inplace=True)
df_mf1 = df_mf.copy()[['year', 'graduates']]
_ = df_mf1.rename(columns={'graduates' : 'graduates_mf'}, inplace=True)
display(df_f1)
display(df_mf1)

Unnamed: 0,year,graduates_f
1,2005,125
3,2006,134
5,2007,123
7,2008,115
9,2009,118
11,2010,89
13,2011,208
15,2012,207


Unnamed: 0,year,graduates_mf
10,2010,227
12,2011,329
14,2012,347
16,2013,368
18,2014,356
20,2015,355
22,2016,351
24,2017,375


Database-style merging either uses the `df1.merge(df2)` syntax or `pd.merge(df1, df2)` syntax. It is always good to specify the common columns to merge on, using the `on` parameter.

In [4]:
merged_df1 = df_f1.merge(df_mf1, on='year')
display(merged_df1)
merged_df2 = pd.merge(df_f1, df_mf1, on='year')
display(merged_df2)

Unnamed: 0,year,graduates_f,graduates_mf
0,2010,89,227
1,2011,208,329
2,2012,207,347


Unnamed: 0,year,graduates_f,graduates_mf
0,2010,89,227
1,2011,208,329
2,2012,207,347


If the columns to merge on are different, specify them respectively using `left_on` and `right_on`.

Using `how='left'` will keep all values of the joining column on the 1st `df`. Using `how='right'` will keep all keys on the 2nd `df`.

In [5]:
merged_df3 = pd.merge(df_f1, df_mf1, on='year', how='left')
display(merged_df3)
merged_df4 = pd.merge(df_f1, df_mf1, on='year', how='right')
display(merged_df4)

Unnamed: 0,year,graduates_f,graduates_mf
0,2005,125,
1,2006,134,
2,2007,123,
3,2008,115,
4,2009,118,
5,2010,89,227.0
6,2011,208,329.0
7,2012,207,347.0


Unnamed: 0,year,graduates_f,graduates_mf
0,2010,89.0,227
1,2011,208.0,329
2,2012,207.0,347
3,2013,,368
4,2014,,356
5,2015,,355
6,2016,,351
7,2017,,375


Using `how='outer'` will keep all values of the joining column on both `df`s. 

In [6]:
merged_df5 = pd.merge(df_f1, df_mf1, on='year', how='outer')
display(merged_df5)

Unnamed: 0,year,graduates_f,graduates_mf
0,2005,125.0,
1,2006,134.0,
2,2007,123.0,
3,2008,115.0,
4,2009,118.0,
5,2010,89.0,227.0
6,2011,208.0,329.0
7,2012,207.0,347.0
8,2013,,368.0
9,2014,,356.0


When the column names are common across both `df`s, then the suffix will change for each `df` after the merging step.

In [7]:
pd.merge(df_f, df_mf, on='year')

Unnamed: 0,year,sex_x,course_x,graduates_x,sex_y,course_y,graduates_y
0,2010,F,Law,89,MF,Law,227
1,2011,F,Law,208,MF,Law,329
2,2012,F,Law,207,MF,Law,347


In [8]:
df_f2 = df_f.copy()
display(df_f2)
df_mf2 = df_mf.copy()
# Setting the index of a df
df_mf2 = df_mf2.set_index('year')
display(df_mf2)

Unnamed: 0,year,sex,course,graduates
1,2005,F,Law,125
3,2006,F,Law,134
5,2007,F,Law,123
7,2008,F,Law,115
9,2009,F,Law,118
11,2010,F,Law,89
13,2011,F,Law,208
15,2012,F,Law,207


Unnamed: 0_level_0,sex,course,graduates
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010,MF,Law,227
2011,MF,Law,329
2012,MF,Law,347
2013,MF,Law,368
2014,MF,Law,356
2015,MF,Law,355
2016,MF,Law,351
2017,MF,Law,375


To merge using a column on one `df` and the index of another, use `left_on`, `right_on`, `left_index` and `right_index` respectively.

In [9]:
# Merge using column on left df and index on right df. Hence, left_on and right_index are used
merged_4 = df_f2.merge(df_mf2, left_on='year', right_index=True)
display(merged_4)

Unnamed: 0,year,sex_x,course_x,graduates_x,sex_y,course_y,graduates_y
11,2010,F,Law,89,MF,Law,227
13,2011,F,Law,208,MF,Law,329
15,2012,F,Law,207,MF,Law,347


Note that it is possible to merge on 2 or more columns.

If the common column in both `df`s are the index columns, consider using `.join()`.

In [10]:
df_f2.index = df_f2['year']
display(df_f2)
display(df_mf2)

Unnamed: 0_level_0,year,sex,course,graduates
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2005,2005,F,Law,125
2006,2006,F,Law,134
2007,2007,F,Law,123
2008,2008,F,Law,115
2009,2009,F,Law,118
2010,2010,F,Law,89
2011,2011,F,Law,208
2012,2012,F,Law,207


Unnamed: 0_level_0,sex,course,graduates
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010,MF,Law,227
2011,MF,Law,329
2012,MF,Law,347
2013,MF,Law,368
2014,MF,Law,356
2015,MF,Law,355
2016,MF,Law,351
2017,MF,Law,375


In [11]:
df_f2.join(df_mf2, lsuffix='_f', rsuffix='_mf')

Unnamed: 0_level_0,year,sex_f,course_f,graduates_f,sex_mf,course_mf,graduates_mf
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2005,2005,F,Law,125,,,
2006,2006,F,Law,134,,,
2007,2007,F,Law,123,,,
2008,2008,F,Law,115,,,
2009,2009,F,Law,118,,,
2010,2010,F,Law,89,MF,Law,227.0
2011,2011,F,Law,208,MF,Law,329.0
2012,2012,F,Law,207,MF,Law,347.0


Using `pd.concat(df1, df2)` to stack both `df`s

In [12]:
pd.concat([df_f, df_mf])

Unnamed: 0,year,sex,course,graduates
1,2005,F,Law,125
3,2006,F,Law,134
5,2007,F,Law,123
7,2008,F,Law,115
9,2009,F,Law,118
11,2010,F,Law,89
13,2011,F,Law,208
15,2012,F,Law,207
10,2010,MF,Law,227
12,2011,MF,Law,329


In [13]:
# Data preparation: make a copy and set the index accordingly.
df_f3, df_mf3 = df_f.copy(), df_mf.copy()
df_f3.index=df_f3['year']
df_f3 = df_f3[['sex', 'graduates']]
df_f3.rename(columns={'graduates' : 'graduates_f'}, inplace=True)
df_mf3.index = df_mf3['year']
df_mf3 = df_mf3[['sex', 'graduates']]
df_mf3.rename(columns={'graduates' : 'graduates_mf'}, inplace=True)
display(df_f3) 
display(df_mf3)

Unnamed: 0_level_0,sex,graduates_f
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2005,F,125
2006,F,134
2007,F,123
2008,F,115
2009,F,118
2010,F,89
2011,F,208
2012,F,207


Unnamed: 0_level_0,sex,graduates_mf
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2010,MF,227
2011,MF,329
2012,MF,347
2013,MF,368
2014,MF,356
2015,MF,355
2016,MF,351
2017,MF,375


Note that you can also perform a `concat()` operation horizontally. In this case, use `axis=1`. Rows with common columns will be stacked together horizontally.

In [14]:
pd.concat([df_f3, df_mf3], axis=1)

Unnamed: 0_level_0,sex,graduates_f,sex,graduates_mf
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2005,F,125.0,,
2006,F,134.0,,
2007,F,123.0,,
2008,F,115.0,,
2009,F,118.0,,
2010,F,89.0,MF,227.0
2011,F,208.0,MF,329.0
2012,F,207.0,MF,347.0
2013,,,MF,368.0
2014,,,MF,356.0


In [15]:
df_wines1, df_wines2 = pd.read_csv('dataset-D3-wines.csv'), pd.read_csv('dataset-D4-wines.csv')
display(df_wines1)
display(df_wines2)

Unnamed: 0,id,variety,points,price
0,146568,Chardonnay,,12.0
1,99586,,92.0,65.0
2,74081,Aglianico,90.0,
3,49142,Marzemino,,75.0
4,86968,Nebbiolo,91.0,


Unnamed: 0,id,variety,points,price
0,146568,Chardonnay,88.0,12.0
1,99586,Cabernet Sauvignon,92.0,65.0
2,74081,,90.0,
3,49142,Marzemino,90.0,75.0
4,86968,Nebbiolo,,


Another way of combining is "overlaying" one `df` onto another. Then, the second `df` will be used to fill the missing values in the first `df` if it is missing.

In [16]:
df_wines1.combine_first(df_wines2)

Unnamed: 0,id,variety,points,price
0,146568,Chardonnay,88.0,12.0
1,99586,Cabernet Sauvignon,92.0,65.0
2,74081,Aglianico,90.0,
3,49142,Marzemino,90.0,75.0
4,86968,Nebbiolo,91.0,


(Note that there is another function called `combine()` and that requires using a function to determine priority of values)

**References:**

Python for Data Analysis, 2nd Edition, McKinney (2017)