__Clean, Transform, Merge and Reshape__

Pandas provides a high-level, flexible, and high-performance set of core manipulations and algorithms to enable you to wrangle data into the right form.

# Combining and Merging Data Sets

`pandas.merge` connects rows in DataFrame based on one or more keys.

`pandas.concat` stacks together objects along an axis.

`combine_first` instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline

## Database-style DataFrame Merges

In [24]:
left = pd.DataFrame({'data1' : np.random.randint(100, 200, 8), 'key' : list('bacbaacb')})
right = pd.DataFrame({'data2' : np.random.randint(10, 99, 7), 'key' : list('acaacdd')})
right


Unnamed: 0,data2,key
0,31,a
1,58,c
2,76,a
3,54,a
4,68,c
5,89,d
6,95,d


In [25]:
left.merge(right)

Unnamed: 0,data1,key,data2
0,188,a,31
1,188,a,76
2,188,a,54
3,165,a,31
4,165,a,76
5,165,a,54
6,163,a,31
7,163,a,76
8,163,a,54
9,186,c,58


In [26]:
pd.merge(left, right, on = 'key')

Unnamed: 0,data1,key,data2
0,188,a,31
1,188,a,76
2,188,a,54
3,165,a,31
4,165,a,76
5,165,a,54
6,163,a,31
7,163,a,76
8,163,a,54
9,186,c,58


By default, `merge` does an 'inner' join (intersection). Other possibilities are `left`, `right`, and `outer`

In [27]:
left.merge(right, how = 'left')
# or pd.merge(left,right, how = 'left')

Unnamed: 0,data1,key,data2
0,110,b,
1,188,a,31.0
2,188,a,76.0
3,188,a,54.0
4,186,c,58.0
5,186,c,68.0
6,143,b,
7,165,a,31.0
8,165,a,76.0
9,165,a,54.0


In [28]:
left.merge(right, how = 'outer')


Unnamed: 0,data1,key,data2
0,110.0,b,
1,143.0,b,
2,135.0,b,
3,188.0,a,31.0
4,188.0,a,76.0
5,188.0,a,54.0
6,165.0,a,31.0
7,165.0,a,76.0
8,165.0,a,54.0
9,163.0,a,31.0


It returns the carteasian product of the elements with common keys, if there are duplicates, then it will return all the posible combinations.

If columns don't have the same name, or we want to join the index of the DataFrames, we will need to specify that.

In [29]:
right.columns = ['a', 'b']
left.merge(right, left_on = ['key'], right_on = ['b'])

Unnamed: 0,data1,key,a,b
0,188,a,31,a
1,188,a,76,a
2,188,a,54,a
3,165,a,31,a
4,165,a,76,a
5,165,a,54,a
6,163,a,31,a
7,163,a,76,a
8,163,a,54,a
9,186,c,58,c


If there are two columns with the same name that we do not join on, both will get transferred to the resulting DataFrame with a suffix. We can customize these suffixes.

In [30]:
right.columns = ['data1', 'key']
left.merge(right, left_on=['key'], right_on=['key'])

Unnamed: 0,data1_x,key,data1_y
0,188,a,31
1,188,a,76
2,188,a,54
3,165,a,31
4,165,a,76
5,165,a,54
6,163,a,31
7,163,a,76
8,163,a,54
9,186,c,58


In [31]:
left.merge(right, left_on=['key'], right_on=['key'], suffixes=['_chachi', '_piruli'])

Unnamed: 0,data1_chachi,key,data1_piruli
0,188,a,31
1,188,a,76
2,188,a,54
3,165,a,31
4,165,a,76
5,165,a,54
6,163,a,31
7,163,a,76
8,163,a,54
9,186,c,58


## Merging on Index

This is the question