# Merging and Concatinating DataFrames and Series

**Learning Objectives:** Learn how to combine multiple DataFrames using `merge` and `concat` and learn about relationships between DataFrames.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import seaborn as sns

## Introduction to merging

To perform a **merge** or **join**, you need two DataFrames with one or more columns in common, called "key" columns or keys.

* Find common "key" column(s) which will be the merge keys.
* Find the unique values in the merge keys and use `how` to pick what values will be in the new DF:
  - `inner` take values present in both DFs.
  - `outer` take values present in either DF.
  - `left/right` take values present only in left/right DF.
* Build a new DataFrame with all columns from both DFs, but the merge keys just once.
* Use `left_on/right_on` to specify which columns to use as the merge keys or `left_index/right_index` to specify that the index should be used as the merge key.

In [2]:
df1 = DataFrame({'key': list('bbacaab'), 'data1': range(7)})
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [3]:
df1.key.unique()

array(['b', 'a', 'c'], dtype=object)

In [4]:
df2 = DataFrame({'key': list('abbd'), 'data2': range(4)})
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,b
3,3,d


In [5]:
df2.key.unique()

array(['a', 'b', 'd'], dtype=object)

The default merge method is `how="inner"`, which only includes keys that are in both DataFrames (`ab`): 

In [6]:
pd.merge(df1, df2)

Unnamed: 0,data1,key,data2
0,0,b,1
1,0,b,2
2,1,b,1
3,1,b,2
4,6,b,1
5,6,b,2
6,2,a,0
7,4,a,0
8,5,a,0


The `how="outer"` approach includes keys that are in either DataFrames (`abcd`): 

In [7]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,data1,key,data2
0,0.0,b,1.0
1,0.0,b,2.0
2,1.0,b,1.0
3,1.0,b,2.0
4,6.0,b,1.0
5,6.0,b,2.0
6,2.0,a,0.0
7,4.0,a,0.0
8,5.0,a,0.0
9,3.0,c,


The `how="left"` approach includes keys that are in only the left DataFrame (`abc`): 

In [8]:
pd.merge(df1, df2, how='left')

Unnamed: 0,data1,key,data2
0,0,b,1.0
1,0,b,2.0
2,1,b,1.0
3,1,b,2.0
4,2,a,0.0
5,3,c,
6,4,a,0.0
7,5,a,0.0
8,6,b,1.0
9,6,b,2.0


The `how="right"` approach includes keys that are in only the right DataFrame (`abd`): 

In [9]:
pd.merge(df1, df2, how='right')

Unnamed: 0,data1,key,data2
0,0.0,b,1
1,1.0,b,1
2,6.0,b,1
3,0.0,b,2
4,1.0,b,2
5,6.0,b,2
6,2.0,a,0
7,4.0,a,0
8,5.0,a,0
9,,d,3


## Relationships between DataFrames

When you have multiple DataFrames that have common keys you can have **relationships** between the entities in the different DataFrames. There are three types of entity relationships that are possible:

* 1-to-1
* 1-to-many
* many-to-many

Here is a small data set from the TV show [The Simpsons]() to illustrate these relationshps.

First, here is a DataFrame with students' first and last names, along with a unique student id:

In [10]:
students = DataFrame({'fname': ['Bart','Lisa','Milhouse'],
                      'lname': ['Simpson','Simpson','Van Houten']},
                     index=list('abc'))
students

Unnamed: 0,fname,lname
a,Bart,Simpson
b,Lisa,Simpson
c,Milhouse,Van Houten


Here is a DataFrame with the student social security numbers, indexed by their unique student id:

In [11]:
ssns = DataFrame({'ssn':[1234,5678,9101]}, index=list('abc'))
ssns

Unnamed: 0,ssn
a,1234
b,5678
c,9101


Each student can have aliases or nicknames:

In [12]:
aliases = DataFrame({'alias':['Bartman','Bartron','Cosmos','Truth Teller','Lady Penelope Ariel',
                              'Jake Boyman','Lou La Trec','Eagle Eye','Maestro'],
                     'student': list('aaabbbccc')})
aliases

Unnamed: 0,alias,student
0,Bartman,a
1,Bartron,a
2,Cosmos,a
3,Truth Teller,b
4,Lady Penelope Ariel,b
5,Jake Boyman,b
6,Lou La Trec,c
7,Eagle Eye,c
8,Maestro,c


Here are the student home addresses:

In [13]:
addresses = DataFrame({'address':['742 Evergreen Terrace','742 Evergreen Terrace','316 Pikeland Ave.']},
                      index=list('abc'))
addresses

Unnamed: 0,address
a,742 Evergreen Terrace
b,742 Evergreen Terrace
c,316 Pikeland Ave.


A table of courses the students can be enrolled in:

In [14]:
courses = DataFrame({'name':['Biology','Math','PE','Underwater electronics']}, index=range(4))
courses

Unnamed: 0,name
0,Biology
1,Math
2,PE
3,Underwater electronics


This table contains the enrollment for each course. Each row of this table has a student and course.

In [15]:
enroll = DataFrame({'student':['a','b','b','c','c','c']},index=(2,0,1,0,1,2))
enroll

Unnamed: 0,student
2,a
0,b
1,b
0,c
1,c
2,c


## 1-1 relationships

* Each student has exactly one SSN.
* Each SSN belongs to exactly one student.

Here we are merging on the index of both columns, so we use `left_index` and `right_index`:

In [16]:
pd.merge(students, ssns, left_index=True, right_index=True)

Unnamed: 0,fname,lname,ssn
a,Bart,Simpson,1234
b,Lisa,Simpson,5678
c,Milhouse,Van Houten,9101


When the merge is on the index of both DataFrames, we can also use the `.join()` method of the left DataFrame:

In [17]:
students.join(ssns)

Unnamed: 0,fname,lname,ssn
a,Bart,Simpson,1234
b,Lisa,Simpson,5678
c,Milhouse,Van Houten,9101


## 1-many relationships

### Students and addresses

* Each student has exactly one address.
* Each address can have many students.

In [18]:
pd.merge(students, addresses, left_index=True, right_index=True)

Unnamed: 0,fname,lname,address
a,Bart,Simpson,742 Evergreen Terrace
b,Lisa,Simpson,742 Evergreen Terrace
c,Milhouse,Van Houten,316 Pikeland Ave.


### Students and aliases

* Each student can have many aliases.
* Each alias belong to exactly one student.

Here we are joining on the left DataFrame's index and the right DataFrame's `student` column:

In [19]:
pd.merge(students, aliases, left_index=True, right_on='student').set_index('student')

Unnamed: 0_level_0,fname,lname,alias
student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,Bart,Simpson,Bartman
a,Bart,Simpson,Bartron
a,Bart,Simpson,Cosmos
b,Lisa,Simpson,Truth Teller
b,Lisa,Simpson,Lady Penelope Ariel
b,Lisa,Simpson,Jake Boyman
c,Milhouse,Van Houten,Lou La Trec
c,Milhouse,Van Houten,Eagle Eye
c,Milhouse,Van Houten,Maestro


## Many-many relationships

* A student can take multiple classes.
* A single class can have multiple students.

In [20]:
m1 = pd.merge(students, enroll, left_index=True, right_on='student')
m1

Unnamed: 0,fname,lname,student
2,Bart,Simpson,a
0,Lisa,Simpson,b
1,Lisa,Simpson,b
0,Milhouse,Van Houten,c
1,Milhouse,Van Houten,c
2,Milhouse,Van Houten,c


In [21]:
pd.merge(m1, courses, left_index=True, right_index=True).sort_values('student')

Unnamed: 0,fname,lname,student,name
2,Bart,Simpson,a,PE
0,Lisa,Simpson,b,Biology
1,Lisa,Simpson,b,Math
0,Milhouse,Van Houten,c,Biology
1,Milhouse,Van Houten,c,Math
2,Milhouse,Van Houten,c,PE


In [22]:
pd.merge(m1, courses, left_index=True, right_index=True, how='outer').sort_values('student')

Unnamed: 0,fname,lname,student,name
2,Bart,Simpson,a,PE
0,Lisa,Simpson,b,Biology
1,Lisa,Simpson,b,Math
0,Milhouse,Van Houten,c,Biology
1,Milhouse,Van Houten,c,Math
2,Milhouse,Van Houten,c,PE
3,,,,Underwater electronics


## Introduction to concatenation

Concatenation is closely related to merging and can be done on sets of `Series` or `DataFrames`. The basic idea is that `concat` simple stacks the different objects along a particular axis.

Here are three `Series`:

In [23]:
s1 = Series(range(5))
s2 = Series(range(5,10))
s3 = Series(range(10,15))

The default concatenation is along `axis=0`, which stacks the Series on top of each other. Notice how the indices of the different Series are preserved.

In [24]:
pd.concat([s1, s2, s3])

0     0
1     1
2     2
3     3
4     4
0     5
1     6
2     7
3     8
4     9
0    10
1    11
2    12
3    13
4    14
dtype: int64

If we pass `ignore_index=True`, the indices for each component are discarded and a new index is created:

In [25]:
pd.concat([s1, s2, s3], ignore_index=True)

0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
dtype: int64

If `axis=1` is set the different objects are put side by side. In this case, the original Series have the same indices and the final DataFrame inherits that:

In [26]:
pd.concat([s1,s2], axis=1)

Unnamed: 0,0,1
0,0,5
1,1,6
2,2,7
3,3,8
4,4,9


However, if the different objects have different indices, the final DataFrame will have NaNs where the indices don't overlap:

In [27]:
s1.index=list('abcde')

In [28]:
pd.concat([s1,s2], axis=1)

Unnamed: 0,0,1
a,0.0,
b,1.0,
c,2.0,
d,3.0,
e,4.0,
0,,5.0
1,,6.0
2,,7.0
3,,8.0
4,,9.0


The `concat` function also works on DataFrames. Here we are stacking the `student` and `addresses` DataFrames on top of each other. It doesn't make much sense conceptually - the point is that `concat` is not "smart" in any way.

In [29]:
pd.concat([students, addresses])

Unnamed: 0,address,fname,lname
a,,Bart,Simpson
b,,Lisa,Simpson
c,,Milhouse,Van Houten
a,742 Evergreen Terrace,,
b,742 Evergreen Terrace,,
c,316 Pikeland Ave.,,


Using `axis=1` in this case provides a meaningful way of combining the `students` and `ssns` DataFrames:

In [30]:
pd.concat([students, ssns], axis=1)

Unnamed: 0,fname,lname,ssn
a,Bart,Simpson,1234
b,Lisa,Simpson,5678
c,Milhouse,Van Houten,9101


More than two DataFrames can be concatenated. This doesn't work with `merge`.

In [31]:
pd.concat([students, ssns, addresses], axis=1)

Unnamed: 0,fname,lname,ssn,address
a,Bart,Simpson,1234,742 Evergreen Terrace
b,Lisa,Simpson,5678,742 Evergreen Terrace
c,Milhouse,Van Houten,9101,316 Pikeland Ave.
