# Lesson: Merging Dataframes

### Objectives

- Concatenation 
- Merging - inner join 
- Merging - left join
- Merging - right join
- Merging - outer join

> ### Warm-up
> - open a new jupyter notebook
> - Do the "Glue tables together" warm up from the course material

In [1]:
import pandas as pd

#### Read the `penguins_mini.csv` (as df1) and `penguins_region.csv` (as df2) and check which columns and rows they have

In [2]:
df1 = pd.read_csv('../data/penguins_mini.csv')
df1

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,181.0,3750.0,male
1,Adelie,Dream,178.0,3900.0,male
2,Gentoo,Biscoe,211.0,4500.0,female
3,Gentoo,Biscoe,230.0,5700.0,male
4,Chinstrap,Dream,192.0,3500.0,female


In [3]:
df2 = pd.read_csv('../data/penguins_region.csv')
df2

Unnamed: 0,species,region
0,Adelie,Anvers
1,King,Tierra del Fuego
2,Emperor,Weddell sea
3,Chinstrap,Anvers
4,Gentoo,Anvers
5,Little Blue,Roaring Forties


The `concat` function can be compared to what `.append()` is for lists.

In [4]:
numbers =[1,5,8,9]
letters =["h", "a", "b"]

In [5]:
# append the list `letters` to the list 'numbers' 

numbers.append(letters)

In [6]:
numbers.extend(letters)

In [7]:
numbers

[1, 5, 8, 9, ['h', 'a', 'b'], 'h', 'a', 'b']

In [8]:
# using append on our two dataframes...

df1.append(df1) # doesn't work

AttributeError: 'DataFrame' object has no attribute 'append'

## 1. Concatenation

- concatenate pandas dataframes along a particular axis with optional set logic along the other axes
- axis = 0 is vertically and axis = 1 is horizontally
- can combine multiple dataframes
- there are several parameters that decide how the concatenation is done, most important are **axis**, **ignore_index** and **sort**
- main use when **axis=0** (or axis='rows') is when df1 and df2 have the **same columns**
- main use when **axis = 1** (or axis='columns') is when df1 and df2 have the **different columns** of the same observations

In [9]:
# concatenate df1 and df2


pd.concat([df1, df2])

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,region
0,Adelie,Torgersen,181.0,3750.0,male,
1,Adelie,Dream,178.0,3900.0,male,
2,Gentoo,Biscoe,211.0,4500.0,female,
3,Gentoo,Biscoe,230.0,5700.0,male,
4,Chinstrap,Dream,192.0,3500.0,female,
0,Adelie,,,,,Anvers
1,King,,,,,Tierra del Fuego
2,Emperor,,,,,Weddell sea
3,Chinstrap,,,,,Anvers
4,Gentoo,,,,,Anvers


In [10]:
# use the 'axis' parameter (axis=0 or axis='rows')

pd.concat([df1, df2], axis=1)

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,species.1,region
0,Adelie,Torgersen,181.0,3750.0,male,Adelie,Anvers
1,Adelie,Dream,178.0,3900.0,male,King,Tierra del Fuego
2,Gentoo,Biscoe,211.0,4500.0,female,Emperor,Weddell sea
3,Gentoo,Biscoe,230.0,5700.0,male,Chinstrap,Anvers
4,Chinstrap,Dream,192.0,3500.0,female,Gentoo,Anvers
5,,,,,,Little Blue,Roaring Forties


In [11]:
# try axis=1



#### Can I concatenate 3 dataframes?

In [12]:
penguin_sweet = {
    'species': ['Adelie', 'Gentoo', 'Chinstrap'], 
    'sweetness': ['sweet', 'sweeter', 'sweetest']
}

df3 = pd.DataFrame(penguin_sweet)
df3

Unnamed: 0,species,sweetness
0,Adelie,sweet
1,Gentoo,sweeter
2,Chinstrap,sweetest


In [13]:
# concatenate 3 dataframes

pd.concat([df1, df2, df3], axis=1)

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,species.1,region,species.2,sweetness
0,Adelie,Torgersen,181.0,3750.0,male,Adelie,Anvers,Adelie,sweet
1,Adelie,Dream,178.0,3900.0,male,King,Tierra del Fuego,Gentoo,sweeter
2,Gentoo,Biscoe,211.0,4500.0,female,Emperor,Weddell sea,Chinstrap,sweetest
3,Gentoo,Biscoe,230.0,5700.0,male,Chinstrap,Anvers,,
4,Chinstrap,Dream,192.0,3500.0,female,Gentoo,Anvers,,
5,,,,,,Little Blue,Roaring Forties,,


#### Adding Parameters

In [14]:
# concatenate df1 and df2 adding the 'ignore_index' and the 'sort' parameters

pd.concat([df1, df2, df3], axis=0, ignore_index=True).fillna('-')

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,region,sweetness
0,Adelie,Torgersen,181.0,3750.0,male,-,-
1,Adelie,Dream,178.0,3900.0,male,-,-
2,Gentoo,Biscoe,211.0,4500.0,female,-,-
3,Gentoo,Biscoe,230.0,5700.0,male,-,-
4,Chinstrap,Dream,192.0,3500.0,female,-,-
5,Adelie,-,-,-,-,Anvers,-
6,King,-,-,-,-,Tierra del Fuego,-
7,Emperor,-,-,-,-,Weddell sea,-
8,Chinstrap,-,-,-,-,Anvers,-
9,Gentoo,-,-,-,-,Anvers,-


In [15]:
pd.concat([df1, df2, df3], axis=0, ignore_index=True, sort=True).fillna('-')

Unnamed: 0,body_mass_g,flipper_length_mm,island,region,sex,species,sweetness
0,3750.0,181.0,Torgersen,-,male,Adelie,-
1,3900.0,178.0,Dream,-,male,Adelie,-
2,4500.0,211.0,Biscoe,-,female,Gentoo,-
3,5700.0,230.0,Biscoe,-,male,Gentoo,-
4,3500.0,192.0,Dream,-,female,Chinstrap,-
5,-,-,-,Anvers,-,Adelie,-
6,-,-,-,Tierra del Fuego,-,King,-
7,-,-,-,Weddell sea,-,Emperor,-
8,-,-,-,Anvers,-,Chinstrap,-
9,-,-,-,Anvers,-,Gentoo,-


## 2. Merging - Inner Join

- will merge dataframes with a database-style **inner** join
- one column must be in common between the dataframes
- inner join means only taking the rows in common based on the join column

In [16]:
df1

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,181.0,3750.0,male
1,Adelie,Dream,178.0,3900.0,male
2,Gentoo,Biscoe,211.0,4500.0,female
3,Gentoo,Biscoe,230.0,5700.0,male
4,Chinstrap,Dream,192.0,3500.0,female


In [17]:
df2

Unnamed: 0,species,region
0,Adelie,Anvers
1,King,Tierra del Fuego
2,Emperor,Weddell sea
3,Chinstrap,Anvers
4,Gentoo,Anvers
5,Little Blue,Roaring Forties


### adding a mismatch to the df1 : `Baby_Penguin` is not in the df2

In [18]:
df1.loc[5] = pd.Series(
{'species': 'Baby_Penguin',
 'island': 'Dream',
 'flipper_length_mm': 92.0,
 'body_mass_g': 1500.0,
 'sex': 'female'}

)

In [19]:
# option 1

pd.merge(left=df1, right=df2, how='inner', on='species')

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,region
0,Adelie,Torgersen,181.0,3750.0,male,Anvers
1,Adelie,Dream,178.0,3900.0,male,Anvers
2,Gentoo,Biscoe,211.0,4500.0,female,Anvers
3,Gentoo,Biscoe,230.0,5700.0,male,Anvers
4,Chinstrap,Dream,192.0,3500.0,female,Anvers


In [20]:
# option 2
# using keyword arguments
df1.merge(right=df2, how='inner', on='species')

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,region
0,Adelie,Torgersen,181.0,3750.0,male,Anvers
1,Adelie,Dream,178.0,3900.0,male,Anvers
2,Gentoo,Biscoe,211.0,4500.0,female,Anvers
3,Gentoo,Biscoe,230.0,5700.0,male,Anvers
4,Chinstrap,Dream,192.0,3500.0,female,Anvers


In [21]:
# or using positional arguments
df1.merge(df2, 'inner', 'species')

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,region
0,Adelie,Torgersen,181.0,3750.0,male,Anvers
1,Adelie,Dream,178.0,3900.0,male,Anvers
2,Gentoo,Biscoe,211.0,4500.0,female,Anvers
3,Gentoo,Biscoe,230.0,5700.0,male,Anvers
4,Chinstrap,Dream,192.0,3500.0,female,Anvers


## 3. Left Join

- will merge dataframes with a database-style **left** join
- one column must be in common between the dataframes
- left join means taking all the rows in the left dataframe
- missing rows from the right dataframe will be filled with null values

In [22]:
# option 1

pd.merge(left=df1, right=df2, how='left', on='species')

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,region
0,Adelie,Torgersen,181.0,3750.0,male,Anvers
1,Adelie,Dream,178.0,3900.0,male,Anvers
2,Gentoo,Biscoe,211.0,4500.0,female,Anvers
3,Gentoo,Biscoe,230.0,5700.0,male,Anvers
4,Chinstrap,Dream,192.0,3500.0,female,Anvers
5,Baby_Penguin,Dream,92.0,1500.0,female,


In [23]:
# option 2

# df1.merge(left=df2... # try yourself

#### Using `indicator=True` to check how the dataframes were merged and if it corresponds to your expectations

In [24]:
pd.merge(left=df1, right=df2, how='left', on='species', indicator=True)

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,region,_merge
0,Adelie,Torgersen,181.0,3750.0,male,Anvers,both
1,Adelie,Dream,178.0,3900.0,male,Anvers,both
2,Gentoo,Biscoe,211.0,4500.0,female,Anvers,both
3,Gentoo,Biscoe,230.0,5700.0,male,Anvers,both
4,Chinstrap,Dream,192.0,3500.0,female,Anvers,both
5,Baby_Penguin,Dream,92.0,1500.0,female,,left_only


## 4. Right join

- will merge dataframes with a database-style **right** join
- one column must be in common between the dataframes
- right join means taking all the rows in the right dataframe
- missing rows from the left dataframe will be filled with null values

In [25]:
# option 1
pd.merge(left=df1, right=df2, how='right', on='species', indicator=True)

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,region,_merge
0,Adelie,Torgersen,181.0,3750.0,male,Anvers,both
1,Adelie,Dream,178.0,3900.0,male,Anvers,both
2,King,,,,,Tierra del Fuego,right_only
3,Emperor,,,,,Weddell sea,right_only
4,Chinstrap,Dream,192.0,3500.0,female,Anvers,both
5,Gentoo,Biscoe,211.0,4500.0,female,Anvers,both
6,Gentoo,Biscoe,230.0,5700.0,male,Anvers,both
7,Little Blue,,,,,Roaring Forties,right_only


In [26]:
# option 2
# with the .merge() method it is always the left join, but you can flip the dataframes

# df2.merge(right=df1... 

In [27]:
# add indicator=True

pd.merge(left=df1, right=df2, how='right', on='species', indicator=True)

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,region,_merge
0,Adelie,Torgersen,181.0,3750.0,male,Anvers,both
1,Adelie,Dream,178.0,3900.0,male,Anvers,both
2,King,,,,,Tierra del Fuego,right_only
3,Emperor,,,,,Weddell sea,right_only
4,Chinstrap,Dream,192.0,3500.0,female,Anvers,both
5,Gentoo,Biscoe,211.0,4500.0,female,Anvers,both
6,Gentoo,Biscoe,230.0,5700.0,male,Anvers,both
7,Little Blue,,,,,Roaring Forties,right_only


## 5. Outer Join

- will merge dataframes with a database-style **outer** join
- one column must be in common between the dataframes
- outer join means taking all the rows from both dataframes
- missing rows from both dataframes will be filled with null values

In [28]:
# option 1
pd.merge(left=df1, right=df2, how='outer', on='species', indicator=True)


Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,region,_merge
0,Adelie,Torgersen,181.0,3750.0,male,Anvers,both
1,Adelie,Dream,178.0,3900.0,male,Anvers,both
2,Gentoo,Biscoe,211.0,4500.0,female,Anvers,both
3,Gentoo,Biscoe,230.0,5700.0,male,Anvers,both
4,Chinstrap,Dream,192.0,3500.0,female,Anvers,both
5,Baby_Penguin,Dream,92.0,1500.0,female,,left_only
6,King,,,,,Tierra del Fuego,right_only
7,Emperor,,,,,Weddell sea,right_only
8,Little Blue,,,,,Roaring Forties,right_only


In [29]:
df1

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,181.0,3750.0,male
1,Adelie,Dream,178.0,3900.0,male
2,Gentoo,Biscoe,211.0,4500.0,female
3,Gentoo,Biscoe,230.0,5700.0,male
4,Chinstrap,Dream,192.0,3500.0,female
5,Baby_Penguin,Dream,92.0,1500.0,female


In [30]:
pd.merge(left=df1, right=df2, how='outer', on='species', indicator=True)

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,region,_merge
0,Adelie,Torgersen,181.0,3750.0,male,Anvers,both
1,Adelie,Dream,178.0,3900.0,male,Anvers,both
2,Gentoo,Biscoe,211.0,4500.0,female,Anvers,both
3,Gentoo,Biscoe,230.0,5700.0,male,Anvers,both
4,Chinstrap,Dream,192.0,3500.0,female,Anvers,both
5,Baby_Penguin,Dream,92.0,1500.0,female,,left_only
6,King,,,,,Tierra del Fuego,right_only
7,Emperor,,,,,Weddell sea,right_only
8,Little Blue,,,,,Roaring Forties,right_only


In [31]:
# sort=True will sort the rows on the 'species' column alphabetically 

pd.merge(left=df1, right=df2, how='outer', on='species', indicator=True, sort=True)

Unnamed: 0,species,island,flipper_length_mm,body_mass_g,sex,region,_merge
0,Adelie,Torgersen,181.0,3750.0,male,Anvers,both
1,Adelie,Dream,178.0,3900.0,male,Anvers,both
2,Baby_Penguin,Dream,92.0,1500.0,female,,left_only
3,Chinstrap,Dream,192.0,3500.0,female,Anvers,both
4,Emperor,,,,,Weddell sea,right_only
5,Gentoo,Biscoe,211.0,4500.0,female,Anvers,both
6,Gentoo,Biscoe,230.0,5700.0,male,Anvers,both
7,King,,,,,Tierra del Fuego,right_only
8,Little Blue,,,,,Roaring Forties,right_only


In [32]:
# option 2

# df1.merge(left=df2... # try yourself