<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Pandas: Combining Data
              
</p>
</div>

Data Science Cohort Live NYC Feb 2022
<p>Phase 1: Topic 5</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

#### Combining data

- Have two datasets related in some way.
- Want to combine them together:
    - concatenate: pd.concat()
    - joins: pd.merge() and DataFrame.join()

We'll go through some of these in this lecture.

#### Concatenating

Basically stitches multiple datasets together:

- Two uses cases (axis):
    - Same columns: stitches by appending rows/observations.
    - Same rows: stitches by appending new columns/attributes.
    

pd.concat([df1, df2, df3], axis = 0)

<figure><center><img src = "Images/concat.png" width = 400></center>
</figure>

#### pd.concat([df1, df4], axis = 1)

<figure><center><img src = "Images/concat_columns.png" width = 700></center>
</figure>

Optional logic: pd.concat([df1, df2, ...], axis = __, how = __):

- default: how = 'outer'
- how = 'inner'

'inner': keeps indices common to both dataframes.

#### pd.concat([df1, df4], axis = 1, how = 'inner')

<figure><center><img src = "Images/merging_concat_inner.png" width = 1200></center>
</figure>

#### Joins

Datasets do not have to have same rows or columns.
- Just a common key (or set of keys) used to match records.

pd.merge() is the most flexible workhorse function for this:

In [1]:
# create two datasets
import pandas as pd
df1 = pd.DataFrame({'employee': ['Chadwick', 'Bartholemew', 'Jake', 'Brunnhilde', 'Sue', 'Jimbo Jr.'],
                    'group': ['Building' ,'Accounting', 'Engineering', 'Engineering', 'HR', 'Compliance']})

df2 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR', 'Endowment'],
                    'supervisor': ['Carly', 'Guido', 'Steve', 'Eileen']})
df3 = pd.DataFrame({'name': ['Brunnhilde', 'Bartholemew', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})


In [2]:
df1

Unnamed: 0,employee,group
0,Chadwick,Building
1,Bartholemew,Accounting
2,Jake,Engineering
3,Brunnhilde,Engineering
4,Sue,HR
5,Jimbo Jr.,Compliance


In [3]:
df2

Unnamed: 0,group,supervisor
0,Accounting,Carly
1,Engineering,Guido
2,HR,Steve
3,Endowment,Eileen


In [4]:
pd.merge(df1, df2, how = 'inner', on = 'group')

Unnamed: 0,employee,group,supervisor
0,Bartholemew,Accounting,Carly
1,Jake,Engineering,Guido
2,Brunnhilde,Engineering,Guido
3,Sue,HR,Steve


In [5]:
pd.merge(df1, df2, how = 'left', on = 'group')

Unnamed: 0,employee,group,supervisor
0,Chadwick,Building,
1,Bartholemew,Accounting,Carly
2,Jake,Engineering,Guido
3,Brunnhilde,Engineering,Guido
4,Sue,HR,Steve
5,Jimbo Jr.,Compliance,


In [6]:
pd.merge(df1, df2, how = 'right', on = 'group')

Unnamed: 0,employee,group,supervisor
0,Bartholemew,Accounting,Carly
1,Jake,Engineering,Guido
2,Brunnhilde,Engineering,Guido
3,Sue,HR,Steve
4,,Endowment,Eileen


#### Joins

- Inner Join: records with matching keys in both tables
- Left Join:  All records from the left table +  records from right table with matching keys
- Right Join: All records from the right table +  records from left table with matching keys
- Outer Join: All records from both tables

<div>
    <center><img src="Images/Venn.png" align = "center" width="900"/></center>
</div>
    


Join on key with different label:

In [5]:
df1

Unnamed: 0,employee,group
0,Chadwick,Building
1,Bartholemew,Accounting
2,Jake,Engineering
3,Brunnhilde,Engineering
4,Sue,HR
5,Jimbo Jr.,Compliance


In [6]:
df3

Unnamed: 0,name,hire_date
0,Brunnhilde,2004
1,Bartholemew,2008
2,Jake,2012
3,Sue,2014


In [9]:
pd.merge(df1, df3, left_on = 'employee', right_on = 'name', how = 'inner')
# pd.merge(df1, df3, left_on = 'employee', right_on = 'name', how = 'inner')

Unnamed: 0,employee,group,name,hire_date
0,Bartholemew,Accounting,Bartholemew,2008
1,Jake,Engineering,Jake,2012
2,Brunnhilde,Engineering,Brunnhilde,2004
3,Sue,HR,Sue,2014


Can do a bit more with merge: 
- merge matching on multiple columns as opposed to one.
- df1.join(df2, how = ' '): similar to merge but less flexible. Joins on index. Faster than merge.

In [10]:
df1.set_index('group').join(df2.set_index('group'), how = 'inner')

Unnamed: 0_level_0,employee,supervisor
group,Unnamed: 1_level_1,Unnamed: 2_level_1
Accounting,Bartholemew,Carly
Engineering,Jake,Guido
Engineering,Brunnhilde,Guido
HR,Sue,Steve


Data in real life can be messy:

- Often keys have mispellings or don't exactly match up
- Determine whether key is similar enough.
- Then link record if true.

Record linkage:

- Check out:
    - recordlinkage package
    - fuzzywuzzy package