# Merging
* There are many different ways to teach merging, and merging in pandas and sql are *very very very* similar
* This stack overflow post goes through a bunch of them: https://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join
* Some people don't like the venn diagram approach, but for me it works well, so let's start there


<img src="https://i.stack.imgur.com/hMKKt.jpg" />

* The crux of the question is, how do you take two dataframes and join them together into one?
* Remember that the dataframe is made up of two axes, rows and columns
* Rows and columns are actually identical underneath - they both have indicies (or names) and we can transform them trivially with `T`
* So the mental model I give you now is actually going to be a bit wrong, but hopefully it will suffice

* Here's our scenario, we have a `DataFrame` of students and one of staff
* Turns out students can be staff! Look at Alana...
* So when we join our dataframes together, who are we interested in?
1. Only students who are also staff?
2. Students who are not staff? Staff who are not students?
3. Students, regardless of whether they are staff or not, but if they are staff we want the staff details too?
4. Ug. What a mess...

In [None]:
import pandas as pd

staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
# And lets index these staff by name
staff_df = staff_df.set_index('Name')

# Now we'll create a student dataframe
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
# And we'll index this by name too
student_df = student_df.set_index('Name')

In [None]:
staff_df

In [None]:
student_df

* Ok, we have two different dataframes (one has a Role the other a School) but they are indexed the same. That's a good start
* Let's just try and get a list of everyone and their details. This is called a union, or outer join, and we're actually interested in unioning in both directions, along the rows and the columns

In [None]:
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)

* Notice how we have both more columns and more rows, and how there are some missing values, since Kelly doesn't have a school and Mike doesn't have a role

In [None]:
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)

* Now notice how we have only taken the place where there is overlap, but we have all of the columns of both DataFrames
* pandas looks for join membership on the index and not the columns, you always get all the columns.

In [None]:
# what will this produce?
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

* Notice how pandas kept anyone involved in the right dataframe, the students, regardless of whether they were in the left dataframe
* People who were in the left dataframe had their new information populated, everyone else (Mike) just got NaN's

In [None]:
# We can also join on columns instead of indicies, which is cool!
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()

In [None]:
staff_df

In [None]:
student_df

In [None]:
pd.merge(staff_df, student_df, how='right', on='Name')

* (this is how I do it 90% of the time)

* What if we have conflicts between dataframes?

In [None]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 
                          'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liasion', 
                          'Location': 'Washington Avenue'},
                         {'Name': 'James', 'Role': 'Grader', 
                          'Location': 'Washington Avenue'}])
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 
                            'Location': '1024 Billiard Avenue'},
                           {'Name': 'Mike', 'School': 'Law', 
                            'Location': 'Fraternity House #22'},
                           {'Name': 'Sally', 'School': 'Engineering', 
                            'Location': '512 Wilson Crescent'}])
student_df

In [None]:
# quick, what's the meaning of this merge?
pd.merge(staff_df, student_df, how='left', on='Name')

In [None]:
# What do we do if we want to match on multiple columns, like first and last name?
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 
                          'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 
                          'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 
                          'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 
                            'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 
                            'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 
                            'School': 'Engineering'}])

In [None]:
pd.merge(staff_df, student_df, how='inner', on=['First Name','Last Name'])

* One last mention, if we want to just append a bunch of rows between dataframes we just use `pd.concat`

In [None]:
staff1_df = pd.DataFrame([{'Name': 'James', 'Role': 'Grader', 
                          'Location': 'Washington Avenue'}])
staff2_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 
                          'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liasion', 
                          'Location': 'Washington Avenue'}])

pd.concat([staff1_df,staff2_df], keys=['staff1','staff2']) #keys is optional if you want to preserve index