# Merging
* There are many different ways to teach merging, and merging in pandas and sql are *very very very* similar
* This stack overflow post goes through a bunch: https://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join
* Some people don't like the venn diagram approach, but for me it works well, so let's start there


<img src="https://i.stack.imgur.com/hMKKt.jpg" />

* The crux of the question is, how do you take two dataframes and join them together into one?
* Remember that the dataframe is made up of two axes, rows and columns
* Rows and columns are identical underneath - they both have indicies (or names) and we can transform them trivially with `T`
* So the mental model I give you now is actually going to be a bit wrong, but hopefully it will suffice

In [16]:
#pizza practice
import pandas as pd #importing pandas 

pizza_a = {'salami': [2, 2], 'sauce': ['red', 'red'], 'cheese': ['mozzerella', 'mozzerella']} #creating variable called pizza_a with quantity of salami, sauce type, and cheese type
pizza_b = {'pepper': [7, 6], 'sauce': ['white', 'white'], 'cheese': ['mozzerella', 'mozzerella']} #creating variable called pizza_b with quantity of salami, sauce type, and cheese type

left = pd.DataFrame(data=pizza_a) #creating variable called left which is a dataframe using data from pizza_a
right = pd.DataFrame(data=pizza_b) #creating variable called right which is a dataframe using data from pizza_b

print(left.head(), '\n\n') #prints first couple rows of left with spaces
print(right.head(), '\n\n') #prints first couple rows of right with spaces

df_merge = pd.merge(left, right) #creating variable called df_merge that merges left and right dataframes
print(df_merge, '\n\n') #prints df_merge

   salami sauce      cheese
0       2   red  mozzerella
1       2   red  mozzerella 


   pepper  sauce      cheese
0       7  white  mozzerella
1       6  white  mozzerella 


Empty DataFrame
Columns: [salami, sauce, cheese, pepper]
Index: [] 




In [17]:
df_outermerge = pd.merge(left, right, how='outer') #creating variable df_outermerge that is merging all information from both dataframes 
print(df_outermerge, '\n\n') #printing df_outermerge with spaces which is a series

   salami  sauce      cheese  pepper
0     2.0    red  mozzerella     NaN
1     2.0    red  mozzerella     NaN
2     NaN  white  mozzerella     7.0
3     NaN  white  mozzerella     6.0 




In [18]:
df_outermerge = pd.merge(left, right, how='outer', on=['cheese']) #merging all data on column cheese 
print(df_outermerge, '\n\n')

   salami sauce_x      cheese  pepper sauce_y
0       2     red  mozzerella       7   white
1       2     red  mozzerella       6   white
2       2     red  mozzerella       7   white
3       2     red  mozzerella       6   white 




* Here's our scenario, we have a `DataFrame` of students and one of staff
* Turns out students can be staff! Look at our IAs...
* So when we join our dataframes together, who are we interested in?
1. Only students who are also staff?
2. Students who are not staff? Staff who are not students?
3. Students, regardless of whether they are staff or not, but if they are staff we want the staff details too?

Ugh, what a mess...

In [19]:
import pandas as pd #importing pandas

#staff dataframe
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},  #creating dataframe called staff_df which contains names and roles
                         {'Name': 'Sally', 'Role': 'Course Liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
# And lets index these staff by name
staff_df = staff_df.set_index('Name') #setting index of staff_df as staff names

# Now we'll create a student dataframe
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},  #creating dataframe called studnet_df which contains names and schools
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
# And we'll index this by name too
student_df = student_df.set_index('Name') #setting index student_df as student names

In [20]:
staff_df #printing out staff_df dataframe

Unnamed: 0_level_0,Role
Name,Unnamed: 1_level_1
Kelly,Director of HR
Sally,Course Liasion
James,Grader


In [21]:
student_df #printing out student_df dataframe

Unnamed: 0_level_0,School
Name,Unnamed: 1_level_1
James,Business
Mike,Law
Sally,Engineering


* Ok, we have two different dataframes (one has a Role the other a School) but they are indexed the same. That's a good start
* Let's just try and get a list of everyone and their details. This is called a union, or outer join, and we're actually interested in unioning in both directions, along the rows and the columns

In [22]:
pd.merge(staff_df, student_df, how='outer') #trying to merge all data from both dataframes, wont work because no common columns to merge on (staff_df column is role and studnet_df is school)

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

In [23]:
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True) #merging all data from both dataframes and setting left_index and right_index to true so dataframes can successfully merge and display all rows and columns

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Kelly,Director of HR,
Mike,,Law
Sally,Course Liasion,Engineering


* Notice how we have both more columns and more rows, and how there are some missing values, since Kelly doesn't have a school and Mike doesn't have a role

In [27]:
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True) #merging both dataframes and only taking data where indeces are the same (sharing name), setting left_index and right_index to True so dataframes can successfully merge and display all rows and columns

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Sally,Course Liasion,Engineering
James,Grader,Business


* Now notice how we have only taken the place where there is overlap, but we have all of the columns of both DataFrames
* Pandas looks for join membership on the index and not the columns; you always get all the columns.

In [25]:
#what will this produce?
result_df = pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True) #merging both dataframes and keeping all indeces in the right dataframe and merging only what has indeces from the left dataframe that are in common
result_df

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Mike,,Law
Sally,Course Liasion,Engineering


* Notice how pandas kept anyone involved in the right dataframe, the students, regardless of whether they were in the left dataframe
* People who were in the left dataframe had their new information populated, everyone else (Mike) just got NaN's

In [15]:
# We can also join on columns instead of indicies!
staff_df = staff_df.reset_index() #resetting the index
student_df = student_df.reset_index() #resetting the index

In [16]:
staff_df #displaying staff_df with original index

Unnamed: 0,Name,Role
0,Kelly,Director of HR
1,Sally,Course Liasion
2,James,Grader


In [17]:
student_df #displaying student_df with original index

Unnamed: 0,Name,School
0,James,Business
1,Mike,Law
2,Sally,Engineering


In [18]:
pd.merge(staff_df, student_df, how='right', on='Name') #merging both dataframes keeping everything in right dataframe and merging only data from left dataframe that relates to the column name

Unnamed: 0,Name,Role,School
0,Sally,Course Liasion,Engineering
1,James,Grader,Business
2,Mike,,Law


* (this is how I do it 90% of the time)

* What if we have conflicts between dataframes?

In [19]:
import pandas as pd
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 
                          'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liasion', 
                          'Location': 'Washington Avenue'},
                         {'Name': 'James', 'Role': 'Grader', 
                          'Location': 'Washington Avenue'}])   #dataframe called staff_df with name role and location
staff_df #printing out staff_df

Unnamed: 0,Name,Role,Location
0,Kelly,Director of HR,State Street
1,Sally,Course liasion,Washington Avenue
2,James,Grader,Washington Avenue


In [20]:
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 
                            'Location': '1024 Billiard Avenue'},
                           {'Name': 'Mike', 'School': 'Law', 
                            'Location': 'Fraternity House #22'},
                           {'Name': 'Sally', 'School': 'Engineering', 
                            'Location': '512 Wilson Crescent'}])   #dataframe called student_df with name school and location
student_df #printing out student_df

Unnamed: 0,Name,School,Location
0,James,Business,1024 Billiard Avenue
1,Mike,Law,Fraternity House #22
2,Sally,Engineering,512 Wilson Crescent


In [22]:
# Quick, what's the meaning of this merge?
pd.merge(staff_df, student_df, how='left', on='Name') #for the conflicting data that occurs when merging on column name, the dataframe presents location_x and location_y to display both addresses

Unnamed: 0,Name,Role,Location_x,School,Location_y
0,Kelly,Director of HR,State Street,,
1,Sally,Course liasion,Washington Avenue,Engineering,512 Wilson Crescent
2,James,Grader,Washington Avenue,Business,1024 Billiard Avenue


In [23]:
# What do we do if we want to match on multiple columns, like first and last name?
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 
                          'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 
                          'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 
                          'Role': 'Grader'}]) #dataframe with first name, last name, role
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 
                            'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 
                            'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 
                            'School': 'Engineering'}]) #dataframe with first name, last name, school

display(staff_df) #display staff_df
display(student_df) #display student_df

Unnamed: 0,First Name,Last Name,Role
0,Kelly,Desjardins,Director of HR
1,Sally,Brooks,Course liasion
2,James,Wilde,Grader


Unnamed: 0,First Name,Last Name,School
0,James,Hammond,Business
1,Mike,Smith,Law
2,Sally,Brooks,Engineering


In [24]:
pd.merge(staff_df, student_df, how='inner', on=['First Name', 'Last Name']) #merging only data where first name and last name match

Unnamed: 0,First Name,Last Name,Role,School
0,Sally,Brooks,Course liasion,Engineering


* One last mention, if we want to just append a bunch of rows between dataframes we just use `pd.concat`

In [26]:
staff1_df = pd.DataFrame([{'Name': 'James', 'Role': 'Grader', 
                          'Location': 'Washington Avenue'}])
staff2_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 
                          'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liasion', 
                          'Location': 'Washington Avenue'}])

#keys is optional if you want to preserve index
pd.concat([staff1_df, staff2_df], keys = ['staff1', 'staff2']) #use concat to combine dataframes together and add rows

Unnamed: 0,Unnamed: 1,Name,Role,Location
staff1,0,James,Grader,Washington Avenue
staff2,0,Kelly,Director of HR,State Street
staff2,1,Sally,Course liasion,Washington Avenue


In [3]:
import pandas as pd
mascots = pd.DataFrame({'Cereal': ['Tony the Tiger', 'Toucan Sam', 'Trix Rabbit'], 'Football': ['Pat Patriot', 'Billy Buffalo', 'Poe'], 'Fast Food': ['Ronald McDonald', 'Colonel Sanders', 'Wendy'], 'Politics': ['Elephant', 'Donkey', 'Porcupine'] })
mascots

2

Unnamed: 0,Cereal,Football,Fast Food,Politics
0,Tony the Tiger,Pat Patriot,Ronald McDonald,Elephant
1,Toucan Sam,Billy Buffalo,Colonel Sanders,Donkey
2,Trix Rabbit,Poe,Wendy,Porcupine
