the college_scorecard datasets are in this URL https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_09012022.zip, because git did not allow us to upload datasets on the github repository due to the heavy size of those.

we're going to address :

* how we can bring multiple dataframe objects together, either by merging them horizontally, or by concatenating them vertically. 
* relational theories and some language conventions.

![Venn Diagram](merging1.png)

 A **Venn Diagram** is traditionally used **to show set membership**. 

For example, the circle on the left is the population of students at a university. The circle on the right is the population of staff at a university. And the overlapping region in the middle are all of those students who are also staff. Maybe these students run tutorials for a course, or grade assignments, or engage in running research experiments.

So, this diagram shows two populations whom we might have data about, but there is overlap between those populations.

# DataFrame joining :

When it comes to **translating this to pandas**, we can think of the case where **we might have these two populations as indices in separate DataFrames**, maybe with the label of Person **Name**.

When we want to join the DataFrames together, we have some choices to make.

First what if we want **a list of all the people** regardless of whether they're staff or student. In **database terminology**, this is called a **full outer join**, And in **set theory**, it's called a **union**. 

In the Venn diagram, **it represents everyone in any circle**.

![Union](merging2.png)

second what if we want the **overlapping parts of each circle**. In **database terminology**, this is called an **inner join**, and in **set theory**, it's called a intersection. 

![Intersection](merging3.png)

# joining horizontally :

In [1]:
import pandas as pd 

In [2]:
# left DataFrame object
staff_df = pd.DataFrame([{"name" : "Dr.Mostafa Haghi Kashani", "role" : "Professor"},
                        {"name" : "Baharesatani", "role": "BSc educational attendant in computer engineering"},
                        {"name" : "Javad Nosrati", "role" : "Course liasion"},
                        {"name" : "Amir Hosein Sedaghati", "role" : "Grader Assignments"}])

staff_df.set_index("name", inplace= True)

# right DataFrame object
student_df = pd.DataFrame([{"name" : "Amir Hosein Sedaghati", "school" : "Engineering"},
                           {"name" : "Sara Rostami", "school" : "Law"},
                           {"name" : "Javad Nosrati", "school" : "Art"},
                           {"name" : "Bahar Tabatabei", "school" : "Business"}])

student_df.set_index("name", inplace= True)

There's some overlap in these DataFrames in that Amir Hosein Sedaghati and Javad Nosrati  are both students and staff, but the other cases are not. Importantly, both DataFrames are indexed along the value we want to merge them on, which is called Name.

In [3]:
print(staff_df.head())
print('-----------------------------------')
print(student_df.head())

                                                                       role
name                                                                       
Dr.Mostafa Haghi Kashani                                          Professor
Baharesatani              BSc educational attendant in computer engineering
Javad Nosrati                                                Course liasion
Amir Hosein Sedaghati                                    Grader Assignments
-----------------------------------
                            school
name                              
Amir Hosein Sedaghati  Engineering
Sara Rostami                   Law
Javad Nosrati                  Art
Bahar Tabatabei           Business


# full outer join(union) :

If we want the **union** of these, we would call **merge()** then **passing in the DataFrame on the left and the DataFrame on the right** and telling merge that we want it to use an **outer join**. We want to use **the left and right indices as the joining columns**.

In [4]:
outerJ_df = pd.merge(staff_df, student_df, how= "outer", left_index= True, right_index= True)
outerJ_df

Unnamed: 0_level_0,role,school
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Amir Hosein Sedaghati,Grader Assignments,Engineering
Bahar Tabatabei,,Business
Baharesatani,BSc educational attendant in computer engineering,
Dr.Mostafa Haghi Kashani,Professor,
Javad Nosrati,Course liasion,Art
Sara Rostami,,Law


We see in the resulting DataFrame that everyone is listed. And since Bahar Tabatabei and Sara Rostami does not have a role, and Baharesatani and Dr.Mostafa Haghi Kashani does not have a school, those cells are listed as missing values.


# inner join(intersection) :

If we wanted to get the **intersection**, that is, just those who are a student AND a staff, we could set the **how attribute to inner**. Again, we set both **left and right indices to be True** as the joining columns.

In [5]:
innerJ_df = pd.merge(staff_df, student_df, how= "inner", left_index= True, right_index= True)
innerJ_df

Unnamed: 0_level_0,role,school
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Javad Nosrati,Course liasion,Art
Amir Hosein Sedaghati,Grader Assignments,Engineering


# left join :

when we would want **to get a list of all staff** regardless of whether they were students or not. But **if they were students, we would want to get their student details as well**. To do this we would use a **left join**. It is important to note **the order of dataframes** in this function: **the first dataframe is the left dataframe and the second is the right**.

In [6]:
leftJ_df = pd.merge(staff_df, student_df, how= "left", left_index= True, right_index= True)
leftJ_df

Unnamed: 0_level_0,role,school
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Dr.Mostafa Haghi Kashani,Professor,
Baharesatani,BSc educational attendant in computer engineering,
Javad Nosrati,Course liasion,Art
Amir Hosein Sedaghati,Grader Assignments,Engineering


# right join :

We want **a list of all of the students and their roles if they were also staff**. To do this we would do a **right join**.

In [7]:
rightJ_df = pd.merge(staff_df, student_df, how= "right", left_index= True, right_index= True)
rightJ_df

Unnamed: 0_level_0,role,school
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Amir Hosein Sedaghati,Grader Assignments,Engineering
Sara Rostami,,Law
Javad Nosrati,Course liasion,Art
Bahar Tabatabei,,Business


# on parameter :

The merge method has a couple of other interesting parameters. 

**we don't need to use indices to join on, you can use columns as well**. Here's an example. Here we have a parameter called **on**, and **we can assign a column that both dataframe has** as the joining column

In [8]:
rightJ_df = pd.merge(staff_df, student_df, how= "right", on= "name")
rightJ_df

Unnamed: 0_level_0,role,school
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Amir Hosein Sedaghati,Grader Assignments,Engineering
Sara Rostami,,Law
Javad Nosrati,Course liasion,Art
Bahar Tabatabei,,Business


what happens when we have conflicts between the DataFrames?

In the staff DataFrame, this is an office location where we can find the staff person. But for the student DataFrame, the location information is actually their home address.

In [9]:
# left DataFrame object
staff_df = pd.DataFrame([{"name" : "Dr.Mostafa Haghi Kashani", "role" : "Professor",
                          "location" : "Shahre Qods University"},
                        {"name" : "Baharesatani", "role": "BSc educational attendant in computer engineering",
                        "location" : "Molasadra Avenue"},
                        {"name" : "Javad Nosrati", "role" : "Course liasion",
                         "location" : "Azadi Avenue"},
                        {"name" : "Amir Hosein Sedaghati", "role" : "Grader Assignments", 
                        "location" : "Azadi Avenue"}])

# right DataFrame object
student_df = pd.DataFrame([{"name" : "Amir Hosein Sedaghati", "school" : "Engineering", 
                           "location" : "Karaj, Rastakhiz Avenue"},
                           {"name" : "Sara Rostami", "school" : "Law",
                           "location" : "Tehran, Bime"},
                           {"name" : "Javad Nosrati", "school" : "Art",
                           "location" : "Anidishe, phase 3"},
                           {"name" : "Bahar Tabatabei", "school" : "Business",
                           "location": "Tehran, Akbatan"}])


The **merge** function **preserves the information that have conflict**, but **appends an _x or _y** to help differentiate between which index went with which column of data. **The _x is always the left DataFrame information, and the _y is always the right DataFrame information**.

In [10]:
rightJ_df = pd.merge(staff_df, student_df, how= "left", on= "name")
rightJ_df

Unnamed: 0,name,role,location_x,school,location_y
0,Dr.Mostafa Haghi Kashani,Professor,Shahre Qods University,,
1,Baharesatani,BSc educational attendant in computer engineering,Molasadra Avenue,,
2,Javad Nosrati,Course liasion,Azadi Avenue,Art,"Anidishe, phase 3"
3,Amir Hosein Sedaghati,Grader Assignments,Azadi Avenue,Engineering,"Karaj, Rastakhiz Avenue"


Location_x refers to the Location column in the left dataframe, which is staff dataframe and Location_y refers to the Location column in the right dataframe, which is student dataframe.

# multi-indexing and multiple columns :

It's quite possible that the first name for students and staff might overlap, but the last name might not.

In this case, we use **a list of the multiple columns** that should be used **to join keys from both dataframes** on the **on parameter**. 

note : **the column name(s)** assigned to the **on** parameter **needs to exist in both dataframes**.

In [11]:
# left DataFrame object
staff_df = pd.DataFrame([{"first name" : "Dr.Mostafa", "last name" : "Haghi Kashani",
                          "role" : "Professor"},
                        {"first name" : "Mina", "last name" : "Baharesatani",
                         "role": "BSc educational attendant in computer engineering"},
                        {"first name" : "Javad", "last name" : "Nosrati",
                         "role" : "Course liasion"},
                        {"first name" : "Amir Hosein", "last name" : "Sedaghati", "role" : "Grader Assignments",}])

# right DataFrame object
student_df = pd.DataFrame([{"first name" : "Amir Hosein", "last name" : "Sedaghati", "school" : "Engineering"},
                           {"first name" : "Sara", "last name" : "Rostami", "school" : "Law"},
                           {"first name" : "Javad", "last name" : "Nosrati", "school" : "Art"},
                           {"first name" : "Bahar", "last name" : "Tabatabei", "school" : "Business"}])


In [12]:
innerJ_df = pd.merge(staff_df, student_df, how= "inner", on=["first name", "last name"])
innerJ_df

Unnamed: 0,first name,last name,role,school
0,Javad,Nosrati,Course liasion,Art
1,Amir Hosein,Sedaghati,Grader Assignments,Engineering


we'll need to know how to pull data from different sources, clean it, and join it for analysis. This is a staple not only of pandas, but of database technologies as well.

# joining vertically :

If we think of **merging as joining "horizontally"**, meaning **we join on similar values in a column found in two dataframes**, and if we think of **concatenating as joining "vertically"**, meaning **we put dataframe values on top or at the bottom of each other**.

Let's take a look at the US Department of Education College Scorecard data. It has each US university's data on student completion, student debt, after-graduation income, etc. The data is stored in separate CSV's with each CSV containing a year's record. Let's say we want the records from 2011 to 2013 we first create three dataframe, each containing one year's record. And, **because the csv files we're working with are messy**, I want to supress some of **the jupyter warning messages** and just **tell read_csv to ignore bad lines**, so I'm going to start the cell with a magic function called **%%capture**

In [13]:
%%capture

df2004 = pd.read_csv("datasets/college_scorecard/MERGED2004_05_PP.csv", error_bad_lines= False)
df2005 = pd.read_csv("datasets/college_scorecard/MERGED2005_06_PP.csv", error_bad_lines= False)
df2006 = pd.read_csv("datasets/college_scorecard/MERGED2006_07_PP.csv", error_bad_lines= False)

In [14]:
df2004.head()

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
0,100654,100200,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2,100690,2503400,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
3,100706,105500,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
4,100724,100500,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,


In [15]:
print(len(df2004))
print(len(df2005))
print(len(df2006))

6660
6824
6848


let's just put all three dataframes in a list and call that list frames and pass the list into the **concat() function** Let's see what it looks like


In [16]:
frames = [df2004, df2005, df2006]
concatenated_df = pd.concat(frames)
concatenated_df.head()

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
0,100654,100200,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2,100690,2503400,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
3,100706,105500,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
4,100724,100500,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,


In [17]:
len(concatenated_df)

20332

In [18]:
len(concatenated_df) == (len(df2004) + len(df2005) + len(df2006))

True

now that all the data is concatenated together, but **we don't know what observations are from what year anymore**! Actually the **concat function** has a parameter that **solves such problem with the keys parameter**, **we can set an extra level of indices**, we pass in a list of keys that we want to correspond to the dataframes into the keys parameter.


In [19]:
concatenated_df = pd.concat(frames, keys= ['2004', '2005', '2006'])
concatenated_df

Unnamed: 0,Unnamed: 1,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
2004,0,100654,00100200,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
2004,1,100663,00105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2004,2,100690,02503400,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
2004,3,100706,00105500,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
2004,4,100724,00100500,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2006,6843,44098901,02568108,25681,Texas Barber College - Branch Campus #1,Dallas,TX,75241,,,,...,,,,,,,,,,
2006,6844,44098902,02568101,25681,Texas Barber College - Branch Campus #2,Dallas,TX,75228,,,,...,,,,,,,,,,
2006,6845,44098903,02568106,25681,Texas Barber Colleges and Hairstyling Schools ...,Houston,TX,77063,,,,...,,,,,,,,,,
2006,6846,44098904,02568107,25681,Texas Barber College - Branch Campus #5,Houston,TX,77022,,,,...,,,,,,,,,,


In [20]:
%%capture

df2011= pd.read_csv("datasets/college_scorecard/MERGED2011_12_PP.csv", error_bad_lines= False)
df2012 = pd.read_csv("datasets/college_scorecard/MERGED2012_13_PP.csv", error_bad_lines= False)
df2013 = pd.read_csv("datasets/college_scorecard/MERGED2013_14_PP.csv", error_bad_lines= False)


In [21]:
print(len(df2011))
print(len(df2012))
print(len(df2013))

15235
7793
7804


In [22]:
frames = [df2011, df2012, df2013]
merged_df = pd.concat(frames, keys= ["2011", "2012", "2013"])
merged_df

Unnamed: 0,Unnamed: 1,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
2011,0,100654.0,100200.0,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
2011,1,100663.0,105200.0,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2011,2,100690.0,2503400.0,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
2011,3,100706.0,105500.0,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
2011,4,100724.0,100500.0,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2013,7799,48285703.0,157107.0,1571,Georgia Military College-Columbus Campus,Columbus,GA,31909,,,,...,,,,,,,,,,
2013,7800,48285704.0,157101.0,1571,Georgia Military College-Valdosta Campus,Valdosta,GA,31605,,,,...,,,,,,,,,,
2013,7801,48285705.0,157105.0,1571,Georgia Military College-Warner Robins Campus,Warner Robins,GA,31093,,,,...,,,,,,,,,,
2013,7802,48285706.0,157100.0,1571,Georgia Military College-Online,Milledgeville,GA,31061,,,,...,,,,,,,,,,


We should know that concatenation also has inner and outer method. 

If we want to concatenate **two dataframes that the whole columns are not identical**, we can use the **outer join mode**, **some cells will be NaN**. 

If we use **inner join mode, some observations will be dropped due to NaN values**. 

We can think of this as analogous to **the left and right join** of the **merge()** function.