# Merging multiple DataFrames

## Video lecture Transcript


**1. Merging multiple DataFrames**

Welcome back. In our last lesson, we learned how to merge two tables with a one-to-many relationship using the merge method. Merging data like this is a necessary skill to bring together data from different sources to answer some more complex data questions.

**2. Merging multiple tables**

Sometimes we need to merge together more than just two tables to complete our analysis.

**3. Remembering the licenses table**

In the previous lesson, we used two tables from the city of Chicago. One table contained business licenses issued by the city.

**4. Remembering the wards table**

The other table listed info about the local neighborhoods called wards, including the local government official's office.

**5. Review new data**

Now, we also have a table of businesses that have received small business grant money from Chicago. The grants are funded by taxpayer money. Therefore, it would be helpful to analyze how much grant money each business received and in what ward that business...ing multiple tables
We can now extend this example to a third table. First, we merge the grants table with the wards table on the ward column again, adding suffixes to the repeated column names. Note that we're using Python's backslash line continuation method to add the second merge on the next line. Python will read this as just one line of code. Without this, Python will throw a syntax error since it will parse it as two separate lines of code, so don't forget your backslash. Now our output table has information about grants, business, and wards. We can now complete our analysis.

**10. Results**

We can now sum the grants by ward and plot the results. Some wards have received more grants than others.

**11. Merging even more...**

We could continue to merge additional tables as needed. We stopped at three, but if needed, we could continue to add more. The code here shows the pattern you would follow as you merge more tables.

**12. Let's practice!**

Now, let's practice merging multiple tables.

## Exercise 1:

### Description:
Your goal is to find the total number of rides provided to passengers passing through the Wilson station (`station_name == 'Wilson'`) when riding Chicago's public transportation system on weekdays (`day_type == 'Weekday'`) in July (`month == 7`). Luckily, Chicago provides this detailed data, but it is in three different tables. You will work on merging these tables together to answer the question. This data is different from the business related data you have seen so far, but all the information you need to answer the question is provided.

The cal, ridership, and stations DataFrames have been loaded for you. The relationship between the tables can be seen in the diagram below.

<img src="..\1_Chapter_1\Resources\3_Ch1_part3_1.png" style="max-height: 1000px; max-width: 1000px;"/>

### Instructions:

___

#### Instruction 1:
Merge the ridership and cal tables together, starting with the ridership table on the left and save the result to the variable ridership_cal. If you code takes too long to run, your merge conditions might be incorrect.
#### Code:
```python
ridership_cal = ridership.merge(cal, on=["year", "month", "day"])
```

___

#### Instruction 2:
Extend the previous merge to three tables by also merging the stations table.
#### Code:
```python
ridership_cal_stations = ridership.merge(cal, on=['year','month','day']) \
							.merge(stations, on='station_id')
```

___

#### Instruction 3:
Create a variable called `filter_criteria` to select the appropriate rows from the merged table so that you can sum the `rides` column.


#### Code:
```python
ridership_cal_stations = ridership.merge(cal, on=['year','month','day']) \
							.merge(stations, on='station_id')

filter_criteria = ((ridership_cal_stations['month'] == 7) 
                   & (ridership_cal_stations['day_type'] == 'Weekday') 
                   & (ridership_cal_stations['station_name'] == 'Wilson'))

# Use .loc and the filter to select for rides
print(ridership_cal_stations.loc[filter_criteria, 'rides'].sum())



# Shell output
140005

```

### Exercise recap
Awesome work! You merged three DataFrames together, including merging two tables on multiple columns. Once the tables were merged, you filtered and selected just like any other DataFrame. Finally, you found out that the Wilson station had 140,005 riders during weekdays in July.



## Exercise 2:

Three table merge
To solidify the concept of a three DataFrame merge, practice another exercise. A reasonable extension of our review of Chicago business data would include looking at demographics information about the neighborhoods where the businesses are. A table with the median income by zip code has been provided to you. You will merge the `licenses` and `wards` tables with this new income-by-zip-code table called `zip_demo`.

The `licenses`, `wards`, and `zip_demo` DataFrames have been loaded for you.

### Instructions
Starting with the `licenses` table, merge to it the `zip_demo` table on the `zip` column. Then merge the resulting table to the `wards` table on the `ward` column. Save result of the three merged tables to a variable named `licenses_zip_ward`.
Group the results of the three merged tables by the column `alderman` and find the median `income`.

### Code:
```python
# Merge licenses and zip_demo, on zip; and merge the wards on ward
licenses_zip_ward = licenses.merge(zip_demo, on="zip") \
            			.merge(wards, on="ward")

# Print the results by alderman and show median income
print(licenses_zip_ward.groupby("alderman").agg({'income':'median'}))
```

### Shell output
```
                             income
alderman                           
Ameya Pawar                 66246.0
Anthony A. Beale            38206.0
Anthony V. Napolitano       82226.0
Ariel E. Reyboras           41307.0
Brendan Reilly             110215.0
Brian Hopkins               87143.0
Carlos Ramirez-Rosa         66246.0
Carrie M. Austin            38206.0
Chris Taliaferro            55566.0
Daniel "Danny" Solis        41226.0
David H. Moore              33304.0
Deborah Mell                66246.0
Debra L. Silverstein        50554.0
Derrick G. Curtis           65770.0
Edward M. Burke             42335.0
Emma M. Mitts               36283.0
George Cardenas             33959.0
Gilbert Villegas            41307.0
Gregory I. Mitchell         24941.0
Harry Osterman              45442.0
Howard B. Brookins, Jr.     33304.0
James Cappleman             79565.0
Jason C. Ervin              41226.0
Joe Moore                   39163.0
John S. Arena               70122.0
Leslie A. Hairston          28024.0
Margaret Laurino            70122.0
Marty Quinn                 67045.0
Matthew J. O'Shea           59488.0
Michael R. Zalewski         42335.0
Michael Scott, Jr.          31445.0
Michelle A. Harris          32558.0
Michelle Smith             100116.0
Milagros "Milly" Santiago   41307.0
Nicholas Sposato            62223.0
Pat Dowell                  46340.0
Patrick Daley Thompson      41226.0
Patrick J. O'Connor         50554.0
Proco "Joe" Moreno          87143.0
Raymond A. Lopez            33959.0
Ricardo Munoz               31445.0
Roberto Maldonado           68223.0
Roderick T. Sawyer          32558.0
Scott Waguespack            68223.0
Susan Sadlowski Garza       38417.0
Tom Tunney                  88708.0
Toni L. Foulkes             27573.0
Walter Burnett, Jr.         87143.0
William D. Burns           107811.0
Willie B. Cochran           28024.0
```

### Exercise recap:
Nice work! You successfully merged three tables together. With the merged data, you can complete your income analysis. You see that only a few aldermen represent businesses in areas where the median income is greater than $62,000, which is the median income for the state of Illinois.

## Exercise 3:
### Description:

One-to-many merge with multiple tables
In this exercise, assume that you are looking to start a business in the city of Chicago. Your perfect idea is to start a company that uses goats to mow the lawn for other businesses. However, you have to choose a location in the city to put your goat farm. You need a location with a great deal of space and relatively few businesses and people around to avoid complaints about the smell. You will need to merge three tables to help you choose your location. The `land_use` table has info on the percentage of vacant land by city ward. The `census` table has population by ward, and the `licenses` table lists businesses by ward.

The `land_use`, `census`, and `licenses` tables have been loaded for you.

### Instructions:
#### Instructions 1:
Merge `land_use` and `census` on the `ward` column. Merge the result of this with `licenses` on the `ward` column, using the suffix `_cen` for the left table and `_lic` for the right table. Save this to the variable `land_cen_lic`.
#### Code
```python
land_cen_lic = land_use.merge(census, on='ward') \
                    .merge(licenses, on='ward', suffixes=('_cen','_lic'))
```
___

#### Instruction 2:
Group `land_cen_lic` by `ward`, `pop_2010` (the population in 2010), and `vacant`, then count the number of `accounts`. Save the results to `pop_vac_lic`.

#### Code
```python
land_cen_lic = land_use.merge(census, on='ward') \
                    .merge(licenses, on='ward', suffixes=('_cen','_lic'))

# Group by ward, pop_2010, and vacant, then count the # of accounts
pop_vac_lic = land_cen_lic.groupby(["ward", "pop_2010", "vacant"], 
                                   as_index=False).agg({'account':'count'})
```

___

#### Instruction 3:
Sort `pop_vac_lic` by `vacant`, `account`, and `pop_2010` in descending, ascending, and ascending order respectively. Save it as `sorted_pop_vac_lic`.

#### Code
```python
land_cen_lic = land_use.merge(census, on='ward') \
                    .merge(licenses, on='ward', suffixes=('_cen','_lic'))

# Group by ward, pop_2010, and vacant, then count the # of accounts
pop_vac_lic = land_cen_lic.groupby(["ward", "pop_2010", "vacant"], 
                                   as_index=False).agg({'account':'count'})

# Shell output
   ward  pop_2010  vacant  account
47    7     51581      19       80
12   20     52372      15      123
1    10     51535      14      130
16   24     54909      13       98
7    16     51954      13      156
```

### Exercise recap:
Great job putting your new skills into action. You merged multiple tables with varying relationships and added suffixes to make your column names clearer. Using your skills, you were able to pull together information from different tables to see that the 7th ward would be a good place to build your goat farm!

# RECAP
Your recent learnings
When you left 1 day ago, you worked on Data Merging Basics, chapter 1 of the course Joining Data with pandas. Here is what you covered in your last lesson:

You learned about the importance and techniques of merging multiple DataFrames in Python using pandas, a crucial skill for data analysis that allows you to combine information from different sources to uncover more complex insights. Specifically, you covered:

The concept of merging tables with a one-to-many relationship using the merge method, which is essential for bringing together disparate data sources.
The importance of selecting the correct keys (columns) for merging tables to avoid duplicating or losing data. For example, merging on both address and zip code to ensure accurate matching of records.
How to merge more than two tables by chaining the merge method, and the significance of adding suffixes to differentiate columns with the same name from different tables.
The practical application of these techniques through exercises, including merging business licenses, grants, and ward information to analyze grant distribution across wards.
In one of the exercises, you practiced merging three tables (licenses, zip_demo, and wards) and then grouping the results to find the median income by alderman:

licenses_zip_ward = licenses.merge(zip_demo, on='zip') \
                            .merge(wards, on='ward')
print(licenses_zip_ward.groupby('alderman').agg({'income':'median'}))
This exercise illustrated how to extend data merging to include demographic information, providing a more nuanced analysis of business data.

The goal of the next lesson is to understand how to use a left join to merge data tables, focusing on including all entries from the primary table and matching entries from the secondary table, and how this method can be applied to identify missing data or enrich datasets with additional information.