# Inner join

## Video lecture Transcript

**1. Inner join**

Welcome! I am Aaren Stubberfield and I will be your instructor for this course. The pandas package is a powerful tool for manipulating and transforming data in Python. However, when working on an analysis, the data needed could be in multiple tables. This course will focus on the vital skill of merging tables together.


**2. For clarity**

As we start, two quick clarifications. First, through other courses on DataCamp, you may have learned how to import tabular data as DataFrames. In this course, you may hear the words table and DataFrame, but they are equivalent here. Second, we will refer to combining different tables together as merging tables, but note that some refer to this same process as joining.


**3. Chicago data portal dataset**

To help us learn about merging tables, we will use data from the city of Chicago data portal.


**4. Datasets for example**

The city of Chicago is divided into fifty local neighborhoods called wards. We have a table with data about the local government offices in each ward. In this example, we want to merge the local government data with census data about the population of each ward.


**5. The ward data**

If we look at the wards table, we have information about the local government of each ward, such as the government office address. This table has 50 rows and 4 columns, or one row for each ward.


**6. Census data**

The census table contains the population of each ward in 2000 and 2010, and that change as a percentage. Additionally, it includes the address for the center of each ward. This table has 50 rows and 6 columns.


**7. Merging tables**

The two tables are related by their ward column. We can merge them together, matching the ward number from each row of the wards table to the ward numbers from the census table. For example, the second ward in the wards table with Alderman Brian Hopkins would be matched with row 2 of the census table where the population in 2000 was 54,361.


**8. Inner join**

The pandas package has an excellent DataFrame method for performing this type of merge called merge. The merge method takes the first DataFrame, wards, and merges it with the second DataFrame, census. We use the on argument to tell the method that we want to merge the two DataFrames on the ward column. Since we listed the wards table first, its columns will appear first in the output, followed by the columns from the census table. In this example, the merge returns a DataFrame with 50 rows and 9 columns, where the returned rows have matching values for the ward column in both tables. This is called an inner join.


**9. Inner join**

An inner join will only return rows that have matching values in both tables.


**10. Suffixes**

You may have noticed that the merged table has columns with suffixes of underscore x or y. This is because both the wards and census tables contained address and zip columns. To avoid multiple columns with the same name, they are automatically given a suffix by the merge method.


**11. Suffixes**

We can use the suffix argument of the merge method to control this behavior. We provide a tuple where all of the overlapping columns in the left table are given the suffix '_ward', and those of the right table will be given the suffix '_cen'. This makes it easier for us to tell the difference between the columns.


**12. Let's practice!**

Now let's practice using the merge method.


## Exercises

### Exercise 1

What column to merge on?
Chicago provides a list of taxicab owners and vehicles licensed to operate within the city, for public safety. Your goal is to merge two tables together. One table is called taxi_owners, with info about the taxi cab company owners, and one is called taxi_veh, with info about each taxi cab vehicle. Both the taxi_owners and taxi_veh tables have been loaded for you to explore.

``` c
print(taxi_owners.head())
     rid   vid           owner                 address    zip
0  T6285  6285  AGEAN TAXI LLC     4536 N. ELSTON AVE.  60630
1  T4862  4862    MANGIB CORP.  5717 N. WASHTENAW AVE.  60659
2  T1495  1495   FUNRIDE, INC.     3351 W. ADDISON ST.  60618
3  T4231  4231    ALQUSH CORP.   6611 N. CAMPBELL AVE.  60645
4  T5971  5971  EUNIFFORD INC.     3351 W. ADDISON ST.  60618
```



``` c
print(taxi_veh.head())
    vid    make   model  year fuel_type                owner
0  2767  TOYOTA   CAMRY  2013    HYBRID       SEYED M. BADRI
1  1411  TOYOTA    RAV4  2017    HYBRID          DESZY CORP.
2  6500  NISSAN  SENTRA  2019  GASOLINE       AGAPH CAB CORP
3  2746  TOYOTA   CAMRY  2013    HYBRID  MIDWEST CAB CO, INC
4  5922  TOYOTA   CAMRY  2013    HYBRID       SUMETTI CAB CO
```

My note: Both tables share the columns `vid` and `owner`. BUT, I feel like the `owner` could be problematic, and it's not even present in the MCQ.

Choose the column you would use to merge the two tables on using the .merge() method.

on='rid'

on='vid' ✅

on='year'

on='zip'

### Exercise 2
Your first inner join
You have been tasked with figuring out what the most popular types of fuel used in Chicago taxis are. To complete the analysis, you need to merge the taxi_owners and taxi_veh tables together on the vid column. You can then use the merged table along with the .value_counts() method to find the most common fuel_type.

Since you'll be working with pandas throughout the course, the package will be preloaded for you as pd in each exercise in this course. Also the taxi_owners and taxi_veh DataFrames are loaded for you.


#### Instructions (3)
1. Merge taxi_owners with taxi_veh on the column vid, and save the result to taxi_own_veh.
2. Set the left and right table suffixes for overlapping columns of the merge to _own and _veh, respectively.
3. Select the fuel_type column from taxi_own_veh and print the value_counts() to find the most popular fuel_types used.

___
``` python
# Instruction 1. Merge taxi_owners with taxi_veh on the column vid, and save the result to taxi_own_veh.

# Merge the taxi_owners and taxi_veh tables
taxi_own_veh = taxi_owners.merge(taxi_veh, on="vid")

# Print the column names of the taxi_own_veh
print(taxi_own_veh.columns)




# shell response
Index(['rid', 'vid', 'owner_x', 'address', 'zip', 'make', 'model', 'year', 'fuel_type', 'owner_y'], dtype='object')
```
___

``` python
# Instruction 2. Set the left and right table suffixes for overlapping columns of the merge to _own and _veh, respectively.

# Merge the taxi_owners and taxi_veh tables setting a suffix
taxi_own_veh = taxi_owners.merge(taxi_veh, on='vid', suffixes=("_own", "_veh"))

# Print the column names of taxi_own_veh
print(taxi_own_veh.columns)




# shell response
Index(['rid', 'vid', 'owner_own', 'address', 'zip', 'make', 'model', 'year', 'fuel_type', 'owner_veh'], dtype='object')
```
___

```python
# Instruction 3. Select the fuel_type column from taxi_own_veh and print the value_counts() to find the most popular fuel_types used.

# Merge the taxi_owners and taxi_veh tables setting a suffix
taxi_own_veh = taxi_owners.merge(taxi_veh, on='vid', suffixes=('_own','_veh'))

# Print the value_counts to find the most popular fuel_type
print(taxi_own_veh['fuel_type'].value_counts())




# shell response
Index(['rid', 'vid', 'owner_x', 'address', 'zip', 'make', 'model', 'year', 'fuel_type', 'owner_y'], dtype='object')
```

### Exercise 3
Inner joins and number of rows returned
All of the merges you have studied to this point are called inner joins. It is necessary to understand that inner joins only return the rows with matching values in both tables. You will explore this further by reviewing the merge between the `wards` and `census` tables, then comparing it to merges of copies of these tables that are slightly altered, named `wards_altered`, and `census_altered`. The first row of the `wards` column has been changed in the altered tables. You will examine how this affects the merge between them. The tables have been loaded for you.

For this exercise, it is important to know that the `wards` and `census` tables start with 50 rows.

#### Instructions (3)
1. Merge wards and census on the ward column and save the result to wards_census.
2. Merge the wards_altered and census tables on the ward column, and notice the difference in returned rows.
3. Merge the wards and census_altered tables on the ward column, and notice the difference in returned rows.

___
``` python
# Instruction 1. Merge wards and census on the ward column and save the result to wards_census.

# Merge the wards and census tables on the ward column
wards_census = wards.merge(census, on="ward")

# Print the shape of wards_census
print('wards_census table shape:', wards_census.shape)


# shell response
wards_census table shape: (50, 9)
```
___

``` python
# Instruction 2. Merge the wards_altered and census tables on the ward column, and notice the difference in returned rows.

# Print the first few rows of the wards_altered table to view the change 
print(wards_altered[['ward']].head())

# Merge the wards_altered and census tables on the ward column
wards_altered_census = wards_altered.merge(census, on="ward")

# Print the shape of wards_altered_census
print('wards_altered_census table shape:', wards_altered_census.shape)



# shell response
  ward
0   61
1    2
2    3
3    4
4    5
wards_altered_census table shape: (49, 9)
```
___

``` python
# Instruction 3. Merge the wards and census_altered tables on the ward column, and notice the difference in returned rows.

# Print the first few rows of the census_altered table to view the change 
print(census_altered[['ward']].head())

# Merge the wards and census_altered tables on the ward column
wards_census_altered = wards.merge(census_altered, on="ward")

# Print the shape of wards_census_altered
print('wards_census_altered table shape:', wards_census_altered.shape)




# shell response
    ward
0  None
1     2
2     3
3     4
4     5
wards_census_altered table shape: (49, 9)
```
___

#### final thought
Great job! In step 1, the `.merge()` returned a table with the same number of rows as the original `wards` table. However, in steps 2 and 3, using the altered tables with the altered first row of the `ward` column, the number of returned rows was fewer. There was not a matching value in the `ward` column of the other table. _Remember that `.merge()` only returns rows where the values match in both tables._

# Recap

Your recent learnings
When you left 2 days ago, you worked on Data Merging Basics, chapter 1 of the course Joining Data with pandas. Here is what you covered in your last lesson:

You learned about the fundamentals of merging tables in pandas, focusing on inner joins. Inner joins are a method to combine rows from two tables based on a common column, returning only rows with matching values in both tables. Here are the key points you covered:

- Understanding DataFrames and Tables: You discovered that in pandas, tables are represented as DataFrames, and merging them is a crucial skill for data analysis.
- The `merge()` Method: You learned how to use pandas' `merge()` method to join two DataFrames. For example, to merge two DataFrames, `df1` and `df2`, on a common column named '`common_column`', you would use `df1.merge(df2, on='common_column')`.
Inner Join Mechanics: An inner join returns a DataFrame containing only the rows that have matching values in both tables. This was illustrated through merging ward and census data, where only wards present in both DataFrames were included in the result.
Suffixes in Merged DataFrames: You saw how pandas handles overlapping column names by appending suffixes, and you learned to customize these suffixes using the `suffixes` argument in the `merge()` method to make the output clearer.
Practical Application: Through exercises, you applied these concepts by merging Chicago taxi owners and vehicle tables on the '`vid`' column to analyze taxi fuel types, demonstrating the real-world utility of inner joins.
This lesson equipped you with the knowledge to merge data from multiple sources, a vital skill for uncovering insights in data analysis.

The goal of the next lesson is to teach how to effectively combine and analyze data from multiple sources using one-to-many relationships, enhancing your ability to conduct comprehensive data analysis.