In [1]:
# SETUP CODE - PlEASE RUN THIS ONCE WHEN YOU STARTUP YOUR CODESPACE

# RUN TEST FILE
%run 'test/week3_test.ipynb'

# Week 3 - Data Cleaning and Data Manipulation



## Introduction

In this notebook, we will explore common techniques for data cleaning and manipulation using the Pandas library in Python. Proper data preparation is crucial for accurate analysis and modeling. </br>

Below is a data cleaning checklist which contains some useful tips on what to keep an eye out for. </br>

<img src = https://images.datacamp.com/image/upload/v1654855433/Data_Cleaning_Checklist_1x_fad5f3e982.png width = "900" height = "2500" >

We are now going to look at how to implement some of these techniques in code. Some other data cleaning techniques will be touched on in next week's notebook.

### Import Libraries

Let's start by importing the necessary libraries.


In [3]:
import pandas as pd

### Handling Missing Values
Handling missing values is a critical step in the data cleaning process. Let's create a sample DataFrame with missing values to demonstrate the techniques.

In [5]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, None, 22, 35],
        'Salary': [50000, 60000, 75000, None, 80000]}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Salary
0,Alice,25.0,50000.0
1,Bob,30.0,60000.0
2,Charlie,,75000.0
3,David,22.0,
4,Eva,35.0,80000.0


Check for Missing Values </br></br>
We can use the isnull() function to identify missing values in the DataFrame.

In [6]:
# Check for missing values
missing_vals = df.isnull()
print(missing_vals)

    Name    Age  Salary
0  False  False   False
1  False  False   False
2  False   True   False
3  False  False    True
4  False  False   False


In [7]:
# Check for missing values
missing_vals_total = df.isnull().sum()
print(missing_vals_total)

Name      0
Age       1
Salary    1
dtype: int64


Fill Missing Values

Filling missing values is a common strategy. Situational logical should be applied to choose how best to fill in the missing values, and it also depends on the datatype of the column. 
- If you're looking at a numerical column, you may like to use the mean value, maximum value or minimum value of that column, or alternatively fill the missing values with a 0.
- If it is a string column, you may like to fill it with the string 'missing', depending on your use case and/or personal preference.

In this example, as we are looking at missing values in numerical columns, we will firstly fill the missing values with a specific value (0), then we will show you how to fill the missing values with the mean of each column.

In [8]:
# Fill missing values with 0
df['Age'].fillna(0, inplace=True)
df['Salary'].fillna(0, inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(0, inplace=True)


Unnamed: 0,Name,Age,Salary
0,Alice,25.0,50000.0
1,Bob,30.0,60000.0
2,Charlie,0.0,75000.0
3,David,22.0,0.0
4,Eva,35.0,80000.0


In [9]:
# Fill missing values with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(df['Salary'].mean(), inplace=True)


Unnamed: 0,Name,Age,Salary
0,Alice,25.0,50000.0
1,Bob,30.0,60000.0
2,Charlie,0.0,75000.0
3,David,22.0,0.0
4,Eva,35.0,80000.0


Did you notice how no values changed to the mean (remained as 0)? This is because when we filled the missing values with 0, we used the "inplace" attribute which essentially means the dataframe 'df' was overridden and those values were now saved in place of the missing values, resulting in 'df' not having any missing values. 

Be mindful when using the inplace attribute, because if you use it in error, you may need to re-run your previous code to get back to the dataframe format/contents you wish to use. An alternative in some cases would be to create a new dataframe. There are many reasons to NOT use the inplace parameter and it is often discouraged, even by the pandas developers. Instead it is a better practice to simply reassign the variable.

In our case, it would be better to use the column means to fill in the missing values (instead of 0), and we will save the new dataframe as a different dataframe variable.

In [10]:
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Salary
0,Alice,25.0,50000.0
1,Bob,30.0,60000.0
2,Charlie,,75000.0
3,David,22.0,
4,Eva,35.0,80000.0


In [11]:
# Fill missing values with mean
df2 = df.copy()
df2['Age'] = df2['Age'].fillna(df2['Age'].mean())
df2['Salary'] = df2['Salary'].fillna(df2['Salary'].mean())
df2

Unnamed: 0,Name,Age,Salary
0,Alice,25.0,50000.0
1,Bob,30.0,60000.0
2,Charlie,28.0,75000.0
3,David,22.0,66250.0
4,Eva,35.0,80000.0


An alternative to filling missing or null values is to remove the whole row of data. As we didn't override the original dataframe 'df', we will now remove the rows which contain missing values

In [12]:
df.dropna()

Unnamed: 0,Name,Age,Salary
0,Alice,25.0,50000.0
1,Bob,30.0,60000.0
4,Eva,35.0,80000.0


The documentation for this function can be found in the link below, which contains explanations of it's various parameters. A common parameter to use is the 'subset' argument, which will only drop a row if there is a missing value in the specified column, and will keep rows which have missing values in other columns

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

This is how you would drop rows with missing values in just the salary column:

In [13]:
df.dropna(subset=['Salary'])

Unnamed: 0,Name,Age,Salary
0,Alice,25.0,50000.0
1,Bob,30.0,60000.0
2,Charlie,,75000.0
4,Eva,35.0,80000.0


Another alternative way to fill missing values is using the interpolate method. It fills missing values with linearly spaced values between the existing numbers. This is an especially common method when you have time series data and you are missing one or more values in series. The example below shows the value we can fill at position x = 1.5 by using the points at x = 1 and x = 2 to create a linear line.

!["Interpolation"](img/interpolation.PNG)

In [14]:
df_inter = df.copy()

df_inter["Age"] = df_inter["Age"].interpolate()
df_inter["Salary"] = df_inter["Salary"].interpolate()

df_inter

Unnamed: 0,Name,Age,Salary
0,Alice,25.0,50000.0
1,Bob,30.0,60000.0
2,Charlie,26.0,75000.0
3,David,22.0,77500.0
4,Eva,35.0,80000.0


### Removing Duplicates
Duplicate records can skew analysis results. Let's create a DataFrame with duplicates and demonstrate how to handle them.

In [15]:
data_duplicates = {'ID': [1, 2, 3, 4, 1],
                   'Product': ['A', 'B', 'C', 'D', 'A'],
                   'Quantity': [10, 20, 15, 30, 10]}

df_duplicates = pd.DataFrame(data_duplicates)
df_duplicates

Unnamed: 0,ID,Product,Quantity
0,1,A,10
1,2,B,20
2,3,C,15
3,4,D,30
4,1,A,10


Check and Remove Duplicates.

Checking and removing duplicates is essential for maintaining data integrity.

In [16]:
# Check for duplicates
print("Duplicates before removal: \n",df_duplicates.duplicated())

# Remove duplicates
df_no_duplicates = df_duplicates.drop_duplicates()
print("\nDataFrame after removing duplicates: \n", df_no_duplicates)


Duplicates before removal: 
 0    False
1    False
2    False
3    False
4     True
dtype: bool

DataFrame after removing duplicates: 
    ID Product  Quantity
0   1       A        10
1   2       B        20
2   3       C        15
3   4       D        30


### Challenge Task 1: Data Cleaning Example

You have been given some fake data below that contains information about employees at a random company. Your task is to perform the following data cleaning steps:

- Identify and handle missing values in the dataset. Think about the best way to do this
- Remove any duplicate records from the dataset.
- Display the cleaned DataFrame.

The goal is to apply the data cleaning techniques learned in the notebook to ensure the dataset is ready for further analysis.

In [2]:
# First 5 rows of the dataframe for challenge 1. Make sure you have run the first command of this notebook for this to work

challenge_1_df.head()

Unnamed: 0,ID,Name,Age,Salary,Role,OfficeLocation
0,1,Alice,28.0,60000.0,Engineer,New York
1,2,Bob,35.0,75000.0,Manager,San Francisco
2,3,Charlie,,80000.0,Analyst,Los Angeles
3,4,David,32.0,90000.0,Director,Chicago
4,5,Eva,28.0,70000.0,Assistant,Boston


In [3]:
print("Count of missing values: \n", challenge_1_df.isnull().sum())

print("\n Duplicate rows: \n", challenge_1_df.duplicated())

Count of missing values: 
 ID                0
Name              0
Age               4
Salary            3
Role              1
OfficeLocation    2
dtype: int64

 Duplicate rows: 
 0     False
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
dtype: bool


### NOTE
Based on the above analysis, first we will drop the 1 duplicate row, before addressing the missing values.

For the missing values, I am going to drop the rows where there are missing values in the string columns (role & location), and then calculate the column means to fill in the missing values of the numerical columns (age & salary).

In your analysis, you may have chosen to address the missing values in a different way. In the real world, as long as you are able to justify your choice based on the analysis you need to conduct, generally, any approach is ok

In [4]:
challenge_1_df = challenge_1_df.drop_duplicates()
challenge_1_df

Unnamed: 0,ID,Name,Age,Salary,Role,OfficeLocation
0,1,Alice,28.0,60000.0,Engineer,New York
1,2,Bob,35.0,75000.0,Manager,San Francisco
2,3,Charlie,,80000.0,Analyst,Los Angeles
3,4,David,32.0,90000.0,Director,Chicago
4,5,Eva,28.0,70000.0,Assistant,Boston
6,6,Frank,45.0,,,San Francisco
7,7,Grace,38.0,80000.0,Analyst,Los Angeles
8,8,Henry,,75000.0,Director,
9,9,Ivy,29.0,65000.0,Assistant,Boston
10,10,Jack,34.0,,Manager,San Francisco


In [5]:
challenge_1_df = challenge_1_df.dropna(subset=['Role', 'OfficeLocation'])
challenge_1_df

Unnamed: 0,ID,Name,Age,Salary,Role,OfficeLocation
0,1,Alice,28.0,60000.0,Engineer,New York
1,2,Bob,35.0,75000.0,Manager,San Francisco
2,3,Charlie,,80000.0,Analyst,Los Angeles
3,4,David,32.0,90000.0,Director,Chicago
4,5,Eva,28.0,70000.0,Assistant,Boston
7,7,Grace,38.0,80000.0,Analyst,Los Angeles
9,9,Ivy,29.0,65000.0,Assistant,Boston
10,10,Jack,34.0,,Manager,San Francisco
11,2,Bob,,72000.0,Manager,San Francisco
12,11,Alice,28.0,60000.0,Engineer,New York


In [6]:
challenge_1_df.loc[:,'Age'] = challenge_1_df['Age'].fillna(challenge_1_df['Age'].mean())
challenge_1_df.loc[:,'Salary'] = challenge_1_df['Salary'].fillna(challenge_1_df['Salary'].mean())
challenge_1_df

Unnamed: 0,ID,Name,Age,Salary,Role,OfficeLocation
0,1,Alice,28.0,60000.0,Engineer,New York
1,2,Bob,35.0,75000.0,Manager,San Francisco
2,3,Charlie,30.545455,80000.0,Analyst,Los Angeles
3,4,David,32.0,90000.0,Director,Chicago
4,5,Eva,28.0,70000.0,Assistant,Boston
7,7,Grace,38.0,80000.0,Analyst,Los Angeles
9,9,Ivy,29.0,65000.0,Assistant,Boston
10,10,Jack,34.0,76416.666667,Manager,San Francisco
11,2,Bob,30.545455,72000.0,Manager,San Francisco
12,11,Alice,28.0,60000.0,Engineer,New York


In [7]:
# Check you have cleaned the dataset correctly

print("Count of missing values: \n", challenge_1_df.isnull().sum())

print("\n Duplicate rows: \n", challenge_1_df.duplicated())

Count of missing values: 
 ID                0
Name              0
Age               0
Salary            0
Role              0
OfficeLocation    0
dtype: int64

 Duplicate rows: 
 0     False
1     False
2     False
3     False
4     False
7     False
9     False
10    False
11    False
12    False
13    False
15    False
16    False
dtype: bool


In [8]:
# To tidy up the final dataset, we are going to change age to integer datatype & round the salary column to 2 decimal places, then display the final dataframe

challenge_1_df.loc[:,'Age'] = challenge_1_df['Age'].astype(int)
challenge_1_df = challenge_1_df.round({'Salary': 2})
challenge_1_df

Unnamed: 0,ID,Name,Age,Salary,Role,OfficeLocation
0,1,Alice,28.0,60000.0,Engineer,New York
1,2,Bob,35.0,75000.0,Manager,San Francisco
2,3,Charlie,30.0,80000.0,Analyst,Los Angeles
3,4,David,32.0,90000.0,Director,Chicago
4,5,Eva,28.0,70000.0,Assistant,Boston
7,7,Grace,38.0,80000.0,Analyst,Los Angeles
9,9,Ivy,29.0,65000.0,Assistant,Boston
10,10,Jack,34.0,76416.67,Manager,San Francisco
11,2,Bob,30.0,72000.0,Manager,San Francisco
12,11,Alice,28.0,60000.0,Engineer,New York


## Data Manipulation

There are various ways you can manipulate data, including selecting specific rows & columns, filtering, grouping, aggregating, pivotting, sorting, changing data types, & joining dataframes together. All of these techniques were covered in the week 3 material, except for the last 3, which will be covered now.

### Sorting data

Sorting data is crucial for better visualization and analysis. Let's demonstrate how to sort a DataFrame, using df2 (the original dataframe with the missing values filled in by the column mean).

In [18]:
# Sort by single column
# Sort DataFrame by 'Age' in descending order
df2_sorted_age = df2.sort_values(by='Age', ascending=False)
df2_sorted_age


Unnamed: 0,Name,Age,Salary
4,Eva,35.0,80000.0
1,Bob,30.0,60000.0
2,Charlie,28.0,75000.0
0,Alice,25.0,50000.0
3,David,22.0,66250.0


In [19]:
# Sort by multiple columns - but first, add in some extra data!
new_data = {'Name': ['Frank', 'George', 'Hayley'],
            'Age': [25, 30, 22],
            'Salary': [60000, 75000, 65000]}
new_df = pd.DataFrame(new_data)

df2_new = pd.concat([df2, new_df], ignore_index = True) 
df2_new

Unnamed: 0,Name,Age,Salary
0,Alice,25.0,50000.0
1,Bob,30.0,60000.0
2,Charlie,28.0,75000.0
3,David,22.0,66250.0
4,Eva,35.0,80000.0
5,Frank,25.0,60000.0
6,George,30.0,75000.0
7,Hayley,22.0,65000.0


In [20]:
# Sort DataFrame by 'Age' in ascending order and then by 'Salary' in descending order
df2_sorted_multiple = df2_new.sort_values(by=['Age', 'Salary'], ascending=[True, False])
df2_sorted_multiple

Unnamed: 0,Name,Age,Salary
3,David,22.0,66250.0
7,Hayley,22.0,65000.0
5,Frank,25.0,60000.0
0,Alice,25.0,50000.0
2,Charlie,28.0,75000.0
6,George,30.0,75000.0
1,Bob,30.0,60000.0
4,Eva,35.0,80000.0


### Changing Data Types

Changing data types is important when the default types assigned by Pandas may not be suitable for analysis.

In [21]:
df2_sorted_multiple.dtypes

Name       object
Age       float64
Salary    float64
dtype: object

In [22]:
# Change the data type of the 'Age' column to float
df2_sorted_multiple['Age'] = df2_sorted_multiple['Age'].astype(int)
print(df2_sorted_multiple.dtypes)

Name       object
Age         int64
Salary    float64
dtype: object


In [23]:
# Convert data types of multiple columns
df2_new = df2_new.astype({'Age': 'int', 'Salary': 'float'}) #note, salary was already float (this is just an example)
print(df2_new.dtypes)

Name       object
Age         int64
Salary    float64
dtype: object


### Joining Datasets

Joining datasets is a common operation when combining information from different sources. It is important to keep in mind which type of join you are doing (for example: inner, outer, left, right). Below is a visualisation of a few different types of joins.

- **Inner**: Returns a Dataframe with only the rows that have a common intersection between thw two joiend Dataframes.
- **Left Join**: Returns all the records in the first Dataframe, and all the rows in the second Dataframe that have a common key with the first Dataframe. This means that if there is a key missing from the second Dataframe, the returned Dataframe will simply fill it with None values.
- **Right Join**: Opposite of left join.
- **Outer Join**: Returns all the rows from the first Dataframe, and all the rows from the second Dataframe, and matches up rows where possible, with None elsewhere.

<div style="text-align: center;">
<img src = https://miro.medium.com/v2/resize:fit:900/1*yb76Gk03pZsjVDp79n2yKA.jpeg width = "500">
</div>


In [24]:
# Create two sample DataFrames
df1_join_eg = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df1_join_eg


Unnamed: 0,ID,Name
0,1,Alice
1,2,Bob
2,3,Charlie


In [25]:
df2_join_eg = pd.DataFrame({'ID': [2, 3, 4], 'Salary': [60000, 75000, 90000]})
df2_join_eg

Unnamed: 0,ID,Salary
0,2,60000
1,3,75000
2,4,90000


#### Inner

In [27]:
# Merge DataFrames based on 'ID', using 'inner' method
merged_df_inner = pd.merge(df1_join_eg, df2_join_eg, on='ID', how='inner')
merged_df_inner

Unnamed: 0,ID,Name,Salary
0,2,Bob,60000
1,3,Charlie,75000


#### Outer

In [28]:
# Merge DataFrames based on 'ID', using 'outer' method
merged_df_outer = pd.merge(df1_join_eg, df2_join_eg, on='ID', how='outer')
merged_df_outer

Unnamed: 0,ID,Name,Salary
0,1,Alice,
1,2,Bob,60000.0
2,3,Charlie,75000.0
3,4,,90000.0


#### Left

In [7]:
# Merge DataFrames based on 'ID', using 'left' method
merged_df_left = pd.merge(df1_join_eg, df2_join_eg, on='ID', how='left')
merged_df_left

Unnamed: 0,ID,Name,Salary
0,1,Alice,
1,2,Bob,60000.0
2,3,Charlie,75000.0


#### Right

In [29]:
# Merge DataFrames based on 'ID', using 'right' method
merged_df_right= pd.merge(df1_join_eg, df2_join_eg, on='ID', how='right')
merged_df_right

Unnamed: 0,ID,Name,Salary
0,2,Bob,60000
1,3,Charlie,75000
2,4,,90000


You can also join on multiple attributes (columns). In the above example, you would not merge on multiple columns as both df1_join_eg & df2_join_eg only have 1 column in common. For more information about joining on multiple columns, check out this website: https://sparkbyexamples.com/pandas/pandas-merge-two-dataframes-on-multiple-columns/

### Challenge Task 2: Data manipulation example

You have two datasets provided below. Your task is to perform the following data manipulation steps:

- Apply the following operations on each DataFrame:
  1. Select only columns which are relevant to an employee database.
  2. Address the missing values in the way you see most logical
  3. Join the two DataFrames using both the left join and inner join methods based on the 'ID' column (2 new dataframes will be created).
  4. Ensure the datatypes for each column are logically correct.
  5. Sort the DataFrame based on the 'Salary' column in descending order.
- Display the final result of both joined DataFrames.

In [9]:
# First 5 rows of the dataframe 1 for challenge 2. Make sure you have run the first command of this notebook for this to work

challenge_2_df_1.head()

Unnamed: 0,ID,Name,Age,Hobby,Salary
0,1,Alice,28.0,Reading,60000
1,2,Bob,35.0,Gaming,75000
2,3,Charlie,32.0,Painting,80000
3,4,David,45.0,Cooking,90000
4,5,Eva,28.0,Traveling,70000


In [10]:
# First 5 rows of the dataframe 2 for challenge 2. Make sure you have run the first command of this notebook for this to work

challenge_2_df_2.head()

Unnamed: 0,ID,Role,Pet,FavoriteFood,OfficeLocation
0,1,Engineer,,Pizza,New York
1,2,Manager,Dog,Sushi,San Francisco
2,3,Analyst,Cat,Pasta,Los Angeles
3,4,Director,Fish,Burger,Chicago
4,5,Assistant,,Salad,Boston


In [11]:
# STEP 1
challenge_2_df_1 = challenge_2_df_1[['ID', 'Name', 'Age', 'Salary']].copy()
challenge_2_df_2 = challenge_2_df_2[['ID', 'Role', 'OfficeLocation']].copy()

In [12]:
# Step 2
print("Missing values in df 1: \n", challenge_2_df_1.isnull().sum())
print("\n Missing values in df 2: \n", challenge_2_df_2.isnull().sum())


Missing values in df 1: 
 ID        0
Name      0
Age       1
Salary    0
dtype: int64

 Missing values in df 2: 
 ID                0
Role              0
OfficeLocation    1
dtype: int64


In [13]:
# Decision to address missing values for challenge 2: drop both rows

challenge_2_df_1 = challenge_2_df_1.dropna()
challenge_2_df_2 = challenge_2_df_2.dropna()

In [14]:
#STEP 3

left_join_df = challenge_2_df_1.merge(challenge_2_df_2, on="ID", how="left")
left_join_df

Unnamed: 0,ID,Name,Age,Salary,Role,OfficeLocation
0,1,Alice,28.0,60000,Engineer,New York
1,2,Bob,35.0,75000,Manager,San Francisco
2,3,Charlie,32.0,80000,Analyst,Los Angeles
3,4,David,45.0,90000,Director,Chicago
4,5,Eva,28.0,70000,Assistant,Boston
5,6,Frank,45.0,85000,,
6,7,Grace,38.0,80000,,
7,9,Ivy,29.0,65000,,
8,10,Jack,34.0,72000,,


In [15]:
inner_join_df = challenge_2_df_1.merge(challenge_2_df_2, on="ID", how="inner")
inner_join_df

Unnamed: 0,ID,Name,Age,Salary,Role,OfficeLocation
0,1,Alice,28.0,60000,Engineer,New York
1,2,Bob,35.0,75000,Manager,San Francisco
2,3,Charlie,32.0,80000,Analyst,Los Angeles
3,4,David,45.0,90000,Director,Chicago
4,5,Eva,28.0,70000,Assistant,Boston


In [16]:
#STEP 4

print("data types for left join df: \n", left_join_df.dtypes)
print("\n data types for inner join df: \n", inner_join_df.dtypes)

data types for left join df: 
 ID                  int64
Name               object
Age               float64
Salary              int64
Role               object
OfficeLocation     object
dtype: object

 data types for inner join df: 
 ID                  int64
Name               object
Age               float64
Salary              int64
Role               object
OfficeLocation     object
dtype: object


In [17]:
# Change age column to integer
left_join_df['Age'] = left_join_df['Age'].astype(int)
inner_join_df['Age'] = inner_join_df['Age'].astype(int)

In [18]:
#STEP 5

left_join_df = left_join_df.sort_values(by='Salary', ascending=False)
inner_join_df = inner_join_df.sort_values(by='Salary', ascending=False)

In [19]:
#STEP 6
left_join_df

Unnamed: 0,ID,Name,Age,Salary,Role,OfficeLocation
3,4,David,45,90000,Director,Chicago
5,6,Frank,45,85000,,
6,7,Grace,38,80000,,
2,3,Charlie,32,80000,Analyst,Los Angeles
1,2,Bob,35,75000,Manager,San Francisco
8,10,Jack,34,72000,,
4,5,Eva,28,70000,Assistant,Boston
7,9,Ivy,29,65000,,
0,1,Alice,28,60000,Engineer,New York


In [20]:
inner_join_df

Unnamed: 0,ID,Name,Age,Salary,Role,OfficeLocation
3,4,David,45,90000,Director,Chicago
2,3,Charlie,32,80000,Analyst,Los Angeles
1,2,Bob,35,75000,Manager,San Francisco
4,5,Eva,28,70000,Assistant,Boston
0,1,Alice,28,60000,Engineer,New York
