# Merging Data

We often want to combine data stored in multiple different sources into a single representation for analysis or modelling. In this notebook, we look at how we can **merge** data using Pandas DataFrames.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

## Data Loading 

We will work with records for university students and programmes, which are split across two separate CSV files:
- *student_details.csv*: A master list of students with their unique student ID, first name, surname, gender, and university email.
- *student_enrolment.csv*: Enrolment records for students, indicating their study level, school, and programme year linked by student ID.

In [None]:
# read the first dataset
df_details = pd.read_csv("student_details.csv")
print(f"DataFrame has {df_details.shape[0]} rows and {df_details.shape[1]} columns")
print(f"Columns: {df_details.columns.tolist()}")
df_details.head(10)

In [None]:
# read the second dataset
df_enrol = pd.read_csv("student_enrolment.csv")
print(f"DataFrame has {df_enrol.shape[0]} rows and {df_enrol.shape[1]} columns")
print(f"Columns: {df_enrol.columns.tolist()}")
df_enrol.head(10)

## Merging Data - Inner Joins

Merging two DataFrames in Pandas involves matching rows based on one or more key columns and combining their associated data into a single unified DataFrame. We do this using the `pd.merge()` function.

The most common type of merge in Pandas is an **inner join**. This operation keeps only rows that have matching keys in both DataFrames.

In the case of our data, we join the two DataFrames based on the column `student_id` in each case, so only students who are present in both files will appear in the result. 

Note that when we call the `pd.merge()` function here, the argument `how="inner"` indicates that an inner join is to be performed.

In [None]:
# perform inner join on student_id
df_merged1 = pd.merge(
    df_details, df_enrol,
    how="inner",
    on="student_id"
)

print(f"Merged DataFrame has {df_merged1.shape[0]} rows and {df_merged1.shape[1]} columns")
print(f"Merged Columns: {df_merged1.columns.tolist()}")

Notice that our new merged DataFrame only contains rows whose `student_id` appears in both input DataFrames, so any students without a matching enrolment are excluded from the result.

In [None]:
# check the original student IDs from the details table
details_student_ids = set(df_details["student_id"])
print(f"Count of original students: {len(details_student_ids)}")

# check the student IDs from the merged table
merged1_student_ids = set(df_merged1["student_id"])
print(f"Count of enrolled students: {len(merged1_student_ids)}")

# which students aren't enrolled?
print("Missing students:")
details_student_ids.difference(merged1_student_ids)

We could now perform characterisation of the data in our merged DataFrame.

For instance, we could look at gender balance by programme level or by school:

In [None]:
# cross tabulation of level and gender
pd.crosstab(df_merged1["level"], df_merged1["gender"], margins=True)

In [None]:
# cross tabulation of school and gender
pd.crosstab(df_merged1["school"], df_merged1["gender"], margins=True)

## Merging Data - Outer Joins

An alternative merging strategy is an **outer join**. This includes all keys from both tables: matched rows where possible, and unmatched rows filled with missing values. This is useful for a completeness view, but it introduces missing values for non-matches.

Again, we will join the two DataFrames based on the column `student_id`. However, when we call the `pd.merge()` function now, the argument `how="outer"` indicates that an outer join is to be performed.

In [None]:
# perform outer join
df_merged2 = pd.merge(
    df_details, df_enrol,
    how="outer"
)

print(f"Merged DataFrame has {df_merged2.shape[0]} rows and {df_merged2.shape[1]} columns")
print(f"Merged Columns: {df_merged2.columns.tolist()}")

Notice now that our merged DataFrame now contains rows for all students in the original details DataFrame.

However, we do now have rows with missing values - i.e. students that are not enrolled on aany programme.

In [None]:
# check columns for missing values 
df_merged2.isna().sum()

In [None]:
# check which rows have missing values (i.e. students not enrolled)
df_merged2[df_merged2.isna().any(axis=1)]

We could decide to replace the missing values in some of these columns:

In [None]:
df_merged2["level"] = df_merged2["level"].fillna("Unknown")
df_merged2["school"] = df_merged2["school"].fillna("Not allocated")

Use frequency tables to check the counts for these columns, after filling missing values:

In [None]:
for col in ["level", "school"]:
    print(f"- Column: {col}")
    display(df_merged2[col].value_counts())
    print()