---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: E-Commerce

### 📋 **Topic**: Combining DataFrames with Python

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---


## Overview

In real life, our data comes from multiple sources. To perform an analysis, we need to combine dataframes using one or more common `key` variables.

**Today we'll learn:**
- How to join DataFrames together
- Different types of joins (left, right, inner, outer)
- How to handle missing matches between datasets
- Real-world examples of combining data


## 1. Data


In [None]:
# Let's import the libraries we will use
import pandas as pd
import numpy as np

# Creating the data

# Let's create our first data frame, consisting of directors, and their nationalities
df_directors = pd.DataFrame(
    {
        "surname": ["Spielberg", "Scorsese", "Hitchcock", "Tarantino", "Villeneuve"],
        "nationality": ["US", "US", "UK", "US", "Canada"],
    }
)

# Create a data frame that has movies and directors
df_movies = pd.DataFrame(
    {
        "surname": [
            "Spielberg",
            "Scorsese",
            "Hitchcock",
            "Hitchcock",
            "Spielberg",
            "Tarantino",
            "Villeneuve",
        ],
        "title": [
            "Schindler's List",
            "Taxi Driver",
            "Psycho",
            "North by Northwest",
            "Catch Me If You Can",
            "Reservoir Dogs",
            "Dune",
        ],
    }
)

print("Directors DataFrame:")
print(df_directors)
print("\nMovies DataFrame:")
print(df_movies)


## 2. Combining dataframes

### 2.1 Left Join

What if we wanted to create a new dataframe that adds the directors' nationalities to the movies data frame? This is easy in Python using pandas `merge()` function.


In [None]:
df_all = df_movies.merge(df_directors, on="surname", how="left")
# Left join keeps all the rows from the left data frame (df_movies),
# and adds the columns from the right data frame (df_directors)
# if there is a match on the key variable (surname)

print("Left join result:")
print(df_all.head(20))


In [None]:
# 2.2 Concatenating dataframes

# Let's say we found some new directors, which we want to add
# to the original data frame
df_directors_new = pd.DataFrame(
    {"surname": ["Guadagnino", "Kubrick", "Jackson"], "nationality": ["IT", "US", "NZ"]}
)

# we can easily add them to the original data frame using pd.concat()
# this "concatenates" the new data frame to the original one, row-wise
df_directors = pd.concat([df_directors, df_directors_new], ignore_index=True)

print("Updated directors DataFrame:")
print(df_directors)


In [None]:
# Let's do the same with some new movies
df_movies_new = pd.DataFrame(
    {
        "surname": ["Guadagnino", "Kubrick", "Fellini"],
        "title": ["Call Me By Your Name", "Barry Lyndon", "La Dolce Vita"],
    }
)

df_movies = pd.concat([df_movies, df_movies_new], ignore_index=True)

print("Updated movies DataFrame:")
print(df_movies)


## 3. Different Types of Joins

Now we have some missing matches:
- `df_directors` has `Jackson`, but there is no movie by him in `df_movies`
- `df_movies` has "La Dolce Vita" by `Fellini`, but there is no director by that name in `df_directors`

Let's see how different join types handle this.


In [None]:
# Left join with missing matches
df_all = df_movies.merge(df_directors, on="surname", how="left")
print("Left join with missing matches:")
print(df_all)
print("\n" + "="*50 + "\n")

# Inner join - only keep rows that match in both datasets
df_all = df_movies.merge(df_directors, on="surname", how="inner")
print("Inner join result:")
print(df_all)
print("\n" + "="*50 + "\n")

# Outer join - keep all rows from both datasets
df_all = df_movies.merge(df_directors, on="surname", how="outer")
print("Full/Outer join result:")
print(df_all)


---

## 🎉 Summary

### Key Join Types:
- **Left join** (`how="left"`): Keep all rows from left DataFrame
- **Right join** (`how="right"`): Keep all rows from right DataFrame  
- **Inner join** (`how="inner"`): Keep only matching rows from both DataFrames
- **Outer join** (`how="outer"`): Keep all rows from both DataFrames

### Key Functions:
- `df.merge()`: For joining DataFrames on common columns
- `pd.concat()`: For stacking DataFrames vertically or horizontally

### Next:
We'll analyze time series data and reputation inflation

---
