---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: E-Commerce

### 📋 **Topic**: Combining DataFrames with Python

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---


## Overview

In real life, our data comes from multiple sources. To perform an analysis, we need to combine dataframes using one or more common `key` variables.

**What we'll learn:**
- How to merge DataFrames using common keys
- Different types of joins (left, inner, outer)
- How to concatenate DataFrames
- Working with multiple datasets


In [1]:
# Let's import the libraries we will use
import pandas as pd
import numpy as np

# Creating the data
# Let's create our first data frame, consisting of directors, and their nationalities
df_directors = pd.DataFrame(
    {
        "surname": ["Spielberg", "Scorsese", "Hitchcock", "Tarantino", "Villeneuve"],
        "nationality": ["US", "US", "UK", "US", "Canada"],
    }
)

print("Directors data:")
print(df_directors)


Directors data:
      surname nationality
0   Spielberg          US
1    Scorsese          US
2   Hitchcock          UK
3   Tarantino          US
4  Villeneuve      Canada


In [2]:
# Create a data frame that has movies and directors
df_movies = pd.DataFrame(
    {
        "surname": [
            "Spielberg",
            "Scorsese",
            "Hitchcock",
            "Hitchcock",
            "Spielberg",
            "Tarantino",
            "Villeneuve",
        ],
        "title": [
            "Schindler's List",
            "Goodfellas",
            "Psycho",
            "The Birds",
            "Jaws",
            "Pulp Fiction",
            "Dune",
        ],
        "movie_id": [1, 2, 3, 4, 5, 6, 7],
        "genre": ["Drama", "Crime", "Horror", "Thriller", "Adventure", "Crime", "Sci-Fi"],
    }
)

print("Movies data:")
print(df_movies)


Movies data:
      surname             title  movie_id      genre
0   Spielberg  Schindler's List         1      Drama
1    Scorsese        Goodfellas         2      Crime
2   Hitchcock            Psycho         3     Horror
3   Hitchcock         The Birds         4   Thriller
4   Spielberg              Jaws         5  Adventure
5   Tarantino      Pulp Fiction         6      Crime
6  Villeneuve              Dune         7     Sci-Fi


## Combining DataFrames

### Left Join (Merge)
A left join keeps all rows from the left DataFrame and matches rows from the right DataFrame:


In [3]:
# Left join: movies with director info
# This keeps all movies and adds director nationality where available
df_combined_left = df_movies.merge(df_directors, on="surname", how="left")

print("Left join result:")
print(df_combined_left)


Left join result:
      surname             title  movie_id      genre nationality
0   Spielberg  Schindler's List         1      Drama          US
1    Scorsese        Goodfellas         2      Crime          US
2   Hitchcock            Psycho         3     Horror          UK
3   Hitchcock         The Birds         4   Thriller          UK
4   Spielberg              Jaws         5  Adventure          US
5   Tarantino      Pulp Fiction         6      Crime          US
6  Villeneuve              Dune         7     Sci-Fi      Canada


### Different Types of Joins

Let's explore different join types to understand how they handle missing matches:


In [4]:
# Inner join: only rows that match in both DataFrames
df_combined_inner = df_movies.merge(df_directors, on="surname", how="inner")
print("Inner join result:")
print(df_combined_inner)
print(f"Shape: {df_combined_inner.shape}")

print("\n" + "="*50)

# Outer join: all rows from both DataFrames
df_combined_outer = df_movies.merge(df_directors, on="surname", how="outer")
print("Outer join result:")
print(df_combined_outer)
print(f"Shape: {df_combined_outer.shape}")

print("\n" + "="*50)

# Right join: all rows from right DataFrame
df_combined_right = df_movies.merge(df_directors, on="surname", how="right")
print("Right join result:")
print(df_combined_right)
print(f"Shape: {df_combined_right.shape}")


Inner join result:
      surname             title  movie_id      genre nationality
0   Spielberg  Schindler's List         1      Drama          US
1    Scorsese        Goodfellas         2      Crime          US
2   Hitchcock            Psycho         3     Horror          UK
3   Hitchcock         The Birds         4   Thriller          UK
4   Spielberg              Jaws         5  Adventure          US
5   Tarantino      Pulp Fiction         6      Crime          US
6  Villeneuve              Dune         7     Sci-Fi      Canada
Shape: (7, 5)

Outer join result:
      surname             title  movie_id      genre nationality
0   Hitchcock            Psycho         3     Horror          UK
1   Hitchcock         The Birds         4   Thriller          UK
2    Scorsese        Goodfellas         2      Crime          US
3   Spielberg  Schindler's List         1      Drama          US
4   Spielberg              Jaws         5  Adventure          US
5   Tarantino      Pulp Fiction      

---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: E-Commerce

### 📋 **Topic**: Combining DataFrames with Python

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---


## Overview

In real life, our data comes from multiple sources. To perform an analysis, we need to combine dataframes using one or more common `key` variables.

**Today we'll learn:**
- How to join DataFrames together
- Different types of joins (left, right, inner, outer)
- How to handle missing matches between datasets
- Real-world examples of combining data


## 1. Data


In [5]:
# Let's import the libraries we will use
import pandas as pd
import numpy as np

# Creating the data

# Let's create our first data frame, consisting of directors, and their nationalities
df_directors = pd.DataFrame(
    {
        "surname": ["Spielberg", "Scorsese", "Hitchcock", "Tarantino", "Villeneuve"],
        "nationality": ["US", "US", "UK", "US", "Canada"],
    }
)

# Create a data frame that has movies and directors
df_movies = pd.DataFrame(
    {
        "surname": [
            "Spielberg",
            "Scorsese",
            "Hitchcock",
            "Hitchcock",
            "Spielberg",
            "Tarantino",
            "Villeneuve",
        ],
        "title": [
            "Schindler's List",
            "Taxi Driver",
            "Psycho",
            "North by Northwest",
            "Catch Me If You Can",
            "Reservoir Dogs",
            "Dune",
        ],
    }
)

print("Directors DataFrame:")
print(df_directors)
print("\nMovies DataFrame:")
print(df_movies)


Directors DataFrame:
      surname nationality
0   Spielberg          US
1    Scorsese          US
2   Hitchcock          UK
3   Tarantino          US
4  Villeneuve      Canada

Movies DataFrame:
      surname                title
0   Spielberg     Schindler's List
1    Scorsese          Taxi Driver
2   Hitchcock               Psycho
3   Hitchcock   North by Northwest
4   Spielberg  Catch Me If You Can
5   Tarantino       Reservoir Dogs
6  Villeneuve                 Dune


## 2. Combining dataframes

### 2.1 Left Join

What if we wanted to create a new dataframe that adds the directors' nationalities to the movies data frame? This is easy in Python using pandas `merge()` function.


In [6]:
df_all = df_movies.merge(df_directors, on="surname", how="left")
# Left join keeps all the rows from the left data frame (df_movies),
# and adds the columns from the right data frame (df_directors)
# if there is a match on the key variable (surname)

print("Left join result:")
print(df_all.head(20))


Left join result:
      surname                title nationality
0   Spielberg     Schindler's List          US
1    Scorsese          Taxi Driver          US
2   Hitchcock               Psycho          UK
3   Hitchcock   North by Northwest          UK
4   Spielberg  Catch Me If You Can          US
5   Tarantino       Reservoir Dogs          US
6  Villeneuve                 Dune      Canada


In [7]:
# 2.2 Concatenating dataframes

# Let's say we found some new directors, which we want to add
# to the original data frame
df_directors_new = pd.DataFrame(
    {"surname": ["Guadagnino", "Kubrick", "Jackson"], "nationality": ["IT", "US", "NZ"]}
)

# we can easily add them to the original data frame using pd.concat()
# this "concatenates" the new data frame to the original one, row-wise
df_directors = pd.concat([df_directors, df_directors_new], ignore_index=True)

print("Updated directors DataFrame:")
print(df_directors)


Updated directors DataFrame:
      surname nationality
0   Spielberg          US
1    Scorsese          US
2   Hitchcock          UK
3   Tarantino          US
4  Villeneuve      Canada
5  Guadagnino          IT
6     Kubrick          US
7     Jackson          NZ


In [8]:
# Let's do the same with some new movies
df_movies_new = pd.DataFrame(
    {
        "surname": ["Guadagnino", "Kubrick", "Fellini"],
        "title": ["Call Me By Your Name", "Barry Lyndon", "La Dolce Vita"],
    }
)

df_movies = pd.concat([df_movies, df_movies_new], ignore_index=True)

print("Updated movies DataFrame:")
print(df_movies)


Updated movies DataFrame:
      surname                 title
0   Spielberg      Schindler's List
1    Scorsese           Taxi Driver
2   Hitchcock                Psycho
3   Hitchcock    North by Northwest
4   Spielberg   Catch Me If You Can
5   Tarantino        Reservoir Dogs
6  Villeneuve                  Dune
7  Guadagnino  Call Me By Your Name
8     Kubrick          Barry Lyndon
9     Fellini         La Dolce Vita


## 3. Different Types of Joins

Now we have some missing matches:
- `df_directors` has `Jackson`, but there is no movie by him in `df_movies`
- `df_movies` has "La Dolce Vita" by `Fellini`, but there is no director by that name in `df_directors`

Let's see how different join types handle this.


In [9]:
# Left join with missing matches
df_all = df_movies.merge(df_directors, on="surname", how="left")
print("Left join with missing matches:")
print(df_all)
print("\n" + "="*50 + "\n")

# Inner join - only keep rows that match in both datasets
df_all = df_movies.merge(df_directors, on="surname", how="inner")
print("Inner join result:")
print(df_all)
print("\n" + "="*50 + "\n")

# Outer join - keep all rows from both datasets
df_all = df_movies.merge(df_directors, on="surname", how="outer")
print("Full/Outer join result:")
print(df_all)


Left join with missing matches:
      surname                 title nationality
0   Spielberg      Schindler's List          US
1    Scorsese           Taxi Driver          US
2   Hitchcock                Psycho          UK
3   Hitchcock    North by Northwest          UK
4   Spielberg   Catch Me If You Can          US
5   Tarantino        Reservoir Dogs          US
6  Villeneuve                  Dune      Canada
7  Guadagnino  Call Me By Your Name          IT
8     Kubrick          Barry Lyndon          US
9     Fellini         La Dolce Vita         NaN


Inner join result:
      surname                 title nationality
0   Spielberg      Schindler's List          US
1    Scorsese           Taxi Driver          US
2   Hitchcock                Psycho          UK
3   Hitchcock    North by Northwest          UK
4   Spielberg   Catch Me If You Can          US
5   Tarantino        Reservoir Dogs          US
6  Villeneuve                  Dune      Canada
7  Guadagnino  Call Me By Your Name

---

## 🎉 Summary

### Key Join Types:
- **Left join** (`how="left"`): Keep all rows from left DataFrame
- **Right join** (`how="right"`): Keep all rows from right DataFrame  
- **Inner join** (`how="inner"`): Keep only matching rows from both DataFrames
- **Outer join** (`how="outer"`): Keep all rows from both DataFrames

### Key Functions:
- `df.merge()`: For joining DataFrames on common columns
- `pd.concat()`: For stacking DataFrames vertically or horizontally

### Next:
We'll analyze time series data and reputation inflation

---
