# Milestone 2

This notebook aims to preprocess our data, and estimate the feasibility of the different ideas we had during the first milestone, using data exploration and some preliminary analysis.

- [Dataset description](#Dataset-description)
- [Initial dataset preprocessing](#Data-preprocessing)
    - [Characters dataset](#Characters-dataset) 
    - [Movies dataset](#Movies-dataset)
    - [Plot summaries dataset](#Plot-summaries-dataset)
- [External datasets](#External-Datasets)
- [Data merging](#Data-Merging)
    - [TMDb](#tmdb)
    - [IMDb](#imdb)

# Datasets description

This project utilizes a rich collection of datasets encompassing movies, characters, and plot summaries. The primary datasets include:

- **Characters Dataset** (`character.metadata.tsv`): Contains detailed metadata on movie characters and the actors portraying them. Key attributes include Wikipedia and Freebase IDs, movie release dates, character names, actor birthdates, genders, heights, ethnicities, and ages at the time of movie release.

- **Movies Dataset** (`movie.metadata.tsv`) Includes comprehensive information about movies. It features data such as Wikipedia and Freebase IDs, movie names, release dates, box office revenues, runtimes, languages, countries, and genres.

- **Plot Summaries Dataset** (`plot_summaries.text`): Offers a concise summary of movie plots, linked to movies through Wikipedia movie IDs.

### External datasets

We also use external datasets to enrich our data. More information about these datasets can be found in the [External datasets](#External-datasets) section.


In [None]:
DATA_FOLDER = "data/"
DATA_FOLDER_CMU = DATA_FOLDER + "raw/MovieSummaries/"
DATA_FOLDER_TMDB = DATA_FOLDER + "raw/TMDb/"
DATA_FOLDER_IMDB = DATA_FOLDER + "raw/IMDb/"

CMU_CHARACTER = DATA_FOLDER_CMU + "character.metadata.tsv"
CMU_MOVIE = DATA_FOLDER_CMU + "movie.metadata.tsv"
PLOT_SUMMARIES = DATA_FOLDER_CMU + "plot_summaries.txt"

TMDB_MOVIE = DATA_FOLDER_TMDB + "movies_metadata.csv"
TMDB_KEYWORDS = DATA_FOLDER_TMDB + "keywords.csv"

# https://datasets.imdbws.com/
IMDB_AKA = DATA_FOLDER_IMDB + "title.akas.tsv"
IMDB_BASIC = DATA_FOLDER_IMDB + "title.basics.tsv"

# Initial dataset preprocessing

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from src.utils.helpers import *

## Characters dataset

In [None]:
# Load the data
df_characters = pd.read_table(
    CMU_CHARACTER,
    names=[
        "wikipedia_id",
        "freebase_id",
        "release_date",
        "character_name",
        "actor_dob",
        "actor_gender",
        "actor_height",
        "actor_ethnicity",
        "actor_name",
        "actor_age_at_movie_release",
        "freebase_actor_map_id",
        "freebase_character_id",
        "freebase_actor_id",
    ],
)
df_characters.head()

Let's check if we can find some missing values in the dataset.

In [None]:
df_characters.info()

We can identify from the output above that:
- There are some *missing values*
- Some *type are not exploitable* (e.g. the `release_date` and `actor_dob`  are `object`s, not exploitable dates)

In [None]:
print(
    "Total amount of rows that contain a NaN value: ",
    df_characters.isna().any(axis=1).sum(),
)
pd.DataFrame(
    {
        "nan_sum": df_characters.isna().sum(),
        "nan_percentage": df_characters.isna().mean() * 100,
    }
)

Concerning the *missing values* problem, we have 4 options: perform value imputation, not using the problematic features, dropping the rows or filling the missing values using external datasets.

Now we'll take a look at the values and see if we can find some outliers.

In [None]:
df_characters.describe(include="all")

We find interesting results:
- How can an actors' height be 510m ? 
- How can an actor's age be negative ?

### Release date and date of birth

To tackle the type problem we can convert the `object` type to `datetime` type.

In [None]:
df_characters["release_date"] = df_characters["release_date"].apply(convert_to_datetime)
df_characters["release_date"] = pd.to_datetime(
    df_characters["release_date"], errors="coerce"
)

df_characters["actor_dob"] = df_characters["actor_dob"].apply(convert_to_datetime)
df_characters["actor_dob"] = pd.to_datetime(df_characters["actor_dob"], errors="coerce")

### Height

In [None]:
df_characters[df_characters["actor_height"] > 2.5]

After a little search on the internet, we found that for the 510.0m problem, the actor's height is 5'10" (1.78m). We'll have to fix this value.
Concerning the 180m problem, we can guess that the actor's height is 1.80m, but we can't find any source to confirm this (internet or other rows). We should drop these rows.

In [None]:
df_characters.loc[df_characters["actor_height"] == 510, "actor_height"] = 1.78
df_characters.drop(
    df_characters[df_characters["actor_height"] > 2.5].index, inplace=True
)

### Age

Let's then take care of the age problem by removing the rows with negative age. 

In [None]:
print(
    df_characters.loc[(df_characters["actor_age_at_movie_release"] < 0)].shape[0],
    "such rows will be removed",
)


df_characters = df_characters[df_characters["actor_age_at_movie_release"] > 0]

In [None]:
# Count the number of actor born after the movie release
df_characters[df_characters["actor_dob"] > df_characters["release_date"]].shape[0]

### Duplicates

In [None]:
# Drop the rows that have duplicates
df_characters.drop_duplicates(
    subset=["freebase_id", "freebase_actor_id", "freebase_character_id"], inplace=True
)

## Movie dataset

In [None]:
# For the movies metadata
df_movies = pd.read_table(
    CMU_MOVIE,
    names=[
        "wikipedia_id",
        "freebase_id",
        "title",
        "release_date",
        "revenue",
        "runtime",
        "languages",
        "countries",
        "genres",
    ],
    header=None,
)
df_movies.info()

In [None]:
print(
    "Total amount of rows that contain a NaN value: ",
    df_movies.isna().any(axis=1).sum(),
)
pd.DataFrame(
    {
        "nan_sum": df_movies.isna().sum(),
        "nan_percentage": df_movies.isna().mean() * 100,
    }
)

In [None]:
# Check for duplicate row
print("Check for duplicate entries:")
print(df_movies.duplicated().sum())

# Check for duplicated row with same name
print("Check for duplicate entries with title only:")
print(df_movies.duplicated(subset=["title"]).sum())

# Check for duplicated row with same name and date release
print("Check for duplicate entries with title and date release:")
print(df_movies.duplicated(subset=["title", "release_date"]).sum())

# Check for duplicate row with same name and runtime
print("Check for duplicate entries with title and movie runtime:")
print(df_movies.duplicated(subset=["title", "runtime"]).sum())

# Check for duplicate row with same wikipedia ID
print("Check for duplicate entries with wikipedia ID:")
print(df_movies.duplicated(subset=["wikipedia_id"]).sum())

We have some duplicate entries on the name, but the wikipedia ID is always unique.

In [None]:
df_movies.describe(include="all")

Looking at the output, we observe that:
- There a problem with the `release_date` column.
- The `runtime` column seems to have some outliers, with at least one movie with a runtime of 0, and one of 1 million minutes.

### Release date

We have a similar date problem as the one we had with the characters dataset. We'll fix it the same way.

In [None]:
# Convert the date strings to datetime objects
df_movies["release_date"] = df_movies["release_date"].apply(convert_to_datetime)
df_movies["release_date"] = pd.to_datetime(df_movies["release_date"], errors="coerce")

In [None]:
# Plot the distribution of the movie release date
plt.figure(figsize=(20, 10))
plt.title("Distribution of release date")
plt.xlabel("Release date")
plt.ylabel("Count")
df_movies.dropna(subset=["release_date"])[
    "release_date"
].dt.year.sort_values().value_counts(sort=False).plot(kind="bar")
plt.show()

In [None]:
print(
    f"There are {len(df_movies[df_movies['release_date'] >= pd.to_datetime('2014')])} movies are after 2014"
)

### Runtime

We know that there are some outliers for `runtime`. We'll drop the rows with a `runtime` less or equal to 0 and greater than 500 minutes. 

In [None]:
old_shape = df_movies.shape[0]
df_movies.drop(df_movies[df_movies["runtime"] > 500].index, inplace=True)
df_movies.drop(df_movies[df_movies["runtime"] <= 0].index, inplace=True)
print(f"Have removed {old_shape - df_movies.shape[0]} rows")

In [None]:
plt.figure(figsize=(20, 10))
plt.title("Log distribution of movie runtime")
plt.xlabel("Runtime")
plt.ylabel("Count (log scale)")
plt.yscale("log")
df_movies.dropna(subset=["runtime"])["runtime"].hist(bins=10)
plt.show()

### Genres

The `genres` feature is represented by a dict, we'll explode this column to have a row for each genre to plot them nicelly.

In [None]:
# Extract the list of genres
df_movies["genres"] = df_movies.apply(lambda row: extract_list(row, "genres"), axis=1)

# Replace empty lists by NaN
df_movies["genres"] = df_movies["genres"].apply(lambda x: np.nan if len(x) == 0 else x)

# Explode the genres
df_movies_exploded = df_movies.explode("genres")

In [None]:
# Plot the distribution of movie genres, for the top 20 genres
plt.figure(figsize=(20, 10))
plt.title("Distribution of movie genres")
plt.xlabel("Movie genres")
plt.ylabel("Count")
df_movies_exploded["genres"].value_counts().head(20).plot(kind="bar")
plt.show()

### Countries

Similarly to the `genres` column, the `countries` column is a dict of countries. We'll explode this column to have a row for each country.

In [None]:
# Extract the list of countries
df_movies["countries"] = df_movies.apply(
    lambda row: extract_list(row, "countries"), axis=1
)

# Replace empty lists by NaN
df_movies["countries"] = df_movies["countries"].apply(
    lambda x: np.nan if len(x) == 0 else x
)

# Explode the countries
df_movies_exploded = df_movies.explode("countries")

In [None]:
# Plot the distribution of movie countries, for the top 20 countries
plt.figure(figsize=(20, 10))
plt.title("Distribution of movie countries")
plt.xlabel("Movie countries")
plt.ylabel("Count")
df_movies_exploded["countries"].value_counts().head(20).plot(kind="bar")
plt.show()

We observe that US movies represent the large majority of the dataset.

### Languages

Similarly to the `genres` column, the `languages` column is a dict of languages. We'll explode this column to have a row for each language.

In [None]:
# Extract the list of languages
df_movies["languages"] = df_movies.apply(
    lambda row: extract_list(row, "languages"), axis=1
)

# Replace empty lists by NaN
df_movies["languages"] = df_movies["languages"].apply(
    lambda x: np.nan if len(x) == 0 else x
)

# Explode the languages
df_movies_exploded = df_movies.explode("languages")

In [None]:
# Plot the distribution of movie language, for the top 20 languages
plt.figure(figsize=(20, 10))
plt.title("Distribution of movie language")
plt.xlabel("Movie language")
plt.ylabel("Count")
df_movies_exploded["languages"].value_counts().head(20).plot(kind="bar")
plt.show()

## Plot summaries dataset

In [None]:
# For the plot summaries
df_plots = pd.read_table(
    PLOT_SUMMARIES, names=["wikipedia_id", "plot_summary"], header=None
)

In [None]:
df_plots.info()

In [None]:
print(
    f"There are {df_plots.duplicated(subset=['plot_summary']).sum()} duplicated plot summaries:"
)


df = pd.DataFrame(df_plots[df_plots.duplicated(subset=["plot_summary"])])


df

There seems to be some movies with the same plot summary. 

In [None]:
df_movies[df_movies["wikipedia_id"].isin(df["wikipedia_id"])].sort_values(
    by=["wikipedia_id"]
)

We decide to directly merge our movie dataframe with the one with the plots 

In [None]:
# preprocess the data for merging
df_movies = preprocess_movie_data(df_movies)

df_movies = df_movies.join(df_plots.set_index("wikipedia_id"), on="wikipedia_id")

# External Datasets

- [The Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) contains metadata for 45,000 films listed in the Full MovieLens Dataset, all of which were released on or before July 2017. We will use this dataset, specifically the movies_metadata.csv file, primarily to enrich the movie.metadata.tsv of the CMU movie dataset. Additionally, this dataset includes plot keywords (in the keywords.csv file), which could be helpful for analyzing and clustering movie plots.

- [IMDB Non-Commercial Dataset](https://developer.imdb.com/non-commercial-datasets/) is a giant database containing over 50 Million movies from different regions, which allows us to complement our initial dataset specifically with movies created in the Soviet Union during the cold war. In addition, using the IMDB API and the IMDbPY Package, we can extract the plot to fix the issues of imbalance (The CMU Movie Summaries data being largely focused on the United States)

# Data Merging
## TMDb

In [None]:
df_tmdb_movies = pd.read_csv(
    TMDB_MOVIE,
    usecols=[
        "id",
        "title",
        "release_date",
        "revenue",
        "runtime",
        "genres",
        "production_countries",
        "original_language",
        "spoken_languages",
        "overview",
    ],
)


df_tmdb_keywords = pd.read_csv(TMDB_KEYWORDS)

In [None]:
# preprocess the data for merging
df_tmdb_movies = preprocess_movie_data(df_tmdb_movies)

# Merge the TMDB movies with keywords based on id
df_tmdb_movies["id"] = df_tmdb_movies["id"].astype(int)
df_tmdb_movies = pd.merge(df_tmdb_movies, df_tmdb_keywords, on="id", how="left")

for column_name in ["genres", "production_countries", "spoken_languages", "keywords"]:
    df_tmdb_movies[column_name] = df_tmdb_movies[column_name].apply(
        lambda row: (
            [item["name"] for item in ast.literal_eval(row)]
            if pd.notnull(row) and ast.literal_eval(row)
            else np.nan
        )
    )

# Merge based on the title
df_merged_movies = pd.merge(
    df_movies,
    df_tmdb_movies,
    on="title",
    how="outer",
    suffixes=("_original", "_additional"),
)
df_merged_movies

In [None]:
# fill as much missing values as possible
for _, column_name in enumerate(df_merged_movies.columns):

    if column_name in ["release_date_original", "revenue_original", "runtime_original"]:
        # craft the additional column name
        new_column_name = column_name[: -len("_original")]
        column_name_additional = new_column_name + "_additional"
        # fill the missing values
        df_merged_movies[new_column_name] = df_merged_movies[column_name].combine_first(
            df_merged_movies[column_name_additional]
        )
        # drop the original and additional column
        df_merged_movies = df_merged_movies.drop(
            columns=[column_name, column_name_additional]
        )

df_merged_movies["genres_original"] = df_merged_movies.apply(
    lambda row: list(
        set(
            (row["genres_original"] if isinstance(row["genres_original"], list) else [])
            + (
                row["genres_additional"]
                if isinstance(row["genres_additional"], list)
                else []
            )
        )
    ),
    axis=1,
)

# Replace empty lists by NaN
df_merged_movies["genres_original"] = df_merged_movies["genres_original"].apply(
    lambda x: np.nan if len(x) == 0 else x
)

df_merged_movies = df_merged_movies.rename(columns={"genres_original": "genres"})

df_merged_movies["languages"] = df_merged_movies.apply(
    lambda row: list(
        set(
            (row["languages"] if isinstance(row["languages"], list) else [])
            + (
                row["spoken_languages"]
                if isinstance(row["spoken_languages"], list)
                else []
            )
        )
    ),
    axis=1,
)

# Replace empty lists by NaN
df_merged_movies["languages"] = df_merged_movies["languages"].apply(
    lambda x: np.nan if len(x) == 0 else x
)

df_merged_movies["countries"] = df_merged_movies.apply(
    lambda row: list(
        set(
            (row["countries"] if isinstance(row["countries"], list) else [])
            + (
                row["production_countries"]
                if isinstance(row["production_countries"], list)
                else []
            )
        )
    ),
    axis=1,
)

# Replace empty lists by NaN
df_merged_movies["countries"] = df_merged_movies["countries"].apply(
    lambda x: np.nan if len(x) == 0 else x
)

df_merged_movies = df_merged_movies.drop(
    columns=["spoken_languages", "production_countries", "id", "genres_additional"]
)

df_merged_movies

In [None]:
df_merged_movies.to_csv(DATA_FOLDER + "/preprocessed/merged_movies.csv", index=False)

## IMDb

In [None]:
title_akas = pd.read_csv(IMDB_AKA, sep="\t", usecols=["titleId", "title", "region"])
title_basics = pd.read_csv(
    IMDB_BASIC,
    sep="\t",
    usecols=["tconst", "primaryTitle", "titleType", "startYear", "genres"],
)

In [None]:
# Merge the DataFrames on the common column tconst
imdb_movies = pd.merge(title_akas, title_basics, left_on="titleId", right_on="tconst")

# Select only the columns we need and rename `titleId` to `tconst` for consistency
imdb_movies = imdb_movies[
    ["tconst", "title", "primaryTitle", "region", "titleType", "startYear", "genres"]
]

regions = [
    "SU",
    "RU",
    "UA",
    "BY",
    "KZ",
    "UZ",
    "GE",
    "AM",
    "AZ",
    "LT",
    "LV",
    "EE",
    "TM",
    "KG",
    "TJ",
    "MD",
]

soviet_movies = imdb_movies[
    imdb_movies["region"].fillna("").str.contains("|".join(regions), case=False)
]

soviet_movies = (
    soviet_movies[soviet_movies["titleType"] == "movie"]
    .drop_duplicates(subset="primaryTitle", keep="first")
    .drop(columns=["title", "titleType"])
)

soviet_movies

N.B: You'll need to install IMDbPY if that's not already the case. Simply run `pip install IMDbPY` in your shell.

In [None]:
from imdb import IMDb
ia = IMDb()

In [None]:
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
import time

time.sleep(1)

with ThreadPoolExecutor(max_workers=4) as executor:

    soviet_movies["plot"] = list(
        tqdm(
            executor.map(get_plot_summary, soviet_movies["tconst"]),
            total=len(soviet_movies),
        )
    )

In [None]:
soviet_movies.to_csv(DATA_FOLDER + "/preprocessed/samples_soviet_movies.tsv", sep="\t")

In [None]:
soviet_movies = pd.read_csv(
    DATA_FOLDER + "/preprocessed/soviet_movies.tsv",
    sep="\t",
    usecols=[
        "tconst",
        "title",
        "primaryTitle",
        "region",
        "titleType",
        "startYear",
        "genres",
    ],
)


soviet_movies = soviet_movies.drop(columns=["title", "titleType"]).rename(
    columns={"primaryTitle": "title", "startYear": "release_date"}
)


soviet_movies["release_date"] = pd.to_datetime(
    soviet_movies["release_date"], format="%Y", errors="coerce"
)


soviet_movies["genres"] = soviet_movies["genres"].apply(
    lambda genres: genres.split(",")
)
soviet_movies

In [None]:
df_merged_movies = pd.merge(df_merged_movies, soviet_movies, on="title", how="outer", suffixes=("_original", "_additional"))
df_merged_movies

In [None]:
df_merged_movies["release_date_original"] = df_merged_movies["release_date_original"].combine_first(
    df_merged_movies["release_date_additional"]
)


df_merged_movies["genres_original"] = df_merged_movies.apply(
    lambda row: list(
        set(
            (row["genres_original"] if isinstance(row["genres_original"], list) else [])
            + (
                row["genres_additional"]
                if isinstance(row["genres_additional"], list)
                else []
            )
        )
    ),
    axis=1,
)


# Replace empty lists by NaN


df_merged_movies["genres_original"] = df_merged_movies["genres_original"].apply(
    lambda x: np.nan if len(x) == 0 else x
)


df_merged_movies = df_merged_movies.rename(
    columns={"release_date_original": "release_date", "genres_original": "genres"}
)
df_merged_movies = df_merged_movies.drop(columns=["release_date_additional", "genres_additional"])
df_merged_movies

In [None]:
len_1 = len(df_merged_movies[df_merged_movies["release_date"] > pd.to_datetime("1991")])
len_2 = len(df_merged_movies[df_merged_movies["release_date"] >= pd.to_datetime("1947")])
print(f"There are {len_2 - len_1} movies taking place during the cold war (1947-91)")

In [None]:
print(f"There are about {len(df_merged_movies[df_merged_movies['region'] == 'RU'])} movies from Russia.")

In [None]:
df_revenue = df_merged_movies[df_merged_movies['revenue'].notna()]
print(f"There are {len(df_revenue)} entries with revenue, {len(df_revenue[df_revenue['region'] == 'RU'])} are from Russia.")