# Preparation and transformation of the data

> In this initial section, before diving into our main analysis, it's crucial to properly prepare and understand our dataset that form the foundation of our study. This preliminary phase involves several key steps:
> 1. Data Loading: We'll begin by opening our data files and adding headers based on the *Readme* of thE *CMU Movie Summary Corpus* study (http://www.cs.cmu.edu/~ark/personas/).
> 2. Data Size Evaluation: Understanding the size of our datasets is essential for determining appropriate processing methods and potential computational constraints.
> 3. Data Merging: We'll combine two relevant dataframes (`characters_df` and `movies_df`) to create a more comprehensive dataset and to facilitate the statistical anaylsis.
> 4. Separation of data into decades: After merging, we'll separate the dataset into decades as we want to study an evolution over time.

In [10]:
import pandas as pd

In [11]:
# Script to load the notebook utils.ipynb
import nbformat
from IPython.core.interactiveshell import InteractiveShell

# Load the notebook utils.ipynb
with open('/home/sara/Dropbox/epfl/master/MA1/ADA/ada-2024-project-analyticaldementialavengers/src/scripts/utils.ipynb') as f:
    nb = nbformat.read(f, as_version=4)

# Create an instance of InteractiveShell
shell = InteractiveShell.instance()

# Execute the notebook utils.ipynb
for cell in nb.cells:
    if cell.cell_type == 'code':
        shell.run_cell(cell.source)

## 1. Adding dataset headers

In [20]:
# Characers file
characters_df = pd.read_csv('/home/sara/Dropbox/epfl/master/MA1/ADA/ada-2024-project-analyticaldementialavengers/data/character.metadata.tsv', sep='\t', header=None)
characters_df.columns = ['Wikipedia_movie_ID', 'Freebase_Movie_ID', 'Movie_release_date', 'Character_name', 'Actor_date_of_birth', 'Actor_gender', 'Actor_height','Actor_ethnicity',
                         'Actor_name', 'Actor_age_at_movie_release','Freebase_character/actor_map_ID','Freebase_character_ID','Freebase_actor_ID']
characters_df.head()
characters_df.to_csv('/home/sara/Dropbox/epfl/master/MA1/ADA/ada-2024-project-analyticaldementialavengers/src/data/characters_df.tsv',  sep='\t', index=False)

# Movies file
movies_df = pd.read_csv('/home/sara/Dropbox/epfl/master/MA1/ADA/ada-2024-project-analyticaldementialavengers/data/movie.metadata.tsv', sep='\t', header=None)
movies_df.columns = ['Wikipedia_movie_ID', 'Freebase_Movie_ID', 'Movie_name', 'Movie_release_date', 'Movie_box_office_revenue', 'Movie_runtime', 'Movie_languages','Movie_countries',
                         'Movie_genres']
movies_df.to_csv('/home/sara/Dropbox/epfl/master/MA1/ADA/ada-2024-project-analyticaldementialavengers/src/data/movies_df.tsv',  sep='\t', index=False)

# Name clusters file
path_name_cluster = '/home/sara/Dropbox/epfl/master/MA1/ADA/ada-2024-project-analyticaldementialavengers/data/name.clusters.txt'
name_cluster_df = pd.read_csv(path_name_cluster, delimiter='\t', header=None)
name_cluster_df.columns = ['unique_character_name', 'freebase_actor_id']
name_cluster_df.to_csv('/home/sara/Dropbox/epfl/master/MA1/ADA/ada-2024-project-analyticaldementialavengers/src/data/name_cluster_df.tsv',  sep='\t', index=False)

# Plot summary file
# Read the txt file by specifying the delimiter (here, a tabulation)
path_plot = '/home/sara/Dropbox/epfl/master/MA1/ADA/ada-2024-project-analyticaldementialavengers/data/plot_summaries.txt'
plot_summary_df = pd.read_csv(path_plot, delimiter='\t', header=None)
plot_summary_df.columns = ['movie_id', 'plot_summary']
plot_summary_df.to_csv('/home/sara/Dropbox/epfl/master/MA1/ADA/ada-2024-project-analyticaldementialavengers/src/data/plot_summary.tsv',  sep='\t', index=False)

# TV tropes cluster
path_name_cluster = '/home/sara/Dropbox/epfl/master/MA1/ADA/ada-2024-project-analyticaldementialavengers/data/tvtropes.clusters.txt'
tv_tropes_df = pd.read_csv(path_name_cluster, delimiter='\t', header=None)
tv_tropes_df.columns = ['character_types', 'ID_field']
tv_tropes_df.to_csv('/home/sara/Dropbox/epfl/master/MA1/ADA/ada-2024-project-analyticaldementialavengers/src/data/tv_tropes_df.tsv',  sep='\t', index=False)

# 2. Merging of the `characters_df` and `moves_df`

In [13]:
merged_df = pd.merge(characters_df, movies_df, on=['Wikipedia_movie_ID', 'Freebase_Movie_ID', 'Movie_release_date'])

# 3. Separation of the `merged_df` into decades

In [14]:
# Conversion of the 'Movie_release_date' column to datetime format
merged_df['Movie_release_date'] = pd.to_datetime(merged_df['Movie_release_date'], errors='coerce')

# Count the number of films without release date
films_without_release_date = merged_df['Movie_release_date'].isna().sum()

# Display the number of films without release date
print(f"Number of movies without release date : {films_without_release_date}, which is {round(films_without_release_date/len(merged_df)*100, 2)}% of the dataset")

# Count the number of films without associated genres
films_without_genre = merged_df['Movie_genres'].isna().sum()

# Display the number of films without associated genres
print(f"Number of movies without genre : {films_without_genre}, which is {round(films_without_genre/len(merged_df)*100, 2)}% of the dataset")

Number of movies without release date : 176797, which is 39.23% of the dataset
Number of movies without genre : 0, which is 0.0% of the dataset


> As 39.23% of the dataset has no release date, we cannot simply discard it. We decide to use the average of the release years of films in the same genre, given that all the films in the dataset have a genre or a combination of genres. For the combination of genres, we take the average of the averages for each genre.

In [15]:
# Creation of a new column called "genre_list" in movies_df to have a list of all genres of a movie
movies_df['genre_list'] = movies_df['Movie_genres'].astype(str).apply(extract_string)

# Creation of a new column called "genre_list" in merged_df to have a list of all genres of a movie
movies_df['genre_list'] = movies_df['Movie_genres'].astype(str).apply(extract_string)
merged_df['genre_list'] = merged_df['Movie_genres'].astype(str).apply(extract_string)

# Conversion of dates to years (to unfiformize the format)
merged_df['Movie_release_year'] = merged_df['Movie_release_date'].dt.year

# Calculate the mean release year by genre
mean_release_year_by_genre = merged_df.explode('genre_list').groupby('genre_list')['Movie_release_year'].mean()

# Estimation of the release year of the films without release date
merged_df['Estimated_release_year'] = merged_df.apply(estimate_release_year, axis=1)

# Vérification
films_without_release_date = merged_df['Estimated_release_year'].isna().sum()
print(f"Number of movies without release date : {films_without_release_date} movies, which is {round(films_without_release_date/len(merged_df)*100, 2)}% of the dataset")

Number of movies without release date : 2879 movies, which is 0.64% of the dataset


> After using the mean release date of movies with similar genres or the mean of the means for genres combination, only 0.64% of the dataset, which corresponds to 2879 movies, still don't have release date. This is acceptable to work with it. What could explain why some movies still don't have release date is due to the used method. Inded, the mean meathod is not applicable on movies that have a unique genre and no release date, making it impossible to calculate the average.