# CMU Movie data

## Initial data inspection
We will first try to provide a first generic inspection of the CMU movie dataset we decided to work on.

In [None]:
import pandas as pd
import numpy as np
import re
import json
from src.utils.data_utils import *
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
%load_ext autoreload
%autoreload 2

### Load Data
The dataset is divided in 3 parts, the characters, the movies and the plots of the movies.

In [None]:
character_data, movie_data, plot_data = load_data()

### Characters dataset
Let's first have a look to the character dataset.

In [None]:
print(f'There are {character_data.shape[0]} characters with {character_data.shape[1]} features for each.')

In [None]:
character_data.head(2)

We can note that the actor ethnicity need to be transform to readable value (for now, it looks to be freebase id).

Let's now see if we have a lot of missing data. We will also check that we don't have duplicated rows.

In [None]:
print("Percentage of null rows in the characters dataset for each features:")
print(character_data.isnull().mean().round(3)*100)

In [None]:
print(f"Duplicated rows: {character_data.duplicated().sum()}")

We see that we miss a lot of character names/ids, actor heights, actor ethnicity and actor age at release.

### Movies dataset
Let's now have a look to the movies dataset.

In [None]:
print(f'There are {movie_data.shape[0]} movies with {movie_data.shape[1]} features for each.')

In [None]:
movie_data.head(2)

We can note that the languages, countries and genres need to be preprocessed (for now dictionnary with id->readablename).
We can also imagine to add a column movie_release_year.

Let's now see if we have a lot of missing data. We will also verify that we dont' have duplicated rows.

In [None]:
print("Percentage of null rows in the movies dataset for each features:")
print(movie_data.isnull().mean().round(3)*100)

In [None]:
print(f"Duplicated rows: {movie_data.duplicated().sum()}")

Ouch! We only have box office revenue for 10% of our movies, that's not good news since it's a key feature in our research problematic, we will need to fix this. Apart from this, we can also note that we are missing 25% of the runtime information. We could try to improve this. This also applies to the movie release data. For the languages, countries and genres, we note that they are dictionaries meaning that we first need to preprocess them a bit (for example transforming them to a list) to then be able to see the percentage of missing data. We will do it now:

In [None]:
# Extract the readable values for 'languages', 'countries', and 'genres' columns. Also clean the language column.

movie_data['languages'] = movie_data['languages'].apply(lambda x: extract_values(x, clean_func=clean_language))
movie_data['countries'] = movie_data['countries'].apply(lambda x: extract_values(x)) 
movie_data['genres'] = movie_data['genres'].apply(lambda x: extract_values(x))  

We can now have a look to the missing data:

In [None]:
# Calculate the number of None (NaN) values for each column
none_languages = movie_data['languages'].isna().mean()
none_countries = movie_data['countries'].isna().mean()
none_genres = movie_data['genres'].isna().mean()

# Print the counts of None (NaN) values
print(f"Percentage of None values in 'languages': {none_languages:.2%}")
print(f"Percentage of None values in 'countries': {none_countries:.2%}")
print(f"Percentage of None values in 'genres': {none_genres:.2%}")

This looks ok overall.

### Plot summary dataset
Let's now have a look to the plot summaries dataset.

In [None]:
print(f'There are {plot_data.shape[0]} plot summaries with {plot_data.shape[1]} features for each.')

In [None]:
plot_data.head(2)

Let's see if we have some rows that are invalid (no summary or wikipedia id).

In [None]:
print("Pourcentage of null rows in the plot summaries dataset:")
print(plot_data.isnull().mean().round(3)*100)

Good new, we have nothing missing here :)

## Data completion + first preprocessing
Before going deeper to the analysis, we want to already fix some problems we pointed out.

Movies:
- A lot of box office revenus missing
- We can also imagine to add a column movie_release_year.

Characters:
- We see that we miss a lot of character names/ids, actor heights, actor ethnicity and actor age at release.
- We first note that we need to preprocess the actor ethnicity that look to be a freebase id.

### Movies problems

Let's first to get more data on box office results to decrease the number of missing data we have for now. To do this, we will merge the current dataset with differents other datasets that contain box office results (and also runtime since we have 25% of missing). Let's first add the Wikidata dataset.

In [None]:
# Import dataset from wikidata
with open('data/wikidata.json', 'r') as f:
    wikidata_json = json.load(f)
wikidata = pd.DataFrame(wikidata_json)

# We rename some columns for merging after
wikidata['box_office_revenue'] = pd.to_numeric(wikidata['box_office'], errors='coerce') 
wikidata['movie_name'] = wikidata['title'].astype(str)
wikidata.drop(columns=['box_office', 'title'], inplace=True)

wikidata.head(2)

Amazing, we have the freebase id and the box office, we just now need to merge them with the current dataframe.
We will first merge on the freebase ID and then on the movie title.

In [None]:
movies_wikidata_merged, before_missing, after_missing = merge_for_completion(movie_data, wikidata, "freebase_movie_id", "freebase_id", "box_office_revenue", merge_strategy='mean')

In [None]:
# Print the before and after missing percentages
print(f"Box office results missing percentage before merge (on freebase ID) with wikidata: {before_missing:.2%}")
print(f"Box office results missing percentage after merge (on freebase ID) with wikidata: {after_missing:.2%}")

In [None]:
movies_wikidata_merged, before_missing, after_missing = merge_for_completion(movies_wikidata_merged, wikidata, "movie_name", "movie_name", "box_office_revenue", merge_strategy='prioritize_first')

In [None]:
# Print the before and after missing percentages
print(f"Box office results missing percentage before merge (on title) with wikidata: {before_missing:.2%}")
print(f"Box office results missing percentage after merge (on title) with wikidata: {after_missing:.2%}")

It's not a big improvement but it's a good start. Let's now do the same with another dataset named 'The Movies Dataset' from https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download. Since we don't have the freebase ID, we will directly merge on the movie title. Note that we will also try to complete the missing data on runtime since this dataset has it.

In [None]:
# Load the new dataset (and rename some columns)
movies_dataset = pd.read_csv('data/movies_metadata.csv')
movies_dataset['box_office_revenue'] = pd.to_numeric(movies_dataset['revenue'], errors='coerce') 
movies_dataset.head(2)

In [None]:
movies_wikidata_merged, before_missing, after_missing = merge_for_completion(movies_wikidata_merged, movies_dataset, "movie_name", "original_title", "box_office_revenue", merge_strategy='prioritize_first')

In [None]:
# Print the before and after missing percentages
print(f"Box office results missing percentage before merge (on title) with The Movies Dataset: {before_missing:.2%}")
print(f"Box office results missing percentage after merge (on title) with The Movies Dataset: {after_missing:.2%}")

Good improvement! Let's do the same for the runtime:

In [None]:
movies_wikidata_merged, before_missing, after_missing = merge_for_completion(movies_wikidata_merged, movies_dataset, "movie_name", "original_title", "runtime", merge_strategy='prioritize_first')

In [None]:
# Print the before and after missing percentages
print(f"Runtime results missing percentage before merge (on title) with The Movies Dataset: {before_missing:.2%}")
print(f"Runtime results missing percentage after merge (on title) with The Movies Dataset: {after_missing:.2%}")

Small improvement but we take it. Let's try to use another dataset to complete...

Lets try a new dataset that contains more revenue data. This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user imdb_ratings and revenue. https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csv. As above we rename the revenue column.

In [None]:
# Load the new dataset (and rename some columns)
tmdb_movies_dataset = pd.read_csv('data/tmdb_movies.csv')
tmdb_movies_dataset['box_office_revenue'] = pd.to_numeric(tmdb_movies_dataset['revenue'], errors='coerce') 
tmdb_movies_dataset.sample(5)

In [None]:
movies_wikidata_merged, before_missing, after_missing = merge_for_completion(movies_wikidata_merged, tmdb_movies_dataset, "movie_name", "original_title", "box_office_revenue", merge_strategy='prioritize_first')

In [None]:
# Print the before and after missing percentages
print(f"Box office results missing percentage before merge (on title) with The Movies Dataset: {before_missing:.2%}")
print(f"Box office results missing percentage after merge (on title) with The Movies Dataset: {after_missing:.2%}")

Now we want to merge with the IMDb datasets (https://developer.imdb.com/non-commercial-datasets/) in order to obtain ratings and lead actors. We set consider an actor a lead actor if their ordering is 1 or 2.

In [None]:
wikidata_merge = wikidata[['freebase_id', 'IMDb_ID']]
wikidata_merge = wikidata_merge.rename(columns={'freebase_id': 'freebase_movie_id'})

movies_wikidata_merged_imdbid = pd.merge(wikidata_merge, movies_wikidata_merged, on='freebase_movie_id', how='inner')
movies_wikidata_merged_imdbid.sample(2)

In [None]:
imdb_ratings = pd.read_csv('data/title.ratings.tsv', sep='\t')
imdb_principals = pd.read_csv('data/title.principals.tsv', sep='\t')
imdb_names = pd.read_csv('data/name.basics.tsv', sep='\t')



In [None]:
imdb_ratings = imdb_ratings.rename(columns={'tconst': 'IMDb_ID'})
imdb_ratings = imdb_ratings[['IMDb_ID', 'averageRating', 'numVotes']]  

lead_actors = imdb_principals[(imdb_principals['category'].isin(['actor', 'actress'])) & (imdb_principals['ordering'].isin([1, 2]))]
lead_actors = lead_actors.rename(columns={'tconst': 'IMDb_ID'})
lead_actors = lead_actors[['IMDb_ID', 'nconst', 'ordering']]

lead_actors = lead_actors.merge(imdb_names[['nconst', 'primaryName']], on='nconst', how='left')
lead_actors = lead_actors.pivot(index='IMDb_ID', columns='ordering', values='primaryName').reset_index()
lead_actors.columns = ['IMDb_ID', 'lead_actor_1', 'lead_actor_2']

imdb_merged_movie_data = pd.merge(movies_wikidata_merged_imdbid, imdb_ratings, on='IMDb_ID', how='left')
imdb_merged_movie_data.head(2)

merged_movie_data = imdb_merged_movie_data.merge(lead_actors, left_on='IMDb_ID', right_on='IMDb_ID', how='left')
merged_movie_data.head(2)




## Cleaning and removing outliers

Before analyzing the data any further, let's remove outliers.

### Character dataset

In [None]:
# Keep only non NaN values
not_na_height = character_data["actor_height"].notna()
not_na_age_at_release = character_data["actor_age_at_release"].notna()
not_na_gender = character_data["actor_gender"].notna()
not_na_ethnicity = character_data["actor_ethnicity"].notna()

not_na_mask = not_na_height & not_na_age_at_release & not_na_gender & not_na_ethnicity
character_data_cleaned = character_data[not_na_mask]

reduction = 1 - character_data_cleaned.shape[0] / character_data.shape[0]
print(f"Removing NaN reduced the dataset by: {reduction:.2%}")

In [None]:
# Keep only valid heights (between 1.5 and 2.8 meters)
character_data_valid_heights = character_data_cleaned.query("actor_height > 1.5 and actor_height < 2.8")
reduction = (len(character_data_cleaned) - len(character_data_valid_heights)) / len(character_data_cleaned)

print(f"Removing invalid actor heights reduced that dataset by {reduction:.2%}.")

In [None]:
# Keep only valid ages (between 0 and 100 years)
character_data_valid_ages = character_data_valid_heights.query("actor_age_at_release > 0 and actor_age_at_release < 100")
reduction = (len(character_data_valid_heights) - len(character_data_valid_ages)) / len(character_data_valid_heights)

print(f"Removing invalid actor ages reduced that dataset by {reduction:.2%}.")

In [None]:
# Keep only ethnicity labels that are common
min_occurrence = 10
ethnicity_label_counts = character_data_valid_ages['actor_ethnicity_label'].value_counts()
ethnicity_labels = ethnicity_label_counts[ethnicity_label_counts > min_occurrence]

mask = character_data_valid_ages['actor_ethnicity_label'].isin(ethnicity_labels.index)
character_data_valid_ethnicity = character_data_valid_ages[mask]

reduction = 1 - len(character_data_valid_ethnicity) / len(character_data_valid_ages)

print(f"Removing ethnicity labels which are uncommon reduced that dataset by {reduction:.2%}.")

In [None]:
character_data_valid = character_data_valid_ethnicity

# Convert the date of birth to datetime
character_data_valid["actor_dob"] = pd.to_datetime(character_data_valid["actor_dob"], errors='coerce')

print(f"Final character dataset size: {len(character_data_valid)}")

## Deeper analysis
Now that our data is more complete, we can do a more in deep analysis.

### Character dataset 

Let's first analyse our character dataset. We will start with a summary of the statistics of the numerical features.

In [None]:
character_data_valid.describe()

Let's print their distributions:

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 6))

# Histogram for the runtime
sns.histplot(data=character_data_valid, x="actor_height", bins=50, ax=axes[0], kde=False)
axes[0].set_title("Height of the actor")
axes[0].set_xlabel("Height (m)")

# Histogram for the box office results
sns.histplot(data=character_data_valid, x="actor_age_at_release", bins=50, ax=axes[1], kde=True)
axes[1].set_title("Age of the actor at the release of the movie")
axes[1].set_xlabel("Age (years)")

# Histogram for the character date of birth
sns.histplot(data=character_data_valid, x="actor_dob", bins=50, ax=axes[2], kde=True)
axes[2].set_title("Date of birth of the actor")
axes[2].set_xlabel("Date of birth")

plt.tight_layout()
plt.show()


Let's now lets explore the categorical data

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Countplot for the gender distribution
sns.countplot(data=character_data_valid, x="actor_gender", ax=axes[0], stat='proportion')
axes[0].set_title("Actor gender distribution")
axes[0].set_xlabel("Gender")
axes[0].set_ylabel("Proportion")

ethnicity_cutoff = 30
values = character_data_valid["actor_ethnicity_label"].value_counts()
values = values[:ethnicity_cutoff]
sns.barplot(x=values, y=values.index, ax=axes[1])
axes[1].set_title(f"{ethnicity_cutoff} most common ethnicity label")
axes[1].set_xlabel("Count")
axes[1].set_ylabel("Ethnicity")

plt.tight_layout()
plt.show()

### Movies dataset 

Let's now analyse our movies dataset. We will start with a summary of the statistics of the numerical features.

In [None]:
movie_data_completed = merged_movie_data.copy()
movie_data_completed.describe()

TODO: Comment this, also note the max of runtime very big

Let's print their distributions (except for the wikipedia id):

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Histogram for the runtime
movie_data_completed["runtime"].hist(bins=100, ax=axes[0])
axes[0].set_title("Histogram for runtime")
axes[0].set_xlabel("Runtime (min)")

# Histogram for the box office results
movie_data_completed["box_office_revenue"].hist(bins=100, ax=axes[1])
axes[1].set_title("Histogram for box_office_revenue")
axes[1].set_xlabel("Box office revenue (dollars)")

plt.tight_layout()
plt.show()


Not really ideal because of the outliers and the spread of the data, let's use a log transformation.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Histogram for the runtime
np.log1p(movie_data_completed["runtime"]).hist(bins=30, ax=axes[0])
axes[0].set_title("Log-Transformed Histogram for Runtime")

# Histogram for the box office results (log-transformed)
np.log1p(movie_data_completed["box_office_revenue"]).hist(bins=30, ax=axes[1])
axes[1].set_title("Log-Transformed Histogram for Box Office Revenue")

plt.tight_layout()
plt.show()

Let's now print some box plots (also with log transformation).

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Boxplot for the log-transformed box_office_revenue
sns.boxplot(data=np.log1p(movie_data_completed["box_office_revenue"]), ax=axes[0])
axes[0].set_title("Log-Transformed Boxplot for box_office_revenue")

# Boxplot for the log-transformed runtime
sns.boxplot(data=np.log1p(movie_data_completed["runtime"]), ax=axes[1])
axes[1].set_title("Log-Transformed Boxplot for runtime")

plt.tight_layout()
plt.show()


TODO: MAYBE TRY TO DO A BETTER plot for box office

We can now have a look to the categorical features:

In [None]:
#TODO: SOME COUNT PLOTS FOR CATEGORICAL