# CMU Movie data

## Initial data inspection
We will first try to provide a first generic inspection of the CMU movie dataset we decided to work on.

In [1]:
import pandas as pd
import numpy as np
import re
import json
from src.utils.data_utils import *
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
%load_ext autoreload
%autoreload 2

### Load Data
The dataset is divided in 3 parts, the characters, the movies and the plots of the movies.

In [2]:
character_data, movie_data, plot_data = load_data()

### Characters dataset
Let's first have a look to the character dataset.

In [3]:
print(f'There are {character_data.shape[0]} characters with {character_data.shape[1]} features for each.')

There are 451432 characters with 14 features for each.


In [4]:
character_data.head(2)

Unnamed: 0,wikipedia_movie_id,freebase_movie_id,movie_release_date,character_name,actor_dob,actor_gender,actor_height,actor_ethnicity,actor_name,actor_age_at_release,character_actor_map_id,character_id,actor_id,actor_ethnicity_label
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.62,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7,
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.78,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4,


We can note that the actor ethnicity need to be transform to readable value (for now, it looks to be freebase id).

Let's now see if we have a lot of missing data. We will also check that we don't have duplicated rows.

In [5]:
print("Percentage of null rows in the characters dataset for each features:")
print(character_data.isnull().mean().round(3)*100)

Percentage of null rows in the characters dataset for each features:
wikipedia_movie_id         0.0
freebase_movie_id          0.0
movie_release_date         2.2
character_name            57.2
actor_dob                 23.5
actor_gender              10.1
actor_height              65.6
actor_ethnicity           76.3
actor_name                 0.3
actor_age_at_release      35.0
character_actor_map_id     0.0
character_id              57.2
actor_id                   0.2
actor_ethnicity_label     77.0
dtype: float64


In [6]:
print(f"Duplicated rows: {character_data.duplicated().sum()}")

Duplicated rows: 0


We see that we miss a lot of character names/ids, actor heights, actor ethnicity and actor age at release.

### Movies dataset
Let's now have a look to the movies dataset.

In [7]:
print(f'There are {movie_data.shape[0]} movies with {movie_data.shape[1]} features for each.')

There are 81741 movies with 9 features for each.


In [8]:
movie_data.head(2)

Unnamed: 0,wikipedia_movie_id,freebase_movie_id,movie_name,movie_release_date,box_office_revenue,runtime,languages,countries,genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."


We can note that the languages, countries and genres need to be preprocessed (for now dictionnary with id->readablename).

Let's now see if we have a lot of missing data. We will also verify that we dont' have duplicated rows.

In [9]:
print("Percentage of null rows in the movies dataset for each features:")
print(movie_data.isnull().mean().round(3)*100)

Percentage of null rows in the movies dataset for each features:
wikipedia_movie_id     0.0
freebase_movie_id      0.0
movie_name             0.0
movie_release_date     8.4
box_office_revenue    89.7
runtime               25.0
languages              0.0
countries              0.0
genres                 0.0
dtype: float64


In [10]:
print(f"Duplicated rows: {movie_data.duplicated().sum()}")

Duplicated rows: 0


Ouch! We only have box office revenue for 10% of our movies, that's not good news since it's a key feature in our research problematic, we will need to fix this. Apart from this, we can also note that we are missing 25% of the runtime information. We could try to improve this. This also applies to the movie release data. For the languages, countries and genres, we note that they are dictionaries meaning that we first need to preprocess them a bit (for example transforming them to a list) to then be able to see the percentage of missing data. We will do it now:

In [11]:
# Extract the readable values for 'languages', 'countries', and 'genres' columns. Also clean the language column.

movie_data['languages'] = movie_data['languages'].apply(lambda x: extract_values(x, clean_func=clean_language))
movie_data['countries'] = movie_data['countries'].apply(lambda x: extract_values(x)) 
movie_data['genres'] = movie_data['genres'].apply(lambda x: extract_values(x))  

We can now have a look to the missing data:

In [12]:
# Calculate the number of None (NaN) values for each column
none_languages = movie_data['languages'].isna().mean()
none_countries = movie_data['countries'].isna().mean()
none_genres = movie_data['genres'].isna().mean()

# Print the counts of None (NaN) values
print(f"Percentage of None values in 'languages': {none_languages:.2%}")
print(f"Percentage of None values in 'countries': {none_countries:.2%}")
print(f"Percentage of None values in 'genres': {none_genres:.2%}")

Percentage of None values in 'languages': 16.96%
Percentage of None values in 'countries': 9.98%
Percentage of None values in 'genres': 4.38%


In [13]:
movie_data[movie_data["movie_release_date"]>'1-1-2012'].isna().count()

wikipedia_movie_id    74839
freebase_movie_id     74839
movie_name            74839
movie_release_date    74839
box_office_revenue    74839
runtime               74839
languages             74839
countries             74839
genres                74839
dtype: int64

This looks ok overall.

### Plot summary dataset
Let's now have a look to the plot summaries dataset.

In [14]:
print(f'There are {plot_data.shape[0]} plot summaries with {plot_data.shape[1]} features for each.')

There are 42303 plot summaries with 2 features for each.


In [15]:
plot_data.head(2)

Unnamed: 0,wikipedia_movie_id,summary
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...


Let's see if we have some rows that are invalid (no summary or wikipedia id).

In [16]:
print("Pourcentage of null rows in the plot summaries dataset:")
print(plot_data.isnull().mean().round(3)*100)

Pourcentage of null rows in the plot summaries dataset:
wikipedia_movie_id    0.0
summary               0.0
dtype: float64


Good new, we have nothing missing here :)

## Data completion + first preprocessing
Before going deeper to the analysis, we want to already fix some problems we pointed out.

Movies:
- A lot of box office revenus missing

Characters:
- We see that we miss a lot of character names/ids, actor heights, actor ethnicity and actor age at release.
- We first note that we need to preprocess the actor ethnicity that look to be a freebase id.

### Movies problems

Let's first to get more data on box office results to decrease the number of missing data we have for now. To do this, we will merge the current dataset with differents other datasets that contain box office results (and also runtime since we have 25% of missing).

Lets try a dataset that contains information about 10,000 movies collected from The Movie Database (TMDb), including revenue and runtime. https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csv.

In [17]:
# Load the new dataset (and rename some columns)
movies_dataset = pd.read_csv('data/movies_metadata.csv')
movies_dataset['box_office_revenue'] = pd.to_numeric(movies_dataset['revenue'], errors='coerce') 
movies_dataset['release_date'] = pd.to_datetime(movies_dataset['release_date'], errors='coerce')

movie_data['movie_release_date'] = pd.to_datetime(movie_data['movie_release_date'], errors='coerce')

movies_dataset.head(2)

  movies_dataset = pd.read_csv('data/movies_metadata.csv')


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,box_office_revenue
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,373554033.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,262797249.0


In [18]:
movie_data_merged, before_missing, after_missing = merge_for_completion(movie_data, movies_dataset, ["movie_name", "movie_release_date"], ["original_title", "release_date"], "box_office_revenue", merge_strategy='prioritize_first')

In [19]:
# Print the before and after missing percentages
print(f"Box office results missing percentage before merge (on title) with wikidata: {before_missing:.2%}")
print(f"Box office results missing percentage after merge (on title) with wikidata: {after_missing:.2%}")

Box office results missing percentage before merge (on title) with wikidata: 89.60%
Box office results missing percentage after merge (on title) with wikidata: 84.05%


It's not a big improvement but it's still 10%. Let's do the same for the runtime column.

In [20]:
movie_data_merged, before_missing, after_missing = merge_for_completion(movie_data_merged, movies_dataset, ["movie_name", "movie_release_date"], ["original_title", "release_date"], "runtime", merge_strategy='prioritize_first')

In [21]:
# Print the before and after missing percentages
print(f"Runtime results missing percentage before merge (on title) with The Movies Dataset: {before_missing:.2%}")
print(f"Runtime results missing percentage after merge (on title) with The Movies Dataset: {after_missing:.2%}")

Runtime results missing percentage before merge (on title) with The Movies Dataset: 24.93%
Runtime results missing percentage after merge (on title) with The Movies Dataset: 24.70%


Minor improvement but we take it. Let's try to use another dataset named Movie Industry (https://www.kaggle.com/datasets/danielgrijalvas/movies).

In [22]:
# Load the new dataset (and rename some columns)
movie_industry_dataset = pd.read_csv('data/movie_industry.csv')
movie_industry_dataset['box_office_revenue'] = pd.to_numeric(movie_industry_dataset['gross'], errors='coerce') 

# Remove any extra information in parentheses
movie_industry_dataset['released'] = movie_industry_dataset['released'].str.split('(').str[0].str.strip()

# Convert to datetime format
movie_industry_dataset['released'] = pd.to_datetime(movie_industry_dataset['released'], errors='coerce')

movie_industry_dataset.head(2)

Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,country,budget,gross,company,runtime,box_office_revenue
0,The Shining,R,Drama,1980,1980-06-13,8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,United Kingdom,19000000.0,46998772.0,Warner Bros.,146.0,46998772.0
1,The Blue Lagoon,R,Adventure,1980,1980-07-02,5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,4500000.0,58853106.0,Columbia Pictures,104.0,58853106.0


In [23]:
movie_data_merged, before_missing, after_missing = merge_for_completion(movie_data_merged, movie_industry_dataset, ["movie_name", "movie_release_date"], ["name", "released"], "box_office_revenue", merge_strategy='prioritize_first')

In [24]:
# Print the before and after missing percentages
print(f"Box office results missing percentage before merge (on title) with The Movies Dataset: {before_missing:.2%}")
print(f"Box office results missing percentage after merge (on title) with The Movies Dataset: {after_missing:.2%}")

Box office results missing percentage before merge (on title) with The Movies Dataset: 84.05%
Box office results missing percentage after merge (on title) with The Movies Dataset: 83.85%


### Characters problems

TODO: Improve missing characters (maybe not needed, see the "Do we have data on all lead actors") -> I think it's not really needed because of the results we have

## Getting the rating and lead actors of movies

Now we want to merge with the IMDb datasets (https://developer.imdb.com/non-commercial-datasets/) in order to obtain ratings and lead actors. We set consider an actor a lead actor if their ordering is 1 or 2.

In [25]:
import json
# import dataset from wikidata
with open('data/wikidata.json', 'r') as f:
    data = json.load(f)

wikidata = pd.DataFrame(data)

wikidata.head(2)

Unnamed: 0,item,freebase_id,IMDb_ID,title,box_office
0,http://www.wikidata.org/entity/Q251063,/m/03gtkbc,tt0425295,Niagara Motel,
1,http://www.wikidata.org/entity/Q251335,/m/02r3qq,tt0071565,ゴジラ対メカゴジラ,


In [26]:
wikidata_merge = wikidata[['freebase_id', 'IMDb_ID']]
wikidata_merge = wikidata_merge.rename(columns={'freebase_id': 'freebase_movie_id'})

movies_wikidata_merged_imdbid = pd.merge(wikidata_merge, movie_data_merged, on='freebase_movie_id', how='inner')
movies_wikidata_merged_imdbid.sample(2)

Unnamed: 0,freebase_movie_id,IMDb_ID,wikipedia_movie_id,movie_name,movie_release_date,box_office_revenue,runtime,languages,countries,genres,original_title,release_date,name,released
10799,/m/0gjs8p,tt0306892,6712075,George and the Dragon,NaT,,93.0,[English],[United States of America],"[Fantasy Adventure, Sword and sorcery films, C...",,NaT,,NaT
24496,/m/061dj0,tt0113464,1856185,Jeffrey,NaT,3487767.0,93.0,[English],[United States of America],"[LGBT, Romantic comedy, Sex comedy, Indie, Gay...",,NaT,,NaT


In [27]:
# Load IMDb ratings and select relevant columns
imdb_ratings = pd.read_csv('data/title.ratings.tsv', sep='\t')
imdb_ratings = imdb_ratings.rename(columns={'tconst': 'IMDb_ID'})
imdb_ratings = imdb_ratings[['IMDb_ID', 'averageRating', 'numVotes']]

In [28]:
# Load IMDb names data for actors
imdb_names = pd.read_csv('data/name.basics.tsv', sep='\t')

In [29]:
# Initialize an empty list to hold chunks of filtered lead actors
filtered_lead_actors = []

# Process imdb_principals in chunks to reduce memory usage
for chunk in pd.read_csv('data/title.principals.tsv', sep='\t', chunksize=100000):
    # Filter for lead actors (first and second-billed actor or actress)
    chunk_lead_actors = chunk[
        (chunk['category'].isin(['actor', 'actress'])) & 
        (chunk['ordering'].isin([1, 2]))
    ][['tconst', 'nconst', 'ordering']]
    
    # Append the filtered chunk to the list
    filtered_lead_actors.append(chunk_lead_actors)

In [None]:
# Concatenate all filtered chunks into a single DataFrame
lead_actors = pd.concat(filtered_lead_actors)

# Rename columns and merge with imdb_names DataFrame to get actor names
lead_actors = lead_actors.rename(columns={'tconst': 'IMDb_ID'})
lead_actors = lead_actors.merge(imdb_names[['nconst', 'primaryName']], on='nconst', how='left')

# Pivot to get separate columns for the first and second lead actors
lead_actors = lead_actors.pivot(index='IMDb_ID', columns='ordering', values='primaryName').reset_index()
lead_actors.columns = ['IMDb_ID', 'lead_actor_1', 'lead_actor_2']

# Merge movies_wikidata_merged_imdbid with IMDb ratings
imdb_merged_movie_data = pd.merge(movies_wikidata_merged_imdbid, imdb_ratings, on='IMDb_ID', how='left')

# Merge with lead actors data
merged_movie_data = imdb_merged_movie_data.merge(lead_actors, on='IMDb_ID', how='left')

# Display the first 5 rows of the final merged dataset
merged_movie_data.head(5)

In [None]:
merged_movie_data = merged_movie_data.drop_duplicates(subset=['IMDb_ID'])
merged_movie_data.head(5)

In [None]:
merged_movie_data.shape

We can now extract the movies that have box office revenue and remove the unnecessary columns.

In [None]:
# Remove rows with NaN in the 'box_office_revenue' column
movie_data_extracted = merged_movie_data.dropna(subset=['box_office_revenue']).copy()

# Drop specified columns
movie_data_extracted = movie_data_extracted.drop(columns=['original_title', 'release_date', 'name', 'released'])

# Verify the new structure of the dataset
movie_data_extracted.head(2)

In [None]:
# Set lead_actor_2 to NaN where it is the same as lead_actor_1
movie_data_extracted.loc[movie_data_extracted['lead_actor_1'] == movie_data_extracted['lead_actor_2'], 'lead_actor_2'] = pd.NA

In [None]:
print("Percentage of null rows in the extracted movies dataset for each features:")
print(movie_data_extracted.isnull().mean().round(3)*100)

In [None]:
print(f"We have {movie_data_extracted.shape[0]} rows in our extracted movies dataset.")

Good, we have 2232 movies with minor missing data!

## Cleaning and removing outliers

Before analyzing the data any further, let's remove outliers.

### Character dataset

In [None]:
# Keep only non-NaN values for all columns
not_na_height = character_data["actor_height"].notna()
not_na_age_at_release = character_data["actor_age_at_release"].notna()
not_na_gender = character_data["actor_gender"].notna()
not_na_ethnicity = character_data["actor_ethnicity"].notna()
not_na_name_char = character_data["character_name"].notna()

# Combine all conditions into one mask
not_na_mask = not_na_height & not_na_age_at_release & not_na_gender & not_na_ethnicity & not_na_name_char

# Apply the mask to the DataFrame
character_data_cleaned = character_data[not_na_mask]

# Calculate reduction in size
reduction = 1 - character_data_cleaned.shape[0] / character_data.shape[0]
print(f"Removing NaN reduced the dataset by: {reduction:.2%}")

In [None]:
# Keep only valid heights (between 1.5 and 2.8 meters)
character_data_valid_heights = character_data_cleaned.query("actor_height > 1.5 and actor_height < 2.8")
reduction = (len(character_data_cleaned) - len(character_data_valid_heights)) / len(character_data_cleaned)

print(f"Removing invalid actor heights reduced that dataset by {reduction:.2%}.")

In [None]:
# Keep only valid ages (between 0 and 100 years)
character_data_valid_ages = character_data_valid_heights.query("actor_age_at_release > 0 and actor_age_at_release < 100")
reduction = (len(character_data_valid_heights) - len(character_data_valid_ages)) / len(character_data_valid_heights)

print(f"Removing invalid actor ages reduced that dataset by {reduction:.2%}.")

In [None]:
# Keep only ethnicity labels that are common
min_occurrence = 10
ethnicity_label_counts = character_data_valid_ages['actor_ethnicity_label'].value_counts()
ethnicity_labels = ethnicity_label_counts[ethnicity_label_counts > min_occurrence]

mask = character_data_valid_ages['actor_ethnicity_label'].isin(ethnicity_labels.index)
character_data_valid = character_data_valid_ages[mask]

reduction = 1 - len(character_data_valid) / len(character_data_valid_ages)

print(f"Removing ethnicity labels which are uncommon reduced that dataset by {reduction:.2%}.")

In [None]:
# Convert the date of birth to datetime using .loc to avoid the warning
character_data_valid.loc[:, "actor_dob"] = pd.to_datetime(character_data_valid["actor_dob"], errors='coerce')

# Print the final dataset size
print(f"Final character dataset size: {len(character_data_valid)}")

### Movies dataset

We will also remove the outliers for the movie dataset.

In [None]:
movie_data_extracted.describe()

In [None]:
print(movie_data_extracted.isnull().mean().round(3)*100)

In [None]:
# Keep only non-NaN values for all columns (the other columns have no missing values)
not_na_release_date = movie_data_extracted["movie_release_date"].notna()
not_na_runtime = movie_data_extracted["runtime"].notna()
not_na_languages = movie_data_extracted["languages"].notna()
not_na_countries = movie_data_extracted["countries"].notna()
not_na_genres = movie_data_extracted["genres"].notna()
not_na_lead_actor_1 = movie_data_extracted["lead_actor_1"].notna()
not_na_lead_actor_2 = movie_data_extracted["lead_actor_2"].notna()

# Combine all conditions into one mask
not_na_mask = not_na_release_date & not_na_runtime & not_na_languages & not_na_countries & not_na_genres & not_na_lead_actor_1 & not_na_lead_actor_2


# Apply the mask to the DataFrame
movie_data_cleaned = movie_data_extracted[not_na_mask]

# Calculate reduction in size
reduction = 1 - movie_data_cleaned.shape[0] / movie_data_extracted.shape[0]
print(f"Removing NaN reduced the dataset by: {reduction:.2%}")
len(movie_data_cleaned)

In [None]:
#Keep only movies released between 1940 and 2012
movie_data_valid_release_dates = movie_data_cleaned[movie_data_cleaned["movie_release_date"]>'1-1-1940']

reduction = 1 - movie_data_valid_release_dates.shape[0]/ movie_data_cleaned.shape[0]
print(f"Removing movies released before 1940 reduced the dataset by: {reduction:.2%}")

In [None]:
#Keep only movies that last under 180 min
movie_data_valid_runtime = movie_data_valid_release_dates[movie_data_valid_release_dates["runtime"]<180]

reduction = 1 - movie_data_valid_runtime.shape[0]/ movie_data_valid_release_dates.shape[0]
print(f"Removing movies lasting more than 3 hours reduced the dataset by: {reduction:.2%}")

In [None]:
#Keep only movies that have at least 500 votes
movie_data_valid_votes = movie_data_valid_runtime[movie_data_valid_runtime["numVotes"]>500]

reduction = 1 - movie_data_valid_votes.shape[0]/ movie_data_valid_runtime.shape[0]
print(f"Removing movies that have less than 500 votes: {reduction:.2%}")

In [None]:
movie_data_valid = movie_data_valid_votes.copy()

## Do we have data for the lead actors

Now that we have the two main actors of each movie, let's see if we have data on them.

In [None]:
# Step 1: Extract unique pairs of (freebase_movie_id, lead_actor) from the first dataset
lead_actor_pairs = pd.concat([
    movie_data_valid[['freebase_movie_id', 'lead_actor_1']].rename(columns={'lead_actor_1': 'actor_name'}),
    movie_data_valid[['freebase_movie_id', 'lead_actor_2']].rename(columns={'lead_actor_2': 'actor_name'})
])

# Convert the DataFrame of pairs to a list of tuples for filtering
lead_actor_pairs = list(lead_actor_pairs.itertuples(index=False, name=None))

# Step 2: Filter character_data to keep only rows where (freebase_movie_id, actor_name) matches the pairs in lead_actor_pairs
lead_actor_data = character_data_valid[
    character_data_valid[['freebase_movie_id', 'actor_name']].apply(tuple, axis=1).isin(lead_actor_pairs)
]

# Step 3: Check for missing values in key columns
print("Missing values in lead actor data:")
print(lead_actor_data[['actor_name', 'actor_dob', 'actor_gender', 'actor_ethnicity', 'actor_height', 'actor_age_at_release']].isna().mean()*100)

# Display the first few rows of the filtered data
lead_actor_data.head(2)


This make sense since we preprocessed our character data, let's now see if we have data for all of our actors.

In [None]:
print(f"We have data for {lead_actor_data.shape[0]/(movie_data_valid["lead_actor_1"].notna().sum() + movie_data_valid["lead_actor_2"].notna().sum())*100}%. of our lead actors")

We have data for more than 50% of our lead actors.

### Extract the characters of our films

Let's now extract more generally the characters of our movies to have a dataset with all the characters of our movies.

In [None]:
# Filter the character_data_valid dataset to keep only rows with freebase_movie_id present in movie_data_valid
character_data_valid_filtered = character_data_valid[character_data_valid['freebase_movie_id'].isin(movie_data_valid['freebase_movie_id'])]

# Extract the relevant columns for characters and associated movies
character_data_valid_filtered = character_data_valid_filtered[['actor_name', 'actor_dob', 'actor_gender', 'actor_ethnicity', 'actor_height', 'actor_age_at_release', 'freebase_movie_id', 'character_name']]

# Display the first few rows of the filtered and extracted data
character_data_valid_filtered.head(2)


In [None]:
character_data_valid_filtered.shape

In [None]:
character_data_valid_filtered.isnull().mean()

## Saving our newly created dataframes

We will save our three dataframes:

- movie_dataset_preprocessed that contains the movies with a box office revenue that were preprocessed to keep only the ones with lead actors
- lead_actors_preprocessed contains information about the lead actors in our movies. Each film is represented, so an actor may appear multiple times if they star in multiple films.
- characters_preprocessed that holds detailed information about all the characters in our selected films, along with the corresponding actors who portrayed them.

In [None]:
# Save to Pickle
movie_data_valid.to_pickle('movie_dataset_preprocessed.pkl')
character_data_valid_filtered.to_pickle('characters_preprocessed.pkl')
lead_actor_data.to_pickle('lead_actors_preprocessed.pkl')

## Loading our dataframes

We can skip all the preprocessing process and load our dataframes directly here:

In [None]:
# Load the datasets from pickle files
movie_data_valid_loaded = pd.read_pickle('movie_dataset_preprocessed.pkl')
character_data_valid_filtered_loaded = pd.read_pickle('characters_preprocessed.pkl')
lead_actor_data_loaded = pd.read_pickle('lead_actors_preprocessed.pkl')

## Deeper analysis
Now that our data is more complete, we can do a more in deep analysis.

### Lead actors dataset 

Let's first analyse our dataframe with the lead actors just created. We will start with a summary of the statistics of the numerical features.

In [None]:
lead_actor_data.describe()

Let's print their distributions:

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 6))

# Histogram for the runtime
sns.histplot(data=lead_actor_data, x="actor_height", bins=50, ax=axes[0], kde=False)
axes[0].set_title("Height of the actor")
axes[0].set_xlabel("Height (m)")

# Histogram for the box office results
sns.histplot(data=lead_actor_data, x="actor_age_at_release", bins=50, ax=axes[1], kde=True)
axes[1].set_title("Age of the actor at the release of the movie")
axes[1].set_xlabel("Age (years)")

# Histogram for the character date of birth
sns.histplot(data=lead_actor_data, x="actor_dob", bins=50, ax=axes[2], kde=True)
axes[2].set_title("Date of birth of the actor")
axes[2].set_xlabel("Date of birth")

plt.tight_layout()
plt.show()


Let's now lets explore the categorical data

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Countplot for the gender distribution
sns.countplot(data=lead_actor_data, x="actor_gender", ax=axes[0], stat='proportion')
axes[0].set_title("Actor gender distribution")
axes[0].set_xlabel("Gender")
axes[0].set_ylabel("Proportion")

ethnicity_cutoff = 30
values = lead_actor_data["actor_ethnicity_label"].value_counts()
values = values[:ethnicity_cutoff]
sns.barplot(x=values, y=values.index, ax=axes[1])
axes[1].set_title(f"{ethnicity_cutoff} most common ethnicity label")
axes[1].set_xlabel("Count")
axes[1].set_ylabel("Ethnicity")

plt.tight_layout()
plt.show()

### Movies dataset 

Let's now analyse our movies dataset. We will start with a summary of the statistics of the numerical features.

In [None]:
movie_data_completed = movie_data_valid.copy()
movie_data_completed.describe()

Let's print their distributions (except for the wikipedia id):

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Histogram for the runtime
movie_data_completed["runtime"].hist(bins=100, ax=axes[0])
axes[0].set_title("Histogram for runtime")
axes[0].set_xlabel("Runtime (min)")

# Histogram for the box office results
movie_data_completed["box_office_revenue"].hist(bins=100, ax=axes[1])
axes[1].set_title("Histogram for box_office_revenue")
axes[1].set_xlabel("Box office revenue (dollars)")

plt.tight_layout()
plt.show()


Let's now print some box plots.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Boxplot for the raw box_office_revenue
sns.boxplot(data=movie_data_completed["box_office_revenue"], ax=axes[0])
axes[0].set_title("Boxplot for box_office_revenue")

# Boxplot for the raw runtime
sns.boxplot(data=movie_data_completed["runtime"], ax=axes[1])
axes[1].set_title("Boxplot for runtime")

plt.tight_layout()
plt.show()


We can now have a look to the categorical features:

In [None]:
#TODO: SOME COUNT PLOTS FOR CATEGORICAL

### Characters dataset

In [None]:
# TODO