# CMU Movies Summary Corpus

- Authors: Zaynab, Lylia, Ali, Christian, Yassin

---

## Tasks

1. **Select Project & Initial Analyses**:
   1. Agree on a project proposal with team members.
   2. Perform initial analyses to verify feasibility of the proposed project, including any additional data.
   3. Acquaint yourself with the provided data, preprocess it, and perform descriptive statistics.

2. **Pipeline & Data Description**:
   1. Create a pipeline for data handling and preprocessing, documented in the notebook.
   2. Describe the relevant aspects of the data, including:
      1. Handling the size of the data.
      2. Understanding the data (formats, distributions, missing values, correlations, etc.).
      3. Considering data enrichment, filtering, and transformation according to project needs.
   3. Develop a plan for methods to be used, with essential mathematical details.
   4. Outline a plan for analysis and communication, discussing alternative approaches considered.

3. **GitHub Repository & Deliverables**:
   1. Create a public GitHub repository named `ada-2023-project-<team>` under the `epfl-ada` GitHub organization. ✅
   2. Ensure the repository contains:
      1. **README.md** file with:
         1. **Title**: Project title.
         2. **Abstract**: 150-word description of the project idea, goals, and motivation.
         3. **Research Questions**: List of research questions to address.
         4. **Proposed Additional Datasets**: Description of additional datasets, expected management, and feasibility analysis.
         5. **Methods**: Methods to be used in the project.
         6. **Proposed Timeline**: Timeline for the project.
         7. **Organization within the Team**: Internal milestones leading to Milestone P3.
         8. **Questions for TAs (optional)**: Any questions for the teaching assistants.
      2. **Code for Initial Analyses**: Structured code for initial analyses and data handling pipelines.
      3. **Notebook** presenting initial results, including:
         1. Main results and descriptive analysis.
         2. External scripts/modules for implementing core logic, to be called from the notebook.

---


## Table of Contents
- [1. Zaynab's part](##Zaynab's-part)
- [2. Lylia's part](##Lylia's-part)
- [3. Ali's part](##Ali's-part)
- [4. Cristians's part](##Christian's-part)
- [5. Yassin's part](##Yassin's-part)

---

### Library importation

In [42]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

### data importation

In [43]:
DATA_PATH='./data/MovieSummaries/'

### movie metadata

In [44]:
movie_columns = [
    'WikipediaMovieID', 'FreebaseMovieID', 'MovieName', 'ReleaseDate', 
    'BoxOfficeRevenue', 'Runtime', 'Languages', 'Countries', 'Genres'
]


movie_metadata = pd.read_csv(DATA_PATH+'movie.metadata.tsv', sep='\t', names=movie_columns)

movie_metadata

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"
...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}"
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0..."
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}"
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ..."


### character metadata

In [45]:
character_columns = [
    'WikipediaMovieID', 'FreebaseMovieID', 'ReleaseDate', 'CharacterName',
    'ActorDOB', 'ActorGender', 'ActorHeight', 'ActorEthnicity', 
    'ActorName', 'ActorAgeAtRelease', 'FreebaseCharacterActorMapID',
    'FreebaseCharacterID', 'FreebaseActorID'
]

character_metadata = pd.read_csv(DATA_PATH+'character.metadata.tsv', sep='\t', names=character_columns)


character_metadata

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,ReleaseDate,CharacterName,ActorDOB,ActorGender,ActorHeight,ActorEthnicity,ActorName,ActorAgeAtRelease,FreebaseCharacterActorMapID,FreebaseCharacterID,FreebaseActorID
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.620,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.780,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.750,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.650,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg
...,...,...,...,...,...,...,...,...,...,...,...,...,...
450664,913762,/m/03pcrp,1992-05-21,Elensh,1970-05,F,,,Dorothy Elias-Fahn,,/m/0kr406c,/m/0kr406h,/m/0b_vcv
450665,913762,/m/03pcrp,1992-05-21,Hibiki,1965-04-12,M,,,Jonathan Fahn,27.0,/m/0kr405_,/m/0kr4090,/m/0bx7_j
450666,28308153,/m/0cp05t9,1957,,1941-11-18,M,1.730,/m/02w7gg,David Hemmings,15.0,/m/0g8ngmc,,/m/022g44
450667,28308153,/m/0cp05t9,1957,,,,,,Roberta Paterson,,/m/0g8ngmj,,/m/0g8ngmm


In [46]:
# Count the total number of rows
total_rows = character_metadata.shape[0]

# Count the number of unique ActorName entries
unique_actor_names = character_metadata['ActorName'].nunique()

# Check for uniqueness
if total_rows == unique_actor_names:
    print("ActorName values are unique.")
else:
    print(f"ActorName values are not unique. There are {total_rows - unique_actor_names} duplicate names.")

ActorName values are not unique. There are 316591 duplicate names.


### plot summaries

In [47]:
plot_columns = ['WikipediaMovieID', 'PlotSummary']

plot_summaries = pd.read_csv(DATA_PATH+'plot_summaries.txt', sep='\t', names=plot_columns)

plot_summaries


Unnamed: 0,WikipediaMovieID,PlotSummary
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...
...,...,...
42298,34808485,"The story is about Reema , a young Muslim scho..."
42299,1096473,"In 1928 Hollywood, director Leo Andreyev look..."
42300,35102018,American Luthier focuses on Randy Parsons’ tra...
42301,8628195,"Abdur Rehman Khan , a middle-aged dry fruit se..."


### name clusters

In [48]:
name_clusters_columns = ['FreebaseCharacterActorMapID', 'CharacterName']

name_clusters = pd.read_csv(DATA_PATH+'name.clusters.txt', sep='\t', names=name_clusters_columns)

name_clusters


Unnamed: 0,FreebaseCharacterActorMapID,CharacterName
0,Stuart Little,/m/0k3w9c
1,Stuart Little,/m/0k3wcx
2,Stuart Little,/m/0k3wbn
3,John Doe,/m/0jyg35
4,John Doe,/m/0k2_zn
...,...,...
2661,John Rolfe,/m/0k5_ql
2662,John Rolfe,/m/02vd6vs
2663,Elizabeth Swann,/m/0k1xvz
2664,Elizabeth Swann,/m/0k1x_d


### TV tropes clusters

In [49]:
tvtropes_columns = ['FreebaseCharacterActorMapID', 'CharacterType']

tvtropes_clusters = pd.read_csv(DATA_PATH+'tvtropes.clusters.txt', sep='\t', names=tvtropes_columns)

tvtropes_clusters


Unnamed: 0,FreebaseCharacterActorMapID,CharacterType
0,absent_minded_professor,"{""char"": ""Professor Philip Brainard"", ""movie"":..."
1,absent_minded_professor,"{""char"": ""Professor Keenbean"", ""movie"": ""Richi..."
2,absent_minded_professor,"{""char"": ""Dr. Reinhardt Lane"", ""movie"": ""The S..."
3,absent_minded_professor,"{""char"": ""Dr. Harold Medford"", ""movie"": ""Them!..."
4,absent_minded_professor,"{""char"": ""Daniel Jackson"", ""movie"": ""Stargate""..."
...,...,...
496,young_gun,"{""char"": ""Morgan Earp"", ""movie"": ""Tombstone"", ..."
497,young_gun,"{""char"": ""Colorado Ryan"", ""movie"": ""Rio Bravo""..."
498,young_gun,"{""char"": ""Tom Sawyer"", ""movie"": ""The League of..."
499,young_gun,"{""char"": ""William H. 'Billy the Kid' Bonney"", ..."


---

## Zaynab's part

---
## Lylia's part

---
## Ali's part

### Verifying the matching between **movie_metadata** and **plot_summaries**

In [50]:
# Verify if all WikipediaMovieID in plot_summaries are also in movie_metadata
all_ids_in_metadata = plot_summaries['WikipediaMovieID'].isin(movie_metadata['WikipediaMovieID']).all()

print(f"All WikipediaMovieID in plot_summaries are in movie_metadata: {all_ids_in_metadata}")

All WikipediaMovieID in plot_summaries are in movie_metadata: False


In [52]:
# Calculate the percentage of common WikipediaMovieID between plot_summaries and movie_metadata
common_ids_count = plot_summaries['WikipediaMovieID'].isin(movie_metadata['WikipediaMovieID']).sum()
percentage_common_ids = (common_ids_count / len(plot_summaries)) * 100

print(f"Percentage of common WikipediaMovieID between plot_summaries and movie_metadata: {percentage_common_ids:.2f}%")
print(common_ids_count)
print(len(plot_summaries))
print(len(movie_metadata))

Percentage of common WikipediaMovieID between plot_summaries and movie_metadata: 99.77%
42204
42303
81741


--> Almost all the movies in plot_summaries are in movie_metadata

### Find the bounds of the movies dates

#### With movie_metadata

In [53]:
# Create a copy of the original DataFrame
movie_metadata_copy0 = movie_metadata.copy()

# Convert ReleaseDate column to datetime, handling errors for missing or malformed dates
movie_metadata_copy0['ReleaseDate'] = pd.to_datetime(movie_metadata_copy0['ReleaseDate'], errors='coerce')

# Drop any rows where ReleaseDate could not be converted
movie_metadata_copy0 = movie_metadata_copy0.dropna(subset=['ReleaseDate'])

# Find minimum and maximum dates after ensuring all dates are converted correctly
min_date = movie_metadata_copy0['ReleaseDate'].min()
max_date = movie_metadata_copy0['ReleaseDate'].max()

print(f"Lower bound (earliest release date): {min_date}")
print(f"Upper bound (latest release date): {max_date}")

Lower bound (earliest release date): 1892-10-28 00:00:00
Upper bound (latest release date): 2016-06-08 00:00:00


#### With character_metadata

In [54]:
# Create a copy of the original DataFrame
character_metadata0 = character_metadata.copy()

# Ensure that the 'ReleaseDate' column is in datetime format, handling errors for missing or malformed dates
character_metadata0['ReleaseDate'] = pd.to_datetime(character_metadata0['ReleaseDate'], errors='coerce')

# Drop rows with NaT in 'ReleaseDate' to avoid issues with min/max
character_metadata0 = character_metadata0.dropna(subset=['ReleaseDate'])

# Get the minimum and maximum release dates
min_date = character_metadata0['ReleaseDate'].min()
max_date = character_metadata0['ReleaseDate'].max()

print(f"Lower bound (earliest release date): {min_date}")
print(f"Upper bound (latest release date): {max_date}")

Lower bound (earliest release date): 1894-11-17 00:00:00
Upper bound (latest release date): 2016-06-08 00:00:00


## PART A : Cleaning of CelebA and Rotten Tomatoes datasets

### **Cleaning CelebA dataset**

In [55]:
# Load the file, skipping the first two lines as they are headers or metadata
file_path = './data/MovieSummaries/list_identity_celeba.txt'
identity_data = pd.read_csv(file_path, sep='\s+', skiprows=2, names=['ImageID', 'IdentityName'])
# Replace underscores with spaces in the IdentityName column
identity_data['IdentityName'] = identity_data['IdentityName'].str.replace('_', ' ')
identity_data.head(10)

  identity_data = pd.read_csv(file_path, sep='\s+', skiprows=2, names=['ImageID', 'IdentityName'])


Unnamed: 0,ImageID,IdentityName
0,000001.jpg,Elizabeth Gutierrez
1,000002.jpg,Emilia Fox
2,000003.jpg,Shane Harper
3,000004.jpg,Leonor Varela
4,000005.jpg,Tatana Kucharova
5,000006.jpg,Jaimee Grubbs
6,000007.jpg,Stephen Colletti
7,000008.jpg,Mario Cantone
8,000009.jpg,Gabriela Spanic
9,000010.jpg,Anita Briem


In [56]:
attr_file_path = './data/MovieSummaries/list_attr_celeba.txt'
# Read the first line as column names and the remaining lines as data
attributes_data = pd.read_csv(attr_file_path, sep='\s+', skiprows=1)

attributes_data.head(10)

  attributes_data = pd.read_csv(attr_file_path, sep='\s+', skiprows=1)


Unnamed: 0,ImageID,5_o_Clock_Shadow,Arched_Eyebrows,Attractive,Bags_Under_Eyes,Bald,Bangs,Big_Lips,Big_Nose,Black_Hair,...,Sideburns,Smiling,Straight_Hair,Wavy_Hair,Wearing_Earrings,Wearing_Hat,Wearing_Lipstick,Wearing_Necklace,Wearing_Necktie,Young
0,000001.jpg,-1,1,1,-1,-1,-1,-1,-1,-1,...,-1,1,1,-1,1,-1,1,-1,-1,1
1,000002.jpg,-1,-1,-1,1,-1,-1,-1,1,-1,...,-1,1,-1,-1,-1,-1,-1,-1,-1,1
2,000003.jpg,-1,-1,-1,-1,-1,-1,1,-1,-1,...,-1,-1,-1,1,-1,-1,-1,-1,-1,1
3,000004.jpg,-1,-1,1,-1,-1,-1,-1,-1,-1,...,-1,-1,1,-1,1,-1,1,1,-1,1
4,000005.jpg,-1,1,1,-1,-1,-1,1,-1,-1,...,-1,-1,-1,-1,-1,-1,1,-1,-1,1
5,000006.jpg,-1,1,1,-1,-1,-1,1,-1,-1,...,-1,-1,-1,1,1,-1,1,-1,-1,1
6,000007.jpg,1,-1,1,1,-1,-1,1,1,1,...,-1,-1,1,-1,-1,-1,-1,-1,-1,1
7,000008.jpg,1,1,-1,1,-1,-1,1,-1,1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,1
8,000009.jpg,-1,1,1,-1,-1,1,1,-1,-1,...,-1,1,-1,-1,1,-1,1,-1,-1,1
9,000010.jpg,-1,-1,1,-1,-1,-1,-1,-1,-1,...,-1,-1,-1,1,-1,-1,1,-1,-1,1


In [57]:
# Merge the two dataframes on 'image_id'
merged_df = pd.merge(identity_data, attributes_data, on='ImageID')

# Display the merged dataframe
merged_df.head()

Unnamed: 0,ImageID,IdentityName,5_o_Clock_Shadow,Arched_Eyebrows,Attractive,Bags_Under_Eyes,Bald,Bangs,Big_Lips,Big_Nose,...,Sideburns,Smiling,Straight_Hair,Wavy_Hair,Wearing_Earrings,Wearing_Hat,Wearing_Lipstick,Wearing_Necklace,Wearing_Necktie,Young
0,000001.jpg,Elizabeth Gutierrez,-1,1,1,-1,-1,-1,-1,-1,...,-1,1,1,-1,1,-1,1,-1,-1,1
1,000002.jpg,Emilia Fox,-1,-1,-1,1,-1,-1,-1,1,...,-1,1,-1,-1,-1,-1,-1,-1,-1,1
2,000003.jpg,Shane Harper,-1,-1,-1,-1,-1,-1,1,-1,...,-1,-1,-1,1,-1,-1,-1,-1,-1,1
3,000004.jpg,Leonor Varela,-1,-1,1,-1,-1,-1,-1,-1,...,-1,-1,1,-1,1,-1,1,1,-1,1
4,000005.jpg,Tatana Kucharova,-1,1,1,-1,-1,-1,1,-1,...,-1,-1,-1,-1,-1,-1,1,-1,-1,1


Now we have 202,599 images of 10,177 unique actors, so there's many actors with multiple images. In fact, in the scope of our study, we want only 1 attributes description per actor, and hence only a subset of the given attributes. We only care for their physical long-lasting attributes, that remain fix when goig from a picture to another for the same actor. We'll do te following (we could've doen it in reverse order also):

--> we san drop the ImageID as it won't serve us anymore

--> remove duplicates, keep only 1 picture per actor, reducing our dataframe from 202,599 rows to 10,177 rows

--> keep only essential and non-superficial features, that are really proper to each actor (with some marge of error of course)

In [58]:
# Step 1: Drop the 'image_id' column
merged_df2 = merged_df.drop(columns=['ImageID'])

# Step 2: Remove duplicates, keeping only one entry per actor
# We use 'identity_name' as the unique identifier and keep the first occurrence
unique_actor_df = merged_df2.drop_duplicates(subset='IdentityName', keep='first')

# Step 3: Keep only essential, non-superficial features
# Define the list of long-lasting physical attributes we want to keep
essential_attributes = [
    'IdentityName', 'Arched_Eyebrows', 'Attractive', 'Bags_Under_Eyes', 'Bald', 'Bangs', 
    'Big_Lips', 'Big_Nose', 'Black_Hair', 'Blond_Hair', 'Brown_Hair', 
    'Bushy_Eyebrows', 'Chubby', 'Double_Chin', 'Goatee', 'Gray_Hair', 
    'High_Cheekbones', 'Male', 'Mustache', 
    'Narrow_Eyes', 'No_Beard', 'Oval_Face', 'Pale_Skin', 'Pointy_Nose', 'Receding_Hairline', 
    'Rosy_Cheeks', 'Sideburns', 'Straight_Hair', 'Wavy_Hair'
]

# Select only the essential columns from the dataframe
final_df = unique_actor_df[essential_attributes]

# Display the final dataframe
final_df.head(10)

Unnamed: 0,IdentityName,Arched_Eyebrows,Attractive,Bags_Under_Eyes,Bald,Bangs,Big_Lips,Big_Nose,Black_Hair,Blond_Hair,...,Narrow_Eyes,No_Beard,Oval_Face,Pale_Skin,Pointy_Nose,Receding_Hairline,Rosy_Cheeks,Sideburns,Straight_Hair,Wavy_Hair
0,Elizabeth Gutierrez,1,1,-1,-1,-1,-1,-1,-1,-1,...,-1,1,-1,-1,1,-1,-1,-1,1,-1
1,Emilia Fox,-1,-1,1,-1,-1,-1,1,-1,-1,...,-1,1,-1,-1,-1,-1,-1,-1,-1,-1
2,Shane Harper,-1,-1,-1,-1,-1,1,-1,-1,-1,...,1,1,-1,-1,1,-1,-1,-1,-1,1
3,Leonor Varela,-1,1,-1,-1,-1,-1,-1,-1,-1,...,-1,1,-1,-1,1,-1,-1,-1,1,-1
4,Tatana Kucharova,1,1,-1,-1,-1,1,-1,-1,-1,...,1,1,-1,-1,1,-1,-1,-1,-1,-1
5,Jaimee Grubbs,1,1,-1,-1,-1,1,-1,-1,-1,...,-1,1,-1,-1,-1,-1,-1,-1,-1,1
6,Stephen Colletti,-1,1,1,-1,-1,1,1,1,-1,...,-1,1,-1,-1,1,-1,-1,-1,1,-1
7,Mario Cantone,1,-1,1,-1,-1,1,-1,1,-1,...,-1,1,-1,-1,1,-1,-1,-1,-1,-1
8,Gabriela Spanic,1,1,-1,-1,1,1,-1,-1,-1,...,-1,1,1,-1,1,-1,1,-1,-1,-1
9,Anita Briem,-1,1,-1,-1,-1,-1,-1,-1,-1,...,-1,1,-1,-1,-1,-1,-1,-1,-1,1


### Number of actors we have in **character_metadata**

In [59]:
# Assuming 'character_metadata' contains a column named 'ActorName' with each actor's name
num_actors = character_metadata['ActorName'].nunique()
print(f"Number of unique actors: {num_actors}")

Number of unique actors: 134078


### Number of actors we have in **list_identity_celeba**

In [60]:
# Get all unique identity names
unique_names = identity_data['IdentityName'].unique()
# Display the result
print(f"Number of unique actors: {len(unique_names)}")

Number of unique actors: 10177


### Number of actors in common in **character_metadata** & **list_identity_celeba**

In [61]:
# Common actors to the two dataframes

# Convert both columns to sets for efficient intersection
identity_names_set = set(identity_data['IdentityName'])
actor_names_set = set(character_metadata['ActorName'])

# Find the intersection
common_names = identity_names_set.intersection(actor_names_set)

# Output the number of common names and the names themselves
print(f"Number of names in common: {len(common_names)}")
print(common_names)

Number of names in common: 4597
{'Will Arnett', 'Serinda Swan', 'Mariano Martinez', 'Christine Taylor', 'Natalia Vodianova', 'Gangadhar', 'June Lockhart', 'Matthew Lawrence', 'Ashley Jensen', 'Doga Rutkay', 'Amanda Blake', 'Brett Somers', 'Phoebe Tonkin', 'Grant Show', 'Van Johnson', 'Angus Scrimm', 'Stephen Fry', 'Matthew Davis', 'Peter Facinelli', 'Emmanuelle Seigner', 'Natalia Livingston', 'Douglas Smith', 'Brittany Murphy', 'Claire Trevor', 'Micheline Presle', 'David Ortiz', 'Richard Wright', 'George Michael', 'Christopher Meloni', 'Claudette Colbert', 'Roland Kickinger', 'Traci Bingham', 'Roy Orbison', 'Abigail Breslin', 'Stephen Colletti', 'Vanessa Morgan', 'Viva Bianca', 'Gale Harold', 'Archie Panjabi', 'Ian Somerhalder', 'William Baldwin', 'Willa Ford', 'Adoor Bhasi', 'Maria Pitillo', 'Syd Barrett', 'Chris Potter', 'Glenn Frey', 'Michael Steger', 'Gail Kim', 'Matisyahu', 'Mae Murray', 'Bobby Flay', 'Jenna Fischer', 'Tatjana Patitz', 'Sheryl Lee', 'Ray Charles', 'Beata Tyszkiewi

### Final CelebA dataframe cleaned to use : **final_celebA_df**

In [62]:
# Convert common_names to a set if it's not already (assuming it’s a set as given)
# Filter final_df to only include rows where 'identity_name' is in common_names
final_celebA_df = final_df[final_df['IdentityName'].isin(common_names)].reset_index(drop=True)

# Display the resulting dataframe
print(f"Number of rows in final_celebA_df: {len(final_celebA_df)}")
final_celebA_df.head(5)

Number of rows in final_celebA_df: 4597


Unnamed: 0,IdentityName,Arched_Eyebrows,Attractive,Bags_Under_Eyes,Bald,Bangs,Big_Lips,Big_Nose,Black_Hair,Blond_Hair,...,Narrow_Eyes,No_Beard,Oval_Face,Pale_Skin,Pointy_Nose,Receding_Hairline,Rosy_Cheeks,Sideburns,Straight_Hair,Wavy_Hair
0,Elizabeth Gutierrez,1,1,-1,-1,-1,-1,-1,-1,-1,...,-1,1,-1,-1,1,-1,-1,-1,1,-1
1,Emilia Fox,-1,-1,1,-1,-1,-1,1,-1,-1,...,-1,1,-1,-1,-1,-1,-1,-1,-1,-1
2,Shane Harper,-1,-1,-1,-1,-1,1,-1,-1,-1,...,1,1,-1,-1,1,-1,-1,-1,-1,1
3,Leonor Varela,-1,1,-1,-1,-1,-1,-1,-1,-1,...,-1,1,-1,-1,1,-1,-1,-1,1,-1
4,Stephen Colletti,-1,1,1,-1,-1,1,1,1,-1,...,-1,1,-1,-1,1,-1,-1,-1,1,-1


### **Cleaning Rotten Tomatoes dataset**

--> Actually we only need "rotten_tomatoes_movies.csv", the critics are non-necessary

In [63]:
# 1. Load the Rotten Tomatoes movies dataset
rt_movies_path = './data/MovieSummaries/rotten_tomatoes_movies.csv'
rt_movies_df = pd.read_csv(rt_movies_path)

# 2. First filter on the columns
rt_movies_df = rt_movies_df[['movie_title', 'critics_consensus', 'content_rating', 
                             'genres', 'actors', 'original_release_date', 
                             'tomatometer_status', 'tomatometer_rating', 
                             'audience_status', 'audience_rating']]

print(rt_movies_df.shape)
rt_movies_df.head(5)

(17712, 10)


Unnamed: 0,movie_title,critics_consensus,content_rating,genres,actors,original_release_date,tomatometer_status,tomatometer_rating,audience_status,audience_rating
0,Percy Jackson & the Olympians: The Lightning T...,Though it may seem like just another Harry Pot...,PG,"Action & Adventure, Comedy, Drama, Science Fic...","Logan Lerman, Brandon T. Jackson, Alexandra Da...",2010-02-12,Rotten,49.0,Spilled,53.0
1,Please Give,Nicole Holofcener's newest might seem slight i...,R,Comedy,"Catherine Keener, Amanda Peet, Oliver Platt, R...",2010-04-30,Certified-Fresh,87.0,Upright,64.0
2,10,Blake Edwards' bawdy comedy may not score a pe...,R,"Comedy, Romance","Dudley Moore, Bo Derek, Julie Andrews, Robert ...",1979-10-05,Fresh,67.0,Spilled,53.0
3,12 Angry Men (Twelve Angry Men),Sidney Lumet's feature debut is a superbly wri...,NR,"Classics, Drama","Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",1957-04-13,Certified-Fresh,100.0,Upright,97.0
4,"20,000 Leagues Under The Sea","One of Disney's finest live-action adventures,...",G,"Action & Adventure, Drama, Kids & Family","James Mason, Kirk Douglas, Paul Lukas, Peter L...",1954-01-01,Fresh,89.0,Upright,74.0


In [64]:
# 3. Filter for movies released between 1888 and 2016 (inclusive)
# Convert original_release_date to datetime format for filtering
rt_movies_df['original_release_date'] = pd.to_datetime(rt_movies_df['original_release_date'], errors='coerce')
rt_movies_df = rt_movies_df[(rt_movies_df['original_release_date'].dt.year >= 1888) & 
                            (rt_movies_df['original_release_date'].dt.year <= 2016)]

rt_movies_df.shape

(14522, 10)

In [73]:
# 4. Calculate the percentage of common movies with movie_metadata
common_with_metadata = rt_movies_df[rt_movies_df['movie_title'].isin(movie_metadata['MovieName'])]
percentage_common_metadata0 = (len(common_with_metadata) / len(movie_metadata)) * 100
percentage_common_metadata1 = (len(common_with_metadata) / len(rt_movies_df)) * 100
print(f"Percentage of common movies with movie_metadata (wrt to movie_metadata): {percentage_common_metadata0:.2f}%")
print(f"Percentage of common movies with movie_metadata (wrt to rt_movies_df): {percentage_common_metadata1:.2f}%")

Percentage of common movies with movie_metadata (wrt to movie_metadata): 11.99%
Percentage of common movies with movie_metadata (wrt to rt_movies_df): 67.48%


### Final Rotten Tomatoes dataframe cleaned to use : **final_RottenTomatoes_df**

--> Keeping only the MOVIE TITLE, ACTORS, TOMATOMETER RATING and AUDIENCE RATING. With only the common movies with **movie_metadata**.

In [76]:
# Second filter on the columns
RottenTomatoes_df_filtered = rt_movies_df[['movie_title', 'actors', 'tomatometer_rating', 'audience_rating']]

# Filter to include only common movies between RottenTomatoes_df and movie_metadata
common_movies = RottenTomatoes_df_filtered[
    RottenTomatoes_df_filtered['movie_title'].isin(movie_metadata['MovieName'])
]

# Create final DataFrame
final_RottenTomatoes_df = common_movies.reset_index(drop=True)

# Display the result
final_RottenTomatoes_df.head(10)

Unnamed: 0,movie_title,actors,tomatometer_rating,audience_rating
0,Percy Jackson & the Olympians: The Lightning T...,"Logan Lerman, Brandon T. Jackson, Alexandra Da...",49.0,53.0
1,Please Give,"Catherine Keener, Amanda Peet, Oliver Platt, R...",87.0,64.0
2,10,"Dudley Moore, Bo Derek, Julie Andrews, Robert ...",67.0,53.0
3,The 39 Steps,"Robert Donat, Madeleine Carroll, Godfrey Tearl...",96.0,86.0
4,3:10 to Yuma,"Glenn Ford, Van Heflin, Felicia Farr, Leora Da...",96.0,79.0
5,Abraham Lincoln,"Walter Huston, Una Merkel, Kay Hammond, Ian Ke...",82.0,40.0
6,Dark Water,"Hitomi Kuroki, Rio Kanno, Shigemitsu Ogi, Mire...",80.0,66.0
7,The Accused,"Jodie Foster, Kelly McGillis, Bernie Coulson, ...",91.0,79.0
8,The Lost City,"Andy Garcia, Dustin Hoffman, Bill Murray, Inés...",25.0,64.0
9,The Breaking Point,"John Garfield, Patricia Neal, Phyllis Thaxter,...",100.0,86.0


## **1. Defining "Success" metrics**

--> We want to first establish clear and quatifiable measures of an actor's success :

- Box office revenue : **total & average revenue of movies** an actor has participated in
- Populatity : **number of films**
- Critical Acclaim : 

Dataframes : plot_summaries, movie_metadata, character_metadata, tvtropes_clusters

### Total revenue of movies an actor has participated in

In [15]:
# Verify if in movie_metadata['BoxOfficeRevenue'] there's some nan values, to maybe clean
nan_percentage = movie_metadata['BoxOfficeRevenue'].isna().mean() * 100
print(f"Percentage of NaNs in 'BoxOfficeRevenue': {nan_percentage:.2f}%")

# Replace NaN values in BoxOfficeRevenue with 0
# movie_metadata['BoxOfficeRevenue'] = movie_metadata['BoxOfficeRevenue'].fillna(0)

Percentage of NaNs in 'BoxOfficeRevenue': 89.72%


With 89.72% of values missing in the BoxOfficeRevenue column, filling with zero remains a viable option. However, the high proportion of missing data suggests that relying on box office revenue alone might not provide a complete or representative measure of success. **We'll try later to complete the given data on BoxOfficeRevenue bringing external data from Wikipedia.**

--> CONTINUITY AFTER IMPORTING A MAX TO FILL THE MISSING DATA

In [16]:
# Merge movie and character data on WikipediaMovieID
merged_df = pd.merge(character_metadata, movie_metadata, on='WikipediaMovieID', how='inner')
merged_df

Unnamed: 0,WikipediaMovieID,FreebaseMovieID_x,ReleaseDate_x,CharacterName,ActorDOB,ActorGender,ActorHeight,ActorEthnicity,ActorName,ActorAgeAtRelease,...,FreebaseCharacterID,FreebaseActorID,FreebaseMovieID_y,MovieName,ReleaseDate_y,BoxOfficeRevenue,Runtime,Languages,Countries,Genres
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.620,,Wanda De Jesus,42.0,...,/m/0bgcj3x,/m/03wcfv7,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.780,/m/044038p,Natasha Henstridge,27.0,...,/m/0bgchn4,/m/0346l4,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,...,/m/0bgchn_,/m/01vw26l,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
3,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.750,,Jason Statham,33.0,...,/m/0bgchnq,/m/034hyc,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
4,975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.650,,Clea DuVall,23.0,...,/m/0bgchp9,/m/01y9xg,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
450664,913762,/m/03pcrp,1992-05-21,Elensh,1970-05,F,,,Dorothy Elias-Fahn,,...,/m/0kr406h,/m/0b_vcv,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ..."
450665,913762,/m/03pcrp,1992-05-21,Hibiki,1965-04-12,M,,,Jonathan Fahn,27.0,...,/m/0kr4090,/m/0bx7_j,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ..."
450666,28308153,/m/0cp05t9,1957,,1941-11-18,M,1.730,/m/02w7gg,David Hemmings,15.0,...,,/m/022g44,/m/0cp05t9,Five Clues to Fortune,1957,,129.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/0lsxr"": ""Crime Fiction""}"
450667,28308153,/m/0cp05t9,1957,,,,,,Roberta Paterson,,...,,/m/0g8ngmm,/m/0cp05t9,Five Clues to Fortune,1957,,129.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/0lsxr"": ""Crime Fiction""}"


In [17]:
# Calculate Total Box Office Revenue per Actor
actor_revenue = merged_df.groupby('ActorName')['BoxOfficeRevenue'].sum().reset_index()
total_revenue = actor_revenue.rename(columns={'BoxOfficeRevenue': 'TotalBoxOfficeRevenue'})
# Sort total_revenue DataFrame in place by 'TotalBoxOfficeRevenue' in descending order
total_revenue.sort_values(by='TotalBoxOfficeRevenue', ascending=False, inplace=True)
print (total_revenue.head(20))

                   ActorName  TotalBoxOfficeRevenue
129501         Warwick Davis           1.293016e+10
111587     Samuel L. Jackson           1.278943e+10
39968           Frank Welker           1.028744e+10
2220            Alan Rickman           1.020871e+10
106644       Robbie Coltrane           1.009465e+10
24311          Conrad Vernon           9.786402e+09
123933             Tom Hanks           9.623361e+09
41629            Gary Oldman           9.614965e+09
79198           Maggie Smith           9.280352e+09
49940           Hugo Weaving           8.896132e+09
33130           Eddie Murphy           8.682898e+09
61841      John Ratzenberger           8.414363e+09
123893            Tom Felton           8.332480e+09
106915     Robert Downey Jr.           8.056541e+09
43035   Geraldine Somerville           7.865889e+09
35612            Emma Watson           7.865083e+09
26490       Daniel Radcliffe           7.863063e+09
123845            Tom Cruise           7.861340e+09
48065   Hele

In [18]:
# Count actors with 0.0 total revenue
zero_revenue_count = actor_revenue[actor_revenue['BoxOfficeRevenue'] == 0.0].shape[0]

# Calculate total number of actors
total_actors = actor_revenue.shape[0]

# Calculate percentage
percentage_zero_revenue = (zero_revenue_count / total_actors) * 100
print(f"Percentage of actors with a total revenue of 0.0: {percentage_zero_revenue:.2f}%")

Percentage of actors with a total revenue of 0.0: 72.12%


Ok so that's maybe not so good to base ourself on this metric, with approx. 72.12% of lost values. Or maybe not, analysis may be biased towards actors in high-grossing films, ignoring those in smaller projects.

### Average revenue of movies an actor has participated in

In [43]:
# Calculate Average Box Office Revenue per Movie (per actor)
average_revenue = merged_df.groupby('ActorName')['BoxOfficeRevenue'].mean().reset_index().rename(columns={'BoxOfficeRevenue': 'AverageBoxOfficeRevenue'})
average_revenue.sort_values(by='AverageBoxOfficeRevenue', ascending=False, inplace=True)
average_revenue.head(20)

Unnamed: 0,ActorName,AverageBoxOfficeRevenue
55768,Jason Whyte,2782275000.0
112919,Scott Lawrence,2782275000.0
113091,Sean Anthony Moran,2782275000.0
74868,Lewis Abernathy,2185372000.0
30556,Dileep Rao,1803904000.0
70482,Kirill Nikiforov,1511758000.0
3595,Alexis Denisof,1511758000.0
78604,M'laah Kaur Singh,1511758000.0
23846,Cobie Smulders,1511758000.0
103669,Rashmi Rustagi,1511758000.0


In [44]:
# Calculate Number of Movies per Actor
movie_count = merged_df.groupby('ActorName')['MovieName'].nunique().reset_index().rename(columns={'MovieName': 'NumberOfMovies'})
# Sort movie_count DataFrame in place by 'NumberOfMovies' in descending order
movie_count.sort_values(by='NumberOfMovies', ascending=False, inplace=True)
movie_count.head(20)

Unnamed: 0,ActorName,NumberOfMovies
86136,Mel Blanc,573
90201,Mithun Chakraborty,322
95972,Oliver Hardy,299
90474,Mohanlal,234
90355,Moe Howard,225
79708,Mammootty,225
72567,Larry Fine,219
30021,Dharmendra Deol,215
101832,Prakash Raj,205
15231,Brahmanandam,198


In [45]:
# Combine Metrics
success_metrics = total_revenue.merge(average_revenue, on='ActorName').merge(movie_count, on='ActorName')

# Display top 10 actors by Total Box Office Revenue
top_actors = success_metrics.sort_values(by='TotalBoxOfficeRevenue', ascending=False)
top_actors.head(20)

Unnamed: 0,ActorName,TotalBoxOfficeRevenue,AverageBoxOfficeRevenue,NumberOfMovies
0,Warwick Davis,12930160000.0,680534900.0,28
1,Samuel L. Jackson,12789430000.0,177631000.0,104
2,Frank Welker,10287440000.0,168646600.0,150
3,Alan Rickman,10208710000.0,352024300.0,41
4,Robbie Coltrane,10094650000.0,336488400.0,59
5,Conrad Vernon,9786402000.0,575670700.0,15
6,Tom Hanks,9623361000.0,204752400.0,53
7,Gary Oldman,9614965000.0,267082400.0,53
8,Maggie Smith,9280352000.0,320012100.0,55
9,Hugo Weaving,8896132000.0,370672100.0,36


## **2. Define actors' characteristics that we want to evaluate**

### **Demographics** : Height, Gender, Ethnicity

In [19]:
# Calculate percentage of NaN values for ActorGender and ActorHeight
gender_nan_percentage = character_metadata['ActorGender'].isna().mean() * 100
height_nan_percentage = character_metadata['ActorHeight'].isna().mean() * 100
ethnicity_nan_percentage = character_metadata['ActorEthnicity'].isna().mean() * 100

print(f"Percentage of NaNs in 'ActorGender': {gender_nan_percentage:.2f}%")
print(f"Percentage of NaNs in 'ActorHeight': {height_nan_percentage:.2f}%")
print(f"Percentage of NaNs in 'ActorEthnicity': {ethnicity_nan_percentage:.2f}%")

Percentage of NaNs in 'ActorGender': 10.12%
Percentage of NaNs in 'ActorHeight': 65.65%
Percentage of NaNs in 'ActorEthnicity': 76.47%


--> AGAIN, WE'LL TRY TO FILL THE MISSING VALUES IMPORTING FROM WIKI

---
## Christian's part

---
## Yassin's part