# CMU Movies Summary Corpus

- Authors: Zaynab, Lylia, Ali, Christian, Yassin

---

## Tasks

1. **Select Project & Initial Analyses**:
   1. Agree on a project proposal with team members.
   2. Perform initial analyses to verify feasibility of the proposed project, including any additional data.
   3. Acquaint yourself with the provided data, preprocess it, and perform descriptive statistics.

2. **Pipeline & Data Description**:
   1. Create a pipeline for data handling and preprocessing, documented in the notebook.
   2. Describe the relevant aspects of the data, including:
      1. Handling the size of the data.
      2. Understanding the data (formats, distributions, missing values, correlations, etc.).
      3. Considering data enrichment, filtering, and transformation according to project needs.
   3. Develop a plan for methods to be used, with essential mathematical details.
   4. Outline a plan for analysis and communication, discussing alternative approaches considered.

3. **GitHub Repository & Deliverables**:
   1. Create a public GitHub repository named `ada-2023-project-<team>` under the `epfl-ada` GitHub organization. ✅
   2. Ensure the repository contains:
      1. **README.md** file with:
         1. **Title**: Project title.
         2. **Abstract**: 150-word description of the project idea, goals, and motivation.
         3. **Research Questions**: List of research questions to address.
         4. **Proposed Additional Datasets**: Description of additional datasets, expected management, and feasibility analysis.
         5. **Methods**: Methods to be used in the project.
         6. **Proposed Timeline**: Timeline for the project.
         7. **Organization within the Team**: Internal milestones leading to Milestone P3.
         8. **Questions for TAs (optional)**: Any questions for the teaching assistants.
      2. **Code for Initial Analyses**: Structured code for initial analyses and data handling pipelines.
      3. **Notebook** presenting initial results, including:
         1. Main results and descriptive analysis.
         2. External scripts/modules for implementing core logic, to be called from the notebook.

---


## Table of Contents
- [1. Zaynab's part](##Zaynab's-part)
- [2. Lylia's part](##Lylia's-part)
- [3. Ali's part](##Ali's-part)
- [4. Cristians's part](##Christian's-part)
- [5. Yassin's part](##Yassin's-part)

---

### Library importation

In [517]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import ast

### data importation

In [518]:
DATA_PATH='./data/MovieSummaries/'

### movie metadata

In [519]:
movie_columns = [
    'WikipediaMovieID', 'FreebaseMovieID', 'MovieName', 'ReleaseDate', 
    'BoxOfficeRevenue', 'Runtime', 'Languages', 'Countries', 'Genres'
]


movie_metadata = pd.read_csv(DATA_PATH+'movie.metadata.tsv', sep='\t', names=movie_columns)

movie_metadata

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"
...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}"
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0..."
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}"
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ..."


In [520]:
def extract_year(date):
    if pd.isna(date):
        return pd.NA
    elif isinstance(date, str) and len(date) == 4 and date.isdigit():
        return int(date)
    elif not pd.isna(pd.to_datetime(date, errors='coerce')):
        return pd.to_datetime(date, errors='coerce').year
    else :
        return pd.NA


movie_metadata['YearOfRelease'] = movie_metadata['ReleaseDate'].apply(extract_year)

movie_metadata

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres,YearOfRelease
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",2001
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...",2000
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",1988
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",1987
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",1983
...,...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}",2011
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0...",2011
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}",1972
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ...",1992


In [521]:
movie_metadata['YearOfRelease'].value_counts()

YearOfRelease
2008    2465
2006    2434
2007    2389
2009    2247
2005    2102
        ... 
1890       2
1904       1
1893       1
1889       1
1888       1
Name: count, Length: 129, dtype: int64

In [None]:
movie_metadata['Genres'] = movie_metadata['Genres'].apply(lambda x : ','.join(list(ast.literal_eval(x).values())))
movie_metadata['Languages'] = movie_metadata['Languages'].apply(lambda x : ','.join(list(ast.literal_eval(x).values())))
movie_metadata['Countries'] = movie_metadata['Countries'].apply(lambda x : ','.join(list(ast.literal_eval(x).values())))
movie_metadata

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres,YearOfRelease
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","Thriller,Science Fiction,Horror,Adventure,Supe...",2001
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","Mystery,Biographical film,Drama,Crime Drama",2000
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","Crime Fiction,Drama",1988
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","Thriller,Erotic thriller,Psychological thriller",1987
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}",Drama,1983
...,...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}",Drama,2011
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","Biographical film,Drama,Documentary",2011
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","Satire,Comedy",1972
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","Science Fiction,Japanese Movies,Adventure,Anim...",1992


### character metadata

In [523]:
character_columns = [
    'WikipediaMovieID', 'FreebaseMovieID', 'ReleaseDate', 'CharacterName',
    'ActorDOB', 'ActorGender', 'ActorHeight', 'ActorEthnicity', 
    'ActorName', 'ActorAgeAtRelease', 'FreebaseCharacterActorMapID',
    'FreebaseCharacterID', 'FreebaseActorID'
]

character_metadata = pd.read_csv(DATA_PATH+'character.metadata.tsv', sep='\t', names=character_columns)

character_metadata

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,ReleaseDate,CharacterName,ActorDOB,ActorGender,ActorHeight,ActorEthnicity,ActorName,ActorAgeAtRelease,FreebaseCharacterActorMapID,FreebaseCharacterID,FreebaseActorID
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.620,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.780,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.750,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.650,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg
...,...,...,...,...,...,...,...,...,...,...,...,...,...
450664,913762,/m/03pcrp,1992-05-21,Elensh,1970-05,F,,,Dorothy Elias-Fahn,,/m/0kr406c,/m/0kr406h,/m/0b_vcv
450665,913762,/m/03pcrp,1992-05-21,Hibiki,1965-04-12,M,,,Jonathan Fahn,27.0,/m/0kr405_,/m/0kr4090,/m/0bx7_j
450666,28308153,/m/0cp05t9,1957,,1941-11-18,M,1.730,/m/02w7gg,David Hemmings,15.0,/m/0g8ngmc,,/m/022g44
450667,28308153,/m/0cp05t9,1957,,,,,,Roberta Paterson,,/m/0g8ngmj,,/m/0g8ngmm


### plot summaries

In [524]:
plot_columns = ['WikipediaMovieID', 'PlotSummary']

plot_summaries = pd.read_csv(DATA_PATH+'plot_summaries.txt', sep='\t', names=plot_columns)

plot_summaries


Unnamed: 0,WikipediaMovieID,PlotSummary
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...
...,...,...
42298,34808485,"The story is about Reema , a young Muslim scho..."
42299,1096473,"In 1928 Hollywood, director Leo Andreyev look..."
42300,35102018,American Luthier focuses on Randy Parsons’ tra...
42301,8628195,"Abdur Rehman Khan , a middle-aged dry fruit se..."


### name clusters

In [525]:
name_clusters_columns = ['CharacterName', 'FreebaseCharacterActorMapID']

name_clusters = pd.read_csv(DATA_PATH+'name.clusters.txt', sep='\t', names=name_clusters_columns)

name_clusters


Unnamed: 0,CharacterName,FreebaseCharacterActorMapID
0,Stuart Little,/m/0k3w9c
1,Stuart Little,/m/0k3wcx
2,Stuart Little,/m/0k3wbn
3,John Doe,/m/0jyg35
4,John Doe,/m/0k2_zn
...,...,...
2661,John Rolfe,/m/0k5_ql
2662,John Rolfe,/m/02vd6vs
2663,Elizabeth Swann,/m/0k1xvz
2664,Elizabeth Swann,/m/0k1x_d


### TV tropes clusters

In [526]:
import json

tvtropes_columns = ['FreebaseCharacterActorMapID', 'CharacterType']
tvtropes_clusters = pd.read_csv(DATA_PATH+'tvtropes.clusters.txt', sep='\t', names=tvtropes_columns)

# Extract the json data from second column
tvtropes_clusters['CharacterType'] = tvtropes_clusters['CharacterType'].apply(json.loads)
json_cols = pd.json_normalize(tvtropes_clusters['CharacterType'])
tvtropes_clusters = tvtropes_clusters.drop(columns=['CharacterType']).join(json_cols)

tvtropes_clusters

Unnamed: 0,FreebaseCharacterActorMapID,char,movie,id,actor
0,absent_minded_professor,Professor Philip Brainard,Flubber,/m/0jy9q0,Robin Williams
1,absent_minded_professor,Professor Keenbean,Richie Rich,/m/02vchl3,Michael McShane
2,absent_minded_professor,Dr. Reinhardt Lane,The Shadow,/m/0k6fkc,Ian McKellen
3,absent_minded_professor,Dr. Harold Medford,Them!,/m/0k6_br,Edmund Gwenn
4,absent_minded_professor,Daniel Jackson,Stargate,/m/0k3rhh,James Spader
...,...,...,...,...,...
496,young_gun,Morgan Earp,Tombstone,/m/0k776f,Bill Paxton
497,young_gun,Colorado Ryan,Rio Bravo,/m/0k2kqg,Ricky Nelson
498,young_gun,Tom Sawyer,The League of Extraordinary Gentlemen,/m/0k5nsh,Shane West
499,young_gun,William H. 'Billy the Kid' Bonney,Young Guns II,/m/03lrjk0,Emilio Estevez


---

## Zaynab's part

---
## Lylia's part

---
## Ali's part

---
## Christian's part

Let's clean the additional IMdB datasets:
+ First import the datasets

In [527]:
imdb_names_raw = pd.read_csv('./data/IMdB/name.basics.tsv', sep='\t')

In [528]:
imdb_movies_raw = pd.read_csv('./data/IMdB/title.basics.tsv', sep='\t')

  imdb_movies_raw = pd.read_csv('./data/IMdB/title.basics.tsv', sep='\t')


In [529]:
imdb_ratings_raw = pd.read_csv('./data/IMdB/title.ratings.tsv', sep='\t')

In [530]:
print("Initial shape of IMdB Names:",imdb_names_raw.shape)
print("Initial shape of IMdB Movies:",imdb_movies_raw.shape)
print("Initial shape of IMdB Ratings:",imdb_ratings_raw.shape)

Initial shape of IMdB Names: (13933215, 6)
Initial shape of IMdB Movies: (11226097, 9)
Initial shape of IMdB Ratings: (1496500, 3)


Let's analze the percentage of NaN in these dataset without forgetting that a NaN in the IMdB dataset

In [531]:
nan_value='\\N'
print("Names\n",(imdb_names_raw == nan_value).mean() * 100,"\n")
print("Movie\n",(imdb_movies_raw == nan_value).mean() * 100,"\n")
print("Ratings\n",(imdb_ratings_raw == nan_value).mean() * 100,"\n")

Names
 nconst                0.000000
primaryName           0.000366
birthYear            95.475222
deathYear            98.304275
primaryProfession    19.382921
knownForTitles       11.279845
dtype: float64 

Movie
 tconst             0.000000
titleType          0.000000
primaryTitle       0.000000
originalTitle      0.000000
isAdult            0.000009
startYear         12.624521
endYear           98.824863
runtimeMinutes    68.448634
genres             4.458549
dtype: float64 

Ratings
 tconst           0.0
averageRating    0.0
numVotes         0.0
dtype: float64 



In [532]:
imdb_movies = imdb_movies_raw[imdb_movies_raw['titleType']=='movie']
imdb_movies = imdb_movies.drop(columns=['titleType','primaryTitle','isAdult','endYear'])

imdb_movies.replace('\\N', np.nan, inplace=True)
imdb_movies = imdb_movies.dropna(subset=['originalTitle','startYear'])

imdb_movies.rename(columns={'originalTitle': 'MovieName',
                            'startYear':'YearOfRelease',
                            'runtimeMinutes':'Runtime',
                            'genres':'Genres'}, inplace=True)

imdb_movies['YearOfRelease'] = imdb_movies['YearOfRelease'].astype(int)
imdb_movies['Runtime'] = imdb_movies['Runtime'].astype(float).round(1)

imdb_movies

Unnamed: 0,tconst,MovieName,YearOfRelease,Runtime,Genres
8,tt0000009,Miss Jerry,1894,45.0,Romance
144,tt0000147,The Corbett-Fitzsimmons Fight,1897,100.0,"Documentary,News,Sport"
498,tt0000502,Bohemios,1905,100.0,
570,tt0000574,The Story of the Kelly Gang,1906,70.0,"Action,Adventure,Biography"
587,tt0000591,L'enfant prodigue,1907,90.0,Drama
...,...,...,...,...,...
11225988,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,57.0,Documentary
11226015,tt9916680,De la ilusión al desconcierto: cine colombiano...,2007,100.0,Documentary
11226027,tt9916706,Dankyavar Danka,2013,,Comedy
11226037,tt9916730,6 Gunn,2017,116.0,Drama


Year boundaries for movie_metadata

In [533]:
cmu_min_year = movie_metadata['YearOfRelease'].min()
cmu_max_year = movie_metadata['YearOfRelease'].max()
print("Movie Year Range in CMU dataset:",cmu_min_year, "-",cmu_max_year)

imdb_min_year = imdb_movies_raw[imdb_movies_raw['startYear'] != '\\N']['startYear'].min()
imdb_max_year = imdb_movies_raw[imdb_movies_raw['startYear'] != '\\N']['startYear'].max()
print("Movie Year Range in IMdB dataset:",imdb_min_year, "-",imdb_max_year)

Movie Year Range in CMU dataset: 1888 - 2016
Movie Year Range in IMdB dataset: 1874 - 2031


Here we see a potential problem with the dates in the imdb database

In [534]:
imdb_movies = imdb_movies[(imdb_movies['YearOfRelease'] >= cmu_min_year) & (imdb_movies['YearOfRelease'] <= cmu_max_year)]

print(f"{len(imdb_movies)} movies of the IMdB dataset are in the range {cmu_min_year}-{cmu_max_year}")


443160 movies of the IMdB dataset are in the range 1888-2016


We filter out the movies that are not in our CMU dataset

In [535]:
imdb_movies = imdb_movies[imdb_movies[['MovieName', 'YearOfRelease']].isin(movie_metadata[['MovieName', 'YearOfRelease']].to_dict(orient='list')).all(axis=1)]
imdb_movies = imdb_movies.reset_index(drop=True)

movie_percentage_matching = (len(imdb_movies) / movie_metadata.shape[0]) * 100
print(f"{len(imdb_movies)} or {movie_percentage_matching:.2f}% of CMU Movies found in the IMDb dataset")


63749 or 77.99% of CMU Movies found in the IMDb dataset


In [536]:
imdb_movies

Unnamed: 0,tconst,MovieName,YearOfRelease,Runtime,Genres
0,tt0000009,Miss Jerry,1894,45.0,Romance
1,tt0000147,The Corbett-Fitzsimmons Fight,1897,100.0,"Documentary,News,Sport"
2,tt0000574,The Story of the Kelly Gang,1906,70.0,"Action,Adventure,Biography"
3,tt0000615,Robbery Under Arms,1907,,Drama
4,tt0000679,The Fairylogue and Radio-Plays,1908,120.0,"Adventure,Fantasy"
...,...,...,...,...,...
63744,tt9875120,Frostbite,2010,90.0,Documentary
63745,tt9881364,Gaja,2008,152.0,"Action,Comedy,Romance"
63746,tt9884086,Flashback,2009,80.0,Thriller
63747,tt9890124,Zindagi,2016,97.0,Drama


**CLEANING IMDB NAMES**

In [537]:
imdb_names_raw

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"actor,miscellaneous,producer","tt0050419,tt0072308,tt0053137,tt0027125"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack,archive_footage","tt0037382,tt0075213,tt0117057,tt0038355"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,music_department,producer","tt0057345,tt0049189,tt0056404,tt0054452"
3,nm0000004,John Belushi,1949,1982,"actor,writer,music_department","tt0072562,tt0077975,tt0080455,tt0078723"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0050986,tt0069467,tt0083922,tt0050976"
...,...,...,...,...,...,...
13933210,nm9993714,Romeo del Rosario,\N,\N,"animation_department,art_department","tt11657662,tt14069590,tt2455546"
13933211,nm9993716,Essias Loberg,\N,\N,\N,\N
13933212,nm9993717,Harikrishnan Rajan,\N,\N,cinematographer,tt8736744
13933213,nm9993718,Aayush Nair,\N,\N,cinematographer,tt8736744


In [538]:
character_metadata.head(0)

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,ReleaseDate,CharacterName,ActorDOB,ActorGender,ActorHeight,ActorEthnicity,ActorName,ActorAgeAtRelease,FreebaseCharacterActorMapID,FreebaseCharacterID,FreebaseActorID


In [539]:
character_metadata.shape[0]

450669

In [540]:
imdb_names = imdb_names_raw.replace('\\N', np.nan)

imdb_names.rename(columns={'primaryName': 'ActorName'}, inplace=True)

imdb_names.head(3)

Unnamed: 0,nconst,ActorName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987.0,"actor,miscellaneous,producer","tt0050419,tt0072308,tt0053137,tt0027125"
1,nm0000002,Lauren Bacall,1924,2014.0,"actress,soundtrack,archive_footage","tt0037382,tt0075213,tt0117057,tt0038355"
2,nm0000003,Brigitte Bardot,1934,,"actress,music_department,producer","tt0057345,tt0049189,tt0056404,tt0054452"


In [541]:
cmu_actor_names = character_metadata['ActorName'].unique()
imdb_actor_names = imdb_names['ActorName'].unique()

common_actors = set(imdb_actor_names) & set(cmu_actor_names)
percentage_common_actors = (len(common_actors) / len(cmu_actor_names)) * 100

print(f"{len(cmu_actor_names)} actors found in CMU dataset")
print(f"{len(imdb_actor_names)} actors found in IMdB dataset\n")
print(f"{percentage_common_actors:.2f} % of actors in CMU dataset found in IMdB dataset")


134079 actors found in CMU dataset
10701282 actors found in IMdB dataset

90.49 % of actors in CMU dataset found in IMdB dataset


Removing actors that are not in our character dataset

In [542]:
imdb_names = imdb_names[imdb_names['ActorName'].isin(common_actors)]

Comparing the NaN percentage before and after cleaning

In [543]:
print("Raw Names\n",(imdb_names_raw == nan_value).mean() * 100,"\n")
print("Filtered Names\n",imdb_names.isna().mean() * 100,"\n")

Raw Names
 nconst                0.000000
primaryName           0.000366
birthYear            95.475222
deathYear            98.304275
primaryProfession    19.382921
knownForTitles       11.279845
dtype: float64 

Filtered Names
 nconst                0.000000
ActorName             0.012420
birthYear            83.951111
deathYear            92.970217
primaryProfession    17.694027
knownForTitles       10.435841
dtype: float64 



Filter NaNs

In [544]:
imdb_names = imdb_names.drop(columns=['birthYear','deathYear'])
imdb_names = imdb_names.dropna(subset=['ActorName'])
imdb_names

Unnamed: 0,nconst,ActorName,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,"actor,miscellaneous,producer","tt0050419,tt0072308,tt0053137,tt0027125"
1,nm0000002,Lauren Bacall,"actress,soundtrack,archive_footage","tt0037382,tt0075213,tt0117057,tt0038355"
2,nm0000003,Brigitte Bardot,"actress,music_department,producer","tt0057345,tt0049189,tt0056404,tt0054452"
3,nm0000004,John Belushi,"actor,writer,music_department","tt0072562,tt0077975,tt0080455,tt0078723"
4,nm0000005,Ingmar Bergman,"writer,director,actor","tt0050986,tt0069467,tt0083922,tt0050976"
...,...,...,...,...
13933058,nm9993544,Matthew Davis,miscellaneous,tt5700176
13933060,nm9993546,Matt James,miscellaneous,tt5700176
13933141,nm9993639,Nyla,actress,tt4862524
13933160,nm9993659,David King,,tt20881070


In [545]:
imdb_names['ActorName'].value_counts()

ActorName
Alex              527
Michael Smith     446
Michael           436
Chris             435
Chris Smith       370
                 ... 
Julian Rivero       1
Stella Dassas       1
Dante Rivero        1
Jérôme Dassier      1
Jim McKrell         1
Name: count, Length: 121322, dtype: int64

Here we can see that some actor names are invalid as they do not contain his Family name. We will thus remove any rows that does not have two words in the column ActorName

In [546]:
imdb_names = imdb_names[imdb_names['ActorName'].str.contains(r'^\w+\s\w+', na=False)]

In [547]:
imdb_names = imdb_names.reset_index(drop=True)

imdb_names['ActorName'].value_counts()

ActorName
Michael Smith          446
Chris Smith            370
David Brown            363
Chris Johnson          362
John Smith             355
                      ... 
Hans Man in 't Veld      1
Ricardo Mamood-Vega      1
Pyotr Mamonov            1
Robert Mammone           1
Jim McKrell              1
Name: count, Length: 117065, dtype: int64

**CLEANING IMDB RATINGS**

In [548]:
imdb_ratings_raw

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2098
1,tt0000002,5.6,282
2,tt0000003,6.5,2117
3,tt0000004,5.4,182
4,tt0000005,6.2,2848
...,...,...,...
1496495,tt9916730,7.0,12
1496496,tt9916766,7.1,24
1496497,tt9916778,7.2,37
1496498,tt9916840,6.9,11


In [549]:
imdb_ratings = imdb_ratings_raw[imdb_ratings_raw['tconst'].isin(imdb_movies['tconst'])]

print(f"{len(imdb_ratings)} or {100*len(imdb_ratings)/len(movie_metadata):.2f} % of films in the CMU dataset have a ratings in the IMdB dataset")


51891 or 63.48 % of films in the CMU dataset have a ratings in the IMdB dataset


Now let's merge the rating in our CMU movie_metadata dataset

In [584]:
imdb_merged = imdb_movies.merge(imdb_ratings, on='tconst', how='inner')
imdb_merged = imdb_merged.drop(columns=['Runtime','Genres'])
imdb_merged

Unnamed: 0,tconst,MovieName,YearOfRelease,averageRating,numVotes
0,tt0000009,Miss Jerry,1894,5.4,215
1,tt0000147,The Corbett-Fitzsimmons Fight,1897,5.2,539
2,tt0000574,The Story of the Kelly Gang,1906,6.0,938
3,tt0000615,Robbery Under Arms,1907,4.3,27
4,tt0000679,The Fairylogue and Radio-Plays,1908,5.2,76
...,...,...,...,...,...
51886,tt9805754,Double Trouble,2013,7.0,6
51887,tt9807870,Masquerade,2001,7.0,5
51888,tt9815072,Pontianak Menjerit,2005,6.9,12
51889,tt9855214,Kisan,2006,4.9,11


In [577]:
movie_metadata

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres,YearOfRelease
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","Thriller,Science Fiction,Horror,Adventure,Supe...",2001
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","Mystery,Biographical film,Drama,Crime Drama",2000
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","Crime Fiction,Drama",1988
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","Thriller,Erotic thriller,Psychological thriller",1987
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}",Drama,1983
...,...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}",Drama,2011
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","Biographical film,Drama,Documentary",2011
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","Satire,Comedy",1972
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","Science Fiction,Japanese Movies,Adventure,Anim...",1992


In [None]:
movie_metadata_merged = movie_metadata.merge(imdb_merged[['MovieName', 'Genres']], on='MovieName', suffixes=('_meta', '_merged'))

movie_metadata_merged['Genres'] = movie_metadata_merged.apply(
    lambda x: ', '.join(set(
        [genre.replace(' ', ',') for genre in x['Genres_meta'].split(', ')] if isinstance(x['Genres_meta'], str) else [] +
        [genre.replace(' ', ',') for genre in x['Genres_merged'].split(', ')] if isinstance(x['Genres_merged'], str) else []
    )), axis=1
)

# Drop the extra genre columns after combining
movie_metadata_merged = movie_metadata_merged.drop(columns=['Genres_meta', 'Genres_merged'])

movie_metadata_merged['Genres'].value_counts().head()


Genres
Drama          6174
Comedy         1577
               1249
Documentary    1042
Comedy,film     921
Name: count, dtype: int64

In [590]:
movie_metadata = movie_metadata.merge(
    imdb_merged[['MovieName', 'YearOfRelease', 'averageRating', 'numVotes']],
    on=['MovieName', 'YearOfRelease'],
    how='left'
)

movie_metadata

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres,YearOfRelease,averageRating,numVotes
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","Thriller,Science Fiction,Horror,Adventure,Supe...",2001,4.9,58876.0
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","Mystery,Biographical film,Drama,Crime Drama",2000,,
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","Crime Fiction,Drama",1988,5.6,42.0
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","Thriller,Erotic thriller,Psychological thriller",1987,,
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}",Drama,1983,,
...,...,...,...,...,...,...,...,...,...,...,...,...
81959,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}",Drama,2011,,
81960,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","Biographical film,Drama,Documentary",2011,6.8,3253.0
81961,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","Satire,Comedy",1972,5.9,117.0
81962,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","Science Fiction,Japanese Movies,Adventure,Anim...",1992,,


---
## Yassin's part