# CMU Movies Summary Corpus

- Team: ArabsDataScientists2024
- Authors: Zaynab, Lylia, Ali, Christian, Yassin

---

## Tasks

1. **Select Project & Initial Analyses**:
   1. Agree on a project proposal with team members.
   2. Perform initial analyses to verify feasibility of the proposed project, including any additional data.
   3. Acquaint yourself with the provided data, preprocess it, and perform descriptive statistics.

2. **Pipeline & Data Description**:
   1. Create a pipeline for data handling and preprocessing, documented in the notebook.
   2. Describe the relevant aspects of the data, including:
      1. Handling the size of the data.
      2. Understanding the data (formats, distributions, missing values, correlations, etc.).
      3. Considering data enrichment, filtering, and transformation according to project needs.
   3. Develop a plan for methods to be used, with essential mathematical details.
   4. Outline a plan for analysis and communication, discussing alternative approaches considered.

3. **GitHub Repository & Deliverables**:
   1. Create a public GitHub repository named `ada-2023-project-<team>` under the `epfl-ada` GitHub organization. ✅
   2. Ensure the repository contains:
      1. **README.md** file with:
         1. **Title**: Project title.
         2. **Abstract**: 150-word description of the project idea, goals, and motivation.
         3. **Research Questions**: List of research questions to address.
         4. **Proposed Additional Datasets**: Description of additional datasets, expected management, and feasibility analysis.
         5. **Methods**: Methods to be used in the project.
         6. **Proposed Timeline**: Timeline for the project.
         7. **Organization within the Team**: Internal milestones leading to Milestone P3.
         8. **Questions for TAs (optional)**: Any questions for the teaching assistants.
      2. **Code for Initial Analyses**: Structured code for initial analyses and data handling pipelines.
      3. **Notebook** presenting initial results, including:
         1. Main results and descriptive analysis.
         2. External scripts/modules for implementing core logic, to be called from the notebook.

---


## Table of Contents
- [1. Zaynab's part](##Zaynab's-part)
- [2. Lylia's part](##Lylia's-part)
- [3. Ali's part](##Ali's-part)
- [4. Cristians's part](##Christian's-part)
- [5. Yassin's part](##Yassin's-part)

---

### Library importation

In [2]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import kagglehub

### data importation

In [3]:
DATA_PATH='./data/MovieSummaries/'

### movie metadata

In [4]:
movie_columns = [
    'WikipediaMovieID', 'FreebaseMovieID', 'MovieName', 'ReleaseDate', 
    'BoxOfficeRevenue', 'Runtime', 'Languages', 'Countries', 'Genres'
]


movie_metadata = pd.read_csv(DATA_PATH+'movie.metadata.tsv', sep='\t', names=movie_columns)

movie_metadata

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"
...,...,...,...,...,...,...,...,...,...
81736,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}"
81737,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0..."
81738,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}"
81739,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ..."


### character metadata

In [5]:
character_columns = [
    'WikipediaMovieID', 'FreebaseMovieID', 'ReleaseDate', 'CharacterName',
    'ActorDOB', 'ActorGender', 'ActorHeight', 'ActorEthnicity', 
    'ActorName', 'ActorAgeAtRelease', 'FreebaseCharacterActorMapID',
    'FreebaseCharacterID', 'FreebaseActorID'
]

character_metadata = pd.read_csv(DATA_PATH+'character.metadata.tsv', sep='\t', names=character_columns)


character_metadata

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,ReleaseDate,CharacterName,ActorDOB,ActorGender,ActorHeight,ActorEthnicity,ActorName,ActorAgeAtRelease,FreebaseCharacterActorMapID,FreebaseCharacterID,FreebaseActorID
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.620,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.780,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.750,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.650,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg
...,...,...,...,...,...,...,...,...,...,...,...,...,...
450664,913762,/m/03pcrp,1992-05-21,Elensh,1970-05,F,,,Dorothy Elias-Fahn,,/m/0kr406c,/m/0kr406h,/m/0b_vcv
450665,913762,/m/03pcrp,1992-05-21,Hibiki,1965-04-12,M,,,Jonathan Fahn,27.0,/m/0kr405_,/m/0kr4090,/m/0bx7_j
450666,28308153,/m/0cp05t9,1957,,1941-11-18,M,1.730,/m/02w7gg,David Hemmings,15.0,/m/0g8ngmc,,/m/022g44
450667,28308153,/m/0cp05t9,1957,,,,,,Roberta Paterson,,/m/0g8ngmj,,/m/0g8ngmm


### plot summaries

In [6]:
plot_columns = ['WikipediaMovieID', 'PlotSummary']

plot_summaries = pd.read_csv(DATA_PATH+'plot_summaries.txt', sep='\t', names=plot_columns)

plot_summaries


Unnamed: 0,WikipediaMovieID,PlotSummary
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...
...,...,...
42298,34808485,"The story is about Reema , a young Muslim scho..."
42299,1096473,"In 1928 Hollywood, director Leo Andreyev look..."
42300,35102018,American Luthier focuses on Randy Parsons’ tra...
42301,8628195,"Abdur Rehman Khan , a middle-aged dry fruit se..."


### name clusters

In [7]:
name_clusters_columns = ['FreebaseCharacterActorMapID', 'CharacterName']

name_clusters = pd.read_csv(DATA_PATH+'name.clusters.txt', sep='\t', names=name_clusters_columns)

name_clusters


Unnamed: 0,FreebaseCharacterActorMapID,CharacterName
0,Stuart Little,/m/0k3w9c
1,Stuart Little,/m/0k3wcx
2,Stuart Little,/m/0k3wbn
3,John Doe,/m/0jyg35
4,John Doe,/m/0k2_zn
...,...,...
2661,John Rolfe,/m/0k5_ql
2662,John Rolfe,/m/02vd6vs
2663,Elizabeth Swann,/m/0k1xvz
2664,Elizabeth Swann,/m/0k1x_d


### TV tropes clusters

In [8]:
tvtropes_columns = ['FreebaseCharacterActorMapID', 'CharacterType']

tvtropes_clusters = pd.read_csv(DATA_PATH+'tvtropes.clusters.txt', sep='\t', names=tvtropes_columns)

tvtropes_clusters


Unnamed: 0,FreebaseCharacterActorMapID,CharacterType
0,absent_minded_professor,"{""char"": ""Professor Philip Brainard"", ""movie"":..."
1,absent_minded_professor,"{""char"": ""Professor Keenbean"", ""movie"": ""Richi..."
2,absent_minded_professor,"{""char"": ""Dr. Reinhardt Lane"", ""movie"": ""The S..."
3,absent_minded_professor,"{""char"": ""Dr. Harold Medford"", ""movie"": ""Them!..."
4,absent_minded_professor,"{""char"": ""Daniel Jackson"", ""movie"": ""Stargate""..."
...,...,...
496,young_gun,"{""char"": ""Morgan Earp"", ""movie"": ""Tombstone"", ..."
497,young_gun,"{""char"": ""Colorado Ryan"", ""movie"": ""Rio Bravo""..."
498,young_gun,"{""char"": ""Tom Sawyer"", ""movie"": ""The League of..."
499,young_gun,"{""char"": ""William H. 'Billy the Kid' Bonney"", ..."


---

## Zaynab's part

---
## Lylia's part

---
## Ali's part

---
## Christian's part

---
## Yassin's part

### Sites with additional data:

dataset with physique : https://arxiv.org/html/2407.03486v1

celeb A: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

https://www.kaggle.com/datasets/vishesh1412/celebrity-face-image-dataset

http://www.stat.ucla.edu/~vlew/stat10/archival/SP01/handouts/celeb.html

IMDB : https://datasets.imdbws.com/

boxofficeojo: https://www.kaggle.com/datasets/eliasdabbas/boxofficemojo-alltime-domestic-data


In [49]:
movie_metadata['year']=movie_metadata['ReleaseDate'].str.extract(r'(\d{4})')
movie_metadata['year'] = movie_metadata['year'].astype(float)
print(len(movie_metadata))
print(movie_metadata['year'].isna().sum())

81741
6902


In [50]:
movie_metadata['year'].describe()

count    74839.000000
mean      1977.476530
std         29.101536
min       1010.000000
25%       1956.000000
50%       1985.000000
75%       2004.000000
max       2016.000000
Name: year, dtype: float64

In [None]:
rows_with_1010 = movie_metadata[movie_metadata['year'] == 1010]

print(rows_with_1010)
print(rows_with_1010.index)

       WikipediaMovieID FreebaseMovieID       MovieName ReleaseDate  \
62836          29666067      /m/0fphzrf  Hunting Season  1010-12-02   

       BoxOfficeRevenue  Runtime  \
62836        12160978.0    140.0   

                                               Languages  \
62836  {"/m/02hwyss": "Turkish Language", "/m/02h40lc...   

                     Countries  \
62836  {"/m/01znc_": "Turkey"}   

                                                  Genres    year  
62836  {"/m/0lsxr": "Crime Fiction", "/m/02n4kr": "My...  1010.0  
Index([62836], dtype='int64')


In [53]:
movie_metadata.at[62836, 'year'] = 2010 

correcting this error, movie date was 1010 when is supposed to be 2010

In [54]:
print(movie_metadata['year'].describe())

count    74839.000000
mean      1977.489892
std         28.886090
min       1888.000000
25%       1956.000000
50%       1985.000000
75%       2004.000000
max       2016.000000
Name: year, dtype: float64


The first movie of our datset was released in 1888 and the latest in 2016

In [55]:
movie_metadata['BoxOfficeRevenue'].isna().sum()/len(movie_metadata['BoxOfficeRevenue'])

0.8972241592346558

In [56]:
character_metadata['ActorEthnicity'].isna().sum()/len(character_metadata['ActorEthnicity']) 

0.7646654196317031

In [57]:
character_metadata['ActorHeight'].isna().sum()/len(character_metadata['ActorHeight'])

0.6564573999986687

In [58]:
movie_metadata['Countries'].unique()

array(['{"/m/09c7w0": "United States of America"}',
       '{"/m/05b4w": "Norway"}', '{"/m/07ssc": "United Kingdom"}', ...,
       '{"/m/0f8l9c": "France", "/m/06mzp": "Switzerland", "/m/0h3y": "Algeria", "/m/0345h": "Germany"}',
       '{"/m/014tss": "Kingdom of Great Britain", "/m/03_3d": "Japan", "/m/02jx1": "England", "/m/07ssc": "United Kingdom", "/m/0345h": "Germany"}',
       '{"/m/06mzp": "Switzerland", "/m/03rjj": "Italy", "/m/082fr": "West Germany", "/m/03f2w": "German Democratic Republic"}'],
      dtype=object)

In [59]:
movie_metadata['Languages'].unique()

array(['{"/m/02h40lc": "English Language"}',
       '{"/m/05f_3": "Norwegian Language"}',
       '{"/m/04306rv": "German Language"}', ...,
       '{"/m/03k50": "Hindi Language", "/m/064r7fk": "Standard Tibetan", "/m/02h40lc": "English Language"}',
       '{"/m/06nm1": "Spanish Language", "/m/04306rv": "German Language", "/m/02h40lc": "English Language", "/m/02ztjwg": "Hungarian language"}',
       '{"/m/02bjrlw": "Italian Language", "/m/02h40lc": "English Language", "/m/05f_3": "Norwegian Language"}'],
      dtype=object)

In [None]:
# # Download latest version
# path = kagglehub.dataset_download("stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset")

# print("Path to dataset files:", path)
# path = kagglehub.dataset_download("patkle/metacritic-scores-for-games-movies-tv-and-music")

# print("Path to dataset files:", path)
# path = kagglehub.dataset_download("eliasdabbas/boxofficemojo-alltime-domestic-data")

# print("Path to dataset files:", path)

# path = kagglehub.dataset_download("martinmraz07/oscar-movies")

# print("Path to dataset files:", path)

# path = kagglehub.dataset_download("unanimad/the-oscar-award")

# print("Path to dataset files:", path)

In [20]:
ratings_IMDB=pd.read_csv('data/IMDB/title.ratings.tsv.gz', sep='\t')
ratings_IMDB

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2096
1,tt0000002,5.6,282
2,tt0000003,6.5,2115
3,tt0000004,5.4,182
4,tt0000005,6.2,2845
...,...,...,...
1495756,tt9916730,7.0,12
1495757,tt9916766,7.1,24
1495758,tt9916778,7.2,37
1495759,tt9916840,6.9,11


In [27]:
akas_IMDB=pd.read_csv('data/IMDB/title.akas.tsv.gz', sep='\t')

akas_IMDB

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Carmencita,\N,\N,original,\N,1
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita,US,\N,imdbDisplay,\N,0
3,tt0000001,4,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
4,tt0000001,5,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
...,...,...,...,...,...,...,...,...
50301474,tt9916852,7,エピソード #3.20,JP,ja,\N,\N,0
50301475,tt9916852,8,Episodio #3.20,ES,es,\N,\N,0
50301476,tt9916856,1,The Wind,\N,\N,original,\N,1
50301477,tt9916856,2,The Wind,DE,\N,imdbDisplay,\N,0


In [28]:
characters_IMDB=pd.read_csv('data/IMDB/name.basics.tsv.gz', sep='\t')

characters_IMDB

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"actor,miscellaneous,producer","tt0050419,tt0072308,tt0053137,tt0027125"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack,archive_footage","tt0037382,tt0075213,tt0117057,tt0038355"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,music_department,producer","tt0057345,tt0049189,tt0056404,tt0054452"
3,nm0000004,John Belushi,1949,1982,"actor,writer,music_department","tt0072562,tt0077975,tt0080455,tt0078723"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0050986,tt0069467,tt0083922,tt0050976"
...,...,...,...,...,...,...
13926614,nm9993714,Romeo del Rosario,\N,\N,"animation_department,art_department","tt11657662,tt14069590,tt2455546"
13926615,nm9993716,Essias Loberg,\N,\N,\N,\N
13926616,nm9993717,Harikrishnan Rajan,\N,\N,cinematographer,tt8736744
13926617,nm9993718,Aayush Nair,\N,\N,cinematographer,tt8736744


<!-- To use Tweepy for downloading Twitter follower counts for a list of celebrities, you’ll need to set up a Twitter Developer account, create an application to get access credentials, and then use Tweepy to interact with the Twitter API. Here’s a step-by-step guide:

1. Set Up Twitter Developer Account and Application
Go to the Twitter Developer Portal, create an account, and apply for API access.
Create a new app within your developer account. Twitter will provide you with an API key, API secret key, Access token, and Access token secret.
2. Install Tweepy
Install Tweepy using pip:

bash
Copy code
pip install tweepy
3. Authenticate with the Twitter API
In your Python script, you can authenticate with Twitter using the credentials you got from the developer portal.

python
Copy code
import tweepy

# Replace these with your own credentials
API_KEY = 'your_api_key'
API_SECRET_KEY = 'your_api_secret_key'
ACCESS_TOKEN = 'your_access_token'
ACCESS_TOKEN_SECRET = 'your_access_token_secret'

# Authenticate with Twitter
auth = tweepy.OAuthHandler(API_KEY, API_SECRET_KEY)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)
4. Get Follower Counts for Celebrities
Create a list of Twitter handles for the celebrities you want to track and fetch their follower counts using Tweepy.

python
Copy code
# List of Twitter usernames for celebrities
celebrity_usernames = ['elonmusk', 'rihanna', 'katyperry', 'TheRock']  # Replace with actual usernames

# Loop through each username and get follower count
follower_data = {}
for username in celebrity_usernames:
    try:
        user = api.get_user(screen_name=username)
        follower_data[username] = user.followers_count
        print(f"{username}: {user.followers_count} followers")
    except tweepy.TweepError as e:
        print(f"Error fetching data for {username}: {e}")

# Optional: Save to a file
import csv

with open('celebrity_followers.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["Username", "Follower Count"])
    for username, followers in follower_data.items():
        writer.writerow([username, followers])
5. Run the Script Regularly (Optional)
If you’d like to track follower count changes over time, you can run this script daily, weekly, etc., and save the data to a CSV file or database.

Notes and Tips
Rate Limits: Twitter has rate limits, so if you have a large list of celebrities, you may hit the limit. Tweepy will handle some rate limiting for you, but you may need to wait a few minutes if you have a long list.
Error Handling: Some accounts may be private, suspended, or may change usernames. Use error handling (as shown above) to skip over these cases.
This approach will give you up-to-date follower counts for each celebrity and can easily be expanded or automated as needed. Let me know if you need help with any specific part! -->

In [30]:
title_basics_IMDB=pd.read_csv('data/IMDB/title.basics.tsv.gz', sep='\t')

title_basics_IMDB

  title_basics_IMDB=pd.read_csv('data/IMDB/title.basics.tsv.gz', sep='\t')


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Poor Pierrot,Pauvre Pierrot,0,1892,\N,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
...,...,...,...,...,...,...,...,...,...
11217428,tt9916848,tvEpisode,Episode #3.17,Episode #3.17,0,2009,\N,\N,"Action,Drama,Family"
11217429,tt9916850,tvEpisode,Episode #3.19,Episode #3.19,0,2010,\N,\N,"Action,Drama,Family"
11217430,tt9916852,tvEpisode,Episode #3.20,Episode #3.20,0,2010,\N,\N,"Action,Drama,Family"
11217431,tt9916856,short,The Wind,The Wind,0,2015,\N,27,Short


In [31]:
crew_IMDB=pd.read_csv('data/IMDB/title.crew.tsv.gz', sep='\t')

In [37]:
crew_IMDB

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N
...,...,...,...
10553131,tt9916848,nm1485677,"nm9187127,nm1485677,nm9826385,nm9299459,nm1628284"
10553132,tt9916850,nm1485677,"nm9187127,nm1485677,nm9826385,nm1628284"
10553133,tt9916852,nm1485677,"nm9187127,nm1485677,nm9826385,nm9299459,nm1628284"
10553134,tt9916856,nm10538645,nm6951431


In [33]:
episode_IMDB=pd.read_csv('data/IMDB/title.episode.tsv.gz', sep='\t')
principals_IMDB=pd.read_csv('data/IMDB/title.principals.tsv.gz', sep='\t')

In [38]:
episode_IMDB

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber
0,tt0031458,tt32857063,\N,\N
1,tt0041951,tt0041038,1,9
2,tt0042816,tt0989125,1,17
3,tt0042889,tt0989125,\N,\N
4,tt0043426,tt0040051,3,42
...,...,...,...,...
8614779,tt9916846,tt1289683,3,18
8614780,tt9916848,tt1289683,3,17
8614781,tt9916850,tt1289683,3,19
8614782,tt9916852,tt1289683,3,20


In [39]:
principals_IMDB

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0005690,producer,producer,\N
3,tt0000001,4,nm0374658,cinematographer,director of photography,\N
4,tt0000002,1,nm0721526,director,\N,\N
...,...,...,...,...,...,...
88937148,tt9916880,17,nm0996406,director,principal director,\N
88937149,tt9916880,18,nm1482639,writer,\N,\N
88937150,tt9916880,19,nm2586970,writer,books,\N
88937151,tt9916880,20,nm1594058,producer,producer,\N


In [44]:
character_metadata['ActorName'].value_counts()

ActorName
Mel Blanc             791
Mithun Chakraborty    328
Oliver Hardy          299
Mohanlal              234
Moe Howard            225
                     ... 
Jan van der Horst       1
Irma Lozin              1
Johan Gildemeijer       1
Jan Feith               1
Roberta Paterson        1
Name: count, Length: 134078, dtype: int64

In [63]:
character_metadata['year']=character_metadata['ReleaseDate'].str.extract(r'(\d{4})')
character_metadata['year'] = character_metadata['year'].astype(float)
print(len(character_metadata))
print(character_metadata['year'].isna().sum())

450669
9995


In [65]:
rows_character_1010 =character_metadata[character_metadata['year']==1010]
print(rows_character_1010)

       WikipediaMovieID FreebaseMovieID ReleaseDate   CharacterName  \
67624          29666067      /m/0fphzrf  1010-12-02         Kamuran   
67625          29666067      /m/0fphzrf  1010-12-02          Ferman   
67626          29666067      /m/0fphzrf  1010-12-02           Idris   
67627          29666067      /m/0fphzrf  1010-12-02           Hasan   
67628          29666067      /m/0fphzrf  1010-12-02          Battal   
67629          29666067      /m/0fphzrf  1010-12-02           Asiye   
67630          29666067      /m/0fphzrf  1010-12-02       Asit Omer   
67631          29666067      /m/0fphzrf  1010-12-02           Hatun   
67632          29666067      /m/0fphzrf  1010-12-02          Müslüm   
67633          29666067      /m/0fphzrf  1010-12-02      Murat Önes   
67634          29666067      /m/0fphzrf  1010-12-02           Hilal   
67635          29666067      /m/0fphzrf  1010-12-02         Cevriye   
67636          29666067      /m/0fphzrf  1010-12-02        Müzeyyen   
67637 

In [64]:
character_metadata['year'].describe()

count    440674.000000
mean       1984.489929
std          25.889522
min        1010.000000
25%        1969.000000
50%        1994.000000
75%        2005.000000
max        2016.000000
Name: year, dtype: float64

In [66]:
character_metadata.loc[character_metadata['year'] == 1010, 'year'] = 2010


In [67]:
character_metadata['year'].describe()

count    440674.000000
mean       1984.523968
std          25.257949
min        1888.000000
25%        1969.000000
50%        1994.000000
75%        2005.000000
max        2016.000000
Name: year, dtype: float64

In [68]:
rows_character_1010 =character_metadata[character_metadata['year']==1010]
print(rows_character_1010)

Empty DataFrame
Columns: [WikipediaMovieID, FreebaseMovieID, ReleaseDate, CharacterName, ActorDOB, ActorGender, ActorHeight, ActorEthnicity, ActorName, ActorAgeAtRelease, FreebaseCharacterActorMapID, FreebaseCharacterID, FreebaseActorID, year]
Index: []


In [69]:
print(movie_metadata['WikipediaMovieID'].isna().sum()/(len(movie_metadata['WikipediaMovieID'])))
print(movie_metadata['FreebaseMovieID'].isna().sum()/(len(movie_metadata['FreebaseMovieID'])))

0.0
0.0


### merging understanding

In [22]:
import pandas as pd

data1 = {
  "name": ["Sally", "Mary", "John"],
  "age": [50, 40, 30]
}

data2 = {
  "name": ["Sally", "Peter", "Micky"],
  "age": [77, 44, 22]
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

newdf = df1.merge(df2, how='right', on='name')
newdf

Unnamed: 0,name,age_x,age_y
0,Sally,50.0,77
1,Peter,,44
2,Micky,,22


In [23]:
merged_data=pd.merge(movie_metadata,character_metadata,how='inner',on='WikipediaMovieID')
merged_data

Unnamed: 0,WikipediaMovieID,FreebaseMovieID_x,MovieName,ReleaseDate_x,BoxOfficeRevenue,Runtime,Languages,Countries,Genres,FreebaseMovieID_y,...,CharacterName,ActorDOB,ActorGender,ActorHeight,ActorEthnicity,ActorName,ActorAgeAtRelease,FreebaseCharacterActorMapID,FreebaseCharacterID,FreebaseActorID
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",/m/03vyhn,...,Akooshay,1958-08-26,F,1.620,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",/m/03vyhn,...,Lieutenant Melanie Ballard,1974-08-15,F,1.780,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",/m/03vyhn,...,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",/m/03vyhn,...,Sgt Jericho Butler,1967-09-12,M,1.750,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",/m/03vyhn,...,Bashira Kincaid,1977-09-25,F,1.650,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
450664,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",/m/02w7zz8,...,,,,,,Billy Morton,,/m/0gchkcy,,/m/0gc4lfm
450665,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",/m/02w7zz8,...,,1982-01-28,,,,Andrea Runge,19.0,/m/0gckh4f,,/m/0gbx_rk
450666,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",/m/02w7zz8,...,,,F,,,Wendy Anderson,,/m/0gcp8fv,,/m/0gby01h
450667,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",/m/02w7zz8,...,,,,,,Ariel Bastian,,/m/0gdkb51,,/m/0gdkb55


In [24]:
merged_data.columns

Index(['WikipediaMovieID', 'FreebaseMovieID_x', 'MovieName', 'ReleaseDate_x',
       'BoxOfficeRevenue', 'Runtime', 'Languages', 'Countries', 'Genres',
       'FreebaseMovieID_y', 'ReleaseDate_y', 'CharacterName', 'ActorDOB',
       'ActorGender', 'ActorHeight', 'ActorEthnicity', 'ActorName',
       'ActorAgeAtRelease', 'FreebaseCharacterActorMapID',
       'FreebaseCharacterID', 'FreebaseActorID'],
      dtype='object')

In [29]:
merged_data = merged_data.drop(columns=[col for col in merged_data.columns if col.endswith('_y')])
merged_data.columns = [col.replace('_x', '') for col in merged_data.columns]


In [30]:
merged_data

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres,CharacterName,ActorDOB,ActorGender,ActorHeight,ActorEthnicity,ActorName,ActorAgeAtRelease,FreebaseCharacterActorMapID,FreebaseCharacterID,FreebaseActorID
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Akooshay,1958-08-26,F,1.620,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Lieutenant Melanie Ballard,1974-08-15,F,1.780,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Sgt Jericho Butler,1967-09-12,M,1.750,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Bashira Kincaid,1977-09-25,F,1.650,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
450664,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",,,,,,Billy Morton,,/m/0gchkcy,,/m/0gc4lfm
450665,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",,1982-01-28,,,,Andrea Runge,19.0,/m/0gckh4f,,/m/0gbx_rk
450666,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",,,F,,,Wendy Anderson,,/m/0gcp8fv,,/m/0gby01h
450667,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",,,,,,Ariel Bastian,,/m/0gdkb51,,/m/0gdkb55


In [83]:
threshold = int(0.80 * merged_data.shape[1])

merged_data = merged_data.dropna(thresh=threshold)

In [84]:
merged_data

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres,year,CharacterName,ActorDOB,ActorGender,ActorHeight,ActorEthnicity,ActorName,ActorAgeAtRelease,FreebaseCharacterActorMapID,FreebaseCharacterID,FreebaseActorID
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",2001.0,Akooshay,1958-08-26,F,1.620,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",2001.0,Lieutenant Melanie Ballard,1974-08-15,F,1.780,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",2001.0,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",2001.0,Sgt Jericho Butler,1967-09-12,M,1.750,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",2001.0,Bashira Kincaid,1977-09-25,F,1.650,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
450659,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ...",1992.0,Elensh,1970-05,F,,,Dorothy Elias-Fahn,,/m/0kr406c,/m/0kr406h,/m/0b_vcv
450660,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ...",1992.0,Hibiki,1965-04-12,M,,,Jonathan Fahn,27.0,/m/0kr405_,/m/0kr4090,/m/0bx7_j
450661,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",2002.0,,1980-06-24,F,1.720,/m/041rx,Liane Balaban,21.0,/m/03jpb_5,,/m/02pn4z4
450662,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",2002.0,,1946-07-02,M,1.740,/m/041rx,Ron Silver,55.0,/m/04hv69s,,/m/03swmf


We lost more than 100k rows in 75% threshold and more than 200k in 80% threshold 

## Oscar dataset

In [97]:
OscarWithRatings=pd.read_csv('data/oscar-movies-ratings/versions/2/oscars_df.csv')
OscarWithRatings

Unnamed: 0.1,Unnamed: 0,Film,Oscar Year,Film Studio/Producer(s),Award,Year of Release,Movie Time,Movie Genre,IMDB Rating,IMDB Votes,...,Tomatometer Status,Tomatometer Rating,Tomatometer Count,Audience Status,Audience Rating,Audience Count,Tomatometer Top Critics Count,Tomatometer Fresh Critics Count,Tomatometer Rotten Critics Count,Film ID
0,0,Wings,1927/28,Famous Players-Lasky,Winner,1927,144,"Drama,Romance,War",7.5,12221,...,Certified-Fresh,93.0,46.0,Upright,78.0,3530.0,9.0,43.0,3.0,2becf7d5-a3de-46ab-ae45-abdd6b588067
1,1,7th Heaven,1927/28,Fox,Nominee,1927,110,"Drama,Romance",7.7,3439,...,,,,,,,,,,19ed3295-a878-4fd2-8e60-5cd7b5f93dad
2,2,The Racket,1927/28,The Caddo Company,Nominee,1928,84,"Crime,Drama,Film-Noir",6.7,1257,...,,,,,,,,,,3111c2d8-0908-4093-8ff3-99c89f2f2f08
3,3,The Broadway Melody,1928/29,Metro-Goldwyn-Mayer,Winner,1929,100,"Drama,Musical,Romance",5.7,6890,...,Rotten,33.0,24.0,Spilled,21.0,1813.0,7.0,8.0,16.0,de063f3f-2d35-4e1c-8636-6eb4c16bd236
4,4,Alibi,1928/29,Feature Productions,Nominee,1929,91,"Action,Crime,Romance",5.8,765,...,,,,,,,,,,609887c2-877c-43a4-b88c-e40e31096a98
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
566,566,Mank,2020,"Ceán Chaffin, Eric Roth, and Douglas Urbanski",Nominee,2020,131,"Biography,Comedy,Drama",6.9,65380,...,,,,,,,,,,47d4ae4f-e782-4cd9-9508-4a07302b1c1a
567,567,Minari,2020,Christina Oh,Nominee,2020,115,Drama,7.5,57976,...,,,,,,,,,,7262b3a8-214d-4205-985c-70e0860f3236
568,568,Promising Young Woman,2020,"Ben Browning, Ashley Fox, Emerald Fennell, and...",Nominee,2020,113,"Crime,Drama,Thriller",7.5,122269,...,,,,,,,,,,d64c669b-7a73-496a-bddb-19cb09264371
569,569,Sound of Metal,2020,Bert Hamelinck and Sacha Ben Harroche,Nominee,2019,120,"Drama,Music",7.8,102807,...,,,,,,,,,,647357e9-c067-46bd-aaeb-24d4344ec124


In [98]:
OscarWithRatings.columns

Index(['Unnamed: 0', 'Film', 'Oscar Year', 'Film Studio/Producer(s)', 'Award',
       'Year of Release', 'Movie Time', 'Movie Genre', 'IMDB Rating',
       'IMDB Votes', 'Movie Info', 'Genres', 'Critic Consensus',
       'Content Rating', 'Directors', 'Authors', 'Actors',
       'Original Release Date', 'Streaming Release Date', 'Production Company',
       'Tomatometer Status', 'Tomatometer Rating', 'Tomatometer Count',
       'Audience Status', 'Audience Rating', 'Audience Count',
       'Tomatometer Top Critics Count', 'Tomatometer Fresh Critics Count',
       'Tomatometer Rotten Critics Count', 'Film ID'],
      dtype='object')

In [99]:
oscar_awards=pd.read_csv('data/the-oscar-award/versions/11/the_oscar_award.csv')

oscar_awards

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
0,1927,1928,1,ACTOR,Richard Barthelmess,The Noose,False
1,1927,1928,1,ACTOR,Emil Jannings,The Last Command,True
2,1927,1928,1,ACTRESS,Louise Dresser,A Ship Comes In,False
3,1927,1928,1,ACTRESS,Janet Gaynor,7th Heaven,True
4,1927,1928,1,ACTRESS,Gloria Swanson,Sadie Thompson,False
...,...,...,...,...,...,...,...
10884,2023,2024,96,WRITING (Original Screenplay),Written by Celine Song,Past Lives,False
10885,2023,2024,96,JEAN HERSHOLT HUMANITARIAN AWARD,,,True
10886,2023,2024,96,HONORARY AWARD,"To Angela Bassett, who has inspired audiences ...",,True
10887,2023,2024,96,HONORARY AWARD,"To Mel Brooks, for his comedic brilliance, pro...",,True


In [100]:
oscar_awards['actor']=oscar_awards['category'].str.contains(r'actor|actress', case=False)

df_actors=oscar_awards[oscar_awards['actor']==True]
df_actors

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner,actor
0,1927,1928,1,ACTOR,Richard Barthelmess,The Noose,False,True
1,1927,1928,1,ACTOR,Emil Jannings,The Last Command,True,True
2,1927,1928,1,ACTRESS,Louise Dresser,A Ship Comes In,False,True
3,1927,1928,1,ACTRESS,Janet Gaynor,7th Heaven,True,True
4,1927,1928,1,ACTRESS,Gloria Swanson,Sadie Thompson,False,True
...,...,...,...,...,...,...,...,...
10780,2023,2024,96,ACTRESS IN A SUPPORTING ROLE,Emily Blunt,Oppenheimer,False,True
10781,2023,2024,96,ACTRESS IN A SUPPORTING ROLE,Danielle Brooks,The Color Purple,False,True
10782,2023,2024,96,ACTRESS IN A SUPPORTING ROLE,America Ferrera,Barbie,False,True
10783,2023,2024,96,ACTRESS IN A SUPPORTING ROLE,Jodie Foster,Nyad,False,True


In [101]:
df_actors['category'].unique()

array(['ACTOR', 'ACTRESS', 'ACTOR IN A SUPPORTING ROLE',
       'ACTRESS IN A SUPPORTING ROLE', 'ACTOR IN A LEADING ROLE',
       'ACTRESS IN A LEADING ROLE'], dtype=object)

In [102]:
df_actors['winner'].sum()/len(df_actors)

0.2024070021881838

In [103]:
df_oscar_winners=df_actors[df_actors['winner']==True].drop(columns='actor')
df_oscar_winners

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
1,1927,1928,1,ACTOR,Emil Jannings,The Last Command,True
3,1927,1928,1,ACTRESS,Janet Gaynor,7th Heaven,True
36,1928,1929,2,ACTOR,Warner Baxter,In Old Arizona,True
45,1928,1929,2,ACTRESS,Mary Pickford,Coquette,True
73,1929,1930,3,ACTOR,George Arliss,Disraeli,True
...,...,...,...,...,...,...,...
10657,2022,2023,95,ACTRESS IN A SUPPORTING ROLE,Jamie Lee Curtis,Everything Everywhere All at Once,True
10768,2023,2024,96,ACTOR IN A LEADING ROLE,Cillian Murphy,Oppenheimer,True
10772,2023,2024,96,ACTOR IN A SUPPORTING ROLE,Robert Downey Jr.,Oppenheimer,True
10779,2023,2024,96,ACTRESS IN A LEADING ROLE,Emma Stone,Poor Things,True


In [104]:
len(df_oscar_winners['name'].unique())

318

In [105]:
df_oscar_winners['actor_clean_name'] = df_oscar_winners['name'].str.lower().str.strip().str.replace(r'[^\w\s]', '', regex=True)
df_oscar_winners['movie_clean_name']=df_oscar_winners['film'].str.lower().str.strip().str.replace(r'[^\w\s]', '', regex=True)
df_oscar_winners

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner,actor_clean_name,movie_clean_name
1,1927,1928,1,ACTOR,Emil Jannings,The Last Command,True,emil jannings,the last command
3,1927,1928,1,ACTRESS,Janet Gaynor,7th Heaven,True,janet gaynor,7th heaven
36,1928,1929,2,ACTOR,Warner Baxter,In Old Arizona,True,warner baxter,in old arizona
45,1928,1929,2,ACTRESS,Mary Pickford,Coquette,True,mary pickford,coquette
73,1929,1930,3,ACTOR,George Arliss,Disraeli,True,george arliss,disraeli
...,...,...,...,...,...,...,...,...,...
10657,2022,2023,95,ACTRESS IN A SUPPORTING ROLE,Jamie Lee Curtis,Everything Everywhere All at Once,True,jamie lee curtis,everything everywhere all at once
10768,2023,2024,96,ACTOR IN A LEADING ROLE,Cillian Murphy,Oppenheimer,True,cillian murphy,oppenheimer
10772,2023,2024,96,ACTOR IN A SUPPORTING ROLE,Robert Downey Jr.,Oppenheimer,True,robert downey jr,oppenheimer
10779,2023,2024,96,ACTRESS IN A LEADING ROLE,Emma Stone,Poor Things,True,emma stone,poor things


In [106]:
merged_data=pd.merge(movie_metadata,character_metadata,how='inner',on='WikipediaMovieID')
merged_data = merged_data.drop(columns=[col for col in merged_data.columns if col.endswith('_y')])
merged_data.columns = [col.replace('_x', '') for col in merged_data.columns]
merged_data

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres,CharacterName,ActorDOB,ActorGender,ActorHeight,ActorEthnicity,ActorName,ActorAgeAtRelease,FreebaseCharacterActorMapID,FreebaseCharacterID,FreebaseActorID
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Akooshay,1958-08-26,F,1.620,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Lieutenant Melanie Ballard,1974-08-15,F,1.780,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Sgt Jericho Butler,1967-09-12,M,1.750,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Bashira Kincaid,1977-09-25,F,1.650,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
450664,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",,,,,,Billy Morton,,/m/0gchkcy,,/m/0gc4lfm
450665,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",,1982-01-28,,,,Andrea Runge,19.0,/m/0gckh4f,,/m/0gbx_rk
450666,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",,,F,,,Wendy Anderson,,/m/0gcp8fv,,/m/0gby01h
450667,12476867,/m/02w7zz8,Spliced,2002,,86.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}","{""/m/01jfsb"": ""Thriller"", ""/m/03npn"": ""Horror""...",,,,,,Ariel Bastian,,/m/0gdkb51,,/m/0gdkb55


In [107]:
merged_data['actor_clean_name']=merged_data['ActorName'].str.lower().str.strip().str.replace(r'[^\w\s]', '', regex=True)
merged_data['movie_clean_name']=merged_data['MovieName'].str.lower().str.strip().str.replace(r'[^\w\s]', '', regex=True)


In [108]:
merged_data=pd.merge(merged_data,df_oscar_winners,how='outer',left_on=['actor_clean_name', 'movie_clean_name'], right_on=['actor_clean_name', 'movie_clean_name'])

In [109]:
merged_data

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres,CharacterName,...,FreebaseActorID,actor_clean_name,movie_clean_name,year_film,year_ceremony,ceremony,category,name,film,winner
0,11901968.0,/m/02rxcn3,Big Money Rustlas,2010,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0gf28"": ""Parody"", ""/m/0lsxr"": ""Crime Fict...",,...,/m/0421n4b,2 tuff tony,big money rustlas,,,,,,,
1,21029252.0,/m/05b3f51,Miss March,2009-03-13,4591629.0,90.0,"{""/m/05zjd"": ""Portuguese Language"", ""/m/02h40l...","{""/m/09c7w0"": ""United States of America""}","{""/m/04228s"": ""Road movie"", ""/m/0gsy3b"": ""Sex ...",,...,/m/09rsqv,40 glocc,miss march,,,,,,,
2,21798180.0,/m/05mz_dh,13,2010-03-13,,97.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/07s9rl0"": ""Drama""}",Jimmy,...,/m/01vvyc_,50 cent,13,,,,,,,
3,6501095.0,/m/0g7sfc,50 Cent: The New Breed,2003-04-15,,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/04rlf"": ""Music"", ""/m/0d2rhq"": ""Concert fi...",,...,/m/01vvyc_,50 cent,50 cent the new breed,,,,,,,
4,33638321.0,/m/0h28lbt,All Things Fall Apart,2011,,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01z02hx"": ""Sports"", ""/m/07s9rl0"": ""Drama""}",Deon,...,/m/01vvyc_,50 cent,all things fall apart,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
450716,6415208.0,/m/0g4fl5,Zachariah,1971-01-24,,92.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0219x_"": ""Indie"", ""/m/04t36"": ""Musical"", ...",Belle Starr's Band,...,,,zachariah,,,,,,,
450717,34953010.0,/m/0j43swk,Zero Dark Thirty,2012-12-19,,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/02kdv5l"": ""Actio...",,...,,,zero dark thirty,,,,,,,
450718,34953010.0,/m/0j43swk,Zero Dark Thirty,2012-12-19,,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/02kdv5l"": ""Actio...",,...,,,zero dark thirty,,,,,,,
450719,4463641.0,/m/0c3vzx,Zombie Genocide,1993,,64.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland""}","{""/m/0jb4p32"": ""Zombie Film""}",,...,/m/03skpf,,zombie genocide,,,,,,,


In [110]:
merged_data.columns

Index(['WikipediaMovieID', 'FreebaseMovieID', 'MovieName', 'ReleaseDate',
       'BoxOfficeRevenue', 'Runtime', 'Languages', 'Countries', 'Genres',
       'CharacterName', 'ActorDOB', 'ActorGender', 'ActorHeight',
       'ActorEthnicity', 'ActorName', 'ActorAgeAtRelease',
       'FreebaseCharacterActorMapID', 'FreebaseCharacterID', 'FreebaseActorID',
       'actor_clean_name', 'movie_clean_name', 'year_film', 'year_ceremony',
       'ceremony', 'category', 'name', 'film', 'winner'],
      dtype='object')

In [111]:
merged_data[merged_data['winner']==True]

Unnamed: 0,WikipediaMovieID,FreebaseMovieID,MovieName,ReleaseDate,BoxOfficeRevenue,Runtime,Languages,Countries,Genres,CharacterName,...,FreebaseActorID,actor_clean_name,movie_clean_name,year_film,year_ceremony,ceremony,category,name,film,winner
3382,1897341.0,/m/064lsn,The Pianist,2002-05-24,120072577.0,142.0,"{""/m/06b_j"": ""Russian Language"", ""/m/04306rv"":...","{""/m/0f8l9c"": ""France"", ""/m/05qhw"": ""Poland"", ...","{""/m/03g3w"": ""History"", ""/m/03bxz7"": ""Biograph...",Wladyslaw Szpilman,...,/m/01cj6y,adrien brody,the pianist,2002.0,2003.0,75.0,ACTOR IN A LEADING ROLE,Adrien Brody,The Pianist,True
6158,133648.0,/m/0_9wr,Scent of a Woman,1992-12-23,,137.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama"", ""/m/01j1n2"": ""Coming o...",Lt. Col. Frank Slade,...,/m/0bj9k,al pacino,scent of a woman,1992.0,1993.0,65.0,ACTOR IN A LEADING ROLE,Al Pacino,Scent of a Woman,True
6565,7047921.0,/m/0h1x5f,Little Miss Sunshine,2006-01-20,,102.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hn10"": ""LGBT"", ""/m/0hj3n0w"": ""Ensemble F...",Grandpa Edwin Hoover,...,/m/015grj,alan arkin,little miss sunshine,2006.0,2007.0,79.0,ACTOR IN A SUPPORTING ROLE,Alan Arkin,Little Miss Sunshine,True
8728,42856.0,/m/0bs4r,The Bridge on the River Kwai,1957-10-02,33300000.0,155.0,"{""/m/03_9r"": ""Japanese Language"", ""/m/02h40lc""...","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/0hj3n0w"": ""Ensemble Film"", ""/m/07s9rl0"": ...",Colonel Nicholson,...,/m/0gr36,alec guinness,the bridge on the river kwai,1957.0,1958.0,30.0,ACTOR,Alec Guinness,The Bridge on the River Kwai,True
11909,61527.0,/m/0gns4,In Old Chicago,1938,,94.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01drsx"": ""Disaster"", ""/m/03btsm8"": ""Actio...",,...,/m/019wx1,alice brady,in old chicago,1937.0,1938.0,10.0,ACTRESS IN A SUPPORTING ROLE,Alice Brady,In Old Chicago,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
438632,,,,,,,,,,,...,,will smith,king richard,2021.0,2022.0,94.0,ACTOR IN A LEADING ROLE,Will Smith,King Richard,True
440308,156641.0,/m/014knw,Stalag 17,1953,3300000.0,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hj3n0w"": ""Ensemble Film"", ""/m/01t_vv"": ""...",Sgt. J.J. Sefton,...,/m/012v9y,william holden,stalag 17,1953.0,1954.0,26.0,ACTOR,William Holden,Stalag 17,True
440432,113452.0,/m/0sxkh,Kiss of the Spider Woman,1985-05-13,17005229.0,120.0,"{""/m/064_8sq"": ""French Language"", ""/m/02h40lc""...","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/0hn10"": ""LGBT"", ""/m/0lsxr"": ""Crime Fictio...",Luis Molina,...,/m/016khd,william hurt,kiss of the spider woman,1985.0,1986.0,58.0,ACTOR IN A LEADING ROLE,William Hurt,Kiss of the Spider Woman,True
445273,,,,,,,,,,,...,,yuhjung youn,minari,2020.0,2021.0,93.0,ACTRESS IN A SUPPORTING ROLE,Yuh-Jung Youn,Minari,True


In [112]:
duplicate_movies =merged_data[merged_data.duplicated(subset=['movie_clean_name', 'actor_clean_name', 'winner'], keep=False)]
print(duplicate_movies)


        WikipediaMovieID FreebaseMovieID                         MovieName  \
290           18999254.0      /m/02z6zks              Onks' Viljoo näkyny?   
291           18999254.0      /m/02z6zks              Onks' Viljoo näkyny?   
292           18999254.0      /m/02z6zks              Onks' Viljoo näkyny?   
293           18999254.0      /m/02z6zks              Onks' Viljoo näkyny?   
294           18999254.0      /m/02z6zks              Onks' Viljoo näkyny?   
...                  ...             ...                               ...   
450702         8326198.0      /m/026_hnv  Who the Fuck Is Jackson Pollock?   
450704        27947216.0      /m/0ch185_                       Witchhammer   
450705        27947216.0      /m/0ch185_                       Witchhammer   
450717        34953010.0      /m/0j43swk                  Zero Dark Thirty   
450718        34953010.0      /m/0j43swk                  Zero Dark Thirty   

       ReleaseDate  BoxOfficeRevenue  Runtime  \
290     1988-1