In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ast
import os

In [42]:
oscars_df = pd.read_csv('../Data/preprocessed_data/oscars_db.csv')

In [44]:
oscars_df = oscars_df[["name", "nominations", "wins"]]
oscars_df.head()

Unnamed: 0,name,nominations,wins
0,Barkhad Abdi,1,0
1,F. Murray Abraham,1,1
2,Amy Adams,6,0
3,Nick Adams,1,0
4,Isabelle Adjani,2,0


Our metric of an actor's success can be based on two factors: 
 - awards (in particular, Oscar) show an actor's recognition by the professional community
 - popularity shows an actor's recognition by the broad audience

We can aggregate these factors as follows:

$ Success(P, N, W, t) = P\_scale(t) * P + N\_scale * N + W\_scale * W $

In the formula P corresponds to popularity (from TMDB or new database with the most popular actors), N corresponds to the number of Oscar nominations of an actor and W is the number of times the actor became an Oscar winner. Scaling factors are constants for an award-related part, but $P\_scale(t)$ is a decreasing function of time period - how far was a peak of an actor's career from now. The motivation is that the popularity of actors who starred recently is supposed to be higher than the popularity of their retired colleagues.

In [45]:
actors_df = pd.read_csv('../Data/preprocessed_data/actors_db.csv')
popularity_df = actors_df[["name", "popularity"]]
popularity_df.head()

Unnamed: 0,name,popularity
0,Sangeeth Shobhan,226.892
1,Gary Oldman,220.449
2,Angeli Khang,199.449
3,Florence Pugh,176.589
4,Jason Statham,162.466


In [47]:
# merging 2 databases to get popularity, nominations and awards in a single database
success_df = pd.merge(left=popularity_df, right=oscars_df, how="left", on=["name"])
success_df = success_df.fillna(0)
success_df.head()

Unnamed: 0,name,popularity,nominations,wins
0,Sangeeth Shobhan,226.892,0.0,0.0
1,Gary Oldman,220.449,3.0,1.0
2,Angeli Khang,199.449,0.0,0.0
3,Florence Pugh,176.589,1.0,0.0
4,Jason Statham,162.466,0.0,0.0


In [51]:
# for now we consider P_scale(t) constant, will change later
P_SCALE = 1.0
N_SCALE = 10.0
W_SCALE = 50.0

success_df = success_df.assign(metric = lambda x: P_SCALE * success_df["popularity"] + \
                               N_SCALE * success_df["nominations"] + W_SCALE * success_df["wins"])
success_df.head()

Unnamed: 0,name,popularity,nominations,wins,metric
2022,Meryl Streep,26.71,21.0,3.0,386.71
1996,Jack Nicholson,26.806,12.0,3.0,296.806
2904,Bette Davis,23.522,11.0,2.0,233.522
47,Denzel Washington,77.941,9.0,2.0,267.941
6793,Ingrid Bergman,16.691,7.0,3.0,236.691


According to our metric, here are the top actors

In [53]:
success_df = success_df.sort_values(by="metric", ascending=False)
success_df.head(10)

Unnamed: 0,name,popularity,nominations,wins,metric
2022,Meryl Streep,26.71,21.0,3.0,386.71
1,Gary Oldman,220.449,3.0,1.0,300.449
1996,Jack Nicholson,26.806,12.0,3.0,296.806
47,Denzel Washington,77.941,9.0,2.0,267.941
6793,Ingrid Bergman,16.691,7.0,3.0,236.691
2055,Frances McDormand,26.59,6.0,3.0,236.59
2904,Bette Davis,23.522,11.0,2.0,233.522
87,Tom Hanks,68.632,6.0,2.0,228.632
0,Sangeeth Shobhan,226.892,0.0,0.0,226.892
367,Cate Blanchett,46.456,8.0,2.0,226.456
