## Notebook 2: Supplementing original dataset with awards data

Arjun Lokur <br>
10/04/2023

### Introduction to Notebook: 

After I finished creating the initial notebook, I realized I could make the data stronger by adding in awards information like the Oscars and Emmys. To be clear, this is only pertaining to the actors and directors - I'm not checking if the movie or TV show itself won Best Picture or Best Series, as that is unavailable information from the perspective of my project. Meaning if someone were to use this model to decide whether or not make a movie, they of course would not have information about this hypothetical movie winning a Best Picture Oscar.

On the other hand, if someone were to cast an Oscar winning actor in the movie, logically the chances of its success increase.

One question that I wrestled with is whether I should exclude Oscars won for the movie in question. Meaning if, say, 'Training Day' is part of my train set, should I exclude Denzel Washington's acting oscar in that row because we won it for that performance? And is that a form of data leakage?

In the end I decided to keep it, as Oscar wins are only meant to check whether a particular actor has won an Oscar at any time in their career. And if you added an Oscar-calibre actor to your movie (whether it's before or after they actually won the Oscar), chances are your movie will be more successful because of the acting ability they are bringing to the table, not because of the award itself.


## Table of Contents

[Oscar Acting Winners](#1)

[Oscar Directing Winners](#2)

[Oscar Directing Nominees](#3)

[Emmy Acting Winners](#4)

[Emmy Directing Winners](#5)

[Saving the Data](#6)

[Conclusion](#7)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### 1

### Oscar Acting Winners

In [26]:
df = pd.read_csv('data/the_oscar_award.csv')

In [7]:
df.head()

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
0,1927,1928,1,ACTOR,Richard Barthelmess,The Noose,False
1,1927,1928,1,ACTOR,Emil Jannings,The Last Command,True
2,1927,1928,1,ACTRESS,Louise Dresser,A Ship Comes In,False
3,1927,1928,1,ACTRESS,Janet Gaynor,7th Heaven,True
4,1927,1928,1,ACTRESS,Gloria Swanson,Sadie Thompson,False


It looks like the dataset has information on all the nominees and also who won the award.

Let's see some of the oscar categories this dataset has.

In [5]:
df['category'].value_counts().head(20)

DIRECTING                       464
FILM EDITING                    445
ACTRESS IN A SUPPORTING ROLE    435
ACTOR IN A SUPPORTING ROLE      435
DOCUMENTARY (Short Subject)     378
BEST PICTURE                    361
DOCUMENTARY (Feature)           345
CINEMATOGRAPHY                  333
FOREIGN LANGUAGE FILM           315
ART DIRECTION                   307
COSTUME DESIGN                  290
MUSIC (Original Score)          265
SOUND                           240
ACTRESS                         236
ACTRESS IN A LEADING ROLE       235
ACTOR IN A LEADING ROLE         235
ACTOR                           232
MUSIC (Original Song)           230
SHORT FILM (Live Action)        221
MUSIC (Song)                    215
Name: category, dtype: int64

And the total number of Oscar categories.

In [6]:
len(df['category'].value_counts())

115

I need to extract only the acting winners from this list. There are 4 categories we need to look at - Actor/Actress in a Leading role, and Actor/Actress in a Supporting role.

However prior to the 1977 Oscar ceremony, the acting awards were just for Actor and Actress, so we need to include that too for the older awards.

In [7]:
df[((df['category'] == 'ACTOR') | (df['category'] == 'ACTRESS')\
    | (df['category'] == 'ACTOR IN A SUPPORTING ROLE')\
  | (df['category'] == 'ACTRESS IN A SUPPORTING ROLE')\
   | (df['category'] == 'ACTRESS IN A LEADING ROLE') \
    | (df['category'] == 'ACTOR IN A LEADING ROLE'))\
  & (df['winner'] == True)].sort_values(by='year_film', ascending=True).sample(10)

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
3012,1955,1956,28,ACTOR,Ernest Borgnine,Marty,True
4881,1971,1972,44,ACTOR,Gene Hackman,The French Connection,True
3486,1959,1960,32,ACTRESS,Simone Signoret,Room at the Top,True
7788,1998,1999,71,ACTOR IN A SUPPORTING ROLE,James Coburn,Affliction,True
3726,1961,1962,34,ACTRESS,Sophia Loren,Two Women,True
3719,1961,1962,34,ACTOR IN A SUPPORTING ROLE,George Chakiris,West Side Story,True
3593,1960,1961,33,ACTOR,Burt Lancaster,Elmer Gantry,True
8131,2001,2002,74,ACTRESS IN A SUPPORTING ROLE,Jennifer Connelly,A Beautiful Mind,True
2414,1950,1951,23,ACTRESS IN A SUPPORTING ROLE,Josephine Hull,Harvey,True
9767,2015,2016,88,ACTOR IN A LEADING ROLE,Leonardo DiCaprio,The Revenant,True


In [8]:
oscar_acting_winners = df[((df['category'] == 'ACTOR') | (df['category'] == 'ACTRESS')\
    | (df['category'] == 'ACTOR IN A SUPPORTING ROLE')\
  | (df['category'] == 'ACTRESS IN A SUPPORTING ROLE')\
   | (df['category'] == 'ACTRESS IN A LEADING ROLE') \
    | (df['category'] == 'ACTOR IN A LEADING ROLE'))\
  & (df['winner'] == True)].sort_values(by='year_film', ascending=True)

In [9]:
oscar_acting_winners.reset_index(drop=True,inplace=True)

In [10]:
oscar_acting_winners.sample(10)

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
106,1958,1959,31,ACTOR IN A SUPPORTING ROLE,Burl Ives,The Big Country,True
364,2022,2023,95,ACTOR IN A LEADING ROLE,Brendan Fraser,The Whale,True
56,1945,1946,18,ACTOR IN A SUPPORTING ROLE,James Dunn,A Tree Grows in Brooklyn,True
211,1984,1985,57,ACTOR IN A SUPPORTING ROLE,Haing S. Ngor,The Killing Fields,True
36,1940,1941,13,ACTRESS IN A SUPPORTING ROLE,Jane Darwell,The Grapes of Wrath,True
156,1970,1971,43,ACTRESS,Glenda Jackson,Women in Love,True
278,2001,2002,74,ACTOR IN A LEADING ROLE,Denzel Washington,Training Day,True
71,1949,1950,22,ACTRESS,Olivia de Havilland,The Heiress,True
60,1946,1947,19,ACTRESS IN A SUPPORTING ROLE,Anne Baxter,The Razor's Edge,True
251,1994,1995,67,ACTRESS IN A LEADING ROLE,Jessica Lange,Blue Sky,True


In [68]:
#Checking who the leading actor winners are, 1977 onwards
oscar_acting_winners[oscar_acting_winners['category'] == 'ACTOR IN A LEADING ROLE']

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
180,1976,1977,49,ACTOR IN A LEADING ROLE,Peter Finch,Network,True
184,1977,1978,50,ACTOR IN A LEADING ROLE,Richard Dreyfuss,The Goodbye Girl,True
186,1978,1979,51,ACTOR IN A LEADING ROLE,Jon Voight,Coming Home,True
190,1979,1980,52,ACTOR IN A LEADING ROLE,Dustin Hoffman,Kramer vs. Kramer,True
197,1980,1981,53,ACTOR IN A LEADING ROLE,Robert De Niro,Raging Bull,True
198,1981,1982,54,ACTOR IN A LEADING ROLE,Henry Fonda,On Golden Pond,True
202,1982,1983,55,ACTOR IN A LEADING ROLE,Ben Kingsley,Gandhi,True
208,1983,1984,56,ACTOR IN A LEADING ROLE,Robert Duvall,Tender Mercies,True
210,1984,1985,57,ACTOR IN A LEADING ROLE,F. Murray Abraham,Amadeus,True
214,1985,1986,58,ACTOR IN A LEADING ROLE,William Hurt,Kiss of the Spider Woman,True


In [118]:
#Extracting only the names
oscar_actors_names = pd.DataFrame(oscar_acting_winners['name'].\
                                  unique(), columns = ['name'])

The above process, of extracting just the names of the winners, is something I'll repeat in this notebook for the Oscar Directing award and the Emmys. The goal is to use the winners names to turn it into a feature for our main dataframe in the Capstone notebook.

### 2

### Oscar Directing Winners

In [18]:
#Checking who the most recent winners are
df[(df['category'] == 'DIRECTING')\
  & (df['winner'] == True)].sort_values(by='year_film', ascending=True).tail(10)

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
9550,2013,2014,86,DIRECTING,Alfonso Cuarón,Gravity,True
9675,2014,2015,87,DIRECTING,Alejandro G. Iñárritu,Birdman or (The Unexpected Virtue of Ignorance),True
9802,2015,2016,88,DIRECTING,Alejandro G. Iñárritu,The Revenant,True
9926,2016,2017,89,DIRECTING,Damien Chazelle,La La Land,True
10054,2017,2018,90,DIRECTING,Guillermo del Toro,The Shape of Water,True
10180,2018,2019,91,DIRECTING,Alfonso Cuarón,Roma,True
10306,2019,2020,92,DIRECTING,Bong Joon Ho,Parasite,True
10433,2020,2021,93,DIRECTING,Chloé Zhao,Nomadland,True
10553,2021,2022,94,DIRECTING,Jane Campion,The Power of the Dog,True
10675,2022,2023,95,DIRECTING,Daniel Kwan and Daniel Scheinert,Everything Everywhere All at Once,True


In [11]:
oscar_directing_winners = df[(df['category'] == 'DIRECTING')\
  & (df['winner'] == True)].sort_values(by='year_film', ascending=True)

In [12]:
oscar_directing_winners.reset_index(drop=True,inplace=True)

In [13]:
oscar_directing_winners.sample(10)

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
12,1941,1942,14,DIRECTING,John Ford,How Green Was My Valley,True
67,1996,1997,69,DIRECTING,Anthony Minghella,The English Patient,True
89,2018,2019,91,DIRECTING,Alfonso Cuarón,Roma,True
27,1956,1957,29,DIRECTING,George Stevens,Giant,True
65,1994,1995,67,DIRECTING,Robert Zemeckis,Forrest Gump,True
77,2006,2007,79,DIRECTING,Martin Scorsese,The Departed,True
22,1951,1952,24,DIRECTING,George Stevens,A Place in the Sun,True
24,1953,1954,26,DIRECTING,Fred Zinnemann,From Here to Eternity,True
1,1929,1930,3,DIRECTING,Lewis Milestone,All Quiet on the Western Front,True
84,2013,2014,86,DIRECTING,Alfonso Cuarón,Gravity,True


In [15]:
oscar_director_names = pd.DataFrame(oscar_directing_winners['name'].unique(), columns=['name'])

In [16]:
#Checking the most recent names again
oscar_director_names.tail(10)

Unnamed: 0,name
62,Tom Hooper
63,Michel Hazanavicius
64,Alfonso Cuarón
65,Alejandro G. Iñárritu
66,Damien Chazelle
67,Guillermo del Toro
68,Bong Joon Ho
69,Chloé Zhao
70,Jane Campion
71,Daniel Kwan and Daniel Scheinert


### 3

### Oscar Directing Nominees

The reason I'm also looking as Oscar Directing nominees is to essentially cast a wider net. 4 people win an acting oscar every year but only 1 person wins the Directing Oscar. This is why in the case of Directing I wanted to expand it to the nominees as well.

In [27]:
oscar_directing_nominees = df[(df['category'] == 'DIRECTING')\
  & (df['winner'] == False)].sort_values(by='year_film', ascending=True)

In [28]:
oscar_directing_nominees.reset_index(drop=True,inplace=True)

In [29]:
oscar_directing_nominees.sample(10)

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
34,1939,1940,12,DIRECTING,William Wyler,Wuthering Heights,False
98,1955,1956,28,DIRECTING,Joshua Logan,Picnic,False
180,1975,1976,48,DIRECTING,Federico Fellini,Amarcord,False
198,1980,1981,53,DIRECTING,Roman Polanski,Tess,False
195,1979,1980,52,DIRECTING,Francis Coppola,Apocalypse Now,False
297,2004,2005,77,DIRECTING,Taylor Hackford,Ray,False
89,1952,1953,25,DIRECTING,John Huston,Moulin Rouge,False
138,1965,1966,38,DIRECTING,William Wyler,The Collector,False
202,1981,1982,54,DIRECTING,Mark Rydell,On Golden Pond,False
303,2006,2007,79,DIRECTING,Clint Eastwood,Letters from Iwo Jima,False


In [31]:
oscar_director_nominee_names = pd.DataFrame(oscar_directing_nominees['name'].unique(), columns=['name'])

In [32]:
oscar_director_nominee_names.tail(10)

Unnamed: 0,name
212,Yorgos Lanthimos
213,Sam Mendes
214,Todd Phillips
215,Emerald Fennell
216,Thomas Vinterberg
217,Lee Isaac Chung
218,Ryusuke Hamaguchi
219,Todd Field
220,Martin McDonagh
221,Ruben Östlund


### 4

### Emmy Acting Winners

In [21]:
df2 = pd.read_csv('data/the_emmy_awards.csv')

In [22]:
df2.sample(5)

Unnamed: 0,id,year,category,nominee,staff,company,producer,win
11272,11273,1997,Outstanding Guest Actor In A Drama Series,Murder One,"Pruitt Taylor Vince, Clifford Banks",ABC,,True
2295,2296,2016,Outstanding Music Composition For A Series (Or...,Minority Report,"Sean P. Callery, Music by",FOX,"20th Century Fox Television, Paramount Televis...",False
16302,16303,1982,Outstanding Lead Actress In A Comedy Series,"Love, Sidney","Swoosie Kurtz, as",NBC,,False
21421,21422,1951,BEST SPORTS PROGRAM,Hollywood Baseball,"n/a,",KLAC,,False
169,170,2019,Outstanding Original Music And Lyrics,Flight Of The Conchords: Live In London,"Bret McKenzie, Music & Lyrics by; Jemaine Clem...",HBO,HBO Entertainment in association with Done + D...,False


Looking at the rows with the words Actor or Actress in them

In [23]:
emmy_acting_winners = df2[(df2['category'].str.contains('Actor','Actress'))\
  & (df2['win'] == True)].sort_values(by='year', ascending=True)

In [24]:
emmy_acting_winners.tail(10)

Unnamed: 0,id,year,category,nominee,staff,company,producer,win
1031,1032,2018,Outstanding Guest Actor In A Comedy Series,Atlanta,"Katt Williams, as Willy",FX Networks,FX Productions,True
661,662,2019,Outstanding Lead Actor In A Drama Series,Pose,"Billy Porter, as Pray Tell",FX Networks,Fox 21 Television Studios and FX Productions,True
658,659,2019,Outstanding Supporting Actor In A Drama Series,Game Of Thrones,"Peter Dinklage, as Tyrion Lannister",HBO,"HBO Entertainment in association with Bighead,...",True
650,651,2019,Outstanding Lead Actor In A Limited Series Or ...,When They See Us,"Jharrel Jerome, as Korey Wise",Netflix,"Participant Media, Tribeca Productions, Harpo ...",True
648,649,2019,Outstanding Supporting Actor In A Limited Seri...,A Very English Scandal,"Ben Whishaw, as Norman",Prime Video,Blueprint and Amazon Studios,True
643,644,2019,Outstanding Lead Actor In A Comedy Series,Barry,"Bill Hader, as Barry",HBO,HBO Entertainment in association with Alec Ber...,True
639,640,2019,Outstanding Supporting Actor In A Comedy Series,The Marvelous Mrs. Maisel,"Tony Shalhoub, as Abe Weissman",Prime Video,Amazon Studios,True
614,615,2019,Outstanding Guest Actor In A Comedy Series,The Marvelous Mrs. Maisel,"Luke Kirby, as Lenny Bruce",Prime Video,Amazon Studios,True
602,603,2019,Outstanding Guest Actor In A Drama Series,The Handmaid's Tale,"Bradley Whitford, as Commander Joseph Lawrence",Hulu,"MGM, Daniel Wilson Productions, The Littlefiel...",True
592,593,2019,Outstanding Actor In A Short Form Comedy Or Dr...,State Of The Union,"Chris O'Dowd, as Tom",SundanceTV,See-Saw Films,True


The way the information is written here, it needs to be processed further if I want to extract just the names. Passing it through a lambda function to split the string.

In [104]:
emmy_acting_winners['name'] = emmy_acting_winners['staff'].apply(lambda x: x.split(',')[0])

In [113]:
emmy_acting_unique_names = pd.DataFrame(emmy_acting_winners['name'].unique(), columns=['name'])

In [114]:
emmy_acting_unique_names.reset_index(inplace=True)

### 5

### Emmy Directing Winners

In [33]:
emmy_directing_winners = df2[(df2['category'].str.contains('Directing'))\
  & (df2['win'] == True)].sort_values(by='year', ascending=True)

In [34]:
emmy_directing_winners.tail(10)

Unnamed: 0,id,year,category,nominee,staff,company,producer,win
815,816,2018,Outstanding Directing For A Variety Special,The Oscars,"Glenn Weiss, Directed by",ABC,The Academy of Motion Picture Arts and Sciences,True
819,820,2018,Outstanding Directing For A Documentary/Nonfic...,Jane,"Brett Morgen, Directed by",National Geographic,National Geographic Studios in association wit...,True
812,813,2018,Outstanding Directing For A Variety Series,Saturday Night Live,"Don Roy King, Directed by",NBC,SNL Studios in association with Universal Tele...,True
662,663,2019,Outstanding Directing For A Drama Series,Ozark,"Jason Bateman, Directed by",Netflix,Media Rights Capital,True
656,657,2019,Outstanding Directing For A Variety Series,Saturday Night Live,"Don Roy King, Directed by",NBC,SNL Studios in association with Universal Tele...,True
647,648,2019,"Outstanding Directing For A Limited Series, Mo...",Chernobyl,"Johan Renck, Directed by",HBO,HBO Miniseries and SKY in association with Sis...,True
642,643,2019,Outstanding Directing For A Comedy Series,Fleabag,"Harry Bradbeer, Directed by",Prime Video,All3Media International Limited and Amazon Stu...,True
549,550,2019,Outstanding Directing For A Reality Program,Queer Eye,"Hisham Abed, Directed by",Netflix,"Scout Productions, Inc. and ITV Entertainment,...",True
547,548,2019,Outstanding Directing For A Documentary/Nonfic...,Free Solo,"Elizabeth Chai Vasarhelyi, Directed by; Jimmy ...",National Geographic,"National Geographic Documentary Films, Little ...",True
546,547,2019,Outstanding Directing For A Variety Special,Springsteen On Broadway,"Thom Zimny, Directed by",Netflix,"Thrill Hill Productions, Inc.",True


In [36]:
emmy_directing_winners['name'] = emmy_acting_winners['staff'].apply(lambda x: x.split(',')[0])

In [37]:
emmy_directing_winners.tail(10)

Unnamed: 0,id,year,category,nominee,staff,company,producer,win,name
815,816,2018,Outstanding Directing For A Variety Special,The Oscars,"Glenn Weiss, Directed by",ABC,The Academy of Motion Picture Arts and Sciences,True,
819,820,2018,Outstanding Directing For A Documentary/Nonfic...,Jane,"Brett Morgen, Directed by",National Geographic,National Geographic Studios in association wit...,True,
812,813,2018,Outstanding Directing For A Variety Series,Saturday Night Live,"Don Roy King, Directed by",NBC,SNL Studios in association with Universal Tele...,True,
662,663,2019,Outstanding Directing For A Drama Series,Ozark,"Jason Bateman, Directed by",Netflix,Media Rights Capital,True,
656,657,2019,Outstanding Directing For A Variety Series,Saturday Night Live,"Don Roy King, Directed by",NBC,SNL Studios in association with Universal Tele...,True,
647,648,2019,"Outstanding Directing For A Limited Series, Mo...",Chernobyl,"Johan Renck, Directed by",HBO,HBO Miniseries and SKY in association with Sis...,True,
642,643,2019,Outstanding Directing For A Comedy Series,Fleabag,"Harry Bradbeer, Directed by",Prime Video,All3Media International Limited and Amazon Stu...,True,
549,550,2019,Outstanding Directing For A Reality Program,Queer Eye,"Hisham Abed, Directed by",Netflix,"Scout Productions, Inc. and ITV Entertainment,...",True,
547,548,2019,Outstanding Directing For A Documentary/Nonfic...,Free Solo,"Elizabeth Chai Vasarhelyi, Directed by; Jimmy ...",National Geographic,"National Geographic Documentary Films, Little ...",True,
546,547,2019,Outstanding Directing For A Variety Special,Springsteen On Broadway,"Thom Zimny, Directed by",Netflix,"Thrill Hill Productions, Inc.",True,


The Emmy directing winners doesn't seem to be working (after much exploration to try and resolve this), so going with the rest (Oscar acting winners, Oscar directing winners and nominees, and Emmy acting winners).

### 6

### Saving the data

In [126]:
emmy_acting_unique_names.to_csv('data/emmy_acting_winners.csv')

In [127]:
oscar_actors_names.to_csv('data/oscar_acting_winners.csv')

In [128]:
oscar_director_names.to_csv('data/oscar_directing_winners.csv')

In [130]:
oscar_director_nominee_names.to_csv('data/oscar_directing_nominees.csv')

### 7

### Conclusion

Now that we've extracted the awards data, let's add it into our main dataframe (done in Notebook 1) and proceed to our first modeling Notebook, which is modeling for IMDB Votes.