# Final Project Proposal - Predicting winners of the Oscars
### Jason Chan

I plan to use the following data sets:  
https://bigml.com/user/academy_awards/gallery/dataset/5c6886e1eba31d73070017f5

This data set contains a variety of information and fields about a large number of movies, ranging up until 2018. These fields contain information such as duration, genre, gross, user and critic reviews, etc. There are also fields for a variety of other awards and accolades, such BAFTA, the Golden Globes, Critics Choice awards, etc. In total there are 119 fields, and 1235 movie titles. Additionally, for each Oscar award, there is a field for whether the film was nominated for that particular Oscar award, as well as a field for whether it won the award. This will be the target variable - I can choose a specific Oscar award to train my model towards, such as Best Picture. At this early point, "Best Picture" is the most interesting of the Oscar awards that I could train my model for. However, in the future I can certainly extend this model to cover the variety of other Oscar awards. This problem is worth pursuing because the Oscars are the premier movie accolade, and winning an Oscar is considered the best possible accolade to win in the movie business. Therefore, people spend a significant time discussing and debating about the results of the Oscars. Being able to predict the winner of the Oscars would be an impressive feat. 

As I mentioned before, there are 1235 movie titles, and 119 fields. I will use all 1235 movie titles, but in my exploration of the fields, I may choose to exclude certain fields, especially if I decide to train my model towards one particular Oscar award. The dataset is relatively well documented, and tells us whether the feature values are numerical, categorical, text, list, or date field. Because of the sheer number of fields, I will not describe each field independently, but a quick overview can be found on the link. I will also likely exclude the fields pertaining to date, since film industry awards are typically given on a yearly basis anyway. There is also a synopsis field, which may lend to interesting NLP analysis. However, that is beyond the scope of this class, and thus I will not pursue it. Perhaps a cool extension to this project would also be to include analysis of movie synopses and see if movie award winnings can be predicted based on the content of the film. Unfortunately, it seems there is no clear documentation for this dataset on the website, other than the given title headings, so there may be some columns that I will have to guess the meaning of, such as: “popularity”, “rate”, and “metascore”. The latter two seem like scores out of 10 and out of 100, respectively. However, the popularity column is a numerical column where the numbers range on the order of several thousand, so I am uncertain about the meaning of the column. Despite the fact that this dataset is online, it seems that, according to the website where I obtained the dataset, there are no existing models and scripts posted publicly using this dataset. 

### Preprocessing the data  
I chose to remove 50 of the columns, since these corresponded to values that I do not think will have a meaningful effect on the model, such as release date, or columns that, as of right now, are too complicated to process. In particular, I removed any columns with names that ended with “_nominated_categories”, or “_won_categories”, since the values in these columns could not just be single categories like “Best Actress”, but could also be a string form of a list, such as “Best Actress|Best Actor|Best Original Score|Best Picture”. Because of the form of this data, I cannot apply OneHotEncoding to it. For now, I can work with a simpler data set and train a simpler model. However, I think as this project goes on, perhaps I should find a way to include this information, as I think it would play an important role in the prediction of award winning. For example, it would probably be more likely for a film to win an Oscar for “Best Original Score” if it also won “Best Original Score” in the BAFTA awards.   

Before I apply further preprocessing, the shape of the data at this stage is (1235, 69), meaning there are 1235 films and 69 features.   

The majority of the features are either numerical, or categorical (“Yes” or “No”). I will apply a standard scaler to all the numerical features, and use pandas.replace() to the “Yes”/”No” features (simply to change all “No” to 0 and all “Yes” to 1). The only truly categorical feature is “certificate”, which has values such as “PG-13” or “R”. I will use a one hot encoder for this feature. Finally, for the purposes of this early stage work and model, I will the “Oscar_Best_Picture_won” feature as the target variable, and use a label encoder to transform that column.  

By the end of the preprocessing, I have **75 total features**. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler, LabelEncoder
df = pd.read_csv('movies.csv')
df.head()
df.shape
df.columns
drop_cols =['movie_id','synopsis','New_York_Film_Critics_Circle_won_categories','Hollywood_Film_won_categories','Hollywood_Film_nominated_categories','Austin_Film_Critics_Association_won_categories','Austin_Film_Critics_Association_nominated_categories','Denver_Film_Critics_Society_won_categories','Denver_Film_Critics_Society_nominated_categories','Boston_Society_of_Film_Critics_won_categories','Boston_Society_of_Film_Critics_nominated_categories','New_York_Film_Critics_Circle_nominated_categories','Los_Angeles_Film_Critics_Association_won_categories','Los_Angeles_Film_Critics_Association_nominated_categories','Online_Film_Critics_Society_won_categories','Online_Film_Critics_Society_nominated_categories','People_Choice_won_categories','People_Choice_nominated_categories','London_Critics_Circle_Film_won_categories','London_Critics_Circle_Film_nominated_categories','American_Cinema_Editors_won_categories','American_Cinema_Editors_nominated_categories','Costume_Designers_Guild_won_categories','Costume_Designers_Guild_nominated_categories','Online_Film_Television_Association_won_categories','Online_Film_Television_Association_nominated_categories','Producers_Guild_won_categories','Producers_Guild_nominated_categories','Art_Directors_Guild_won_categories','Art_Directors_Guild_nominated_categories','Writers_Guild_won_categories','Writers_Guild_nominated_categories','Critics_Choice_won_categories','Critics_Choice_nominated_categories','Directors_Guild_won_categories','Directors_Guild_nominated_categories','Screen_Actors_Guild_won_categories','Screen_Actors_Guild_nominated_categories','BAFTA_won_categories','BAFTA_nominated_categories','Golden_Globes_won_categories','Golden_Globes_nominated_categories','Oscar_nominated_categories','genre','year','release_date','release_date.year', 'release_date.month', 'release_date.day-of-month', 'release_date.day-of-week']
df = df.drop(columns = drop_cols)

ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
oe = OrdinalEncoder(categories = ['No','Yes'])
ss = StandardScaler()
le = LabelEncoder()

#Replacing columns with "Yes"/"No" values to 1/0
ord_cols = 'Oscar_Best_Picture_nominated,Oscar_Best_Director_won,Oscar_Best_Director_nominated,Oscar_Best_Actor_won,Oscar_Best_Actor_nominated,Oscar_Best_Actress_won,Oscar_Best_Actress_nominated,Oscar_Best_Supporting_Actor_won,Oscar_Best_Supporting_Actor_nominated,Oscar_Best_Supporting_Actress_won,Oscar_Best_Supporting_Actress_nominated,Oscar_Best_AdaScreen_won,Oscar_Best_AdaScreen_nominated,Oscar_Best_OriScreen_won,Oscar_Best_OriScreen_nominated'
ord_cols = [x for x in ord_cols.split(',')]
for col in ord_cols:
    df[col].replace({'Yes':1,'No':0}, inplace=True)
    
num_cols = 'duration,rate,metascore,votes,gross,user_reviews,critic_reviews,popularity,awards_nominations,Oscar_nominated,Golden_Globes_nominated,BAFTA_won,BAFTA_nominated,Screen_Actors_Guild_won,Screen_Actors_Guild_nominated,Critics_Choice_won,Critics_Choice_nominated,Directors_Guild_won,Directors_Guild_nominated,Producers_Guild_won,Producers_Guild_nominated,Art_Directors_Guild_won,Art_Directors_Guild_nominated,Writers_Guild_won,Writers_Guild_nominated,Costume_Designers_Guild_won,Costume_Designers_Guild_nominated,Online_Film_Television_Association_won,Online_Film_Television_Association_nominated,Online_Film_Critics_Society_won,Online_Film_Critics_Society_nominated,People_Choice_won,People_Choice_nominated,London_Critics_Circle_Film_won,London_Critics_Circle_Film_nominated,American_Cinema_Editors_won,American_Cinema_Editors_nominated,Hollywood_Film_won,Hollywood_Film_nominated,Austin_Film_Critics_Association_won,Austin_Film_Critics_Association_nominated,Denver_Film_Critics_Society_won,Denver_Film_Critics_Society_nominated,Boston_Society_of_Film_Critics_won,Boston_Society_of_Film_Critics_nominated,New_York_Film_Critics_Circle_won,New_York_Film_Critics_Circle_nominated,Los_Angeles_Film_Critics_Association_won,Los_Angeles_Film_Critics_Association_nominated'
num_cols = [x for x in num_cols.split(',')]

ohe_data = pd.DataFrame(ohe.fit_transform(np.array(df['certificate'].replace({np.nan:"Missing"})).reshape(-1,1)))
ss_data = pd.DataFrame(ss.fit_transform(df[num_cols]))
le_data = pd.DataFrame(le.fit_transform(df['Oscar_Best_Picture_won']))

ss_data.columns=num_cols
le_data.columns=['Oscar_Best_Picture_won']
ohe_col_names = [('certificate'+x[2:]) for x in ohe.get_feature_names()]
ohe_data.columns=ohe_col_names

final_data = pd.concat([df['movie'],ohe_data,ss_data,df[ord_cols],le_data], axis=1);

### List of all features:

*Categorical*:  

    certificate
    
*Numerical*:  

    duration
    rate
    metascore
    votes
    gross
    user_reviews
    critic_reviews
    popularity
    awards_nominations
    Oscar_nominated
    Golden_Globes_won
    Golden_Globes_nominated
    BAFTA_won
    BAFTA_nominated
    Screen_Actors_Guild_won
    Screen_Actors_Guild_nominated
    Critics_Choice_won
    Critics_Choice_nominated
    Directors_Guild_won
    Directors_Guild_nominated
    Producers_Guild_won
    Producers_Guild_nominated
    Art_Directors_Guild_won
    Art_Directors_Guild_nominated
    Writers_Guild_won
    Writers_Guild_nominated
    Costume_Designers_Guild_won
    Costume_Designers_Guild_nominated
    Online_Film_Television_Association_won
    Online_Film_Television_Association_nominated
    Online_Film_Critics_Society_won
    Online_Film_Critics_Society_nominated
    People_Choice_won
    People_Choice_nominated
    London_Critics_Circle_Film_won
    London_Critics_Circle_Film_nominated
    American_Cinema_Editors_won
    American_Cinema_Editors_nominated
    Hollywood_Film_won
    Hollywood_Film_nominated
    Austin_Film_Critics_Association_won
    Austin_Film_Critics_Association_nominated
    Denver_Film_Critics_Society_won
    Denver_Film_Critics_Society_nominated
    Boston_Society_of_Film_Critics_won
    Boston_Society_of_Film_Critics_nominated
    New_York_Film_Critics_Circle_won
    New_York_Film_Critics_Circle_nominated
    Los_Angeles_Film_Critics_Association_won
    Los_Angeles_Film_Critics_Association_nominated

*Boolean*:  

    Oscar_Best_Picture_nominated
    Oscar_Best_Director_won
    Oscar_Best_Director_nominated
    Oscar_Best_Actor_won
    Oscar_Best_Actor_nominated
    Oscar_Best_Actress_won
    Oscar_Best_Actress_nominated
    Oscar_Best_Supporting_Actor_won
    Oscar_Best_Supporting_Actor_nominated
    Oscar_Best_Supporting_Actress_won
    Oscar_Best_Supporting_Actress_nominated
    Oscar_Best_AdaScreen_won
    Oscar_Best_AdaScreen_nominated
    Oscar_Best_OriScreen_won
    Oscar_Best_OriScreen_nominated
   
***Target Variable***:

    Oscar_Best_Picture_won