# Feature Engineering
In this notebook, we will be taking the data elements we curated from the data gathering notebook and properly performing **feature engineering** on them as we seek to create new features that will be fed into each of our respective models.

## Notebook Setup

In [1]:
# Importing the necessary Python libraries
import warnings
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
from category_encoders.one_hot import OneHotEncoder

# Hiding any warnings
warnings.filterwarnings('ignore')

# Adjusting Pandas output
pd.set_option("display.max_columns", None)

In [2]:
# Import our dataset
df = pd.read_csv('../data/raw/all_data.csv')

In [3]:
# Removing any entries without a score
df.dropna(subset = ['biehn_scale_rating'], axis = 'rows', inplace = True)

## Feature Transform #0: Dropping columns that will not be used
Toward the end of my time with feature engineering, I discovered that not all the data elements I pulled from the data gathering time were going to be useful for the project. That said, I'm going to drop these columns right at the outset of feature engineering here. Here is a list of those data elements and why they are being dropped. The one exception we will make is `movie_name`.
- `tmdb_id`: This is an identification number that provides no meaningful information.
- `imdb_id`: As with the element above, this is an identification number that provides no meaningful information.
- `tmdb_popularity`: On further inspection, it appears that this is a highly variable value that basically shows the movie's popularity at the time the API is invoked. So for example, "The Matrix" is probably climbing in popularity right now as we get close to the release of the new Matrix movie, "The Matrix Resurrections." That said, this feature isn't reliable for our model.

In [4]:
# Dropping columns that are not required
df.drop(columns = ['tmdb_id', 'imdb_id', 'tmdb_popularity'], inplace = True)

## Feature Transform #1: Numerically encoding the `biehn_yes_or_no` feature
As it stands, this feature currently contains `Yes` or `No` string values. Because our model algorithms need to work with numerical data, we need to appropriately transform them into numerical values. `Yes` will become `1`, and `No` will become `0`.

In [5]:
# Performing the encoding of the "biehn_yes_or_no" feature
for index, row in df.iterrows():
    movie_name = row['movie_name']
    if row['biehn_yes_or_no'] == 'Yes':
        df.loc[index, 'biehn_yes_or_no'] = 1
    elif row['biehn_yes_or_no'] == 'No':
        df.loc[index, 'biehn_yes_or_no'] = 0

In [6]:
# Changing the datatype of the 'biehn_yes_or_no' to int
df['biehn_yes_or_no'] = df['biehn_yes_or_no'].astype(int)

## Feature Transform #2: One hot encoding the genre columns
There are two genre columns in this dataset, one representing the primary genre (`primary_genre`), and the other representing the secondary genre (`secondary_genre`). Because these features contain categorical string values, the simplest thing to do here is to perform proper **one hot encoding**. Of course, it is also important to ensure that the one hot encoder can properly handle nulls as there are a handful of nulls in either feature.

In [7]:
# Defining the OneHotEncoders for the genre columns
primary_genre_encoder = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')
secondary_genre_encoder = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')

In [8]:
# Getting the one-hot encoded dummies for each of the genre columns
primary_genre_dummies = primary_genre_encoder.fit_transform(df['primary_genre'])
secondary_genre_dummies = secondary_genre_encoder.fit_transform(df['secondary_genre'])

In [9]:
# Concatenating the genre dummies to the original dataframe
df = pd.concat([df, primary_genre_dummies, secondary_genre_dummies], axis = 1)

In [10]:
# Dropping the original genre columns
df.drop(columns = ['primary_genre', 'secondary_genre'], inplace = True)

## Feature Transform #3: Generating a relative `movie_age` column from the source `year` column
As it currently stands, utilizing the `year` feature does not provide us any value. This is because the algorithm does not understand the "relative importance" of what it means to go from year 2020 to 2021, and thus the algorithm could be unfairly biased by presenting it with an unaltered `year` feature. To that end, we will engineer a new `movie_age` column that will measure how many years has passed from our current year (2021) from the time when the movie was released.

In [11]:
# Extracting current year
currentYear = datetime.now().year
currentYear

2021

In [12]:
# Engineering the "year" column to be a relative "movie_age" column based on number of years since original release
for index, row in df.iterrows():
    movie_name = row['movie_name']
    year_released = row['year']
    movie_age = currentYear - year_released
    df.loc[index, 'movie_age'] = movie_age

## Feature Transform #4: Removing percentage sign from `rt_critic_score`
As it stands, the `rt_critic_score` is a string value that has a percentage sign at the end. We need to simply remove the percentage sign and transfrom the value from a string into an integer.

In [13]:
# Removing percentage sign from RT critic score
for index, row in df.iterrows():
    if pd.notnull(row['rt_critic_score']):
        df.loc[index, 'rt_critic_score'] = int(row['rt_critic_score'][:2])

## Feature Transform #5: Dealing with all the nulls
There are a handful of features that still nulls remaining. We will need to deal with these appropriately before we can feed the final training dataset into the model algorithms. Here is how we will be handling each of the features that need dealt with:
- `rt_critic_score`: Looking [at this news article](https://morningconsult.com/2019/10/29/rotten-tomatoes-scores-continue-to-freshen-what-does-this-mean-for-movies/), it appears that the average critic score hovers around 59%. That said, we'll be filling these nulls with a value of 59.
- `metascore`: This one was tricky. Whereas I was able to find a source above to point to the 59% number for RT critic scores, I could not find the equivalent for the metascore. Unfortunately, we're going to have to go middle of the road here at an even 50.
- `rt_audience_score`: This was also a difficult one to deal with as I could not find a source that would give a definitive answer. From my time analyzing the data, I find that while the critics and audience can vary in their arguments, they both seem to have a bell curve of movies getting ratings right around that 59% mark. So to match the `rt_critic_score`, we're going to fill these nulls also with a value of 59.

In [14]:
# Filling rt_critic_score nulls with critic average of 59%
df['rt_critic_score'].fillna(59, inplace = True)

In [15]:
# Transforming RT critic score into an integer datatype
df['rt_critic_score'] = df['rt_critic_score'].astype(int)

In [16]:
# Filling metascore nulls with 50.0
df['metascore'].fillna(50.0, inplace = True)

In [17]:
# Filling rt_audience_score with audience average of 59%
df['rt_audience_score'].fillna(59.0, inplace = True)

# Wrapping Up!
That wraps up the feature transformations! Let's take one last look at our data and then save it off to a CSV file.

In [18]:
df.head()

Unnamed: 0,movie_name,biehn_scale_rating,biehn_yes_or_no,budget,revenue,runtime,tmdb_vote_average,tmdb_vote_count,imdb_rating,imdb_votes,year,rt_critic_score,metascore,rt_audience_score,primary_genre_Comedy,primary_genre_Crime,primary_genre_Action,primary_genre_Drama,primary_genre_Adventure,primary_genre_Documentary,primary_genre_Family,primary_genre_Western,primary_genre_Horror,primary_genre_Mystery,primary_genre_Thriller,primary_genre_War,primary_genre_Science Fiction,primary_genre_Music,primary_genre_Animation,primary_genre_nan,primary_genre_Fantasy,secondary_genre_nan,secondary_genre_Drama,secondary_genre_Adventure,secondary_genre_Science Fiction,secondary_genre_Action,secondary_genre_Family,secondary_genre_History,secondary_genre_Animation,secondary_genre_Crime,secondary_genre_Thriller,secondary_genre_Horror,secondary_genre_Fantasy,secondary_genre_Mystery,secondary_genre_Comedy,secondary_genre_Romance,secondary_genre_Music,movie_age
0,Zoolander 2,7.0,1,50000000,55969000,100,4.8,1788,4.7,67478.0,2016.0,22,34.0,20.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0
1,Dope,8.5,1,700000,17986781,103,7.1,1190,7.2,83142.0,2015.0,88,72.0,83.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6.0
2,The Big Short,8.0,1,28000000,133346506,131,7.3,7026,7.8,395829.0,2015.0,89,81.0,88.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6.0
3,Deadpool,10.0,1,58000000,783100000,108,7.6,25805,8.0,960086.0,2016.0,85,65.0,90.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0
4,The Martian,8.0,1,108000000,630161890,144,7.7,16305,8.0,803733.0,2015.0,91,80.0,91.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,6.0


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 123 entries, 0 to 123
Data columns (total 48 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   movie_name                       123 non-null    object 
 1   biehn_scale_rating               123 non-null    float64
 2   biehn_yes_or_no                  123 non-null    int64  
 3   budget                           123 non-null    int64  
 4   revenue                          123 non-null    int64  
 5   runtime                          123 non-null    int64  
 6   tmdb_vote_average                123 non-null    float64
 7   tmdb_vote_count                  123 non-null    int64  
 8   imdb_rating                      123 non-null    float64
 9   imdb_votes                       123 non-null    float64
 10  year                             123 non-null    float64
 11  rt_critic_score                  123 non-null    int64  
 12  metascore             

In [20]:
# Saving final dataset to local disk
df.to_csv('../data/clean/train.csv', index = False)