# Preprocessing (feature engineering)

After compiling the data, there are still some changes to perform before training models.

The purpose of this notebook is to:

1. Drop rows missing `runtimeMinutes` (just 4 features)
2. Genres into array into columns
3. Drop release date (unused)
4. Feature engineering feature combinations

## Import necessary packages and data

In [3]:
import pandas as pd

df = pd.read_csv("../../../data_deliverable/compiled_data/compiled.csv")
df.head()

Unnamed: 0,title,runtimeMinutes,genres,averageRating,numVotes,director,writer,releaseDate,productionBudget,domesticGross,worldwideGross,releaseYear,releaseMonth,releaseDay
0,Modern Times,87,"Comedy,Drama,Romance",8.5,257713,Charles Chaplin,Charles Chaplin,"Feb 5, 1936",1500000,163245,229549,1936,2,5
1,Cat People,73,"Fantasy,Horror,Thriller",7.2,25752,Jacques Tourneur,DeWitt Bodeen,"Nov 16, 1942",134000,4000000,8000000,1942,11,16
2,Wilson,154,"Biography,Drama,History",6.4,1686,Henry King,Lamar Trotti,"Aug 1, 1944",5200000,2000000,2000000,1944,8,1
3,12 Angry Men,96,"Crime,Drama",9.0,856479,Sidney Lumet,Reginald Rose,"Apr 13, 1957",340000,0,379,1957,4,13
4,The Alamo,162,"Adventure,Drama,History",6.8,17497,John Wayne,James Edward Grant,"Oct 24, 1960",12000000,7900000,7900000,1960,10,24


## Drop missing `runtimeMinutes`

Removing these rows because
1. Doesn't make sense for a month to not have a runtime specified
2. There are just 4 of these rows

In [4]:
mask = df['runtimeMinutes'] == '\\N'
df = df[~mask]

## Create array out of genres

In [5]:
df["genres"] = df["genres"].str.split(",")
df["genres"].head()

0       [Comedy, Drama, Romance]
1    [Fantasy, Horror, Thriller]
2    [Biography, Drama, History]
3                 [Crime, Drama]
4    [Adventure, Drama, History]
Name: genres, dtype: object

## Convert genres array to columns

In [6]:
df['genre-1'] = df['genres'].apply(lambda x: x[0] if len(x) > 0 else "NA")
df['genre-2'] = df['genres'].apply(lambda x: x[1] if len(x) > 1 else "NA")
df['genre-3'] = df['genres'].apply(lambda x: x[2] if len(x) > 2 else "NA")

## Drop releaseDate and genres array

In [7]:
df.drop(columns=["releaseDate", "genres"], inplace=True)
df.columns

Index(['title', 'runtimeMinutes', 'averageRating', 'numVotes', 'director',
       'writer', 'productionBudget', 'domesticGross', 'worldwideGross',
       'releaseYear', 'releaseMonth', 'releaseDay', 'genre-1', 'genre-2',
       'genre-3'],
      dtype='object')

## Feature Engineering

Additional features include:
 - Budget / year
 - Domestic gross / budget
 - Worldwide gross / budget

In [8]:
df["budgetYear"] = df["productionBudget"] / df["releaseYear"]
df["domesticGrossYear"] = df["domesticGross"] / df["releaseYear"]
df["worldwideGrossYear"] = df["worldwideGross"] / df["releaseYear"]
df.columns

Index(['title', 'runtimeMinutes', 'averageRating', 'numVotes', 'director',
       'writer', 'productionBudget', 'domesticGross', 'worldwideGross',
       'releaseYear', 'releaseMonth', 'releaseDay', 'genre-1', 'genre-2',
       'genre-3', 'budgetYear', 'domesticGrossYear', 'worldwideGrossYear'],
      dtype='object')

Export

In [9]:
df.to_csv("../data/preprocessed.csv", index=False)