# Feature engineering - Adjusting revenue 
To predict whether or not sequels or films inspired by novels are more successful. 

## Tasks this notebook achieves
- Problem: budget has a large number of missing values
    - [x] given mean value to rows without budget, this is better than removing as there are a large amount of empty rows
-  Problem: revenue data is not good enough
    - [x] I removed zero revenue rows, resulting in 900 rows lost. Not great, but I can’t predict revenue without revenue.
    - [x] I remove year to only take into account the seasonality of a movie release

In [2]:
import os

In [3]:
os.chdir('..')
cwd = os.getcwd()

In [4]:
import pandas as pd
import json 
import numpy as np
from datetime import datetime
data_dir = '/data/' 

In [5]:
movie_details_neat = pd.read_pickle(cwd + data_dir + 'pre-processed/movie_details_neat.pkl')

movie_details_neat.set_index('id').head()

Unnamed: 0_level_0,budget,popularity,revenue,runtime,vote_average,vote_count,genres,keywords,original_language,original_title,overview,production_companies,production_countries,release_date,spoken_languages,movie_id,cast,crew
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
64682,105000000,61.196071,351040419,143.0,7.3,3769,"{'id': [18, 10749], 'name': ['Drama', 'Romance']}","{'id': [818, 1326, 1523, 3929, 209714], 'name'...",en,The Great Gatsby,An adaptation of F. Scott Fitzgerald's Long Is...,"{'name': ['Village Roadshow Pictures', 'Bazmar...","{'iso_3166_1': ['US', 'AU'], 'name': ['United ...",2013-05-10,"{'iso_639_1': ['en'], 'name': ['English']}",64682,"{'cast_id': [2, 5, 3, 8, 6, 4, 7, 22, 13, 23, ...","{'credit_id': ['52fe46e3c3a368484e0a982d', '52..."
9543,150000000,62.169881,335154643,116.0,6.2,2317,"{'id': [12, 14, 28, 10749], 'name': ['Adventur...","{'id': [1241, 1965, 12653, 12654, 12655, 41645...",en,Prince of Persia: The Sands of Time,A rogue prince reluctantly joins forces with a...,"{'name': ['Walt Disney Pictures', 'Jerry Bruck...","{'iso_3166_1': ['US'], 'name': ['United States...",2010-05-19,"{'iso_639_1': ['en'], 'name': ['English']}",9543,"{'cast_id': [5, 2, 7, 4, 6, 8, 9, 10, 26, 27, ...","{'credit_id': ['567e74d4c3a36860e9008e46', '52..."
5174,140000000,22.57178,258022233,91.0,6.1,783,"{'id': [28, 35, 80, 53], 'name': ['Action', 'C...","{'id': [1704], 'name': ['ambassador']}",en,Rush Hour 3,After an attempted assassination on Ambassador...,"{'name': ['New Line Cinema'], 'id': [12]}","{'iso_3166_1': ['US'], 'name': ['United States...",2007-08-08,"{'iso_639_1': ['la', 'en', 'fr', 'ja', 'zh'], ...",5174,"{'cast_id': [2, 3, 4, 5, 6, 7, 8, 9, 26, 27, 2...","{'credit_id': ['52fe43fac3a36847f807b5bd', '52..."
1735,145000000,60.034162,401128639,112.0,5.2,1387,"{'id': [12, 28, 14], 'name': ['Adventure', 'Ac...",{},en,The Mummy: Tomb of the Dragon Emperor,"Archaeologist Rick O'Connell travels to China,...","{'name': ['Universal Pictures', 'China Film Co...","{'iso_3166_1': ['DE', 'US'], 'name': ['Germany...",2008-07-01,"{'iso_639_1': ['en', 'zh', 'sa'], 'name': ['En...",1735,"{'cast_id': [1, 2, 8, 12, 13, 14, 15, 16, 17, ...","{'credit_id': ['52fe4312c3a36847f80384c5', '52..."
79698,27000000,2.418535,0,109.0,4.8,34,"{'id': [28, 12, 878, 10749], 'name': ['Action'...",{},en,The Lovers,The Lovers is an epic romance time travel adve...,"{'name': ['Corsan', 'Bliss Media', 'Limelight ...","{'iso_3166_1': ['AU', 'BE', 'IN'], 'name': ['A...",2015-02-13,"{'iso_639_1': ['en'], 'name': ['English']}",79698,"{'cast_id': [11, 13, 22, 17, 14, 15, 16, 18, 1...","{'credit_id': ['52fe49e0c3a368484e145067', '57..."


We want to select independent variables for training during machine learning modelling. I ignored some non-useful variables, such as film title and homepage. Obviously these can’t be used to predict the success of a movie.

Some variables were discarded for other reasons: production_country, because I felt that the information therein would be stored in production_company. Original_language, because I felt that that column would mostly be covered by spoken_languages, with a few exceptions. Popularity, because obviously that was measured after the film was released.

In [6]:
movie_neat_filter = movie_details_neat.drop(columns=["original_title", "original_language", "popularity"], axis=1)

In [7]:
# meaning out 0 budgets - there are a lot, so this is better than removing the rows
movie_neat_filter['budget']=movie_neat_filter['budget'].replace(0,movie_neat_filter['budget'].median())

The variables used for model prediction were:
User vote (akin to IMDb rating, referred to as ‘rating’ throughout)
User-reported box office revenue (referred to as ‘revenue’ throughout)

In [8]:
# Removing zero REVENUES from the data - revenue is super important as this will 
# be one of the variables we want to predict

def remove_zero_revenue(y_revenue, y_rating, X):
    y_revenue_removed = []
    y_rating_removed = []
    X_removed = []

    for l in range(0,len(y_revenue)):
        if y_revenue[l] !=0:
            y_revenue_removed.append(y_revenue[l])
            y_rating_removed.append(y_rating[l])
            X_removed.append(X[l])        
    y_revenue = np.array(y_revenue_removed)
    y_rating = np.array(y_rating_removed)
    X = np.array(X_removed)
    return y_revenue, y_rating, X

In [15]:
X = movie_neat_filter.iloc[:, :].values
y_revenue = movie_neat_filter.iloc[:, 2].values
y_rating = movie_neat_filter.iloc[:, 4].values

In [18]:
#storing removed rows in numpy arrays
y_revenue_nonull, y_rating_nonull, X_nonull = remove_zero_revenue(y_revenue, y_rating, X)

In [19]:
movie_filter_rev = movie_neat_filter[movie_neat_filter.revenue != 0]

We want to remove the year from our data as we will not be addressing revenue prediction using time series analysis, and just take into account the seasonality of the release.

In [20]:
# converting film date to day of year
# i am arguably losing the 'year' which might be slightly correlated with film success
# but that opens up a whole new can of worms about ratings and revenues by year
def remove_year(df):
    datetime_object = list(map(lambda x: datetime.strptime(x,'%Y-%m-%d'), df['release_date']))
    datetime_tuple_ls = []
    for l in datetime_object:
        datetime_tuple = l.timetuple().tm_yday
        datetime_tuple_ls.append(datetime_tuple)
    df['release_date'] = datetime_tuple_ls
    return df

In [26]:
movie_no_year = remove_year(movie_filter_rev)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


In [27]:
movie_no_year.to_pickle(cwd + data_dir+"pre-processed/movie_with_numeric_processing.pkl")
