# Feature engineering - Adjusting revenue 
To predict whether or not sequels or films inspired by novels are more successful. 

## Tasks this notebook achieves
- Problem: budget has a large number of missing values
    - [x] given mean value to rows without budget, this is better than removing as there are a large amount of empty rows
-  Problem: revenue data is not good enough
    - [x] I removed zero revenue rows, resulting in 900 rows lost. Not great, but I can’t predict revenue without revenue.
    - [x] I remove year to only take into account the seasonality of a movie release

In [21]:
import os
import sys

In [22]:
sys.path.insert(0, os.path.abspath('/Users/admin/Documents/Jobs/Datatonic/datatonic-challenge/utils/')) #point this to the where util is relatively to your working directory
from util import *

data_dir = get_path_to_data_dir()

In [37]:
import pandas as pd
import json 
import numpy as np
from datetime import datetime

In [24]:
movie_details_join = pd.read_pickle(data_dir + 'pre-processed/movie_details_join.pkl')

movie_details_join.set_index('id').head()

Unnamed: 0_level_0,budget,popularity,revenue,runtime,vote_average,vote_count,genres,keywords,original_language,original_title,overview,production_companies,production_countries,release_date,spoken_languages,movie_id,cast,crew
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
64682,105000000,61.196071,351040419,143.0,7.3,3769,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 10749, ""n...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,The Great Gatsby,An adaptation of F. Scott Fitzgerald's Long Is...,"[{""name"": ""Village Roadshow Pictures"", ""id"": 7...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2013-05-10,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",64682,"[{""cast_id"": 2, ""character"": ""Jay Gatsby"", ""cr...","[{""credit_id"": ""52fe46e3c3a368484e0a982d"", ""de..."
9543,150000000,62.169881,335154643,116.0,6.2,2317,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 1241, ""name"": ""persia""}, {""id"": 1965, ...",en,Prince of Persia: The Sands of Time,A rogue prince reluctantly joins forces with a...,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2010-05-19,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",9543,"[{""cast_id"": 5, ""character"": ""Prince Dastan"", ...","[{""credit_id"": ""567e74d4c3a36860e9008e46"", ""de..."
5174,140000000,22.57178,258022233,91.0,6.1,783,"[{""id"": 28, ""name"": ""Action""}, {""id"": 35, ""nam...","[{""id"": 1704, ""name"": ""ambassador""}]",en,Rush Hour 3,After an attempted assassination on Ambassador...,"[{""name"": ""New Line Cinema"", ""id"": 12}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-08-08,"[{""iso_639_1"": ""la"", ""name"": ""Latin""}, {""iso_6...",5174,"[{""cast_id"": 2, ""character"": ""Det. James Carte...","[{""credit_id"": ""52fe43fac3a36847f807b5bd"", ""de..."
1735,145000000,60.034162,401128639,112.0,5.2,1387,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 28, ""...",[],en,The Mummy: Tomb of the Dragon Emperor,"Archaeologist Rick O'Connell travels to China,...","[{""name"": ""Universal Pictures"", ""id"": 33}, {""n...","[{""iso_3166_1"": ""DE"", ""name"": ""Germany""}, {""is...",2008-07-01,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",1735,"[{""cast_id"": 1, ""character"": ""Richard O'Connel...","[{""credit_id"": ""52fe4312c3a36847f80384c5"", ""de..."
79698,27000000,2.418535,0,109.0,4.8,34,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",[],en,The Lovers,The Lovers is an epic romance time travel adve...,"[{""name"": ""Corsan"", ""id"": 7299}, {""name"": ""Bli...","[{""iso_3166_1"": ""AU"", ""name"": ""Australia""}, {""...",2015-02-13,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",79698,"[{""cast_id"": 11, ""character"": ""James Stewart /...","[{""credit_id"": ""52fe49e0c3a368484e145067"", ""de..."


In [25]:
movie_neat_filter = movie_details_join.drop(columns=["original_title", "original_language", "popularity"], axis=1)

In [26]:
# meaning out 0 budgets - there are a lot, so this is better than removing the rows
movie_neat_filter['budget']=movie_neat_filter['budget'].replace(0,movie_neat_filter['budget'].median())

The variables used for model prediction were:
User vote (akin to IMDb rating, referred to as ‘rating’ throughout)
User-reported box office revenue (referred to as ‘revenue’ throughout)

In [27]:
# Removing zero REVENUES from the data - revenue is super important as this will 
# be one of the variables we want to predict

def remove_zero_revenue(y_revenue, y_rating, X):
    y_revenue_removed = []
    y_rating_removed = []
    X_removed = []

    for l in range(0,len(y_revenue)):
        if y_revenue[l] !=0:
            y_revenue_removed.append(y_revenue[l])
            y_rating_removed.append(y_rating[l])
            X_removed.append(X[l])        
    y_revenue = np.array(y_revenue_removed)
    y_rating = np.array(y_rating_removed)
    X = np.array(X_removed)
    return y_revenue, y_rating, X

In [28]:
X = movie_neat_filter.iloc[:, :].values
y_revenue = movie_neat_filter.iloc[:, 2].values
y_rating = movie_neat_filter.iloc[:, 4].values

In [29]:
#storing removed rows in numpy arrays
y_revenue_nonull, y_rating_nonull, X_nonull = remove_zero_revenue(y_revenue, y_rating, X)

In [30]:
movie_filter_rev = movie_neat_filter[movie_neat_filter.revenue != 0]

We want to remove the year from our data as we will not be addressing revenue prediction using time series analysis, and just take into account the seasonality of the release.

In [31]:
# converting film date to day of year
# i am arguably losing the 'year' which might be slightly correlated with film success
# but that opens up a whole new can of worms about ratings and revenues by year
def remove_year(df):
    datetime_object = list(map(lambda x: datetime.strptime(x,'%Y-%m-%d'), df['release_date']))
    datetime_tuple_ls = []
    for l in datetime_object:
        datetime_tuple = l.timetuple().tm_yday
        datetime_tuple_ls.append(datetime_tuple)
    df['release_date'] = datetime_tuple_ls
    return df

In [32]:
movie_no_year = remove_year(movie_filter_rev)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


In [38]:
#Inspecting processed dataframe
movie_no_year

Unnamed: 0,budget,id,revenue,runtime,vote_average,vote_count,genres,keywords,overview,production_companies,production_countries,release_date,spoken_languages,movie_id,cast,crew
0,105000000,64682,351040419,143.0,7.3,3769,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 10749, ""n...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",An adaptation of F. Scott Fitzgerald's Long Is...,"[{""name"": ""Village Roadshow Pictures"", ""id"": 7...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",130,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",64682,"[{""cast_id"": 2, ""character"": ""Jay Gatsby"", ""cr...","[{""credit_id"": ""52fe46e3c3a368484e0a982d"", ""de..."
1,150000000,9543,335154643,116.0,6.2,2317,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 1241, ""name"": ""persia""}, {""id"": 1965, ...",A rogue prince reluctantly joins forces with a...,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",139,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",9543,"[{""cast_id"": 5, ""character"": ""Prince Dastan"", ...","[{""credit_id"": ""567e74d4c3a36860e9008e46"", ""de..."
2,140000000,5174,258022233,91.0,6.1,783,"[{""id"": 28, ""name"": ""Action""}, {""id"": 35, ""nam...","[{""id"": 1704, ""name"": ""ambassador""}]",After an attempted assassination on Ambassador...,"[{""name"": ""New Line Cinema"", ""id"": 12}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",220,"[{""iso_639_1"": ""la"", ""name"": ""Latin""}, {""iso_6...",5174,"[{""cast_id"": 2, ""character"": ""Det. James Carte...","[{""credit_id"": ""52fe43fac3a36847f807b5bd"", ""de..."
3,145000000,1735,401128639,112.0,5.2,1387,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 28, ""...",[],"Archaeologist Rick O'Connell travels to China,...","[{""name"": ""Universal Pictures"", ""id"": 33}, {""n...","[{""iso_3166_1"": ""DE"", ""name"": ""Germany""}, {""is...",183,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",1735,"[{""cast_id"": 1, ""character"": ""Richard O'Connel...","[{""credit_id"": ""52fe4312c3a36847f80384c5"", ""de..."
5,15000000,315011,77000000,120.0,6.5,143,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1299, ""name"": ""monster""}, {""id"": 7671,...",From the mind behind Evangelion comes a hit la...,"[{""name"": ""Cine Bazar"", ""id"": 5896}, {""name"": ...","[{""iso_3166_1"": ""JP"", ""name"": ""Japan""}]",211,"[{""iso_639_1"": ""it"", ""name"": ""Italiano""}, {""is...",315011,"[{""cast_id"": 4, ""character"": ""Rando Yaguchi : ...","[{""credit_id"": ""5921d321c3a368799b05933f"", ""de..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4302,14000000,33693,76901,85.0,6.3,8,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 35, ""name...","[{""id"": 171993, ""name"": ""mumblecore""}]","Unsure of what to do next, 23-year-old Marnie ...",[],"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",263,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",33693,"[{""cast_id"": 1, ""character"": ""Marnie"", ""credit...","[{""credit_id"": ""52fe45309251416c9102a535"", ""de..."
4313,12000,692,6000000,93.0,6.2,110,"[{""id"": 27, ""name"": ""Horror""}, {""id"": 35, ""nam...","[{""id"": 237, ""name"": ""gay""}, {""id"": 900, ""name...",Notorious Baltimore criminal and underground f...,"[{""name"": ""Dreamland Productions"", ""id"": 407}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",72,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",692,"[{""cast_id"": 8, ""character"": ""Divine / Babs Jo...","[{""credit_id"": ""52fe426bc3a36847f801d203"", ""de..."
4316,20000,36095,99000,111.0,7.4,63,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 27, ""name...","[{""id"": 233, ""name"": ""japan""}, {""id"": 549, ""na...",A wave of gruesome murders is sweeping Tokyo. ...,"[{""name"": ""Daiei Studios"", ""id"": 881}]","[{""iso_3166_1"": ""JP"", ""name"": ""Japan""}]",310,"[{""iso_639_1"": ""ja"", ""name"": ""\u65e5\u672c\u8a...",36095,"[{""cast_id"": 3, ""character"": ""Kenichi Takabe"",...","[{""credit_id"": ""52fe45cc9251416c9103eb7b"", ""de..."
4319,7000,14337,424760,77.0,6.9,658,"[{""id"": 878, ""name"": ""Science Fiction""}, {""id""...","[{""id"": 1448, ""name"": ""distrust""}, {""id"": 2101...",Friends/fledgling entrepreneurs invent a devic...,"[{""name"": ""Thinkfilm"", ""id"": 446}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",282,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",14337,"[{""cast_id"": 1, ""character"": ""Aaron"", ""credit_...","[{""credit_id"": ""52fe45e79251416c75066791"", ""de..."


### Making ID maps with the new Dataframe

In [34]:
movie_copy = movie_no_year.copy() #copy the dataframe so that the original isn't altered
cols_to_dict = ['genres', 'keywords', 'production_companies', 'production_countries',
            'spoken_languages', 'cast', 'crew'] #list of columns that need to be converted from string to dictionary 

movie_dict = convert_to_dict(movie_copy, cols_to_dict, save=True, save_dir = data_dir + 'pre-processed/')
movie_dict.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column][i] = list_to_dict(list_of_dict)  # Convert list of dictionaries to one dictionary


KeyError: 4

Making the id_maps 

In [None]:
id_maps = make_id_maps(data_copy, cols_to_dict[:-2], True, data_dir + 'pre-processed/') #skipping cast and crew as it's more complicated (there are more than 2 keys)

We want to select independent variables for training during machine learning modelling. I ignored some non-useful variables, such as film title and homepage. Obviously these can’t be used to predict the success of a movie.

Some variables were discarded for other reasons: production_country, because I felt that the information therein would be stored in production_company. Original_language, because I felt that that column would mostly be covered by spoken_languages, with a few exceptions. Popularity, because obviously that was measured after the film was released.