# Movie Analysis

In [1]:
import pandas as pd
import numpy as np
import networkx as nx
from sklearn import svm
from sklearn import linear_model
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold
from catboost import Pool, CatBoostRegressor
import re, json, requests, seaborn, warnings
warnings.filterwarnings( 'ignore' )
from bs4 import BeautifulSoup
from matplotlib import pyplot as plt, rcParams
%matplotlib inline
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from ipyparallel import Client
import time
from keras import initializers
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.utils import np_utils


Using TensorFlow backend.


In [2]:
def text_message(text,number):
    """
    Accepts a text message string, and a phone number string
    Pulls up webpage, enters text and phone number and sends a text,
    then closes webpage.
    Returns Nothing
    """
    browser = webdriver.Chrome()
    url = "http://www.txtdrop.com/"
    browser.get( url )
    email = browser.find_element_by_id("emailfrom")
    email.send_keys("brookemosby@hotmail.com")
    first_3 = browser.find_element_by_id("npa")
    first_3.send_keys(number[:3])
    second_3 = browser.find_element_by_id("exchange")
    second_3.send_keys(number[3:6])
    last_4 = browser.find_element_by_id("number")
    last_4.send_keys(number[6:])
    message = browser.find_element_by_name("body")
    message.send_keys(text)
    browser.find_element_by_name("submit").click()
    browser.close()

## Abstract
Cinema has become one of the highest profiting industries over the past century. The total box office revenue in North America alone amounted to $11.38 billion in 2016. With the possibility of great success, there is also a large risk of financial failure. This machine learning analysis is motivated by answering the question what makes a movie successful. There is plenty of quantative data available for movies, such as the movies' budget, the release date, ratings etc., but in this analysis an attempt will be made to quantify movie information that is less measurable and then predict movie success.

## Introduction
Research has been done to determine what aspects of a movie make it more successful; however, much of this research is contradictory. The research paper "Early Predictions of Movie Success: the Who, What, and When of Profitability" states movies with a motion picture content rating 'R' will likely have lower profits, whereas the research paper "What Makes A Great Movie?" states a motion picture content rating 'R' will have higher a box-office. Both papers analyzed thousands of movies, but came to opposite conclusions. Some variables used to predict movie success in these studies, included budget, motion picture content rating, and actor popularity.

Based on these previous models, the dataset used will include movie title length, run-time, motion picture content rating, director, genre, release date, actors, an actor networking score, budget, opening weekend box-office revenue, and a list of other predictor variables. Movie success will be determined by whether or not the movie turns a profit.

## Data Scraped, Downloaded, Cleaned & Engineered
### Beginning Dataset
A beginning dataset is downloaded from IMDb with 10,000 movies, each entry containing the movie title, URL on IMDb rating, run-time, Year, Genres, Num Votes, Release Date, Directors. From this dataset, additional information on the movie budget, gross income, opening weekend box revenue, actors, Oscar nominations, Oscars won, other award nominations, other awards won, MetaCritic score, and content rating is scraped and cleaned. 
The data points will be collected from IMDb, which is a reputable source for information, according to their website, 

>"we [IMDb] actively gather information from and verify items with studios and 
filmmakers".

### Cleaning Data
After gathering each data point, the data set is complete, although the information is not clean or uniform. The first step to clean the data will be to remove all commas across each column in the DataFrame. Removing commas will make it easier to convert monetary amounts to ints. Next each date in the Release column will be changed to a pandas date object, which will simplify any calculations that rely on the release date of the movie. Each monetary amount will be converted into an int and converted into USD. Each unique genre will be made into a column, with a true or false boolean for each movie entry.
### Feature Engineering
To resolve the disagreement in monetary amounts, due to inflation, a dataset containing the CPI for each year from 1914 will used to adjust the monetary amounts. The CPI, Consumer Price Index, describes the amount of purchasing power the average consumer has. The length of the movie title will be added, and a NetworkX graph of all actors will be made. This network will connect nodes of actors to each other, if they appear in a movie together. The edges of the network will be weighted by the amount of movies the actors appear in together. An actor popularity score will be calculated from the actors appearing in the movie, based on how many other movies they appear in with other actors and the actor's income.

The total variables in the new dataset are movie title, title length, motion picture content rating, run-time, IMDb rating, genres, MetaCritic score, Oscar nominations, Oscar wins, other award nominations, other award wins, director, release date, budget, opening weekend, gross, profit, budget adjusted for inflation, opening weekend adjusted for inflation, gross adjusted for inflation, profit adjusted for inflation, the top ten actors in the movie, and actor popularity score. A separate network will hold the actor nodes and their connections.
### Actor Network
A network of actors is made to help determine the success of a movie. Each node in the network is an actor that has appeared in a movie, and that actor node will have an edge to another actor node if they appeared in a movie together. The weights on each edge will correspond to the amount of movies the actors, that the edge connects, have appeared in together.

TODO: Insert picture here of what network should look like with example

In [2]:
df = pd.read_csv("Result_Data/total_engineered.csv",encoding = "ISO-8859-1")
del df["Unnamed: 0"]
df = df.fillna(df.mean())

In [3]:
df_x = df[['Actor_0', 'Actor_1', 'Actor_2', 'Actor_3', 'Actor_4', 'Actor_5', 'Actor_6', 'Actor_7', 'Actor_8', 'Actor_9', 
           'Budget', 'Directors', 'Release Date', 'Runtime (mins)', 'Title', 'Year', 'Genre: Short', 'Genre:  Comedy', 
           'Genre: Fantasy', 'Genre: Film-Noir', 'Genre: War', 'Genre: Musical', 'Genre:  Sport', 'Genre: Biography', 
           'Genre: Action', 'Genre:  Fantasy', 'Genre:  Animation', 'Genre:  Biography', 'Genre: Mystery', 
           'Genre:  Musical', 'Genre:  Romance', 'Genre: Thriller', 'Genre:  Film-Noir', 'Genre:  History', 
           'Genre: Western', 'Genre: Drama', 'Genre: Sci-Fi', 'Genre:  Horror', 'Genre: Romance', 'Genre: Adventure', 
           'Genre:  Family', 'Genre:  Sci-Fi', 'Genre: Animation', 'Genre:  Music', 'Genre: Music', 'Genre: History', 
           'Genre:  Mystery', 'Genre:  Thriller', 'Genre: Comedy', 'Genre:  Crime', 'Genre: Horror', 'Genre:  Drama', 
           'Genre:  War', 'Genre:  Western', 'Genre:  Adventure', 'Genre: Family', 'Genre:  Action', 'Genre: Crime', 
           'Content Rating: PASSED', 'Content Rating: TV-MA', 'Content Rating: X', 'Content Rating: NC-17', 
           'Content Rating: TV-14', 'Content Rating: M', 'Content Rating: GP', 'Content Rating: TV-PG', 
           'Content Rating: PG', 'Content Rating: PG-13', 'Content Rating: G', 'Content Rating: NR', 
           'Content Rating: APPROVED', 'Content Rating: UNRATED', 'Content Rating: M/PG', 'Content Rating: TV-13', 
           'Content Rating: NOT RATED', 'Content Rating: TV-G', 'Content Rating: R', 'Decade',  'Budget_Adjusted',  
           'Length of Title', 'Directors Prev Number Movies', 'Directors Prev Mean Profit', 'Directors Prev Mean IMDb',
           'Directors Prev Mean Meta', 'Directors Prev Mean Num Votes', 'Directors Prev Mean Nominations', 
           'Directors Prev Mean Wins', 'Actor Weights']]
df_y = df[['Gross', 'IMDb Rating', 'Meta Score', 'Num Votes', 'Oscar Nominations', 'Oscar Wins', 'Other Nominations', 
           'Other Wins','Profit','Gross_Adjusted',  'Profit_Adjusted', 'Profit_Bool', 'Total Nominations', 'Total Wins']]

## Methods
TODO: Look into SVM
Various supervised machine learning models were used to predict certain characteristics of the movie that determine a movie's success. The variables that our models attempt to predict are the movie's IMDb rating, Metacritic Score, the number of Oscar nominations, the number of Oscar wins, the number of other award nominations, the number of other award wins, and whether or not a movie turned a profit. These varaibles will hereon be referred to as the movies' dependent variables. For each model the data is split into a training and testing set to determine how accurate each method proves.
### PCA
Principal Component Analysis (PCA) is used to reduce the dimensionality of the large dataset while retaining important information. For each Machine Learning model, PCA is used to reduce dimensionality to 10, 20, and 50 components, and then compared to find the best dimensionality. Because our dataset contains over 100 columns, PCA is essential for our models to avoid extremely costly computations and produce more accurate results.
### Linear Regression
The simple supervised machine learning model, linear regression, attempts to predict the movies' dependent variables. This regression method attempts to predict these variables with a linear approach by modelling the relationship between the scalar and categorical independent variables and the dependent variables, that we have previously determined. Because this is simple to implement and does not contain costly computations, it is a great starting model for our data.
### Ridge Regression
The ridge regression model is essentially the same as the linear regression model; however, it implements a regularization term to prevent overfitting, and potentially prove more accurate than linear regression.
### Random Forest Regression
The random forest regression model is a type of additive model that makes predictions by combining decisions from a sequence of base models. A random forest is made by growing multiple binary trees. At each node the data is split into two children nodes, a decision made based on the residual sum of squares, or RSS. For random forests using regression, the predicted value at a node is the average response variable for all observations in the node.
### Catboost
There exist several implementations of random forest boosters, among these are GBM, XGBoost, LightGBM, and Catboost. Catboost seems to outperform the other implementations, even by using only its default parameters. Not only is it more accurate, but also faster than the other methods, making it an ideal machine learning model to implement on our dataset. In Boosting each tree is grown using information from previously learned trees, then each tree is fitted on a modified version of the original data set. Because the random forest regression gave the best results, the next natural step was to examine how Catboost performed on our dataset.
### Neural Network
Neural networks are one of the biggest advancements in machine learning in the 21st century. A neural network is based on a collection of connected nodes, or neurons. In nerual network implementations, the output of each node is calculated by a non-linear loss function based on the sum of its inputs. The edges in the network typically have a weight that adjusts as learning proceeds. The weight increases or decreases based on the sum of the inputs. Neurons may have a threshold for the sum of inputs, before they are activated. Typically, neurons are organized in layers, where different layers may perform different kinds of transformations on their inputs. In our implementation, the neural network was costly to compute, so only the profit boolean was predicted with the neural network.

In [6]:
pca = PCA()
df_x_temp = df_x.select_dtypes(include=['float64','int','bool']).astype('float')
df_y_temp = df_y.select_dtypes(include=['float64','int','bool']).astype('float')
tr_x, tt_x, tr_y, tt_y = tts(df_x_temp, df_y_temp, test_size = .2)
def results(es_y1):
    """
    Accepts a regression prediction for variables and prints 
    accuracy values of prediction, so it is easier to digest.
    Returns Nothing
    """
    print("\nMovie Gross Average Percent Error:")
    print((abs(es_y1[0:,0]-np.array(tt_y.astype(float))[0:,0])/abs(np.array(tt_y.astype(float))[0:,0])).mean()*100,"%")
    print("\nIMDb Rating Average Percent Error:")
    print((abs(es_y1[0:,1]-np.array(tt_y.astype(float))[0:,1])/abs(np.array(tt_y.astype(float))[0:,1])).mean()*100,"%")
    print("\nMeta Score Average Percent Error:")
    print((abs(es_y1[0:,2]-np.array(tt_y.astype(float))[0:,2])/abs(np.array(tt_y.astype(float))[0:,2])).mean()*100,"%")
    print("\nNumber of Votes Average Percent Error:")
    print((abs(es_y1[0:,3]-np.array(tt_y.astype(float))[0:,3])/abs(np.array(tt_y.astype(float))[0:,3])).mean()*100,"%")
    print("\nOscar Nominations Average Error:")
    print(abs(es_y1[0:,4]-np.array(tt_y.astype(float))[0:,4]).mean())
    print("\nOscar Wins Average Error:")
    print(abs(es_y1[0:,5]-np.array(tt_y.astype(float))[0:,5]).mean())
    print("\nOther Nominations Average Error:")
    print(abs(es_y1[0:,6]-np.array(tt_y.astype(float))[0:,6]).mean())
    print("\nOther Wins Average Error:")
    print(abs(es_y1[0:,7]-np.array(tt_y.astype(float))[0:,7]).mean())
    print("\nProfit Average Percent Error:")
    print((abs(es_y1[0:,8]-np.array(tt_y.astype(float))[0:,8])/abs(np.array(tt_y.astype(float))[0:,8])).mean()*100,"%")
    print("\nGross Adjusted Average Percent Error:")
    print((abs(es_y1[0:,9]-np.array(tt_y.astype(float))[0:,9])/abs(np.array(tt_y.astype(float))[0:,9])).mean()*100,"%")
    print("\nProfit Adjusted Average Percent Error:")
    print((abs(es_y1[0:,10]-np.array(tt_y.astype(float))[0:,10])/abs(np.array(tt_y.astype(float))[0:,10])).mean()*100,"%")
    print("\nProfit Bool Average Error:")
    print(abs(es_y1[0:,11]-np.array(tt_y.astype(float))[0:,11]).mean())
    print("\nProfit Bool Accuracy:")
    es_y1_profit = (es_y1[0:,11]+.5).astype(int)
    tt_y1_profit = np.array(tt_y.astype(float))[0:,11]
    print(sum((es_y1_profit==tt_y1_profit).astype(int))/len(tt_y1_profit))
    print("\nTotal Nominations Average Error:")
    print(abs(es_y1[0:,12]-np.array(tt_y.astype(float))[0:,12]).mean())
    print("\nTotal Wins Average Error:")
    print(abs(es_y1[0:,13]-np.array(tt_y.astype(float))[0:,13]).mean())


In [6]:
#Implementing linear regression
parameters = {'pca__n_components':[10,20,50]}
regr = linear_model.LinearRegression()
pipe = Pipeline(steps=[('pca', pca), ('regr', regr)])
estimator = GridSearchCV(pipe, parameters, n_jobs = -1, verbose = 1)
estimator.fit(tr_x,tr_y)
es_y1 = estimator.best_estimator_.predict(tt_x)
results(es_y1)
text_message("Finished Linear","6787995970")

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Done   4 out of   9 | elapsed:    0.7s remaining:    0.9s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    1.8s finished



Movie Gross Average Percent Error:
21174.044777587114 %

IMDb Rating Average Percent Error:
10.943972540307133 %

Meta Score Average Percent Error:
22.233386159641668 %

Number of Votes Average Percent Error:
227.25520075838648 %

Oscar Nominations Average Error:
0.36882244830576283

Oscar Wins Average Error:
0.27553806746243514

Other Nominations Average Error:
8.999050858301405

Other Wins Average Error:
5.73574354056335

Profit Average Percent Error:
654.222134871736 %

Gross Adjusted Average Percent Error:
27478.05735135932 %

Profit Adjusted Average Percent Error:
708.8915503027722 %

Profit Bool Average Error:
0.3761595010478481

Profit Bool Accuracy:
0.7212212212212212

Total Nominations Average Error:
9.127342662202654

Total Wins Average Error:
5.8743355617785715


In [7]:
print(estimator.best_params_)

{'pca__n_components': 50}


## Linear Regression Model Analysis
Regarding the results of this model,there are very few things that linear regression can predict accurately. It is important to note that the best estimator was PCA with 50 components, which is the largest number of components used in the predictions. There are some variables that seem to have potential for a large increase in accuracy The first notable varaible is the IMDb rating, with has an error percentage of about 10% with potential to be increased. The Oscar nominations and wins are predicted extremely accurately from linear regression alone, and the boolean for whether or not a movie returns a profit is correct more than 70% of the time, which is perhaps one of the most important variables movie producers will look for, and one that will hopefully have better accuracy with different models. Other variables do poorly when predicted with linear regression.

In [8]:
#implementing Ridge regression
regr = linear_model.Ridge()
parameters = {'pca__n_components':[10,20,50]}
pca = PCA()
pipe = Pipeline(steps=[('pca', pca), ('regr', regr)])
estimator = GridSearchCV(pipe, parameters, n_jobs = -1, verbose = 1)
estimator.fit(tr_x,tr_y)
es_y2 = estimator.best_estimator_.predict(tt_x)
results(es_y2)
text_message("Finished Logistic","6787995970")

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Done   4 out of   9 | elapsed:    0.5s remaining:    0.6s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    1.7s finished



Movie Gross Average Percent Error:
21163.699509043057 %

IMDb Rating Average Percent Error:
10.9444872552927 %

Meta Score Average Percent Error:
22.232002873434183 %

Number of Votes Average Percent Error:
227.13037457013425 %

Oscar Nominations Average Error:
0.3687895465107543

Oscar Wins Average Error:
0.2754840332277056

Other Nominations Average Error:
8.99762356785237

Other Wins Average Error:
5.735122254309773

Profit Average Percent Error:
653.8567426068672 %

Gross Adjusted Average Percent Error:
27462.440433014228 %

Profit Adjusted Average Percent Error:
708.3469770591662 %

Profit Bool Average Error:
0.3761846360719753

Profit Bool Accuracy:
0.7207207207207207

Total Nominations Average Error:
9.126066676232615

Total Wins Average Error:
5.873626874233154


In [9]:
print(estimator.best_params_)

{'pca__n_components': 50}


## Ridge Regression Model Analysis
Ridge Regression results perfom as poorly as the linear regression model, with very few notable differences. This makes sense because the only difference between the two models is the uses l1 regularization term in the cost function.

In [10]:
#consider normalizing data to better predict
#implementing random forest
parameters = {'pca__n_components':[10,20,50],'rfr__n_estimators':[10,100], 'rfr__criterion':['mae', 'mse'], 'rfr__max_features':['auto', 'sqrt', 'log2']}
pca = PCA()
rfr = RFR()
pipe = Pipeline(steps=[('pca', pca), ('rfr', rfr)])
estimator = GridSearchCV(pipe, parameters, n_jobs = -1, verbose = 1)
estimator.fit(tr_x,tr_y)
es_y3 = estimator.best_estimator_.predict(tt_x)
results(es_y3)
text_message("Finished Random Forests","6787995970")

Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed: 31.6min
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 263.1min finished



Movie Gross Average Percent Error:
28360.003945479428 %

IMDb Rating Average Percent Error:
11.57944501421292 %

Meta Score Average Percent Error:
21.272463125108494 %

Number of Votes Average Percent Error:
217.0941046687056 %

Oscar Nominations Average Error:
0.3422434934934935

Oscar Wins Average Error:
0.23656406406406405

Other Nominations Average Error:
8.54305055055055

Other Wins Average Error:
5.568672988464655

Profit Average Percent Error:
895.3850933188331 %

Gross Adjusted Average Percent Error:
33359.201330645235 %

Profit Adjusted Average Percent Error:
822.3867180805835 %

Profit Bool Average Error:
0.341981981981982

Profit Bool Accuracy:
0.7657657657657657

Total Nominations Average Error:
8.65530071738405

Total Wins Average Error:
5.734206021497688


In [11]:
print(estimator.best_params_)

{'pca__n_components': 50, 'rfr__criterion': 'mse', 'rfr__max_features': 'log2', 'rfr__n_estimators': 100}


## Random Forest Regression Model Analysis
The Random Forest Regressor outperforms both other models in determining whether or not a movie turns a profit, with 76% accuracy, and predicting the Oscar wins and Oscar Nominations. However, IMDb rating and the Meta Score rating both did worse with the random forest regressor, although they are arguably the least important variables when predicting movie success.

In [4]:
# Implementing Catboost
df_x_temp = df_x.select_dtypes(include=['float64','int','bool']).astype('float')
df_y_temp = df_y.select_dtypes(include=['float64','int','bool']).astype('float')
"""df_y_temp.iloc[0]  = (df_y_temp.iloc[0]  - df_y_temp.iloc[0].mean())  / np.std(df_y_temp.iloc[0])
df_y_temp.iloc[3]  = (df_y_temp.iloc[3]  - df_y_temp.iloc[3].mean())  / np.std(df_y_temp.iloc[3])
df_y_temp.iloc[8]  = (df_y_temp.iloc[8]  - df_y_temp.iloc[8].mean())  / np.std(df_y_temp.iloc[8])
df_y_temp.iloc[9]  = (df_y_temp.iloc[9]  - df_y_temp.iloc[9].mean())  / np.std(df_y_temp.iloc[9])
df_y_temp.iloc[10] = (df_y_temp.iloc[10] - df_y_temp.iloc[10].mean()) / np.std(df_y_temp.iloc[10])"""

tr_x, tt_x, tr_y, tt_y = tts(df_x_temp, df_y_temp, test_size = .2)
es_y4 = np.zeros_like(tt_y)

In [74]:
for i in range(len(df_x)):
    if df_x.iat[i,87] == 1114:
        print(i,df_x.iat[i,14],df_x.iat[i,87])

7744 Sherlock Holmes: A Game of Shadows 1114.0


In [112]:
for i in range(len(tt_x)):
    print(i,tt_x.iat[i,14])

0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
21 0.0
22 0.0
23 0.0
24 0.0
25 0.0
26 0.0
27 0.0
28 0.0
29 0.0
30 0.0
31 0.0
32 0.0
33 0.0
34 0.0
35 0.0
36 0.0
37 0.0
38 0.0
39 0.0
40 0.0
41 0.0
42 0.0
43 0.0
44 0.0
45 0.0
46 0.0
47 0.0
48 0.0
49 0.0
50 1.0
51 0.0
52 0.0
53 1.0
54 0.0
55 0.0
56 0.0
57 0.0
58 0.0
59 0.0
60 0.0
61 0.0
62 0.0
63 0.0
64 0.0
65 1.0
66 0.0
67 0.0
68 0.0
69 0.0
70 0.0
71 0.0
72 0.0
73 0.0
74 0.0
75 0.0
76 0.0
77 0.0
78 0.0
79 0.0
80 0.0
81 1.0
82 1.0
83 0.0
84 0.0
85 0.0
86 0.0
87 0.0
88 0.0
89 0.0
90 0.0
91 0.0
92 0.0
93 0.0
94 0.0
95 1.0
96 0.0
97 0.0
98 0.0
99 0.0
100 0.0
101 0.0
102 1.0
103 0.0
104 0.0
105 0.0
106 0.0
107 0.0
108 0.0
109 0.0
110 0.0
111 0.0
112 0.0
113 0.0
114 0.0
115 0.0
116 0.0
117 1.0
118 0.0
119 1.0
120 0.0
121 0.0
122 0.0
123 0.0
124 0.0
125 0.0
126 0.0
127 0.0
128 0.0
129 0.0
130 0.0
131 0.0
132 0.0
133 0.0
134 0.0
135 0.0
136 0.0
137 0.0
138 0.

1907 1.0
1908 0.0
1909 0.0
1910 0.0
1911 0.0
1912 0.0
1913 0.0
1914 1.0
1915 0.0
1916 0.0
1917 0.0
1918 1.0
1919 0.0
1920 1.0
1921 0.0
1922 0.0
1923 0.0
1924 0.0
1925 0.0
1926 0.0
1927 0.0
1928 0.0
1929 0.0
1930 0.0
1931 1.0
1932 0.0
1933 1.0
1934 0.0
1935 0.0
1936 0.0
1937 0.0
1938 0.0
1939 0.0
1940 0.0
1941 0.0
1942 0.0
1943 0.0
1944 1.0
1945 0.0
1946 0.0
1947 0.0
1948 0.0
1949 0.0
1950 0.0
1951 0.0
1952 0.0
1953 0.0
1954 0.0
1955 0.0
1956 0.0
1957 0.0
1958 0.0
1959 0.0
1960 0.0
1961 0.0
1962 1.0
1963 0.0
1964 1.0
1965 0.0
1966 0.0
1967 0.0
1968 0.0
1969 0.0
1970 0.0
1971 0.0
1972 0.0
1973 0.0
1974 0.0
1975 1.0
1976 0.0
1977 0.0
1978 0.0
1979 0.0
1980 0.0
1981 0.0
1982 0.0
1983 0.0
1984 0.0
1985 1.0
1986 0.0
1987 0.0
1988 0.0
1989 0.0
1990 0.0
1991 0.0
1992 1.0
1993 0.0
1994 0.0
1995 1.0
1996 0.0
1997 0.0


In [113]:
tt_x1 = pd.DataFrame(tt_x.iloc[1500]).transpose()
tt_x1

Unnamed: 0,Budget,Runtime (mins),Genre: Short,Genre: Comedy,Genre: Fantasy,Genre: Film-Noir,Genre: War,Genre: Musical,Genre: Sport,Genre: Biography,...,Budget_Adjusted,Length of Title,Directors Prev Number Movies,Directors Prev Mean Profit,Directors Prev Mean IMDb,Directors Prev Mean Meta,Directors Prev Mean Num Votes,Directors Prev Mean Nominations,Directors Prev Mean Wins,Actor Weights
9165,27557140.0,116.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,38504710.0,12.0,3.0,30070580.0,7.366667,62.34004,7363.0,0.0,0.333333,159.0


In [114]:


for iterations in [500]:
    for learning_rate in [.10]:
        for depth in [10]:
            estimator = CatBoostRegressor(iterations=iterations, 
                                      learning_rate=learning_rate, 
                                      depth=depth,l2_leaf_reg=64) 
            for i in range(14):
                estimator.fit(tr_x,tr_y.iloc[0:,i],verbose=False)
                es_y4[0:,i] = estimator.predict(tt_x1)
            print("\n\n\n",iterations,learning_rate,depth)    
            results(es_y4)


# for iterations in [500]:
#     for learning_rate in [.10]:
#         for depth in [10]:
#             estimator = CatBoostRegressor(iterations=iterations, 
#                                       learning_rate=learning_rate, 
#                                       depth=depth,l2_leaf_reg=64) 
            
#             estimator.fit(tr_x,tr_y.iloc[0:,11],verbose=False)
#             es_y4[0:,11] = estimator.predict(tt_x.iloc[7744])
#             print("\n\n\n",iterations,learning_rate,depth)    
#             results(es_y4)




 500 0.1 10

Movie Gross Average Percent Error:
12893.961487128814 %

IMDb Rating Average Percent Error:
15.90555026306198 %

Meta Score Average Percent Error:
28.26638203975 %

Number of Votes Average Percent Error:
102.4109819077627 %

Oscar Nominations Average Error:
0.5469083196943089

Oscar Wins Average Error:
0.39987936557821724

Other Nominations Average Error:
7.577882994600327

Other Wins Average Error:
5.218771331605518

Profit Average Percent Error:
1278.1668574906416 %

Gross Adjusted Average Percent Error:
15596.125340068833 %

Profit Adjusted Average Percent Error:
1586.2915818269992 %

Profit Bool Average Error:
0.3591516519026242

Profit Bool Accuracy:
0.6481481481481481

Total Nominations Average Error:
8.120106943897481

Total Wins Average Error:
5.643949805179577


In [8]:
#we use a for loop, because Catboost does not play nicely with GridSearch

for iterations in [500,1000]:
    for learning_rate in [.25,.10,.05,.01]:
        for depth in [15,10,6,2]:
            estimator = CatBoostRegressor(iterations=iterations, 
                                      learning_rate=learning_rate, 
                                      depth=depth,l2_leaf_reg=64) 
            for i in range(14):
                estimator.fit(tr_x,tr_y.iloc[0:,i],verbose=False)
                es_y4[0:,i] = estimator.predict(tt_x)
            print("\n\n\n",iterations,learning_rate,depth)    
            results(es_y4)
text_message("Finished Catboost","6787995970")


Iteration with suspicious time 1.63 sec ignored in overall statistics.



 500 0.25 15

Movie Gross Average Percent Error:
20851.47488871568 %

IMDb Rating Average Percent Error:
10.311584356690005 %

Meta Score Average Percent Error:
22.41840499651566 %

Number of Votes Average Percent Error:
148.99903648894932 %

Oscar Nominations Average Error:
0.39958256941917

Oscar Wins Average Error:
0.24459137082913626

Other Nominations Average Error:
9.032404561288212

Other Wins Average Error:
6.180924096406043

Profit Average Percent Error:
817.065869209421 %

Gross Adjusted Average Percent Error:
23431.586195488468 %

Profit Adjusted Average Percent Error:
791.6521968773422 %

Profit Bool Average Error:
0.27077318778857123

Profit Bool Accuracy:
0.7897897897897898

Total Nominations Average Error:
9.171404943570234

Total Wins Average Error:
6.353777198751218



 500 0.25 10

Movie Gross Average Percent Error:
21355.48949260712 %

IMDb Rating Average Percent Error:
10.12666984271767 %

Me




 500 0.05 6

Movie Gross Average Percent Error:
21221.21296005804 %

IMDb Rating Average Percent Error:
10.40319341510991 %

Meta Score Average Percent Error:
22.151198377354223 %

Number of Votes Average Percent Error:
151.2885493192673 %

Oscar Nominations Average Error:
0.3737930441153088

Oscar Wins Average Error:
0.2217275379089286

Other Nominations Average Error:
8.686564884337445

Other Wins Average Error:
5.963339782300156

Profit Average Percent Error:
1076.117083714869 %

Gross Adjusted Average Percent Error:
22942.60754681946 %

Profit Adjusted Average Percent Error:
1135.06787185861 %

Profit Bool Average Error:
0.2804543526641405

Profit Bool Accuracy:
0.7847847847847848

Total Nominations Average Error:
8.794418785005204

Total Wins Average Error:
6.1476826079564955



 500 0.05 2

Movie Gross Average Percent Error:
22884.020477021626 %

IMDb Rating Average Percent Error:
10.728403774964828 %

Meta Score Average Percent Error:
22.61805454775929 %

Number of Votes Aver


Iteration with suspicious time 1.64 sec ignored in overall statistics.



 1000 0.1 15

Movie Gross Average Percent Error:
20786.562651171498 %

IMDb Rating Average Percent Error:
10.170569367322631 %

Meta Score Average Percent Error:
22.029806547004156 %

Number of Votes Average Percent Error:
145.42364577133344 %

Oscar Nominations Average Error:
0.3936928463673576

Oscar Wins Average Error:
0.23327205266171025

Other Nominations Average Error:
8.84597727736049

Other Wins Average Error:
6.075132001019914

Profit Average Percent Error:
840.8922755437965 %

Gross Adjusted Average Percent Error:
22268.095760220494 %

Profit Adjusted Average Percent Error:
873.1720411428834 %

Profit Bool Average Error:
0.2686613325174024

Profit Bool Accuracy:
0.7862862862862863

Total Nominations Average Error:
8.933468458217318

Total Wins Average Error:
6.277297222183628



 1000 0.1 10

Movie Gross Average Percent Error:
20045.83821176339 %

IMDb Rating Average Percent Error:
10.124761612686589 %




 1000 0.01 6

Movie Gross Average Percent Error:
22422.8724745134 %

IMDb Rating Average Percent Error:
10.774585730693294 %

Meta Score Average Percent Error:
22.49192462657989 %

Number of Votes Average Percent Error:
159.3473768486581 %

Oscar Nominations Average Error:
0.37168654448296257

Oscar Wins Average Error:
0.21790916402805843

Other Nominations Average Error:
8.7909084993461

Other Wins Average Error:
5.977563369817306

Profit Average Percent Error:
1164.7751975395693 %

Gross Adjusted Average Percent Error:
22587.846725412037 %

Profit Adjusted Average Percent Error:
1223.422090044283 %

Profit Bool Average Error:
0.2887389335325348

Profit Bool Accuracy:
0.7847847847847848

Total Nominations Average Error:
8.900734334562026

Total Wins Average Error:
6.1342584583160935



 1000 0.01 2

Movie Gross Average Percent Error:
23433.164867341173 %

IMDb Rating Average Percent Error:
11.016051651328933 %

Meta Score Average Percent Error:
22.79485389182434 %

Number of Votes 

## Catboost Regression Model Analysis
Catboost lives up to the hype and out-performs all of the models so far, with the exception of predicting the Oscar nominations a movie receives. Catboost is able to predict whether or not a movie will turn a profit with almost 80% accuracy.

In [18]:
def scale_nn(X_nn, y_nn):
    scaler = StandardScaler()
    X_nn = scaler.fit_transform(X_nn)
    y_nn = np_utils.to_categorical(y_nn)
    return X_nn, y_nn

def build_nn(X_nn, y_nn, input_dim):
    X_nn_train, X_nn_test, y_nn_train, y_nn_test = tts(X_nn,y_nn)
    model = Sequential()
    model.add(Dense(300, input_dim = input_dim, kernel_initializer = 'normal', activation = 'relu'))
    model.add(Dense(300, kernel_initializer = 'normal', activation = 'relu'))
    model.add(Dense(y_nn_test.shape[1], kernel_initializer = 'normal', activation = 'softmax'))
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
    return model

def run_nn(X_nn, y_nn, num_folds):
    av_roc = 0
    count = 0
    X_nn, y_nn = scale_nn(X_nn, y_nn)
    kf= KFold(n_splits= num_folds, random_state = 41, shuffle = True)
    for train_index, test_index in kf.split(X_nn):
        count +=1
        print('Fold #: ', count)
        X_nn_train, X_nn_test = X_nn[train_index], X_nn[test_index]
        y_nn_train, y_nn_test = y_nn[train_index], y_nn[test_index]
        input_dim = X_nn_train.shape[1]
        print("Building model...")
        model = build_nn(X_nn,y_nn, input_dim)
        print("Training model...")
        model.fit(X_nn_train,y_nn_train, epochs = 15, batch_size =30, verbose = 0)
        print("Evaluating model...")
        test_preds = model.predict_proba(X_nn_test,verbose = 0)
        roc = roc_auc_score(y_nn_test, test_preds)
        scores = model.evaluate(X_nn_test, y_nn_test)
        print(scores)
        print(model.summary())
        print("ROC: ", roc)
        av_roc += roc
        print("Continues average: ", av_roc/count)
    print("Average ROC:", av_roc/num_folds)
    predict_proba_all = pd.DataFrame(model.predict_proba(X_nn, verbose = 0))
    return predict_proba_all
nn = run_nn(df_x.select_dtypes(include=['float64','int','bool']).astype('float'), df_y['Profit_Bool'],4)

Fold #:  1
Building model...
Training model...
Evaluating model...
[0.6538452772353534, 0.7465972778222578]
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_13 (Dense)             (None, 300)               22500     
_________________________________________________________________
dense_14 (Dense)             (None, 300)               90300     
_________________________________________________________________
dense_15 (Dense)             (None, 2)                 602       
Total params: 113,402
Trainable params: 113,402
Non-trainable params: 0
_________________________________________________________________
None
ROC:  0.8263208329229457
Continues average:  0.8263208329229457
Fold #:  2
Building model...
Training model...
Evaluating model...
[0.6611484649477063, 0.744195356285028]
_________________________________________________________________
Layer (type)                 Output Shape         

## Neural Network Classification Model Analysis
The nerual network is unable to classify whether or not a movie turned a profit, as well and the random forest regressor and the catboost regressor are, with an accuracy of about 75%.

In [19]:
nn.head()

Unnamed: 0,0,1
0,0.136841,0.863159
1,0.015606,0.984394
2,8e-06,0.999992
3,0.159395,0.840605
4,0.003181,0.996819


## Conclusion
After analyzing the results of all the machine learning algorithms implemented, it becomes apparent that determining the success of a movie is very difficult. Some characteristics of movies, such as word of mouth appeal, are very difficult to quantify. Including an actor network score into our models improved the accuracy of our predictions, however there are still many other variables to account for that detract from the accuracy. It is worth noting that every regressor model failed to accuractely estimate the profit a movie turned, because this value was extremely unpredicatable. Because Catboost was so fast and inherently more accurate, it enabled us to fine tune more than other models, to produce the most accurate predictions. Overall Catboost performed the best.

In [32]:
#look at possibly including...
clf = svm.SVC()
clf.fit(tr_x,tr_y['Profit_Bool']) 
es_y1 = np.array(clf.predict(tt_x))
tt_y1_profit = np.array(tt_y['Profit_Bool'])
print("\nProfit Bool Accuracy:")
es_y1_profit = (es_y1+.5).astype(int)
print(sum((es_y1_profit==tt_y1_profit).astype(int))/len(tt_y1_profit))


Profit Bool Accuracy:
0.6241241241241241


In [34]:
gnb = GaussianNB()
es_y1 = gnb.fit(tr_x,tr_y['Profit_Bool']).predict(tt_x)
print("\nProfit Bool Accuracy:")
es_y1_profit = (es_y1+.5).astype(int)
print(sum((es_y1_profit==tt_y1_profit).astype(int))/len(tt_y1_profit))


Profit Bool Accuracy:
0.6911911911911912
