# Movie Analysis

In [1]:
import pandas as pd
import numpy as np
import networkx as nx
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import cross_val_score
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
# importing all libraries to be used throughout data exploration
import re, json, requests, seaborn, warnings
warnings.filterwarnings( 'ignore' )
from bs4 import BeautifulSoup
from matplotlib import pyplot as plt, rcParams
%matplotlib inline
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from ipyparallel import Client
import time
from catboost import Pool, CatBoostRegressor

In [2]:
def text_message(text,number):
    browser = webdriver.Chrome()
    url = "http://www.txtdrop.com/"
    browser.get( url )
    email = browser.find_element_by_id("emailfrom")
    email.send_keys("brookemosby@hotmail.com")
    first_3 = browser.find_element_by_id("npa")
    first_3.send_keys(number[:3])
    second_3 = browser.find_element_by_id("exchange")
    second_3.send_keys(number[3:6])
    last_4 = browser.find_element_by_id("number")
    last_4.send_keys(number[6:])
    message = browser.find_element_by_name("body")
    message.send_keys(text)
    browser.find_element_by_name("submit").click()
    browser.close()

## Abstract
Cinema has become one of the highest profiting industries over the past century. The total box office revenue in North America alone amounted to $11.38 billion in 2016. With the possibility of great success, there is also a large risk of financial failure. This data exploration is motivated by answering the question what makes a movie successful. There is plenty of quantative data available for movies, such as the movies' budget, the release date, ratings etc., but in this analysis an attempt will be made to quantify movie information that is less measurable and then predict movie success.

## Introduction
Research has been done to determine what aspects of a movie make it more successful; however, much of this research is contradictory. The research paper "Early Predictions of Movie Success: the Who, What, and When of Profitability" states movies with a motion picture content rating 'R' will likely have lower profits, whereas the research paper "What Makes A Great Movie?" states a motion picture content rating 'R' will have higher a box-office. Both papers analyzed thousands of movies, but came to opposite conclusions. Some variables used to predict movie success in these studies, included budget, motion picture content rating, and actor popularity.

Based on these previous models, the dataset used will include movie title length, run-time, motion picture content rating, director, genre, release date, actors, an actor popularity score, average salary of the actors in the movie, budget, and opening weekend box-office revenue for predictor variables. The actor popularity score will be calculated from a network of actors connected through the movies they appeared in together and from the average actor income. Movie success will be determined by the awards it is nominated for, the awards it won, the MetaCritic score, the IMDb rating, and the profit of the movie.

## Data Scraped, Downloaded, Cleaned & Engineered
### Beginning Dataset
A beginning dataset is downloaded from IMDb with 10,000 movies, each entry containing the movie title, URL on IMDb rating, run-time, Year, Genres, Num Votes, Release Date, Directors. From this dataset, additional information on the movie budget, gross income, opening weekend box revenue, actors, Oscar nominations, Oscars won, other award nominations, other awards won, MetaCritic score, and content rating is scraped and cleaned. 
The data points will be collected from IMDb, which is a reputable source for information, according to their website, 

>"we [IMDb] actively gather information from and verify items with studios and 
filmmakers".

### Cleaning Data
After gathering each data point, the data set is complete, although the information is not clean or uniform. The first step to clean the data will be to remove all commas across each column in the DataFrame. Removing commas will make it easier to convert monetary amounts to ints. Next each date in the Release column will be changed to a pandas date object, which will simplify any calculations that rely on the release date of the movie. Each monetary amount will be converted into an int and converted into USD. Each unique genre will be made into a column, with a true or false boolean for each movie entry.
### Feature Engineering
To resolve the disagreement in monetary amounts, due to inflation, a dataset containing the CPI for each year from 1914 will used to adjust the monetary amounts. The CPI, Consumer Price Index, describes the amount of purchasing power the average consumer has. The length of the movie title will be added, and a NetworkX graph of all actors will be made. This network will connect nodes of actors to each other, if they appear in a movie together. The edges of the network will be weighted by the amount of movies the actors appear in together. An actor popularity score will be calculated from the actors appearing in the movie, based on how many other movies they appear in with other actors and the actor's income.

The total variables in the new dataset are movie title, title length, motion picture content rating, run-time, IMDb rating, genres, MetaCritic score, Oscar nominations, Oscar wins, other award nominations, other award wins, director, release date, budget, opening weekend, gross, profit, budget adjusted for inflation, opening weekend adjusted for inflation, gross adjusted for inflation, profit adjusted for inflation, the top ten actors in the movie, and actor popularity score. A separate network will hold the actor nodes and their connections.

In [3]:
df = pd.read_csv("Result_Data/total_engineered.csv",encoding = "ISO-8859-1")
del df["Unnamed: 0"]
df = df.fillna(df.mean())
print(list(df.columns))

['Actor_0', 'Actor_1', 'Actor_2', 'Actor_3', 'Actor_4', 'Actor_5', 'Actor_6', 'Actor_7', 'Actor_8', 'Actor_9', 'Budget', 'Directors', 'Gross', 'IMDb Rating', 'Meta Score', 'Num Votes', 'Opening Weekend', 'Oscar Nominations', 'Oscar Wins', 'Other Nominations', 'Other Wins', 'Release Date', 'Runtime (mins)', 'Title', 'Year', 'Genre: Short', 'Genre:  Comedy', 'Genre: Fantasy', 'Genre: Film-Noir', 'Genre: War', 'Genre: Musical', 'Genre:  Sport', 'Genre: Biography', 'Genre: Action', 'Genre:  Fantasy', 'Genre:  Animation', 'Genre:  Biography', 'Genre: Mystery', 'Genre:  Musical', 'Genre:  Romance', 'Genre: Thriller', 'Genre:  Film-Noir', 'Genre:  History', 'Genre: Western', 'Genre: Drama', 'Genre: Sci-Fi', 'Genre:  Horror', 'Genre: Romance', 'Genre: Adventure', 'Genre:  Family', 'Genre:  Sci-Fi', 'Genre: Animation', 'Genre:  Music', 'Genre: Music', 'Genre: History', 'Genre:  Mystery', 'Genre:  Thriller', 'Genre: Comedy', 'Genre:  Crime', 'Genre: Horror', 'Genre:  Drama', 'Genre:  War', 'Genr

In [4]:
df_x = df[['Actor_0', 'Actor_1', 'Actor_2', 'Actor_3', 'Actor_4', 'Actor_5', 'Actor_6', 'Actor_7', 'Actor_8', 'Actor_9', 
           'Budget', 'Directors', 'Release Date', 'Runtime (mins)', 'Title', 'Year', 'Genre: Short', 'Genre:  Comedy', 
           'Genre: Fantasy', 'Genre: Film-Noir', 'Genre: War', 'Genre: Musical', 'Genre:  Sport', 'Genre: Biography', 
           'Genre: Action', 'Genre:  Fantasy', 'Genre:  Animation', 'Genre:  Biography', 'Genre: Mystery', 
           'Genre:  Musical', 'Genre:  Romance', 'Genre: Thriller', 'Genre:  Film-Noir', 'Genre:  History', 
           'Genre: Western', 'Genre: Drama', 'Genre: Sci-Fi', 'Genre:  Horror', 'Genre: Romance', 'Genre: Adventure', 
           'Genre:  Family', 'Genre:  Sci-Fi', 'Genre: Animation', 'Genre:  Music', 'Genre: Music', 'Genre: History', 
           'Genre:  Mystery', 'Genre:  Thriller', 'Genre: Comedy', 'Genre:  Crime', 'Genre: Horror', 'Genre:  Drama', 
           'Genre:  War', 'Genre:  Western', 'Genre:  Adventure', 'Genre: Family', 'Genre:  Action', 'Genre: Crime', 
           'Content Rating: PASSED', 'Content Rating: TV-MA', 'Content Rating: X', 'Content Rating: NC-17', 
           'Content Rating: TV-14', 'Content Rating: M', 'Content Rating: GP', 'Content Rating: TV-PG', 
           'Content Rating: PG', 'Content Rating: PG-13', 'Content Rating: G', 'Content Rating: NR', 
           'Content Rating: APPROVED', 'Content Rating: UNRATED', 'Content Rating: M/PG', 'Content Rating: TV-13', 
           'Content Rating: NOT RATED', 'Content Rating: TV-G', 'Content Rating: R', 'Decade',  'Budget_Adjusted',  
           'Length of Title', 'Directors Prev Number Movies', 'Directors Prev Mean Profit', 'Directors Prev Mean IMDb',
           'Directors Prev Mean Meta', 'Directors Prev Mean Num Votes', 'Directors Prev Mean Nominations', 
           'Directors Prev Mean Wins', 'Actor Weights']]
df_y = df[['Gross', 'IMDb Rating', 'Meta Score', 'Num Votes', 'Oscar Nominations', 'Oscar Wins', 'Other Nominations', 
           'Other Wins','Profit','Gross_Adjusted',  'Profit_Adjusted', 'Profit_Bool', 'Total Nominations', 'Total Wins']]

In [None]:
actors_network = nx.read_yaml('Result_Data/network_of_actors.yaml')

In [None]:
# networkx.Graph.degree   weighted number of edges  https://networkx.github.io/documentation/stable/reference/classes/generated/networkx.Graph.degree.html
weights = []
for edge in actors_network.edges():    
    weights.append(actors_network.get_edge_data(edge[0],edge[1])['weight'])
    
set(sorted(weights)[::-1])
#look into finding out who has 63 weights

In [None]:
#consider normalizing data to better predict
df_x_temp = df_x.select_dtypes(include=['float64','int','bool']).astype('float')
df_y_temp = df_y.select_dtypes(include=['float64','int','bool']).astype('float')
tr_x, tt_x, tr_y, tt_y = tts(df_x_temp, df_y_temp, test_size = .2)
                
parameters = {'pca__n_components':[10,20,50,100],'rfr__n_estimators':[10,100,500,1000], 'rfr__criterion':['mae', 'mse'], 'rfr__max_features':['auto', 'sqrt', 'log2']}
pca = PCA()
rfr = RFR()
pipe = Pipeline(steps=[('pca', pca), ('rfr', rfr)])
estimator = GridSearchCV(pipe, parameters, n_jobs = -1, verbose = 1)
estimator.fit(tr_x,tr_y)
es_y = estimator.best_estimator.predict(tt_x)
print("\nMovie Gross Average Percent Error:")
print((abs(es_y[0:,0]-np.array(tt_y.astype(float))[0:,0])/abs(np.array(tt_y.astype(float))[0:,0])).mean()*100,"%")
print("\nIMDb Rating Average Percent Error:")
print((abs(es_y[0:,1]-np.array(tt_y.astype(float))[0:,1])/abs(np.array(tt_y.astype(float))[0:,1])).mean()*100,"%")
print("\nMeta Score Average Percent Error:")
print((abs(es_y[0:,2]-np.array(tt_y.astype(float))[0:,2])/abs(np.array(tt_y.astype(float))[0:,2])).mean()*100,"%")
print("\nNumber of Votes Average Percent Error:")
print((abs(es_y[0:,3]-np.array(tt_y.astype(float))[0:,3])/abs(np.array(tt_y.astype(float))[0:,3])).mean()*100,"%")
print("\nOscar Nominations Average Error:")
print(abs(es_y[0:,4]-np.array(tt_y.astype(float))[0:,4]).mean())
print("\nOscar Wins Average Error:")
print(abs(es_y[0:,5]-np.array(tt_y.astype(float))[0:,5]).mean())
print("\nOther Nominations Average Error:")
print(abs(es_y[0:,6]-np.array(tt_y.astype(float))[0:,6]).mean())
print("\nOther Wins Average Error:")
print(abs(es_y[0:,7]-np.array(tt_y.astype(float))[0:,7]).mean())
print("\nProfit Average Percent Error:")
print((abs(es_y[0:,8]-np.array(tt_y.astype(float))[0:,8])/abs(np.array(tt_y.astype(float))[0:,8])).mean()*100,"%")
print("\nGross Adjusted Average Percent Error:")
print((abs(es_y[0:,9]-np.array(tt_y.astype(float))[0:,9])/abs(np.array(tt_y.astype(float))[0:,9])).mean()*100,"%")
print("\nProfit Adjusted Average Percent Error:")
print((abs(es_y[0:,10]-np.array(tt_y.astype(float))[0:,10])/abs(np.array(tt_y.astype(float))[0:,10])).mean()*100,"%")
print("\nProfit Bool Average Error:")
print(abs(es_y[0:,11]-np.array(tt_y.astype(float))[0:,11]).mean())
print("\nTotal Nominations Average Error:")
print(abs(es_y[0:,12]-np.array(tt_y.astype(float))[0:,12]).mean())
print("\nTotal Wins Average Error:")
print(abs(es_y[0:,13]-np.array(tt_y.astype(float))[0:,13]).mean())

In [None]:
text_message("Finished","6787995970")

In [None]:
from sklearn import datasets, linear_model
parameters = {'pca__n_components':[10,20,50,100]}
pca = PCA()
regr = linear_model.LinearRegression()
pipe = Pipeline(steps=[('pca', pca), ('regr', regr)])
estimator = GridSearchCV(pipe, parameters, n_jobs = -1, verbose = 1)
estimator.fit(tr_x,tr_y)
es_y = estimator.best_estimator.predict(tt_x)
print("\nMovie Gross Average Percent Error:")
print((abs(es_y[0:,0]-np.array(tt_y.astype(float))[0:,0])/abs(np.array(tt_y.astype(float))[0:,0])).mean()*100,"%")
print("\nIMDb Rating Average Percent Error:")
print((abs(es_y[0:,1]-np.array(tt_y.astype(float))[0:,1])/abs(np.array(tt_y.astype(float))[0:,1])).mean()*100,"%")
print("\nMeta Score Average Percent Error:")
print((abs(es_y[0:,2]-np.array(tt_y.astype(float))[0:,2])/abs(np.array(tt_y.astype(float))[0:,2])).mean()*100,"%")
print("\nNumber of Votes Average Percent Error:")
print((abs(es_y[0:,3]-np.array(tt_y.astype(float))[0:,3])/abs(np.array(tt_y.astype(float))[0:,3])).mean()*100,"%")
print("\nOscar Nominations Average Error:")
print(abs(es_y[0:,4]-np.array(tt_y.astype(float))[0:,4]).mean())
print("\nOscar Wins Average Error:")
print(abs(es_y[0:,5]-np.array(tt_y.astype(float))[0:,5]).mean())
print("\nOther Nominations Average Error:")
print(abs(es_y[0:,6]-np.array(tt_y.astype(float))[0:,6]).mean())
print("\nOther Wins Average Error:")
print(abs(es_y[0:,7]-np.array(tt_y.astype(float))[0:,7]).mean())
print("\nProfit Average Percent Error:")
print((abs(es_y[0:,8]-np.array(tt_y.astype(float))[0:,8])/abs(np.array(tt_y.astype(float))[0:,8])).mean()*100,"%")
print("\nGross Adjusted Average Percent Error:")
print((abs(es_y[0:,9]-np.array(tt_y.astype(float))[0:,9])/abs(np.array(tt_y.astype(float))[0:,9])).mean()*100,"%")
print("\nProfit Adjusted Average Percent Error:")
print((abs(es_y[0:,10]-np.array(tt_y.astype(float))[0:,10])/abs(np.array(tt_y.astype(float))[0:,10])).mean()*100,"%")
print("\nProfit Bool Average Error:")
print(abs(es_y[0:,11]-np.array(tt_y.astype(float))[0:,11]).mean())
print("\nTotal Nominations Average Error:")
print(abs(es_y[0:,12]-np.array(tt_y.astype(float))[0:,12]).mean())
print("\nTotal Wins Average Error:")
print(abs(es_y[0:,13]-np.array(tt_y.astype(float))[0:,13]).mean())

In [None]:
text_message("Finished","6787995970")

In [None]:
# Michael Catboosting fihdaze

df_x_temp = df_x.select_dtypes(include=['float64','int','bool']).astype('float')
df_y_temp = df_y.select_dtypes(include=['float64','int','bool']).astype('float')
df_y_temp.iloc[0]  = (df_y_temp.iloc[0] - df_y_temp.iloc[0].mean())/np.std(df_y_temp.iloc[0])
df_y_temp.iloc[3]  = (df_y_temp.iloc[3] - df_y_temp.iloc[3].mean())/np.std(df_y_temp.iloc[3])
df_y_temp.iloc[8]  = (df_y_temp.iloc[8] - df_y_temp.iloc[8].mean())/np.std(df_y_temp.iloc[8])
df_y_temp.iloc[9]  = (df_y_temp.iloc[9] - df_y_temp.iloc[9].mean())/np.std(df_y_temp.iloc[9])
df_y_temp.iloc[10] = (df_y_temp.iloc[10] - df_y_temp.iloc[10].mean())/np.std(df_y_temp.iloc[10])

tr_x, tt_x, tr_y, tt_y = tts(df_x_temp, df_y_temp, test_size = .2)
es_y = np.zeros_like(tt_y)

for iterations in [50,100]:
    for learning_rate in [.25,.05]:
        for depth in [6,2]:
            estimator = CatBoostRegressor(iterations=iterations, 
                                      learning_rate=learning_rate, 
                                      depth=depth,l2_leaf_reg=64) 
            for i in range(14):
                estimator.fit(tr_x,tr_y.iloc[0:,i],verbose=False)
                es_y[0:,i] = estimator.predict(tt_x)
                
            print("\n\n**** ",iterations,learning_rate,depth,' ****\n')
            print("\nMovie Gross Average Percent Error:")
            print((abs(es_y[0:,0]-np.array(tt_y.astype(float))[0:,0])/abs(np.array(tt_y.astype(float))[0:,0])).mean()*100,"%")
            print("\nIMDb Rating Average Percent Error:")
            print((abs(es_y[0:,1]-np.array(tt_y.astype(float))[0:,1])/abs(np.array(tt_y.astype(float))[0:,1])).mean()*100,"%")
            print("\nMeta Score Average Percent Error:")
            print((abs(es_y[0:,2]-np.array(tt_y.astype(float))[0:,2])/abs(np.array(tt_y.astype(float))[0:,2])).mean()*100,"%")
            print("\nNumber of Votes Average Percent Error:")
            print((abs(es_y[0:,3]-np.array(tt_y.astype(float))[0:,3])/abs(np.array(tt_y.astype(float))[0:,3])).mean()*100,"%")
            print("\nOscar Nominations Average Error:")
            print(abs(es_y[0:,4]-np.array(tt_y.astype(float))[0:,4]).mean())
            print("\nOscar Wins Average Error:")
            print(abs(es_y[0:,5]-np.array(tt_y.astype(float))[0:,5]).mean())
            print("\nOther Nominations Average Error:")
            print(abs(es_y[0:,6]-np.array(tt_y.astype(float))[0:,6]).mean())
            print("\nOther Wins Average Error:")
            print(abs(es_y[0:,7]-np.array(tt_y.astype(float))[0:,7]).mean())
            print("\nProfit Average Percent Error:")
            print((abs(es_y[0:,8]-np.array(tt_y.astype(float))[0:,8])/abs(np.array(tt_y.astype(float))[0:,8])).mean()*100,"%")
            print("\nGross Adjusted Average Percent Error:")
            print((abs(es_y[0:,9]-np.array(tt_y.astype(float))[0:,9])/abs(np.array(tt_y.astype(float))[0:,9])).mean()*100,"%")
            print("\nProfit Adjusted Average Percent Error:")
            print((abs(es_y[0:,10]-np.array(tt_y.astype(float))[0:,10])/abs(np.array(tt_y.astype(float))[0:,10])).mean()*100,"%")
            print("\nProfit Bool Average Error:")
            print(abs(es_y[0:,11]-np.array(tt_y.astype(float))[0:,11]).mean())
            print("\nTotal Nominations Average Error:")
            print(abs(es_y[0:,12]-np.array(tt_y.astype(float))[0:,12]).mean())
            print("\nTotal Wins Average Error:")
            print(abs(es_y[0:,13]-np.array(tt_y.astype(float))[0:,13]).mean())



****  50 0.25 6  ****


Movie Gross Average Percent Error:
27270.119913947496 %

IMDb Rating Average Percent Error:
10.8176486243267 %

Meta Score Average Percent Error:
25.57233087256052 %

Number of Votes Average Percent Error:
156.30949015507315 %

Oscar Nominations Average Error:
0.3546690621578977

Oscar Wins Average Error:
0.20539105918471837

Other Nominations Average Error:
8.06937876326911

Other Wins Average Error:
5.229295298812379

Profit Average Percent Error:
932.3105495811515 %

Gross Adjusted Average Percent Error:
29937.722636778966 %

Profit Adjusted Average Percent Error:
1028.1933521517687 %

Profit Bool Average Error:
0.28212343167759557

Total Nominations Average Error:
8.205311947814538

Total Wins Average Error:
5.313126196450458


****  50 0.25 2  ****


Movie Gross Average Percent Error:
28508.49051499486 %

IMDb Rating Average Percent Error:
10.918756335983801 %

Meta Score Average Percent Error:
25.913140168607956 %

Number of Votes Average Percent Error:


In [None]:
text_message("Catboosted Booooi","7655868338")

In [None]:
# Do some feature engineering with new actor dataset
# Add in column for popularity
# Need to make actor DataFrame?

In [None]:
# Talk about machine learning methods, why there are good and suitable for our data

In [None]:
# And now do some machine learning....

In [None]:
# Analyze ML models

In [None]:
# Case study to see how well we predict on a specific movie

In [None]:
# Insert some amazing conclusion here