# Movie Analysis

In [2]:
import pandas as pd
import numpy as np
import networkx as nx

## Abstract
Cinema has become one of the highest profiting industries over the past century. The total box office revenue in North America alone amounted to $11.38 billion in 2016. With the possibility of great success, there is also a large risk of financial failure. This data exploration is motivated by answering the question what makes a movie successful. There is plenty of quantative data available for movies, such as the movies' budget, the release date, ratings etc., but in this analysis an attempt will be made to quantify movie information that is less measurable and then predict movie success.

## Introduction
Research has been done to determine what aspects of a movie make it more successful; however, much of this research is contradictory. The research paper "Early Predictions of Movie Success: the Who, What, and When of Profitability" states movies with a motion picture content rating 'R' will likely have lower profits, whereas the research paper "What Makes A Great Movie?" states a motion picture content rating 'R' will have higher a box-office. Both papers analyzed thousands of movies, but came to opposite conclusions. Some variables used to predict movie success in these studies, included budget, motion picture content rating, and actor popularity.

Based on these previous models, the dataset used will include movie title length, run-time, motion picture content rating, director, genre, release date, actors, an actor popularity score, average salary of the actors in the movie, budget, and opening weekend box-office revenue for predictor variables. The actor popularity score will be calculated from a network of actors connected through the movies they appeared in together and from the average actor income. Movie success will be determined by the awards it is nominated for, the awards it won, the MetaCritic score, the IMDb rating, and the profit of the movie.

## Data Scraped, Downloaded, Cleaned & Engineered
### Beginning Dataset
A beginning dataset is downloaded from IMDb with 10,000 movies, each entry containing the movie title, URL on IMDb rating, run-time, Year, Genres, Num Votes, Release Date, Directors. From this dataset, additional information on the movie budget, gross income, opening weekend box revenue, actors, Oscar nominations, Oscars won, other award nominations, other awards won, MetaCritic score, and content rating is scraped and cleaned. 
The data points will be collected from IMDb, which is a reputable source for information, according to their website, 

>"we [IMDb] actively gather information from and verify items with studios and 
filmmakers".

### Cleaning Data
After gathering each data point, the data set is complete, although the information is not clean or uniform. The first step to clean the data will be to remove all commas across each column in the DataFrame. Removing commas will make it easier to convert monetary amounts to ints. Next each date in the Release column will be changed to a pandas date object, which will simplify any calculations that rely on the release date of the movie. Each monetary amount will be converted into an int and converted into USD. Each unique genre will be made into a column, with a true or false boolean for each movie entry.
### Feature Engineering
To resolve the disagreement in monetary amounts, due to inflation, a dataset containing the CPI for each year from 1914 will used to adjust the monetary amounts. The CPI, Consumer Price Index, describes the amount of purchasing power the average consumer has. The length of the movie title will be added, and a NetworkX graph of all actors will be made. This network will connect nodes of actors to each other, if they appear in a movie together. The edges of the network will be weighted by the amount of movies the actors appear in together. An actor popularity score will be calculated from the actors appearing in the movie, based on how many other movies they appear in with other actors and the actor's income.

The total variables in the new dataset are movie title, title length, motion picture content rating, run-time, IMDb rating, genres, MetaCritic score, Oscar nominations, Oscar wins, other award nominations, other award wins, director, release date, budget, opening weekend, gross, profit, budget adjusted for inflation, opening weekend adjusted for inflation, gross adjusted for inflation, profit adjusted for inflation, the top ten actors in the movie, and actor popularity score. A separate network will hold the actor nodes and their connections.

In [47]:
df = pd.read_csv("Result_Data/total_engineered.csv",encoding = "ISO-8859-1")
del df["Unnamed: 0"]
df = df.fillna(df.mean())
print(list(df.columns))

['Actor_0', 'Actor_1', 'Actor_2', 'Actor_3', 'Actor_4', 'Actor_5', 'Actor_6', 'Actor_7', 'Actor_8', 'Actor_9', 'Budget', 'Directors', 'Gross', 'IMDb Rating', 'Meta Score', 'Num Votes', 'Opening Weekend', 'Oscar Nominations', 'Oscar Wins', 'Other Nominations', 'Other Wins', 'Release Date', 'Runtime (mins)', 'Title', 'Year', 'Genre: Short', 'Genre:  Comedy', 'Genre: Fantasy', 'Genre: Film-Noir', 'Genre: War', 'Genre: Musical', 'Genre:  Sport', 'Genre: Biography', 'Genre: Action', 'Genre:  Fantasy', 'Genre:  Animation', 'Genre:  Biography', 'Genre: Mystery', 'Genre:  Musical', 'Genre:  Romance', 'Genre: Thriller', 'Genre:  Film-Noir', 'Genre:  History', 'Genre: Western', 'Genre: Drama', 'Genre: Sci-Fi', 'Genre:  Horror', 'Genre: Romance', 'Genre: Adventure', 'Genre:  Family', 'Genre:  Sci-Fi', 'Genre: Animation', 'Genre:  Music', 'Genre: Music', 'Genre: History', 'Genre:  Mystery', 'Genre:  Thriller', 'Genre: Comedy', 'Genre:  Crime', 'Genre: Horror', 'Genre:  Drama', 'Genre:  War', 'Genr

In [48]:
df_x = df[['Actor_0', 'Actor_1', 'Actor_2', 'Actor_3', 'Actor_4', 'Actor_5', 'Actor_6', 'Actor_7', 'Actor_8', 'Actor_9', 
           'Budget', 'Directors', 'Release Date', 'Runtime (mins)', 'Title', 'Year', 'Genre: Short', 'Genre:  Comedy', 
           'Genre: Fantasy', 'Genre: Film-Noir', 'Genre: War', 'Genre: Musical', 'Genre:  Sport', 'Genre: Biography', 
           'Genre: Action', 'Genre:  Fantasy', 'Genre:  Animation', 'Genre:  Biography', 'Genre: Mystery', 
           'Genre:  Musical', 'Genre:  Romance', 'Genre: Thriller', 'Genre:  Film-Noir', 'Genre:  History', 
           'Genre: Western', 'Genre: Drama', 'Genre: Sci-Fi', 'Genre:  Horror', 'Genre: Romance', 'Genre: Adventure', 
           'Genre:  Family', 'Genre:  Sci-Fi', 'Genre: Animation', 'Genre:  Music', 'Genre: Music', 'Genre: History', 
           'Genre:  Mystery', 'Genre:  Thriller', 'Genre: Comedy', 'Genre:  Crime', 'Genre: Horror', 'Genre:  Drama', 
           'Genre:  War', 'Genre:  Western', 'Genre:  Adventure', 'Genre: Family', 'Genre:  Action', 'Genre: Crime', 
           'Content Rating: PASSED', 'Content Rating: TV-MA', 'Content Rating: X', 'Content Rating: NC-17', 
           'Content Rating: TV-14', 'Content Rating: M', 'Content Rating: GP', 'Content Rating: TV-PG', 
           'Content Rating: PG', 'Content Rating: PG-13', 'Content Rating: G', 'Content Rating: NR', 
           'Content Rating: APPROVED', 'Content Rating: UNRATED', 'Content Rating: M/PG', 'Content Rating: TV-13', 
           'Content Rating: NOT RATED', 'Content Rating: TV-G', 'Content Rating: R', 'Decade',  'Budget_Adjusted',  
           'Length of Title', 'Directors Prev Number Movies', 'Directors Prev Mean Profit', 'Directors Prev Mean IMDb',
           'Directors Prev Mean Meta', 'Directors Prev Mean Num Votes', 'Directors Prev Mean Nominations', 
           'Directors Prev Mean Wins']]
df_y = df[['Gross', 'IMDb Rating', 'Meta Score', 'Num Votes', 'Oscar Nominations', 'Oscar Wins', 'Other Nominations', 
           'Other Wins','Profit','Gross_Adjusted',  'Profit_Adjusted', 'Profit_Bool', 'Total Nominations', 'Total Wins']]

In [72]:
df_y["Oscar Nominations"].describe()

count    9990.000000
mean        0.222723
std         0.804647
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max        11.000000
Name: Oscar Nominations, dtype: float64

In [29]:
actors_network = nx.read_yaml('Result_Data/network_of_actors.yaml')

In [30]:
# networkx.Graph.degree   weighted number of edges  https://networkx.github.io/documentation/stable/reference/classes/generated/networkx.Graph.degree.html
weights = []
for edge in actors_network.edges():    
    weights.append(actors_network.get_edge_data(edge[0],edge[1])['weight'])
    
set(sorted(weights)[::-1])
#look into finding out who has 63 weights

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 20, 29, 38, 63}

In [90]:
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import cross_val_score
from sklearn.decomposition import PCA

RF = RFR()
pca = PCA(n_components = 10)
df_x_temp = df_x.select_dtypes(include=['float64','int','bool']).astype('float')
df_x_temp2 = pca.fit_transform(df_x_temp)
print(cross_val_score(RF,df_x_temp2,df_y.select_dtypes(include=['float64','int','bool']).astype('float')))
tr_x, tt_x, tr_y, tt_y = tts(df_x_temp2, df_y, test_size = .2)
RF.fit(tr_x,tr_y)
es_y = RF.predict(tt_x)
np.array(tt_y.astype(float))[0:,0]
print("Average Movie Gross:")
print(np.array(tt_y.astype(float))[0:,0].mean())
print("Average Percent Error:")
print((abs(es_y[0:,0]-np.array(tt_y.astype(float))[0:,0])/np.array(tt_y.astype(float))[0:,0]).mean())
print("\nAverage Movie IMDb Rating:")
print(np.array(tt_y.astype(float))[0:,1].mean())
print("Average Percent Error:")
print((abs(es_y[0:,1]-np.array(tt_y.astype(float))[0:,1])/np.array(tt_y.astype(float))[0:,1]).mean())
print("\nAverage Movie Meta Score:")
print(np.array(tt_y.astype(float))[0:,2].mean())
print("Average Percent Error:")
print((abs(es_y[0:,2]-np.array(tt_y.astype(float))[0:,2])/np.array(tt_y.astype(float))[0:,2]).mean())
print("\nAverage Movie Number of Votes:")
print(np.array(tt_y.astype(float))[0:,3].mean())
print("Average Percent Error:")
print((abs(es_y[0:,3]-np.array(tt_y.astype(float))[0:,3])/np.array(tt_y.astype(float))[0:,3]).mean())
print("\nAverage Movie Oscar Nominations:")
print(np.array(tt_y.astype(float))[0:,4].mean())
print("Average Error:")
print(abs(es_y[0:,4]-np.array(tt_y.astype(float))[0:,4]).mean())
print("Max Error:")
print(max(es_y[0:,4]-np.array(tt_y.astype(float))[0:,4]))
print("\nAverage Movie Oscar Wins:")
print(np.array(tt_y.astype(float))[0:,5].mean())
print("Average Error:")
print(abs(es_y[0:,5]-np.array(tt_y.astype(float))[0:,5]).mean())
print("Max Error:")
print(max(es_y[0:,5]-np.array(tt_y.astype(float))[0:,5]))
print("\nAverage Movie Other Nominations:")
print(np.array(tt_y.astype(float))[0:,6].mean())
print("Average Error:")
print(abs(es_y[0:,6]-np.array(tt_y.astype(float))[0:,6]).mean())
print("Max Error:")
print(max(es_y[0:,6]-np.array(tt_y.astype(float))[0:,6]))
print("\nAverage Movie Other Wins:")
print(np.array(tt_y.astype(float))[0:,7].mean())
print("Average Error:")
print(abs(es_y[0:,7]-np.array(tt_y.astype(float))[0:,7]).mean())
print("Max Error:")
print(max(es_y[0:,7]-np.array(tt_y.astype(float))[0:,7]))
print("\nAverage Movie Profit:")
print(np.array(tt_y.astype(float))[0:,8].mean())
print("Average Percent Error:")
print((abs(es_y[0:,8]-np.array(tt_y.astype(float))[0:,8])/np.array(tt_y.astype(float))[0:,8]).mean())
print("\nAverage Movie Gross Adjusted:")
print(np.array(tt_y.astype(float))[0:,9].mean())
print("Average Percent Error:")
print((abs(es_y[0:,9]-np.array(tt_y.astype(float))[0:,9])/np.array(tt_y.astype(float))[0:,9]).mean())
print("\nAverage Movie Profit Adjusted:")
print(np.array(tt_y.astype(float))[0:,10].mean())
print("Average Percent Error:")
print((abs(es_y[0:,10]-np.array(tt_y.astype(float))[0:,10])/np.array(tt_y.astype(float))[0:,10]).mean())
print("\nAverage Movie Profit Bool:")
print(np.array(tt_y.astype(float))[0:,11].mean())
print("Average Error:")
print(abs(es_y[0:,11]-np.array(tt_y.astype(float))[0:,11]).mean())
print("\nAverage Movie Total Nominations:")
print(np.array(tt_y.astype(float))[0:,12].mean())
print("Average Error:")
print(abs(es_y[0:,12]-np.array(tt_y.astype(float))[0:,12]).mean())
print("Max Error:")
print(max(es_y[0:,12]-np.array(tt_y.astype(float))[0:,12]))
print("\nAverage Movie Total Wins:")
print(np.array(tt_y.astype(float))[0:,13].mean())
print("Average Error:")
print(abs(es_y[0:,13]-np.array(tt_y.astype(float))[0:,13]).mean())
print("Max Error:")
print(max(es_y[0:,13]-np.array(tt_y.astype(float))[0:,13]))



[-0.25119536 -0.11216382 -0.31842237]
Average Movie Gross:
75057678.72875479
Average Percent Error:
342.67084307121047

Average Movie IMDb Rating:
6.636486486486486
Average Percent Error:
0.12334665930089224

Average Movie Meta Score:
56.33780782716953
Average Percent Error:
0.23938218438863915

Average Movie Number of Votes:
64562.14914914915
Average Percent Error:
1.7413023493741562

Average Movie Oscar Nominations:
0.21871871871871873
Average Error:
0.3476226226226226
Max Error:
5.4

Average Movie Oscar Wins:
0.14864864864864866
Average Error:
0.22742742742742744
Max Error:
2.6

Average Movie Other Nominations:
9.373373373373374
Average Error:
9.244244244244245
Max Error:
110.8

Average Movie Other Wins:
5.233233233233233
Average Error:
6.238138138138138
Max Error:
75.6

Average Movie Profit:
61825087.56712181
Average Percent Error:
-2.2034023280060673

Average Movie Gross Adjusted:
112355130.98360057
Average Percent Error:
437.39873247999265

Average Movie Profit Adjusted:
94996026

In [None]:
# Do some feature engineering with new actor dataset
# Add in column for popularity
# Need to make actor DataFrame?

In [None]:
# Talk about machine learning methods, why there are good and suitable for our data

In [1]:
# And now do some machine learning....

In [2]:
# Analyze ML models

In [None]:
# Case study to see how well we predict on a specific movie

In [None]:
# Insert some amazing conclusion here