# Regression Model For Predicting Movie Box Office Gross

Group Members: Ying Wu (A20370189), Yingjuan Wu (A20326320), Sahand Zeinali (A20318383)

Project Description: 
In this project, we will explore the relationship between a movie’s theatrical revenue and other key features. We will use worldwide box-office gross (numerical) as the target variable, and use relevant information that are available prior a movie's release as input variables, including general information like number of critic reviews, duration of movie (in mins), face number in poster, genres, budget, country, content-rating, imdb score, as well as social media factors like number of director facebook likes, number of cast total facebook likes, etc. 

The objective of this project is to build a regression model to predict movie box office gross. Categorical input variables include genres, country, and content-rating; numerical input variables are number of critic reviews, duration, face number in poster, budget, imdb score, number of director facebook likes, and number of cast total facebook likes. 

In [1]:
import numpy as np 
import pandas as pd
from sklearn import linear_model 
from sklearn import preprocessing
from sklearn import model_selection

movies = pd.read_csv("Processed_Data.csv", header = 0)
original_headers = list(movies.columns.values)#save headers in a list

In [2]:
# Scale continuous features to have 0 mean and 1 variance

# num_critic_for_reviews
movies['num_critic_for_reviews'] = preprocessing.scale(movies['num_critic_for_reviews'])
# duration
movies['duration'] = preprocessing.scale(movies['duration']) 
# director_facebook_likes
movies['director_facebook_likes'] = preprocessing.scale(movies['director_facebook_likes']) 
# cast_total_facebook_likes
movies['cast_total_facebook_likes'] = preprocessing.scale(movies['cast_total_facebook_likes']) 
# facenumber_in_poster
movies['facenumber_in_poster'] = preprocessing.scale(movies['facenumber_in_poster']) 
# budget
movies['budget'] = preprocessing.scale(movies['budget']) 
# imdb_score
movies['imdb_score'] = preprocessing.scale(movies['imdb_score']) 

# Check the shape of data
movies.head()

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,gross,cast_total_facebook_likes,facenumber_in_poster,budget,imdb_score,Action,Adventure,...,PG-13,PG,G,R,Not Rated,NC-17,Approved,M,GP,X
0,1.33346,3.055853,-0.107198,309000000.0,1.966068,-0.69516,3.015063,0.678332,1.0,1.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5.844645,2.806854,6.49821,448000000.0,5.016847,-0.69516,2.434369,2.029763,1.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2.745964,1.213263,-0.134314,73100000.0,-0.456334,-0.186635,2.596963,0.195678,1.0,1.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.527679,-0.380328,-0.276054,201000000.0,-0.447483,-0.186635,2.550507,1.354047,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.977915,2.259057,-0.193783,302000000.0,2.507503,0.830414,2.434369,1.064455,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
#Define target variable (gross) as target, define independent variables as data
movies_array = movies.as_matrix()
target = movies_array[:, 3]
data = movies_array[:, list(range(0,3))+list(range(4,len(movies_array[0])))]

In [4]:
##Ordinary Least Squares Regression Model

#Build model with default settings
reg = linear_model.LinearRegression()
reg.fit(data, target)
#Use 10 fold cross-validation to calculate MSE
MSE_array = model_selection.cross_val_score(reg, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Use Mean Squared Error (MSE) to measure the performance:")
print("Linear Regression with default settings:    MSE =",MSE)

#Modify parameter fit_intercept
reg1 = linear_model.LinearRegression(fit_intercept=False)
reg1.fit(data, target)
MSE1_array = model_selection.cross_val_score(reg1, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE1 = np.absolute(np.mean(MSE1_array))
print("Linear Regression with fit_intercept=False: MSE =",MSE1)

#Modify parameter n_jobs
reg4 = linear_model.LinearRegression(n_jobs=-1)
reg4.fit(data, target)
MSE4_array = model_selection.cross_val_score(reg4, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE4 = np.absolute(np.mean(MSE4_array))
print("Linear Regression with n_jobs=-1:           MSE =",MSE4)

#Modify parameter normalize
reg2 = linear_model.LinearRegression(normalize=True)
reg2.fit(data, target)
MSE2_array = model_selection.cross_val_score(reg2, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE2 = np.absolute(np.mean(MSE2_array))
print("Linear Regression with normalize=True:      MSE =",MSE2)

#Modify parameter copy_X
reg3 = linear_model.LinearRegression(copy_X=False)
reg3.fit(data, target)
MSE3_array = model_selection.cross_val_score(reg3, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE3 = np.absolute(np.mean(MSE3_array))
print("Linear Regression with false copy_X=False:  MSE =",MSE3)


Use Mean Squared Error (MSE) to measure the performance:
Linear Regression with default settings:    MSE = 3.82566499904e+39
Linear Regression with fit_intercept=False: MSE = 5.82895162482e+38
Linear Regression with n_jobs=-1:           MSE = 3.82566499904e+39
Linear Regression with normalize=True:      MSE = 4.70697290355e+42
Linear Regression with false copy_X=False:  MSE = 3.01513255402e+39


In [5]:
# Reset target and data, make sure data is not changed
target = movies_array[:, 3]
data = movies_array[:, list(range(0,3))+list(range(4,len(movies_array[0])))]

# Ridge Regression Model

# Default settings
clf = linear_model.Ridge()
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Use Mean Squared Error (MSE) and Coefficient of Determination (R^2) to measure the performance:")
print("Ridge Regression with default settings:    MSE = ",MSE," R^2 = ",clf.score(data, target) )

# alpha=0.5
clf = linear_model.Ridge(alpha=0.5)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Ridge Regression with alpha=0.5:           MSE = ",MSE," R^2 = ",clf.score(data, target) )

# fit_intercept=false
clf = linear_model.Ridge(fit_intercept=False)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Ridge Regression with fit_intercept=false: MSE = ",MSE," R^2 = ",clf.score(data, target) )

# solver='lsqr'
clf = linear_model.Ridge(solver='lsqr')
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Ridge Regression with solver='lsqr':       MSE = ",MSE," R^2 = ",clf.score(data, target) )


Use Mean Squared Error (MSE) and Coefficient of Determination (R^2) to measure the performance:
Ridge Regression with default settings:    MSE =  2.46586966196e+15  R^2 =  0.47149245887
Ridge Regression with alpha=0.5:           MSE =  2.49697052704e+15  R^2 =  0.471969734855
Ridge Regression with fit_intercept=false: MSE =  2.46569305504e+15  R^2 =  0.471459488646
Ridge Regression with solver='lsqr':       MSE =  2.44591342153e+15  R^2 =  0.469904664609
