# Regression Model For Predicting Movie Box Office Gross

Group Members: Ying Wu (A20370189), Yingjuan Wu (A20326320), Sahand Zeinali (A20318383)

Project Description: 
In this project, we will explore the relationship between a movie’s theatrical revenue and other key features. We will use worldwide box-office gross (numerical) as the target variable, and use relevant information that are available prior a movie's release as input variables, including general information like number of critic reviews, duration of movie (in mins), face number in poster, genres, budget, country, content-rating, imdb score, as well as social media factors like number of director facebook likes, number of cast total facebook likes, etc. 

The objective of this project is to build a regression model to predict movie box office gross. Categorical input variables include genres, country, and content-rating; numerical input variables are number of critic reviews, duration, face number in poster, budget, imdb score, number of director facebook likes, and number of cast total facebook likes. 

In [1]:
import numpy as np 
import pandas as pd
from sklearn import linear_model 
from sklearn import preprocessing
from sklearn import model_selection
from sklearn import dummy

movies = pd.read_csv("Processed_Data.csv", header = 0)
original_headers = list(movies.columns.values)#save headers in a list

In [2]:
# Scale continuous features to have 0 mean and 1 variance

# num_critic_for_reviews
movies['num_critic_for_reviews'] = preprocessing.scale(movies['num_critic_for_reviews'])
# duration
movies['duration'] = preprocessing.scale(movies['duration']) 
# director_facebook_likes
movies['director_facebook_likes'] = preprocessing.scale(movies['director_facebook_likes']) 
# cast_total_facebook_likes
movies['cast_total_facebook_likes'] = preprocessing.scale(movies['cast_total_facebook_likes']) 
# facenumber_in_poster
movies['facenumber_in_poster'] = preprocessing.scale(movies['facenumber_in_poster']) 
# budget
movies['budget'] = preprocessing.scale(movies['budget']) 
# imdb_score
movies['imdb_score'] = preprocessing.scale(movies['imdb_score']) 

# Check the shape of data
movies.shape

(3321, 87)

# Performance

We use two measures to evaluate the performance of the regression models:

1.Mean Squared Error(MSE) - measures the average of the squares of the errors. It is a risk function corresponding to the expected value of the squared error loss. We use 'neg_mean_squared_error' as scoring parameter in the cross-validation function to calculate the MSE scores.

2.R^2 Score(Coefficient of Determination) - measures how well future instances are likely to be predicted by the model. The best possible score is 1. Higher R^2 score indicates better performance of regression model. We use 'r2_score' as scoring parameter in the cross-validation function to calculate the R^2 scores.

In [3]:
#Define target variable (gross) as target, define independent variables as data
movies_array = movies.as_matrix()
target = movies_array[:, 3]
data = movies_array[:, list(range(0,3))+list(range(4,len(movies_array[0])))]

In [4]:
#Baseline performance
dum = dummy.DummyRegressor()
dum.fit(data, target)
pred_mean = np.mean(dum.predict(data))
print("Predicted mean of target:",pred_mean)
print("Actual mean of target:",np.mean(target))
MSE_array = model_selection.cross_val_score(dum, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Baseline MSE:",MSE)
R2_array = model_selection.cross_val_score(dum, data, target, cv=10, scoring = 'r2')
print("Baseline R^2:",np.mean(R2_array))

Predicted mean of target: 45608461.1575
Actual mean of target: 45608461.1575
Baseline MSE: 4.12674080594e+15
Baseline R^2: -1.52339286602


In [5]:
##Ordinary Least Squares Regression Model

#Build model with default settings
reg = linear_model.LinearRegression()
reg.fit(data, target)
#Use 10 fold cross-validation to calculate MSE
MSE_array = model_selection.cross_val_score(reg, data, target, cv=10, scoring = 'neg_mean_squared_error')
R2_array = model_selection.cross_val_score(reg, data, target, cv=10, scoring = 'r2')
MSE = np.absolute(np.mean(MSE_array))
print("Use Mean Squared Error (MSE) to measure the performance:")
print("Linear Regression with default settings:    MSE =",MSE," R^2 = ",reg.score(data, target) )

#Modify parameter fit_intercept
reg1 = linear_model.LinearRegression(fit_intercept=False)
reg1.fit(data, target)
MSE1_array = model_selection.cross_val_score(reg1, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE1 = np.absolute(np.mean(MSE1_array))
print("Linear Regression with fit_intercept=False: MSE =",MSE1," R^2 = ",reg1.score(data, target) )

#Modify parameter n_jobs
reg4 = linear_model.LinearRegression(n_jobs=-1)
reg4.fit(data, target)
MSE4_array = model_selection.cross_val_score(reg4, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE4 = np.absolute(np.mean(MSE4_array))
print("Linear Regression with n_jobs=-1:           MSE =",MSE4," R^2 = ",reg4.score(data, target) )

#Modify parameter normalize
reg2 = linear_model.LinearRegression(normalize=True)
reg2.fit(data, target)
MSE2_array = model_selection.cross_val_score(reg2, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE2 = np.absolute(np.mean(MSE2_array))
print("Linear Regression with normalize=True:      MSE =",MSE2," R^2 = ",reg2.score(data, target) )

#Modify parameter copy_X
reg3 = linear_model.LinearRegression(copy_X=False)
reg3.fit(data, target)
MSE3_array = model_selection.cross_val_score(reg3, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE3 = np.absolute(np.mean(MSE3_array))
print("Linear Regression with false copy_X=False:  MSE =",MSE3," R^2 = ",reg3.score(data, target) )




Use Mean Squared Error (MSE) to measure the performance:
Linear Regression with default settings:    MSE = 2.64367167074e+15  R^2 =  0.472313086101
Linear Regression with fit_intercept=False: MSE = 2.64367407615e+15  R^2 =  0.472313086101
Linear Regression with n_jobs=-1:           MSE = 2.64367167074e+15  R^2 =  0.472313086101
Linear Regression with normalize=True:      MSE = 2.46248008551e+42  R^2 =  0.463866556306
Linear Regression with false copy_X=False:  MSE = 1.71258839024e+35  R^2 =  0.32133367021


In [6]:
# Reset target and data, make sure data is not changed
target = movies_array[:, 3]
data = movies_array[:, list(range(0,3))+list(range(4,len(movies_array[0])))]

# Ridge Regression Model

# Default settings
clf = linear_model.Ridge()
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Use Mean Squared Error (MSE) and Coefficient of Determination (R^2) to measure the performance:")
print("Ridge Regression with default settings:    MSE = ",MSE," R^2 = ",clf.score(data, target) )

# Modify parameter alpha, get best performance when alpha=0.5
clf = linear_model.Ridge(alpha=0.5)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Ridge Regression with alpha=0.5:           MSE = ",MSE," R^2 = ",clf.score(data, target) )

# fit_intercept=false
clf = linear_model.Ridge(fit_intercept=False)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Ridge Regression with fit_intercept=false: MSE = ",MSE," R^2 = ",clf.score(data, target) )

# Modify parameter solver, get best performance when solver='lsqr'
clf = linear_model.Ridge(solver='lsqr')
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Ridge Regression with solver='lsqr':       MSE = ",MSE," R^2 = ",clf.score(data, target) )


Use Mean Squared Error (MSE) and Coefficient of Determination (R^2) to measure the performance:
Ridge Regression with default settings:    MSE =  2.46586966196e+15  R^2 =  0.47149245887
Ridge Regression with alpha=0.5:           MSE =  2.49697052704e+15  R^2 =  0.471969734855
Ridge Regression with fit_intercept=false: MSE =  2.46569305504e+15  R^2 =  0.471459488646
Ridge Regression with solver='lsqr':       MSE =  2.44552892085e+15  R^2 =  0.469904512892


In [7]:
# Reset target and data, make sure data is not changed
target = movies_array[:, 3]
data = movies_array[:, list(range(0,3))+list(range(4,len(movies_array[0])))]

# Bayesian Ridge Regression Model

# Default settings
clf = linear_model.BayesianRidge()
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Use Mean Squared Error (MSE) and Coefficient of Determination (R^2) to measure the performance:")
print("Bayesian Ridge Regression with default settings:           MSE = ",MSE," R^2 = ",clf.score(data, target) )

# Different alpha and lambda
clf = linear_model.BayesianRidge(alpha_1=1.e1, alpha_2=1.e2, lambda_1=1.e3, lambda_2=1.e4)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Bayesian Ridge Regression with Different alpha and lambda: MSE = ",MSE," R^2 = ",clf.score(data, target) )

# fit_intercept
clf = linear_model.BayesianRidge(fit_intercept=False)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Bayesian Ridge Regression with fit_intercept=False:        MSE = ",MSE," R^2 = ",clf.score(data, target) )

# compute_score
clf = linear_model.BayesianRidge(compute_score =True)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Bayesian Ridge Regression with compute_score =True:        MSE = ",MSE," R^2 = ",clf.score(data, target) )


Use Mean Squared Error (MSE) and Coefficient of Determination (R^2) to measure the performance:
Bayesian Ridge Regression with default settings:           MSE =  4.12674080594e+15  R^2 =  9.30033827728e-13
Bayesian Ridge Regression with Different alpha and lambda: MSE =  4.12674080591e+15  R^2 =  9.350520358e-12
Bayesian Ridge Regression with fit_intercept=False:        MSE =  5.87417706621e+15  R^2 =  -0.547289862272
Bayesian Ridge Regression with compute_score =True:        MSE =  4.12674080594e+15  R^2 =  9.30033827728e-13


In [12]:
# Reset target and data, make sure data is not changed
target = movies_array[:, 3]
data = movies_array[:, list(range(0,3))+list(range(4,len(movies_array[0])))]

# Lasso Regression Model

#all the settings in Lasso Regression Model possess tol=1 since otherwise 
# Convergence Warnings were received

# Default settings
clf = linear_model.Lasso(tol=1)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Lasso with tol=1:      MSE = ",MSE," R^2 = ",clf.score(data, target))

# alpha changed
clf = linear_model.Lasso(alpha=0.5,tol=1)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Lasso with alpha=0.5:      MSE = ",MSE," R^2 = ",clf.score(data, target))


# normalize changed
clf = linear_model.Lasso(fit_intercept=True,tol=1)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Lasso with normalize=True:        MSE = ",MSE," R^2 = ",clf.score(data, target) )


# fit_intercept changed
clf = linear_model.Lasso(fit_intercept=False, tol=1)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Lasso with fit_intercept=False:        MSE = ",MSE," R^2 = ",clf.score(data, target) )

# precompute changed
clf = linear_model.Lasso(precompute=True,tol=1)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Lasso with precomputer=True:        MSE = ",MSE," R^2 = ",clf.score(data, target) )

# positive changed
clf = linear_model.Lasso(positive=True,tol=1)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Lasso with positive=True:        MSE = ",MSE," R^2 = ",clf.score(data, target) )

# warm_start changed
clf = linear_model.Lasso(warm_start=True,tol=1)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Lasso with warm_start=True:        MSE = ",MSE," R^2 = ",clf.score(data, target) )

# copy_X changed
clf = linear_model.Lasso(copy_X=False,tol=1)
clf.fit(data, target) 
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Lasso with copy_X=False:        MSE = ",MSE," R^2 = ",clf.score(data, target) )



Lasso with tol=1:      MSE =  2.53337154458e+15  R^2 =  0.458067873992
Lasso with alpha=0.5:      MSE =  2.53337304789e+15  R^2 =  0.458067831135
Lasso with normalize=True:        MSE =  2.53337154458e+15  R^2 =  0.458067873992
Lasso with fit_intercept=False:        MSE =  2.75324351024e+15  R^2 =  0.391314906877
Lasso with precomputer=True:        MSE =  2.53337154458e+15  R^2 =  0.458067873992
Lasso with positive=True:        MSE =  2.49489904854e+15  R^2 =  0.441464092428
Lasso with warm_start=True:        MSE =  2.53337154458e+15  R^2 =  0.458067873992
Lasso with copy_X=False:        MSE =  2.53337154458e+15  R^2 =  0.177954146882


In [14]:
# After reviewing all the models, we found Ridge regression model with solver set to 
#'lsqr'performs the best.
# Therefore, we choose Ridge regression model and report the feature importance.

target = movies_array[:, 3]
data = movies_array[:, list(range(0,3))+list(range(4,len(movies_array[0])))]
clf = linear_model.Ridge(solver='lsqr')
clf.fit(data, target)
MSE_array = model_selection.cross_val_score(clf, data, target, cv=10, scoring = 'neg_mean_squared_error')
MSE = np.absolute(np.mean(MSE_array))
print("Performance of best model: MSE =",MSE," R^2 = ",clf.score(data, target))


# Report Important Features
# Align feature with its name
data_name = list(movies.columns.values)
del data_name[3]
features = zip(clf.coef_, data_name)

# Sort features according to the absolute value of their weights
features = sorted(features, key = lambda x:-np.abs(x[0]))

# Print out the top ten features
for i in range(0,10):
    print("Feature #",i+1,"   ",features[i][1]," (Weight:",features[i][0],")")


Performance of best model: MSE = 2.4455338374e+15  R^2 =  0.469904512892
Feature # 1     Japan  (Weight: -55841237.3165 )
Feature # 2     USA  (Weight: 30814705.9574 )
Feature # 3     Hungary  (Weight: -27750059.7128 )
Feature # 4     Documentary  (Weight: -20355670.4002 )
Feature # 5     Animation  (Weight: 20166026.9034 )
Feature # 6     num_critic_for_reviews  (Weight: 19990562.8122 )
Feature # 7     Family  (Weight: 19939213.7658 )
Feature # 8     Drama  (Weight: -18989489.516 )
Feature # 9     West Germany  (Weight: -18708252.9169 )
Feature # 10     Canada  (Weight: 18471238.6345 )
