# Film Prediction

The success of a film is usually measured by looking at both its critical performance (reviews/scores) and commercial performance (gross). I want to see if it is possible to predict an upcoming film's success by using machine learning and data about the film.

First, I import the necessary packages.

In [10]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
import omdb

# Importing Data

It is surprisingly difficult to find a good film dataset because IMDb doesn't have an API. Instead, they provide a dump of text files with inconsistent formats. Using those text files with MySQL and IMDbPY proved to be frustrating. Instead, I will use the movie dataset included in the ggplot R package ([available here](http://had.co.nz/data/movies/)).

In [11]:
data = np.genfromtxt('movies.tab', delimiter = '\t', skip_header = 1, usecols = (1, 2, 3, 4))

columns: year, length, budget, rating

Next I split the data into input variables (X) and output variables (y).

In [12]:
# remove rows with missing data. how does this work?!
data = data[~np.isnan(data).any(axis=1)]
# create X and Y
data_X = data[:, :3]
data_y = data[:, 3]

I again split the data into training and test sets.

In [13]:
data_X_train = data_X[:-200]
data_X_test = data_X[-200:]
data_y_train = data_y[:-200]
data_y_test = data_y[-200:]

Use scikit-learn's linear regression toolkit.

In [14]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(data_X_train, data_y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Display the results

In [15]:
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f" % np.mean((regr.predict(data_X_test) - data_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Coefficient of determination: %.2f' % regr.score(data_X_test, data_y_test))

Coefficients: 
 [ -3.88070485e-03   1.45282685e-03  -3.40938046e-10]
Residual sum of squares: 2.41
Coefficient of determination: -0.02


As you can see, the coefficients and variance are all close to 0. This data does not fit well with linear regression. We have either chosen either a bad model or there is no correlation between year, length, budget, and rating.

# Adding features

Since we appear to underfitting the data, I will try adding more features. First, I will generate polynomial features from our current features and add a feature vector which represents the **genre** of a film. Luckily this data is already in the original data file.

In [16]:
data = np.genfromtxt('movies.tab', comments = '\\', delimiter = '\t', skip_header = 1, usecols = (1,2,3,4,17,18,19,20,21,22,23) )

### Preprocessing

In [17]:
# remove rows with missing data. how does this work?!
data = data[~np.isnan(data).any(axis=1)]


# create X and Y
data_X = np.delete(data, 3, 1)
data_y = data[:, 3]

poly = PolynomialFeatures(2)
data_X = poly.fit_transform(data_X)

# Split into training/testing
data_X_train = data_X[:-800]
data_X_test = data_X[-800:]
data_y_train = data_y[:-800]
data_y_test = data_y[-800:]

### Regression

In [18]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(data_X_train, data_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f" % np.mean((regr.predict(data_X_test) - data_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Coefficient of determination: %.2f' % regr.score(data_X_test, data_y_test))

Coefficients: 
 [  0.00000000e+00  -1.66167810e-05   1.70235227e-04  -8.55187946e-07
   1.10974076e-06   4.29489050e-07  -2.13560660e-07  -8.02076978e-08
   1.00209285e-06   1.07521010e-06   1.22246956e-06  -2.66590067e-06
   1.83341463e-05   4.23516377e-10  -4.21537556e-04  -4.68903607e-06
   1.40667679e-04   3.63644127e-04   6.01414749e-04   2.83335458e-04
   2.15524679e-03  -7.14125357e-05   7.55499442e-11   5.40596809e-03
   1.12793898e-02   5.22652029e-04  -3.26526389e-04   3.47002692e-03
  -3.55459208e-03  -2.88354160e-02  -6.15214399e-18   2.09012252e-09
   5.92407923e-09  -4.19711861e-09  -5.32570607e-09  -1.25531470e-07
  -6.85326506e-09   1.66762066e-08   1.09270426e-06  -1.91867127e-05
  -2.90939986e-05  -6.17909402e-06   2.27517684e-21   3.11467824e-06
   6.28976324e-06   4.21717850e-07  -3.78332626e-05   1.56034157e-06
   5.52621487e-06   1.15084267e-05  -1.00183379e-04  -2.33808361e-07
  -4.83403125e-05   1.33901989e-05   5.59954020e-05   8.11178395e-05
  -7.56254315e-08 

Our coeffficient of determination has improved slightly but it's still a little low. At this point, I will have to bite the bullet and start working on data collection to find more data which may help our model better predict film scores. After my negative experience working with the IMDb data, I plan on using OMDb (the Open Movie Database) and their API.

### Quantifying People

I suspect that the previous success of the people involved in a film might be an accurate predictor of a film's success. If I want to include this data in my model, however, I will first have to quantify the past performance of people. I think I will simply look at the ratings of a person's past films. I will look at both the lifetime data (i.e. average rating of all past films) and the recent data (i.e. average rating of past 3 films).

In [None]:
data = np.genfromtxt('movies.tab', comments = '\\', delimiter = '\t', skip_header = 1, usecols = (0,1,2,3,4,17,18,19,20,21,22,23) )