# Film Prediction

The success of a film is usually measured by looking at both its critical performance (reviews/scores) and commercial performance (gross). I want to see if it is possible to predict an upcoming film's success by using machine learning and data about the film.

First, I import the necessary packages.

In [3]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Importing Data

It is surprisingly difficult to find a good film dataset because IMDb doesn't have an API. Instead, they provide a dump of text files with inconsistent formats. Using those text files with MySQL and IMDbPY proved to be frustrating. Instead, I will use the movie dataset included in the ggplot R package ([available here](http://had.co.nz/data/movies/)).

In [4]:
data = np.genfromtxt('movies.tab', delimiter = '\t', skip_header = 1, usecols = (1, 2, 3, 4))

columns: year, length, budget, rating

Next I split the data into input variables (X) and output variables (y).

In [5]:
# remove rows with missing data. how does this work?!
data = data[~np.isnan(data).any(axis=1)]
# create X and Y
data_X = data[:, :3]
data_y = data[:, 3]

I again split the data into training and test sets.

In [6]:
data_X_train = data_X[:-200]
data_X_test = data_X[-200:]
data_y_train = data_y[:-200]
data_y_test = data_y[-200:]

Use scikit-learn's linear regression toolkit.

In [7]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(data_X_train, data_y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Display the results

In [8]:
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f" % np.mean((regr.predict(data_X_test) - data_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(data_X_test, data_y_test))

Coefficients: 
 [ -3.88070485e-03   1.45282685e-03  -3.40938046e-10]
Residual sum of squares: 2.41
Variance score: -0.02


As you can see, the coefficients and variance are all close to 0. This data does not fit well with linear regression. We have either chosen either a bad model or there is no correlation between year, length, budget, and rating.