In [None]:
import numpy as np
import pandas as pd
from plotnine import *
from plydata import *
import matplotlib.pyplot as plt

# The idea

This little project will look into whether we can predict a movie's success based on its characteristics (like budget or director). For the sake of flexibility, there will be 2 determinants of success: IMDB score and gross revenue. 

To begin, I checked the available factors and overall dimensionality of the data.

In [None]:
# Load data
data = pd.read_csv('movies.csv', encoding='latin1')

# Visualize the characteristics that are available
list(data)

In [None]:
# Check the dimensionality
data.shape

# Cleaning up the data

As it turns out, `year` and `released` contain similar information. Here I'm only interested in year, so I'll remove the `released` column. I'll also remove `name`, since its a unique identifier that won't provide any predictive info (unless I were to do a semantic analysis, but no...). It is also worth checking if country will be a useful predictor. If most movies come from the US, then there might not be enough variability in this feature to justify its inclusion.

In [None]:
# Get the number of times each country appears
countrydata = data.groupby('country').size().reset_index(name='count')


# count the manufacturer and sort by the count 
c_categories = (data >> count('country', sort=True) >> pull('country'))

df = data.copy()
df['country'] = pd.Categorical(df['country'], categories=c_categories, ordered=True)

# plot
(ggplot(df) + 
 aes(x='country') +
 geom_bar() + 
 coord_flip() +
 theme_classic() +
 theme(axis_text_y=element_text(size=5))
)

The answer seems to be that country will be unhelpful, so I'll remove it too. (**NOTE: REMOVE GROSS. SCORE IS MORE LIKELY TO PREDICT GROSS, NOT THE OTHER WAY AROUND**)

In [None]:
# Remove the chosen columns
data.drop(['released','name','country'], inplace=True, axis=1)

Now that the features have been trimmed, here is a short set of descriptive stats for the numeric factors of the data. This is meant to give a broad overview of any interesting/doubtful elements.

In [None]:
pd.DataFrame.describe(data)

I'll pay no attenton to year for now, since it will be useful to keep it as a continuous numeric column for visualization purposes. At first sight, it's curious that the 25th percentile of budget amounts is 0. Since there are so many datapoints, I'll see how much we lose by removing movies with the impossible budget of 0 (at least impossible as far as I know).

In [None]:
# Remove the movies that have no budget 
data = data[data.budget != 0]

# See what the new dimension is
data.shape

# Saving this for later
# (ggplot(aes(x='budget'), data = data) +
#  geom_histogram() +
#  theme_classic()
# )

We removed ~2k values from that, but we're still left with a good number of samples. Now let's look at distributions. This is particularly important for two reasons: many analyses assume normally distributed data, but things like budget cannot have negative values, and monetary distributions tend to be Pareto or Poisson-like. 

I'll kill two birds with one stone by looking at distributions and correlations among all features. That way we can shave off heavily interdependent features.

In [None]:
# Covariance matrix
pd.plotting.scatter_matrix(data, figsize=(12,9), alpha = 0.5);

The diagonals show that budget, gross earnings, and vote counts are not normally distributed (maybe runtime too, but it looks fair right now). Luckily they all have similar distributions, so I'll go ahead and normalize this data a bit by taking its natural log (a common technique when you're dealing with reaction time data).

In [None]:
# Get the log of these variables
data[['budget','gross','votes']] = data[['budget','gross','votes']].transform(lambda x: np.log(x))

# And let's take a second look at the distributions/correlations
pd.plotting.scatter_matrix(data, figsize=(12,9), alpha = 0.3);

That did the trick for the most part (eventually, adding a qnorm style plot would be useful). The plot shows a number of linear relationships, but since this is a toy example I will keep them. Trimming the features further might become more relevant later.

Now we know what our features will be, so let's define that and the outcome clearly. Note that I'm removing `gross`, since it doesn't quite make sense to predict score from revenue (and gross will be the variable to be predicted later).

In [None]:
# Features and outcome variable
features = data[['budget','runtime','votes','year']]
score = data['score']

# Let's predict stuff!

First thing is to set aside training and testing subsamples. 

In [None]:
# Select n random movies for testing
n_test = 100
test_data = features.sample(n=n_test)
test_scores = score[test_data.index]
train_data = features.loc[~features.index.isin(test_data.index)]
train_scores = score[~score.index.isin(test_scores.index)]

At this point the data are ready for a simple multiple regression, but for the sake of example we will nuke the problem with a neural net.