# Finding the Hollywood Formula: IMDB Movie Dataset Analysis

<img src="dataset-cover.png"/>

### Motivation
As a group, we are huge movie fanatics and enjoy great films such as the Godfather, Casablanca, and any Tarantino flick. As data scientists we wanted to dig deeper into the business side of movies and explore the economics behind what makes a successful movie. Basically we wanted to to examine whether there are any trends among films that lead them to become successful at the box office, and whether a film's box office success correlates with its ratings. A useful analysis would help us predict how well a film does at the box office before it screens, without having to rely on critics or our own instinct. Essentially we want to determine if there is a "Hollywood formula" to making a successful movie. 

### Background
We found an interesting dataset of more than 5000 data points consisting of 28 attributes describing IMDB movies here: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset.

You can read more about the dataset here: https://blog.nycdatascience.com/student-works/machine-learning/movie-rating-prediction/. 

### Original Problem
Kaggle user [chuansun76](https://www.kaggle.com/deepmatrix) was trying to solve the following problem:
> 1. Given that thousands of movies were produced each year, is there a better way for us to tell the greatness of movie without relying on critics or our own instincts?
2. Will the number of human faces in movie poster correlate with the movie rating?

### Our Problem
We decided to tackle the problem by trying to answer the following questions: 
1. Does the genre, imdb score, and popularity of the cast impact a film's success at the box office?
2. Are there any movies with a high gross-to-budget ratio (ROI), and why? 

First, let's install any packages and libraries we might need.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import numpy as np
from ggplot import *
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

## I. Data Collection

We provided two ways to obtain the data scraped by chuansun76: either downloading the data directly online or simply reading in the downloaded file from our data folder.

In [2]:
DATA_URL = "https://raw.githubusercontent.com/sundeepblue/movie_rating_prediction/master/movie_metadata.csv"
FILE_PATH = "data/movie_metadata.csv"

def load_data_online(data_url):
    data = None
    SUCCESS = 200
    
    r = requests.get(data_url)
    if r.status_code == SUCCESS:
        # Decode data and read it into a DataFrame
        content = r.content.decode('utf-8')
        cr = csv.reader(content.splitlines(), delimiter=',')
        my_list = list(cr)
        data = pd.DataFrame(my_list[1:], columns=my_list[0])
        return data
    
# movies_table = load_data_online(DATA_URL)
movies_table = pd.read_csv(FILE_PATH)
movies_table.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


## II. Data Processing

For our analysis we will drop any rows with `nil` values for `gross`, `budget`, `title_year`, and `country`, as those data points won't contribute to our analysis. We will also focus on movies produced domestically, so we will drop all rows for movies produced outside the United States.

In [4]:
movies_table = pd.read_csv("data/movie_metadata.csv")

# replace na values with 0
movies_table["gross"].fillna(0, inplace=True)
movies_table["budget"].fillna(0, inplace=True)
movies_table["title_year"].fillna(0, inplace=True)
movies_table["country"].fillna("NaN", inplace=True)

# only consider movies made in the USA. Drop all other rows
movies_table.drop(movies_table[-(movies_table["country"].str.contains("USA"))].index, inplace=True)

movies_table.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
6,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0


To be able to compare movies across different years we will need to convert gross and budget values into real dollar amounts, in terms of 2016 purchasing power. To accomplish this we will use the [Consumer Price Index (CPI)](https://www.bls.gov/cpi/) to adjust for inflation. 

Let's scrape CPI values for every year from 1912-2016 exclusively. Using [The US Inflation Calculator](http://www.usinflationcalculator.com/) as our source, we'll traverse all the rows in the table to build our DataFrame.

In [5]:
url = "http://www.usinflationcalculator.com/inflation/consumer-price-index-and-annual-percent-changes-from-1913-to-2008/"

r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')

table = soup.find('table')
rows = table.tbody.findAll('tr');

years = []
cpis = []

for row in rows:
    year = row.findAll('td')[0].get_text()
    if year.isdigit() and int(year) < 2017:
        years.append(int(year))
        cpis.append(float(row.findAll('td')[13].get_text()))

cpi_table = pd.DataFrame({
    "year": years,
    "avg_annual_cpi": cpis
})

cpi_table.head()

Unnamed: 0,avg_annual_cpi,year
0,9.9,1913
1,10.0,1914
2,10.1,1915
3,10.9,1916
4,12.8,1917


First we let's define a function to translate the nominal dollars into real dollars in 2016 using the CPI. We'll use this [equation](http://www.nasdaq.com/article/how-to-calculate-the-real-value-of-money-using-the-cpi-formula-cm564346) to calculate the real value:
> Past dollars in terms of recent dollars = Dollar amount × Ending-period CPI ÷ Beginning-period CPI.

In [6]:
def get_real_value(nominal_amt, old_cpi, new_cpi):
    real_value = (nominal_amt * new_cpi) / old_cpi
    return real_value

Let's drop all rows in `movies_table` with a `budget`, `gross`, or `year` of 0, as those rows won't contribute to our analysis:

In [7]:
movies_table.drop(movies_table[(movies_table["budget"] == 0) | (movies_table["gross"] == 0) | 
                                (movies_table["title_year"] == 0)].index, inplace=True)

We're interested in real dollars in *2016* so let's make things easier for ourself and set it to a constant.

In [8]:
CPI_2016 = float(cpi_table[cpi_table['year'] == 2016]['avg_annual_cpi'])

Now we're ready to transform the `budget` and `gross` for each movie into real 2016 dollar terms:

In [9]:
real_domestic_gross = []
real_budget_values = []

# must transform gross and budget values into real 2016 dollar terms
for index, row in movies_table.iterrows():
    gross = row['gross']
    budget = row['budget']
    year = row['title_year']
    cpi = float(cpi_table[cpi_table['year'] == int(year)]['avg_annual_cpi'])
    
    real_gross = get_real_value(gross, cpi, CPI_2016)
    real_budget = get_real_value(budget, cpi, CPI_2016)
    real_domestic_gross.append(real_gross)
    real_budget_values.append(real_budget)

movies_table["real_domestic_gross"] = real_domestic_gross
movies_table["real_budget"] = real_budget_values

We'll also drop the nominal value columns, as those won't contribute to our analysis.

In [10]:
# drop the gross and budget cols because we won't use the nominal values
movies_table.drop(labels='gross', axis=1, inplace=True)
movies_table.drop(labels='budget', axis=1, inplace=True)

movies_table.head()   

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,genres,actor_1_name,...,language,country,content_rating,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,real_domestic_gross,real_budget
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,...,English,USA,PG-13,2009.0,936.0,7.9,1.78,33000,850793700.0,265136800.0
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,Action|Adventure|Fantasy,Johnny Depp,...,English,USA,PG-13,2007.0,5000.0,7.1,2.35,0,358220800.0,347332900.0
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,Action|Thriller,Tom Hardy,...,English,USA,PG-13,2012.0,23000.0,8.5,2.35,164000,468455100.0,261338500.0
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,Action|Adventure|Sci-Fi,Daryl Sabara,...,English,USA,PG-13,2012.0,632.0,6.6,2.35,24000,76372180.0,275659800.0
6,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,Action|Adventure|Romance,J.K. Simmons,...,English,USA,PG-13,2007.0,11000.0,6.2,2.35,0,389626800.0,298706300.0


Let's also calculate the return on investment (ROI) and absolute profit for each movie. The ROI will show which movie studio earned the greatest profit based on their initial budget for the movie, and will be useful in evaluating the sucess of a film in an economic sense. **We will be storing the ROI values as percentages.**

In [36]:
profits = []
roi_vals = []

for index, row in movies_table.iterrows():
    profit = row['real_domestic_gross'] - row['real_budget']
    budget = row['real_budget']
    num = profit - budget
    den = budget
    # convert roi to percentage
    roi = (num / den) * 100
    
    profits.append(profit)
    roi_vals.append(roi)
    
movies_table['profit'] = profits
movies_table['roi'] = roi_vals

movies_table.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,genres,actor_1_name,...,content_rating,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,real_domestic_gross,real_budget,profit,roi
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,...,PG-13,2009.0,936.0,7.9,1.78,33000,850793700.0,265136800.0,585656900.0,120.888543
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,Action|Adventure|Fantasy,Johnny Depp,...,PG-13,2007.0,5000.0,7.1,2.35,0,358220800.0,347332900.0,10887900.0,-96.865283
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,Action|Thriller,Tom Hardy,...,PG-13,2012.0,23000.0,8.5,2.35,164000,468455100.0,261338500.0,207116700.0,-20.747743
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,Action|Adventure|Sci-Fi,Daryl Sabara,...,PG-13,2012.0,632.0,6.6,2.35,24000,76372180.0,275659800.0,-199287700.0,-172.294775
6,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,Action|Adventure|Romance,J.K. Simmons,...,PG-13,2007.0,11000.0,6.2,2.35,0,389626800.0,298706300.0,90920510.0,-69.561898


Now that we have tidied and processed the data, adding new data that will aid our analysis, we can proceed to the next step of the data science pipeline. 

## III. Exploratory Analysis and Data Visualization

### Analyzing imdb score vs. gross
let's see if these is a relationship between a film's imdb score and its box office gross

In [None]:
x = 'imdb_score'
y = 'real_domestic_gross'

# choose a subset to represent the dataset
subset_data = movies_table.head(500)[[x, y]]
subset_data.sort_values(['imdb_score'], ascending=True, inplace=True)
subset_data.dropna(inplace=True)

subset_data.plot(x=x, y=y, kind='scatter')

It is interesting to see how there is almost an exponential relationship between a film's imdb score and its box office gross. Let's also create violin plots based on this relationship to better visualize the spread and density of the data distribution.

In [None]:
ggplot(aes(x=x, y=y), data=subset_data) +\
    geom_violin() +\
    labs(title="Imdb score vs. gross profit",
         x = x,
         y = y)

Mention analysis of violin plots

### Analyzing imdb score vs. gross by genre

Next, we wanted to analyze if genre played a role in the correlation betwen imdb score and gross. In our original data set, each movie could have multiple genres and so we split each row into multiple based on how many genres the movies were classified as part of. 

In [None]:
# Subset data #2
subset_data2 = pd.DataFrame(columns=[x, y, 'genre'])

i = 0
values = []
for idx, row in movies_table.iterrows():
    genres_val = row['genres']
    
    if not pd.isnull(genres_val) and not pd.isnull(row['real_domestic_gross']):
        for genre in genres_val.split('|'):
            values.append([row[x], row[y], genre])    
        
subset_data2 = pd.DataFrame(values, columns=[x, y, 'genre'])
subset_data2 = subset_data2.groupby(['imdb_score', 'genre'], as_index=False).mean()

Now let's graph the imdb score vs. gross relationship for each genre in our dataset. 

In [None]:
# Plot the data
g = sns.lmplot(x=x, y=y, data=subset_data2, col="genre", hue="genre", scatter=True, fit_reg=True, col_wrap=3)
sns.plt.show()

Mention analysis of above graphs in at least a paragraph.

### Analyzing film earnings
Next, let's turn our analysis onto ROI and which films brought studios the most bang for their buck.

In [None]:
movies_by_roi = movies_table.sort_values('roi', ascending=False)

for index, row in movies_by_roi.head().iterrows():
    print(row["movie_title"], row["roi"])

Out of all the movies produced by Hollywood, Paranormal Activity, the 2009 indie horror film, yielded the greatest ROI. With a budget of around \$15,000, it grossed \$107,917,283 at the box office during its run in 2010. We found this statistic to be quite amazing. It is also interesting to see how out of the top 5 movies with greatest ROI in our dataset, 3/5 films are horror. 

Let's group the movies by greatest absolute profit.

In [None]:
movies_by_profit = movies_table.sort_values('profit', ascending=False)

for index, row in movies_by_profit.head().iterrows():
    print(row["movie_title"], row["profit"])

Let's group the movies by greatest real domestic gross at the box office.

In [None]:
movies_by_gross = movies_table.sort_values('real_domestic_gross', ascending=False)

for index, row in movies_by_gross.head().iterrows():
    print(row["movie_title"], row["real_domestic_gross"])

Let's group the movies by greatest imdb scores.

In [None]:
movies_by_score = movies_table.sort_values('imdb_score', ascending=False)

for index, row in movies_by_score.head().iterrows():
    print(row["movie_title"], row["imdb_score"], row["real_domestic_gross"])

Add a summarizing note to this analysis

## IV. Analysis, Hypothesis Testing, and ML

## V. Insight & Policy Decision