# Movie Analysis: EntertAInment Experts
**Authors:** Alec Plante, Deanna Hedges, Raul Cortes Vaquez , Sanchez Sunny, Zachary Mitchell
***
<img src="images/movies1.png" width = 800 height = 600>

## Overview

Computing Vision is building a new movie studio with the goal of creating original video content to compete with larger companies who are doing the same. In order to provide a concise and thorough plan for Computing Visions film studio we followed the steps below:

### 1. Align on Business Understanding 
- Evaluate business needs
- Utilize data to inform business recommendations
        
### 2. Data Analysis
- Establish patterns in film data
- Analyze film datasets 
    
### 3. Recommendations
- Define three recommendations to move forward in the film industry
- Use data findings to support recommendations
- Identify next steps for Computing Vision

## Business Problem

Computing Vision requires to understand all the major companies producing original video content. The main purpose is to learn the trends to help them create a new movie based on the recommendations and suggestion. In order for them to create a new movie studio they need to analyze the background in creating movies. The task here is to explore the various movie datasets and the types of films that are currently doing the best at the box office using different samples of available data. Using the recommendations the client can translate those findings into actionable insights that can lead to a successful venture of Computing Vision's new movie studio to decide what type of films to create.

## Data Understanding


The data used to complete our analysis was sourced from five different movie websites, which include: 
- https://www.boxofficemojo.com
- https://www.imdb.com/ 
- https://www.rottentomatoes.com/ 
- https://www.themoviedb.org/ 
- https://www.the-numbers.com/ 

#### The characteristics of each data set include:

![Data Understanding Image](images/MicrosoftTeams-image.png)


The raw data have different data types and in order to combine them into a meaningful dataframe we will need to analyze, clean, and utilize the data to interpret it and inform our recommendations. We begin by analyzing the information contained in the datasets and cleaning this data. Once the datasets are combined we then review the different columns to provide effective recommendations and create visualization to support our findings.
 
    

### Import Libraries


In [None]:
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats 
import math

%matplotlib inline

### Unzip Data
This section is used to unzip data from the zippedData folder and place it into the new data folder

In [None]:
#extract im.db zip file
import zipfile
with zipfile.ZipFile('zippedData/im.db.zip', 'r') as zip_ref:
    zip_ref.extractall('data/')

# unzip the gz files 
import gzip
import shutil

# unzip bom.movie_gross
with gzip.open('zippedData/bom.movie_gross.csv.gz', 'rb') as f_in:
    with open('data/bom.movie_gross.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
# unzip rt.movie_info.tsv
with gzip.open('zippedData/rt.movie_info.tsv.gz', 'rb') as f_in:
    with open('data/rt.movie_info.tsv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
# unzip rt.reviews.tsv
with gzip.open('zippedData/rt.reviews.tsv.gz', 'rb') as f_in:
    with open('data/rt.reviews.tsv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
# unzip tmdb.movies.csv
with gzip.open('zippedData/tmdb.movies.csv.gz', 'rb') as f_in:
    with open('data/tmdb.movies.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
# unzip tn.movie_budgets.csv
with gzip.open('zippedData/tn.movie_budgets.csv.gz', 'rb') as f_in:
    with open('data/tn.movie_budgets.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

### Import Data and Connect to Database

In [None]:
# import data as 
movieGross = pd.read_csv('data/bom.movie_gross.csv')
tmdbMovies = pd.read_csv('data/tmdb.movies.csv')
movieBudgets = pd.read_csv('data/tn.movie_budgets.csv')
movieInfo = pd.read_csv('data/rt.movie_info.tsv', sep = '\t', index_col = 0)
reviews = pd.read_csv('data/rt.reviews.tsv', sep = '\t', encoding= 'latin1')

In [None]:
# Connect to sql database
conn = sqlite3.connect('data/im.db')

***
### movieInfo
The movieInfo table contains basic information about the movie, including ratings, runtimes, box office sales, genres, and dates

In [None]:
# This table consists of 11 columns, listed under the 'column' below/
movieInfo.info()

In [None]:
# See the first 5 rows of the table below
movieInfo.head()

***
### reviews
The reviews table provides written reviews, scores, and dates about movies

In [None]:
reviews.info()

In [None]:
reviews.head()

***
### movieGross
The movieGross table includes information about revenue for movies, as well as the release year and studio that created the movies

In [None]:
movieGross.info()

In [None]:
movieGross.head()

***
### tmdbMovies
The tmdbMovies gives genres, popularity scores, vote scores, and dates related to movies.

In [None]:
tmdbMovies.info()

In [None]:
tmdbMovies.head()

***
### movieBudgets
This table provides information about budgets and sales of movies

In [None]:
movieBudgets.info()

In [None]:
movieBudgets.head()

***
### im.db
This database provides information regarding ratings, actors, directors, writers, and other basic information for movies.
***
<img src="images/ERD.png" alt="ERD for im.db" width="800" height="600">

In [None]:
# The table names are listed below. The columns for each table can be seen above
pd.read_sql("""
SELECT name 
FROM sqlite_master 
WHERE type = 'table';""", conn)

## Data Cleaning

#### Cleaning movieInfo

In [None]:
# View how many NA values there are per column
movieInfo.isna().sum()

In [None]:
# making a copy to clean without editing main dataframe
movieInfoClean = movieInfo.copy()

In [None]:
#finding duplicate rows
movieInfoClean.duplicated().value_counts()
movieInfoClean[movieInfoClean.duplicated(keep=False)].sort_values(by='id')

In [None]:
movieInfoClean = movieInfoClean.drop_duplicates()

In [None]:
movieInfoClean.duplicated().value_counts()

In [None]:
# changing null values for string columns to '-'
movieInfoClean['synopsis'] = movieInfoClean['synopsis'].fillna('-')
movieInfoClean['rating'] = movieInfoClean['rating'].fillna('-')
movieInfoClean['genre'] = movieInfoClean['genre'].fillna('-')
movieInfoClean['director'] = movieInfoClean['director'].fillna('-')
movieInfoClean['writer'] = movieInfoClean['writer'].fillna('-')
movieInfoClean['theater_date'] = movieInfoClean['theater_date'].fillna('-')
movieInfoClean['dvd_date'] = movieInfoClean['dvd_date'].fillna('-')
movieInfoClean['currency'] = movieInfoClean['currency'].fillna('-')
movieInfoClean['studio'] = movieInfoClean['studio'].fillna('-')


In [None]:
# changing runtime to int representing minutes, replaced null with 0
movieInfoClean['runtime']=movieInfoClean['runtime'].map(lambda x: 0 if x is np.nan else int(x.split(' ')[0]))

In [None]:
# removing commas and changing box office to a float
movieInfoClean['box_office']=movieInfoClean['box_office'].map(lambda x: 0 if x is np.nan else float(x.replace(',','')))

In [None]:
# changing theater date and dvd date to a date time type
movieInfoClean['theater_date']=movieInfoClean['theater_date'].map(lambda x: pd.to_datetime(x,format = "%b %d, %Y") if x != '-' else x)
movieInfoClean['dvd_date']=movieInfoClean['dvd_date'].map(lambda x: pd.to_datetime(x,format = "%b %d, %Y") if x != '-' else x)

In [None]:
# finding all the genres in the dataset
genres = []
for row in movieInfoClean['genre'].map(lambda x: x.split('|')):
    for genre in row:
        if genre not in genres:
            genres.append(genre)
genres
# matching genres to other datasets
genresUpdated = [['Action','Adventure'],
                 ['Classics'],
                 ['Drama'],
                 ['Science Fiction','Fantasy'],
                 ['Music'],
                 ['Mystery'],
                 ['Romance'],
                 ['Family'],
                 ['Comedy'],
                 ['-'],
                 ['Documentary'],
                 ['Special Interest'],
                 ['Art House and International'],
                 ['Horror'],
                 ['Western'],
                 ['TV Movie'],
                 ['Sports and Fitness'],
                 ['Animation'],
                 ['Faith and Spirituality'],
                 ['Cult Movies'],
                 ['Anime and Manga'],
                 ['Gay and Lesbian']
                ]
# making dict matching old genres with new
genreDict = {}
for i in range(len(genres)):
    genreDict[genres[i]]=genresUpdated[i]
# changing column to be final list of genres
finalGenres = []
for row in movieInfoClean['genre'].map(lambda x: x.split('|')):
    thisRow = []
    for genre in row:
        thisRow += genreDict[genre]
    finalGenres.append(thisRow)
movieInfoClean['genre']=finalGenres

In [None]:
movieInfoClean.head()

In [None]:
movieInfoClean.info()

In [None]:
# Create folder to store cleaned data
!mkdir cleanedData
# Export movieinfo as csv
movieInfoClean.to_csv('cleanedData/movieInfoClean.csv')

#### Cleaning reviews

In [None]:
# Here we confirm missing data, count and add it all up
reviews.isna().sum()

In [None]:
# We create a copy of the review data set which we will modify
reviews2 = reviews.copy()

In [None]:
# We fill empty reviews with unavailable chosen string by the team for consistency
reviews2[['review','rating','critic','publisher']] = reviews2[['review','rating','critic','publisher']].fillna('-')
reviews2.head()

In [None]:
# Here we confirm missing data has been filled out, count and add it all up missing data, should be 0
reviews2.isna().sum()

In [None]:
# Export reviews as csv
reviews2.to_csv('cleanedData/reviewsClean.csv')

#### Cleaning movieGross

In [None]:
#view the amount of NA's for each column
movieGross.isna().sum()

In [None]:
# Converts foreign_gross column to string and removes commas
movieGross['foreign_gross'] = movieGross['foreign_gross'].astype(str).str.replace(",","")
# Converts Null values in foreign_gross column to 0
movieGross['foreign_gross'] = movieGross['foreign_gross'].replace('nan',0)
# Converts foreign_gross column from object type
movieGross['foreign_gross'] = movieGross['foreign_gross'].astype(float).astype(int)



# Converts domestic_gross column values to integers and Null values in domestic_gross column to 0
movieGross['domestic_gross'] = movieGross['domestic_gross'].fillna(0).astype(int)



# Converts year column to datetime data type
movieGross['year'] = pd.to_datetime(movieGross['year'],format = '%Y')

In [None]:
# Export movieGross as csv
movieGross.to_csv('cleanedData/movieGrossClean.csv')

#### Cleaning tmdbMovies

In [None]:
# show the amount of NA's for the tmdbMovies table
tmdbMovies.isna().sum()
# there are no missing values

In [None]:
# start by looking at the first 5 rows of data
tmdbMovies.head()

In [None]:
# View the Column names
tmdbMovies.columns

At first glace, we can see that there is an extra column that matches with the index. This should be removed.

In [None]:
# Drop 'Unnamed: 0' as it contains the same information as the index
tmdbMovies.drop('Unnamed: 0', axis = 1, inplace = True)

In [None]:
# View the Column names again to confirm that changes were made
tmdbMovies.columns
# The changes have been made

After removing the unneeded column, the data types should be reviewed to ensure that we are able to work with the table.

In [None]:
# View the information about each column
tmdbMovies.info()

A few columns should be investigated:
- genre_ids should be a list
- release_date should be datetime

In [None]:
# Check the type of each Column
print(f"gener_id type: {type(tmdbMovies['genre_ids'].iloc[1])}\nrelease_date type: {type(tmdbMovies['release_date'].iloc[1])}")

In [None]:
# View the values in genre_ids and to get an idea with teh data we are dealing with:
print(tmdbMovies.genre_ids.value_counts())
print(f"There are {tmdbMovies['genre_ids'].isna().sum()} null values")
# There are no NA values, and they all seem to be close to lists. We can procede by changing the type to a list

In [None]:
# Convert genre_ids into list
# ast is a library with function for us to complete this operation
import ast

#converts all strings into a list
tmdbMovies.genre_ids = tmdbMovies.genre_ids.map(lambda x: ast.literal_eval(x))

In [None]:
# make sure that rows are of type list
for i in tmdbMovies['genre_ids']:
    assert isinstance(i, list), "ERROR: element is not a list"
print("all rows in genre_ids column are of type list :^)")

The genre_ids in tmdbMovies are numbers, which doesn't give us a lot of information. A new column reflecting the meaning of these numbers should be created. The dictionary of the meanings is listed below:

In [None]:
genre_ids_dict={28:'Action',
                12:'Adventure',
                16:'Animation',
                35:'Comedy',
                80:'Crime',
                99:'Documentary',
                18:'Drama',
                10751:'Family',
                14:'Fantasy',
                36:'History',
                27:'Horror',
                10402:'Music',
                9648:'Mystery',
                10749:'Romance',
                878:'Science Fiction',
                10770:'TV Movie',
                53:'Thriller',
                10752:'War',
                37:'Western'}

In [None]:
# Create a new column 'genres' that is a list of the genres as strings
tmdbMovies['genres'] = tmdbMovies['genre_ids'].map(lambda x: list(pd.Series(x,dtype='float64').replace(genre_ids_dict)))
tmdbMovies.head()

In [None]:
print(f"The second entry in genres is of type {type(tmdbMovies['genres'].iloc[1])} and looks like: {tmdbMovies['genres'].iloc[1]}.")

When creating models and comparing data, it may be beneficial to have each genre as its own column with a boolean value indicating whether a given movie is of that genre.

In [None]:
# Creates a column for every value in the dictionary and returns true if that value shows up in genre_ids
for i in range(len(genre_ids_dict)):
    tmdbMovies[list(genre_ids_dict.values())[i]] = \
    (tmdbMovies['genre_ids'].map(lambda x: list(genre_ids_dict.keys())[i] in x)).astype(int)

In [None]:
# Check if the data is changed to booleans
tmdbMovies.head()

Creating these columns will make it easier to sort by category, as well as give us the option to use various regression models

Now, the release_date column needs to be converted to a datetime.

In [None]:
# Investigate the types of values in the release date column
print(tmdbMovies['release_date'].value_counts())
# make sure there are no NA values
print(f"There are {tmdbMovies['release_date'].isna().sum()} null values")

In [None]:
#convert the column to datetimes
tmdbMovies['release_date'] = pd.to_datetime(tmdbMovies['release_date'])

In [None]:
# make sure that release_date is of type datetime
tmdbMovies.dtypes

In [None]:
# Export movieGross as csv
tmdbMovies.to_csv('cleanedData/tmdbMoviesClean.csv')

#### Cleaning movieBudgets 

In [None]:
# Preview the dataset to get a better idea of the data that we are working with
movieBudgets.head()

In [None]:
movieBudgets.info()

The only problems with the dataset seem to be regarding types.
- release_date should be of type datetime
- production_budget, domestic_gross, and worldwide_gross should be of type int

In [None]:
#Convert release_date to datetime
movieBudgets["release_date"] = pd.to_datetime(movieBudgets["release_date"])

In [None]:
#for each column, remove the '$' and ',' for each entry and convert to an int
movieBudgets["production_budget"] = movieBudgets["production_budget"].str.replace('$','').str.replace(',','').astype(int)
movieBudgets["domestic_gross"]    = movieBudgets["domestic_gross"].str.replace('$','').str.replace(',','').astype(int)
movieBudgets["worldwide_gross"]   = movieBudgets["worldwide_gross"].str.replace('$','').str.replace(',','').astype(np.int64)


In [None]:
#make sure types are reflected in dataframe
movieBudgets.info()

In [None]:
#take a look at the new data
movieBudgets.head()

In [None]:
# Export movieBudgets as csv
movieBudgets.to_csv('cleanedData/movieBudgets.csv')

## Data Analysis

***
### Popularity by Genre: 
Which Genres are the most popular?

In [None]:
#start by viewing the tmdbMovies columns
tmdbMovies.columns

In [None]:
# we are most interested in columns after 10, as those are the genres
tmdbMovies.columns[11:]

In [None]:
# create a dictionary that gets the mean popularity score for each genre
popularityByGenre = {}
for i in range(11,30):
    popularityByGenre[tmdbMovies.columns[i]] = tmdbMovies[tmdbMovies[tmdbMovies.columns[i]]==1]['popularity'].mean()

# barplot that shows popularity genre score for every genre
sns.barplot(y = list(popularityByGenre.keys()), x=list(popularityByGenre.values()))

In [None]:
# get popularity and release date for each genre
genre_dicts = {}
for i in range(11,30):
    genre_dicts[tmdbMovies.columns[i]] = tmdbMovies[tmdbMovies[tmdbMovies.columns[i]]==1][['popularity','release_date']]
#add popularity to genre_dicts to be able to see the change in popularity over years
for i in genre_dicts:
    genre_dicts[i]['year'] = pd.to_datetime(genre_dicts[i]['release_date']).dt.year

In [None]:
#view the new dictionary that contains poularity scores, release dates, and years for each movie
print(genre_dicts.keys())
genre_dicts["Action"]

In [None]:
# use groupby and average to create an an average popularity for each year for every genre
avgPopByGenre = {}
for i in genre_dicts:
    avgPopByGenre[i] = genre_dicts[i].groupby(genre_dicts[i]['year']).mean()

In [None]:
#view new popularity dictionary
avgPopByGenre

In [None]:
# view trend in popularity by genre
sns.lineplot(x = avgPopByGenre['Action'].index, y = avgPopByGenre['Action']["popularity"])

In [None]:
# Make a plot with all genre trends on one chart
for i in avgPopByGenre:
    sns.lineplot(x=avgPopByGenre[i].index, y=avgPopByGenre[i]['popularity'].values)

In [None]:
# it worked. Split into important genres. This is too busy
# We have the list of most popular genres. Lets view the top 5 most popular generes

#get list of top 5 genres
mostPopGenre = sorted(popularityByGenre, key=popularityByGenre.get, reverse=True)[:5]
mostPopGenre

In [None]:
# get the avergae popularity score over the past 10 years for the action genre
avgPopByGenre['Action'][avgPopByGenre['Action'].index>2009]

In [None]:
#average poularitty  for the last 10 years for every genre
popLastTenYears = {}
for genre in avgPopByGenre:
    popLastTenYears[genre] = float(avgPopByGenre[genre][avgPopByGenre[genre].index>2009].mean())

In [None]:
# Get the most popular genres in the past 10 years
mostPop10YR = sorted(popLastTenYears, key=popLastTenYears.get, reverse=True)[:5]
mostPop10YR

These 2 lists share 4 categories: **'Adventure', 'Action', 'Fantasy', 'Crime'**.

It seems that war movies have lost popularity, while science fiction has taken its spot in the top 5.
**War** fell from #5 to #8
**Science Fiction** climbed from #6 to #5
The following graph will include: **['Adventure', 'Action', 'Fantasy', 'Crime','War','Science Fiction']**

In [None]:
#plot the most popular genres on one graph.
fig, ax = plt.subplots()
ax.set_xlim(2005,2020)
for i in ['Adventure', 'Action', 'Fantasy', 'Crime','War','Science Fiction']:
    sns.lineplot(x=avgPopByGenre[i].index, y=avgPopByGenre[i]['popularity'].values)
ax.legend(['Adventure', 'Action', 'Fantasy', 'Crime','War','Science Fiction'])

The Graph above is busy. Lets seperate the lines to get a better idea

In [None]:
# Graph the most popular genres on 6 graphs on the same figure
fig, ax = plt.subplots(3,2, sharex = True, sharey=True, figsize=(15,5))

sns.set_context('talk', font_scale=.88)
fig.suptitle('Popularity by Genre (Last 10 Years)')
colors = ['#62b5e5','#046a38','#005587','#0076a8','#7f7f7f','#007680']
l = ['Adventure', 'Action', 'Fantasy', 'Crime','War','Science Fiction']
for i in range(len(l)):
    ax[int(i%3)][int(i//3)].set_xlim(2010,2019)
    ax[int(i%3)][int(i//3)].set_ylim(2.5,15)
    ax[int(i%3)][int(i//3)].set_title(l[i])
    ax[1][0].set_ylabel('Popularity')
    ax[2][0].set_xlabel('Year')
    ax[2][1].set_xlabel('Year')
    
    sns.lineplot(ax=ax[i%3][i//3], x=avgPopByGenre[l[i]].index, y=avgPopByGenre[l[i]]['popularity'].values, color=colors[i])
sns.set_style('darkgrid')

While war was not popular in the early 2010's, it has been gaining some momentum. Movies Action and crime have been increasing steadily, while the others seem more or less random. With only 10 years of data, its hard to tell where these genres are really headed

***
### Profitability By Genre (average)
This section will help determine which genres have the highest profitability.

In [None]:
#view the data we will be working with
movieInfo.head()

In [None]:
# there are only USD in this dataset. No conversions needed
movieInfo['currency'].value_counts()

In [None]:
# Visualize the distribution of box office sales
sns.distplot(movieInfo[movieInfo['box_office']>0]['box_office'])
# There are many small numbers in box office sales.
# This resembles a power law distribution

In [None]:
# Check the type of the genre column
type(movieInfo['genre'][2])

The goal right now is to be able to work with the genre column. We want to create a column for each genre, where it lists true or false if the given movie is of that genre. This will make visualizations and analysis easier:

In [None]:
# The Genre column is of type string, when it should be list. 
# Changing the string to list using the code below
import ast
movieInfo.genre = movieInfo.genre.map(lambda x: ast.literal_eval(x))

In [None]:
#confirm that the change was successful
type(movieInfo['genre'][2])

In [None]:
# Getting a list of all the genres that show up in the genres column.
valCGenres = list(movieInfo['genre'].value_counts().index)
l = []
for i in valCGenres:
    for j in i:
        l.append(j)
genres=set(l)
print(genres)
genres = list(genres)

In [None]:
# Creates a column for every value in the genres list and returns true if that value shows up in the genre column
for i in range(len(genres)):
    movieInfo[genres[i]] = (movieInfo['genre'].map(lambda x: genres[i] in x).astype(int))

In [None]:
# Ensure that the change was a success
movieInfo

We successfully created boolean columns for the genres. Now we may begin data analysis. 

Lets start by looking to see the **median box office revenue** for every genre:

In [None]:
#create dictionary that shows median box office for every genre 
genre_dict2={}
for i in genres:
    genre_dict2[i] = movieInfo[(movieInfo[i]==1) & (movieInfo['box_office']>1)]['box_office'].astype(int).median()

In [None]:
# view this new dictionary
genre_dict2

In [None]:
# There are nan values in this, likely because those genres had no recorded revenue
# For now, we will leave them in the dataset. We will remove later

# Graph the median box office sales for each genre
fig, ax = plt.subplots(figsize=(7,10))
sns.barplot(y = list(genre_dict2.keys()), x=list(genre_dict2.values()))
plt.show()

This bargrraph shows good information. Now lets clean it up to select only the genres with 25 or more recorded box office sales in the database

In [None]:
# Creating a dictionary that shows the number of genres with recorded box office sales for each genre
genre_dict_len={}
for i in genres:
    genre_dict_len[i] = len(movieInfo[(movieInfo[i]==1) & (movieInfo['box_office']>1)]['box_office'])

# Create a second Genre Length dictionary for items with over 24 recorded box office sales instances
gdl2 = {k:v for (k,v) in genre_dict_len.items() if v>25}

In [None]:
# Create a new dictionary from the ones above that shows median box office sales 
# for films with over 24 recorded box office sales instances 
avgRevforPopularGenres = {k:v for (k,v) in genre_dict2.items() if k in gdl2.keys()}

# sort the values
avgRevforPopularGenres = {k:v for k,v in sorted(avgRevforPopularGenres.items(), key=lambda item: item[1], reverse = True)}


In [None]:
# Plot the median box office sales for films with over 24 recorded box office sales instances
fig, ax = plt.subplots(figsize=(5,5))
sns.barplot(y = list(avgRevforPopularGenres.keys()), x=list(avgRevforPopularGenres.values()))
plt.show()

The box plot above shows that Family films has the highest median profit, followed by Action and Adventure, then Fantasy and Science Fiction.

### Runtime and sales
We are going to investigate the effect that box runtime has on box office sales.

In [None]:
# View both columns
movieInfo['runtime'].head()
movieInfo['box_office'].head()

In [None]:
# Create new table that only includes rows with recorded sales
sales = movieInfo[movieInfo['box_office']>0]

# Check to ensure there are no 0s
sales['box_office']

In [None]:
# Create a scatterplot to visualize the relationship between box office sales and runtime
sns.scatterplot(x=sales['runtime'], y=sales['box_office'])
# There looks to be little correlation.

It looks like there is little correlation between run time and box office sales. This relationship should not be viewed any longer

### Gross Income by Year
We will view how the gross income has changed across years

In [None]:
# View the table we are 
movieGross.head()

In [None]:
# Change the year column to a data time, and take the median for the
movieGross['year'] = pd.to_datetime(movieGross['year']).dt.year
dgByYear = movieGross['domestic_gross'].groupby(movieGross['year']).median()

In [None]:
#plot the distribution
sns.barplot(x = dgByYear.index, y=dgByYear)

There is no clear relationship between year and median gross revenue. This distribution is still interesting to see.

### Return on Investment by Month of Release
We will determine if there is a relationship between Return on Investment (ROI) and the month a movie is released.

                        ROI(%) = profit/cost = (revenue-cost)/cost

In [None]:
# Making new column for Return on investment
#ROI = profit / cost
movieBudgets['ROI']=(movieBudgets['worldwide_gross']-movieBudgets['production_budget'])/movieBudgets['production_budget']*100

In [None]:
# Check if the column was created and the numbers are reasonable
movieBudgets

In [None]:
#lets view the distribution of ROI to get a beetter understanding of the data
sns.displot(movieBudgetsAdj['ROI'])
#This resembles a power law distribution, where most ROI values are around 0

In [None]:
# To analyze by month and year, we need to convert teh relase date into a data time
# Then, we need to make a month column. I added a year in case we want to analyze further
movieBudgets['release_date'] = pd.to_datetime(movieBudgets['release_date'])
movieBudgets['month'] = movieBudgets['release_date'].dt.month
movieBudgets['year'] = movieBudgets['release_date'].dt.year
movieBudgets.head()

In [None]:
# Create a distribution plot to visualize the distribution of production budget 
# This should inform the insights we find
sns.displot(movieBudgets['production_budget'])
movieBudgets['production_budget'].median()
movieBudgets['worldwide_gross'].median()

In [None]:
# only want ot select movies with bigger budgets and revenue. 
movieBudgets[(movieBudgets['production_budget']>1000000) & (movieBudgets['worldwide_gross']>1000000)]

In [None]:
# Create a Series that lists the median ROI based on the month
medianByMonth = movieBudgets['ROI'].groupby(movieBudgets['month']).median()

# Show the distribution of median ROI by month.
sns.barplot(x=medianByMonth.index, y=medianByMonth)
# There seems to be a clear difference between ROI and month

In [None]:
# We are going to use this in out presentation so lets make it look nice

# movieBudgetAdj to select only movies with large budgets and revenue
movieBudgetsAdj = movieBudgets[(movieBudgets['production_budget']>1000000) & (movieBudgets['worldwide_gross']>1000000)]

# take the median ROI by month
medianByMonthAdj = movieBudgetsAdj['ROI'].groupby(movieBudgetsAdj['month']).median()

#set ip chart
fig, ax = plt.subplots(figsize = (8,6))
# set title of chart
ax.set_title('Median ROI by Month')
# set colors to emphasize the important months
col = ['#A1D3EF','#A1D3EF','#A1D3EF','#A1D3EF','#86bc25','#86bc25','#86bc25','#A1D3EF','#A1D3EF','#A1D3EF','#86bc25','#86bc25']
sns.set_palette(sns.color_palette(col))
# Create and label the plot
sns.barplot(x=medianByMonthAdj.index, y=medianByMonthAdj)
ax.set_xticklabels(['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
ax.set_xlabel('Month')
ax.set_ylabel('Return on Investment (%)')

This visulaization is great to describe the optimal time to relase a movie. Lets make more:

In [None]:
# Create a boxplot Distribtion of ROI to get a better idea of mins, maxes, and quartiles
sns.boxplot(x=movieBudgets['month'], y=movieBudgets['ROI'])
# This does not show a lot, lets change the y limits

In [None]:
# Create a stacked boxplot chart to show the distribution of ROI for each genre
fig, ax = plt.subplots(figsize = (8,6))
ax.set_ylim(-150,1000)
ax.set_title("ROI by Month")
sns.boxplot(x=movieBudgetsAdj['month'], y=movieBudgetsAdj['ROI'])
ax.set_xticklabels(['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
ax.set_xlabel('Month')
ax.set_ylabel('Return on Investment')

In [None]:
julyMedian = medianByMonthAdj[7]
TotalMedian = movieBudgetsAdj['ROI'].median()

ROI seems to be higher in the green months. Lets try to find out why. Viewing the **mean production budget** may give us some insight

In [None]:
# get mean budget by month
meanBudgetByMonthAdj = movieBudgets['production_budget'].groupby(movieBudgetsAdj['month']).mean()

# Create and format graph
fig, ax = plt.subplots(figsize = (8,6))
ax.set_title('Mean Production Budget by Month')
col = ['#A1D3EF','#A1D3EF','#A1D3EF','#A1D3EF','#86bc25','#86bc25','#86bc25','#A1D3EF','#A1D3EF','#A1D3EF','#86bc25','#86bc25']
sns.set_palette(sns.color_palette(col))
sns.barplot(x=meanBudgetByMonthAdj.index, y=meanBudgetByMonthAdj/1000000)
ax.set_xticklabels(['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
ax.set_xlabel('Month')
ax.set_ylabel('Production Budget (in Millions of $)')

In [None]:
# Create a stacked boxplot chart to show the distribution of Production budget for each genre
fig, ax = plt.subplots(figsize = (8,6))
ax.set_ylim(-150,200000000)
ax.set_title("Production Budget by Month")
sns.boxplot(x=movieBudgets['month'], y=movieBudgets['production_budget'])
ax.set_xticklabels(['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
ax.set_xlabel('Month')
ax.set_ylabel('Production Budget')

The Production Budget is also higher, which means that they likely put more effort into those movies. Lets also view the gross earnings by month:

In [None]:
movieBudgetsAdj = movieBudgets[(movieBudgets['production_budget']>1000000) & (movieBudgets['worldwide_gross']>1000000)]
fig, ax = plt.subplots(figsize = (12,8))
ax.set_ylim(-10000000,1000000000)
ax.set_title("Gross Earnings by Month")
sns.boxplot(x=movieBudgetsAdj['month'], y=movieBudgetsAdj['worldwide_gross'])
ax.set_xticklabels(['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
ax.set_xlabel('Month')
ax.set_ylabel('Gross Earnings')

This distribution is very similar to the last two.

Finally, we will run statistical tests to ensure that this relationship is significant. We will start with a t-test:

**H0:** July mean ROI  <=  Population mean ROI

**H1:** July mean ROI > Population mean ROI

In [None]:
# Create a function to run a t-test on some data given the information
def one_sample_ttest(sample, popmean, alpha):

    # Visualize sample distribution for normality 
    sns.displot(sample)
    
    # Population mean 
    mu = popmean
    
    # Sample mean (x̄) using NumPy mean()
    xbar = np.mean(sample)
    n = len(sample)
    
    # Sample Standard Deviation (sigma) using Numpy
    s = np.std(sample, ddof = 1)
    
    # Degrees of freedom
    df = n-1
    
    # Calculate the critical t-value
    t_crit = stats.t.ppf(1-alpha, df=df)
    
    # Calculate the results     
    results = stats.ttest_1samp(a=sample, popmean=mu)   
    # return results
    return results

In [None]:
julyROI = list(movieBudgetsAdj[movieBudgetsAdj['month']==7]['ROI'])
mu=movieBudgetsAdj['ROI'].mean()
alpha=.9
one_sample_ttest(julyROI, mu, alpha)

The P value for this test is only .21, which means we can reject the null hypothesis with 79% certainty. This is not great. We cannot comfortably say that the July mean ROI is significantly different.

This is not the end of our tests. We can still prove something else. In this case, we can use an ANOVA to prove that the means for all months are not equal. We will choose an alpha value of .9

**H0:** The mean ROI is equal across all months
**H1:** The mean ROI is not equal across all months (at least one month is not the same as another)

In [None]:
di = {}
for i in range(1,13):
    di[i]=movieBudgetsAdj[movieBudgetsAdj['month']==i]['ROI']
    di[i] = list(di[i].values)
stats.f_oneway(di[1],di[2],di[3],di[4],di[5],di[6],di[7],di[8],di[9],di[10],di[11],di[12])

In [None]:
# Alternatively
import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform ANOVA
mbCat = movieBudgetsAdj.copy()
mbCat['month'] = mbCat['month'].astype(str)

model = ols('ROI ~ C(month)', data=mbCat).fit()
sm.stats.anova_lm(model, typ=2)

In [None]:
mbCat['month'].value_counts()

The p-value of this test is .051. This means that with about 95% certiainty, we can say that the mean ROI differes significantly between months. This is great news, as it statistically proves the impact of month on ROI

Below, we will use some statistical methods to create models to try to predict ROI and Box Office Sales, respectively.

In [None]:
# Run Regression model, Trying to use the month of the year to predict ROI 
mod = ols(formula='ROI ~ month', data = mbCat)
res = mod.fit()
print(res.summary())

The P value for each month can be seen in the P>|t| column. No months have prediction capabilities with a p value less than .05. However, the p-value does reveal which months have the most certian effect on ROI, indicated by the lowest p-values. The coefficeint reveal the strength of the relationship between ROI and month, as well as the direction. The higher the coefficient, the greater impact the month has on ROI. Months with negative coefficeints decrease ROI.  



In [None]:
# Build a regression model to predict box office sales based on runtime, rating, and genre.
cleanMI = movieInfo[(movieInfo['runtime']>30)& (movieInfo['box_office']>0)]
mod = ols(formula='box_office ~ runtime + rating + Documentary + Q("Art House and International") + Western +\
    Horror + Q("Cult Movies") + Action + Q("TV Movie") + Comedy + Mystery + Q("Faith and Spirituality") +\
    Classics + Fantasy + Q("Special Interest") + Drama + Animation + Q("Anime and Manga")+\
    Q("Gay and Lesbian") + Family  + Music + Romance + Q("Sports and Fitness")', data = cleanMI)
res = mod.fit()
print(res.summary())

Some interesting observations can be seen:
- unrated movies have a large negative effect on the sales (makes sense because likely smaller)
- greater runtime indicated more sales. This is likely because movies with very small run time do not do well (Coeff = 9.975e+05 ; p = 0.0)
- Art House and International perform poorly (Coeff = -1.74e+07 ; p = 0.081)
- Western Movies Perform Poorly (Coeff = -3.773e+07 ; p = 0.076)
- Action movies sell well (Coeff = 2.005e+07 ; p = 0.009) (very strong relationship)
- Classics Perform Poorly (Coeff = -9.378e+07 ; p = 0.014) (strong relationship)
- Fantasy Performs Well (Coeff = 2.114e+07 ; p = 0.048) (strong relationship)
- Drama performs poorly (Coeff = -1.936e+07 ; p = 0.006)

While ROI and Sales are hard to predict, trying to predict them gave some interesting insights

### Finding Top 50 Profitable Movies in MovieInfo

In [None]:
# Convert genre_ids into list
#library with function for us to complete this operation
import ast
#converts all strings into a list
movieInfo.genre = movieInfo.genre.map(lambda x: ast.literal_eval(x))

In [None]:
# copy dataframe for safety and ease
mi = movieInfo.copy()

In [None]:
# make sub dataframe uf the top 50 movies
top_mi = mi.sort_values(by=['box_office'],ascending=False)[0:50]
top_mi

Identifying and charting top genres

In [None]:
# make list and frequency distribution of top genres
top_genres=[]
for genres in top_mi['genre']:
    top_genres += genres
genre_freq = {}
for g in top_genres:
    if g not in genre_freq.keys():
        genre_freq[g] = 1
    else:
        genre_freq[g] += 1


In [None]:
chart_genre = pd.Series(genre_freq)
chart_genre = chart_genre.sort_values(ascending = False)
chart_genre.plot(kind = 'bar', 
                 title = 'Genres in Top 50 Movies',
                 xlabel="Genres",
                 ylabel='Number of Movies',
                 fontsize=24)

Comparing top genres to genres of all movies

In [None]:
# make list and frequency distribution of all genres
all_genres=[]
for genres in mi['genre']:
    if '-' not in genres:
        all_genres += genres
genre_freq_all = {}
for g in all_genres:
    if g not in genre_freq_all.keys():
        genre_freq_all[g] = 1
    else:
        genre_freq_all[g] += 1        
abb_genres = ['Act',
              'Adv',
              'Anim',
              'Anime',
              'Art',
              'Class',
              'Com',
              'Cult',
              'Doc',
              'Drama',
              'Faith',
              'Fam',
              'Fant',
              'LGBTQ+',
              'Hor',
              'Mus',
              'Myst',
              'Rom',
              'Sci-Fi',
              'Sp Int',
              'Sports',
              'TV',
              'West'
             ]
genre_keys = sorted(list(set(all_genres)))
genre_legend={}
for i in range(len(genre_keys)):
    genre_legend[genre_keys[i]]=abb_genres[i]


In [None]:
all_genre = pd.Series(genre_freq_all)
chart_genre_all = all_genre.sort_values(ascending = False)
chart_genre_all.plot(kind = 'bar', 
                     title = 'Genres in All Movies',
                     xlabel="Genres",
                     ylabel='Number of Movies',
                     fontsize=18)

Finding most common combos of genres

In [None]:
# make frequency distribution of all genre combos
genre_combos = mi['genre']
genre_combos = genre_combos.map(lambda x: ', '.join([genre_legend[i]for i in x if i != '-']))
genre_combo_freq = {}
for g in genre_combos:
    if g not in genre_combo_freq.keys():
        genre_combo_freq[g] = 1
    else:
        genre_combo_freq[g] += 1


In [None]:
chart_combos = pd.Series(genre_combo_freq)
chart_combos = chart_combos.sort_values(ascending = False)[:20]
# chart_combos = chart_combos.sort_values(ascending = True)
chart_combos.plot(kind = 'barh', 
                  title = 'Common Genre Combinations in All Movies',
                  xlabel="Genres",
                  ylabel='Number of Movies',
                  fontsize=16)

Making final visualization with color coding to emphasize the science fiction and fantasy genres

In [None]:
# using seaborn to make style consistent with other group members' charts
fig, ax = plt.subplots(figsize = (8,6))
sns.set_context('talk',font_scale=.99)
sns.set_style('darkgrid')
ax.set_title('Common Genre Combinations in All Movies')
# color coding the sci fi fantasy movies
col = ['#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#648D1C',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF']
sns.set_palette(sns.color_palette(col))
sns.barplot(y=chart_combos.index,x=chart_combos)
ax.set_xlabel('Frequency')
ax.set_ylabel('Genres')

In [None]:
# make frequency distribution of top genre combos
top_combos = top_mi['genre']
top_combos = top_combos.map(lambda x: ', '.join([genre_legend[i]for i in x if i != '-']))
top_combo_freq = {}
for g in top_combos:
    if g not in top_combo_freq.keys():
        top_combo_freq[g] = 1
    else:
        top_combo_freq[g] += 1

In [None]:
chart_combos_top = pd.Series(top_combo_freq)
chart_combos_top = chart_combos_top.sort_values(ascending = False)[:20]
# chart_combos_top = chart_combos_top.sort_values(ascending = True)
chart_combos_top.plot(kind = 'barh', 
                      title = 'Genre Combinations in Top 50 Movies',
                      xlabel="Genres",
                      ylabel='Number of Movies',
                      fontsize=18)

Making final visualization with color coding to emphasize the science fiction and fantasy genres

In [None]:
# using seaborn to make style consistent with other group members' charts
fig, ax = plt.subplots(figsize = (8,6))
sns.set_context('talk',font_scale=.99)
sns.set_style('darkgrid')
ax.set_title('Genre Combinations in Top 50 Movies')
# color coding the sci fi fantasy movies
col = ['#A1D3EF',
       '#648D1C',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#A1D3EF',
       '#648D1C',
       '#648D1C',
       '#A1D3EF',
       '#648D1C',
       '#648D1C',
       '#A1D3EF',
       '#A1D3EF',
       '#648D1C',
       '#A1D3EF',
       '#A1D3EF',
       '#648D1C']
sns.set_palette(sns.color_palette(col))
sns.barplot(y=chart_combos_top.index,x=chart_combos_top)
ax.set_xlabel('Frequency')
ax.set_ylabel('Genres')

### Exploring Common Words in Descriptions of Top 50 Films

In [None]:
# importing Natural Language Toolkit for tokenization
import nltk
nltk.download("stopwords")
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [None]:
# making list of words in the synopses of top 50 movies
top_corpus = []
for i in range(50):
    top_corpus += word_tokenize(top_mi['synopsis'].iloc[i])
    
# excluding common stop words and punctuation
stops = stopwords.words('english')
punctuation = """!@#$%^&*()_+-={}[]:;"'<>?,./"""
words_to_cut = ["'s",'--',"'nt","'snt","''","``","n't","c","also","...","'ll","'re"]
top_corpus = [x.lower() for x in top_corpus if x.lower() not in stops and\
              x not in punctuation and x not in words_to_cut]

# making frequency distribution as a dict, then a pandas Series
corp_freq = {}
for word in top_corpus:
    if word not in corp_freq.keys():
        corp_freq[word] = 1
    else:
        corp_freq[word] += 1
frequencies = pd.Series(corp_freq)
chart_freq = frequencies.sort_values(ascending = False)[:20]


Charting Findings

In [None]:
chart_freq.plot(kind = 'bar', 
                title = 'Most Common Words in Top 50 Movie Descriptions',
                xlabel="Words",
                ylabel='Frequency',
                fontsize=20,
                rot = 65)

### Exploring Common Words of Reviews of "Fresh" Movies

In [None]:
# quick look at review data
reviews.head()

In [None]:
# make Series of reviews with 'fresh' rating
reviews_best = reviews[reviews['fresh']=='fresh']['review']
reviews_best.head()

In [None]:
# making list of words in the synopses of top 50 movies
review_corpus = []
for i in range(len(reviews_best)):
    review_corpus += word_tokenize(reviews_best.iloc[i])
    
# excluding common stop words and punctuation
review_corpus = [x.lower() for x in review_corpus if x.lower() not in stops\
                 and x not in punctuation and x not in words_to_cut]

# making frequency distribution as a dict, then a pandas Series
review_freq = {}
for word in review_corpus:
    if word not in review_freq.keys():
        review_freq[word] = 1
    else:
        review_freq[word] += 1
r_frequencies = pd.Series(review_freq)
chart_review = r_frequencies.sort_values(ascending = False)[:20]


In [None]:
# charting results
chart_review.plot(kind = 'bar', 
                  title = 'Most Common Words in "Fresh" Movie Reviews',
                  xlabel="Words",
                  ylabel='Frequency',
                  fontsize=18,
                  rot = 65)

### Testing for Significance in Sci-fi Fantasy Profit

In [None]:
tmdbMovies['movie'] = tmdbMovies['original_title']
tmdbMovies.head()

In [None]:
tmdbMovies = tmdbMovies.drop_duplicates()
tmdbMovies.duplicated().value_counts()

In [None]:
# attempting to merge movieBudgets and tmdbMovies
# to get dataframe with genre and cost and income
df_merged = pd.merge(movieBudgets,tmdbMovies,on=['movie'],how = 'inner')
df_merged.head()

In [None]:
# seeing how many values are missing in worldwide and domestic gross
df_merged['worldwide_gross'].value_counts()

In [None]:
# since there are more missing values in domestic, I will use worldwide gross
df_merged['domestic_gross'].value_counts()

In [None]:
rows_to_drop = df_merged[df_merged['worldwide_gross']==0].index
df_merged=df_merged.drop(rows_to_drop)
df_merged['worldwide_gross'].value_counts()

In [None]:
# making column for worldwide profit
df_merged['profit']=df_merged['worldwide_gross']-df_merged['production_budget']
df_merged.head()

In [None]:
# making Series of the profits of sci-fi fantasy movies
sff_profits = df_merged[(df_merged['Science Fiction']==1)\
                        |(df_merged['Fantasy']==1)]['profit']
# making Series of the profits of all movies
all_profits = df_merged['profit']

Null Hypothesis:
    There is no difference in profits in Sci-Fi and Fantasy movies from the population

Alternative Hypothesis:
    Profits of Sci-fi and Fantasy movies are higher than the population
    
alpha = 0.01

99% confidence

In [None]:
# calculating z statistic
mu = all_profits.mean()
x_bar = sff_profits.mean()
sigma = np.std(all_profits)
n = len(sff_profits)

z= (x_bar - mu)/(sigma/math.sqrt(n))
z

In [None]:
# calculating p-value
stats.norm.cdf(z)
pval = 1- stats.norm.cdf(z)
pval

In [None]:
print(f'Average Profit of All Films: {mu}')
print(f'Average Profit of Sci-fi and Fantasy Films: {x_bar}')

I reject the null hypothesis with a 99% confidence level

# Conclusions

This analysis leads to three recommendations for Computing Vision as they move into the film industry:

    1. Focus on release month
    2. Critics' reviews don't impact the profitability of a film
    3. Explore the science fiction fantasy genre
Based on our recommendations there is information that you should know as you create your studio and first films.
- The month a movie is released is correlated with ROI
- Critics’ reviews don’t have a large influence in the profitability of a film
- Science Fiction Fantasy poses a large opportunity for profitable films
- Art is subjective and there are many other factors that can impact profitability

# Next Steps

- Plan movie releases around popular months and “Dump Months”
- Focus on audience reviews and quality over critics’ reviews
- Create films that utilize the Science Fiction Fantasy genre
- Continue film industry research into areas such as streaming platforms, multimedia releases, and utilizing Artificial Intelligence.
