
# Matthew's EDA


**Stakeholder**: Microsoft's new movie studio head 

**Problem**: What type of films should the studio create? 

**Subproblem**: What films are doing "the best at the box office?" How can this studio compete with original video content from other "big companies?"

In [2]:
# Import packages for exploration
import pandas as pd
import numpy as np


## Rotten Tomatoes

Rotten tomatoes is a review-aggregation website and database for movies.

In [3]:
# This file is a TSV so the seperator should be set to '\t' and it is encoded using 'latin-1'
rt_reviews = pd.read_csv('../Data/zippedData/rt.reviews.tsv.gz', sep='\t', encoding='latin-1')
rt_movie_info = pd.read_csv('../Data/zippedData/rt.movie_info.tsv.gz', sep='\t', encoding='latin-1')


### Movie Info


In [4]:
# Visually confirming table was read properly
rt_movie_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [5]:
# Making observations based on table metadata
rt_movie_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


**Notes:**  
1. Columns 8, 9, and 11 seem to contain significantly more NaN values then the other columns.
2. Rating is the MPAA rating
3. 'theater_date' and 'dvd_date' not datetime objects
4. 'box_Office' appears to be the money made at the box office, but is not an integer/float
5. There is NO name assosiated with this table. Movies seem to be identified by 'id'


### REVIEWS


In [6]:
# Visually confirming table was read properly
rt_reviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [7]:
# Making observations based on table metadata
rt_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


**Notes:**  
1. ID seems to be a foreign key in this table. There is still no way to identify the NAME of the movie outside of speculation, though it may be irrelevant.
2. Top critic appears to be a boolean value.
3. Rating seems to be a fraction out of 5. Will need to convert to float.
4. 'Data' column not 'datetime' object.

## Final Rotten Tomato Notes

There is a great deal of useful statistics in here. If the two Rotten Tomato tables are joined a great deal of observations could be made.

1. Critic ratings based on genre  
2. Box office earnings by rating (MPAA)  
3. Rating by Runtime  
4. Writers / Directors who result in the most box office   revenue  

These are all examples of statistics that could be used to recommend actions to the head of the studio like:

1. What kind of Genre is most likely to Rate well
2. What kind of movie rating (MPAA) is most likley to make the most box office earnings
3. The target length for a movie to get the highest rating.
4. What kind of writers or directors to hire.

**Limitiations**
- The box office column has a severe amount of null values. About 80% of the entries are missing their 'box office' value.

Even without the box office column, this dataset could provide plenty of other reccomendations with the other data.


### The Movie DB

The Movie DB is a user editable database for movies and TV shows.

In [20]:
# Open and assign data to a dataframe.
tmdb = pd.read_csv('../Data/zippedData/tmdb.movies.csv.gz')

In [21]:
# Visually confirm table was read properly
tmdb.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [22]:
# There appears to be an index column already in this table. I will drop the column rather then re-read the file.
tmdb.drop(axis= 1, labels='Unnamed: 0', inplace=True)

#Re-check table
tmdb.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [23]:
tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          26517 non-null  object 
 1   id                 26517 non-null  int64  
 2   original_language  26517 non-null  object 
 3   original_title     26517 non-null  object 
 4   popularity         26517 non-null  float64
 5   release_date       26517 non-null  object 
 6   title              26517 non-null  object 
 7   vote_average       26517 non-null  float64
 8   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 1.8+ MB


**Notes:**  
1. No null values.
2. Genre_ids looks like a list of numbers that corrosponds to a genre.
3. vote_average and vote_count looks to be about user ratings.
4. Immedietly it may be intresting to look at the relationship between 'popularity' and 'vote_average'

[Popularity](https://developers.themoviedb.org/3/getting-started/popularity) seems to be a unique metric for calculating how popular an entry is based on recent user interaction with the entry 

In [24]:
# Lets look at some general stats for the integer columns
tmdb.describe()

Unnamed: 0,id,popularity,vote_average,vote_count
count,26517.0,26517.0,26517.0,26517.0
mean,295050.15326,3.130912,5.991281,194.224837
std,153661.615648,4.355229,1.852946,960.961095
min,27.0,0.6,0.0,1.0
25%,157851.0,0.6,5.0,2.0
50%,309581.0,1.374,6.0,5.0
75%,419542.0,3.694,7.0,28.0
max,608444.0,80.773,10.0,22186.0


In [25]:
# The 'popularity' mean value looks a little off to me. Let's check the column for irregularities.
tmdb['popularity'].value_counts()

0.600     7037
1.400      649
0.840      587
0.624      104
0.625       92
          ... 
3.742        1
14.749       1
7.924        1
8.414        1
9.060        1
Name: popularity, Length: 7425, dtype: int64

**Notes**  
It seems to me that '0.600' may be a placeholder value due to the sheer number of occurances.

In [26]:
# The 'popularity' column has made me suspicous, so now lets check the 'vote_count' column
tmdb['vote_count'].value_counts()

1       6541
2       3044
3       1757
4       1347
5        969
        ... 
2328       1
6538       1
489        1
2600       1
2049       1
Name: vote_count, Length: 1693, dtype: int64

In [28]:
# It is certainly possible that there are 6541 movies with 1 vote, but it feels unlikely.
tmdb[tmdb['vote_count'] == 1]

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
770,"[28, 80, 18, 53]",51488,en,Full Love,2.288,2010-01-01,Full Love,10.0,1
873,"[878, 27]",27485,en,Megaconda,1.960,2010-01-01,Megaconda,7.0,1
1004,"[12, 80]",76747,ru,Burning Daylight,1.588,2010-11-14,Burning Daylight,6.0,1
1008,"[16, 10751]",52272,en,Bratz: Pampered Petz,1.579,2010-10-05,Bratz: Pampered Petz,5.0,1
1063,"[9648, 53]",295682,en,Bright Falls,1.400,2010-04-27,Bright Falls,9.0,1
...,...,...,...,...,...,...,...,...,...
26512,"[27, 18]",488143,en,Laboratory Conditions,0.600,2018-10-13,Laboratory Conditions,0.0,1
26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.600,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,"[14, 28, 12]",381231,en,The Last One,0.600,2018-10-01,The Last One,0.0,1
26515,"[10751, 12, 28]",366854,en,Trailer Made,0.600,2018-06-22,Trailer Made,0.0,1


**Notes**  
Multiple issues discovered here, vote_average, vote_count and popularity all seem to have placeholder values.

## Final The Movie DB Notes

**Example Observations:**

1. User rating (vote_average) based on genre
2. Most common words in the title of the movies with the highest user rating (vote_average)

These are all examples of statistics that could be used to recommend actions to the head of the studio like:

1. What kind of film genre is most likely to rate well
2. What kind of words should be put in the movies title

**Limitiations**
- The box office column has a severe amount of null values. About 80% of the entries are missing their 'box office' value.
- The 'vote_count', 'vote_average', and 'popularity' all have some form of placeholder values.

The data here would require heavy cleaning, but is far from unusable.