# Tyler's Workspace

First I need to import useful packages

In [6]:
#useful packages to import first
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Now I'll need to open some data and read it into a data frame or some other workable form. It will be easiest to open data in the form of csv/tsv. We'll start with Box Office Mojo data.

## Box Office Mojo

In [5]:
# Making a df from Box Office Mojo data
bom_df = pd.read_csv('../zippedData/bom.movie_gross.csv.gz')
# Looking to see if the data was read correctly
bom_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [None]:
bom_df.tail()

Everything looks good in the resulting data frames. Now its time look at the data more in-depth. 
We should look at the meta data and summary statistics to get a better idea of what we're working with.

In [None]:
# Looking at the meta data
bom_df.info()

Looks like the quality of the data is alright at first glance. There are lots of missing values from foreign_gross column, but that makes sense if several of the movies only showed in the U.S. foreign_gross also seems not to have been read in as the correct data type. Might as well try and clean that before we use .describe().

In [None]:
# Noticed there was a comma in at least one observation, lets get rid of that
bom_df['foreign_gross'] = bom_df['foreign_gross'].str.replace(",", "")
# Let's see if it worked
bom_df['foreign_gross'] = bom_df['foreign_gross'].astype(float)

In [None]:
# Nice, now we'll have 3 columns of summary stats
bom_df.describe()

Looks like the movies are from the year 2000 to 2018, which is good to know. The observations seem to be regularly distributed enough in regard to years. It would be helpful to know total gross income as well which should be easy enough to make. Furthermore, it would be helpful to see the distribution for total gross income to know how much money could be expected. We'll need to join this data with much more data later, but this will be a start.

In [None]:
# Let's make the total gross column and check it out
# First we need to replace Nan with 0, so we don't lose all of our smaller films
bom_df['foreign_gross'].fillna(0, inplace=True)
# If we didn't replace those, we would lose over 1000 values, now we can rest easy
bom_df['total_gross'] = bom_df['domestic_gross'] + bom_df['foreign_gross']
bom_df.head()

In [None]:
# DONT RUN THIS TWICE OR IT WILL DIVIDE 'total_gross' TWICE
# It's hard to make meaning out of all these 0's, let's make these numbers more readable
# We'll work with the total gross in millions, this will make the smaller observations less readable,
# but we're trying to compete in the big league. 
bom_df['total_gross'] = bom_df['total_gross'] / 1000000
#Time to rename the col for clarity.
bom_df.rename(columns={'total_gross' : 'total_gross(mil)'}, inplace=True)


In [None]:
# Check it out
bom_df.head()

In [None]:
# There are some numbers we can understand, lets looks at total_gross(mil) closer
bom_df['total_gross(mil)'].describe()

Here we are seeing a lot of interesting things. For one, the mean total gross is $74mil compared to a median of $5.5mil. That is a serious positive skew. Also our std and interquartile range are huge. We will try and graph this, but we will probably need to use a logarithmic scale to make sense of it.

In [None]:
# First we'll set our figure name to grossing_hist
grossing_hist, ax = plt.subplots()
# Now we'll make a histogram of the data, we choose the number of bins by square root of n
ax.hist(bom_df['total_gross(mil)'], bins=round(3359**.5))


In [None]:
# As expected, we're gonna need to use a logarithmic scale to make any sense of this data
log_of_gross_hist, ax = plt.subplots()
# Here we use the numpy function, log, which will broadcast to the entire series
ax.hist(np.log(bom_df['total_gross(mil)']), bins=round(3359**.5))

This looks much closer to a bell curve, although it is still by no means regular. (Don't be worried by the negative values, this is because many of the smaller films only grossed a fraction of a million dollars). This seems to suggest that as movies get more appealing, the amount of money they make increases exponentially. However, it's hard to say much about what makes a movie appealing giving the limited amount of data we have so far. There also is a pretty large negative skew on this graph, which seems to suggest that it is quite easy to get lumped into the middle with mediocre grossing films. Finally, it's important to take all of these observations with a small mound of salt, given we have have not even taken a peek at the rest of the data yet. 

Another critical thing we should be thinking about is the type of studios we will be competing with. The average movie studio won't have access to the same resources that we have at Microsoft, so we should be sure to whittle down observations to studios we will actually be competing with. It will ultimately be a failure if we invest a fortune into a new studio that fails to compete with other big-name studios with similiar levels of spending. 

In [None]:
# Let's see the studios with the most movies
counts = bom_df['studio'].value_counts()

In [None]:
bom_df.head()

In [None]:
# Now we are going to remove all the studios that made less than 10 movies
prolific_counts = counts[counts.values > 10]
print(prolific_counts.index)
# I need to figure this out later
prolific_studios = None

In [None]:
# Let's group by studio and see which ones make the most
means = bom_df.groupby('studio')['total_gross(mil)'].mean()
high_means = means.sort_values(ascending=False).head(20)
high_means

In [None]:
sums = bom_df.groupby('studio')['total_gross(mil)'].sum()
high_sums = sums.sort_values(ascending=False).head(20)
high_sums

In [None]:
# Not sure what all these movie studios are so let's make a dictionary to replace their names
# Google will be our best friend for figuring this out
studio_dict = {'BV': 'Disney', 'WB (NL)': 'WB', 'P/DW': 'Par.'}
# Let's see the movies by the studios and then google the studio
bom_df[bom_df['studio'] == 'HC'].head()
# This studio just has one Chinese propaganda flavored film, we should probably ignore this one
# (unless another movie by the studio comes up)
bom_df[bom_df['studio'] == 'P/DW']
# This one is Paramount Dreamworks, we'll just consider it Paramount
bom_df[bom_df['studio'] == 'GrtIndia']
# Once again just one movie by this studio, we'll ignore it for now

## IMDB

In [1]:
# This is a sqlite database, so we'll need to import sqlite3
import sqlite3

# Me overthinking this crap
# Also, this is a zipped file, so we'll need import zipfile
#import zipfile

In [4]:
#Also me overthinking this crap
#zipfolder = "./dsc-phase-1-project-v2-4/im.db.zip"
#destination = "C:/Users/TWood/Documents/FlatironMaterials/project_1/"
#with zipfile.ZipFile(zipfolder) as file:
#    file.extracttractall(destination)

# Connect to the unzipped db 
con = sqlite3.connect('C:/Users/TWood/Documents/FlatironMaterials/project_1/im.db')

In [7]:
# Make movie_basics data frame
imdb_basics = pd.read_sql("""
SELECT *
FROM movie_basics
""", con)
imdb_basics

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,


In [None]:
# Make movie_ratings df
imdb_ratings = pd.read_sql("""
SELECT *
FROM movie_ratings
""", con)
imdb_ratings

In [None]:
# We could get a lot more out of these Tables if we joined them
imdb_title_ratings = pd.read_sql("""
SELECT *
FROM movie_basics
JOIN movie_ratings
    USING(movie_id)
""", con)
imdb_title_ratings

In [None]:
imdb_title_ratings.info()

In [None]:
imdb_title_ratings.describe()

In [None]:
rated_imdb = imdb_title_ratings[imdb_title_ratings['numvotes'] > 100]

In [None]:
imdb_rating_hist, ax = plt.subplots()
ax.hist(rated_imdb['averagerating'])

In [None]:
con.close()

## Rotten Tomatoes

Here, I'm confused. There are no movie_titles in either file, and while we can technically get some amount of information from this data, we won't figure out a ton. Maybe some of the extra API's and Web scraping they would expect us to do would be data that would flesh out these data frames that are missing valueable data.

In [24]:
rt_info = pd.read_csv('../zippedData/rt.movie_info.tsv.gz', delimiter="\t", compression='gzip')
rt_info

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,
...,...,...,...,...,...,...,...,...,...,...,...,...
1555,1996,Forget terrorists or hijackers -- there's a ha...,R,Action and Adventure|Horror|Mystery and Suspense,,,"Aug 18, 2006","Jan 2, 2007",$,33886034,106 minutes,New Line Cinema
1556,1997,The popular Saturday Night Live sketch was exp...,PG,Comedy|Science Fiction and Fantasy,Steve Barron,Terry Turner|Tom Davis|Dan Aykroyd|Bonnie Turner,"Jul 23, 1993","Apr 17, 2001",,,88 minutes,Paramount Vantage
1557,1998,"Based on a novel by Richard Powell, when the l...",G,Classics|Comedy|Drama|Musical and Performing Arts,Gordon Douglas,,"Jan 1, 1962","May 11, 2004",,,111 minutes,
1558,1999,The Sandlot is a coming-of-age story about a g...,PG,Comedy|Drama|Kids and Family|Sports and Fitness,David Mickey Evans,David Mickey Evans|Robert Gunter,"Apr 1, 1993","Jan 29, 2002",,,101 minutes,


In [25]:
rt_reviews = pd.read_csv('../zippedData/rt.reviews.tsv.gz', delimiter="\t", encoding='unicode_escape')
rt_reviews

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"
...,...,...,...,...,...,...,...,...
54427,2000,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"


In [36]:
rt_join = rt_info.merge(rt_reviews, on='id', how='inner', suffixes=('', '_review'))
rt_join

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,review,rating_review,fresh,critic,top_critic,publisher,date
0,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54427,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"


In [37]:
#change fresh into True/False
rt_join['fresh_bool'] = rt_join['fresh'].loc[rt_join['fresh'] == 'fresh'] = float(1)
rt_join['fresh_bool'] = rt_join['fresh'].loc[rt_join['fresh'] == 'rotten'] = float(0)
rt_join

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,review,rating_review,fresh,critic,top_critic,publisher,date,fresh_bool
0,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,A distinctly gallows take on contemporary fina...,3/5,1,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018",0.0
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,It's an allegory in search of a meaning that n...,,0,Annalee Newitz,0,io9.com,"May 23, 2018",0.0
2,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,... life lived in a bubble in financial dealin...,,1,Sean Axmaker,0,Stream on Demand,"January 4, 2018",0.0
3,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,Continuing along a line introduced in last yea...,,1,Daniel Kasman,0,MUBI,"November 16, 2017",0.0
4,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One,... a perverse twist on neorealism...,,1,,0,Cinema Scope,"October 12, 2017",0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54427,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,The real charm of this trifle is the deadpan c...,,1,Laura Sinagra,1,Village Voice,"September 24, 2002",0.0
54428,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,1/5,0,Michael Szymanski,0,Zap2it.com,"September 21, 2005",0.0
54429,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2/5,0,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005",0.0
54430,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures,,2.5/5,0,Christopher Null,0,Filmcritic.com,"September 7, 2003",0.0


In [71]:
Fresh_values = rt_join[rt_join['fresh'] == 1]
Fresh_movies = pd.DataFrame(Fresh_values['id'].value_counts())
Fresh_movies.reset_index(inplace=True)
Fresh_movies

Unnamed: 0,index,id
0,782,308
1,1067,261
2,1960,255
3,1083,250
4,251,243
...,...,...
1052,1604,1
1053,1412,1
1054,406,1
1055,470,1


In [72]:
Rotten_values = rt_join[rt_join['fresh'] == 0]
Rotten_movies = pd.DataFrame(Rotten_values['id'].value_counts())
Rotten_movies.reset_index(inplace=True)
Rotten_movies

Unnamed: 0,index,id
0,1325,146
1,443,136
2,321,132
3,1071,128
4,1376,128
...,...,...
1008,1148,1
1009,1393,1
1010,1645,1
1011,1526,1


In [76]:
Fresh_counts = Fresh_movies.merge(Rotten_movies, on='index', how='inner', suffixes=('fresh','rotten'))
Fresh_counts

Unnamed: 0,index,idfresh,idrotten
0,782,308,30
1,1067,261,14
2,1960,255,3
3,1083,250,10
4,251,243,15
...,...,...,...
930,1828,1,1
931,1813,1,6
932,1845,1,11
933,1604,1,1


In [107]:
Fresh_counts['proportion_fresh'] = Fresh_counts['idfresh'] / (Fresh_counts['idfresh']+Fresh_counts['idrotten'])
Good_movies = Fresh_counts[Fresh_counts['proportion_fresh'] > .9]
select_good = list(Good_movies['index'].values)
rt_fresh = rt_join[rt_join['id'].isin(select_good)]
popular_directors = rt_fresh['director'].value_counts().head(10)
popular_directors.index

Index(['Martin McDonagh', 'Steven Spielberg', 'Luca Guadagnino', 'Sam Mendes',
       'Michel Hazanavicius', 'Joel Coen|Ethan Coen', 'Ava DuVernay',
       'Todd Haynes', 'Steven Soderbergh', 'Alexander Payne'],
      dtype='object')

In [128]:
list(['Drama']*30)

['Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama',
 'Drama']

In [141]:
rt_fresh['genre_list'] = rt_fresh['genre'].str.split('|')

n = len(rt_fresh['genre_list'])

Drama = pd.DataFrame(list(['Drama']*n))

rt_fresh[Drama[0].isin(rt_fresh['genre_list'])]

Genre_counts = {}
Genres = rt_fresh['genre_list'].iterrows()
for lst in Genres:
    
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rt_fresh['genre_list'] = rt_fresh['genre'].str.split('|')
  rt_fresh[Drama[0].isin(rt_fresh['genre_list'])]


IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

In [134]:
rt_fresh

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,...,review,rating_review,fresh,critic,top_critic,publisher,date,fresh_bool,genre list,genre_list
726,23,A fictional film set in the alluring world of ...,R,Drama,,,"Dec 20, 2013","Mar 18, 2014",$,99165609,...,The movie is great. It is interesting without ...,8/10,1,Debbie Baldwin,0,Ladue News,"November 2, 2018",0.0,[Drama],[Drama]
727,23,A fictional film set in the alluring world of ...,R,Drama,,,"Dec 20, 2013","Mar 18, 2014",$,99165609,...,It doesn't matter how much of the story is tru...,,1,Sarah Gopaul,0,Digital Journal,"October 31, 2018",0.0,[Drama],[Drama]
728,23,A fictional film set in the alluring world of ...,R,Drama,,,"Dec 20, 2013","Mar 18, 2014",$,99165609,...,David O. Russell follows The Fighter and Silve...,3.5/5,1,Alistair Ryder,0,Cinemole,"October 21, 2018",0.0,[Drama],[Drama]
729,23,A fictional film set in the alluring world of ...,R,Drama,,,"Dec 20, 2013","Mar 18, 2014",$,99165609,...,The movie and most of its characters too often...,,0,Pat Padua,0,DCist,"August 30, 2018",0.0,[Drama],[Drama]
730,23,A fictional film set in the alluring world of ...,R,Drama,,,"Dec 20, 2013","Mar 18, 2014",$,99165609,...,As cons within cons press loyalties on every s...,,1,Kathryn Reklis,0,The Christian Century,"August 21, 2018",0.0,[Drama],[Drama]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54042,1992,"The title character, played by John Turturro, ...",R,Comedy|Drama,Joel Coen,Joel Coen|Ethan Coen,"Aug 21, 1991","May 20, 2003",,,...,,4/5,1,Rich Cline,0,Shadows on the Wall,"October 4, 2003",0.0,"[Comedy, Drama]","[Comedy, Drama]"
54043,1992,"The title character, played by John Turturro, ...",R,Comedy|Drama,Joel Coen,Joel Coen|Ethan Coen,"Aug 21, 1991","May 20, 2003",,,...,,4/5,1,Brian J. Arthurs,0,Beach Reporter (Southern California),"November 8, 2002",0.0,"[Comedy, Drama]","[Comedy, Drama]"
54044,1992,"The title character, played by John Turturro, ...",R,Comedy|Drama,Joel Coen,Joel Coen|Ethan Coen,"Aug 21, 1991","May 20, 2003",,,...,,4/5,1,Philip Martin,0,Arkansas Democrat-Gazette,"August 29, 2002",0.0,"[Comedy, Drama]","[Comedy, Drama]"
54045,1992,"The title character, played by John Turturro, ...",R,Comedy|Drama,Joel Coen,Joel Coen|Ethan Coen,"Aug 21, 1991","May 20, 2003",,,...,,4/5,1,Jeffrey Westhoff,0,"Northwest Herald (Crystal Lake, IL)","August 16, 2002",0.0,"[Comedy, Drama]","[Comedy, Drama]"


## The Movie DB

In [8]:
movie_db = pd.read_csv("../zippedData/tmdb.movies.csv.gz", index_col=[0])
movie_db

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.920,2010-07-16,Inception,8.3,22186
...,...,...,...,...,...,...,...,...,...
26512,"[27, 18]",488143,en,Laboratory Conditions,0.600,2018-10-13,Laboratory Conditions,0.0,1
26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.600,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,"[14, 28, 12]",381231,en,The Last One,0.600,2018-10-01,The Last One,0.0,1
26515,"[10751, 12, 28]",366854,en,Trailer Made,0.600,2018-06-22,Trailer Made,0.0,1


In [None]:
mdb.info()

In [None]:
mdb.describe()

## The Numbers

In [6]:
numbers = pd.read_csv("../zippedData/tn.movie_budgets.csv.gz", index_col=[0])
numbers

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"
...,...,...,...,...,...
78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0


In [7]:
numbers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5782 entries, 1 to 82
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   release_date       5782 non-null   object
 1   movie              5782 non-null   object
 2   production_budget  5782 non-null   object
 3   domestic_gross     5782 non-null   object
 4   worldwide_gross    5782 non-null   object
dtypes: object(5)
memory usage: 271.0+ KB


In [8]:
numbers['production_clean'] = numbers['production_budget'].str.replace("$", "").str.replace(",", "").astype(float)
numbers['domestic_clean'] = numbers['domestic_gross'].str.replace("$", "").str.replace(",", "").astype(float)
numbers['worldwide_clean'] = numbers['worldwide_gross'].str.replace("$", "").str.replace(",", "").astype(float)
numbers.head()

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross,production_clean,domestic_clean,worldwide_clean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279",425000000.0,760507625.0,2776345000.0
2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875",410600000.0,241063875.0,1045664000.0
3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350",350000000.0,42762350.0,149762400.0
4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963",330600000.0,459005868.0,1403014000.0
5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747",317000000.0,620181382.0,1316722000.0


In [9]:
numbers.describe()

Unnamed: 0,production_clean,domestic_clean,worldwide_clean
count,5782.0,5782.0,5782.0
mean,31587760.0,41873330.0,91487460.0
std,41812080.0,68240600.0,174720000.0
min,1100.0,0.0,0.0
25%,5000000.0,1429534.0,4125415.0
50%,17000000.0,17225940.0,27984450.0
75%,40000000.0,52348660.0,97645840.0
max,425000000.0,936662200.0,2776345000.0
