## Data Exploration 


In [2]:
import pandas as pd
df_imdb_basics = pd.read_csv('zippedData/imdb.title.basics.csv.gz', compression='gzip')
df_imdb_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
tconst             146144 non-null object
primary_title      146144 non-null object
original_title     146123 non-null object
start_year         146144 non-null int64
runtime_minutes    114405 non-null float64
genres             140736 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


From a brief look at the data I can see that 1, there are movies that haven't come out yet; 2, there are movies without genres. Since neither of these will be helpful for my analysis I have decided to remove them. Additionally my rationale has determined that I don't want to look at movies before 2014 so I've removed those too.

In [3]:
df_filtered = df_imdb_basics[(df_imdb_basics['start_year'] <= 2020) & (df_imdb_basics['start_year'] >= 2014)]

In [4]:
df_filtered.dropna(subset = ['genres'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


This still leaves me with 89K results

In [5]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 89084 entries, 1 to 146140
Data columns (total 6 columns):
tconst             89084 non-null object
primary_title      89084 non-null object
original_title     89082 non-null object
start_year         89084 non-null int64
runtime_minutes    68721 non-null float64
genres             89084 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 4.8+ MB


Next, I want to explore the ratings data and merge these dataframes. But first I want to remove rows that have a low amount of ratings

In [6]:
df_imdb_ratings = pd.read_csv('zippedData/imdb.title.ratings.csv.gz', compression='gzip')
df_imdb_ratings.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [7]:
df_imdb_ratings = df_imdb_ratings[(df_imdb_ratings['numvotes'] > 100)]

In [8]:
df_imdb_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28648 entries, 1 to 73855
Data columns (total 3 columns):
tconst           28648 non-null object
averagerating    28648 non-null float64
numvotes         28648 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 895.2+ KB


In [9]:
from pandasql import sqldf
pysqldf =  lambda q: sqldf(q, globals())

the dataframes are now joined so those films that have ratings will show up

In [18]:
q = '''SELECT *
        FROM df_filtered
        JOIN df_imdb_ratings
        USING(tconst)
        ;'''

imdb_joined_df = pysqldf(q)
imdb_joined_df.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
1,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119
2,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy",8.1,263
3,tt0315642,Wazir,Wazir,2016,103.0,"Action,Crime,Drama",7.1,15378
4,tt0331314,Bunyan and Babe,Bunyan and Babe,2017,84.0,"Adventure,Animation,Comedy",5.0,302


In [22]:
imdb_joined_df['genres'] = imdb_joined_df['genres'].str.split(',')

In [23]:
imdb_joined_df['genres'].map(lambda x: x. )

0                               [Drama]
1              [Comedy, Drama, Fantasy]
2        [Adventure, Animation, Comedy]
3                [Action, Crime, Drama]
4        [Adventure, Animation, Comedy]
                      ...              
17076                     [Documentary]
17077                           [Drama]
17078                           [Drama]
17079                           [Drama]
17080                   [Drama, Family]
Name: genres, Length: 17081, dtype: object

In [32]:
df_bom_gross = pd.read_csv('zippedData/bom.movie_gross.csv.gz', compression='gzip')
df_bom_gross.sort_values('domestic_gross', ascending = False)

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1872,Star Wars: The Force Awakens,BV,936700000.0,1131.6,2015
3080,Black Panther,BV,700100000.0,646900000,2018
3079,Avengers: Infinity War,BV,678800000.0,1369.5,2018
1873,Jurassic World,Uni.,652300000.0,1019.4,2015
727,Marvel's The Avengers,BV,623400000.0,895500000,2012
...,...,...,...,...,...
1975,Surprise - Journey To The West,AR,,49600000,2015
2392,Finding Mr. Right 2,CL,,114700000,2016
2468,Solace,LGP,,22400000,2016
2595,Viral,W/Dim.,,552000,2016


In [28]:
df_rt_info = pd.read_csv('zippedData/rt.movie_info.tsv.gz', sep='\t', compression='gzip')
df_rt_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [30]:
df_rt_rev = pd.read_csv('zippedData/rt.reviews.tsv.gz', sep='\t', compression='gzip', encoding='iso-8859-1')
df_rt_rev.head()
df_rt_rev[df_rt_rev['id'] == 3]

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"
...,...,...,...,...,...,...,...,...
158,3,Beyond its withering critique of contemporary ...,,fresh,David Jenkins,0,Little White Lies,"May 25, 2012"
159,3,"Threatens to soar and to be important, but it ...",3/5,fresh,Dave Calhoun,1,Time Out,"May 25, 2012"
160,3,A parade of hollow didactic encounters.,,rotten,Owen Gleiberman,1,Entertainment Weekly,"May 25, 2012"
161,3,[An] agonisingly self-conscious and meagre pie...,2/5,rotten,Peter Bradshaw,0,Guardian,"May 25, 2012"


Looking atthe movie budgets data, I want to see if i can get a sense of movies with highest return on investment. I need to convert the budget columns into integers

In [64]:
df_budgets = pd.read_csv('zippedData/tn.movie_budgets.csv.gz', compression='gzip')
df_budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [65]:
df_budgets['production_budget'] = df_budgets['production_budget'].map(lambda price: int(price.replace("$", "").replace(",", "")))
df_budgets['domestic_gross'] = df_budgets['domestic_gross'].map(lambda price: int(price.replace("$", "").replace(",", "")))
df_budgets['worldwide_gross'] = df_budgets['worldwide_gross'].map(lambda price: int(price.replace("$", "").replace(",", "")))

In [66]:
df_budgets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
id                   5782 non-null int64
release_date         5782 non-null object
movie                5782 non-null object
production_budget    5782 non-null int64
domestic_gross       5782 non-null int64
worldwide_gross      5782 non-null int64
dtypes: int64(4), object(2)
memory usage: 271.2+ KB


In [67]:
df_budgets['domestic_roi'] = df_budgets['domestic_gross'].map(lambda x: x/ df_budgets['production_budget'])

In [68]:
df_budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,domestic_roi
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279,0 1.789430 1 1.852186 2 ...
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,0 0.567209 1 0.587101 2 ...
2,3,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350,0 0.100617 1 0.104146 2 ...
3,4,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963,0 1.080014 1 1.117891 2 ...
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,0 1.459250 1 1.510427 2 ...


In [69]:
df_budgets['domestic_roi'][0]

0            1.789430
1            1.852186
2            2.172879
3            2.300386
4            2.399078
            ...      
5777    108643.946429
5778    126751.270833
5779    152101.525000
5780    543219.732143
5781    691370.568182
Name: production_budget, Length: 5782, dtype: float64