# Exploratory Notebook

### Exploring the Data

importing pandas with its alias pd

In [1]:
import pandas as pd
import CustomLibrary as cl

Unzipping the zipped data files (all at once)

In [2]:
#!find . -name '*.tsv.gz' -exec gzip -d {} \;

Reading in all the data files to understand which ones will help answer our business questions. Checking the files with .head() and .tail() methods to view a section of the data. The .shape method will help tell how many rows we have in each data set. The .info() method will help discover the dtypes of the columns and possible missing values.

In [3]:
df0= pd.read_csv('./data/zippedData/imdb.name.basics.csv')
df0.head()
#df0.shape

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"


In [4]:
df1= pd.read_csv('./data/zippedData/bom.movie_gross.csv')
df1.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [5]:
df2= pd.read_csv('./data/zippedData/imdb.title.akas.csv')
df2.head()

Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0369610,10,Джурасик свят,BG,bg,,,0.0
1,tt0369610,11,Jurashikku warudo,JP,,imdbDisplay,,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,,,short title,0.0
4,tt0369610,14,Jurassic World,FR,,imdbDisplay,,0.0


The dataset below gives us important information about the movies. Important columns: primary_title, start_year, and genres.

In [6]:
df3= pd.read_csv('./data/zippedData/imdb.title.basics.csv')
df3.head()
df_title_movie = df3.rename(columns = {'primary_title':'movie'})
df_title_movie.head()

Unnamed: 0,tconst,movie,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


Changed the primary_title to 'movie' so that the merge with the next important data set goes smoothly. 

In [7]:
df_title_movie.describe()

Unnamed: 0,start_year,runtime_minutes
count,146144.0,114405.0
mean,2014.621798,86.187247
std,2.733583,166.36059
min,2010.0,1.0
25%,2012.0,70.0
50%,2015.0,87.0
75%,2017.0,99.0
max,2115.0,51420.0


In [8]:
df4= pd.read_csv('./data/zippedData/imdb.title.crew.csv')
df4.head()

Unnamed: 0,tconst,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943


In [9]:
df5= pd.read_csv('./data/zippedData/imdb.title.principals.csv')
df5.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"


In [10]:
df6= pd.read_csv('./data/zippedData/imdb.title.ratings.csv')
df6.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [11]:
df7= pd.read_csv('./data/zippedData/tmdb.movies.csv')
df7.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


Found the next file to be one of the important files to gain an understanding of the costs and profits of different movies.

In [12]:
df8= pd.read_csv('./data/zippedData/tn.movie_budgets.csv')
df8['start_year']= [int(x[-4:]) for x in df8['release_date']]
df8.head()
#df8.loc[df8['domestic_gross'] == '$0']

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,start_year
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279",2009
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875",2011
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350",2019
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963",2015
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747",2017


Created a new column called start_year that creates a year integer from the string from release_date. This is another qualifier for merging this dataset and the one chosen from above.

### Cleaning the Data

First, merging the datasets so that we have all the information we need in one dataframe.
    The 'inner' join is chosen because each dataset has important data for the analysis. Not having both sides of the dataset makes the row useless in our chosen analysis. The merge is on 'movie' and 'start_year' to make sure that duplicates are at a minimum from having movies with the same name, but different release years.
    We also check for duplicates and drop the rows from our merged dataset.

In [13]:
df_budget_merge = pd.merge(df8, df_title_movie, how = 'inner', on = ('movie', 'start_year'))
df_duplicates = df_budget_merge[df_budget_merge['movie'].duplicated()]
#df_budget_merge.loc[df_budget_merge['domestic_gross'] == '$0'].head()
df_budget_merge.drop(df_duplicates.index, axis = 0, inplace = True)
#df_budget_merge.head()

We next check for missing values. We use worlwide_gross as a check to see if any data necessary for profit calculation is missing. If it is, we drop the row because filling the missing values would throw off our analysis and it does not account for more recently, stream only movies like from Netflix. We want to check only movies from the box office.

In [14]:
df_no_values = df_budget_merge.loc[df_budget_merge['worldwide_gross'] == '$0']
df_budget_merge.drop(df_no_values.index, axis = 0, inplace = True)

Below is a function to change any string numbers into integers so we can use them for stats calculation.

In [15]:
def clean_dollars(dataframe, column_str):
    dataframe[column_str] = dataframe[column_str].str.replace(',', '').str.replace('$', '').astype(int)
    return dataframe

clean_dollars(df_budget_merge, 'production_budget')
clean_dollars(df_budget_merge, 'domestic_gross')
clean_dollars(df_budget_merge, 'worldwide_gross')
df_budget_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1365 entries, 0 to 1542
Data columns (total 11 columns):
id                   1365 non-null int64
release_date         1365 non-null object
movie                1365 non-null object
production_budget    1365 non-null int64
domestic_gross       1365 non-null int64
worldwide_gross      1365 non-null int64
start_year           1365 non-null int64
tconst               1365 non-null object
original_title       1365 non-null object
runtime_minutes      1359 non-null float64
genres               1365 non-null object
dtypes: float64(1), int64(5), object(5)
memory usage: 128.0+ KB


Next we create an advertisement column to add another important cost to producing a movie. Due to not having the data, we chose to follow the heuristic that advertisement budgets will generally be equal to production budgets. With the costs, we create a profit column from the differnce between the worldwide_gross column and the production_budget and advertisement_budget columns.

In [16]:
df_budget_merge['advertisement_budget'] = df_budget_merge['production_budget']
df_budget_merge['profit'] = df_budget_merge['worldwide_gross'] - df_budget_merge['production_budget'] - df_budget_merge['advertisement_budget']
df_budget_merge.head(20)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,start_year,tconst,original_title,runtime_minutes,genres,advertisement_budget,profit
0,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,2011,tt1298650,Pirates of the Caribbean: On Stranger Tides,136.0,"Action,Adventure,Fantasy",410600000,224463875
1,3,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350,2019,tt6565702,Dark Phoenix,113.0,"Action,Adventure,Sci-Fi",350000000,-550237650
2,4,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963,2015,tt2395427,Avengers: Age of Ultron,141.0,"Action,Adventure,Sci-Fi",330600000,741813963
3,7,"Apr 27, 2018",Avengers: Infinity War,300000000,678815482,2048134200,2018,tt4154756,Avengers: Infinity War,149.0,"Action,Adventure,Sci-Fi",300000000,1448134200
4,9,"Nov 17, 2017",Justice League,300000000,229024295,655945209,2017,tt0974015,Justice League,120.0,"Action,Adventure,Fantasy",300000000,55945209
5,10,"Nov 6, 2015",Spectre,300000000,200074175,879620923,2015,tt2379713,Spectre,148.0,"Action,Adventure,Thriller",300000000,279620923
6,11,"Jul 20, 2012",The Dark Knight Rises,275000000,448139099,1084439099,2012,tt1345836,The Dark Knight Rises,164.0,"Action,Thriller",275000000,534439099
7,12,"May 25, 2018",Solo: A Star Wars Story,275000000,213767512,393151347,2018,tt3778644,Solo: A Star Wars Story,135.0,"Action,Adventure,Fantasy",275000000,-156848653
8,13,"Jul 2, 2013",The Lone Ranger,275000000,89302115,260002115,2013,tt1210819,The Lone Ranger,150.0,"Action,Adventure,Western",275000000,-289997885
9,14,"Mar 9, 2012",John Carter,275000000,73058679,282778100,2012,tt0401729,John Carter,132.0,"Action,Adventure,Sci-Fi",275000000,-267221900


We decide to sort the dataset by the profit values, with greatest profit being at the top.

In [17]:
df_budget_merge.sort_values(by = ['profit'], axis = 0, ascending = False, inplace = True)

### Using Genres

Next we check what type of genres are in each row.

In [18]:
df_budget_merge.genres.value_counts()

Adventure,Animation,Comedy      67
Action,Adventure,Sci-Fi         51
Comedy                          48
Comedy,Drama,Romance            48
Drama                           43
                                ..
Action,Biography,Documentary     1
Drama,Family,Fantasy             1
Comedy,Fantasy,Sci-Fi            1
Adventure,Horror,Sci-Fi          1
Adventure,Mystery,Sci-Fi         1
Name: genres, Length: 214, dtype: int64

We decide to use only certain common genres instead of the 214 possible collection of genres.

In [19]:
# df_budget_merge['comedy_id'] = [1 if 'Comedy' in x
#                                 else 0 
#                                 for x in df_budget_merge['genres']]
# df_budget_merge['drama_id'] = [1 if 'Drama' in x
#                                 else 0 
#                                 for x in df_budget_merge['genres']]
# df_budget_merge['action_id'] = [1 if 'Action' in x
#                                 else 0 
#                                 for x in df_budget_merge['genres']]
# df_budget_merge.head()

We create a column to have a better understanding of the genres in each movie.

In [20]:
#df_budget_merge['genre_tuple'] = list(zip(df_budget_merge['comedy_id'], df_budget_merge['drama_id'], df_budget_merge['action_id']))

In [21]:
cl.indicator_str_parser(df_budget_merge, 'genres', ['Action', 'Adventure', 'Comedy', 'Drama', 'Family', 'Thriller', 'Documentary'])

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,start_year,tconst,original_title,runtime_minutes,...,profit,genres_not_parsed_id,genres_Action_id,genres_Adventure_id,genres_Comedy_id,genres_Drama_id,genres_Family_id,genres_Thriller_id,genres_Documentary_id,genre_tuple
3,7,"Apr 27, 2018",Avengers: Infinity War,300000000,678815482,2048134200,2018,tt4154756,Avengers: Infinity War,149,...,1448134200,0,1,1,0,0,0,0,0,"(1, 1, 0, 0, 0, 0, 0)"
23,34,"Jun 12, 2015",Jurassic World,215000000,652270625,1648854864,2015,tt0369610,Jurassic World,124,...,1218854864,0,1,1,0,0,0,0,0,"(1, 1, 0, 0, 0, 0, 0)"
46,67,"Apr 3, 2015",Furious 7,190000000,353007020,1518722794,2015,tt2820852,Furious Seven,137,...,1138722794,0,1,0,0,0,0,1,0,"(1, 0, 0, 0, 0, 1, 0)"
18,27,"May 4, 2012",The Avengers,225000000,623279547,1517935897,2012,tt0848228,The Avengers,143,...,1067935897,0,1,1,0,0,0,0,0,"(1, 1, 0, 0, 0, 0, 0)"
301,73,"Jul 10, 2015",Minions,74000000,336045770,1160336173,2015,tt2293640,Minions,91,...,1012336173,0,0,1,1,0,0,0,0,"(0, 1, 1, 0, 0, 0, 0)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199,42,"Jun 14, 2019",Men in Black: International,110000000,3100000,3100000,2019,tt2283336,Men in Black: International,115,...,-216900000,0,1,1,1,0,0,0,0,"(1, 1, 1, 0, 0, 0, 0)"
124,94,"Mar 11, 2011",Mars Needs Moms,150000000,21392758,39549758,2011,tt1305591,Mars Needs Moms,88,...,-260450242,0,0,1,0,0,1,0,0,"(0, 1, 0, 0, 1, 0, 0)"
9,14,"Mar 9, 2012",John Carter,275000000,73058679,282778100,2012,tt0401729,John Carter,132,...,-267221900,0,1,1,0,0,0,0,0,"(1, 1, 0, 0, 0, 0, 0)"
8,13,"Jul 2, 2013",The Lone Ranger,275000000,89302115,260002115,2013,tt1210819,The Lone Ranger,150,...,-289997885,0,1,1,0,0,0,0,0,"(1, 1, 0, 0, 0, 0, 0)"


WE next check to see which movies are missing from our previous genre categories.

In [22]:
df_budget_merge.loc[df_budget_merge['genre_tuple'] == (0, 0, 0, 0, 0, 0, 0)]

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,start_year,tconst,original_title,runtime_minutes,...,profit,genres_not_parsed_id,genres_Action_id,genres_Adventure_id,genres_Comedy_id,genres_Drama_id,genres_Family_id,genres_Thriller_id,genres_Documentary_id,genre_tuple
1303,65,"Oct 20, 2010",Paranormal Activity 2,3000000,84752907,177512032,2010,tt1536044,Paranormal Activity 2,91,...,171512032,0,0,0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0, 0)"
1423,12,"Jan 6, 2012",The Devil Inside,1000000,53262945,101759490,2012,tt1560985,The Devil Inside,83,...,99759490,0,0,0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0, 0)"
1061,61,"Oct 27, 2017",Jigsaw,10000000,38052832,102445196,2017,tt3348730,Jigsaw,92,...,82445196,0,0,0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0, 0)"
505,49,"Aug 12, 2011",Final Destination 5,40000000,42587643,155011165,2011,tt1622979,Final Destination 5,92,...,75011165,0,0,0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0, 0)"
1222,76,"Feb 27, 2015",The Lazarus Effect,5000000,25801570,38359310,2015,tt2918436,The Lazarus Effect,83,...,28359310,0,0,0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0, 0)"
1224,79,"Mar 9, 2018",The Strangers: Prey at Night,5000000,24431472,29960051,2018,tt1285009,The Strangers: Prey at Night,85,...,19960051,0,0,0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0, 0)"
1228,94,"Sep 2, 2011",Apollo 18,5000000,17686929,26517819,2011,tt1772240,Apollo 18,86,...,16517819,0,0,0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0, 0)"
508,55,"Apr 15, 2011",Scream 4,40000000,38180928,95989590,2011,tt1262416,Scream 4,111,...,15989590,0,0,0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0, 0)"
1426,29,"Jul 20, 2018",Unfriended: Dark Web,1000000,8866745,16434588,2018,tt4761916,Unfriended: Dark Web,92,...,14434588,0,0,0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0, 0)"
1232,4,"Jun 9, 2017",It Comes at Night,5000000,13985117,19720203,2017,tt4695012,It Comes at Night,91,...,9720203,0,0,0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0, 0)"


In [23]:
df_budget_merge.loc[df_budget_merge['genre_tuple'] == (0, 0, 0, 0, 0, 0, 0)].genres.value_counts()

Horror                    10
Crime,Horror,Mystery       4
Horror,Mystery,Sci-Fi      3
Horror,Mystery             2
Fantasy,Horror,Mystery     2
Horror,Sci-Fi              1
Music                      1
Musical                    1
Fantasy                    1
Name: genres, dtype: int64

In [24]:
df_budget_merge.describe()

Unnamed: 0,id,production_budget,domestic_gross,worldwide_gross,start_year,advertisement_budget,profit,genres_not_parsed_id,genres_Action_id,genres_Adventure_id,genres_Comedy_id,genres_Drama_id,genres_Family_id,genres_Thriller_id,genres_Documentary_id
count,1365.0,1365.0,1365.0,1365.0,1365.0,1365.0,1365.0,1365.0,1365.0,1365.0,1365.0,1365.0,1365.0,1365.0,1365.0
mean,50.795604,48116130.0,61174610.0,153326200.0,2013.866667,48116130.0,57093910.0,0.0,0.306227,0.249817,0.353114,0.486447,0.063004,0.169963,0.023443
std,28.458309,57202840.0,86846810.0,240176700.0,2.607475,57202840.0,167189200.0,0.0,0.461094,0.433066,0.478113,0.499999,0.243059,0.375738,0.151362
min,1.0,25000.0,0.0,26.0,2010.0,25000.0,-550237600.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,26.0,10000000.0,6810754.0,15362300.0,2012.0,10000000.0,-14056210.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,51.0,26000000.0,33078270.0,62599160.0,2014.0,26000000.0,5885836.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,75.0,60000000.0,74262030.0,173567600.0,2016.0,60000000.0,64041800.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
max,100.0,410600000.0,700059600.0,2048134000.0,2019.0,410600000.0,1448134000.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [25]:
df_budget_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1365 entries, 3 to 1
Data columns (total 22 columns):
id                       1365 non-null int64
release_date             1365 non-null object
movie                    1365 non-null object
production_budget        1365 non-null int64
domestic_gross           1365 non-null int64
worldwide_gross          1365 non-null int64
start_year               1365 non-null int64
tconst                   1365 non-null object
original_title           1365 non-null object
runtime_minutes          1359 non-null object
genres                   1365 non-null object
advertisement_budget     1365 non-null int64
profit                   1365 non-null int64
genres_not_parsed_id     1365 non-null int64
genres_Action_id         1365 non-null int64
genres_Adventure_id      1365 non-null int64
genres_Comedy_id         1365 non-null int64
genres_Drama_id          1365 non-null int64
genres_Family_id         1365 non-null int64
genres_Thriller_id       1365 non-n

In [29]:
df_budget_merge.genre_tuple.value_counts()

(0, 0, 0, 1, 0, 0, 0)    280
(0, 0, 1, 1, 0, 0, 0)    135
(0, 0, 1, 0, 0, 0, 0)    130
(1, 1, 0, 0, 0, 0, 0)    120
(0, 1, 1, 0, 0, 0, 0)     85
(0, 0, 0, 1, 0, 1, 0)     78
(1, 0, 0, 1, 0, 0, 0)     72
(0, 0, 0, 0, 0, 1, 0)     71
(1, 0, 1, 0, 0, 0, 0)     54
(1, 0, 0, 0, 0, 1, 0)     53
(1, 0, 0, 0, 0, 0, 0)     30
(0, 0, 0, 0, 0, 0, 1)     26
(1, 1, 0, 1, 0, 0, 0)     26
(0, 0, 0, 0, 0, 0, 0)     25
(1, 1, 1, 0, 0, 0, 0)     21
(0, 1, 0, 0, 1, 0, 0)     20
(0, 0, 1, 0, 1, 0, 0)     18
(0, 1, 0, 1, 0, 0, 0)     17
(0, 0, 0, 1, 1, 0, 0)     14
(1, 0, 0, 1, 0, 1, 0)     13
(1, 1, 0, 0, 0, 1, 0)     12
(0, 1, 1, 0, 1, 0, 0)     11
(0, 1, 1, 1, 0, 0, 0)     10
(0, 1, 0, 0, 0, 0, 0)      6
(1, 1, 0, 0, 1, 0, 0)      6
(0, 1, 0, 1, 1, 0, 0)      5
(0, 0, 1, 1, 1, 0, 0)      5
(1, 0, 1, 1, 0, 0, 0)      4
(0, 0, 1, 0, 0, 0, 1)      3
(0, 0, 1, 0, 0, 1, 0)      3
(0, 1, 0, 1, 0, 1, 0)      2
(0, 0, 0, 0, 1, 0, 0)      2
(1, 0, 1, 0, 1, 0, 0)      2
(1, 0, 0, 1, 1, 0, 0)      2
(1, 0, 0, 0, 0

In [26]:
df9= pd.read_csv('./data/zippedData/rt.movie_info.tsv', sep = '\t')
df9.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [27]:
df10= pd.read_csv('./data/zippedData/rt.reviews.tsv', sep = '\t', encoding = 'latin-1')
df10.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [28]:
df11= pd.read_csv('./data/zippedData/bom.movie_gross.csv')
df11.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
