# Data Analysis

In this interactive notebook, we perform analysis on the data that we merged in the [previous notebook](data-merging.ipynb). Here, we aim to perform some analysis on the merged and cleaned data, producing some visualizations and important statistical information to help us answer some of the questions outlined in the [ReadMe](README.md).

----

Let's start by importing the required libraries. Note that custom functions are stored in the file [`analytics_tools.py`](analytics_tools.py), which we will need to import.

In [18]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
sns.set_context('talk')

In [19]:
%matplotlib inline

Let us now import our cleaned and merged data from the pickle file stored at [cleaned_data/final_data_merged.pkl](./cleaned_data/final_data_merged.pkl).

In [20]:
final_data = pd.read_pickle('cleaned_data/final_data_merged.pkl')

In [21]:
final_data.head()

Unnamed: 0,imdb_id,primary_title,original_title,start_year,genres,directors,writers,averagerating,numvotes,release_date,production_budget,domestic_gross,worldwide_gross
20,tt0249516,Foodfight!,Foodfight!,2012,"[Action, Animation, Comedy]",[nm0440415],"[nm0440415, nm0923312, nm0295165, nm0841854, n...",1.9,8248.0,2012-12-31,45000000.0,0.0,73706.0
48,tt0337692,On the Road,On the Road,2012,"[Adventure, Drama, Romance]",[nm0758574],"[nm0449616, nm1433580]",6.1,37886.0,2013-03-22,25000000.0,720828.0,9313302.0
54,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,"[Adventure, Comedy, Drama]",[nm0001774],"[nm0175726, nm0862122]",7.3,275300.0,2013-12-25,91000000.0,58236838.0,187861200.0
58,tt0365907,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,"[Action, Crime, Drama]",[nm0291082],"[nm0088747, nm0291082]",6.5,105116.0,2014-09-19,28000000.0,26017685.0,62108590.0
60,tt0369610,Jurassic World,Jurassic World,2015,"[Action, Adventure, Sci-Fi]",[nm1119880],"[nm0415425, nm0798646, nm1119880, nm2081046, n...",7.0,539338.0,2015-06-12,215000000.0,652270625.0,1648855000.0


We can encapsulate some of the budgeting and revenue information into the return on investment, which we will store in a new field for later use in plotting.

In [22]:
final_data['ROI'] = ((final_data['worldwide_gross']
                      - final_data['production_budget'])
                     / final_data['production_budget']) * 100

Somehow, we need to best make use of our genre descriptors. We currently have lists of strings, and it would be nice to produce some sort of histogram using this information. Let's try flattening out our genre descriptions into dummy variables. We have to do a bit of additional processing to make the input suitable for Pandas's `get_dummies` function.

In [23]:
genre_dummies = pd.get_dummies(final_data.genres.apply(pd.Series).stack()).sum(level=0)

In [24]:
genre_dummies.head()

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Sport,Thriller,War,Western
20,1,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48,0,1,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
54,0,1,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
58,1,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
60,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [25]:
genre_dummies.columns

Index(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror',
       'Music', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller',
       'War', 'Western'],
      dtype='object')

Let's join our original merged data with our genre dummies, so that we can look at some summary information.

In [26]:
final_with_genre_dummies = final_data.join(genre_dummies, how='outer')

We might also want to take a look at a correlation table, to see if there are any variables that are obviously correlated, and that we can explore later on.

In [27]:
correl_table = final_data.corr()
correl_table

Unnamed: 0,start_year,averagerating,numvotes,production_budget,domestic_gross,worldwide_gross,ROI
start_year,1.0,0.048592,-0.043684,0.102462,0.104921,0.115488,0.03186
averagerating,0.048592,1.0,0.464911,0.209422,0.287181,0.27093,0.026794
numvotes,-0.043684,0.464911,1.0,0.554988,0.653605,0.642382,0.089905
production_budget,0.102462,0.209422,0.554988,1.0,0.724452,0.792679,-0.020642
domestic_gross,0.104921,0.287181,0.653605,0.724452,1.0,0.947475,0.120453
worldwide_gross,0.115488,0.27093,0.642382,0.792679,0.947475,1.0,0.103602
ROI,0.03186,0.026794,0.089905,-0.020642,0.120453,0.103602,1.0


Let's do the same thing with the dataframe involving the genre dummies. In this case, we will do some subsetting so that we can get the 10 genres with the largest correlation to the return on investment.

In [28]:
correl_with_dummies = final_with_genre_dummies.corr()
correl_with_dummies.loc['Action':'Western', 'ROI'].nlargest(10)

Mystery      0.149433
Horror       0.143449
Thriller     0.085869
Romance      0.004900
Animation    0.002419
Sport       -0.001042
Music       -0.002574
Sci-Fi      -0.007390
Biography   -0.007580
Musical     -0.012171
Name: ROI, dtype: float64

In particular, it seems as though Myster and Horror films are slightly more profitable than the remaining genres. For a more visual description of the data, however, we should probably do a bit of plotting. Since we are going to want to plot within genres, we might want to store the number of different possible genres so that we can easily know how many subplots we will need.

In [29]:
num_subplots = len(genre_dummies.columns)
num_subplots

21

Now, we can loop through the possible genres, and consider a correlation plot between the date of release and the return on investment.

In [30]:
# rows = 7
# cols = 3
# genre_subset_f, genre_subset_ax = plt.subplots(nrows=rows, ncols=cols, figsize=(40, 30))

# for i in range(num_subplots):
#     genre = genre_dummies.columns[i]
#     row = i // cols
#     col = i % cols
#     cur_ax = genre_subset_ax[row, col]
#     cur_subset = final_data[genre_dummies[genre] == 1]
#     sns.scatterplot(x='release_date', y='ROI', data=cur_subset, ax=cur_ax)
#     cur_ax.set_title(genre)
#     cur_ax.set_xlabel('Relese Date')
#     cur_ax.set_ylabel('Return on Investment (%)')

# genre_subset_f.suptitle('Correlation of ROI to Release Date')
# genre_subset_f.tight_layout()

This data helps to illustrate a few facts. For one, there is no strong correlation between the date of release and the return on investment, which suggests that within the given timeframe, the relative popularity of any given genre doesn't seem to be changing. This plot also illustrates the presence of a few outliers which may be affecting the data analysis adversely.

In [31]:
final_data['release_year'] = final_data.release_date.dt.year

In [32]:
final_data.groupby('release_year').describe()

Unnamed: 0_level_0,start_year,start_year,start_year,start_year,start_year,start_year,start_year,start_year,averagerating,averagerating,...,worldwide_gross,worldwide_gross,ROI,ROI,ROI,ROI,ROI,ROI,ROI,ROI
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
release_year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2010,175.0,2010.085714,0.65088,2010.0,2010.0,2010.0,2010.0,2017.0,175.0,6.194857,...,147451695.5,1068880000.0,175.0,218.498954,596.531301,-100.0,-15.394588,83.24375,232.421949,5817.067733
2011,201.0,2010.860697,0.361244,2010.0,2011.0,2011.0,2011.0,2012.0,201.0,6.29403,...,118729073.0,1123791000.0,201.0,190.738618,594.867565,-100.0,-53.183722,80.738155,214.043708,6558.059067
2012,197.0,2011.873096,0.856629,2010.0,2012.0,2012.0,2012.0,2019.0,197.0,6.216751,...,133085295.0,1517936000.0,197.0,271.163092,876.672379,-100.0,-49.066667,68.449583,293.62742,10075.949
2013,197.0,2012.730964,0.737969,2010.0,2013.0,2013.0,2013.0,2015.0,197.0,6.201523,...,127983283.0,1272470000.0,197.0,191.269182,422.672506,-100.0,-59.148083,100.703767,251.785447,2942.219367
2014,213.0,2013.713615,0.775602,2010.0,2014.0,2014.0,2014.0,2017.0,213.0,6.355869,...,111946251.0,1104039000.0,213.0,177.674962,446.176564,-100.0,-94.24629,54.872124,273.931388,3851.737231
2015,255.0,2014.313725,1.28115,2010.0,2014.0,2015.0,2015.0,2017.0,255.0,5.916863,...,67973090.0,1648855000.0,255.0,299.2688,2646.385969,-100.0,-100.0,-22.189083,188.249074,41556.474
2016,179.0,2015.530726,1.172199,2010.0,2016.0,2016.0,2016.0,2017.0,179.0,6.310615,...,145075382.5,1140069000.0,179.0,244.37797,541.392791,-100.0,-41.23035,95.91513,293.698348,4249.7008
2017,132.0,2016.348485,1.620408,2010.0,2016.0,2017.0,2017.0,2019.0,132.0,6.337121,...,242732923.0,1259200000.0,132.0,366.966122,767.769123,-100.0,14.954915,155.018377,385.728292,5479.29612
2018,119.0,2017.453782,1.640318,2010.0,2018.0,2018.0,2018.0,2019.0,119.0,6.311765,...,180655404.0,2048134000.0,119.0,305.297444,470.916246,-100.0,5.266511,147.34173,509.658293,2617.924114
2019,42.0,2017.333333,2.9022,2010.0,2017.25,2019.0,2019.0,2019.0,42.0,6.32381,...,118055757.5,1123062000.0,42.0,150.715082,347.558441,-100.0,-90.802054,6.804755,237.636109,1344.091235


Let's now consider a further visualization of the return on investment over time broken up by genre. We may use Pandas's Grouper object to specify a custom grouping within business quarters. From here, we can plot the mean return on investment as develops within these quarters, broken into each specific genre.

In [33]:
# rows = 7
# cols = 3
# genre_subset_f2, genre_subset_ax2 = plt.subplots(nrows=rows, ncols=cols, figsize=(40, 30))

# for i in range(num_subplots):
#     genre = genre_dummies.columns[i]
#     row = i // cols
#     col = i % cols
#     cur_ax = genre_subset_ax2[row, col]
#     cur_subset = final_data[genre_dummies[genre] == 1]
#     grouped = cur_subset.groupby(pd.Grouper(key='release_date', freq='Q')).describe()['ROI'].reset_index()
#     grouped.dropna(subset=['mean'])
#     sns.lineplot(x='release_date', y='mean', data=grouped, ax=cur_ax)
#     cur_ax.set_title(genre)
#     cur_ax.set_xlabel('Release Date')
#     cur_ax.set_ylabel('Mean ROI (%)')

# genre_subset_f2.suptitle('Mean ROI for Various Genres of Movie over Time')
# genre_subset_f2.tight_layout()

This data reveals that many genres have suffered from long periods where the return on investment is not particularly high. While there are not really any trends indicating that one type of movie is becoming more popular than another, we can conclude from this plot that certain genres offer a higher probability of being successful in terms of the return on investment.

Similarly, we may want to consider if there are any trends in the average ratings over time, broken up by genre as well. In particular, we can group quarterly in the same fashion as we did previously, and consider instead the mean average rating development along these quarters.

In [34]:
# rows = 7
# cols = 3
# genre_subset_f3, genre_subset_ax3 = plt.subplots(nrows=rows, ncols=cols, figsize=(40, 30))

# for i in range(num_subplots):
#     genre = genre_dummies.columns[i]
#     row = i // cols
#     col = i % cols
#     cur_ax = genre_subset_ax3[row, col]
#     cur_subset = final_data[genre_dummies[genre] == 1]
#     grouped = cur_subset.groupby(pd.Grouper(key='release_date', freq='Q')).describe()['averagerating'].reset_index()
#     grouped.dropna(subset=['mean'])
#     sns.lineplot(x='release_date', y='mean', data=grouped, ax=cur_ax)
#     cur_ax.set_title(genre)
#     cur_ax.set_xlabel('Release Date')
#     cur_ax.set_ylabel('Mean Average Rating')

# genre_subset_f3.suptitle('Mean Average Rating for Various Genres of Movie over Time')
# genre_subset_f3.tight_layout()
# plt.savefig('Plots/Mean ROI by genre')

Let's also take a look at some histograms of the average rating data.

In [35]:
# rows = 7
# cols = 3
# genre_subset_f4, genre_subset_ax4 = plt.subplots(nrows=rows, ncols=cols, figsize=(40, 30))

# for i in range(num_subplots):
#     genre = genre_dummies.columns[i]
#     row = i // cols
#     col = i % cols
#     cur_ax = genre_subset_ax4[row, col]
#     cur_subset = final_data[genre_dummies[genre] == 1]
#     sns.distplot(cur_subset.averagerating, kde=False, bins=10, ax=cur_ax)
#     cur_ax.set_xlim(0,10)
#     cur_ax.set_title(genre)
#     cur_ax.set_xlabel('Average Rating')
#     cur_ax.set_ylabel('Number of Movies')

# genre_subset_f4.suptitle('Genre Subsetted Histograms of Average Ratings')
# genre_subset_f4.tight_layout()
# plt.savefig('Plots/Genre Histogram')

We might also want to consider the full aggregation of the return on investment and average ratings by genre.

In [36]:
# genre_subset_f5, genre_subset_ax5 = plt.subplots(nrows=1, ncols=1, figsize=(15, 10))
# mean_rois = []

# for genre in genre_dummies.columns:
#     cur_subset = final_data[genre_dummies[genre] == 1]
#     mean_rois.append(cur_subset.ROI.mean())

# sns.barplot(x=genre_dummies.columns, y=mean_rois)
# genre_subset_ax5.set_title('Average Return on Investment By Genre (Full Timescale)')
# genre_subset_ax5.set_xlabel('Genre of Movie')
# genre_subset_ax5.set_ylabel('Return on Investment (%)')
# genre_subset_ax5.tick_params(axis='x', labelrotation=50.0)
# genre_subset_f5.tight_layout()
# plt.savefig('Plots/BarChart')

In [37]:
# genre_subset_f6, genre_subset_ax6 = plt.subplots(nrows=1, ncols=1, figsize=(15, 10))
# mean_ratings = []

# for genre in genre_dummies.columns:
#     cur_subset = final_data[genre_dummies[genre] == 1]
#     mean_ratings.append(cur_subset.averagerating.mean())

# sns.barplot(x=genre_dummies.columns, y=mean_ratings)
# genre_subset_ax6.set_title('Average Ratings By Genre (Full Timescale)')
# genre_subset_ax6.set_xlabel('Genre of Movie')
# genre_subset_ax6.set_ylabel('Average User Rating')
# genre_subset_ax6.tick_params(axis='x', labelrotation=50.0)
# genre_subset_f6.tight_layout()
# plt.savefig('Plots/Average Rating By Genre')

Let's finally consider the correlation between average rating and return on investment broken down within genres.

In [38]:
# rows = 7
# cols = 3
# genre_subset_f7, genre_subset_ax7 = plt.subplots(nrows=rows, ncols=cols, figsize=(40, 30))

# for i in range(num_subplots):
#     genre = genre_dummies.columns[i]
#     row = i // cols
#     col = i % cols
#     cur_ax = genre_subset_ax7[row, col]
#     cur_subset = final_data[genre_dummies[genre] == 1]
#     sns.scatterplot(x='averagerating', y='ROI', data=cur_subset, ax=cur_ax)
#     cur_ax.set_title(genre)
#     cur_ax.set_xlabel('Average User Rating')
#     cur_ax.set_ylabel('Return on Investment (%)')

# genre_subset_f7.suptitle('Correlation of ROI to Average User Rating By Genre')
# genre_subset_f7.tight_layout()

In [39]:
# rows = 7
# cols = 3
# genre_subset_f8, genre_subset_ax8 = plt.subplots(nrows=rows, ncols=cols, figsize=(40, 30))

# for i in range(num_subplots):
#     genre = genre_dummies.columns[i]
#     row = i // cols
#     col = i % cols
#     cur_ax = genre_subset_ax8[row, col]
#     cur_subset = final_data[genre_dummies[genre] == 1]
#     sns.kdeplot(cur_subset.averagerating, cur_subset.ROI, ax=cur_ax, shade=True)
#     cur_ax.set_title(genre)
#     cur_ax.set_xlabel('Average User Rating')
#     cur_ax.set_ylabel('Return on Investment (%)')

# genre_subset_f8.suptitle('KDE of ROI to Average User Rating By Genre')
# genre_subset_f8.tight_layout()
# plt.savefig('Plots/KDE of ROI by user ratings')

In [198]:
action = final_with_genre_dummies[final_with_genre_dummies.Action == 1]
action_drop = action.drop(['Horror', 'Adventure', 'Animation', 'Biography', 
                            'Comedy', 'Crime','Documentary', 'Drama', 'Family', 
                           'Fantasy', 'History','Music', 'Musical', 'Mystery', 
                           'Romance', 'Sci-Fi', 'Sport', 'Thriller','War', 'Western'],axis=1)

action_drop.sort_values('ROI',ascending =False,inplace = True)
action_top10 = action_drop.head(10)

In [149]:
horror = final_with_genre_dummies[final_with_genre_dummies.Horror == 1]

In [155]:
horror.columns

Index(['imdb_id', 'primary_title', 'original_title', 'start_year', 'genres',
       'directors', 'writers', 'averagerating', 'numvotes', 'release_date',
       'production_budget', 'domestic_gross', 'worldwide_gross', 'ROI',
       'Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror',
       'Music', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller',
       'War', 'Western'],
      dtype='object')

In [162]:
horror_drop = horror.drop(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime','Documentary', 'Drama', 'Family', 'Fantasy', 'History','Music', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller','War', 'Western'],axis=1)

In [209]:
horror_drop.sort_values('ROI',ascending =False,inplace = True)
horror_top10 = horror_drop.head(10)

Tried using pivot table on final_with_genre_dummies dataframe

In [104]:
# table = pd.pivot_table(final_with_genre_dummies, 
#                        index=['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime',
#                               'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror',
#                               'Music', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller',
#                               'War', 'Western'],
#                        values=['ROI','worldwide_gross'])

In [200]:
mystery = final_with_genre_dummies[final_with_genre_dummies.Mystery == 1]
mystery_drop = mystery.drop(['Horror', 'Adventure', 'Animation', 'Biography', 
                            'Comedy', 'Crime','Documentary', 'Drama', 'Family', 
                           'Fantasy', 'History','Music', 'Musical', 'Action', 
                           'Romance', 'Sci-Fi', 'Sport', 'Thriller','War', 'Western'],axis=1)
mystery_drop.sort_values('ROI',ascending =False,inplace = True)
mystery_top10=mystery_drop.head(10)

In [199]:
thriller = final_with_genre_dummies[final_with_genre_dummies.Thriller == 1]
thriller_drop = thriller.drop(['Horror', 'Adventure', 'Animation', 'Biography', 
                            'Comedy', 'Crime','Documentary', 'Drama', 'Family', 
                           'Fantasy', 'History','Music', 'Musical', 'Action', 
                           'Romance', 'Sci-Fi', 'Sport', 'Mystery','War', 'Western'],axis=1)
thriller_drop.sort_values('ROI',ascending =False,inplace = True)
thriller_top10 = thriller_drop.head(10) 

Identified Top10 movies by their ROI in Action, Horror, Mytery and Thriller genres

In [202]:
# thriller_top10 
# mystery_top10
# horror_top10
# action_top10 

In [250]:
action_top10

Unnamed: 0,imdb_id,primary_title,original_title,start_year,genres,directors,writers,averagerating,numvotes,release_date,production_budget,domestic_gross,worldwide_gross,ROI,Action
87402,tt5074352,Dangal,Dangal,2016,"[Action, Biography, Drama]",[nm4318159],"[nm6328029, nm6328031, nm6328030, nm8661566, n...",8.5,123638.0,2016-12-21,9500000.0,12391761.0,294654618.0,3001.627558,1
20344,tt1853739,You're Next,You're Next,2011,"[Action, Comedy, Horror]",[nm1417392],[nm1440023],6.6,79451.0,2013-08-23,1000000.0,18494006.0,26887177.0,2588.7177,1
79713,tt4573516,Sleight,Sleight,2016,"[Action, Drama, Sci-Fi]",[nm2300570],"[nm2300570, nm2242713]",5.9,7074.0,2017-04-28,250000.0,3930990.0,3934450.0,1473.78,1
129179,tt7961060,Dragon Ball Super: Broly,Doragon bôru chô: Burorî,2018,"[Action, Adventure, Animation]",[nm0619110],[nm0868066],8.0,16465.0,2019-01-16,8500000.0,30376755.0,122747755.0,1344.091235,1
7543,tt1431045,Deadpool,Deadpool,2016,"[Action, Adventure, Comedy]",[nm1783265],"[nm1014201, nm1116660]",8.0,820847.0,2016-02-12,58000000.0,363070709.0,801025593.0,1281.078609,1
50144,tt2975578,The Purge: Anarchy,The Purge: Anarchy,2014,"[Action, Horror, Sci-Fi]",[nm0218621],[nm0218621],6.5,126203.0,2014-07-18,9000000.0,71562550.0,111534881.0,1139.276456,1
71729,tt4094724,The Purge: Election Year,The Purge: Election Year,2016,"[Action, Horror, Sci-Fi]",[nm0218621],[nm0218621],6.0,80254.0,2016-07-01,10000000.0,79042440.0,118514727.0,1085.14727,1
48320,tt2872732,Lucy,Lucy,2014,"[Action, Sci-Fi, Thriller]",[nm0000108],[nm0000108],6.4,403194.0,2014-07-25,40000000.0,126573960.0,457507776.0,1043.76944,1
34723,tt2283362,Jumanji: Welcome to the Jungle,Jumanji: Welcome to the Jungle,2017,"[Action, Adventure, Comedy]",[nm0440458],"[nm0571344, nm1273099, nm0003298, nm0684374, n...",7.0,242735.0,2017-12-20,90000000.0,404508916.0,964496193.0,971.662437,1
104877,tt6133466,The First Purge,The First Purge,2018,"[Action, Horror, Sci-Fi]",[nm2618764],[nm0218621],5.1,41741.0,2018-07-04,13000000.0,69488745.0,136617305.0,950.902346,1


In [247]:
thriller_top10

Unnamed: 0,imdb_id,primary_title,original_title,start_year,genres,directors,writers,averagerating,numvotes,release_date,production_budget,domestic_gross,worldwide_gross,ROI,Thriller
35625,tt2309260,The Gallows,The Gallows,2015,"[Horror, Mystery, Thriller]","[nm4000389, nm3951039]","[nm3951039, nm4000389]",4.2,17763.0,2015-07-10,100000.0,22764410.0,41656474.0,41556.474,1
10236,tt1591095,Insidious,Insidious,2010,"[Horror, Mystery, Thriller]",[nm1490123],[nm1191481],6.9,254197.0,2011-04-01,1500000.0,54009150.0,99870886.0,6558.059067,1
64866,tt3713166,Unfriended,Unfriended,2014,"[Horror, Mystery, Thriller]",[nm0300174],[nm4532532],5.6,62043.0,2015-04-17,1000000.0,32789645.0,64364198.0,6336.4198,1
87039,tt5052448,Get Out,Get Out,2017,"[Horror, Mystery, Thriller]",[nm1443502],[nm1443502],7.7,400474.0,2017-02-24,5000000.0,176040665.0,255367951.0,5007.35902,1
24683,tt1991245,Chernobyl Diaries,Chernobyl Diaries,2012,"[Horror, Mystery, Thriller]",[nm0662086],"[nm2305431, nm1139317, nm0886749]",5.0,60304.0,2012-05-25,1000000.0,18119640.0,42411721.0,4141.1721,1
17175,tt1778304,Paranormal Activity 3,Paranormal Activity 3,2011,"[Horror, Mystery, Thriller]","[nm1413364, nm1160962]","[nm0484907, nm2305431]",5.8,85689.0,2011-10-21,5000000.0,104028807.0,207039844.0,4040.79688,1
56930,tt3322940,Annabelle,Annabelle,2014,"[Horror, Mystery, Thriller]",[nm0502954],[nm2477891],5.4,122039.0,2014-10-03,6500000.0,84273813.0,256862920.0,3851.737231,1
6605,tt1320244,The Last Exorcism,The Last Exorcism,2010,"[Drama, Horror, Thriller]",[nm0821844],"[nm1428204, nm0348640]",5.6,45815.0,2010-08-27,1800000.0,41034350.0,70165900.0,3798.105556,1
31244,tt2184339,The Purge,The Purge,2013,"[Horror, Thriller]",[nm0218621],[nm0218621],5.7,183549.0,2013-06-07,3000000.0,64473115.0,91266581.0,2942.219367,1
22460,tt1922777,Sinister,Sinister,2012,"[Horror, Mystery, Thriller]",[nm0220600],"[nm0220600, nm1803036]",6.8,198345.0,2012-10-12,3000000.0,48086903.0,87727807.0,2824.260233,1


In [248]:
horror_top10.corr()

Unnamed: 0,start_year,averagerating,numvotes,production_budget,domestic_gross,worldwide_gross,ROI,Horror
start_year,1.0,0.121566,0.370757,0.206001,0.370331,0.261709,0.350745,
averagerating,0.121566,1.0,0.907214,0.475555,0.710014,0.52653,-0.499435,
numvotes,0.370757,0.907214,1.0,0.461276,0.773212,0.572829,-0.304579,
production_budget,0.206001,0.475555,0.461276,1.0,0.777319,0.955569,-0.472738,
domestic_gross,0.370331,0.710014,0.773212,0.777319,1.0,0.886869,-0.347999,
worldwide_gross,0.261709,0.52653,0.572829,0.955569,0.886869,1.0,-0.408331,
ROI,0.350745,-0.499435,-0.304579,-0.472738,-0.347999,-0.408331,1.0,
Horror,,,,,,,,


In [246]:
mystery_top10.corr()

Unnamed: 0,start_year,averagerating,numvotes,production_budget,domestic_gross,worldwide_gross,ROI,Mystery
start_year,1.0,0.169812,0.149192,0.223361,0.332716,0.230621,0.207764,
averagerating,0.169812,1.0,0.871969,0.275771,0.622485,0.386298,-0.49288,
numvotes,0.149192,0.871969,1.0,0.249444,0.745721,0.491319,-0.310274,
production_budget,0.223361,0.275771,0.249444,1.0,0.653215,0.879768,-0.554221,
domestic_gross,0.332716,0.622485,0.745721,0.653215,1.0,0.874407,-0.305417,
worldwide_gross,0.230621,0.386298,0.491319,0.879768,0.874407,1.0,-0.399196,
ROI,0.207764,-0.49288,-0.310274,-0.554221,-0.305417,-0.399196,1.0,
Mystery,,,,,,,,


In [234]:
action_top10.production_budget.mean()

23925000.0

In [235]:
horror_top10.production_budget.mean()

2590000.0

In [237]:
action_top10.production_budget.mean()/horror_top10.production_budget.mean()

9.237451737451737