## Initial Data Analysis

Computing Vision (a made-up company for the purposes of this project) sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t have much background in creating movies. You are charged with exploring what types of films are currently doing the best at the box office using different samples of available data. You then will translate those findings into actionable insights that the head of Computing Vision's new movie studio can use to help decide what type of films to create.

In simpler terms, we want to analyze data collected from platforms like IMDB, BoxOffice Mojo, and Rotten Tomatoes to figure out which genre is doing the best in box office. This will allow Computing Vision to see how they can position themselves to do well.

In [31]:
import pandas as pd
import numpy as np

boxoffice = pd.read_csv('zippedData/bom.movie_gross.csv.gz') #not using this
reviews = pd.read_table('zippedData/rt.reviews.tsv.gz', encoding='windows-1252') #not using this
movie_info = pd.read_table('zippedData/rt.movie_info.tsv.gz') 
movies = pd.read_csv('zippedData/tmdb.movies.csv.gz')
movie_budgets = pd.read_csv('zippedData/tn.movie_budgets.csv.gz') #using this instead of box office

#### Initial reaction to this data set:

We have the domestic gross value which I am assuming is gross revenue made across the US.

I am not exactly sure what studio might mean (I'm assuming WB is Warner Brothers), but this might be key information to determine which film genre is doing best in box office.

In [3]:
movie_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [5]:
sum(movie_info['box_office'].isnull())

1220

In [6]:
movie_info.shape

(1560, 12)

This dataset does not have the film names

In [19]:
movies.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [12]:
movies.sort_values('popularity', ascending=[False])

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
23811,23811,"[12, 28, 14]",299536,en,Avengers: Infinity War,80.773,2018-04-27,Avengers: Infinity War,8.3,13948
11019,11019,"[28, 53]",245891,en,John Wick,78.123,2014-10-24,John Wick,7.2,10081
23812,23812,"[28, 12, 16, 878, 35]",324857,en,Spider-Man: Into the Spider-Verse,60.534,2018-12-14,Spider-Man: Into the Spider-Verse,8.4,4048
11020,11020,"[28, 12, 14]",122917,en,The Hobbit: The Battle of the Five Armies,53.783,2014-12-17,The Hobbit: The Battle of the Five Armies,7.3,8392
5179,5179,"[878, 28, 12]",24428,en,The Avengers,50.289,2012-05-04,The Avengers,7.6,19673
...,...,...,...,...,...,...,...,...,...,...
13877,13877,[10749],401741,en,Crème Caramel,0.600,2014-05-20,Crème Caramel,5.0,1
13878,13878,[878],401427,en,Elegy,0.600,2014-09-10,Elegy,5.0,1
13879,13879,[35],399054,en,Jaguar,0.600,2014-09-21,Jaguar,5.0,1
13880,13880,[],381154,en,Unleashed! A Dog Dancing Story,0.600,2014-02-13,Unleashed! A Dog Dancing Story,5.0,1


Note: Movies ids do not correspond with reviews ids

In [36]:
movie_budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350
3,4,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747


In [35]:
#Converting string dollar values into integers for calculations

movie_budgets['domestic_gross'] = movie_budgets['domestic_gross'].str.replace(',', '').str.replace('$', '').astype(int)
movie_budgets['production_budget'] = movie_budgets['production_budget'].str.replace(',', '').str.replace('$', '').astype(int)
movie_budgets['worldwide_gross'] = movie_budgets['worldwide_gross'].str.replace(',', '').str.replace('$', '').astype(np.int64)

AttributeError: Can only use .str accessor with string values!

In [37]:
merge1 = pd.merge(movies, movie_budgets, left_on='original_title', right_on='movie')
merge1.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id_x,original_language,original_title,popularity,release_date_x,title,vote_average,vote_count,id_y,release_date_y,movie,production_budget,domestic_gross,worldwide_gross
0,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,30,"Mar 26, 2010",How to Train Your Dragon,165000000,217581232,494870992
1,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368,15,"May 7, 2010",Iron Man 2,170000000,312433331,621156389
2,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174,37,"Nov 22, 1995",Toy Story,30000000,191796233,364545516
3,2473,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174,37,"Nov 22, 1995",Toy Story,30000000,191796233,364545516
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186,38,"Jul 16, 2010",Inception,160000000,292576195,835524642


In [38]:
merge1['domestic_profit'] = merge1['domestic_gross'] - merge1['production_budget']
merge1['worldwide_profit'] = merge1['worldwide_gross'] - merge1['production_budget']

In [39]:
merge1 #includes popularity, movie name and budgets

Unnamed: 0.1,Unnamed: 0,genre_ids,id_x,original_language,original_title,popularity,release_date_x,title,vote_average,vote_count,id_y,release_date_y,movie,production_budget,domestic_gross,worldwide_gross,domestic_profit,worldwide_profit
0,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,30,"Mar 26, 2010",How to Train Your Dragon,165000000,217581232,494870992,52581232,329870992
1,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368,15,"May 7, 2010",Iron Man 2,170000000,312433331,621156389,142433331,451156389
2,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174,37,"Nov 22, 1995",Toy Story,30000000,191796233,364545516,161796233,334545516
3,2473,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174,37,"Nov 22, 1995",Toy Story,30000000,191796233,364545516,161796233,334545516
4,4,"[28, 878, 12]",27205,en,Inception,27.920,2010-07-16,Inception,8.3,22186,38,"Jul 16, 2010",Inception,160000000,292576195,835524642,132576195,675524642
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2311,26323,[],509316,en,The Box,0.600,2018-03-04,The Box,8.0,1,66,"Nov 6, 2009",The Box,25000000,15051977,34356760,-9948023,9356760
2312,26425,[10402],509306,en,The Box,0.600,2018-03-04,The Box,6.0,1,66,"Nov 6, 2009",The Box,25000000,15051977,34356760,-9948023,9356760
2313,26092,"[35, 16]",546674,en,Enough,0.719,2018-03-22,Enough,8.7,3,68,"May 24, 2002",Enough,38000000,39177215,50970660,1177215,12970660
2314,26322,[],513161,en,Undiscovered,0.600,2018-04-07,Undiscovered,8.0,1,7,"Aug 26, 2005",Undiscovered,9000000,1069318,1069318,-7930682,-7930682


In [21]:
# connect to imdb database
import pandas as pd
import sqlite3
conn = sqlite3.connect('zippedData/im.db')
#Joined movies basics and movie ratings in IMDB SQL 

pd.read_sql("""
SELECT * 
FROM movie_basics AS mb 
JOIN movie_ratings AS mr ON mb.movie_id = mr.movie_id
ORDER BY numvotes DESC, averagerating DESC;""", conn)

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,movie_id.1,averagerating,numvotes
0,tt1375666,Inception,Inception,2010,148.0,"Action,Adventure,Sci-Fi",tt1375666,8.8,1841066
1,tt1345836,The Dark Knight Rises,The Dark Knight Rises,2012,164.0,"Action,Thriller",tt1345836,8.4,1387769
2,tt0816692,Interstellar,Interstellar,2014,169.0,"Adventure,Drama,Sci-Fi",tt0816692,8.6,1299334
3,tt1853728,Django Unchained,Django Unchained,2012,165.0,"Drama,Western",tt1853728,8.4,1211405
4,tt0848228,The Avengers,The Avengers,2012,143.0,"Action,Adventure,Sci-Fi",tt0848228,8.1,1183655
...,...,...,...,...,...,...,...,...,...
73851,tt9366716,DaGram,DaGram,2018,75.0,Comedy,tt9366716,1.2,5
73852,tt2447822,Momok jangan cari pasal!,Momok jangan cari pasal!,2012,85.0,Comedy,tt2447822,1.0,5
73853,tt6792126,Jak se mori revizori,Jak se mori revizori,2018,,Comedy,tt6792126,1.0,5
73854,tt8426154,Pup Scouts,Pup Scouts,2018,72.0,Animation,tt8426154,1.0,5
