# WORKBOOK- Film Performance Exploratory Data Analysis

Aidan O'Keefe

## Project Requirements (DELETE AFTER PROJECT)

Your analysis should yield three concrete business recommendations. The ultimate purpose of exploratory analysis is not just to learn about the data, but to help an organization perform better. Explicitly relate your findings to business needs by recommending actions that you think the business (Microsoft) should take.

Main question:
What types of films are currently doing the best at the box office?

Questions to answer:
What genre of movie makes the most money?
Which directors make the most money? 
Which generes/directors make the most profit? 
Cheapest (production budget) films with best reviews

Factors to consider:
Do we just care about modern trends? Past 15 years? 
Which films are most profitable adjusted for inflation? 


Presentation and notebook should describe:
goals, data, methods, and result. 3 visualizations corresponding to 3 business recommendations.

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Case

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

Questions to consider:

What are the business's pain points related to this project?
How did you pick the data analysis question(s) that you did?
Why are these questions important from a business perspective?

Microsoft sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of Microsoft's new movie studio can use to help decide what type of films to create.

## Data Understanding

Describe the data being used for this project. 

We are going to look at data taken from IMDB and Rotten Tomatoes

Questions to consider:

Where did the data come from, and how do they relate to the data analysis questions?
What do the data represent? Who is in the sample and what variables are included?
What is the target variable?
What are the properties of the variables you intend to use?

In [1]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

%matplotlib inline

In [2]:
# Here you run your code to explore the data

## DATA PREPARATION

Describe and justify the process for preparing the data for analysis.

Questions to consider:

Were there variables you dropped or created?
How did you address missing values or outliers?
Why are these choices appropriate given the data and the business problem?


In [3]:
# Here you run your code to clean the data

### IMPORTING SQL DATA FROM IMDB

In [4]:
import sqlite3
conn = sqlite3.connect('/Users/Aidan/Documents/Flatiron/Phase_1/dsc-phase-1-project-v2-4/zippedData/im.db')

In [5]:
q = """

SELECT movie_basics.primary_title AS "Film Name",
        movie_basics.original_title AS "Original Film Name",
        movie_basics.start_year AS "Release Year",
        movie_basics.runtime_minutes AS "Runtime (min)",
        movie_basics.genres AS "Genres",
        movie_ratings.averagerating AS "IMDB Average Rating",
        movie_ratings.numvotes AS "IMDB Votes Count"
        

FROM movie_basics
JOIN movie_ratings
    USING(movie_id)

"""
imdb_df = pd.read_sql(q, conn)

In [6]:
imdb_df.shape

(73856, 7)

In [7]:
imdb_df.head()

Unnamed: 0,Film Name,Original Film Name,Release Year,Runtime (min),Genres,IMDB Average Rating,IMDB Votes Count
0,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77
1,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43
2,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
3,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13
4,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119


### IMPORTING CSV AND TSV DATA

In [8]:
bom_df = pd.read_csv("/Users/Aidan/Documents/Flatiron/Phase_1/dsc-phase-1-project-v2-4/zippedData/bom.movie_gross.csv")

In [9]:
bom_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [10]:
bom_df.shape

(3387, 5)

In [11]:
rt_mov_df = pd.read_csv\
("/Users/Aidan/Documents/Flatiron/Phase_1/dsc-phase-1-project-v2-4/zippedData/rt.movie_info.tsv",\
 delimiter = '\t')

In [12]:
rt_mov_df.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [13]:
rt_mov_df.shape

(1560, 12)

In [14]:
rt_rev_df = pd.read_csv\
("/Users/Aidan/Documents/Flatiron/Phase_1/dsc-phase-1-project-v2-4/zippedData/rt.reviews.tsv", delimiter = '\t',\
 encoding='windows-1252')

In [15]:
rt_rev_df.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [16]:
rt_rev_df.shape

(54432, 8)

In [17]:
tmbd_mov_df = pd.read_csv\
("/Users/Aidan/Documents/Flatiron/Phase_1/dsc-phase-1-project-v2-4/zippedData/tmdb.movies.csv")

In [18]:
tmbd_mov_df.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [19]:
tmbd_mov_df.shape

(26517, 10)

In [20]:
mov_budgets_df = pd.read_csv\
("/Users/Aidan/Documents/Flatiron/Phase_1/dsc-phase-1-project-v2-4/zippedData/tn.movie_budgets.csv")

In [21]:
mov_budgets_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [22]:
mov_budgets_df.shape

(5782, 6)

Stripping the `production_budget` `domestic_gross` and `worldwide_gross` columns of the "$" and commas.

In [23]:
mov_budgets_df['production_budget'] = mov_budgets_df['production_budget'].str.replace(',','').str.replace('$','')

In [24]:
mov_budgets_df['domestic_gross'] = mov_budgets_df['domestic_gross'].str.replace(',','').str.replace('$','')

In [25]:
mov_budgets_df['worldwide_gross'] = mov_budgets_df['worldwide_gross'].str.replace(',','').str.replace('$','')

In [26]:
mov_budgets_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350
3,4,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747


### EXPLORING DATA

6 DATAFRAMES:

- imdb_df. <br>
Using this Data Frame as base.
<br>
- bom_df <br>
Has same information as mov_budgets_df but fewer records.
<br>
- rt_mov_df <br>
Information not useful
<br>
- rt_rev_df <br>
Information not useful
<br>
- tmbd_mov_df <br>
Vote Average, Vote Count, and Popularity might be useful
<br>
- mov_budgets_df <br>
All useful data

### Consolidating Data Frames

#### Rotten Tomatoes Data Frame

In [27]:
rt_df = rt_mov_df.merge(rt_rev_df, how = 'left', on= "id")

In [28]:
rt_df['synopsis'][0]

'This gritty, fast-paced, and innovative police drama earned five Academy Awards, including Best Picture, Best Adapted Screenplay (written by Ernest Tidyman), and Best Actor (Gene Hackman). Jimmy "Popeye" Doyle (Hackman) and his partner, Buddy Russo (Roy Scheider), are New York City police detectives on narcotics detail, trying to track down the source of heroin from Europe into the United States. Suave Alain Charnier (Fernando Rey) is the French drug kingpin who provides a large percentage of New York City\'s dope, and Pierre Nicoli (Marcel Bozzuffi) is a hired killer and Charnier\'s right-hand man. Acting on a hunch, Popeye and Buddy start tailing Sal Boca (Tony Lo Bianco) and his wife, Angie (Arlene Faber), who live pretty high for a couple whose corner store brings in about 7,000 dollars a year. It turns out Popeye\'s suspicions are right -- Sal and Angie are the New York agents for Charnier, who will be smuggling 32 million dollars\' worth of heroin into the city in a car shipped 

In [29]:
rt_df.head()

Unnamed: 0,id,synopsis,rating_x,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,review,rating_y,fresh,critic,top_critic,publisher,date
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,,,,,,,,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0.0,Patrick Nabarro,"November 10, 2018"
2,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0.0,io9.com,"May 23, 2018"
3,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0.0,Stream on Demand,"January 4, 2018"
4,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0.0,MUBI,"November 16, 2017"


*The information from the Rotten Tomatoes Data Frames does not seem very useful in furthering our business case so will not be used moving forward.

#### IMDB Data Base and CSV/TSV Merge

Merging based on the following columns in each data frame:
- imdb_df('Film Name')
- mov_budgets_df ('movie')
- tmbd_mov_df("original_title")
- bom_df ("title")

Create a `release_year` column from `release_date` in `mov_budgets_df`

In [30]:
mov_budgets_df['release_year'] = [i[-4:] for i in mov_budgets_df["release_date"]]

In [31]:
mov_budgets_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279,2009
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,2011
2,3,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350,2019
3,4,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963,2015
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,2017


In [32]:
imdb_bud_df = imdb_df.merge(mov_budgets_df, how="inner", left_on= "Original Film Name", right_on= "movie")

In [33]:
imdb_bud_df.shape

(2638, 14)

In [34]:
imdb_bud_df["release_year"] = imdb_bud_df["release_year"].astype(int)

In [35]:
imdb_bud_df["release_year"].dtype

dtype('int64')

In [36]:
imdb_bud_df= imdb_bud_df[imdb_bud_df["Release Year"] == imdb_bud_df["release_year"]]

In [37]:
imdb_bud_df.head(10)

Unnamed: 0,Film Name,Original Film Name,Release Year,Runtime (min),Genres,IMDB Average Rating,IMDB Votes Count,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year
0,Foodfight!,Foodfight!,2012,91.0,"Action,Animation,Comedy",1.9,8248,26,"Dec 31, 2012",Foodfight!,45000000,0,73706,2012
2,The Overnight,The Overnight,2015,79.0,"Comedy,Mystery",6.1,14828,21,"Jun 19, 2015",The Overnight,200000,1109808,1165996,2015
6,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.0,"Adventure,Comedy,Drama",7.3,275300,37,"Dec 25, 2013",The Secret Life of Walter Mitty,91000000,58236838,187861183,2013
7,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,114.0,"Action,Crime,Drama",6.5,105116,67,"Sep 19, 2014",A Walk Among the Tombstones,28000000,26017685,62108587,2014
8,Jurassic World,Jurassic World,2015,124.0,"Action,Adventure,Sci-Fi",7.0,539338,34,"Jun 12, 2015",Jurassic World,215000000,652270625,1648854864,2015
9,The Rum Diary,The Rum Diary,2011,119.0,"Comedy,Drama",6.2,94787,16,"Oct 28, 2011",The Rum Diary,45000000,13109815,21544732,2011
10,The Three Stooges,The Three Stooges,2012,92.0,"Comedy,Family",5.1,28570,4,"Apr 13, 2012",The Three Stooges,30000000,44338224,54052249,2012
11,Anderson's Cross,Anderson's Cross,2010,98.0,"Comedy,Drama,Romance",5.5,106,65,"Dec 31, 2010",Anderson's Cross,300000,0,0,2010
12,Tangled,Tangled,2010,100.0,"Adventure,Animation,Comedy",7.8,366366,15,"Nov 24, 2010",Tangled,260000000,200821936,586477240,2010
13,John Carter,John Carter,2012,132.0,"Action,Adventure,Sci-Fi",6.6,241792,14,"Mar 9, 2012",John Carter,275000000,73058679,282778100,2012


In [38]:
imdb_bud_df.shape

(1489, 14)

In [39]:
imdb_bud_df2 = imdb_bud_df.merge(tmbd_mov_df, how="inner", left_on= "Original Film Name", right_on="original_title")

In [40]:
imdb_bud_df2.head(10)

Unnamed: 0.1,Film Name,Original Film Name,Release Year,Runtime (min),Genres,IMDB Average Rating,IMDB Votes Count,id_x,release_date_x,movie,...,Unnamed: 0,genre_ids,id_y,original_language,original_title,popularity,release_date_y,title,vote_average,vote_count
0,Foodfight!,Foodfight!,2012,91.0,"Action,Animation,Comedy",1.9,8248,26,"Dec 31, 2012",Foodfight!,...,8456,"[16, 28, 35, 10751]",116977,en,Foodfight!,4.705,2013-05-07,Foodfight!,2.1,46
1,The Overnight,The Overnight,2015,79.0,"Comedy,Mystery",6.1,14828,21,"Jun 19, 2015",The Overnight,...,14596,"[9648, 35]",308024,en,The Overnight,6.576,2015-06-19,The Overnight,6.0,200
2,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.0,"Adventure,Comedy,Drama",7.3,275300,37,"Dec 25, 2013",The Secret Life of Walter Mitty,...,7998,"[12, 35, 18, 14]",116745,en,The Secret Life of Walter Mitty,10.743,2013-12-25,The Secret Life of Walter Mitty,7.1,4859
3,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,114.0,"Action,Crime,Drama",6.5,105116,67,"Sep 19, 2014",A Walk Among the Tombstones,...,11053,"[80, 18, 9648, 53]",169917,en,A Walk Among the Tombstones,19.373,2014-09-19,A Walk Among the Tombstones,6.3,1685
4,Jurassic World,Jurassic World,2015,124.0,"Action,Adventure,Sci-Fi",7.0,539338,34,"Jun 12, 2015",Jurassic World,...,14193,"[28, 12, 878, 53]",135397,en,Jurassic World,20.709,2015-06-12,Jurassic World,6.6,14056
5,The Rum Diary,The Rum Diary,2011,119.0,"Comedy,Drama",6.2,94787,16,"Oct 28, 2011",The Rum Diary,...,2568,"[18, 35]",23514,en,The Rum Diary,12.011,2011-10-27,The Rum Diary,5.7,652
6,The Three Stooges,The Three Stooges,2012,92.0,"Comedy,Family",5.1,28570,4,"Apr 13, 2012",The Three Stooges,...,5331,[35],76489,en,The Three Stooges,9.358,2012-04-13,The Three Stooges,5.1,215
7,Anderson's Cross,Anderson's Cross,2010,98.0,"Comedy,Drama,Romance",5.5,106,65,"Dec 31, 2010",Anderson's Cross,...,2237,"[10749, 35, 18]",324352,en,Anderson's Cross,0.6,2010-05-20,Anderson's Cross,5.0,1
8,Tangled,Tangled,2010,100.0,"Adventure,Animation,Comedy",7.8,366366,15,"Nov 24, 2010",Tangled,...,13,"[16, 10751]",38757,en,Tangled,21.511,2010-11-24,Tangled,7.5,6407
9,John Carter,John Carter,2012,132.0,"Action,Adventure,Sci-Fi",6.6,241792,14,"Mar 9, 2012",John Carter,...,5197,"[28, 12, 878]",49529,en,John Carter,18.549,2012-03-09,John Carter,6.1,3338


In [41]:
imdb_bud_df2.shape

(1624, 24)

In [42]:
imdb_ultimate_df = imdb_bud_df2.merge(bom_df, how="inner", left_on= "Original Film Name", right_on="title")

In [43]:
imdb_alphabetical_df = imdb_ultimate_df.sort_values(by = "Film Name")

In [44]:
imdb_alphabetical_df.head()

Unnamed: 0,Film Name,Original Film Name,Release Year,Runtime (min),Genres,IMDB Average Rating,IMDB Votes Count,id_x,release_date_x,movie,...,popularity,release_date_y,title_x,vote_average,vote_count,title_y,studio,domestic_gross_y,foreign_gross,year
177,10 Cloverfield Lane,10 Cloverfield Lane,2016,103.0,"Drama,Horror,Mystery",7.2,260383,54,"Mar 11, 2016",10 Cloverfield Lane,...,17.892,2016-03-11,10 Cloverfield Lane,6.9,4629,10 Cloverfield Lane,Par.,72100000.0,38100000.0,2016
341,12 Strong,12 Strong,2018,130.0,"Action,Drama,History",6.6,50155,64,"Jan 19, 2018",12 Strong,...,13.183,2018-01-19,12 Strong,5.6,1312,12 Strong,WB,45800000.0,21600000.0,2018
726,12 Years a Slave,12 Years a Slave,2013,134.0,"Biography,Drama,History",8.1,577301,18,"Oct 18, 2013",12 Years a Slave,...,16.493,2013-10-30,12 Years a Slave,7.9,6631,12 Years a Slave,FoxS,56700000.0,131100000.0,2013
432,127 Hours,127 Hours,2010,94.0,"Adventure,Biography,Drama",7.6,323949,6,"Nov 5, 2010",127 Hours,...,11.435,2010-11-05,127 Hours,7.0,4469,127 Hours,FoxS,18300000.0,42400000.0,2010
731,13 Sins,13 Sins,2014,93.0,"Horror,Thriller",6.3,29550,51,"Apr 18, 2014",13 Sins,...,10.899,2014-04-18,13 Sins,6.3,576,13 Sins,RTWC,13800.0,,2014


In [45]:
imdb_alphabetical_df.shape

(1195, 29)

### Remove Duplicates

Here we run into the issue of some movies having the same name. These duplicates could include remakes done years later or domestic remake of a foreign film. However we know the release year of these films will differ. We tried to avoid the foreign film issue by merging data sets based on `Original Film Name` rather than `Film Name`.

In [46]:
imdb_alphabetical_df.shape

(1195, 29)

In [47]:
imdb_alphabetical_df.duplicated(subset=["Film Name", "Release Year"]).value_counts()

False    1013
True      182
dtype: int64

In [48]:
imdb_alphabetical_df[imdb_alphabetical_df.duplicated(subset=["Film Name", "Release Year"]) == True]

Unnamed: 0,Film Name,Original Film Name,Release Year,Runtime (min),Genres,IMDB Average Rating,IMDB Votes Count,id_x,release_date_x,movie,...,popularity,release_date_y,title_x,vote_average,vote_count,title_y,studio,domestic_gross_y,foreign_gross,year
1178,A Bad Moms Christmas,A Bad Moms Christmas,2017,104.0,"Adventure,Comedy",5.5,33056,40,"Nov 1, 2017",A Bad Moms Christmas,...,16.604,2017-11-01,A Bad Moms Christmas,6.3,1044,A Bad Moms Christmas,STX,72100000.0,58400000,2017
436,A Better Life,A Better Life,2011,98.0,"Drama,Romance",7.2,14602,84,"Jun 24, 2011",A Better Life,...,5.885,2011-06-24,A Better Life,7.1,97,A Better Life,Sum.,1800000.0,,2011
925,A Most Violent Year,A Most Violent Year,2014,125.0,"Crime,Drama,Thriller",7.0,60165,68,"Dec 31, 2014",A Most Violent Year,...,12.271,2014-12-31,A Most Violent Year,6.6,792,A Most Violent Year,A24,5700000.0,6300000,2014
1011,A Street Cat Named Bob,A Street Cat Named Bob,2016,103.0,"Biography,Comedy,Drama",7.4,23204,31,"Nov 18, 2016",A Street Cat Named Bob,...,7.120,2016-11-18,A Street Cat Named Bob,7.5,500,A Street Cat Named Bob,Cleopatra,82700.0,,2016
472,Abduction,Abduction,2011,106.0,"Action,Mystery,Thriller",5.1,72552,96,"Sep 23, 2011",Abduction,...,17.690,2011-09-23,Abduction,5.8,1791,Abduction,LGF,28100000.0,54000000,2011
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
656,Won't Back Down,Won't Back Down,2012,121.0,Drama,6.4,5915,49,"Sep 28, 2012",Won't Back Down,...,6.140,2012-09-28,Won't Back Down,6.3,64,Won't Back Down,Fox,5300000.0,,2012
875,Wonder,Wonder,2017,113.0,"Drama,Family",8.0,111632,95,"Nov 17, 2017",Wonder,...,20.101,2017-11-17,Wonder,8.2,3959,Wonder,LGF,132400000.0,173500000,2017
16,Wonder Woman,Wonder Woman,2017,141.0,"Action,Adventure,Fantasy",7.5,487527,55,"Jun 2, 2017",Wonder Woman,...,2.841,2013-09-29,Wonder Woman,5.9,34,Wonder Woman,WB,412600000.0,409300000,2017
1063,Woodlawn,Woodlawn,2015,123.0,"Drama,Sport",6.5,6070,81,"Oct 16, 2015",Woodlawn,...,5.575,2015-10-16,Woodlawn,7.0,79,Woodlawn,PFR,14400000.0,,2015


In [49]:
new_imdb_df = imdb_alphabetical_df.drop_duplicates(subset= ["Film Name", "Release Year"], keep='first', inplace=False)

In [50]:
new_imdb_df.shape

(1013, 29)

In [51]:
new_imdb_df[new_imdb_df.duplicated(subset=["Film Name"])].head(20)

Unnamed: 0,Film Name,Original Film Name,Release Year,Runtime (min),Genres,IMDB Average Rating,IMDB Votes Count,id_x,release_date_x,movie,...,popularity,release_date_y,title_x,vote_average,vote_count,title_y,studio,domestic_gross_y,foreign_gross,year
98,Robin Hood,Robin Hood,2018,116.0,"Action,Adventure,Thriller",5.3,41588,9,"Nov 21, 2018",Robin Hood,...,15.444,2010-05-14,Robin Hood,6.3,2569,Robin Hood,Uni.,105300000.0,216400000,2010


In [52]:
imdb_clean_df = new_imdb_df.drop(axis=1, columns = \
            ['id_x','id_y','movie','title_x','title_y','year', 'original_language', \
             'release_date_y','Unnamed: 0', 'genre_ids', 'original_title', 'Original Film Name'])

In [53]:
imdb_clean_df.shape

(1013, 17)

In [54]:
imdb_clean_df.head(10)

Unnamed: 0,Film Name,Release Year,Runtime (min),Genres,IMDB Average Rating,IMDB Votes Count,release_date_x,production_budget,domestic_gross_x,worldwide_gross,release_year,popularity,vote_average,vote_count,studio,domestic_gross_y,foreign_gross
177,10 Cloverfield Lane,2016,103.0,"Drama,Horror,Mystery",7.2,260383,"Mar 11, 2016",5000000,72082999,108286422,2016,17.892,6.9,4629,Par.,72100000.0,38100000.0
341,12 Strong,2018,130.0,"Action,Drama,History",6.6,50155,"Jan 19, 2018",35000000,45819713,71118378,2018,13.183,5.6,1312,WB,45800000.0,21600000.0
726,12 Years a Slave,2013,134.0,"Biography,Drama,History",8.1,577301,"Oct 18, 2013",20000000,56671993,181025343,2013,16.493,7.9,6631,FoxS,56700000.0,131100000.0
432,127 Hours,2010,94.0,"Adventure,Biography,Drama",7.6,323949,"Nov 5, 2010",18000000,18335230,60217171,2010,11.435,7.0,4469,FoxS,18300000.0,42400000.0
731,13 Sins,2014,93.0,"Horror,Thriller",6.3,29550,"Apr 18, 2014",4000000,9134,47552,2014,10.899,6.3,576,RTWC,13800.0,
242,2 Guns,2013,109.0,"Action,Comedy,Crime",6.7,182025,"Aug 2, 2013",61000000,75612460,132493015,2013,14.8,6.5,2368,Uni.,75600000.0,56300000.0
221,21 Jump Street,2012,109.0,"Action,Comedy,Crime",7.2,477771,"Mar 16, 2012",42000000,138447667,202812429,2012,14.836,6.8,6527,Sony,138400000.0,63100000.0
809,22 Jump Street,2014,112.0,"Action,Comedy,Crime",7.0,319504,"Jun 13, 2014",50000000,191719337,331333876,2014,11.176,6.9,5167,Sony,191700000.0,139600000.0
766,3 Days to Kill,2014,117.0,"Action,Drama,Thriller",6.2,81681,"Feb 21, 2014",28000000,30697999,38959900,2014,11.011,6.1,1279,Rela.,30700000.0,21900000.0
505,30 Minutes or Less,2011,83.0,"Action,Comedy,Crime",6.1,87254,"Aug 12, 2011",28000000,37053924,40966716,2011,11.315,5.6,825,Sony,37100000.0,3500000.0


In [55]:
imdb_clean_df.set_index('Film Name', inplace=True)

In [56]:
imdb_clean_df.reset_index(inplace=True)

In [57]:
imdb_clean_df.shape

(1013, 17)

In [58]:
imdb_clean_df

Unnamed: 0,Film Name,Release Year,Runtime (min),Genres,IMDB Average Rating,IMDB Votes Count,release_date_x,production_budget,domestic_gross_x,worldwide_gross,release_year,popularity,vote_average,vote_count,studio,domestic_gross_y,foreign_gross
0,10 Cloverfield Lane,2016,103.0,"Drama,Horror,Mystery",7.2,260383,"Mar 11, 2016",5000000,72082999,108286422,2016,17.892,6.9,4629,Par.,72100000.0,38100000
1,12 Strong,2018,130.0,"Action,Drama,History",6.6,50155,"Jan 19, 2018",35000000,45819713,71118378,2018,13.183,5.6,1312,WB,45800000.0,21600000
2,12 Years a Slave,2013,134.0,"Biography,Drama,History",8.1,577301,"Oct 18, 2013",20000000,56671993,181025343,2013,16.493,7.9,6631,FoxS,56700000.0,131100000
3,127 Hours,2010,94.0,"Adventure,Biography,Drama",7.6,323949,"Nov 5, 2010",18000000,18335230,60217171,2010,11.435,7.0,4469,FoxS,18300000.0,42400000
4,13 Sins,2014,93.0,"Horror,Thriller",6.3,29550,"Apr 18, 2014",4000000,9134,47552,2014,10.899,6.3,576,RTWC,13800.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1008,Youth,2015,124.0,"Comedy,Drama,Music",7.3,64418,"Dec 4, 2015",13000000,2703296,24001573,2015,9.265,6.9,1098,FoxS,2700000.0,
1009,Zero Dark Thirty,2012,157.0,"Drama,Thriller",7.4,251072,"Dec 19, 2012",52500000,95720716,134612435,2012,14.239,6.9,2553,Sony,95700000.0,37100000
1010,Zookeeper,2011,102.0,"Comedy,Family,Romance",5.2,52396,"Jul 8, 2011",80000000,80360866,170805525,2011,10.764,5.3,886,Sony,80400000.0,89500000
1011,Zoolander 2,2016,101.0,Comedy,4.7,59914,"Feb 12, 2016",50000000,28848693,55348693,2016,12.997,4.7,1374,Par.,28800000.0,27900000


### Dealing with Missing Values

Now let's take a look at the missing values of our newly compiled data frame.

In [59]:
imdb_clean_df.isna().sum()

Film Name               0
Release Year            0
Runtime (min)           1
Genres                  1
IMDB Average Rating     0
IMDB Votes Count        0
release_date_x          0
production_budget       0
domestic_gross_x        0
worldwide_gross         0
release_year            0
popularity              0
vote_average            0
vote_count              0
studio                  0
domestic_gross_y        0
foreign_gross          96
dtype: int64

We can see we are missing values in `Runtime (min)`, `Genres`, and `foreign_gross`. We might be able to impute the `foreign_gross` data by subtracting `domestic_gross_x` from `worldwide_gross`. We also may be able to impute the missing `Runtime (min)` and `Genres` values by looking up the films since there are only 10 total missing values.

In [60]:
imdb_clean_df[imdb_clean_df['Genres'].isna()==True]

Unnamed: 0,Film Name,Release Year,Runtime (min),Genres,IMDB Average Rating,IMDB Votes Count,release_date_x,production_budget,domestic_gross_x,worldwide_gross,release_year,popularity,vote_average,vote_count,studio,domestic_gross_y,foreign_gross
720,The Bounty Hunter,2010,,,6.3,29,"Mar 19, 2010",45000000,67061228,135808837,2010,9.861,5.7,1627,Sony,67099999.0,69300000


In [61]:
imdb_clean_df[imdb_clean_df['Film Name'] == "Robin Hood"]

Unnamed: 0,Film Name,Release Year,Runtime (min),Genres,IMDB Average Rating,IMDB Votes Count,release_date_x,production_budget,domestic_gross_x,worldwide_gross,release_year,popularity,vote_average,vote_count,studio,domestic_gross_y,foreign_gross
595,Robin Hood,2010,140.0,"Action,Adventure,Drama",6.6,239480,"May 14, 2010",210000000,105487148,322459006,2010,15.444,6.3,2569,Uni.,105300000.0,216400000
596,Robin Hood,2018,116.0,"Action,Adventure,Thriller",5.3,41588,"Nov 21, 2018",99000000,30824628,84747441,2018,15.444,6.3,2569,Uni.,105300000.0,216400000


## Data Modeling


Describe and justify the process for analyzing or modeling the data.

Questions to consider:

How did you analyze or model the data?
How did you iterate on your initial approach to make it better?
Why are these choices appropriate given the data and the business problem?


In [62]:
# Here you run your code to model the data

### Visualizing the Data

In [63]:
# fig, ax = plt.subplots(figsize = (10,5))

# x= imdb_clean_df["Genres"]

# y= imdb_clean_df["production_budget"]


# ax.bar(x, y);



## Evaluation


Evaluate how well your work solves the stated business problem.

Questions to consider:

- How do you interpret the results?
- How well does your model fit your data? How much better is this than your baseline model?
- How confident are you that your results would generalize beyond the data you have?
- How confident are you that this model would benefit the business if put into use?


## Conclusions


Provide your conclusions about the work you've done, including any limitations or next steps.

Questions to consider:

- What would you recommend the business do as a result of this work?
- What are some reasons why your analysis might not fully solve the business problem?
- What else could you do in the future to improve this project?