# Top Earners in Movie Industry

## Table of Contents

<ul>
    <li><a href="#intro">Introduction</a></li>
    <li><a href="#eda">Exploratory Data Analysis</a></li>
    <li><a href="#conclusion">Conclusion</a></li>
</ul>

<a id="#intro"></a>
## Introduction

> This analysis project is to be done using the imdb movie data. When the analysis is completed, you should be able to find the top 5 highest grossing directors, the top 5 highest grossing movie genres of all time, comparing the revenue of the highest grossing movies and which companies released the most movies. 

> There are 10 columns that will not be needed for the analysis. Use pandas to drop these columns. HINT: Only the columns pertaining to revenue will be needed.

> To get you started, I've already placed the needed code for getting the packages and datafile that you will be using for the project. 

In [1]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

In [46]:
df = pd.read_csv('files/imdb-movies.csv')
df.head(5)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


### Drop columns without neccesary information and remove all records with no financial information -- Pay close attention to things that don't tell you anything regarding financial data

In [48]:
df.drop(df.query('budget == 0 or revenue == 0').index , inplace=True)
df.drop(['id','popularity','imdb_id','homepage','budget', 'runtime','tagline','keywords','vote_count','vote_average','release_date','budget_adj','overview'], axis=1, inplace=True)

In [49]:
df.dropna(inplace=True)
df.reset_index(drop=True,inplace=True)
df.head(5)

Unnamed: 0,revenue,original_title,cast,director,genres,production_companies,release_year,revenue_adj
0,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,2015,1392446000.0
1,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,2015,348161300.0
2,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,Robert Schwentke,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,2015,271619000.0
3,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,J.J. Abrams,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,2015,1902723000.0
4,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,James Wan,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,2015,1385749000.0


### Data Cleaning

In [180]:
# Delete all records with null, or empty values



#### Here's a helpful hint from my own analysis when I ran this the first time. This may help shed light on what your data set should look like.

#### If I created one record for each the `production_companies` a movie was release under and one record each for `genres`<br>and tried to run calculations, it wouldn't work because for many records, the amount of `production_companies`<br>and `genres` aren't the same, so I'll create 2 dataframes; one w/o a `production_companies` column and one w/o a `genres` columns

In [50]:
# df2 = pd.read_csv('files/imdb-movies.csv')
# df2.head(5)
# df2.drop(['production_companies'], axis=1, inplace=True)

# df3 = pd.read_csv('files/imdb-movies.csv')
# df3.head(5)
# df3.drop(['genres'], axis=1, inplace=True)

production_records = df[df.production_companies.str.contains('|')] 
production_records.production_companies = production_records.production_companies.apply(lambda x: x.split('|'))

df_pc = pd.DataFrame(columns = df.columns)
for i in range(len(production_records)):
    record = production_records.iloc[i] # returns all data for every single record with the ability to access columns
    for production_company in production_records.production_companies[i]: # loop through ALL production_companies from EACH record
        df_pc = df_pc.append(pd.DataFrame([[record.revenue, record.original_title, record.cast, record.director,record.genres, production_company, record.release_year, record.revenue_adj]], columns = df.columns))
            
df_pc.drop('genres', axis=1, inplace=True)
df_pc.reset_index(drop=True,inplace=True)

df_pc.head()

Unnamed: 0,revenue,original_title,cast,director,production_companies,release_year,revenue_adj
0,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Universal Studios,2015,1392446000.0
1,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Amblin Entertainment,2015,1392446000.0
2,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Legendary Pictures,2015,1392446000.0
3,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Fuji Television Network,2015,1392446000.0
4,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Dentsu,2015,1392446000.0


In [53]:
genre_records = df[df.genres.str.contains('|')] 
genre_records.production_companies = production_records.genres.apply(lambda x: x.split('|'))

df_g = pd.DataFrame(columns = df.columns)
for i in range(len(production_records)):
    record = production_records.iloc[i] # returns all data for every single record with the ability to access columns
    for genre in production_records.production_companies[i]: # loop through ALL production_companies from EACH record
        df_g = df_g.append(pd.DataFrame([[record.revenue, record.original_title, record.cast, record.director,record.genres, genre, record.release_year, record.revenue_adj]], columns = df.columns))
            
df_g.drop('production_companies', axis=1, inplace=True)
df_g.reset_index(drop=True,inplace=True)

df_g.head()

Unnamed: 0,revenue,original_title,cast,director,genres,release_year,revenue_adj
0,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,2015,1392446000.0
1,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,2015,1392446000.0
2,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,2015,1392446000.0
3,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,2015,1392446000.0
4,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,2015,1392446000.0


In [15]:
df_genre = df.drop('genres', axis = 1, inplace = False)
df_genre

Unnamed: 0,revenue,original_title,cast,director,production_companies,release_year,revenue_adj
0,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,Universal Studios|Amblin Entertainment|Legenda...,2015,1.392446e+09
1,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,Village Roadshow Pictures|Kennedy Miller Produ...,2015,3.481613e+08
2,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,Robert Schwentke,Summit Entertainment|Mandeville Films|Red Wago...,2015,2.716190e+08
3,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,J.J. Abrams,Lucasfilm|Truenorth Productions|Bad Robot,2015,1.902723e+09
4,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,James Wan,Universal Pictures|Original Film|Media Rights ...,2015,1.385749e+09
5,532950503,The Revenant,Leonardo DiCaprio|Tom Hardy|Will Poulter|Domhn...,Alejandro GonzÃ¡lez IÃ±Ã¡rritu,Regency Enterprises|Appian Way|CatchPlay|Anony...,2015,4.903142e+08
6,440603537,Terminator Genisys,Arnold Schwarzenegger|Jason Clarke|Emilia Clar...,Alan Taylor,Paramount Pictures|Skydance Productions,2015,4.053551e+08
7,595380321,The Martian,Matt Damon|Jessica Chastain|Kristen Wiig|Jeff ...,Ridley Scott,Twentieth Century Fox Film Corporation|Scott F...,2015,5.477497e+08
8,1156730962,Minions,Sandra Bullock|Jon Hamm|Michael Keaton|Allison...,Kyle Balda|Pierre Coffin,Universal Pictures|Illumination Entertainment,2015,1.064192e+09
9,853708609,Inside Out,Amy Poehler|Phyllis Smith|Richard Kind|Bill Ha...,Pete Docter,Walt Disney Pictures|Pixar Animation Studios|W...,2015,7.854116e+08


<a id="eda"></a>
## Exploratory Data Analysis

> Use Matplotlib to display your data analysis

### Which production companies released the most movies in the last 10 years? Display the top 5 production companies.

In [55]:
# df_genre.groupby('production_companies').count().max()
df_pc.query('release_year > 2007')['production_companies'].value_counts().nlargest(10)

Warner Bros.                              64
Universal Pictures                        63
Relativity Media                          58
Columbia Pictures                         57
Paramount Pictures                        44
Walt Disney Pictures                      34
Lionsgate                                 29
New Line Cinema                           28
Twentieth Century Fox Film Corporation    28
Legendary Pictures                        26
Name: production_companies, dtype: int64

### What 5 movie genres grossed the highest all-time?

In [56]:
#df_genre.groupby['revenue'].value_counts().nlargest(5)
# df_prod.groupby('revenue', as_index = False).count().head(5)
df_g.groupby('genres')['revenue'].sum().nlargest(5)

genres
Drama                                       20684999483
Comedy                                      20593011762
Adventure|Fantasy|Action                    20543728652
Action|Adventure|Fantasy|Science Fiction    18454390078
Action|Science Fiction|Adventure            11525923981
Name: revenue, dtype: int64

### Who are the top 5 grossing directors?

In [41]:
df_genre.groupby('director')['revenue'].sum().nlargest(5)/1e9

director
Steven Spielberg     9.018564
Peter Jackson        6.523245
James Cameron        5.841895
Michael Bay          4.917208
Christopher Nolan    4.167549
Name: revenue, dtype: float64

### Compare the revenue of the highest grossing movies of all time.

In [44]:
df_genre.groupby('original_title')['revenue'].sum().nlargest(10)

original_title
Avatar                                          2781505847
Star Wars: The Force Awakens                    2068178225
Titanic                                         1845034188
The Avengers                                    1568080742
Jurassic World                                  1513528810
Furious 7                                       1506249360
Avengers: Age of Ultron                         1405035767
Harry Potter and the Deathly Hallows: Part 2    1327817822
Frozen                                          1274219009
Iron Man 3                                      1215439994
Name: revenue, dtype: int64

<a id="conclusions"></a>
## Conclusions

> Using the cell below, write a brief conclusion of what you have found from the anaylsis of the data. The Cell below will allow you to write plan text instead of code.

Based on the numbers, it appears that Warner Bros. beat out Universal Pictures in making the most movies in the last 10 years. Surprisingly drama, comedy and adventure were all very similar in numbers when comparing the highest grossing genres. Steven Speilberg of course takes the cake on being the highest grossing director and Avatar appears to be the highest grossing film by about $700,000,000!