# Top Earners in the Movie Industry

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> I chose the IMDB movie dataset. I've wanted to know how much the different movie genres, directors and production companies have grossed over a period of time.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [5]:
df = pd.read_csv('imdb-movies.csv')

In [6]:
df


Unnamed: 0,popularity,budget,revenue,original_title,director,genres,production_companies,release_date,release_year
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,2015
1,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,2015
2,13.112507,110000000,295238201,Insurgent,Robert Schwentke,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2015
3,11.173104,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,2015
4,9.335014,190000000,1506249360,Furious 7,James Wan,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2015
...,...,...,...,...,...,...,...,...,...
10861,0.080598,0,0,The Endless Summer,Bruce Brown,Documentary,Bruce Brown Films,6/15/66,1966
10862,0.065543,0,0,Grand Prix,John Frankenheimer,Action|Adventure|Drama,Cherokee Productions|Joel Productions|Douglas ...,12/21/66,1966
10863,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mystery|Comedy,Mosfilm,1/1/66,1966
10864,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Action|Comedy,Benedict Pictures Corp.,11/2/66,1966


### Data Cleaning

In [3]:
# Drop columns without neccesary information and remove all records with no financial information


In [15]:
df.isna().any()


popularity              False
budget                  False
revenue                 False
original_title          False
director                False
genres                  False
production_companies    False
release_date            False
release_year            False
dtype: bool

In [10]:
#I did these steps in excel

df.columns


Index(['popularity', 'budget', 'revenue', 'original_title', 'director',
       'genres', 'production_companies', 'release_date', 'release_year'],
      dtype='object')

In [13]:
df.drop_duplicates(inplace=True)

In [14]:
df.dropna(inplace=True)

In [22]:
df.drop(['release_date'], axis='columns', inplace=True)
df

Unnamed: 0,popularity,budget,revenue,original_title,director,genres,production_companies,release_year
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,2015
1,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,2015
2,13.112507,110000000,295238201,Insurgent,Robert Schwentke,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,2015
3,11.173104,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,2015
4,9.335014,190000000,1506249360,Furious 7,James Wan,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,2015
...,...,...,...,...,...,...,...,...
10861,0.080598,0,0,The Endless Summer,Bruce Brown,Documentary,Bruce Brown Films,1966
10862,0.065543,0,0,Grand Prix,John Frankenheimer,Action|Adventure|Drama,Cherokee Productions|Joel Productions|Douglas ...,1966
10863,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mystery|Comedy,Mosfilm,1966
10864,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Action|Comedy,Benedict Pictures Corp.,1966


In [18]:
df=df.astype({'release_date': 'datetime64'})

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9806 entries, 0 to 10865
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   popularity            9806 non-null   float64       
 1   budget                9806 non-null   int64         
 2   revenue               9806 non-null   int64         
 3   original_title        9806 non-null   object        
 4   director              9806 non-null   object        
 5   genres                9806 non-null   object        
 6   production_companies  9806 non-null   object        
 7   release_date          9806 non-null   datetime64[ns]
 8   release_year          9806 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(4)
memory usage: 766.1+ KB


#### If I created one record for each the `production_companies` a movie was release under and one record each for `genres`<br>and tried to run calculations, it wouldn't work because for many records, the amount of `production_companies`<br>and `genres` aren't the same, so I'll create 2 dataframes; one w/o a `production_companies` column and one w/o a `genres` columns

#### One `production_companies` per record

In [26]:
multiple_production_companies = df.assign(production_companies=df.production_companies.str.split("|")).explode('production_companies')
multiple_production_companies

Unnamed: 0,popularity,budget,revenue,original_title,director,genres,production_companies,release_year
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Universal Studios,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Amblin Entertainment,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Legendary Pictures,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Fuji Television Network,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action|Adventure|Science Fiction|Thriller,Dentsu,2015
...,...,...,...,...,...,...,...,...
10862,0.065543,0,0,Grand Prix,John Frankenheimer,Action|Adventure|Drama,Joel Productions,1966
10862,0.065543,0,0,Grand Prix,John Frankenheimer,Action|Adventure|Drama,Douglas & Lewis Productions,1966
10863,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mystery|Comedy,Mosfilm,1966
10864,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Action|Comedy,Benedict Pictures Corp.,1966


In [27]:
multiple_production_companies.drop(['genres'], axis='columns', inplace=True)
multiple_production_companies

Unnamed: 0,popularity,budget,revenue,original_title,director,production_companies,release_year
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Universal Studios,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Amblin Entertainment,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Legendary Pictures,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Fuji Television Network,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Dentsu,2015
...,...,...,...,...,...,...,...
10862,0.065543,0,0,Grand Prix,John Frankenheimer,Joel Productions,1966
10862,0.065543,0,0,Grand Prix,John Frankenheimer,Douglas & Lewis Productions,1966
10863,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mosfilm,1966
10864,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Benedict Pictures Corp.,1966


In [36]:
multiple_production_companies.where(multiple_production_companies.release_year > 2005).production_companies.value_counts().head(10)

Universal Pictures                        149
Warner Bros.                              131
Relativity Media                          108
Columbia Pictures                         100
Paramount Pictures                         92
Walt Disney Pictures                       79
Twentieth Century Fox Film Corporation     79
New Line Cinema                            71
BBC Films                                  67
Lionsgate                                  58
Name: production_companies, dtype: int64

In [1]:
# GENRES
# For every string of genres in that record, split the production companies into a list. 
# This way we should be able to query whichever production company

In [33]:
genresdf = df.assign(genres=df.genres.str.split("|")).explode('genres')
genresdf

Unnamed: 0,popularity,budget,revenue,original_title,director,genres,production_companies,release_year
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action,Universal Studios|Amblin Entertainment|Legenda...,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Adventure,Universal Studios|Amblin Entertainment|Legenda...,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Science Fiction,Universal Studios|Amblin Entertainment|Legenda...,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Thriller,Universal Studios|Amblin Entertainment|Legenda...,2015
1,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,Action,Village Roadshow Pictures|Kennedy Miller Produ...,2015
...,...,...,...,...,...,...,...,...
10863,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mystery,Mosfilm,1966
10863,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Comedy,Mosfilm,1966
10864,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Action,Benedict Pictures Corp.,1966
10864,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Comedy,Benedict Pictures Corp.,1966


In [35]:
genresdf.drop(['production_companies'], axis='columns', inplace=True)
genresdf

Unnamed: 0,popularity,budget,revenue,original_title,director,genres,release_year
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Action,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Adventure,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Science Fiction,2015
0,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,Thriller,2015
1,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,Action,2015
...,...,...,...,...,...,...,...
10863,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Mystery,1966
10863,0.065141,0,0,Beregis Avtomobilya,Eldar Ryazanov,Comedy,1966
10864,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Action,1966
10864,0.064317,0,0,"What's Up, Tiger Lily?",Woody Allen,Comedy,1966


#### One `genres` per record

<a id='eda'></a>
## Exploratory Data Analysis

### Which production companies released the most movies in the last 10 years? Display the top 10 production companies.

In [30]:
multiple_production_companies.where(multiple_production_companies.release_year > 2005).production_companies.value_counts().head(10)

Universal Pictures                        149
Warner Bros.                              131
Relativity Media                          108
Columbia Pictures                         100
Paramount Pictures                         92
Walt Disney Pictures                       79
Twentieth Century Fox Film Corporation     79
New Line Cinema                            71
BBC Films                                  67
Lionsgate                                  58
Name: production_companies, dtype: int64

### What 5 movie genres grossed the highest all-time?

In [43]:
genresdf.groupby('genres').revenue.sum().sort_values(ascending=False).head()

genres
Action       173417346979
Adventure    166317625752
Comedy       142141376544
Drama        138895805395
Thriller     121188594087
Name: revenue, dtype: int64

### Who are the top 10 grossing directors?

In [45]:
df.groupby('director').revenue.sum().sort_values(ascending=False).head(10)

director
Steven Spielberg     9018563772
Peter Jackson        6523244659
James Cameron        5841894863
Michael Bay          4917208171
Christopher Nolan    4167548502
David Yates          4154295625
Robert Zemeckis      3869690869
Chris Columbus       3851491668
Tim Burton           3665414624
Ridley Scott         3649996480
Name: revenue, dtype: int64

### Compare the revenue of the highest grossing movies of all time.

In [48]:
df.groupby('original_title').revenue.sum().sort_values(ascending=False).head(10)

original_title
Avatar                                          2781505847
Star Wars: The Force Awakens                    2068178225
Titanic                                         1845034188
The Avengers                                    1568080742
Jurassic World                                  1513528810
Furious 7                                       1506249360
Avengers: Age of Ultron                         1405035767
Harry Potter and the Deathly Hallows: Part 2    1327817822
Frozen                                          1277284869
Iron Man 3                                      1215439994
Name: revenue, dtype: int64

<a id='conclusions'></a>
## Conclusions

* Avatar is the highest-grossing movie of all time.

* Steven Spielberg is the highest-grossing director of all time.

* Action movies (not to my surprise) are the highest-grossing movies..

* Disney is not one of the top 5 highest-grossing production companies during the last 10 years.