 
# IMDB Movie Deep Dive 

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction


> In this analysis, the objective is look at the TMDB movie dataset taken from Kaggle  to het an understanding of what influences movie grossing by looking at movie ratings, movie budgets and so on.  

#### Problem Statement

> A lot of questions may be posed to help : 
    - Does the budget, run time, genres or production companies impact the ratings?
    - What kinds of properties are associated with movies that have high revenues?
    - Have ratings gotten better over time or worse?
    - Has the vote count increased over the years with the adoption of technology of the years? 
    - Does the production company influence ratings, spend as well as revenue?
    - Which are the top 10 production companies in rgeards to revenue generation?
    - Which are the top 10 highest grossing movies of all time?
    - Which are the top 10 highest rated movies of all time?
    - Which genres aremost popular from year to year? 
    - What kinds of properties are associated with movies that have high revenues? 
 
#### Potential Outcome 

> Based on the set of questions and deep dive into the data, various recommendations may be drawn and help stakeholders such producers, production houses as well as actors make better decisions when coming up with movie budgets, casts and also better rated movies. If the primary objective of movie production is profitbality, this analysis of movies may help to explain which factors drive it and help struggling production companies to be more competitive.

 Below is the code to import the packages to make use of for the movie data deep dive.

In [5]:
# import pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> After loading all the packages that will be used for the analysis, the next step would to load to the movie data set itself.

In [6]:
df = pd.read_csv('tmdb-movies.csv')

> Below are the first three rows of the movie dataset so that we have a look at what the looks like and to ensure that correct data was loaded. 

In [7]:
df.head(3)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0


> The next part is to get an idea of the columns in the dataset as well as the data types associated.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              1


### General Properties

>The original data set has 21 variables and their data types are mentioned in brackets. They are as follows:
    - id (int64) : The unique identifier of the movie
    - imdb_id (object) : The unique identifier of the movie on IMDB
    - Popularity (float64) : Popularity of the movie
    - Budget (int64) : The cost associated with the production of the movie
    - Revenue (int64) : The amount generated and earned by the movie
    - Original_title (object) : The name or title of a movie
    - Cast (object) : The actors that were in a movie
    - Homepage (object) : url or link to the movie in question
    - Director (object) : Individual/s who oversaw the production of the movie
    - Tagline (object) : A catch phrase associated with the movie
    - Keywords (object) : Words associated or that best describe the movie
    - Overview (object) : Brief description of a movie
    - Runtime (int64) : Duration of a movie
    - Genres (object) : Genre of a movie such as comedy, action, etc 
    - Production_companies (object) : Names of the companies producing the movie 
    - Release_date (object) : Release date of the movie
    - Vote_count(int64) : Number of votes received by the movie
    - Vote_average (float64) : Average vote received for the movie in IMDB 
    - Release_year (int64) :  Release year of the movie
    - Budget_adj (int64) : The adjusted cost associated with the production of the movie
    - Revenue_adj (int64) : The adjusted amount generated and earned by the movie   
    

_Note: Original data set from Kaggle provides no information about unit of budget and revenue. I have assumed it to be in US dollars_

### Data Cleaning
> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

> First step, getting the data set array layout would be a useful starting point.

In [9]:
df.shape

(10866, 21)

> Next step would be to find out the composition of the dataset.

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              1

> There are some issues in regards to the data type that may need to be fixed. The release_date and release_year are currently an object and an integer instead of a timestamp. In order to rectify this, the release date has to be converted to a timestamp.  

In [11]:
df.release_date = pd.to_datetime(df['release_date'])
df.release_year = pd.to_datetime(df['release_year'])  
df.budget = float(df['budget'])

TypeError: cannot convert the series to <class 'float'>

> Not all of the columns will be used for the analysis so I will drop some of the columns which I am not really interested in looking at.

In [96]:
df.drop(['homepage','cast','overview','tagline','keywords','id','imdb_id','genres'], axis=1, inplace=True)

In [97]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 13 columns):
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
director                10822 non-null object
runtime                 10866 non-null int64
production_companies    9836 non-null object
release_date            10866 non-null datetime64[ns]
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null datetime64[ns]
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: datetime64[ns](2), float64(4), int64(4), object(3)
memory usage: 1.1+ MB


> Check which variables with missing values, if so , true boolean value will be return.

In [98]:
df.isnull().any()

popularity              False
budget                  False
revenue                 False
original_title          False
director                 True
runtime                 False
production_companies     True
release_date            False
vote_count              False
vote_average            False
release_year            False
budget_adj              False
revenue_adj             False
dtype: bool

> The next thing to look at based on the composition of the dataset is missing values. All the numeric value variables seem to have no missing data. However, there are some missing string data, mainly _production companies_ and _director_. The blanks are gonna be filled in with the word '_unknown_'. 

In [99]:
df.production_companies = df['production_companies'].fillna('unknown')
df.director = df['director'].fillna('unknown')

> Another thing to look at is duplicates.

In [100]:
sum(df.duplicated())

1

> There seems to be one duplicate, we will use the code below to remove the duplicate row.

In [101]:
# dropping ALL duplicte values 
df.drop_duplicates(inplace = True)

> To check if the row has been dropped, 

In [102]:
df.shape

(10865, 13)

In [103]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 13 columns):
popularity              10865 non-null float64
budget                  10865 non-null int64
revenue                 10865 non-null int64
original_title          10865 non-null object
director                10865 non-null object
runtime                 10865 non-null int64
production_companies    10865 non-null object
release_date            10865 non-null datetime64[ns]
vote_count              10865 non-null int64
vote_average            10865 non-null float64
release_year            10865 non-null datetime64[ns]
budget_adj              10865 non-null float64
revenue_adj             10865 non-null float64
dtypes: datetime64[ns](2), float64(4), int64(4), object(3)
memory usage: 1.2+ MB


In [87]:
df.describe()

Unnamed: 0,popularity,budget,revenue,runtime,vote_count,vote_average,budget_adj,revenue_adj
count,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0
mean,0.646441,14625700.0,39823320.0,102.070863,217.389748,5.974922,17551040.0,51364360.0
std,1.000185,30913210.0,117003500.0,31.381405,575.619058,0.935142,34306160.0,144632500.0
min,6.5e-05,0.0,0.0,0.0,10.0,1.5,0.0,0.0
25%,0.207583,0.0,0.0,90.0,17.0,5.4,0.0,0.0
50%,0.383856,0.0,0.0,99.0,38.0,6.0,0.0,0.0
75%,0.713817,15000000.0,24000000.0,111.0,145.75,6.6,20853250.0,33697100.0
max,32.985763,425000000.0,2781506000.0,900.0,9767.0,9.2,425000000.0,2827124000.0


In [105]:
df.corr()

Unnamed: 0,popularity,budget,revenue,runtime,vote_count,vote_average,budget_adj,revenue_adj
popularity,1.0,0.545481,0.66336,0.139032,0.800828,0.209517,0.513555,0.609085
budget,0.545481,1.0,0.734928,0.1913,0.632719,0.081067,0.968963,0.622531
revenue,0.66336,0.734928,1.0,0.16283,0.791174,0.172541,0.706446,0.919109
runtime,0.139032,0.1913,0.16283,1.0,0.163273,0.156813,0.221127,0.175668
vote_count,0.800828,0.632719,0.791174,0.163273,1.0,0.253818,0.587062,0.707941
vote_average,0.209517,0.081067,0.172541,0.156813,0.253818,1.0,0.093079,0.193062
budget_adj,0.513555,0.968963,0.706446,0.221127,0.587062,0.093079,1.0,0.646627
revenue_adj,0.609085,0.622531,0.919109,0.175668,0.707941,0.193062,0.646627,1.0
