
# Project: Investigate TMDB Movie Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction


In this report, I am going to demonstrate my investigations and findings on the TMDB movie dataset by answering a set of questions. 

### Assumptions
According to the [TMDB Movie Data metadata page](https://www.kaggle.com/tmdb/tmdb-movie-metadata), actors and actresses are listed in the descending order of how much they billed. Therefore I had used the first listed name as the leading actor/actress for each movie in the 'cast' column to analyse the correlation between popularity and the actor/actress, and to analyse the correlation between revenue and the actor/actress. 

For the mojority of movies, there is a single director. For movies with multiple directors, I considered them all rather than only the first listed director when analyse the correlation between popularity and the directors. Same as when analysing the correlation between revenue and directors. 

I will use imdb_id as the primary key of each movie. Therefore empty imdb_id or duplicate imdb_id will be considered as bad data and will be removed in the <a href="#cleaning">data cleaning</a> process.

### Questions to be Answered
Here is the list of questions I am going to find answers for:

1. Who are the actors/actresses in the top popular movies? 
2. Who are the actors/actresses in the movies with the most revenue?
3. Who are the directors in the top popular movies?
4. Who are the directors in the movies with the most revenue?
5. For individul actor/actress, are there any difference in popularity and revenue?
6. Does popularity have a positive correlation with vote_average?
7. In general how is popularity and revenue related? (double line chart)




In [1]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html
import pandas as pd
pd.options.display.max_columns = None

import numpy as np
np.set_printoptions(threshold=np.nan)

import matplotlib as mp
import unicodecsv

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline
### to make sure output will be displayed in the output cell rather than only the output for the last expression.
###from IPython.core.interactiveshell import InteractiveShell
###InteractiveShell.ast_node_interactivity = "all"

<a id='wrangling'></a>
## Data Wrangling


### General Properties

In the below cell, I am going to load the csv file and print the top 5 rows and the data type of each column:

In [2]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
movies_df_original = pd.read_csv('tmdb-movies.csv')
movies_df_original.head()
movies_df_original.dtypes


Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,monster|dna|tyrannosaurus rex|velociraptor|island,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,future|chase|post-apocalyptic|dystopia|australia,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,based on novel|revolution|dystopia|sequel|dyst...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,android|spaceship|jedi|space opera|3d,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,car race|speed|revenge|suspense|car,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


id                        int64
imdb_id                  object
popularity              float64
budget                    int64
revenue                   int64
original_title           object
cast                     object
homepage                 object
director                 object
tagline                  object
keywords                 object
overview                 object
runtime                   int64
genres                   object
production_companies     object
release_date             object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object

Then I will going to check how many rows in total are in the loaded dataframe, as well as the number of `non-NaN` cells in each column: 

In [3]:
### This will return me the total number of rows and columns in the dataframe
print ('Number of rows and columns in the original movie dataset: %s' % movies_df_original.shape[0])

### Next I want to see the count of non-NaN rows of each column
print ('Number of non-NaN rows in each column:')
movies_df_original.count()

Number of rows and columns in the original movie dataset: 10866
Number of non-NaN rows in each column:


id                      10866
imdb_id                 10856
popularity              10866
budget                  10866
revenue                 10866
original_title          10866
cast                    10790
homepage                 2936
director                10822
tagline                  8042
keywords                 9373
overview                10862
runtime                 10866
genres                  10843
production_companies     9836
release_date            10866
vote_count              10866
vote_average            10866
release_year            10866
budget_adj              10866
revenue_adj             10866
dtype: int64

The output shows that there are 10856 rows in the dataframe that the imdb_id is not NaN, wheras there are 10866 rows in total in the dataframe. This means that there are 10 rows whose imdb_id is NaN. 

Since later on when I am analysing the correlation between popularity/revenue and actors, I will use imdb_id as the unique key to join dataframes, I do not want imdb_id to be NaN. So in the I will remove these rows and return a new dataframe object movies_df_cleaned later on in the <a href="#cleaning">Data Cleaning</a> section.

Now I am curious to see ignoring the NaN imdb_id, if there are duplicate imdb_id:

In [4]:
movies_df_original.groupby('imdb_id').filter(lambda x: len(x) > 1)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
2089,42194,tt0411951,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,,Dwight H. Little,Survival is no game,martial arts|dystopia|based on video game|mart...,"In the year of 2039, after World Wars destroy ...",92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,3/20/10,110,5.0,2010,30000000.0,967000.0
2090,42194,tt0411951,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,,Dwight H. Little,Survival is no game,martial arts|dystopia|based on video game|mart...,"In the year of 2039, after World Wars destroy ...",92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,3/20/10,110,5.0,2010,30000000.0,967000.0


The output shows that imdb_id tt0411951 has a duplicate row. Here I assumed that since the two lines has the same imdb_id, the whole rows should be exactly the same. So in <a href="#cleaning">Data Cleaning</a> section I will remove this duplicated row from movies_df_cleaned dataframe.

Since I am interested in the correlation between the revenue with who is acting in the movie, and the correlation between the popularity with who is acting in the movie, I am going to have a look in `revenue` and `popularity` columns and find if there are any data required to be cleaned:

In [6]:
movies_df_original.iloc[movies_df_original['revenue'].argmax()]
movies_df_original.iloc[movies_df_original['revenue'].argmin()]

movies_df_original.iloc[movies_df_original['popularity'].argmax()]
movies_df_original.iloc[movies_df_original['popularity'].argmin()]

id                                                                  19995
imdb_id                                                         tt0499549
popularity                                                        9.43277
budget                                                          237000000
revenue                                                        2781505847
original_title                                                     Avatar
cast                    Sam Worthington|Zoe Saldana|Sigourney Weaver|S...
homepage                                      http://www.avatarmovie.com/
director                                                    James Cameron
tagline                                       Enter the World of Pandora.
keywords                culture clash|future|space war|space colony|so...
overview                In the 22nd century, a paraplegic Marine is di...
runtime                                                               162
genres                           Actio

id                                                                 265208
imdb_id                                                         tt2231253
popularity                                                        2.93234
budget                                                           30000000
revenue                                                                 0
original_title                                                  Wild Card
cast                    Jason Statham|Michael Angarano|Milo Ventimigli...
homepage                                                              NaN
director                                                       Simon West
tagline                       Never bet against a man with a killer hand.
keywords                                        gambling|bodyguard|remake
overview                When a Las Vegas bodyguard with lethal skills ...
runtime                                                                92
genres                                

id                                                                 135397
imdb_id                                                         tt0369610
popularity                                                        32.9858
budget                                                          150000000
revenue                                                        1513528810
original_title                                             Jurassic World
cast                    Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...
homepage                                    http://www.jurassicworld.com/
director                                                  Colin Trevorrow
tagline                                                 The park is open.
keywords                monster|dna|tyrannosaurus rex|velociraptor|island
overview                Twenty-two years after the events of Jurassic ...
runtime                                                               124
genres                          Action

id                                                                  18729
imdb_id                                                         tt0088583
popularity                                                        6.5e-05
budget                                                                  0
revenue                                                                 0
original_title                                    North and South, Book I
cast                    Patrick Swayze|Philip Casnoff|Kirstie Alley|Ge...
homepage                                                              NaN
director                                                              NaN
tagline                                                               NaN
keywords                                                              NaN
overview                Two friends, one northern and one southern, st...
runtime                                                               561
genres                                

The outputs show that in both `revenue` and `popularity` columns, there are 0s. In <a href="#cleaning">Data Cleaning</a> section I will remove these rows 

<a id='cleaning'></a>
### Data Cleaning

In this section, I am going to demonstrate the process of
1. Removing rows where imdb_id is NaN
2. Removing rows where imdb_id is duplicate

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.

#Firstly determine the locations of these rows:
np.where(pd.isnull(movies_df_original['imdb_id']))

#Double-check the rows identified to be removed and confirmed that the imdb_id field is NaN for all these rows:
movies_df_original.loc[np.where(pd.isnull(movies_df_original['imdb_id']))]

In [None]:
###Now drop all rows whose imdb_id is NaN (10 rows) and return a new DataFrame instance movies_df_cleaned:
movies_df_cleaned = movies_df_original.dropna(subset=['imdb_id'],how='any')

###Check if rows are dropped. The output should return nothing:
movies_df_cleaned.loc[np.where(pd.isnull(movies_df_cleaned['imdb_id']))]

###Check count of non-NaN rows of each column of movies_df_cleaned dataframe and confirm NaN rows of imdb_id column are deleted:
movies_df_cleaned.count()

In [None]:
### Now from movies_df_cleaned dataframe, I want to check if any of the id is duplicate
(movies_df_cleaned.groupby('imdb_id').filter(lambda x: len(x) > 1))

(movies_df_original.groupby('id').filter(lambda x: len(x) > 1))

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!