# Project: TMDb Movie Data Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

this data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. and you can read more about it [here](https://www.kaggle.com/tmdb/tmdb-movie-metadata).

### Questions
- Which genres are most popular from year to year?
- What kinds of properties are associated with movies that have high revenues?

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

# Setting up the general theme of charts and color palette to use
sns.set_theme(style='white', palette='Set2')
base_color = '#00334e'
title_font = {'fontsize': 20, 'fontweight':'bold'}
axis_font = {'fontsize': 14}

# To display charts in the same notebook
%matplotlib inline

# Pandas diplay options
pd.options.display.max_columns = None
pd.options.display.max_rows = None

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [2]:
# Load your data and print out a few lines.
tmdb = pd.read_csv('data/tmdb-movies.csv')
tmdb.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,monster|dna|tyrannosaurus rex|velociraptor|island,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,future|chase|post-apocalyptic|dystopia|australia,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,based on novel|revolution|dystopia|sequel|dyst...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,android|spaceship|jedi|space opera|3d,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,car race|speed|revenge|suspense|car,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


In [3]:
# Display the dataset information
tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date       

In [5]:
# Display the number of missing values for each column or feature
tmdb.isnull().sum()

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

In [6]:
# Display summary statistics for the numeric features of the data
tmdb.describe()

Unnamed: 0,id,popularity,budget,revenue,runtime,vote_count,vote_average,release_year,budget_adj,revenue_adj
count,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0
mean,66064.177434,0.646441,14625700.0,39823320.0,102.070863,217.389748,5.974922,2001.322658,17551040.0,51364360.0
std,92130.136561,1.000185,30913210.0,117003500.0,31.381405,575.619058,0.935142,12.812941,34306160.0,144632500.0
min,5.0,6.5e-05,0.0,0.0,0.0,10.0,1.5,1960.0,0.0,0.0
25%,10596.25,0.207583,0.0,0.0,90.0,17.0,5.4,1995.0,0.0,0.0
50%,20669.0,0.383856,0.0,0.0,99.0,38.0,6.0,2006.0,0.0,0.0
75%,75610.0,0.713817,15000000.0,24000000.0,111.0,145.75,6.6,2011.0,20853250.0,33697100.0
max,417859.0,32.985763,425000000.0,2781506000.0,900.0,9767.0,9.2,2015.0,425000000.0,2827124000.0


In [39]:
# Display Movies with runtime more than 3 hours
print(f'There is {tmdb[tmdb.runtime > 180].shape[0]} movies longer than 3 hours')
tmdb[tmdb.runtime > 180]

There is 128 movies longer than 3 hours


Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
415,340968,tt2492296,0.249595,0,0,Show Me a Hero,Oscar Isaac|Alfred Molina|Winona Ryder|Catheri...,,Paul Haggis,How does a politician know he's doing the righ...,mayor|politics|murder|tv mini-series|racism,Mayor Nick Wasicsko took office in 1987 during...,300,History|Crime|Drama,,8/16/15,32,7.7,2015,0.0,0.0
559,373977,tt4146128,0.031635,0,0,Childhood's End,Mike Vogel|Osy Ikhile|Daisy Betts|Georgina Hai...,,Nick Hurran,,,"After peaceful aliens invade earth, humanity f...",246,Thriller|TV Movie|Science Fiction|Drama,,12/14/15,21,6.2,2015,0.0,0.0
609,321640,tt4299972,0.033378,0,0,The Jinx: The Life and Deaths of Robert Durst,Robert Durst|Andrew Jarecki|Marc Smerling|Zach...,,Andrew Jarecki,Four Decades. Three Murders. And One Very Rich...,murder|crime|real life,"Robert Durst, scion of one of New Yorkâ€™s bil...",240,Documentary,Blumhouse Productions|Hit the Ground Running F...,2/8/15,72,8.4,2015,0.0,0.0
989,289314,tt3012698,0.369555,0,0,Olive Kitteridge,Frances McDormand|Richard Jenkins|Bill Murray|...,,Lisa Cholodenko,There's no such thing as a simple life.,woman director,The story focuses on a middle-school math teac...,233,Drama,Home Box Office (HBO)|Playtone Productions|As ...,11/2/14,41,7.1,2014,0.0,0.0
1077,289394,tt3132738,0.342044,0,0,Houdini,Adrien Brody|Kristen Connolly|Evan Jones|Tom B...,,Uli Edel,,magic|biography|houdini,Follow the man behind the magic as he finds fa...,210,TV Movie|Drama|History,A&E Television Networks|Lionsgate Television,9/1/14,53,7.1,2014,0.0,0.0
1183,312497,tt3696720,0.028695,0,0,Ascension,Tricia Helfer|Gil Bellows|Brian Van Holt|Andre...,,Mairzee Almas|Nick Copus|Robert Lieberman,Be part of mankind's last hope.,woman director,"In this three-part miniseries, a young woman's...",282,Drama|Science Fiction|TV Movie,,12/15/14,30,5.5,2014,0.0,0.0
1235,242754,tt2761630,0.093377,0,0,Klondike,Richard Madden|Abbie Cornish|Sam Shepard|Tim R...,http://www.klondiketv.com,Simon Cellan Jones,Stake your claim.,gold rush|tv mini-series,The story centers on the friendship of two adv...,285,Drama|History,Scott Free Productions|Discovery Channel|E1 En...,1/20/14,17,6.7,2014,0.0,0.0
1678,61872,tt1461312,0.342084,0,0,Alice,Caterina Scorsone|Kathy Bates|Andrew-Lee Potts...,,Nick Willing,Welcome to a whole new Wonderland.,,The story takes place in Wonderland 150 years ...,240,Fantasy|Drama|Science Fiction,,12/6/09,32,6.0,2009,0.0,0.0
1802,183894,tt1366312,0.189207,0,0,Emma,Romola Garai|Michael Gambon|Jonny Lee Miller|L...,,Jim O'Hanlon,,,"Emma Woodhouse seems to be perfectly content, ...",240,Romance|Comedy|Drama,,10/11/09,17,7.6,2009,0.0,0.0
1865,220903,tt1533395,0.102223,0,0,Life,David Attenborough|Oprah Winfrey,http://www.bbc.co.uk/programmes/b00lbpcy,Martha Holmes|Simon Blakeney|Stephen Lyle,From the Makers of Planet Earth,plants|animal species|biology|wildlife|ecology,David Attenborough's legendary BBC crew explai...,500,Documentary,British Broadcasting Corporation (BBC),12/14/09,24,7.0,2009,0.0,0.0


In [44]:
# Display Movies with runtime less than 1 hour
print(f'There is {tmdb[tmdb.runtime < 60].shape[0]} movies shorter than 1 hour')
tmdb[tmdb.runtime < 60]

There is 318 movies shorter than 1 hour


Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
92,370687,tt3608646,1.876037,0,0,Mythica: The Necromancer,Melanie Stone|Adam Johnson|Kevin Sorbo|Nicola ...,http://www.mythicamovie.com/#!blank/y9ake,A. Todd Smith,,sword|magic|sorcery|necromancer,Mallister takes Thane prisoner and forces Mare...,0,Fantasy|Action|Adventure,Arrowstorm Entertainment|Camera 40 Productions...,12/19/15,11,5.4,2015,0.0,0.0
100,326359,tt4007502,1.724712,0,0,Frozen Fever,Kristen Bell|Idina Menzel|Jonathan Groff|Josh ...,,Chris Buck|Jennifer Lee,,sister sister relationship|birthday|song|birth...,"On Anna's birthday, Elsa and Kristoff are dete...",8,Adventure|Animation|Family,Walt Disney Pictures|Walt Disney Animation Stu...,3/9/15,475,7.0,2015,0.0,0.0
159,251516,tt3472226,0.953046,630019,0,Kung Fury,David Sandberg|Jorma Taccone|Leopold Nilsson|A...,http://www.kungfury.com/,David Sandberg,It takes a cop from the future to fight an ene...,video game|martial arts|kung fu|hacker|nazis,"During an unfortunate series of events, a frie...",31,Action|Comedy|Science Fiction|Fantasy,Laser Unicorns,5/28/15,487,7.7,2015,579617.2,0.0
181,322456,tt4189260,0.821443,0,0,LEGO DC Comics Super Heroes: Justice League vs...,Nolan North|Troy Baker|Diedrich Bader|Khary Pa...,http://www.lego.com/en-us/dccomicssuperheroes,Brandon Vietti,There are two sides to every hero,dc comics|based on comic book|super powers|lego,"Supermanâ€™s clone, Bizarro, has become an emb...",48,Action|Adventure|Animation|Family,Warner Bros. Animation,2/10/15,14,6.4,2015,0.0,0.0
216,286192,tt3824386,0.640151,0,0,Lava,Napua Greig|Kuana Torres Kahele,http://www.pixar.com/short_films/Theatrical-Sh...,James Ford Murphy,,pixar animated short|animation|pixar|short,The story follows the love story of two volcan...,7,Animation|Comedy|Family|Fantasy|Music,Pixar Animation Studios,6/19/15,298,7.3,2015,0.0,0.0
279,355338,tt4941804,0.442835,0,0,Riley's First Date?,Amy Poehler|Phyllis Smith|Bill Hader|Lewis Bla...,,Josh Cooley,,mother daughter relationship|rock music|girl|f...,"Riley, now 12, is hanging out at home with her...",5,Animation|Family,Walt Disney Pictures|Pixar Animation Studios,11/3/15,137,7.3,2015,0.0,0.0
284,364067,tt4537842,0.439598,0,0,A Very Murray Christmas,Bill Murray|Paul Shaffer|George Clooney|Miley ...,http://www.netflix.com/title/80042368,Sofia Coppola,,woman director|christmas,Bill Murray worries no one will show up to his...,56,Comedy|Music,American Zoetrope|South Beach Productions|Depa...,12/4/15,101,5.4,2015,0.0,0.0
305,359983,tt4938602,0.250209,0,0,The Lion Guard: Return of the Roar,Max Charles|Jeff Bennett|Dusan Brown|Sarah Hyl...,,Howy Parkins,,,"Set in the African savannah, the film follows ...",44,Family|TV Movie|Animation,Walt Disney Television Animation,11/22/15,48,5.9,2015,0.0,0.0
334,361931,tt5065822,0.357654,0,0,Ronaldo,Cristiano Ronaldo,http://www.ronaldothefilm.com,Anthony Wonke,Astonishing. Intimate. Definitive.,biography|soccer player,Filmed over 14 months with unprecedented acces...,0,Documentary,"On The Corner Films|We Came, We Saw, We Conque...",11/9/15,80,6.5,2015,0.0,0.0
343,366142,tt5223342,0.344994,0,0,Minions: The Competition,Pierre Coffin|Chris Renaud,,Kyle Balda|Julien Soret,,minions,Two minions working in a bomb lab get competit...,4,Animation,Illumination Entertainment,11/4/15,16,5.9,2015,0.0,0.0


In [50]:
# Display number of movies with budget 0 USD
print(f'There is {tmdb[tmdb.budget_adj == 0].shape[0]} with budget 0 USD')

There is 5696 with budget 0 USD


In [51]:
# Display number of movies with revenue 0 USD
print(f'There is {tmdb[tmdb.revenue_adj == 0].shape[0]} with revenu 0 USD')

There is 6016 with revenu 0 USD


In [32]:
# Display the number of unique cast
tmdb.cast.nunique()

10719

In [34]:
# Display the unique genres in the dataset
tmdb.genres.dropna().str.split('|').apply(lambda x: pd.value_counts(x)).fillna(0).columns

Index(['Science Fiction', 'Adventure', 'Thriller', 'Action', 'Fantasy',
       'Crime', 'Western', 'Drama', 'Comedy', 'Family', 'Animation', 'Mystery',
       'Romance', 'War', 'History', 'Music', 'Horror', 'Documentary',
       'TV Movie', 'Foreign'],
      dtype='object')

> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning 

- I will need the following columns in my investigation (`popularity`, `original_title`, `cast`, `director`, `runtime`, `genres`, `vote_count`, `vote_average`, `release_year`, `budget_adj`, `revenue_adj`) and drop the rest.
- Drop any missing values.
- Remove any movie less than 1 hour long or longer than 3 hours.
- Convert genre column to one hot encoding.
- Create a new feature `profit/loss` by subtracting the `budget_adj` from `revenue_adj`.

In [8]:
keep_cols = ['popularity', 'original_title', 'cast', 'director', 
             'runtime', 'genres', 'vote_count', 'vote_average', 
             'release_year', 'budget_adj', 'revenue_adj']

In [9]:
clean_tmdb = tmdb.copy()[keep_cols]

In [10]:
clean_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   popularity      10866 non-null  float64
 1   original_title  10866 non-null  object 
 2   cast            10790 non-null  object 
 3   director        10822 non-null  object 
 4   runtime         10866 non-null  int64  
 5   genres          10843 non-null  object 
 6   vote_count      10866 non-null  int64  
 7   vote_average    10866 non-null  float64
 8   release_year    10866 non-null  int64  
 9   budget_adj      10866 non-null  float64
 10  revenue_adj     10866 non-null  float64
dtypes: float64(4), int64(3), object(4)
memory usage: 933.9+ KB


In [None]:
tmdb.genres.dropna().str.split('|').apply(lambda x: pd.value_counts(x)).fillna(0).head()

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!