# Project: Investigate tmdb-movies Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> This data set contains information about 10,000 movies collected from The Movie Database (TMDb),including user ratings and revenue.

In [98]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### Loading Data

In [99]:
movies_df = pd.read_csv('tmdb-movies.csv')

### General Properties

In [100]:
# get shape of df
movies_df.shape

(10866, 21)

In [101]:
# get the attributes of df in list
list(movies_df.columns)

['id',
 'imdb_id',
 'popularity',
 'budget',
 'revenue',
 'original_title',
 'cast',
 'homepage',
 'director',
 'tagline',
 'keywords',
 'overview',
 'runtime',
 'genres',
 'production_companies',
 'release_date',
 'vote_count',
 'vote_average',
 'release_year',
 'budget_adj',
 'revenue_adj']

In [102]:
# get first 5 rows from the df
movies_df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


In [103]:
movies_df.describe()

Unnamed: 0,id,popularity,budget,revenue,runtime,vote_count,vote_average,release_year,budget_adj,revenue_adj
count,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0
mean,66064.177434,0.646441,14625700.0,39823320.0,102.070863,217.389748,5.974922,2001.322658,17551040.0,51364360.0
std,92130.136561,1.000185,30913210.0,117003500.0,31.381405,575.619058,0.935142,12.812941,34306160.0,144632500.0
min,5.0,6.5e-05,0.0,0.0,0.0,10.0,1.5,1960.0,0.0,0.0
25%,10596.25,0.207583,0.0,0.0,90.0,17.0,5.4,1995.0,0.0,0.0
50%,20669.0,0.383856,0.0,0.0,99.0,38.0,6.0,2006.0,0.0,0.0
75%,75610.0,0.713817,15000000.0,24000000.0,111.0,145.75,6.6,2011.0,20853250.0,33697100.0
max,417859.0,32.985763,425000000.0,2781506000.0,900.0,9767.0,9.2,2015.0,425000000.0,2827124000.0


In [104]:
movies_df.keywords

0        monster|dna|tyrannosaurus rex|velociraptor|island
1         future|chase|post-apocalyptic|dystopia|australia
2        based on novel|revolution|dystopia|sequel|dyst...
3                    android|spaceship|jedi|space opera|3d
4                      car race|speed|revenge|suspense|car
                               ...                        
10861                             surfer|surfboard|surfing
10862                            car race|racing|formula 1
10863                             car|trolley|stealing car
10864                                                spoof
10865                  fire|gun|drive|sacrifice|flashlight
Name: keywords, Length: 10866, dtype: object

In [105]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date       

In [106]:
movies_df = movies_df[['id', 'popularity', 'budget', 'revenue', 'runtime', 'original_title', 'genres', 'cast', 'director', 'release_date', 'vote_count', 'vote_average', 'release_year']]

In [107]:
movies_df.shape

(10866, 13)

In [108]:
movies_df.head()

Unnamed: 0,id,popularity,budget,revenue,runtime,original_title,genres,cast,director,release_date,vote_count,vote_average,release_year
0,135397,32.985763,150000000,1513528810,124,Jurassic World,Action|Adventure|Science Fiction|Thriller,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,6/9/15,5562,6.5,2015
1,76341,28.419936,150000000,378436354,120,Mad Max: Fury Road,Action|Adventure|Science Fiction|Thriller,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,5/13/15,6185,7.1,2015
2,262500,13.112507,110000000,295238201,119,Insurgent,Adventure|Science Fiction|Thriller,Shailene Woodley|Theo James|Kate Winslet|Ansel...,Robert Schwentke,3/18/15,2480,6.3,2015
3,140607,11.173104,200000000,2068178225,136,Star Wars: The Force Awakens,Action|Adventure|Science Fiction|Fantasy,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,J.J. Abrams,12/15/15,5292,7.5,2015
4,168259,9.335014,190000000,1506249360,137,Furious 7,Action|Crime|Thriller,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,James Wan,4/1/15,2947,7.3,2015


In [109]:
movies_df.budget.value_counts()

0           5696
20000000     190
15000000     183
25000000     178
10000000     176
            ... 
51500000       1
25500000       1
1350000        1
7920000        1
4653000        1
Name: budget, Length: 557, dtype: int64

> there is 5696 movies that have budget which is zero and will be dropped

In [110]:
movies_df.revenue.value_counts()

0            6016
12000000       10
10000000        8
11000000        7
6000000         6
             ... 
53676580        1
617000          1
13001257        1
504050219       1
20518905        1
Name: revenue, Length: 4702, dtype: int64

In [111]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              10866 non-null  int64  
 1   popularity      10866 non-null  float64
 2   budget          10866 non-null  int64  
 3   revenue         10866 non-null  int64  
 4   runtime         10866 non-null  int64  
 5   original_title  10866 non-null  object 
 6   genres          10843 non-null  object 
 7   cast            10790 non-null  object 
 8   director        10822 non-null  object 
 9   release_date    10866 non-null  object 
 10  vote_count      10866 non-null  int64  
 11  vote_average    10866 non-null  float64
 12  release_year    10866 non-null  int64  
dtypes: float64(2), int64(6), object(5)
memory usage: 1.1+ MB


In [112]:
# drop null values
movies_df.dropna(inplace=True)

In [113]:
# change release_date to datetime
movies_df['release_date'] = pd.to_datetime(movies_df['release_date'])

In [114]:
movies_df.sample(5)

Unnamed: 0,id,popularity,budget,revenue,runtime,original_title,genres,cast,director,release_date,vote_count,vote_average,release_year
4410,62764,1.606999,85000000,183018522,106,Mirror Mirror,Adventure|Fantasy|Drama|Comedy|Science Fiction,Julia Roberts|Lily Collins|Armie Hammer|Nathan...,Tarsem Singh,2012-03-15,718,5.4,2012
2138,48466,0.449761,0,0,21,Scared Shrekless,Animation|Comedy,Mike Myers|Cameron Diaz|Antonio Banderas|Dean ...,Gary Trousdale|Raman Hui,2010-10-28,49,6.4,2010
10076,25018,0.388685,0,5765562,81,Leatherface: Texas Chainsaw Massacre III,Thriller|Horror,Kate Hodge|Ken Foree|R.A. Mihailoff|William Bu...,Jeff Burr,1990-01-12,30,5.0,1990
162,252512,0.93762,10000000,7587485,97,While We're Young,Drama|Comedy,Ben Stiller|Naomi Watts|Adam Driver|Amanda Sey...,Noah Baumbach,2015-03-27,265,5.9,2015
1192,254772,0.266087,0,0,120,Two Men in Town,Crime|Drama,Forest Whitaker|Harvey Keitel|Brenda Blethyn|L...,Rachid Bouchareb,2014-02-06,23,6.1,2014


In [115]:
movies_df.id = movies_df.id.astype('str')

In [116]:
movies_df.cast = movies_df.cast.apply(lambda x: x.split('|'))

In [117]:
movies_df.genres = movies_df.genres.apply(lambda x: x.split('|'))

In [118]:
movies_df.director = movies_df.director.apply(lambda x: x.split('|'))

In [119]:
def objects_count(series):
    """this function accepts a pandas series to get count of items in each cell and return a dict with key and count"""
    objects_dict = {}
    for keys in series:
        for key in keys:
            if key in objects_dict:
                objects_dict[key] += 1
            else:
                objects_dict[key] = 1
    return objects_dict

In [120]:
genre_dict = objects_count(movies_df.genres)

In [121]:
genre_dict

{'Action': 2377,
 'Adventure': 1465,
 'Science Fiction': 1222,
 'Thriller': 2903,
 'Fantasy': 908,
 'Crime': 1354,
 'Western': 164,
 'Drama': 4747,
 'Family': 1214,
 'Animation': 664,
 'Comedy': 3775,
 'Mystery': 808,
 'Romance': 1708,
 'War': 268,
 'History': 330,
 'Music': 399,
 'Horror': 1636,
 'Documentary': 470,
 'TV Movie': 162,
 'Foreign': 184}

In [122]:
directors_dict = objects_count(movies_df.director)

In [123]:
directors_dict

{'Colin Trevorrow': 2,
 'George Miller': 10,
 'Robert Schwentke': 5,
 'J.J. Abrams': 5,
 'James Wan': 8,
 'Alejandro GonzÃ¡lez IÃ±Ã¡rritu': 5,
 'Alan Taylor': 2,
 'Ridley Scott': 23,
 'Kyle Balda': 6,
 'Pierre Coffin': 3,
 'Pete Docter': 4,
 'Sam Mendes': 7,
 'Lana Wachowski': 7,
 'Lilly Wachowski': 7,
 'Alex Garland': 1,
 'Chris Columbus': 14,
 'Joss Whedon': 5,
 'Quentin Tarantino': 14,
 'Olivier Megaton': 4,
 'Peyton Reed': 6,
 'Kenneth Branagh': 10,
 'Francis Lawrence': 6,
 'Brad Bird': 6,
 'Antoine Fuqua': 10,
 'Brad Peyton': 3,
 'Sam Taylor-Johnson': 2,
 'Adam McKay': 7,
 'Christopher McQuarrie': 3,
 'Seth MacFarlane': 3,
 'Matthew Vaughn': 5,
 'Tom McCarthy': 5,
 'Wes Ball': 2,
 'Bill Condon': 8,
 'Neill Blomkamp': 4,
 'Elizabeth Banks': 2,
 'Steven Spielberg': 30,
 'Rob Letterman': 4,
 'Lenny Abrahamson': 3,
 'Afonso Poyart': 1,
 'Peter Sohn': 2,
 'Jaume Collet-Serra': 6,
 'John Crowley': 4,
 'F. Gary Gray': 8,
 'Breck Eisner': 4,
 'Danny Boyle': 11,
 'Guy Ritchie': 8,
 'Lee To

In [124]:
len(directors_dict)

5298

In [125]:
movies_df.tail()

Unnamed: 0,id,popularity,budget,revenue,runtime,original_title,genres,cast,director,release_date,vote_count,vote_average,release_year
10861,21,0.080598,0,0,95,The Endless Summer,[Documentary],"[Michael Hynson, Robert August, Lord 'Tally Ho...",[Bruce Brown],2066-06-15,11,7.4,1966
10862,20379,0.065543,0,0,176,Grand Prix,"[Action, Adventure, Drama]","[James Garner, Eva Marie Saint, Yves Montand, ...",[John Frankenheimer],2066-12-21,20,5.7,1966
10863,39768,0.065141,0,0,94,Beregis Avtomobilya,"[Mystery, Comedy]","[Innokentiy Smoktunovskiy, Oleg Efremov, Georg...",[Eldar Ryazanov],2066-01-01,11,6.5,1966
10864,21449,0.064317,0,0,80,"What's Up, Tiger Lily?","[Action, Comedy]","[Tatsuya Mihashi, Akiko Wakabayashi, Mie Hama,...",[Woody Allen],2066-11-02,22,5.4,1966
10865,22293,0.035919,19000,0,74,Manos: The Hands of Fate,[Horror],"[Harold P. Warren, Tom Neyman, John Reynolds, ...",[Harold P. Warren],2066-11-15,15,1.5,1966


In [138]:
print("number of movies with no runtime recorded: {}".format(movies_df.runtime[movies_df.runtime == 0].count()))

number of movies with no runtime recorded: 28


> there is 6016 movies with 0 revenue and those as well will be dropped

> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Replace this with more specific notes!)

In [126]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [127]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [128]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!