# Project: Investigate Imdb Movie Dataset 

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
IMDB Movie Data:
This dataset contains information about 10,000 movies collected from The Movie
Database (TMDb),including user ratings and revenue.
Data provided by [Udacity](https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csv)<br>
Original data provided by [Kaggle](https://www.kaggle.com/tmdb/tmdb-movie-metadata)<br>

**Dataset features:**<br>
- id,
- imdb_id,
- popularity,
- budget,
- revenue,
- original_title,
- cast,
- homepage,
- director,
- tagline,
- keywords,
- overview,
- runtime,
- genres,
- production_companies,
- release_date,
- vote_count,
- vote_average,
- release_year,
- budget_adj*,
- revenue_adj*<br><br>
**The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.*

Not all features are relevant to the questions this analysis is to answer. Therefore, a few will be taken out.

### Questions:
>describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.
>


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling
In this section data is <a href = '#load'>loaded</a>, <a href = #check>checked for cleanliness</a>, and then <a href =#trim>trimed and cleaned</a> for analysis. 

<a id = 'load'></a>

<a id = 'load'></a>

In [2]:
df_raw_data = pd.read_csv('../../Datasets/movies.csv')
df_raw_data.head(3)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0


In [3]:
print('Dataset consist of {} rows and {} columns.'.format(df_raw_data.shape[0],df_raw_data.shape[1]))

Dataset consist of 10866 rows and 21 columns.


<a id = 'check'></a>
#### Number of Duplicates

In [4]:
print('Of {} rows, {} are unique and {} is a duplicate.'.format(df_raw_data.shape[0],df_raw_data.id.nunique(),df_raw_data.duplicated().sum()))

Of 10866 rows, 10865 are unique and 1 is a duplicate.


#### Number of Missing Values for Each Column

In [5]:
pd.DataFrame(df_raw_data.isna().sum()).transpose()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,0,10,0,0,0,0,76,7930,44,2824,...,4,0,23,1030,0,0,0,0,0,0


#### Number of Empty Rows

In [6]:
df_raw_data.isna().all(1).sum()

0

#### Number of 0s for Each Column

In [7]:
pd.DataFrame(df_raw_data.eq(0).sum()).transpose()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,0,0,0,5696,6016,0,0,0,0,0,...,0,31,0,0,0,0,0,0,5696,6016


#### Number of Unique Values for Each Column
Although this information is not of importance for every column, it helps us ensure not more than one entry is made fore one movie. 

In [8]:
pd.DataFrame(df_raw_data.nunique()).transpose()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,10865,10855,10814,557,4702,10571,10719,2896,5067,7997,...,10847,247,2039,7445,5909,1289,72,56,2614,4840


In [9]:
df_raw_data[['original_title','director','cast']].duplicated().sum()


2

>At first glance, number of unique value for ```original_title``` suggests that we have duplicate entries. However, running the code above explains that there are different movies with the same name.Moreover, it suggests that we have one more duplicate row that perhaps was not detected by ```duplicated()``` earlier du to a difference in one of the columns' entry.

#### Check Columns' Datatype

In [10]:
df_raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date       

>Datatype for ```release_date``` should change to ```datetime```. 

<a id ='trim'></a>	
### Data Cleaning and Triming

#### Drop irrelevent columns


In [11]:
df_raw_data.drop(['imdb_id','popularity','homepage','tagline','overview'],axis =1, inplace = True)


#### Delete Duplicate Rows

In [12]:
df_raw_data.drop_duplicates(inplace = True)
print('Number of duplicate rows: {}'.format(df_raw_data.duplicated().sum()))

Number of duplicate rows: 0


In [13]:
#Drop rows with entries on the same film
df_raw_data[df_raw_data[['original_title','director','cast']].duplicated()]

Unnamed: 0,id,budget,revenue,original_title,cast,director,keywords,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
6701,16781,6000000,57231524,Madea's Family Reunion,Tyler Perry|Blair Underwood|Lynn Whitfield|Bor...,Tyler Perry,spanking|based on play,110,Drama|Comedy|Romance,Lions Gate Films,2/24/06,63,6.0,2006,6490015.0,61905570.0


In [14]:
df_raw_data[df_raw_data['original_title'] == "Madea's Family Reunion"]

Unnamed: 0,id,budget,revenue,original_title,cast,director,keywords,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
4063,28004,6000000,0,Madea's Family Reunion,Tyler Perry|Blair Underwood|Lynn Whitfield|Bor...,Tyler Perry,,0,Comedy,,1/25/02,49,5.9,2002,7273568.0,0.0
6701,16781,6000000,57231524,Madea's Family Reunion,Tyler Perry|Blair Underwood|Lynn Whitfield|Bor...,Tyler Perry,spanking|based on play,110,Drama|Comedy|Romance,Lions Gate Films,2/24/06,63,6.0,2006,6490015.0,61905570.0


In [15]:
df_raw_data.drop([4063],axis = 0, inplace = True)
df_raw_data[df_raw_data['original_title'] == "Madea's Family Reunion"]

Unnamed: 0,id,budget,revenue,original_title,cast,director,keywords,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
6701,16781,6000000,57231524,Madea's Family Reunion,Tyler Perry|Blair Underwood|Lynn Whitfield|Bor...,Tyler Perry,spanking|based on play,110,Drama|Comedy|Romance,Lions Gate Films,2/24/06,63,6.0,2006,6490015.0,61905570.0
