# Project:-Investigate a Dataset - TMDb Movie Data
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id=intro></a>
## Introduction

### Dataset Description
This data set contains information about 10000 movies collected from The Movie Database (TMDb) including user ratings and revenue
- id -
- imdb_id - 
- popularity - 
- budget - 
- revenue - 
- original_title - 
- cast - 
- homepage - 
- director - 
- tagline - 
- keywords - 
- overview - 
- runtime - 
- genres - 
- production_companies - 
- release_date - 
- vote_count - 
- vote_average - 
- release_year - 
- budget_adj - 
- revenue_adj - 

### Questions for Analysis
1. Which 5 genres are dominating the movie industry?
2. Are these genres making most of the profits in the movie industry?
3. What percentage of success does these movie genres take in the movie industry?
4. Are these genres the most popular?
5. Whats the relationship between a movie's popularity and its revenue?
6. Which 5 directors would you recommend to your friend? And why?

#### Import necessariy libraries

In [103]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

**Gather the data to use for our analysis**

In [162]:
# Use pandas 'read_csv()' function to load in the csv file that holds our data
tmdb_data = pd.read_csv('tmdb-movies.csv')

# Confirm the operation by displaying the first five records with pandas 'head()' function
tmdb_data.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999939.3,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999939.3,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101199955.5,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999919.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799923.1,1385749000.0


In [163]:
# How many records and columns do we have in this data
shape = tmdb_data.shape
print('We have ', shape[0], 'records and ', shape[1], 'attributes in the TMDb Movies Dataset')

We have  10866 records and  21 attributes in the TMDb Movies Dataset


**Assess the data for analysis**

In [164]:
# What are the attributes we have in these dataset are they all needed for our analysis?
tmdb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date       

**Clean the data for analysis**

- I will remove *id* attribute leaving imdb_id as the unique identifier.

In [165]:
# Remove 'id' attribute from the data
tmdb_data.drop('id', axis=1, inplace=True)

# A custom function to check if attribute removal is successful
def check_attribute_status(attr_names:list):
    # Get the attributes in the dataset we're working with
    attributes = tmdb_data.columns

    # Loop through the list of attributes supplied
    for i in attr_names:
        # Checks if the attribute is in the dataset
        state = i in attributes

        # Print out a text stating if the attribute is in the dataset or not
        print(i, 'attribute in the dataset?', state)
    

# Check if we have indeed removed the column
check_attribute_status(['id']) # Should return False

id attribute in the dataset? False


- I will remove *cast, homepage, tagline, keywords, overview, release_date, budget_adj and revenue_adj* attributes from the dataset as I don't need them for my analysis

In [166]:
# Remove 'homepage, tagline, keywords, overview, release_date, budget_adj and revenue_adj' attributes from the dataset
tmdb_data.drop(columns=['cast', 'homepage', 'tagline', 'keywords', 'overview', 'release_date', 'budget_adj', 'revenue_adj'], inplace=True)

# Validate the operation by checking if they are still in the dataset
check_attribute_status(['cast', 'homepage', 'tagline', 'keywords', 'overview', 'release_date', 'budget_adj', 'revenue_adj'])

cast attribute in the dataset? False
homepage attribute in the dataset? False
tagline attribute in the dataset? False
keywords attribute in the dataset? False
overview attribute in the dataset? False
release_date attribute in the dataset? False
budget_adj attribute in the dataset? False
revenue_adj attribute in the dataset? False


Are all these records unique? Let's check

In [167]:
# Check for duplicated records
tmdb_data.duplicated().sum()

1

In [168]:
# View the duplicate(s)
tmdb_data[tmdb_data.duplicated()]

Unnamed: 0,imdb_id,popularity,budget,revenue,original_title,director,runtime,genres,production_companies,vote_count,vote_average,release_year
2090,tt0411951,0.59643,30000000,967000,TEKKEN,Dwight H. Little,92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,110,5.0,2010


In [169]:
# View both records simultaneously
# Use the original title to get both
tmdb_data.query('original_title == "TEKKEN"')

Unnamed: 0,imdb_id,popularity,budget,revenue,original_title,director,runtime,genres,production_companies,vote_count,vote_average,release_year
2089,tt0411951,0.59643,30000000,967000,TEKKEN,Dwight H. Little,92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,110,5.0,2010
2090,tt0411951,0.59643,30000000,967000,TEKKEN,Dwight H. Little,92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,110,5.0,2010


In [170]:
# Drop the duplicates
tmdb_data.drop_duplicates(inplace=True)

# Check again to verify the duplucate is out of the dataset
tmdb_data.duplicated().sum()

0

Do we have null values in our records? How many for each attribute?

In [171]:
# Check if the attributes contain null values
tmdb_data.isnull().any()

imdb_id                  True
popularity              False
budget                  False
revenue                 False
original_title          False
director                 True
runtime                 False
genres                   True
production_companies     True
vote_count              False
vote_average            False
release_year            False
dtype: bool

In [172]:
# Count the null values for each attribute
tmdb_data.isnull().sum()

imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
director                  44
runtime                    0
genres                    23
production_companies    1030
vote_count                 0
vote_average               0
release_year               0
dtype: int64

Our dataset contain null values with production_companies containing the most. This attribute is not really needed for our analysis so let's drop it.

In [173]:
# Drop the production_companies attribute
tmdb_data.drop('production_companies', axis=1, inplace=True)

# Validate the operation
check_attribute_status(['production_companies'])

production_companies attribute in the dataset? False


In [174]:
# Check the null values for each attribute
tmdb_data.isnull().sum()

imdb_id           10
popularity         0
budget             0
revenue            0
original_title     0
director          44
runtime            0
genres            23
vote_count         0
vote_average       0
release_year       0
dtype: int64

Assuming each null value is a unique record; Let's add them up and see the percentage of the null values relative to the number of records we have in the dataset

In [175]:
total_null = tmdb_data.isnull().sum().sum() # Sum of all the null values together
total_records = tmdb_data.shape[0] # Get the number of records in our dataset

print('Null values in our modified dataset takes up {:.2%} of the records'.format(total_null/total_records))

Null values in our modified dataset takes up 0.71% of the records


The null values takes less than 1% of the records so we can remove the individual records with null values

In [176]:
# The function to remove null values from our dataset
def remove_nulls(attr_name):
    # Extract the null records from the dataframe
    null_df = tmdb_data[tmdb_data[attr_name].isnull()]

    # Remove records with the same index in the null dataframe
    tmdb_data.drop(index=null_df.index, inplace=True)

    # Print a statement to the console stating the number of nulls in the particular attribute
    print('Attribute', attr_name, 'contains', tmdb_data[attr_name].isnull().sum(), 'null value(s)')

In [177]:
# Remove null records from our dataset using the function created earlier
remove_nulls('imdb_id')
remove_nulls('director')
remove_nulls('genres')

Attribute imdb_id contains 0 null value(s)
Attribute director contains 0 null value(s)
Attribute genres contains 0 null value(s)


In [178]:
# Check for null values
tmdb_data.isnull().sum()

imdb_id           0
popularity        0
budget            0
revenue           0
original_title    0
director          0
runtime           0
genres            0
vote_count        0
vote_average      0
release_year      0
dtype: int64

Now that we're done with null values let's assses the dtypes of our attributes

In [179]:
# Display the data types of our attributes
tmdb_data.dtypes

imdb_id            object
popularity        float64
budget              int64
revenue             int64
original_title     object
director           object
runtime             int64
genres             object
vote_count          int64
vote_average      float64
release_year        int64
dtype: object

Our data types are good. I can start exploring and manipulating. Let's have a quick view once more.

In [180]:
# View the first five records
tmdb_data.head()

Unnamed: 0,imdb_id,popularity,budget,revenue,original_title,director,runtime,genres,vote_count,vote_average,release_year
0,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,124,Action|Adventure|Science Fiction|Thriller,5562,6.5,2015
1,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,120,Action|Adventure|Science Fiction|Thriller,6185,7.1,2015
2,tt2908446,13.112507,110000000,295238201,Insurgent,Robert Schwentke,119,Adventure|Science Fiction|Thriller,2480,6.3,2015
3,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,J.J. Abrams,136,Action|Adventure|Science Fiction|Fantasy,5292,7.5,2015
4,tt2820852,9.335014,190000000,1506249360,Furious 7,James Wan,137,Action|Crime|Thriller,2947,7.3,2015


Genre contains multiple values separated by pipe character. I need to split it as I'll heavily use it in my analysis

In [181]:
# Convert the string the genres attribute into a list; splitting based on the pipe character
tmdb_data.genres = tmdb_data.genres.apply(lambda x: x.split('|'))

# Use pandas 'explode()' function to convert the list into individual records
tmdb_data = tmdb_data.explode('genres')

# View our dataset
tmdb_data.head()

Unnamed: 0,imdb_id,popularity,budget,revenue,original_title,director,runtime,genres,vote_count,vote_average,release_year
0,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,124,Action,5562,6.5,2015
0,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,124,Adventure,5562,6.5,2015
0,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,124,Science Fiction,5562,6.5,2015
0,tt0369610,32.985763,150000000,1513528810,Jurassic World,Colin Trevorrow,124,Thriller,5562,6.5,2015
1,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,George Miller,120,Action,6185,7.1,2015


Now I am done wrangling this dataset.

<a id='eda'></a>
## Exploratory Data Analysis

<a id='conclusions'></a>
## Conclusions