# Project:-Investigate a Dataset - TMDb Movie Data
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id=intro></a>
## Introduction

### Dataset Description
This data set contains information about 10000 movies collected from The Movie Database (TMDb) including user ratings and revenue
- id - Unique id for each entry
- imdb_id - IMDb id for each movie
- popularity - The popularity rating of each movie
- budget - The amount dedicated for the production of the movie
- revenue - The amount derived after production of the movie
- original_title - The original title of each movie
- cast - The casts that appeared in the movie
- homepage - The homepage of the movie
- director - The name of the director that directed the movie
- tagline - Short introduction about the movie
- keywords - Keywords associated with the movie
- overview - Storyline of the movie
- runtime - The duration of the movie
- genres - The classes that the movie belongs
- production_companies - The companies that produced the movie
- release_date - The date that the movie was released
- vote_count - The number of votes that the movie got
- vote_average - The average vote for a particular movie
- release_year - The year that the movie was released in
- budget_adj - The budget in terms of 2010 dollars
- revenue_adj - The revenue in terms of 2010 dollars

### Questions for Analysis
1. Which 5 genres are dominating the movie industry?
2. Are these genres making most of the profits in the movie industry?
3. What percentage of success does these movie genres take in the movie industry?
4. Are these genres the most popular?
5. Whats the relationship between a movie's popularity and its revenue?
6. Which 5 directors would you recommend to your friend? And why?

#### Import necessariy libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

**Gather the data to use for our analysis**

In [None]:
# Use pandas 'read_csv()' function to load in the csv file that holds our data
tmdb_data = pd.read_csv('tmdb-movies.csv')

# Confirm the operation by displaying the first five records with pandas 'head()' function
tmdb_data.head()

In [None]:
# How many records and columns do we have in this data
shape = tmdb_data.shape
print('We have ', shape[0], 'records and ', shape[1], 'attributes in the TMDb Movies Dataset')

**Assess the data for analysis**

In [None]:
# What are the attributes we have in these dataset are they all needed for our analysis?
tmdb_data.info()

**Clean the data for analysis**

- I will remove *id* attribute leaving imdb_id as the unique identifier.

In [None]:
# Remove 'id' attribute from the data
tmdb_data.drop('id', axis=1, inplace=True)

# A custom function to check if attribute removal is successful
def check_attribute_status(attr_names:list):
    # Get the attributes in the dataset we're working with
    attributes = tmdb_data.columns

    # Loop through the list of attributes supplied
    for i in attr_names:
        # Checks if the attribute is in the dataset
        state = i in attributes

        # Print out a text stating if the attribute is in the dataset or not
        print(i, 'attribute in the dataset?', state)
    

# Check if we have indeed removed the column
check_attribute_status(['id']) # Should return False

- I will remove *cast, homepage, tagline, keywords, overview, release_date, budget_adj and revenue_adj* attributes from the dataset as I don't need them for my analysis

In [None]:
# Remove 'homepage, tagline, keywords, overview, release_date, budget_adj and revenue_adj' attributes from the dataset
tmdb_data.drop(columns=['cast', 'homepage', 'tagline', 'keywords', 'overview', 'release_date', 'budget_adj', 'revenue_adj'], inplace=True)

# Validate the operation by checking if they are still in the dataset
check_attribute_status(['cast', 'homepage', 'tagline', 'keywords', 'overview', 'release_date', 'budget_adj', 'revenue_adj'])

Are all these records unique? Let's check

In [None]:
# Check for duplicated records
tmdb_data.duplicated().sum()

In [None]:
# View the duplicate(s)
tmdb_data[tmdb_data.duplicated()]

In [None]:
# View both records simultaneously
# Use the original title to get both
tmdb_data.query('original_title == "TEKKEN"')

In [None]:
# Drop the duplicates
tmdb_data.drop_duplicates(inplace=True)

# Check again to verify the duplucate is out of the dataset
tmdb_data.duplicated().sum()

Do we have null values in our records? How many for each attribute?

In [None]:
# Check if the attributes contain null values
tmdb_data.isnull().any()

In [None]:
# Count the null values for each attribute
tmdb_data.isnull().sum()

Our dataset contain null values with production_companies containing the most. This attribute is not really needed for our analysis so let's drop it.

In [None]:
# Drop the production_companies attribute
tmdb_data.drop('production_companies', axis=1, inplace=True)

# Validate the operation
check_attribute_status(['production_companies'])

In [None]:
# Check the null values for each attribute
tmdb_data.isnull().sum()

Assuming each null value is a unique record; Let's add them up and see the percentage of the null values relative to the number of records we have in the dataset

In [None]:
total_null = tmdb_data.isnull().sum().sum() # Sum of all the null values together
total_records = tmdb_data.shape[0] # Get the number of records in our dataset

print('Null values in our modified dataset takes up {:.2%} of the records'.format(total_null/total_records))

The null values takes less than 1% of the records so we can remove the individual records with null values

In [None]:
# The function to remove null values from our dataset
def remove_nulls(attr_name):
    # Extract the null records from the dataframe
    null_df = tmdb_data[tmdb_data[attr_name].isnull()]

    # Remove records with the same index in the null dataframe
    tmdb_data.drop(index=null_df.index, inplace=True)

    # Print a statement to the console stating the number of nulls in the particular attribute
    print('Attribute', attr_name, 'contains', tmdb_data[attr_name].isnull().sum(), 'null value(s)')

In [None]:
# Remove null records from our dataset using the function created earlier
remove_nulls('imdb_id')
remove_nulls('director')
remove_nulls('genres')

In [None]:
# Check for null values
tmdb_data.isnull().sum()

Now that we're done with null values let's assses the dtypes of our attributes

In [None]:
# Display the data types of our attributes
tmdb_data.dtypes

Our data types are good. I can start exploring and manipulating. Let's have a quick view once more.

In [None]:
# View the first five records
tmdb_data.head()

Genre contains multiple values separated by pipe character. I need to split it as I'll heavily use it in my analysis

In [None]:
# Convert the string the genres attribute into a list; splitting based on the pipe character
tmdb_data.genres = tmdb_data.genres.apply(lambda x: x.split('|'))

# Use pandas 'explode()' function to convert the list into individual records
tmdb_data = tmdb_data.explode('genres')

# View our dataset
tmdb_data.head()

Rename genres to genre

In [None]:
tmdb_data.rename(columns={'genres':'genre'}, inplace=True)
tmdb_data.head()

Now I am done wrangling this dataset.

<a id='eda'></a>
## Exploratory Data Analysis

#### 1. Which 5 genres are dominating the movie industry?

In [None]:
# Count the number of times each genre is in a movie
tmdb_data.genre.value_counts()

In [None]:
# Plot the first five on a bar chart
top_5_genres = tmdb_data.genre.value_counts().nlargest(5)

top_5_genres.plot(
    kind='bar',
    color='green',
    title='Top 5 Genres in the Dataset');

In [None]:
print('The five genres dominating the movie industry are:')
for i in top_5_genres.index.to_list():
    print(i)

#### 2. Are these genres making most of the profits in the movie industry?

In [None]:
# First, I need to calculate the profit for each record
# Record Profit = Record Revenue - Record Budget
tmdb_data['profit'] = tmdb_data['revenue'] - tmdb_data['budget']

# Then calculate the average profit per genre
# And get the top 10 genres by profit
top_10_genres_by_profit = tmdb_data.groupby('genre').profit.mean().nlargest(10)

# Now display two bar charts:
# 1. Showing the top 5 genres in the movie industry
# 2. Showing the top 5 genres in the movie industry by profit
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6), width_ratios=[1.5, 2.5])

fig.suptitle('Top Genres in the Movie Industry')

# First chart
ax1.bar(top_5_genres.index, top_5_genres.values, color='green')
ax1.set_title('Top 5 Genres in the movie industry')
ax1.tick_params(labelrotation=90)
ax1.set_ylabel('Count')

# Second chart
ax2.bar(top_10_genres_by_profit.index, top_10_genres_by_profit.values, color='blue')
ax2.set_title('Top 10 Genres in the movie industry by Profit')
ax2.tick_params(labelrotation=90)
ax2.set_ylabel('Profit')

plt.show()

Only two after out of five showed up in the lower half of the top 10 genres by profit.

**No. The top 5 genres with more movie counts are not making most of the profit in the industry. Action, Thriller and Comedy are profitable but not the most profitable**

#### 3. Are these genres the most popular?

In [None]:
# Calculate the mean popularity for the genres and select the top 10 genres
top_10_genres_by_popularity = tmdb_data.groupby('genre').popularity.mean().nlargest(10)

# Now display two bar charts:
# 1. Showing the top 5 genres in the movie industry
# 2. Showing the top 5 genres in the movie industry by popularity
fig, (ax3, ax4) = plt.subplots(1, 2, figsize=(16, 6), width_ratios=[1.5, 2.5])

fig.suptitle('Top Genres in the Movie Industry')

# First chart
ax3.bar(top_5_genres.index, top_5_genres.values, color='green')
ax3.set_title('Top 5 Genres in the movie industry')
ax3.tick_params(labelrotation=90)
ax3.set_ylabel('Count')

# Second chart
ax4.bar(top_10_genres_by_popularity.index, top_10_genres_by_popularity.values, color='red')
ax4.set_title('Top 10 Genres in the movie industry by Popularity')
ax4.tick_params(labelrotation=90)
ax4.set_ylabel('Popularity')

plt.show()

Action and Thriller are popular as they show up in the top 10 popular genres. But are not the most popular

**No. The five genres leading by movie counts are not the most popular in the movie industry**

#### 4. Which attribute(s) go with popularity of a movie?

In [None]:
# Let's see the correlation matrix between the attributes
tmdb_data.corr()

In [None]:
# Visualize the matrix with a heatmap for better interpretation
sns.heatmap(tmdb_data.corr(), cmap='YlGnBu', linewidths=0.30, annot=True);

Popularity attribute correlates positively with the number of votes for that particular movie. The more the number of votes for a movie the popular it gets.

Popularity attribute also correlates positively with the budget, revenue and profit. Meaning investing more on a movie will make it popular as streamers will enjoy the quality of the movie

Attributes like average vote, release year and runtime doesn't really have any effect on the popularity of a movie

#### 5. Which 5 directors would you recommend to your friend? And why?

Let's see the relationship between profit and popularity

In [None]:
# Plot a scatter plot showing the relationship between the popularity and profit a movie
plt.scatter(tmdb_data.popularity, tmdb_data.profit, color='green')
plt.title('Popularity against Profit')
plt.xlabel('Popularity')
plt.ylabel('Profit');

I can see from my previous analysis that popularity and profit goes hand in hand. My friend will want a director that'll grant him both popularity and profit. So I'll be diving into those two categories specifically.

In [None]:
# Get the top ranking movies data by popularity
top_movies_data_by_popularity = tmdb_data[tmdb_data.popularity > tmdb_data.popularity.mean()]

# Get the top ranking movies data by profit
top_movies_data_by_profit = tmdb_data[tmdb_data.profit > tmdb_data.profit.mean()]

# Display two plots of directors with more movie counts int the two categories
fig, (ax5, ax6) = plt.subplots(1, 2, figsize=(20, 5))

fig.suptitle('Top Ranking Movies')

profit_data = top_movies_data_by_profit.groupby('director').director.count().nlargest(10)
ax5.set_title('Directors with higher movie count when considering Profit')
ax5.bar(profit_data.index, profit_data.values)
ax5.tick_params(rotation=90)
ax5.set_ylabel('Movie Count')

popularity_data = top_movies_data_by_popularity.groupby('director').director.count().nlargest(10)
ax6.set_title('Directors with higher movie count when considering Popularity')
ax6.bar(popularity_data.index, popularity_data.values)
ax6.tick_params(rotation=90)
ax6.set_ylabel('Movie Count')

plt.show()

- The first 4 directors with higher movie count when considering profit are the ones with more more popularity.
- Clint Eastwood being the fifth when considering profit is also part of the top 10 when considering popularity.

Prioritizing Profit over popularity, I'll recommend the first five directors with more movie counts when considering profit to my friend.

**I'll recommend *Steven Spielberg, Robert Zenneckis, Tim Burton, Ridley Scott and Clint Eastwood* to my friend because the movies directed by these directors have more profit than any other and their movies are very popular.**

<a id='conclusions'></a>
## Conclusions

- The top 5 genres in the movie industry are *Drama, Comedy, Thriller, Action, Romance*.
- These genres are neither the most popular nor the most profitable genre in the movie industry.
- Adventure is the most popular and the most profitable genre in the movie industry.
- Profit of a particular movie increases as it gets popular.
- The runtime, release year and average vote have nothing to do with how popular a movie gets.
- But the number of votes a movie gets deeply affect it's popularity
- The 5 best directors that I recommend in the industry are **Steven Spielberg, Robert Zenneckis, Tim Burton, Ridley Scott and Clint Eastwood**