# Project: Investigation of a Collection of 10,000 Movies in The Movie Database (TMDB)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description
The dataset investigated in this project is a collection of features of about 10,000 movies in The Movie Database (TMDb). [1](https://docs.google.com/document/d/e/2PACX-1vTlVmknRRnfy_4eTrjw5hYGaiQim5ctr9naaRd4V9du2B5bxpd8FEH3KtDgp8qVekw7Cj1GLk1IXdZi/pub?embedded=True) TMDb is built by community and for community since 2008 and there are currently more than 400,000 developers using the metadata. [2](https://www.themoviedb.org/about)
The database consists of the following features:
- id: this is the TMDb unique movie identifier
- imdb_id: this is the iMDb unique movie identifier
- popularity: a metric that measures the acceptance of a movie by the viewers
- budget: estimate of the total amount of money earmarked for the movie
- revenue: the total amount of money realized from selling the movie
- original_title: the movie title
- cast: list of actors in the movie [3](https://dictionary.cambridge.org/dictionary/english/cast#:~:text=cast%20noun%20%28ACTORS%29%20B2%20%5B%20C%2C%20%2B%20sing%2Fpl,actors%20who%20were%20not%20playing%20the%20main%20parts%29.)
- homepage: the movie homepage
- director: the movie director
- tagline: a catchphrase for the movie advert
- keywords: words used in indexing and ranking movies in search results
- overview: high-level summary of the movie story line
- runtime: movie duration [4](https://www.merriam-webster.com/dictionary/running%20time#:~:text=Definition%20of%20running%20time%20%3A%20the%20duration%20of,recording%20Examples%20of%20running%20time%20in%20a%20Sentence)
- genres: movie category [5](https://www.studiobinder.com/blog/movie-genres-list/)
- production_companies: the company that produced the movie
- release_date: movie release date
- vote_count: number of votes
- vote_average: average votes
- release_year: movie release year
- budget_adj: adjusted budget taking inflation into account
- revenue_adj: adjusted revenue taking inflation into account


### Question(s) for Analysis
The movie industry is multi-billion dollar entreprise with a lot of revenue generations [6](https://www.investopedia.com/articles/investing/091615/movie-vs-tv-industry-which-most-profitable.asp). However, it is important for anyone that want to invest in the industry to have a grasp of the annual revenue trends in the industry. 
In this notebook, I am going to investigate the:
- Annual Revenue Growth of the movie industry
- and the relationships between ***popularity, runtime, average votes and revenue generation*** 

In [1]:
# importing the necessary packages
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas==0.25.0

<a id='wrangling'></a>
## Data Wrangling
In this section I am going to load, view and clean the dataset before analysis.

In [None]:
# loading the data
movies_df = pd.read_csv('tmdb-movies.csv')

In [None]:
# first five records in the dataset
movies_df.head()

From this we can observe that:
- the dataset consists of 21 features
- a movie can fall under more than one genres 
- and two or more production companies do collaborate to produce movies.

Let's check the number of records in the dataset.

In [None]:
# checking the size of the dataset
movies_df.shape

We can see that there are 10866 entries in the database. Let's investigate the metadata of the features.

In [None]:
movies_df.info()

We can see from the above that we have:
- three data types: ***float64, int64, object***
- the features: ***imdb_id, cast, homepage, director, tagline, keywords, overview, genres, and production_companies*** all have missing values.

Let's check the summary statics of the numerical features of the dataset.

In [None]:
movies_df.describe()

From the above summary statistics we note that:
- ***popularity, runtime, and vote_count*** have outliers because their maximum values far outpass their interquartile ranges.
- the record is for 55 year duration (1960-2015)
- more than 50% of ***budget, revenue, budget_adj and revenue_adj*** contain zero values. 


### Data Cleaning
In this section, the data will be cleaned prior to analysis. Thus:
- as noted above the features: ***imdb_id, cast, homepage, director, tagline, keywords, overview, genres, and production_companies*** have missing values. However, the missing values are not going to be imputed because they are not in our features of interest. 
- The outliers will be left intact because they are possible values. 
- The features ***revenue_adj and vote_average*** will be renamed to more readable features
- a function that will drop rows equal to some certain values will be defined for code reusability
- zero revenue and budget are not common values but it may seem irrational to replace these values with *mean, median or mode*. Thus, all records with the value of zero in ***revenue_adj*** , which is our target feature, will be dropped using the function mentioned above.


**Renaming the features revenue_adj and vote_average**

In [None]:
# renaming columns
movies_df = movies_df.rename(columns = {'vote_average' : 'average vote', 'revenue_adj' : 'adjusted revenue'})

**Defining the rows drop function**

In [None]:
def row_filter(df, column, value):
    '''
    The function takes a dataframe object and drops rows equal to a certain value in a certain column.
    
    Arguments
    df: dataframe
    column: it is of type string, it is the name of the column to apply the filter
    value: it is of type float and it refers to the value in the rows to be dropped
    
    returns a dataframe without the dropped rows
    '''
    indices = df[df[column] == value].index            # generating the indices of the rows to be dropped
    return df.drop(indices, inplace=True)          # dropping the rows

**Dropping the adjusted revenue zero valued rows**

In [None]:
row_filter(movies_df,'adjusted revenue',0)

In [None]:
movies_df.shape

We are now left with 4850 records which accounts for 44% of the original dataset.

<a id='eda'></a>
## Exploratory Data Analysis
In this section analyses of the features: ***popularity, runtime, vote_count, relaese_year*** will be carried out with respect to the target variable, ***revenue_adj*** to explore the relationships between them.

### Question 1: What is the relationship between popularity, runtime, vote_average and revenue_adj?
To answer this question, let's define a function that will plot the graphical distributions of these features.

In [None]:
def hist_plots(df, feature, ylabel):
    '''
    Parameters
    df: dataframe
    feature: the string name of the feature whose distribution is to be plotted
    ylabel: a string name of the horizontal axis
    
    Returns a plot for the distribution of the feature
    '''
    proper_feature = feature.capitalize() # capitalizes the feature name
    plt.xlabel(proper_feature)
    plt.ylabel(ylabel)
    title = proper_feature + " Distribution"  # crafting the title
    plt.title(title);
    return df[feature].hist(grid=False)

Below are the histograms of the features of interest.

In [None]:
# popularity distribution
hist_plots(movies_df, "popularity", "Frequency");

In [None]:
# runtime distribution
hist_plots(movies_df, "runtime", "Frequency");

In [None]:
# average vote distribution
hist_plots(movies_df, "average vote", "Frequency");

In [None]:
hist_plots(movies_df, "adjusted revenue", "Frequency");

From the above histograms, the feature ***popularity and adjusted revenue*** are left skwed. This reinforces our earlier discovery of outliers in these features. However, the ***runtime and average vote*** are approximately normally distributed as expected.

Now let's evaluate the correlations between variables and also plot the scatter plots for the variables of interest.

In [None]:
# getting the correlation coefficients of the features
movies_df.corr()

In [None]:
# plotting the scatter plots for popularity, runtime, average votes against revenue adjusted

fig, ax = plt.subplots(3, figsize=(10, 20))
ax[0].scatter(x = movies_df['popularity'], y = movies_df['adjusted revenue']/1e9)
ax[0].set_xlabel("Popularity")
ax[0].set_ylabel("Revenue in billion USD")
ax[0].set_title('Revenue by Popularity')

ax[1].scatter(x = movies_df['runtime'], y = movies_df['adjusted revenue']/1e9)
ax[1].set_xlabel("Runtime")
ax[1].set_ylabel("Revenue in billion USD")
ax[1].set_title('Revenue by Runtime')

ax[2].scatter(x = movies_df['average vote'], y = movies_df['adjusted revenue']/1e9)
ax[2].set_xlabel("Average Votes")
ax[2].set_ylabel("Revenue in billion USD")
ax[2].set_title('Revenue by Average Votes')

plt.show()

From the correlation coefficents and scatter plots above, we can see that ***runtime and average votes*** have weak postive correlations with the ***adjusted revenue***. While there is moderate positive correlation between ***popularity*** and the ***adjusted revenue***.

### Question 2: What is the annual revenue trend of the movie industry?
To answer this there is a need to evaluate annual revenues and plotting the results in a time series chart. This will showcase the revenue trend over time.

In [None]:
# grouping and summing annual revenue of the industry
annual_revenue = movies_df.groupby('release_year')['adjusted revenue'].sum()

# changing the series to a dataframe
annual_revenue_df = pd.DataFrame(annual_revenue)

In [None]:
# plotting the line chart

plt.plot(annual_revenue_df/1e9, linewidth=3, color='red') # changing the scale to billion USD
plt.title('Annual Revenue Growth')
plt.xlabel('Release Year')
plt.ylabel('Total Annual Revenue in Billion USD')
plt.rcParams['figure.figsize'] = [10, 7]
plt.rc('axes', titlesize=24)                 # fontsize of the title
plt.rc('axes', labelsize=14)                 # fontsize of the x and y labels

The above line chart shows an increasing annual revenue trend over the years.

<a id='conclusions'></a>
## Conclusions
The above analyses show that popularity is moderately positively correlated with annual revenue generation. It can also be observed that the movie industry revenue is growing over time despite some momentary declines as indicated by the line chart of the annual revenue. The industry has grown in revenue from 950 million USD in 1960 to 24 billion USD in 2015. This industry is worth investing in!

**Limitation**
The aggregation of annual revenue of the industry is limited because a movie produced in a particular year may continue to generate revenue in the subsequent years and there are no metrics to update such data values.

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

## References

- <a href="https://docs.google.com/document/d/e/2PACX-1vTlVmknRRnfy_4eTrjw5hYGaiQim5ctr9naaRd4V9du2B5bxpd8FEH3KtDgp8qVekw7Cj1GLk1IXdZi/pub
">Investigate a Dataset</a>

- <a href="https://www.themoviedb.org/about">The Movie Database</a>

- <a href="https://dictionary.cambridge.org/dictionary/english/cast#:~:text=cast%20noun%20%28ACTORS%29%20B2%20%5B%20C%2C%20%2B%20sing%2Fpl,actors%20who%20were%20not%20playing%20the%20main%20parts%29. ">Cambridge Dictionary</a>

- <a href="https://www.merriam-webster.com/dictionary/running%20time#:~:text=Definition%20of%20running%20time%20%3A%20the%20duration%20of,recording%20Examples%20of%20running%20time%20in%20a%20Sentence
">Merriam Webster Dictionary</a>
- <a href="https://www.studiobinder.com/blog/movie-genres-list/">Studio Binder, Ultimate Guide to Movie Genres</a>

- <a href="https://www.investopedia.com/articles/investing/091615/movie-vs-tv-industry-which-most-profitable.asp">Investopedia, Movie vs. TV Industry: Which Is More Profitable?</a>

 


    