            
# Project: the Movie Database(TMDb) Investigation project
 
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
    
<li><a href="#wrangling">Data Wrangling</a></li>
    <ul>
        <li><a href="#Gen_Prop">General Properties</a></li>
        <li><a href="#Data_Clean">Initial Data Cleaning</a></li>
    </ul>     
<li><a href="#eda">Exploratory Data Analysis</a></li>
    <ul>
        <li><a href="#Question_1">Question 1</a></li>
        <li><a href="#Question_2">Question 2</a></li>
        <li><a href="#Question_3">Question 3</a></li>
        <li><a href="#Question_4">Question 4</a></li> 
    </ul>     
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

For this project, we will be investigating the TMDb dataset provided by Udacity and Kaggle. This dataset was chosen because of my interest in movies and I wanted to take on a challenge that would help me further understand the Python data packages. I will be observing how the movie industry changed over the years, the most successful directors in the industry, and see what genres are most common in the industry.

We will first start off by importing the packages that we will be using throughout the project.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline


<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

<a id='Gen_Prop'></a>
### General Properties 

We are going to start off by loading our data into a dataframe. We will print out a few values that inspect the datatypes and characteristics of the data

In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
df =pd.read_csv('tmdb-movies.csv', encoding = "UTF-8")
df.head(3)

In [None]:
df.describe()

The column titles seem to be self-explanatory. However, the popularity index seemed to be an index that TMDb has generated with their calculation. Going through the TMDb support page, I found an answer from their forum. 

Popularity is a value that gets updated daily and takes a number of things into account like views, number of user ratings/watchlist/favourite additions and release date.

<cite>Elizabeth Jennings- https://www.themoviedb.org/talk/56e614a2c3a3685aa4008121?language=en</cite>

In order to confirm that answer, I generated an correlation matrix and observed its characteristics through an heat map.

In [None]:
corr =df.corr()
correlation_matrix =sns.heatmap(corr, xticklabels=corr.columns.values,yticklabels=corr.columns.values)


- We confirmed popularity being uncorrelated from metrics such as time(release date, release_year) and runtime. 
- Revenue is more correlated than budget, so we can assume that the movie's success is also important.
- Vote_count is the most closely correlated value. 

<a id='Data_Clean'></a>
### Data Cleaning 

We can see from a glance a few characteristics about this dataset. 
- Special Characters: We have characters '|' that are meant to split the values into multiple values in columns 'cast', 'genre', and 'title'.
- Null values: As the database contains modern information such as websites, there are null values for the older movies.
- Different Datatypes: The columns that are meant to be an equivalent of another such as 'revenue' and 'revenue_adj' are not in the same datatype.

#### Removal of Unused Columns
We will do a preliminary removal of the data columns that we will not be using throughout the dataset investigation.

In [None]:
# As we will not be using the imdb_id, homepage, keywords, and taglines, we will drop these values.
df.drop(['homepage', 'keywords', 'tagline', 'imdb_id', 'production_companies', 'overview', 'runtime'], axis=1, inplace=True)

#### Check Duplicates and remove them
We are expect every value in the dataframe to have a unique id. So we will check duplicates in column id and remove them.

<a id='eda'></a>

## Explanatory Data Analysis 
### Question 1 - Top Grossing Movie Title Each Year

<a id='Question_1'></a>
#### Data Preparation
Now we will answer our questions about this dataset. We are going to start out by observing success of movies in perspective of time. We want to find the top grossing movie each year, so we will do additional data cleaning from the general dataframe we created in the previous section. 

In [None]:
# We will create a copy of our dataframe for actors count. We will remove actors with null data in our analysis for this study. 
df2 = df.copy()
df2.drop(['id','release_date', 'director', 'cast', 'vote_count', 'vote_average','popularity' ], axis=1, inplace=True)
df2 = df2.sort_values('release_year')
max_rev_adj = df2.groupby(['release_year'])['revenue_adj'].transform(max) == df2['revenue_adj']
max_rev = df2.groupby(['release_year'])['revenue'].transform(max) == df2['revenue']
df2[max_rev_adj].head()

We found the top grossing titles for each year using a group_by function in terms of revenue and adjusted revenue. We will now prepare our data for visualization to see where the biggest success in movies happened throughout the years.

In [None]:
# We will save our data in lists for presentation
x_year = []
y_revenue = []
y_revenue_adj = []
for x in range(1960,2016):
    x_year.append(x)
for y in df2[max_rev]['revenue']:
    y_revenue.append(y)
for y in df2[max_rev_adj]['revenue_adj']:
    y_revenue_adj.append(y)

#### Data Visualization

We will generate bar graphs based on revenue and adjusted revenue.

In [None]:
fig=plt.figure(figsize=(18,10))
fig.suptitle("Top Grossing Movie Each Year", fontsize=14)

ax1 = plt.subplot(221)
ax2 = plt.subplot(222)
ax1.bar(x_year,y_revenue)
ax2.bar(x_year,y_revenue_adj)
ax1.set_title("Revenue, in 10 Billion, USD")
ax2.set_title("Revenue Adjusted, in 10 Billion, USD")



Looking at the revenue chart, we notice some noteworthy success in movies that overacheived againsts their peers in the 1990's, the 2000's, and the 2010's. From reccollection, we can assume that these movie titles are the 'Titanic', 'Avatar', and 'Star Wars: The Force Awakens'. 

However, if we look to the revenue adjusted to current currency, we can see that there are some noteworthy success in the 1970's that are almost par to the success of 'Avatar' as well. 

This leads us to our second question as we broaden our scope, what are the top grossing movie titles per decade.

<a id='Question_2'></a>
### Question 2 - Top Grossing Movie Title Each Decade

#### Data Preparation
Since we already have a dataframe that we adjusted based on time. We will use the dataframe we used from our previous question. 
We will start off by importing a function to generate the specific decades of each movie.

In [None]:
def decade_converter(x):
    decade = x%1960
    if(decade < 10):
        x= 1960
    elif(decade < 20):
        x= 1970
    elif(decade < 30):
        x= 1980
    elif(decade < 40):
        x= 1990
    elif(decade < 50):
        x= 2000
    elif(decade < 60):
        x= 2010
    return x

Using the function that we created just above, we will assign the movie decades to create a new column in our time dataframe.

In [None]:
ans = []
for x in df2['release_year']:
    ans.append(decade_converter(x))
df2=df2.assign(release_decade=ans)
max_rev_adj = df2.groupby(['release_decade'])['revenue_adj'].transform(max) == df2['revenue_adj']
max_rev = df2.groupby(['release_decade'])['revenue'].transform(max) == df2['revenue']
df2 = df2[max_rev]
df2["movie_decade"] = df2["original_title"].map(str) + " ("+ df2["release_year"].astype(str) + ")"

#### Data Visualization

We will now make a bar graph that will visualize the top grossing movies of each decade in terms of revenue and adjusted revenue.

In [None]:
df2.plot(kind='barh', y= 'revenue', x='movie_decade',legend = False)
plt.title("Top Grossing Movie Each Decade")
plt.ylabel("Director (Movie Count)")
plt.xlabel("Revenue (in Billions, USD)")

df2.plot(kind='barh', y= 'revenue_adj', x='movie_decade',legend = False)
plt.ylabel("Director (Movie Count)")
plt.xlabel("Revenue Adjusted (in Billions, USD)")

Our observations from Question 1 - Data Visualization seem to be accurate. If we look at the success of movies in revenue, modern movies seem to be more noteworthy than older ones. However, if we look at movies in adjusted revenue, we can see that older movies such as Star Wars had remarkable successes as well, showing how much of a cultural sensation the movies were in their time period. 

<a id='Question_3'></a>
### Question 3 - Top 10 Directors

#### Data Preparation
For this question, we will now look at the top grossing directors in the film industry. Like we did for our time-based dataframe, we will perform additional data cleaning from the general dataframe. 


In [None]:
# We will create a copy of our dataframe for directors count. 
df3 = df.copy()

# We will remove directors with null data in our analysis for this study. 
df3.dropna(subset=['director'], inplace=True)
df3 = df3.groupby('director').agg({'popularity':'max','revenue': 'sum','revenue_adj': 'sum', 'id': 'count','vote_average': 'median'})
df3 = df3.rename(index=str, columns={"id": "movie_count"})

rev_df3 = df3.sort_values('revenue', ascending = False)
rev_df3 = rev_df3[:10]
rev_df3 = rev_df3.reset_index().sort_values('revenue', ascending= True)
rev_df3.head()

In [None]:
# We are going to create a new column for the director's movie count, so that we can use that as our x-axis labels.
rev_df3["director_count"] = rev_df3["director"].map(str) + " ("+ rev_df3["movie_count"].astype(str) + ")"

#### Data Visualization

We will now make a bar graph that will visualize the top grossing directors in terms of gross revenue and adjusted revenue.

In [None]:
# We will plot our graphs.

rev_df3.plot(kind='barh', y= 'revenue', x='director_count',legend = False,xlim= (0,1.7* 10**10))
plt.ylabel("Director (Movie Count)")
plt.xlabel("Revenue (in 10 Billions, USD)")
plt.title("Top 10 Grossing Directors",)

rev_df3.plot(kind='barh', y= 'revenue_adj', x='director_count',legend = False, xlim= (0,1.7* 10**10))
plt.ylabel("Director (Movie Count)")
plt.xlabel("Revenue Adjusted (in 10 Billions, USD)")


From a glance, we can make some clear observations.
- We can see that many of the top grossing directors has filmed more than 10 films other than David Yates.
    - David Yates only filmed independent and TV films prior to the Harry Potter series. So we can see that the success of the Harry Potter Series helped him become one of the top grossing directors in the industry. 
    
- We can see that Steven Spielberg has grossed more than anyone in the filming industry. However, he also has the largest filmography among the directors.

<a id='Question_4'></a>
### Question 4 - What is the genre distribution?

#### Data Preparation
For this question, we will now look at the most frequent genres in the film industry. 
Like we mentioned in our data-wrangling process, the genres column uses '|' in order to separate the multiple values. 
We will perform additional data cleaning from the general dataframe. 

In [None]:
# We will create a copy of our original dataframe for directors count. 
df4_pre = df.copy()

# We drop the null from the genres.
df4_pre.dropna(subset=['genres'], inplace=True)
df4 = df4_pre['genres']


# We will use a dictionary to keep count of the unique genres.
genre_count = dict()

for x in df4:
    genre_list = x.split('|')
    for genre in genre_list:

        if genre in genre_count:
            genre_count[genre] += 1
        else:
            genre_count[genre] = 1

# Convert the dictionary into a list and sort it based on its value.
genre_count =sorted(genre_count.items(), key=lambda x: x[1], reverse= True)
genre_df = pd.DataFrame(genre_count, columns=['Genre', 'Genre_count'])
genre_df


As we have around 20 unique types of genres, we think its best to use a pie chart to clearly visualize the distribution. Since 20 slices will be too much, we will keep the top 9 genres and combine the others into an 'others' category.

In [None]:
df[df.duplicated(['id'],keep=False)]

Check if the duplicates are successfully dropped from the database

In [None]:
df.drop_duplicates(['id'],inplace = True)
df[df.duplicated(['id'],keep=False)]

We do not have any duplicates in our dataset now.

### Data Cleaning - Change Datatype
We will change the datatype of the float values into integer values for cleaner representation.


In [None]:
# Since budget and revenue were in integer values, we will change the adjusted values to the datatype to integers
df['budget_adj'] = df['budget_adj'].astype(int)
df['revenue_adj'] = df['revenue_adj'].astype(int)


In [None]:
# Assign parameters
topN = 9 
other_count =0

# Combine the others category
for x in range(topN+1,len(genre_df)):
    other_count += genre_df['Genre_count'][x]
genre_other_df = pd.DataFrame([["Others", other_count]], columns=['Genre', 'Genre_count'])

# Concatenate the original date from up to the topN and append the others dataframe
genre_df_conc = genre_df[:topN].append(genre_other_df, ignore_index=True)

#### Data Visualization

We will now make a pie chart and a table to see the individual counts of each unique genre.