# Project: Investigate a Dataset - [TMDB Movie Dataset]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 
> This project examines the TMDB movies dataset, a collection of movies of various genre released between the timeline of 1960 and 2015. The dataset contains the following attributes:

>-    movie_id and imdb_id - The identifiers for each movie.
>-    cast - The name of the primary actors of the movie.
>-    director - The name of the director of the movie.
>-    budget - The budgeted amount allocated for production of the movie .
>-    genre - The genre of the movie.
>-    homepage - The website of the movie.
>-    keywords - The tags related to the movie's content.
>-    original_title - The initial release title of the movie.
>-    overview - A summary description of the movie.
>-    popularity - The popularity count of the movies.
>-    production_companies - The name of the companies involved in the production of the movie.
>-    production_countries - The country in which the movie was produced.
>-    release_date - The date on which the movie was released to the market.
>-    revenue - The total revenue generated by the movie.
>-    runtime - The movie's duration in minutes.
>-    tagline - The movie's tag.
>-    vote_average - The average ratings of the movie.
>-    budget_adj - The amount of the budget when converted to dollar rate in 2010.
>-    revenue_adj - The amount of the revenue generated when converted to dollar rate in 2010.

 



### Question(s) for Analysis
>In this project our analysis shall be centred around the popularity of each movie as such our primary variable for analysis will be the "popularity" variable either as an independent or dependent variable in relation to other variables in the dataset.
<br>Popularity will be taken as a dependent variable in examining its relationship with (1) Budget (2) Genre (3) Runtime (4) Vote count.
<br>It will be taken as an independent variable in examining its relationsip with (1) Revenue (2) Profit
The questions will be looking into includes:
 
>  - What is the distribution of the Popularity variable?
>  - What is the relationship between Budget and Popularity?
>  - What is the relationship between Genre and Popularity?
>  - What is the relationship between Runtime and Popularity?
>  - What is the relationship between Vote Count and Popularity?
>  - What is the relationship between Popularity and Revenue?
>  - What is the relationship between Popularity and Profit?


In [1]:
#import the needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


<a id='wrangling'></a>
## Data Wrangling

>In this section we will import the dataset and examine the data

In [2]:
movie_dataset = pd.read_csv('tmdb-movies.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'tmdb-movies.csv'

In [None]:
movie_dataset.head()

In [None]:
movie_dataset.shape

In [None]:
movie_dataset.info()


### Data Cleaning
 

>In this section will clean our dataset. We will examine the dataset for null and duplicate values and clean them as appropriate for our analysis.


In [None]:
movie_dataset.isnull().sum().sum()

In [None]:
movie_dataset.isnull().sum()

In [None]:
movie_dataset.drop(["homepage","tagline","keywords","production_companies"], axis=1, inplace=True)

In [None]:
movie_dataset.info()

In [None]:
movie_dataset.isnull().sum().sum()

In [None]:
movie_dataset.dropna(inplace=True)

In [None]:
movie_dataset.info()

In [None]:
movie_dataset.describe()

Calling the describe() method on our dataset, we see that certain observations have for their minimum value as zero in the runtime, budget, revenue, budget_adj and revenue_adj columns, we are going to drop the observations with zero value.

In [None]:
# define a function that check for zero value and drop them
def zero(x):
    y =  movie_dataset[movie_dataset[x] ==0]
    movie_dataset.drop(y.index, inplace=True)

In [None]:
# drop the observations with zero values
zero("runtime")
zero("budget")
zero("revenue")
zero("budget_adj")
zero("revenue_adj")

In [None]:
movie_dataset.duplicated().sum()

In [None]:
movie_dataset.drop_duplicates(inplace=True)

In [None]:
movie_dataset.info()

### Cleaning summary
Examining the dataset, the number of observations recorded was 10866 with 21 attributes, further examination showed that the dataset contains 13434 nulls and as such there was need for action to ensure the integrity of the data for analysis, in this particular situation will be dropped the rows with null values, further examination however showed that most of the null values are in the following columns: homepage, tagline, keywords and production_companies. In line with the objectives of this project the columns are categorical variables which we do not need in our analysis as such will dropped these columns.
<br>After the dropping the columns there were 127 null values left which were dropped. The check for duplicates showed one duplicate, which was removed as well. 
<br>After dropping all null and duplicated values, we ran a summary statistics on the dataset which showed that certain observations had zero in the runtime, budget, revenue, revenue_adj, budget_adj column and clearly it is impossible for a movie already in production to have zero in those attributes, as such observations with those values were dropped to protect the integrity of our analysis.
<br> Of the 10866 observation in the dataset, 3849 observation were left for the analysis.

#### **Limitations**
<br>There were many null values in the dataset, however, we were able to identify a way of handling them so that they do not significantly affect the number of the observations in our dataset.
<br>The currency of the budget and revenue was not given and there is in no indication to tell whether or not all the currency is the same due to different countries of movie production. However, we assume it is in dollars.

<a id='eda'></a>
## Exploratory Data Analysis

> In the section, we will carry out EDA on our dataset will the goal of gaining insight that provide answers to our questions as stated above.

### Research Question 1 (What is the distribution of the Popularity variable?)

In [None]:
#will compute the summary statistics to get an overview of the distribution of the Popularity Variable
movie_dataset["popularity"].describe()

In [None]:
#we will plot a boxplot to visualize the distribution
plt.boxplot(movie_dataset["popularity"])
plt.xlabel("Popularity");

In [None]:
#we will plot a histogram showing the distribution
sns.set(style="darkgrid")
sns.histplot(data=movie_dataset,x=movie_dataset["popularity"], kde= True, bins=50)
plt.title("Distribution of Popularity");

### Research Question 2 (What is the relationship between Budget and Popularity?)

In [None]:
#Is there are a relationship between budget and movie popularity. To find out will plot a scatter plot and find the correlation
plt.scatter(movie_dataset["budget"],movie_dataset["popularity"])
plt.xlabel("budget")
plt.ylabel("popularity")
plt.title("Budget vs Popularity");

In [None]:
movie_dataset["budget"].corr(movie_dataset["popularity"])

### Research Question 3 (What is the relationship between Genre and Popularity?)

In [None]:
#will look at our Genres column
movie_dataset.genres.head(2)

In [None]:
#The genre columns contains more than one genre, will separate the genres into distinct and unique attribute of each movies
movie_dataset["genres"]= movie_dataset["genres"].str.replace("|",",",regex=False)

In [None]:
#source code for spliting https://programmer.ink/think/pandas-how-do-i-split-text-in-a-column-into-multiple-lines-python.html
movie_dataset2 = movie_dataset.drop("genres", axis=1,).join(movie_dataset["genres"].str.split(",",expand=True).stack().reset_index(level=1, drop=True).rename("genres")).reset_index()

In [None]:
movie_dataset2.head(10)

In [None]:
# the number of occurrences of the genre theme in the movies dataset
genre_occurrence =movie_dataset2.groupby("genres")["genres"].count()
genre_occurrence

In [None]:
# a barchart for the number of occurrence of the genre theme in the movies dataset
plt.bar(x=genre_occurrence.index, height=genre_occurrence.values)
plt.xlabel("Genres")
plt.ylabel("Number of occurrences")
plt.title("Number of the occurrence of individual Genre theme ")
plt.xticks(rotation=90);

In [None]:
#the popularity of each themed genre
genres_popularity =movie_dataset2.groupby("genres")["popularity"].sum()
genres_popularity

In [None]:
plt.bar(x=genres_popularity.index, height=genres_popularity.values)
plt.xlabel("Genres")
plt.ylabel("sum of popularity count")
plt.title("sum of popularity count of individual Genre theme ")
plt.xticks(rotation=90);

In [None]:
#creating a dataframe to hold the number of occurrence and popularity count by genre
genre_data = pd.DataFrame([genre_occurrence.index,genre_occurrence.values,genres_popularity.values])
genre_data = genre_data.T
genre_data.columns=["Genres","No. of Genre Occurrence","Genre Popularity Count"]
genre_data

In [None]:
#source code adapted from https://towardsdatascience.com/easy-grouped-bar-charts-in-python-b6161cdd563d
# Visualizing the number of occurrence and popularity by genre
plt.style.use("fivethirtyeight")
fig, ax = plt.subplots(1,1, figsize = (8,6))
label = genre_data["Genres"]
x = np.arange(len(label))
width = 0.3
rect1 = ax.bar(x - width/2,
              genre_data["No. of Genre Occurrence"],
              width = width, 
               label = 'No. of Genre Occurrence',
               edgecolor = "black"
              )
rect2 = ax.bar(x + width/2,
              genre_data["Genre Popularity Count"],
              width = width,
              label = 'Genre Popularity Count',
              edgecolor = "black")
ax.set_ylabel("number",
             fontsize = 20,
             labelpad = 20)
ax.set_xlabel("Genres",
             fontsize = 20,
             labelpad =20)
ax.set_title("No. of Genre Occurrence vis-a-vis Genre Popularity Count",
            fontsize = 30,
            pad = 20)
ax.set_xticks(x)
ax.set_xticklabels(label)
ax.legend(title = "Genres Indicator",
         fontsize = 16,
         title_fontsize = 20)
ax.tick_params(axis = "x",
              which = "both",
              labelrotation = 90)
ax.tick_params(axis = "y",
              which = "both",
              labelsize = 15)

In [None]:
#scatter plot for number of genre occurrence vs genre popularity count
plt.scatter(genre_data["No. of Genre Occurrence"],genre_data["Genre Popularity Count"])
plt.ylabel("Genre Popularity Count")
plt.xlabel("No. of Genre Occurrence")
plt.title("No. of Genre Occurrence vs Genre Popularity Count");

In [None]:
#check for the correlation between number of genre-themed occurrence in moviedataset and the genre popularity count
genre_data["No. of Genre Occurrence"].astype(int).corr(genre_data["Genre Popularity Count"].astype(int))

### Research Question 4  (What is the relationship between Runtime and Popularity?)

In [None]:
#To find if there is any relationship between movie popularity and runtime lenght
# a scatter plot to show the relationship
plt.scatter(movie_dataset["runtime"],movie_dataset["popularity"])
plt.xlabel("runtime")
plt.ylabel("popularity")
plt.title("movie runtime vs popularity");

In [None]:
movie_dataset["runtime"].astype(int).corr(movie_dataset["popularity"].astype(int))

### Research Question 5  (What is the relationship between Vote Count and Popularity?)

In [None]:
#scatter plot to show the relation between movie vote count and popularity
plt.scatter(movie_dataset["vote_count"],movie_dataset["popularity"])
plt.xlabel("vote_count")
plt.ylabel("popularity")
plt.title("movie runtime vs popularity");

In [None]:
#correlation between movie vote count and movie popularity
movie_dataset["vote_count"].corr(movie_dataset["popularity"])

### Research Question 6  (What is the relationship between Popularity and Revenue?)

In [None]:
#scatterplot for movie popularity and revenue
plt.scatter(movie_dataset["popularity"],movie_dataset["revenue"])
plt.xlabel("popularity")
plt.ylabel("revenue")
plt.title("movie runtime vs popularity");

In [None]:
movie_dataset["popularity"].astype(int).corr(movie_dataset["revenue"].astype(int))

### Research Question 7  (What is the relationship between Popularity and Profit?)

In [None]:
#we will calculate the profit realized for each movie
movie_dataset["profit"] = movie_dataset["revenue"] - movie_dataset["budget"]
movie_dataset["profit"]

In [None]:
#scatterplot between movie popularity and movie profit
plt.scatter(movie_dataset["popularity"],movie_dataset["profit"])
plt.xlabel("popularity")
plt.ylabel("profit")
plt.title("movie runtime vs popularity");

In [None]:
#correlation between movie popularity and movie profit
movie_dataset["popularity"].corr(movie_dataset["profit"])

<a id='conclusions'></a>
## Conclusions

> From our analysis we see that:
  <br>The distribution of the popularity of movies in our dataset is right-skewed.
  <br>There is a positive correlation of 0.45 between budget and popularity.
  <br>There is a strong positive correlation of 0.94 between the number of genre occurrence and genre popularity count.
  <br>There is a weak positive correlation of 0.21 between movie runtime and popularity.
  <br>There is a strong positive correlation of 0.78 between vote count and popularity of a movie.
  <br>There is a positive correlation of 0.58 between the popularity of a movie and the revenue generated as well as a positive correlation of 0.59 between movie popularity and profit accurred.
  <hr>
As such we can conclude that a movie's popularity is dependent on how much budget is invested into its production, its genre and the number of vote recieved from reviewers, however, a movie's popularity influence how much revenue it will generate as well as the total profit accrued.
  


### Recommendations
><br>Investment into already popular and well accepted genre of movie will most likely yield more profit for movie production companies.
 <br>It is observed that certain movies were extremely popular as shown from our popularity distribution chart, further studies can be conducted to give insight into why these movies were so popular and its significance in revenue generation.

### References
https://programmer.ink/think/pandas-how-do-i-split-text-in-a-column-into-multiple-lines-python.html
<br>
https://towardsdatascience.com/easy-grouped-bar-charts-in-python-b6161cdd563d

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])