

# Project: Investigate TMDb movie data set
***

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction


> For this project i have chosen to analyse TMDB dataset which contains information for +10 thousand movies collected from The Movie Database (TMDb). It consist of 21 columns such as imdb_id, revenue, budget, vote_count etc...



> **By Looking to provided dataset, the below Questions can be answered:**

       
           - How movies production varied over the years?
           - What are the top 10 movies in terms of revenue?
           - Which genres are most popular?
           - How budget correlated with revenue? Do higher budget mean higher revenue or vice versa??!!


In [None]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html


#Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
% matplotlib inline


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

<a id='wrangling'></a>
## 1- Data Wrangling

> Before start answering proposed questions, we need to assess and clean our data.



### 1.1 - Data Gathering

In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.

#Load tmdb file into pandas df
df_tmdb = pd.read_csv('tmdb-movies.csv')

### 1.2 - Data Assessing

In [None]:

df_tmdb.head()

In [None]:
#Check for columns containing null values & data types
df_tmdb.info()

In [None]:
# Number of columns & rows in dataframe
df_tmdb.shape

In [None]:
df_tmdb.sort_values(by=['popularity'])

In [None]:
# Summary statistics
df_tmdb.describe()

In [None]:
# Columns with NULL values 
df_tmdb.isnull().sum()

In [None]:
# Number of duplicate rows
sum(df_tmdb.duplicated())

#### 1.2.1 - Obsevations:

  - Remove unnecessary columns. 
  - The **realease_date** column values has a wrong format (str instead of datetime).
  - Removing duplicate rows.
  - There are many movies have zero **Budget & Revenue**, Therefore we need to remove them. 

### 1.3 - Data Cleaning

**>> Removing unsed coulmns:**

**Code:**

In [None]:
# List of columns to be deleted
del_col=['imdb_id','homepage','tagline','keywords','overview','budget_adj','revenue_adj']

# Remove unused columns
df_tmdb.drop(del_col, axis=1, inplace=True)

**Test:**

In [None]:
# Viewing the new dataset
df_tmdb.head()

**>> Convert the 'release_date' column to date format:**

**Code:**

In [None]:
df_tmdb['release_date']=pd.to_datetime(df_tmdb['release_date'])

**Test:**

In [None]:
df_tmdb.info()

**>> Removing duplicate rows:**

**Code:**

In [None]:
# Drop duplicate rows
df_tmdb.drop_duplicates(keep = 'first', inplace = True)

**Test:**

In [None]:
sum(df_tmdb.duplicated())

**>> Removing Zero values in Budget & Revenue Columns and delete thier movies**

**Code:**

In [None]:
# List of columsn with Zero values
zero_rows = ['budget', 'revenue']

# Replacing the vlaue of Zero with NaN
df_tmdb[zero_rows] = df_tmdb[zero_rows].replace(0, np.NaN)

# Drop any rows with NaN values in any columns of "Zero_Row"
df_tmdb.dropna(subset = zero_rows, inplace = True)


**Test:**

In [None]:
df_tmdb.shape

<a id='eda'></a>
## Exploratory Data Analysis



### Research Question 1: How movies production varied over the years?

In [None]:
# Number of movies produced each year
movies_prod_per_year= df_tmdb['release_year'].value_counts().sort_index();

movies_prod_per_year.plot(kind='line', figsize=(8, 4))
plt.title('Movie production over the years');
plt.xlabel('Year');
plt.ylabel('Number of movies released');


### Research Question 2: What are the top 10 movies in terms of revenue?

In [None]:
top_ten=df_tmdb.nlargest(10,'revenue');

top_ten.plot(kind='bar',x='original_title',y='revenue',figsize=(8, 4))

plt.title('Movie production over the years');
plt.xlabel('Top 10 with hight revenue');
plt.ylabel('Revenue');


### Research Question 3: Which genres are most popular?

In [None]:
# Copy relevant columns into a new dataframe
df_genres = df_tmdb.filter(['id','popularity','genres','release_year'])
#Refrence: https://www.codegrepper.com/code-examples/python/python+copy+columns+to+new+dataframe

df_genres.head()


In [None]:
df_genres['filtered_genres'] = df_genres['genres'].str.extract('([^|]+)', expand=True)
df_genres.head()


In [None]:
genres_count= df_genres['filtered_genres'].value_counts().sort_values( ascending=False);


genres_count.head()


In [None]:

genres_count.plot(kind='bar', figsize=(16, 8));
plt.title('Most Popular Genres');
plt.xlabel('Genres');
plt.ylabel('No. of movies');


### Research Question 4: How budget correlated with revenue? Do higher budget mean higher revenue or vice versa??!!

In [None]:
# Plotting the correlation between Budget & Revenue
df_tmdb.plot(kind='scatter',x='budget',y='revenue',figsize=(8, 4))

plt.title('Budget vs Revenue');
plt.xlabel('Budget');
plt.ylabel('Revenue');


<a id='conclusions'></a>
## Conclusions

> **Tip**: After achieving this analysis, we came to know that the cinema production has significantly increased along years especially the last 20 years, means direct and indirect Economic impact....also, the most preferable genres was Drama and most profitable movies was Avatar....

> **Tip**: I have extracted only the first genres type from the data list

## Submitting your Project 

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [72]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

0