# Project: TMDB-movies Data Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id = "intro"></a>
## Introduction

<p>
My analysis is based on the TMDB- movies dataset. This dataset has information about 10,000 movies obtained from The Movie Database (TMDb),which includes,the popularity, types, user ratings and revenue of the individual movie. i'm interested in finding patterns in the dataset.
</p>

<h3>Questions</h3>
<ul>
  <li>Which genres gain more popularity annually?</li>
  <li>What Characteristics Are Associated With High-Profit Movies?</li>
</ul>



In [None]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

<a id='wrangling'></a>
## Data Wrangling

<p>
  At this point of the analysis, I will be loading the data by reading the dataset using pandas, found out how clean the data is and trim and clean the data if neccessary for analysis.   
</p>

#### Steps Involved:
<ol>
  <li>Load the data</li>
  <li>print out some rows of the data in order to understand what the data contains.</li>
  <li>check the dimension of the data, that is, number of rows and colums of the data</li>
  <li>check the basic information about the data, which includes the data types of the various columns, null values.</li>
  <li>perform basic statistics which provides a concise summary about the data.</li>
</ol>

### General Properties

In [None]:
#load the data using pandas read_csv method
df = pd.read_csv("./tmdb-movies.csv")
#printing the top 5 rows of the data
df.head()

In [None]:
#printing the last 5 rows
df.tail()

In [None]:
# checking the dimension of the data
df.shape

In [None]:
#print basic information about the data
df.info()

In [None]:
#print concise summary of total data
df.describe

>After observing the data, the data contains,
<ol>
  <li>10866 rows, 21 columns</li>
  <li>some columns has null values</li>
  <li>some movies have both budget and revenue columns have zero values</li>
</ol>


In [None]:
#count the total number of null values per column
df.isna().sum()

>I'll delete irrelevant data and maintain the crucial datasets after observation and the above-mentioned queries.

### Data Cleaning (Deleting non-essential data from the dataset)




>Data that must be changed or removed
<ol>
<li>Delete duplicated data(rows)</li>
<li>Deleting non-essential columns that are not required throughout the analysis></li>
<li>Converting the released data and release year from object to datetime Format</li>
<li>Delete rows that have values that are inaccurate or improper.</li>
</ol>

In [None]:
#create a copy of the original dataset
# All changes made to the copied data will not be reflected in the orginal data
df_copied = df.copy(deep=True)
df_copied

> I made a copy of the dataset so that all changes made to the copied data will not be reflected in the orginal data

><b>1. Check and Remove dupicate rows</b>

In [None]:
# check the number of dupliated data
sum(df_copied.duplicated())

In [None]:
#delete duplicated rows from the dataset
df_copied.drop_duplicates(inplace=True)
sum(df_copied.duplicated())

In [None]:
#shape after droping the duplicate rows
df_copied.shape

<b>2. Deleting non-essential columns that are not required throughout the analysis</b>

In [None]:
df_copied.drop(["imdb_id","cast","homepage","tagline","keywords","overview","budget_adj","revenue_adj"],axis=1, inplace=True)
df_copied.head()

<b> 3. Converting the released data and release year from object to datetime Format  </b>

In [None]:
# change the data type of release date to datatime format
df_copied["release_date"] =  pd.to_datetime(df_copied["release_date"])

In [None]:
df_copied.info()

<b>4. Delete rows that have values that are inaccurate or improper.</b>

In [None]:
#check the number of rows that have budget column to be zero(0)
print(df_copied[df_copied["budget"]==0].shape[0])
#check the number of rows that have revenue column to be zero(0)
print(df_copied[df_copied["revenue"]==0].shape[0])

In [None]:
#filter for data that have the budget column to be zero
df_copied[df_copied['budget'] == 0]
#delete those rows with budget column equal to zero
df_copied.drop(df_copied.index[df_copied['budget'] == 0], inplace=True)

In [None]:
#filter for data that have the revenue column to be zero
df_copied[df_copied['revenue'] == 0]
#delete those rows with revenue column equal to zero
df_copied.drop(df_copied.index[df_copied['revenue'] == 0], inplace=True)

In [None]:
df_copied.info()

In [None]:
df_copied.describe()

> After accessing,gathering and cleaning the data, I will move on to investigate the data based on the questions stated in the introductory section.

<a id='eda'></a>
## Exploratory Data Analysis
>

### Research Question 1: Which genres gain more popularity annually?

In [None]:
genre_i = list(map(str,(df_copied["genres"])))

genre = ['Action','Adventure', 'Science Fiction', 'Thriller','Fantasy', 'Crime', 'Western', 'Family','Documentary', 'Animation','War','Mystery','Romance','TV Movie','Comedy','Drama' , 'History', 'Music', 'Horror','Foreign','nan']

year = np.array(df_copied['release_year'])
popularity = np.array(df_copied['popularity'])

#make a null dataframe which indexs are genres and columns are years.
popularity_df = pd.DataFrame(index = genre, columns = range(1960, 2016))
#change all the values of the dataframe from NAN to zero.
popularity_df = popularity_df.fillna(value = 0.0)

i = 0
for j in genre_i:
    genre_s = list(map(str,j.split('|')))
    print()
    popularity_df.loc[genre_s, year[i]] = popularity_df.loc[genre_s, year[i]] + popularity[i]
    print(popularity_df.loc[genre_s, year[i]])
    i+=1

##### Note on the above cell
>I started  by spliting and converting the genre columns details into a list. This is because a single movie an be categorized into different genres and to get rid of the "|" that seperate them.
>I went on to create a dataframe that contains the unique genres and the popularity of the movie.
>Below function, compute the standard deviation of the individual genres according to the year. it assume to be the actual popularity of the genre. 

In [None]:
# find the standard divation of the popularity for the various genres
def calculate_std(data):
    return (data-data.mean())/data.std(ddof=0)

popularGenre = calculate_std(popularity_df)
popularGenre.head(5)

In [None]:
# basic informatiom about the genre and popularity dataframe
popularGenre.info()

In [None]:
# the standardised data should be shown as a barh plot.
popularGenre.iloc[:,50:].plot(kind='barh',figsize = (20,40),fontsize=13)
# name the axes and title of the graph
plt.title("Most Popular Genre from 2010 to 2015",fontsize=16)
plt.xlabel("Popularity)",fontsize=15)
plt.ylabel("Genres",fontsize = 15)


#### Brief discription of the above plot
* The graph above, depicts the popularity of genres from the year 2010 to 2015.
* In 2010, action movies was more popular the others
* In 2011, action movies was more popular the others
* In 2012, drama was the most popular  
* In 2013, drama was the most popular whiles fantesy gain the lest pupolarity 
* In 2014, action movies was more popular the others
* In 2015, action movies was more popular the others

In [None]:
figure, axis = plt.subplots(3,3,figsize = (16,10))

#set the title of the subplot.
figure.suptitle('Genre Popularity based on Various Years',fontsize = 16)

#plot the individual genre plot to see the popularity difference yearly.
popularGenre.loc['Action'].plot(label = "Action",color='#33FFB5',ax = axis[0][0],legend=True)
popularGenre.loc['Drama'].plot(label = "Drama",color = '#f67280',ax = axis[0][1],legend=True)
popularGenre.loc['Science Fiction'].plot(label = "Science Fiction",color='#6f6600',ax = axis[1][2],legend=True)
popularGenre.loc['Comedy'].plot(label = "Comedy",color='#fe5f55',ax = axis[0][2],legend=True)
popularGenre.loc['Romance'].plot(label = "Romance",color='#1a2c5b',ax = axis[1][1],legend=True)
popularGenre.loc['Thriller'].plot(label = "Horror",color='#00818a',ax = axis[1][0],legend=True)
popularGenre.loc['Music'].plot(label = "Music",color='#db3b61',ax = axis[2][0],legend=True)
popularGenre.loc['Adventure'].plot(label = "Adventure",color='#08c299',ax = axis[2][1],legend=True)
popularGenre.loc['Crime'].plot(label = "Crime",color='c',ax = axis[2][2],legend=True)

##### A short note on the graph above
>From 1960 to 2015, the popularity of various genres is seen in the graph above.
* It's worth noting that music's popularity peaked in 2010 and peaked in 1964, respectively.
* In terms of popularity, it grew in popularity in 2003 and peaked in 1975.
* The highest and lowest popularity for comedy were recorded in 1988 and 1971, respectively. 
* The highest and lowest popularity for drama were reported in 1976 and 1977, respectively.
* Horror films had their peak popularity in 1964 and their lowest in 1961.
* In 1965, romance was at its peak, and in 1978, it was at its lowest.
* Science fiction had its peak popularity in 1979 and its lowest in 1964; adventure had its highest popularity in 1962 and its lowest in 1964.
  

### Research Question 2  : What Characteristics Are Associated With High-Profit Movies?

> 

In [None]:
data=["id","popularity","budget","revenue","original_title","runtime","vote_average","release_year"]
xtics = pd.DataFrame(df_copied['revenue'].sort_values(ascending=False))
for i in data:
  xtics[i]=df_copied[i]

xtics.head()


In [None]:

figure, axis = plt.subplots(2,2,figsize = (16,6))
figure.suptitle("Revenue Vs (Budget,Popularity,Vote Average, Runtime)",fontsize=14)
# scatter plot for each properties showing the relationship between them
sns.regplot(x=df_copied['revenue'], y=df_copied['budget'],color='c',ax=axis[0][0])
sns.regplot(x=df_copied['revenue'], y=df_copied['popularity'],color='c',ax=axis[0][1])
sns.regplot(x=df_copied['revenue'], y=df_copied['vote_average'],color='c',ax=axis[1][0])
sns.regplot(x=df_copied['revenue'], y=df_copied['runtime'],color='c',ax=axis[1][1])

sns.set_style("whitegrid")

In [None]:
# check if there is any zero value and replace with nan 
xtics['budget'] = xtics['budget'].replace(0,np.NAN)
xtics['revenue'] = xtics['revenue'].replace(0,np.NAN)
# calculate the correction revenue and the other properties
corr = xtics.corr()
a = corr.loc['revenue','budget']
b = corr.loc['revenue','popularity']
c = corr.loc['revenue','vote_average']
d = corr.loc['revenue','runtime']
# create a dataFrame for the correction coefficient
corr_data = pd.DataFrame(data =[a,b,c,d],index=["budget","popularity","vote_average","runtime"], columns=["revenue"])
# find the transpose of the corr_data 
corr_data.transpose()



In [None]:
fig,ax = plt.subplots( figsize =( 4 , 2 ) )
fig.suptitle("Correlation between revenue and (budget,popularity,vote Average and runtime)",fontsize=14)
cmap = sns.diverging_palette( 520 , 10 , as_cmap = True )
fig = sns.heatmap(corr_data.transpose(),cmap = cmap,square=True, cbar_kws={ 'shrink' : .9 }, ax=ax, annot = True, annot_kws = { 'fontsize' : 12 })

## Explanation to the above plots

 * First Plot: Revenue Vs Budget
 >There are quite number of small movies with larger budget however the increase in revenue is a little higher. Base on the graph, there is a good probability that movies with Greta budget results in higher Revenue. correlation coefficient is 0.69
 
 *  Second Plot: Revenue Vs Popularity
 >Popularity seems to be directly proportional to Revenue.This implies that, the higher the popularity the higher the increase in revenue of the movies. they have correlation of 0.65  

* Third Plot: Revenue Vs Vote Average
>Due to the low correlation between the rating and the revenue, it can be clearly stated that vote average have no relationship or very little impact on the income of a movie. 

* Fourth Plot: Revenue Vs Runtime
>The relationship between runtime and revenue is little. Though the correlation between them is positive,that is, 0.25, it can be concluded that runtime has less effect on them revenue of the movies.

<a id='conclusions'></a>
## Conclusions

>After performing  exploratory analysis on the questions I imposed, the following is the summary of my findings: 
* Action movies gain more popularity over the year, followed by Drama and then thrillers.
* Most movies with greater budgets  and higher popularity has proving to be gain more revenues.

#### Limitations
> it was notice in the the genres column of the movies contains data the was seperated by "|" during the data cleaning stage. As a result, runtime to map the various genres to a list was much longer.

>Using the standard deviation values to represent the popularity of the genre might not be completely be accurate.If the genres of each movie were more specific, then the popularity will be more concise.