# Project: Movie Data Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> The aim of this report is to analyze ‘TMDB’ database from a numerical standpoint and explore the following: 
1) Identify whether the movie budget has an impact on other movie features such as revenue, popularity, voting, etc. 
2) Identify whether there is a correlation between movie popularity and its rating.  

> The analysis were performed mainly using the following datasets:  
• Popularity: This is a popularity index. Higher values represents higher popularity of the movies         
• budget: The allocated budget to produce the movie in USD  
• Revenue: Revenues of the movies in USD  
• original_title: The title of the movie  
• runtime: Duration of the movie  
• vote_count: number of voters of the movie  
• vote_average: average voting score received from the voters of the movie          
• release_year: the year that the movie was released  

>Given that the scope is to perform the analysis from a numerical standpoint, string datasets (e.g. Movie cast and genre) are considered out of the scope as they are not associated directly with a numerical value. Moreover, momentarily financial aspects of movies were only considered in this scope. Therefore, the projection of the future economy and the impact of inflation is not studied in this analysis.


In [1]:
#importing the required packages for the project:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


<a id='wrangling'></a>
## Data Wrangling


### General Properties

In [2]:
#Reading the csv database
df = pd.read_csv('tmdb-movies.csv')

# presenting the first four rows of the data
df.head(4)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0


Size of the data is of 10,866 rows and 21 colums as shown below:

In [224]:
df.shape

(10866, 21)

In [225]:
# Running info() to acquire general view of the data frame 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date       

### Dropping Unnecessary Data

The total dataset is 10866. There are several data messing from cast, homepage, director, tagline, keywords and production_companies. However, the scope of this analysis is to explore the data from numerical standpoint. Since the missing data is of string values, all of them can be dropped from the data frame. An exception was made for orignal_title as it is easier to identify a movie from its name.

In [226]:
#dropping unnecessary columns of string values (not necessary for the scope of this project)
df.drop(['cast', 'homepage', 'director', 'tagline', 'keywords', 'overview', 'genres', 'production_companies'], axis=1, inplace=True)

In [4]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

0