# Project: Investigate a TMDB Movies Data Set, as a part of Udacity Advanced Data Analysis Nanodegree, provided by Egypt ITIDA.

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction


The aim of this project is provide clean, concise, and a properly-addressed answers to some of the real-world questions that may relate to the Movies business, based on information analyised in a professional way from a nearly-messy collected-datasource.

## Here are some of the possible questions, so I can start finding answers:

- Which movies geners received the highest profits ? -> you should study this info if you decided to start a movie business!

- Who are the directors whom received the highest profits ? -> this info will facilitate whom you should call when starting your movie business :)

- To start producing a movie, your budget should NOT be lower to ... USD ? -> gives you a figure, so to do your math correctly before making the call!

- If you invested a masive budget, will the movie receives higher revenues ? -> trying to find any correlation between budget and revenues! https://classroom.udacity.com/courses/ud170/lessons/5428018709/concepts/54422617800923

- Which genres are most popular from year to year? 
- What kinds of properties are associated with movies that have high revenues?

In [364]:
# Importing analysis libraries 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
import datetime


<a id='wrangling'></a>
## Data Wrangling

> **Tip**: 

### General Properties

In [365]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.

# Reading the main DF of the analysis
tmdb_data = pd.read_csv('tmdb-movies.csv')

> *** Notes: *** 
1. I found that vote_count differs alot from movie to another, in which vote_average will also differs, I belleive its unfair to build comparison between movies based on vote_average variable. 
2. Since my analysis will be built on popularity, geners, budget, revenue, and profit. There are many columns needs to be eliminated from DF to ease my analysis.

### Data Cleaning (Replace this with more specific notes!)

In [367]:
#Check, and Remove any possible duplicates in data entry: https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/
tmdb_data.shape

(10866, 21)

In [368]:
tmdb_data.drop_duplicates(keep = 'first', inplace = True)

In [369]:
tmdb_data.shape #There was one duplicate entry and its now removed.

(10865, 21)

In [370]:
#Removing any non-analysis-related columns from the DF
#Create a list of columns names that will be deleted
columns_to_delete = ['id', 'imdb_id', 'homepage', 'tagline', 'overview', 'release_date', 
                     'budget_adj', 'revenue_adj', 'keywords', 'vote_average']

#Delete the columns from DF
tmdb_data.drop(columns_to_delete, 1, inplace = True)

 Since my analysis will include budget and revenue variables, all 0s values entries should be removed first.

In [371]:
tmdb_data.head(2)

Unnamed: 0,popularity,budget,revenue,original_title,cast,director,runtime,genres,production_companies,vote_count,release_year
0,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,5562,2015
1,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,6185,2015


In [372]:
tmdb_data.dtypes

popularity              float64
budget                    int64
revenue                   int64
original_title           object
cast                     object
director                 object
runtime                   int64
genres                   object
production_companies     object
vote_count                int64
release_year              int64
dtype: object

In [373]:
#Replacing 0 values with NaN, so I can then remove them from the DF
nan_list = ['budget', 'revenue']
tmdb_data[nan_list] = tmdb_data[nan_list].replace(0, np.NAN)

#Remove all rows with NaN values, using dropna() function.
tmdb_data.dropna(subset = nan_list, inplace = True)

#Checking DF after cleaning NaNs
tmdb_data.shape
tmdb_data.head(2)

Unnamed: 0,popularity,budget,revenue,original_title,cast,director,runtime,genres,production_companies,vote_count,release_year
0,32.985763,150000000.0,1513529000.0,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,5562,2015
1,28.419936,150000000.0,378436400.0,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,6185,2015


In [374]:
#The result of replacing and removing NaN values has caused both budget and revenue columns to be of float type
# I need to change them back to int, for better calculations and observations on DF outputs.
# I will also change the popularity column to int64 for easy-results-readings.

tmdb_data.dtypes

popularity              float64
budget                  float64
revenue                 float64
original_title           object
cast                     object
director                 object
runtime                   int64
genres                   object
production_companies     object
vote_count                int64
release_year              int64
dtype: object

In [375]:
#Adding a new column to show the profit gained for each movie, by substracting budget from revenue
# https://www.geeksforgeeks.org/python-pandas-dataframe-insert/
tmdb_data.insert(3, 'profit', tmdb_data['revenue'] - tmdb_data['budget'])

In [376]:
#Change columns type to int64, https://stackoverflow.com/questions/43956335/convert-float64-column-to-int64-in-pandas
columns_to_change = ['popularity', 'budget', 'revenue', 'profit']
tmdb_data[columns_to_change] = tmdb_data[columns_to_change].astype(np.int64)
tmdb_data.dtypes

popularity               int64
budget                   int64
revenue                  int64
profit                   int64
original_title          object
cast                    object
director                object
runtime                  int64
genres                  object
production_companies    object
vote_count               int64
release_year             int64
dtype: object

In [377]:
tmdb_data.head(2)

Unnamed: 0,popularity,budget,revenue,profit,original_title,cast,director,runtime,genres,production_companies,vote_count,release_year
0,32,150000000,1513528810,1363528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,5562,2015
1,28,150000000,378436354,228436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,6185,2015


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Here I'm playing with the clean data to explore some insights before proceeding with the analysis.


In [378]:
#Create a funciton to compute highest and lowest movies' many properties like revenue, profit, and budget
def compute(column):
    
    # highest number in function-passed column
    highest = tmdb_data[column].idxmax()
    highest_pd = pd.DataFrame(tmdb_data.loc[highest])
   
    
    # lowest number in function-passed column
    lowest = tmdb_data[column].idxmin()
    lowest_pd = pd.DataFrame(tmdb_data.loc[lowest])
    
    
    #put them in one output together
    concatenated = pd.concat([highest_pd, lowest_pd], axis = 1, keys=['Highest', 'Lowest'], sort = False)
        
    return concatenated

### Research Question 1 (Movies with highest/ lowest budgets)

In [379]:
#Use compute function on budget column
compute('budget')

Unnamed: 0_level_0,Highest,Lowest
Unnamed: 0_level_1,2244,2618
popularity,0,0
budget,425000000,1
revenue,11087569,100
profit,-413912431,99
original_title,The Warrior's Way,Lost & Found
cast,Kate Bosworth|Jang Dong-gun|Geoffrey Rush|Dann...,David Spade|Sophie Marceau|Ever Carradine|Step...
director,Sngmoo Lee,Jeff Pollack
runtime,100,95
genres,Adventure|Fantasy|Action|Western|Thriller,Comedy|Romance
production_companies,Boram Entertainment Inc.,Alcon Entertainment|Dinamo Entertainment


#### Observation: The Warrior's Way movie has the highest budget in all dataset, while Lost & Found has the lowest budget ever!
  I tend to believe that 1 USD as a budget is a wrong entry in the dataset, its not even logic, however the calculations provided here are based on the data given, and is being made to showcase how to analyse a given dataset and extract some useful insghts!

### Research Question 2  (Movies with highest/ lowest revenue)

In [380]:
#Use compute function on revenue column
compute('revenue')

Unnamed: 0_level_0,Highest,Lowest
Unnamed: 0_level_1,1386,5067
popularity,9,0
budget,237000000,6000000
revenue,2781505847,2
profit,2544505847,-5999998
original_title,Avatar,Shattered Glass
cast,Sam Worthington|Zoe Saldana|Sigourney Weaver|S...,Hayden Christensen|Peter Sarsgaard|ChloÃ« Sevi...
director,James Cameron,Billy Ray
runtime,162,94
genres,Action|Adventure|Fantasy|Science Fiction,Drama|History
production_companies,Ingenious Film Partners|Twentieth Century Fox ...,Lions Gate Films|Cruise/Wagner Productions|Bau...


### Research Question 2  (Movies with highest/ lowest profit)

In [381]:
#Use compute function on profit column
compute('profit')

Unnamed: 0_level_0,Highest,Lowest
Unnamed: 0_level_1,1386,2244
popularity,9,0
budget,237000000,425000000
revenue,2781505847,11087569
profit,2544505847,-413912431
original_title,Avatar,The Warrior's Way
cast,Sam Worthington|Zoe Saldana|Sigourney Weaver|S...,Kate Bosworth|Jang Dong-gun|Geoffrey Rush|Dann...
director,James Cameron,Sngmoo Lee
runtime,162,100
genres,Action|Adventure|Fantasy|Science Fiction,Adventure|Fantasy|Action|Western|Thriller
production_companies,Ingenious Film Partners|Twentieth Century Fox ...,Boram Entertainment Inc.


#### Observation: Ofcourse Avatar WINS! :D
   https://en.wikipedia.org/wiki/List_of_box_office_records_set_by_Avatar

### Research Question 3  (What is the successful movies's range of runtime?)

In [393]:
#Getting the 5 number summary of the runtime column
runtime_describe = tmdb_data['runtime'].describe()
runtime_describe

count    3854.000000
mean      109.220291
std        19.922820
min        15.000000
25%        95.000000
50%       106.000000
75%       119.000000
max       338.000000
Name: runtime, dtype: float64

On average, a movie's runtime is around 109 to 119 minutes long.

In [390]:
tmdb_data.groupby('runtime').sum().sort_values(by = 'profit', ascending = False)['profit'][0:5]

runtime
130    7636942726
115    7531073476
124    6992887739
136    6697827329
93     6388365276
Name: profit, dtype: int64

Top 5 highest profit movies has a runtime ranges from 93 minutes to 130 minutes, good insight to know if you are starting movies business.

### Research Question 4: Who are the directors making high profit movies?

In [396]:
# Groupping data based on 'director' column, sorting in a descending way the first 10 values for better eye reading
tmdb_data.groupby('director').sum().sort_values(by = 'profit', ascending = False)['profit'][0:10]


director
Steven Spielberg     7467063772
Peter Jackson        5197244659
James Cameron        5081994863
Michael Bay          3557208171
David Yates          3379295625
Christopher Nolan    3162548502
Chris Columbus       3116631503
George Lucas         2955996893
Robert Zemeckis      2846690869
J.J. Abrams          2839169916
Name: profit, dtype: int64

### Research Question 5: What is the average budget of the top 10 movies in terms of profit?

In [399]:
#Getting the mean() of the last line of code
tmdb_data.groupby('director').sum().sort_values(by = 'profit', ascending = False)['profit'][0:10].mean()

3960384477.3

The average budget of a successful movie should not be less than 40 million dollars

 #### Steven Spielberg is one of the most influential personalities in the history of cinema, analysis was correct, referring to the IMDB article https://www.imdb.com/name/nm0000229/ :)

<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!