> **Tip**: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Before submitting your project, it will be a good idea to go back through your report and remove these sections to make the presentation of your work as tidy as possible. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.

# Project: Investigate The TMBD Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

**This is a TMBD movie data set and I am going to Analyiz it to undertsand**
1. The internet avilability on the Industry
2. The impact of economical recession on the Movie industry

**References**
1. https://github.com/palewire/cpi
2. https://github.com/celiao/tmdbsimple/
3. https://www.kaggle.com/tmdb/tmdb-movie-metadata/discussion/58203
4. https://www.kaggle.com/tmdb/tmdb-movie-metadata/discussion/271794
5. https://towardsdatascience.com/the-easiest-way-to-adjust-your-data-for-inflation-in-python-365490c03969

In [None]:
# install the tmbd library to get the movie missing info
!pip install tmdbsimple
# install the CPI library will be used to convert budget to 2010 inflation rate as it is the refereance as per the documentation
!pip install cpi
# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas
# !pip install --upgrade pandas==0.25.0

In [None]:
import pandas as pd
# Supress warning is not a recomennded option, however I enabled it as I got several warning
## visit: https://www.dataquest.io/blog/settingwithcopywarning/ (Mentioned in the references)
pd.set_option('mode.chained_assignment', None)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# for internet calls to try toget as much data as possible
import requests
import json
# TMBD Library
import tmdbsimple as tmdb
tmdb.API_KEY = 'INSERT_TMBD_TOKEN_HERE' # please add your key if you reached max limit to be able to continue
# for inflation adjustment
import cpi
# For date time function
import datetime
# for time claulations to meausre how long a function takes
import time

In [None]:
#-> can be Commented if you don't want inflation updated data especially that it takes long time to run
cpi.update() 

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you **document your data cleaning steps in mark-down cells precisely and justify your cleaning decisions.**


### General Properties
> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

<a id='generalfunctions'></a>
## General functions

> **Imp**: This section will include the general functions that may be used by one or of the below cels

In [None]:
'''
The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of **2010** dollars, accounting for inflation over time.
'''
def get_inflation_adjusted(amount, source_year, target_year):
    amount_adjusted = cpi.inflate(amount, source_year, to=target_year)
    print ('=== START === get_inflation_adjusted === START === ')
    print ('amount ', amount, 'current_year ', source_year, 'target_year ', target_year)
    print ('The adjusted inflation for ', amount, ' is ', amount_adjusted)
    print ('=== FIN === get_inflation_adjusted === FIN === ')
    return amount_adjusted

In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
df = pd.read_csv('tmdb-movies.csv')

# display first ten rows of data
df.head(10)

In [None]:
# this returns a tuple of the dimensions of the dataframe
df.shape

In [None]:
#Get infomrtaion aboy the data
df.info()

In [None]:
df.dtypes

In [None]:
# We will use this info to compare between the original description & the later one after cleaning & completeing the data from the interne
df.describe()

**Note** It is clear that we have a lot of missing values so we will try to get them from a source of truth which are TMDB, & imbd
Types of missing values that we will focus on are Zero value, & Nulls.
Then we will count the missing values, now, & then to know how many values we were able to recover

**Columns with Null or Zero values**

In [None]:
# Columns - With Null Value
# return a list of the columns which have missing values
columns_nulls = df.columns[df.isnull().any()]
print('columns_missing_values are: ', columns_nulls, '\n')

for column in columns_nulls:
    print('The number of nulls in ', column, ' is ', df[column].isnull().sum())

In [None]:
# Columns - With Zero Value
# return a list of the columns which have ZERO values
columns_zero_values = df.columns[df.isin([0]).any()]
print('columns_zero_values are: ', columns_zero_values)

'''
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 12  runtime               10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
'''
for colum in columns_zero_values:
    value_zero = (df[colum] == 0).sum()
    print('the number of ZERO valuses in ', colum, ' is ', value_zero)
#     value_not_Zero = (df[colum] != 0).sum()
#     print('the number of NOT-ZERO valuses in ', colum, ' is ', value_not_Zero )
#     print ('Total for verification is ', (value_zero+value_not_Zero), '\n')

**Notes**
- Columns with Null (Missing) values are: (['imdb_id', 'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview', 'genres', 'production_companies']

- Columns with Zero values are:  Index(['budget', 'revenue', 'runtime', 'budget_adj', 'revenue_adj']

**Quick conclusion:**
- Financial related columns has Zero values
- Data related columns has nulls (Missing values)
- The date 'release_date' & year 'release_year' columns has NO missing info. But, the "release_date" columns raws formats are not the same
- The ID column has no missing values which is a good point when we try to get the missing info from the TMBD internet
- The ['popularity', 'Original title', 'Vote_count'] has no missing values

**Raws with Null or Zero values**

In [None]:
#find raws with Zero values
raws_with_zero_values_original = df[df.isin([0]).any(axis=1)]
print(raws_with_zero_values_original)

In [None]:
# find raws with NULL valus
raws_with_null_values_original = df[df.isnull().any(axis=1)]
print(raws_with_null_values_original)

**Note** find raws with NULL (Missing) values, & Zero Values, we may need to compare that later after cleaning
- Raws with "NULL" valus : [8874 rows x 21 columns]
- Raws with "Zero" values : [7011 rows x 21 columns]

In [None]:
print ('raws_with_null_values : ',raws_with_null_values_original.size)
print ('raws_with_zero_values : ',raws_with_zero_values_original.size)


### Data Cleaning
> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).
 

### Data Cleaning
1. Remove duplicates
2. rename id columns to tmbd_id
3. Unify release_date formate
    3.1. change the 2 digit years into 4 digit years
4. fill the missing data from a trusted source
4. adjust inflation values
    4.1. using values already in the dataframe we will find
    - the budjet from adjusted budjet & vicevers
    - the revenue from adjusted revenu & vicevers

### Cleaning step 1 find & drop duplicates duplicates

In [None]:
# count the current duplicates
sum(df.duplicated())

We found only 1 duplicate (No columns removed yet)

In [None]:
# drop duplicates
df.drop_duplicates(inplace=True)

In [None]:
# confirm correction by rechecking for duplicates in the data
sum(df.duplicated())

### Cleaning Step 2 rename confusing columns

In [None]:
# Cleaning Step 2 rename confusing columns
df.rename(columns={'id': 'tmbd_id'}, inplace=True)

In [None]:
# confirm that rename is sucessuss
df.head(1)

### Cleaning Step 3 drop columns that willnot be used
**NOTE** on my next Analaysi part I wil lfocus more on financials & not names, so I drop all columns that are notr related to finance
- The following columns will be dropped
-- 'imdb_id', 'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview', 'genres', 'production_companies']

- The following columns will stay (TMBD_id used to be known only as id before renaming it)
-- 'TMBD_id', 'popularity', 'budget', 'revenue', 'original_title',  'runtime', 'genres' , 'release_date', 'vote_count', 'vote_average', 'release_year', 'budget_adj', 'revenue_adj', 

In [None]:
dfng = df.copy()

In [None]:
dfng.head(1)

In [None]:
dfng = dfng.drop(['imdb_id','cast','homepage','director','tagline','keywords','overview','genres','production_companies', 'release_date'], axis=1)

In [None]:
dfng.head(5)

### cleaning step 4 fill the missing from a source of truth 

**Note** 
If you have missing data & can get it from trusted resource, go do it. then do your Analysis
- columns_zero_values are:  Index(['budget', 'revenue', 'runtime', 'budget_adj', 'revenue_adj']
**Runtime**
- As it doesn't make sense to have a movie with 0 min. We are going to try to fill the runtime column data as the number of ZERO valuses in runtime column are  31

## !Note  : the below code do multiple internet calls and will take time
### you can skip if you want but the conclussion you will have will differ

In [None]:
#columns_zero_values are:  Index(['budget', 'revenue', 'runtime', 'budget_adj', 'revenue_adj']
for i in range(len(dfng['runtime'])):
    if dfng["runtime"].iloc[i] == 0:
        print ('\n the movie ', dfng["original_title"].iloc[i], 'has Zero runtime , & its TMDB_id is ', dfng["tmbd_id"].iloc[i])
        # Get the Movie info from TMBD
        try:
            movie_tmbd_id = dfng["tmbd_id"].iloc[i]
            movie = tmdb.Movies(movie_tmbd_id)

            # get runtime as it doesn't make sense to have a movie with 0 min
            movie_runtime = movie.info()["runtime"]
            if (movie.info()["runtime"] !=0):
                print('the movie ', dfng["original_title"].iloc[i], ' has a TMBD runtime of ', movie_runtime , ' mins')
                # This worked but with a warrning, warning was suppresed see tiop notes of the excercis at import
                dfng["runtime"].iloc[i] = np.int64(movie.info()["runtime"])
                print('The new Runtime is ', dfng["runtime"].iloc[i], '\n')

                # check budget and update 
                # if Original data budgut is Zero
                if (dfng["budget"].iloc[i] != movie.info()["budget"]):
                    print ('\nthe movie ', dfng["original_title"].iloc[i], 'has dfng budget of  : ', dfng["budget"].iloc[i])
                    print ('the movie ', dfng["original_title"].iloc[i], 'has TMDB budget of : ', movie.info()["budget"])
                    if movie.info()["budget"] !=0:
                        dfng["budget"].iloc[i] = movie.info()["budget"]
                        print ('the movie ', dfng["original_title"].iloc[i], 'has updated budget of : ', dfng["budget"].iloc[i])

                        # adjust for inflation
                        print('adjust for inflation \n')
                        print ('the movie ', dfng["original_title"].iloc[i], 'has dfng adjusted budget of : ', dfng["budget_adj"].iloc[i])
                        if dfng["budget_adj"].iloc[i] == 0:
                            movie_budget_adj = get_inflation_adjusted(movie.info()["budget"], dfng["release_year"].iloc[i], 2010)
                            # Replace the 0 budget_adj with the retunred budget_adj
                            print('movie_budget_adj : ', movie_budget_adj)
                            dfng["budget_adj"].iloc[i] = np.int64(movie_budget_adj)
                            print ('the movie ', dfng["original_title"].iloc[i], 'has updated budget of : ', dfng["budget_adj"].iloc[i])

               # check Revenu and update 
                # if Original data of Revenu is Zero
                if (dfng["revenue"].iloc[i] != movie.info()["revenue"]):
                    print ('\nthe movie ', dfng["original_title"].iloc[i], 'has dfng revenu of  : ', dfng["revenue"].iloc[i])
                    print ('the movie ', dfng["original_title"].iloc[i], 'has TMDB revenue of : ', movie.info()["revenue"])
                    if movie.info()["revenue"] !=0:
                        dfng["revenue"].iloc[i] = movie.info()["revenue"]
                        print ('the movie ', dfng["original_title"].iloc[i], 'has updated revenue of : ', dfng["revenue"].iloc[i])

                        # adjust revenue for inflation
                        print('adjust revenue for inflation \n')
                        print ('the movie ', dfng["original_title"].iloc[i], 'has dfng adjusted revenue of : ', dfng["revenue_adj"].iloc[i])
                        if dfng["revenue_adj"].iloc[i] == 0:
                            movie_revenue_adj = get_inflation_adjusted(movie.info()["revenue"], dfng["release_year"].iloc[i], 2010)
                            # Replace the 0 budget_adj with the retunred budget_adj
                            print('movie_revenue_adj : ', movie_revenue_adj)
                            dfng["revenue_adj"].iloc[i] = np.int64(movie_revenue_adj)
                            print ('the movie ', dfng["original_title"].iloc[i], 'has updated revenue_adj of : ', dfng["revenue_adj"].iloc[i])
                
                # check popularity
                if (dfng["popularity"].iloc[i] != movie.info()["popularity"]):
                    print ('\nthe movie ', dfng["original_title"].iloc[i], 'has dfng popularity of  : ', dfng["popularity"].iloc[i], ' & TMDB popularity of : ', movie.info()["popularity"])
                    dfng["popularity"].iloc[i] = movie.info()["popularity"]
                    print ('\nthe movie ', dfng["original_title"].iloc[i], 'has updated  popularity of ', dfng["popularity"].iloc[i])

                # check vote_count
                if (dfng["vote_count"].iloc[i] != movie.info()["vote_count"]):
                    print ('\nthe movie ', dfng["original_title"].iloc[i], 'has dfng vote_count of  : ', dfng["vote_count"].iloc[i], ' & TMDB vote_count of : ', movie.info()["vote_count"])
                    dfng["vote_count"].iloc[i] = movie.info()["vote_count"]
                    print ('\nthe movie ', dfng["original_title"].iloc[i], 'has updated  vote_count of ', dfng["vote_count"].iloc[i])

                    
                # check vote_average
                if (dfng["vote_average"].iloc[i] != movie.info()["vote_average"]):
                    print ('\nthe movie ', dfng["original_title"].iloc[i], 'has dfng vote_average of  : ', dfng["vote_average"].iloc[i], ' & TMDB vote_average of : ', movie.info()["vote_average"])
                    dfng["vote_average"].iloc[i] = movie.info()["vote_average"]
                    print ('\nthe movie ', dfng["original_title"].iloc[i], 'has updated  vote_average of ', dfng["vote_average"].iloc[i])
                    
        except ValueError:
            print("Oops!  That TMBD Request falid")

tmbd_id	popularity	budget	revenue	original_title	runtime	release_date	vote_count	vote_average

### Cleaning step 5 1st  Evaluation after adjusting for runtime from TMBD runtime, & not needed columns

In [None]:
# Columns - With no Value
# return a list of the columns which have missing values
dfng_columns_missing_values = dfng.columns[dfng.isnull().any()]
print('dfng_columns_missing_values are: ', dfng_columns_missing_values, '\n')

for column in dfng_columns_missing_values:
    print('the number of missing valuses in ', column, ' is ', dfng[column].isnull().sum())

In [None]:
# Columns - With Zero Value
# return a list of the columns which have ZERO values
dfng_columns_zero_values = dfng.columns[dfng.isin([0]).any()]
print('dfng_columns_zero_values are: ', dfng_columns_zero_values)


for colum in dfng_columns_zero_values:
    dfng_value_zero = (dfng[colum] == 0).sum()
    print('The number of ZERO valuses in ', colum, ' is ', dfng_value_zero)

In [None]:
#find raws with Zero values
dfng_raws_with_zero_values = dfng[dfng.isin([0]).any(axis=1)]

In [None]:
# find raws with NULL valus
dfng_raws_with_null_values = dfng[dfng.isnull().any(axis=1)]

In [None]:
print ('dfng_ raws_with_null_values : ',dfng_raws_with_null_values.size)
print ('dfng_ raws_with_zero_values : ',dfng_raws_with_zero_values.size)

**Note: The result of updating the data from the internet shows**
1. Columns
    1.1. Number of columns of Null values is Zero
    1.2. Number of columns of Zero Value decreased by
        1.2.1 runtime from 31 to just 4
        1.2.2 budget & revenu both decrease by 2
2. Raws
    2.1. Number of raws with null values decreased from 186354 to zero -> more than 186K values
    2.2. Number of raws with Zero values decreased from 147231 to 84108 -> more than 63K values
    

Note: we will drop the 4 raws that has No or Zero runtime value as we can't have movies with zero mins

In [None]:
dfng.shape

In [None]:
# Delet the 4 raws where the runtime value is zero
dfng = dfng.loc[dfng["runtime"] != 0]

In [None]:
# confimr that our data set decreased by 4 rawas
dfng.shape

## !!!  : the below code do more than 5000 API calls to TMBD and will take a very long time
### you can skip if you want but the conclussion you will have will differ

We cannot just drop all this number of columsn 
The number of ZERO valuses in  budget  is  5694 this is a very hug so we will try to get from the TMBD database

In [None]:
#columns_zero_values are:  Index(['budget', 'revenue', 'runtime', 'budget_adj', 'revenue_adj']
for i in range(len(dfng['budget'])):
    if dfng["budget"].iloc[i] <1600:
        print ('\n the movie ', dfng["original_title"].iloc[i], 'has budget less than 1600 dollar, its TMDB_id ', dfng["tmbd_id"].iloc[i])
        
        # Get the Movie info from TMBD
        try:
            movie_tmbd_id = dfng["tmbd_id"].iloc[i]
            movie = tmdb.Movies(movie_tmbd_id)

            # check budget and update 
            # if Original data budgut is Zero
            if  movie.info()["budget"] !=0:
                if dfng["budget"].iloc[i] != movie.info()["budget"]:
                    print ('\nthe movie ', dfng["original_title"].iloc[i], 'has dfng budget of  : ', dfng["budget"].iloc[i])
                    print ('the movie ', dfng["original_title"].iloc[i], 'has TMDB budget of : ', movie.info()["budget"])
                    dfng["budget"].iloc[i] = movie.info()["budget"]
                    print ('the movie ', dfng["original_title"].iloc[i], 'has updated budget of : ', dfng["budget"].iloc[i])

                    # adjust for inflation
                    print('adjust for inflation \n')
                    print ('the movie ', dfng["original_title"].iloc[i], 'has dfng adjusted budget of : ', dfng["budget_adj"].iloc[i])
                    movie_budget_adj = get_inflation_adjusted(movie.info()["budget"], dfng["release_year"].iloc[i], 2010)
                    if dfng["budget_adj"].iloc[i] != movie_budget_adj:
                        # Replace the old budget_adj with the retunred budget_adj
                        print('movie_budget_adj : ', movie_budget_adj)
                        dfng["budget_adj"].iloc[i] = np.int64(movie_budget_adj)
                        print ('the movie ', dfng["original_title"].iloc[i], 'has updated budget of : ', dfng["budget_adj"].iloc[i])

                # check Revenu and update 
                # if Original data of Revenu is Zero
                if movie.info()["revenue"] !=0:
                    if (dfng["revenue"].iloc[i] != movie.info()["revenue"]):
                        print ('\nthe movie ', dfng["original_title"].iloc[i], 'has dfng revenu of  : ', dfng["revenue"].iloc[i])
                        print ('the movie ', dfng["original_title"].iloc[i], 'has TMDB revenue of : ', movie.info()["revenue"])
                        dfng["revenue"].iloc[i] = movie.info()["revenue"]
                        print ('the movie ', dfng["original_title"].iloc[i], 'has updated revenue of : ', dfng["revenue"].iloc[i])

                        # adjust revenue for inflation
                        print('adjust revenue for inflation \n')
                        print ('the movie ', dfng["original_title"].iloc[i], 'has dfng adjusted revenue of : ', dfng["revenue_adj"].iloc[i])
                        movie_revenue_adj = get_inflation_adjusted(movie.info()["revenue"], dfng["release_year"].iloc[i], 2010)
                        if dfng["revenue_adj"].iloc[i] != movie_revenue_adj:
                            # Replace the old revenue_adj with the calculated revenue_adj
                            print('movie_revenue_adj : ', movie_revenue_adj)
                            dfng["revenue_adj"].iloc[i] = np.int64(movie_revenue_adj)
                            print ('the movie ', dfng["original_title"].iloc[i], 'has updated revenue_adj of : ', dfng["revenue_adj"].iloc[i])
                
                # check popularity
                if movie.info()["popularity"] !=0:
                    if (dfng["popularity"].iloc[i] != movie.info()["popularity"]):
                        print ('\nthe movie ', dfng["original_title"].iloc[i], 'has dfng popularity of  : ', dfng["popularity"].iloc[i], ' & TMDB popularity of : ', movie.info()["popularity"])
                        dfng["popularity"].iloc[i] = movie.info()["popularity"]
                        print ('\nthe movie ', dfng["original_title"].iloc[i], 'has updated  popularity of ', dfng["popularity"].iloc[i])

                # check vote_count
                if movie.info()["vote_count"] !=0:
                    if (dfng["vote_count"].iloc[i] != movie.info()["vote_count"]):
                        print ('\nthe movie ', dfng["original_title"].iloc[i], 'has dfng vote_count of  : ', dfng["vote_count"].iloc[i], ' & TMDB vote_count of : ', movie.info()["vote_count"])
                        dfng["vote_count"].iloc[i] = movie.info()["vote_count"]
                        print ('\nthe movie ', dfng["original_title"].iloc[i], 'has updated  vote_count of ', dfng["vote_count"].iloc[i])

                    
                # check vote_average
                if movie.info()["vote_average"] !=0:
                    if (dfng["vote_average"].iloc[i] != movie.info()["vote_average"]):
                        print ('\nthe movie ', dfng["original_title"].iloc[i], 'has dfng vote_average of  : ', dfng["vote_average"].iloc[i], ' & TMDB vote_average of : ', movie.info()["vote_average"])
                        dfng["vote_average"].iloc[i] = movie.info()["vote_average"]
                        print ('\nthe movie ', dfng["original_title"].iloc[i], 'has updated  vote_average of ', dfng["vote_average"].iloc[i])
        except requests.HTTPError as exception:
            print("Oops!  That TMBD Request falid")
            print(exception)

saving the data to a csv file for future resue to avoid doing again all the intern calls

## Attention
**If you didn't run the above long API calls **
1. don't run the next step as I already ran it once befor and saved the data to a file
2. Skipp till read from the saved CSV files

In [None]:
# run only if you already did the api calls
dfng.to_csv('tmdb-movies_live_122021_not_clean.csv', index=False)

### Cleaning step 5 2nd  Evaluation after adjusting for extensive TMBD API calls

In [None]:
# Columns - With no Value
# return a list of the columns which have missing values
dfng_columns_missing_values = dfng.columns[dfng.isnull().any()]
print('dfng_columns_missing_values are: ', dfng_columns_missing_values, '\n')

for column in dfng_columns_missing_values:
    print('the number of missing valuses in ', column, ' is ', dfng[column].isnull().sum())

In [None]:
# Columns - With Zero Value
# return a list of the columns which have ZERO values
dfng_columns_zero_values = dfng.columns[dfng.isin([0]).any()]
print('dfng_columns_zero_values are: ', dfng_columns_zero_values)


for colum in dfng_columns_zero_values:
    dfng_value_zero = (dfng[colum] == 0).sum()
    print('The number of ZERO valuses in ', colum, ' is ', dfng_value_zero)

In [None]:
#find raws with Zero values
dfng_raws_with_zero_values = dfng[dfng.isin([0]).any(axis=1)]

In [None]:
# find raws with NULL valus
dfng_raws_with_null_values = dfng[dfng.isnull().any(axis=1)]

In [None]:
print ('dfng_ raws_with_null_values : ',dfng_raws_with_null_values.size)
print ('dfng_ raws_with_zero_values : ',dfng_raws_with_zero_values.size)

**Note:** loading data from the new live TMDB file

In [None]:
dfupdated = pd.read_csv('tmdb-movies_live_122021_not_clean.csv')

In [None]:
dfupdated.shape

In [None]:
# now find how many raws has values with a Zero value
dfupdated[dfupdated.isin([0]).any(axis=1)].size

In [None]:
# now we want to know the columns that include the zero value
dfupdated.columns[dfupdated.isin([0]).any()]  # df.isin([response])

**Note**:
- Given: 
- The fact that we stillhave a lot of rawas that has zero Values
    The number of ZERO valuses in  budget  is  4815
    The number of ZERO valuses in  revenue  is  5567
    The number of ZERO valuses in  budget_adj  is  4815
    The number of ZERO valuses in  revenue_adj  is  5567we will drop the raws that has Zero value 
- the above amount of raws missing values impacts more than 45% of the data, hence using the mean to substitue them will affect thae data. prefered to work on actual data and get more clean result than using the mean

In [None]:
for col_name in dfupdated.columns: 
    dfupdated = dfupdated.loc[dfupdated[col_name] != 0]

In [None]:
# checking that no rawas has a ZeroL value
#find raws with Zero values
dfupdated_raws_with_zero_values = dfupdated[dfupdated.isin([0]).any(axis=1)]
print ('dfupdated_raws_with_zero_values : ',dfupdated_raws_with_zero_values.size)

In [None]:
dfupdated.shape

In [None]:
dfupdated.info()

In [None]:
dfupdated.describe()

In [None]:
dfupdated.head(5)

In [None]:
dfupdated['profite'] = dfupdated['revenue_adj']-dfupdated['budget_adj']

In [None]:
# Runtime cost == the cost of producing every minite of the movie
dfupdated['runtime_cost'] = dfupdated['budget_adj']/ dfupdated['runtime']

In [None]:
dfupdated.head(5)

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. **Compute statistics** and **create visualizations** with the goal of addressing the research questions that you posed in the Introduction section. You should compute the relevant statistics throughout the analysis when an inference is made about the data. Note that at least two or more kinds of plots should be created as part of the exploration, and you must  compare and show trends in the varied visualizations. 



> **Tip**: - Investigate the stated question(s) from multiple angles. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables. You should explore at least three variables in relation to the primary question. This can be an exploratory relationship between three variables of interest, or looking at how two independent variables relate to a single dependent variable of interest. Lastly, you  should perform both single-variable (1d) and multiple-variable (2d) explorations.


In [None]:
# A quick way to view histograms for all numerical columns in a dataframe is a hist function
dfupdated.hist(figsize=(12,12)); # the ";" removes the unwanted output

In [None]:
pd.plotting.scatter_matrix(dfupdated, figsize=(15,15));

In [None]:
# call histograme for a specific column
dfupdated['release_year'].hist();

In [None]:
# This "value_counts" function aggregates counts for each unique value in a column
dfupdated['release_year'].value_counts()

In [None]:
dfupdated['release_year'].value_counts().plot(kind='bar', figsize=(12,12));

### Research Q 1 (Does release year affect the movie watcher -  Internet Impact)

In [None]:
dfupdated.plot(x='release_year', y = 'popularity', kind='scatter');

In [None]:
dfupdated.plot(x='release_year', y = 'vote_average', kind='scatter'); 

In [None]:
dfupdated.plot(x='release_year', y = 'vote_count', kind='scatter');

### Research Q 1 Conclusion:
- With the avilability of internet people now have more ways to watch movies, & it is not limited any more to the movie theaters, that is reflected on;
1. More movies tends to have higher popularity
2. The average number of votes significattly increase as now the industry has more means to collect the user votes
3. The vote significattly increase as now the industry has more means to collect the user votes

### Research Q 2 (When people tends to vote -  Internet Impact)

In [None]:
dfupdated.plot(x='popularity', y = 'vote_count', kind='scatter');

In [None]:
dfupdated.plot(x='popularity', y = 'vote_average', kind='scatter');

### Research Q 2 Conclusion:
1. The most popular movies has les votes count than the less popular
2. The most popular movie has less vote average

- **POV**: As people now has more means to speak-up, peolpe tends to punish the movies that they do not like, & avoid rewarding the movies that they like the most.

### Research Q 3 (Does Recessions periods affect the movie Industry - Economically )
As per [wikipedia](https://en.wikipedia.org/wiki/List_of_recessions_in_the_United_States) the following periods had recession
1. Apr 1960–Feb 1961	10 months
2. Dec 1969–Nov 1970	11 months
3. Nov 1973–Mar 1975	1 year 4 months
5. Jan 1980–July 1980	6 months
6. July 1981–Nov 1982	1 year4 months
7. July 1990–Mar 1991	8 months
8. Mar 2001–Nov 2001	8 months
9. Dec 2007–June 2009	1 year6 months

In [None]:
plt.figure(figsize=(12,8))
plt.title('recession impact on budget', weight='bold')
plt.xlabel('Year', weight='bold')
plt.ylabel('budget', weight='bold');
plt.grid()
plt.plot(dfupdated.groupby('release_year').sum()['budget_adj']);

In [None]:
plt.figure(figsize=(12,8))
plt.title('recession impact on revenue', weight='bold')
plt.xlabel('Year', weight='bold')
plt.ylabel('revenue', weight='bold');
plt.grid()
plt.plot(dfupdated.groupby('release_year').sum()['revenue_adj']);

In [None]:
plt.figure(figsize=(12,8))
plt.title('recession impact on profit', weight='bold')
plt.xlabel('Year', weight='bold')
plt.ylabel('profit', weight='bold');
plt.grid()
plt.plot(dfupdated.groupby('release_year').sum()['profite']);

### Research Q 3 Conclusion:
**Budget**
- Their is a decline in movie budget starts with the rescission and continue after the rescission. 
- Longer period of rescission is represented by more budget cuts
**revenue, & Profir**
- Their is an increase in profite & revnue durring recession example (Dec 2007–June 2009), & (Mar 2001–Nov 2001).
- **POV**: My guess is that people tends to look for a cheap way to intertain themselves during hard times

<a id='conclusions'></a>
## Conclusions

- The Avilability of internet & recessision tends to have impact on the movie industry.

### Limitations
- The data is not up to date till today. it will be interesting to compare between the great recession 2007 -2009, & the COVID-19 recession 2020-2021 as their is a significant life style change.
- On line TV producers are not really included. Amazon has only 2 movies, Netflix has only 8 movies included. That can be due to those two companies joined the producers club late. it will be interesting to investigate the difference between the movies they produce and the other industry producers.

## Submitting your Project 

> **Tip**: Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> **Tip**: Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> **Tip**: Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])