# Project: Investigate a Dataset - [TMDB-MOVIES]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

> The TMDB (The Movie Database) Movie Metadata dataset from Kaggle contains information about movies and their metadata, including cast, genres, keywords, and more. This dataset is a great resource for anyone interested in analyzing movie-related data or working on projects within the entertainment industry.

The data includes over 10865 rows of information about movies, with each row representing a unique movie. The columns in the dataset include:

ID: A unique identifier for each movie.
IMBD ID: A unique identifier of IMBD for each movie.
orginaltitle: The title of the movie.
Year: The release year of the movie.
Rated: Whether the movie is rated or not 
Released: The date when the movie was released.
Runtime: The length of the movie in minutes.
Genres: One or more genres associated with the movie.
Directors: The director(s) of the movie.
cast: The main actors in the movie, separated by commas.
tagline: A brief summary of the movie's plot.
Production_companies: The production companies involved in making the movie, separated by commas.
Tagline: The tagline or slogan associated with the movie.

In [5]:
# Import Needed Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

UsageError: Line magic function `%` not found.


In [6]:
# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas==0.25.0

Collecting pandas==0.25.0
  Downloading pandas-0.25.0.tar.gz (12.6 MB)
     ---------------------------------------- 0.0/12.6 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.6 MB ? eta -:--:--
     --------------------------------------- 0.0/12.6 MB 320.0 kB/s eta 0:00:40
     --------------------------------------- 0.0/12.6 MB 320.0 kB/s eta 0:00:40
     --------------------------------------- 0.0/12.6 MB 245.8 kB/s eta 0:00:52
     --------------------------------------- 0.1/12.6 MB 326.1 kB/s eta 0:00:39
     --------------------------------------- 0.1/12.6 MB 420.8 kB/s eta 0:00:30
      -------------------------------------- 0.2/12.6 MB 769.9 kB/s eta 0:00:17
      -------------------------------------- 0.3/12.6 MB 948.8 kB/s eta 0:00:13
     - -------------------------------------- 0.4/12.6 MB 1.1 MB/s eta 0:00:11
     - -------------------------------------- 0.6/12.6 MB 1.4 MB/s eta 0:00:09
     -- ------------------------------------- 0.7/12.6 MB 1.5 MB/s 

  error: subprocess-exited-with-error
  
  python setup.py bdist_wheel did not run successfully.
  exit code: 1
  
  [919 lines of output]
    import pkg_resources
  C:\Users\FCC\anaconda3\Lib\site-packages\setuptools\__init__.py:80: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
  !!
  
          ********************************************************************************
          Requirements should be satisfied by a PEP 517 installer.
          If you are using pip, you can try `pip install --use-pep517`.
          ********************************************************************************
  
  !!
    dist.fetch_build_eggs(dist.setup_requires)
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-cpython-311
  creating build\lib.win-amd64-cpython-311\pandas
  copying pandas\conftest.py -> build\lib.win-amd64-cpython-311\pandas
  copying pandas\testing.py -> build\lib.win-amd64-cpython-311\p

<a id='wrangling'></a>
## Data Wrangling

### General Properties

In [7]:
# load the data
movie = pd.read_csv('Database_TMDb_movie_data/tmdb-movies.csv')
movie.head(5)

FileNotFoundError: [Errno 2] No such file or directory: 'Database_TMDb_movie_data/tmdb-movies.csv'

In [None]:
# Get the overview of the all data
movie.describe()

In [None]:
# Get the overview of the numerical columns of data
movie[ ['budget', 'revenue', 'runtime', 'vote_count', 'vote_average', 'budget_adj', 'revenue_adj'] ].describe()

In [None]:
# display a concise summary of the dataframe, including the number of non-null values in each column
movie.info()

In [None]:
# return a tuple of the dimensions of the dataframe
movie.shape

In [None]:
# return the datatypes of the columns
movie.dtypes

In [None]:
# return the number of unique values in each column
movie.nunique()

In [None]:
# calculate the number of missing values in each column of the DataFrame
movie.isnull().sum()

In [None]:
# return the number of duplicated data
movie.duplicated().sum()


### Data Cleaning
 

In [None]:
movie.drop(['id', 'imdb_id', 'homepage', 'tagline', 'keywords', 'overview'], axis = 1 , inplace=True)
movie.head()

In [4]:
# fill missing values in the dataframe with 0
movie.fillna(0)
movie

NameError: name 'movie' is not defined

In [None]:
# calculate the number of missing values in each column of the Data after filling NaN
movie.isnull().sum()

In [None]:
# drop the duplication in the data
movie.drop_duplicates(inplace=True)

In [None]:
# calculate the number of duplicated rows in the data after drop duplicated
movie.duplicated().sum()

### Check Outliers

In [None]:
#Use describe know diffrence between max and man for each column (there's no outliers!; as there is no chance to have outliers)
movie.describe()

In [None]:
boxplot = movie.boxplot(figsize = (20,15))

In [None]:
import pandas as pd

def new_column(data, column_name, column1, column2, operation):

#This function adds a new column to a pandas DataFrame by performing
#a mathematical operation between two existing columns.
#Args:
#data: The pandas DataFrame to add the new column to.
#column_name: The name of the new column.
#column1: The name of the first existing column.
#column2: The name of the second existing column.
#operation: The mathematical operation to apply (e.g., "+", "-", "*", "/").
#Returns:
#The pandas DataFrame with the new column added.
  # Ensure numerical data types for columns involved in the operation
    data[column1] = pd.to_numeric(data[column1], errors='coerce')
    data[column2] = pd.to_numeric(data[column2], errors='coerce')

  # Apply the operation using eval() for dynamic calculation
    data[column_name] = eval(f'data.{column1} {operation} data.{column2}')

    return data

# Create a new column named 'gain' by subtracting 'budget' from 'revenue'
movie = new_column(movie, 'gain', 'revenue', 'budget', '-')

# Print the modified DataFrame
movie

<a id='eda'></a>
## Exploratory Data Analysis

### Research Question 1 (What are the most types of Genres?)

In [None]:
# Know the Most type of genre is made
movie['genres'].value_counts()

In [None]:
# group the data by Genres and calculate the total gain for each Genre, then sort in descending order
top_genres = movie.groupby('genres')['vote_average'].mean().sort_values(ascending=False).head()
top_genres

In [None]:
# From bar chart : We know that Drama is top Genres by votes
plt.figure(figsize=(11, 7))
top_genres.plot(kind='bar')
plt.xlabel('Genres')
plt.ylabel('Total Vote Average')
plt.title('Top Genres by Total Vote Average')
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

### Research Question 2  (Who are the top Actors and Actress?)

In [None]:
# group the data by Cast and calculate the total gain for each Actor or Actress, then sort in descending order
top_Cast = movie.groupby('cast')['vote_average'].mean().sort_values(ascending=False).head()
top_Cast


In [None]:
# From bar chart : We know that Mark Cousins|Jean-Michel Frodon|Cari Beauchamp|Agnes de Mille top Actors and Actress by votes
plt.figure(figsize=(20, 15))
top_Cast.plot(kind='bar')
plt.xlabel('Cast')
plt.ylabel('Total Vote Average')
plt.title('Top Actors and Actress by Total Vote Average')
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=1.0)
plt.show()

### Research Question 3  (Who are the top Directors?)

In [None]:
# group the data by director and calculate the total gain for each director, then sort in descending order
top_directors = movie.groupby('director')['vote_average'].mean().sort_values(ascending=False).head()
top_directors

In [None]:
# From bar chart : We know that Mark Cousins is top director by votes
plt.figure(figsize=(7, 5))
top_directors.plot(kind='bar')
plt.xlabel('Directors')
plt.ylabel('Total Vote Average')
plt.title('Top Directors by Total Vote Average')
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=1.0)
plt.show()

### Research Question 4  (Who are the top Production Companies?)

In [None]:
# group the data by Production Companies and calculate the total gain for each Company, then sort in descending order
top_productionCompanies = movie.groupby('production_companies')['vote_average'].mean().sort_values(ascending=False).head()
top_productionCompanies

In [None]:
# From bar chart : We know that SMV Enterprises|Columbia Music Video|EMI are top Production Companies by votes
plt.figure(figsize=(20, 15))
top_productionCompanies.plot(kind='bar')
plt.xlabel('Production Companies')
plt.ylabel('Total Vote Average')
plt.title('Top Prouction Companies by Total Vote Average')
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=1.0)
plt.show()

### Research Question 5  (What are the top Movies?)

In [None]:
# group the data by Movies and calculate the total gain for each movie, then sort in descending order
top_movies = movie.groupby('original_title')['gain'].mean().sort_values(ascending=False).head()
top_movies

In [None]:
# From bar chart : We know that Avatar is top Movies by votes
plt.figure(figsize=(20, 15))
top_movies.plot(kind='bar')
plt.xlabel('Movies')
plt.ylabel('Total Vote Average')
plt.title('Top Movies by Total Vote Average')
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=1.0)
plt.show()

### Research Question 6  (What is the type of Genres take the longest Time?)

In [None]:
# group the data by genres and calculate the total Time for each movie, then sort in descending order
longest_genres = movie.groupby('genres')['runtime'].mean().sort_values(ascending=False).head()
longest_genres

In [None]:
# From bar chart : We know that History is Longest Time by Genres
plt.figure(figsize=(25, 17))
longest_genres.plot(kind='bar')
plt.xlabel('Genres')
plt.ylabel('Total Vote Average')
plt.title('Longest Time by Genres')
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=1.0)
plt.show()

<a id='conclusions'></a>
## Conclusions
Research Question 1 : What are the most types of Genres?

Answer 1 : Drama is top Genres by votes

Research Question 2 : Who are the top Actors and Actress?

Answer 2 : Mark Cousins|Jean-Michel Frodon|Cari Beauchamp|Agnes de Mille top Actors and Actress by votes

Research Question 3 : Who are the top Directors?

Answer 3 : Mark Cousins is top director by votes

Research Question 4 :Who are the top Production Companies?

Answer 4 : SMV Enterprises|Columbia Music Video|EMI are top Production Companies by votes

Research Question 5 : What are the top Movies?

Answer 5 : Avatar is top Movies by votes

Research Question 6 : What is the type of Genres take the longest Time?

Answer 6 : History is Longest Time by Genres

The dataset appears to have significant information on box office success of films and genres, along with directors' earnings and production companies' gains. It also highlights audience preferences regarding genres based on average votes. The analysis suggests Sci-Fi movies perform exceptionally well at the box office. However, potential inaccuracies or missing data could limit the reliability of some interpretations

## Submitting your Project 

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])