# Movie Series Sub-Project 1: Movie Data Analysis

<p><b>Author</b>: Jingze Dai</p>
<p><b>McMaster University</b>, Honors Computer Science (Coop) student</p>
<p><b>Personal Email Address</b>: <a>david1147062956@gmail.com</a>, or <a>dai.jingze@icloud.com</a></p>
<a href="https://github.com/daijingz">Github Homepage</a>
<a href="https://www.linkedin.com/in/jingze-dai/">Linkedin Webpage</a>
<a href="https://leetcode.com/david1147062956/">Leetcode Webpage</a>

<i>This sub-project analyze the dataset data with obtained visualized observations.</i>

<i>Your Feedback is important for Jingze's further development. If you want to give feedback and suggestions, or you want to participate in working and learning together, please email Jingze at dai.jingze@icloud.com. If you want Jingze to provide contributions to your research or opensource project or you want Jingze to help you with any programming issues, please email Jingze at david1147062956@gmail.com. Thank you for your help.</i>

### <a class="anchor" id=""><b>Section 1</b>: Dataset sources</a>

<b>Name</b>: TMDB 5000 Movie Dataset
<br>
<b>Source</b>: Kaggle
<br>
<b>Download Link</b>: <a href="https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata">TMDB 5000 Movie Dataset</a>

After downloading the compressed folder, unzip it. The expected situation indicates two datasets: <b>tmdb_5000_credits.csv</b> and <b>tmdb_5000_movies.csv</b> Each dataset contains different information.

Putting these two data files in the same folder with this notebook.

### <a class="anchor" id=""><b>Section 2</b>: Early-stage data observations</a>

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

print("Existing Datasets: ")
for dirname, _, filenames in os.walk(os.getcwd()):
    for filename in filenames:
        if filename[-4:] == '.csv':
            print(os.path.join(dirname, filename))

Existing Datasets: 
C:\Users\david\Downloads\Movies\tmdb_5000_credits.csv
C:\Users\david\Downloads\Movies\tmdb_5000_movies.csv


<b>Dataset 1 Name</b>: tmdb_5000_credits.csv
<br>
<b>Majority Content</b>: Movie participants' information, including crews and casts.
<br>
<b>Columns</b>: 4

| Columns  | Description   |      Range     | # of Values | All values are unique |
|:---------|---------------|:--------------|:--------|--------:|
| movie_id |      identifier of movies (have unique values)     |  5, 459K |  4813   |       Yes |
| title    |      movie title     |    Complicated   |    4800   |       No |
| cast     |      cast information     | Complicated |     4761   |       No |
| crew     |      crew information     | Complicated |     4776   |       No |

The identifier of each record is the variable "<b>movie_id</b>", ranges from 5 to 459000 and has the integer type. Variable "title" has string values. Variable "cast" and "crew" combines multiple data types, and they do not have specific value ranges.

In [13]:
credit_dataset_path = 'tmdb_5000_credits.csv'
credit_df = pd.read_csv(credit_dataset_path)
print("Number of rows:", credit_df.shape[0])
print("Number of columns:", credit_df.shape[1])
print("\nColumns:")
print(credit_df.columns)
print("\nData Types:")
print(credit_df.dtypes)

Number of rows: 4803
Number of columns: 4

Columns:
Index(['movie_id', 'title', 'cast', 'crew'], dtype='object')

Data Types:
movie_id     int64
title       object
cast        object
crew        object
dtype: object


<b>File Name</b>: tmdb_5000_movies.csv
<br>
<b>Majority Content</b>: Movies' basic information, such as content and budgets
<br>
<b>Columns</b>: 20

In [14]:
major_dataset_path = 'tmdb_5000_movies.csv'
major_df = pd.read_csv(major_dataset_path)
print("Number of rows:", major_df.shape[0])
print("Number of columns:", major_df.shape[1])
print("\nColumns:")
print(major_df.columns)
print("\nData Types:")
print(major_df.dtypes)

Number of rows: 4803
Number of columns: 20

Columns:
Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

Data Types:
budget                    int64
genres                   object
homepage                 object
id                        int64
keywords                 object
original_language        object
original_title           object
overview                 object
popularity              float64
production_companies     object
production_countries     object
release_date             object
revenue                   int64
runtime                 float64
spoken_languages         object
status                   object
tagline                  object
title                    object
vote_average       

### <a class="anchor" id=""><b>Section 3</b>: Data Preprocessing</a>

Before performing normal data analysis, discovering incompleteness, inappropriate values, and conflicts can diminish potential problems in the future steps. 

In [17]:
import pandas as pd

print("Missing values in each column of the credit data set:")
print(credit_df.isnull().sum())
print("\n")
print("Missing values in each column of the major data set:")
print(major_df.isnull().sum())

Missing values in each column of the credit data set:
movie_id    0
title       0
cast        0
crew        0
dtype: int64


Missing values in each column of the major data set:
budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64


Seems like the credit datset does not have any missing values, but there are a lot of missing information on the major movie information fields.