# CS-401: Milestone P2

This notebook aims to provide a basic primary pre-processing and analysis of datasets that we will be using. The pre-processing pipeline is divided into three parts: loading, cleaning and merging. One can have a look of this pipeline in the `preprocessing` folder. For more information, look at the `README`.

In [1]:
import sys

sys.path.append("./preprocessing")
sys.path.append("./analysis")

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

import loader as loader
import cleaner as cleaner
import merger as merger

%load_ext autoreload
%autoreload 2

# I. Preprocessing

## I.1. Load the data

We first start by loading datasets. CMU dataset is loaded as well as external dataset like Bechdel and TMDB dataset since these will be useful later for a more in-depth analysis.

In [2]:
# =============== LOADER =============== #
# CMU Dataset
character_df, movie_df = loader.load_CMU_dataset()
summaries_df = loader.load_summaries()

# External Dataset
bechdel_df = loader.load_bechdel_dataset()
credits_df, meta_df = loader.load_tmdb_dataset()

FileNotFoundError: [Errno 2] No such file or directory: './data/MovieSummaries/character.metadata.tsv'

## I.2. Basic cleaning

Here, we perform basic data cleaning operations on dataframes, such as renaming columns and converting date strings into Python datetime objects. To handle the credits dataframe, you can refer to this [function](./preprocessing/cleaner.py#clean_credit_df). Our objective is to analyze women's representation in the film industry, and to do this, we require information about the gender of the crew members for movies. We observed that this information was frequently missing, while full names were always provided. As a solution, we utilized the `genderguesser` package to predict the gender based on the first name. This approach allows us to reduce the number of ungendered samples without discarding entire rows.

In [None]:
# =============== CLEAN =============== #
# Clean datasets
movie_df = cleaner.clean_movie_df(movie_df)
bechdel_df = cleaner.clean_bechdel_df(bechdel_df)
credits_df = cleaner.clean_credit_df(credits_df, meta_df)
meta_df = cleaner.clean_metadata_df(meta_df)

## I.3. Merging all datasets

In [None]:
# =============== MERGE =============== #
# Merge datasets
movies = merger.merge_with_char(movie_df, character_df)
movies = merger.merge_with_summaries(movies, summaries_df)
movies = merger.merge_with_metadata(movies, meta_df)
movies = merger.merge_with_credits(movies, credits_df)

# Put the actor_name column at the beginning
movies = movies.reindex(columns=['actor_name'] + [col for col in movies.columns if col != 'actor_name'])
display(movies.sample(5))

# It will be useful when we will analyze the effect of the bechdel score rating on 
# the women's representation in movies
movies_with_bechdel = merger.merge_with_bechdel(movies, bechdel_df)

In [None]:
# Alternatively:
# You can directly load full cleaned and merged data
# movies = pd.read_pickle("./data/Processed/movies_cleaned.pkl")

# II. Exploratory & Basic Analysis

Now, let's explore and try to understand the data. 

In [None]:
# Import our visualization tools
import visualizer as viz

In [None]:
movies.drop(columns=['wikiID']).describe()

As one can see, there are outliers and dirty samples in the data. We will remove them in the next section.

In [None]:
movies = cleaner.clean_movies_ranges(movies)
movies = cleaner.clean_remove_outlier(movies, method='quantile', name='actor_height_meters')
movies = cleaner.clean_remove_outlier(movies, method='quantile', name='movie_runtime')

movies.drop(columns=['wikiID']).describe()

Todo: explain next

In [None]:
viz.visualize_year_distribution(movies)

## II.2 Missing values analysis

In [None]:
viz.visualize_missing_values(movies)

 Based on this analysis, drop columns with a lot of missing values  (according to a threshold, say 50%)

# II.3 Distribution of Key Features

In [None]:
viz.visualize_gender_distribution(movies)

Todo: Explain

In [None]:
viz.visualize_gender_proportion_repartition(movies)

TODO: Explain