# Title

# ToC

# Introduction 

In [15]:
import sys

sys.path.append("./preprocessing")
sys.path.append("./analysis")
sys.path.append("./nlp")

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats
import ast 
from wordcloud import WordCloud

# Our tools (see python files in preprocessing and analysis)
import loader
import cleaner
import merger
#import roles
import visualizer as viz
import nlp_utils as nlputils

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 1. Pre-processing

In this section we perform loading and cleaning of our dataset

## 1.1. Loading and basic cleaning

### Part 1: CMU dataset

After loading these two datasets, we perfom some cleaning on the movies: 

- We convert string date to real datetime python object.
- We drop rows with no release date.

In [16]:
# Loading 
character_df, movie_df = loader.load_CMU_dataset()
summaries_df = loader.load_summaries()

# Basic cleaning
movie_df = cleaner.clean_movie_df(movie_df)

FileNotFoundError: [Errno 2] No such file or directory: './data/MovieSummaries/character.metadata.tsv'

### Part 2: TMDB Dataset

Here we load the TMDB Dataset. We noticed that often gender are missing while names are available, We used wikipedia to recover gender based on names. For very few still unfound genders, we used the `GenderGuesser` Python package that allows to fill most of missing values for directors and procuders.

In [None]:
# Loading
credits_df, meta_df = loader.load_tmdb_dataset()

# Basic cleaning (scraping takes times)
credits_df = cleaner.clean_credit_df(credits_df, meta_df)
meta_df = cleaner.clean_metadata_df(meta_df)

### Part 3: Bechdel Dataset 

In the cleaning process, we drop the column `id` which is not useful for our analysis. 

In [None]:
# Loading
bechdel_df = loader.load_bechdel_dataset()

# Basic cleaning
bechdel_df = cleaner.clean_bechdel_df(bechdel_df)

## 1.2. Putting all together

This section merge all of our loaded and pre-processed dataset to make only one big `movies` dataframe that will be used through the entire analysis. 

Part I: Merging dataset

In [None]:
# Merge datasets
movies = merger.merge_with_char(movie_df, character_df)
movies = merger.merge_with_summaries(movies, summaries_df)
movies = merger.merge_with_metadata(movies, meta_df)
movies = merger.merge_with_credits(movies, credits_df)

# Put the actor_name column at the beginning since we have one row per actor name
movies = movies.reindex(
    columns=["actor_name"] + [col for col in movies.columns if col != "actor_name"]
)

display(movies.sample(5))

# It will be useful when we will analyze the effect of the bechdel score rating on
# the women's representation in movies
movies_with_bechdel = merger.merge_with_bechdel(movies, bechdel_df)

### Part 2: Add roles and categorize roles

Fetch all roles in the movies from tmdb and attempt to determine if they fit theses categories:
- Named character (John)
- Relationship with named character (John's wife)
- Unnamed character with function (Mechanic)
- Relationship with unnamed character with function (Mechanic's wife)
- Uncredited

In [None]:
API_TMDB_TOKEN = "API_TMDB_TOKEN"

In [None]:
movies = roles.get_roles(movies, API_TMDB_TOKEN)

In [None]:
# Alternatively:
# You can directly load full cleaned and merged data
movies = pd.read_pickle("./data/Processed/movies_with_roles_cat.pkl")
movies_with_bechdel = pd.read_pickle("./data/Processed/movies_with_bechdel.pkl")

## 1.3. Advanced Cleaning and Pre-processing

This section is devoted to a more in-depth cleaning of our big dataset `movies`.

### Part 1. Outlier Removal 

In [None]:
movies.drop(columns=["wikiID"]).describe()

As one can see, there are outliers and dirty samples in the data: negative actor's ages, movies of 16h long, etc.
We will remove them in the next cell, outliers will be remove by quantile method (see [here](./preprocessing/cleaner.py#clean_remove_outlier))

In [None]:
movies = cleaner.clean_movies_ranges(movies)
movies = cleaner.clean_remove_outlier(
    movies, method="quantile", name="actor_height_meters"
)
movies = cleaner.clean_remove_outlier(movies, method="quantile", name="movie_runtime")

movies.drop(columns=["wikiID"]).describe()

In [None]:
viz.visualize_year_distribution(movies)

### Part 2. Handling missing values

Here, we interested in the proportion of missing values. 

In [None]:
viz.visualize_missing_values(movies)

Based on the previous plot, it is evident that there are numerous missing values for the `writer` and `writer_gender` features. Given their limited usability due to this, it's necessary to remove them. Additionally, the `actor_ethni_fbid` feature, which would not have been utilized regardless, should also be dropped.

This correspond to dropping colums with more than 50% of missing values.

In [None]:
movies = cleaner.clean_missing_values_by_dropping(
    movies, to_drop=["writer", "writer_gender", "actor_ethni_fbid"]
)
display(movies.sample(2))

# 2. Data Analysis 
## 2.1. Global Analysis 
### Part 1. Analysis of inequalities  

In this section, we will examine the distribution of women in various sectors of the movie industry. To begin, let's visualize the distribution of men and women across three roles: actors, film directors, and producers.

In [None]:
viz.visualize_gender_distribution(movies)

One can clearly see that women are underrepresented both in front of the camera and behind the scenes. Actually, they are much less represented behind the camera than in front of it. This is a first interesting observation.

### Part 2. Analysis of stereotypes in gender representation

In this section, we want to explore roles given to women and see it's stereotypes. 

In [None]:
roles = loader.load_roles()

In [None]:
actor_with_role = movies.merge(chartypes, on=["movie_title", "fbid_char_actor_map"], how="inner").copy(deep=True)

viz.visualize_wordcloud_roles(actor_with_role)

The contrast in the terms used for men and women in this word cloud reflect common gender stereotypes in storytelling and media, where men are often associated with power and action, while women are often portrayed in relation to beauty and emotional roles.

#### Analysis of roles played

In [None]:
viz.visualize_type_of_role_credited(movies, gender = "B")
viz.visualize_type_of_role_credited(movies, gender = "F")
viz.visualize_type_of_role_credited(movies, gender = "M")

In [None]:
viz.visualize_proportion_gender_credited(movies)
viz.visualize_proportion_specific_gender_credited(movies, "F")
viz.visualize_proportion_specific_gender_credited(movies, "M")

In [None]:
viz.visualize_un_credited_stacked(movies,gender = "B")
viz.visualize_un_credited_stacked(movies,gender = "F")
viz.visualize_un_credited_stacked(movies,gender = "M")

Looking at the category "Job", what kind of jobs do the characters employ ?

In [None]:
viz.visualize_wordcloud_job_roles(movies)

Looking at our category "Role to job" (ex: "Marx's Boss), what kind of job do the character ?

In [None]:
viz.visualize_wordcloud_r2j_roles(movies)


### Part 3. Analysis of reception 
### Part 4. Across the world (Fine-Grained Analysis)


## 2.2. Bechdel Test 
## 2.3 NLP 
# 3. Conclusion