## "MADAME" - Studying women representation through movie director’s lens 

### Abstract

"What do we do now?" : this famous movie quote embodies how much women are discredited and set aside in cinema, often reduced to their appearance, or defined by their relationships with men. 
The MADAME project explores the dynamics of female representation in cinema, focusing on how the gender of a movie director influences the portrayal of women. By examining movie genres, character tropes, plot emotions and success, the project investigates how different male and female movie directors depict women. Through an in-depth analysis of plot summaries and character attributes, MADAME project identifies trends in gender representation, for both male and female directed movies. Readers will follow the story of Madame, who leads the analysis, uncovering insights and sparking discussions on the way male and female directors actually represent women in movies. The ultimate goal of our work is to question the rooted stereotypes in film industry to advocate for more inclusive and diverse narratives in film.

### Research Questions

- How does the **gender of a movie director** influence the portrayal of women in cinema?

- What are the **key female stereotypes** in movies?

- Are female director **better than men** at depicting women in movies? 

### Tools, Libraries, and Datasets

#### Tools and Libraries
- **Python Libraries**:
  - **Pandas**
  - **Numpy**
  - **Matplotlib**
  - **Seaborn** (display graphs)
  - **json** (clustering movie languages, genres and countries)
  - **tqdm** (progression bar when running functions)
  - **collections** (Counter)
  - [**Hugging Face’s transformers library**](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) (sentiment analysis)
  
- **Visualization**: Interactive visualization libraries (to be determined)

#### Main dataset
- **CMU Movie Summaries Dataset**: contains the following files:
  - **characters_metadata.tsv**  
  - **movie_metadata.tsv**  
  - **name_clusters.txt**  
  - **plot_summaries.txt**  
  - **tvtropes.clusters.txt**  

#### Proposed additional datasets
- [**IMDb Ratings**](https://datasets.imdbws.com/):
  - provides movie ratings data
  - reduces the initial dataset size by only keeping movies whose ratings are available
- [**TMDB Ratings**](https://www.kaggle.com/datasets/juzershakir/tmdb-movies-dataset):
  - provides box office and budget data
- [**Bechdel Test API**](https://bechdeltest.com/api/v1/doc): This dataset
  - provides Bechdel Test result ('rating') for a group of movies.
  - drastically reduces primary dataset size
  - essential for assessing gender interaction trends and understanding the accuracy of the Bechdel Test as a predictor of gender equality in film.
- [**Gender by name - UCI** :](https://archive.ics.uci.edu/dataset/591/gender+by+name)
  - provides a wide range of first names and associated gender
  - used to recover missing gender in the character_metadata.tsv file
  - indirectly helps to analyze how genders correlate with character types

ORGA REPO

Our repository is organized the following way :
> data
 > processed
  > transitionary
  movies_complete_csv
 > raw
   
> src
 > data
  > external_data
  data_cleaner.py
  data_loader.py
  data_transformer.py
 > models
 > utils
  models.py
  train_model.py

.gitignore
pip_requirements.text
README.md
results.ipynb

The first folder « data » contains the majority of the data used for our analysis. It is separated into two distinct folders : « processed » which contains processed and filtered dataframes, ready to be used, and « raw » which constitute the initial database, uncleaned. The majority of dataframes contained within « processed » are located in a particular folder called « transitionary ». They won’t be used as such when performing analysis, but were shaped and transformed to create the csv file movies_complete.csv.

Then, the folder « src » contains itself a folder « data », as well as « models » and « utils ».
First things first, « data » contains the external_data folder that groups all csv files that could not be imported into GitHub as too heavy. This folder is specified in the file .gitignore for logical reasons. Then, « data » contains 3 .py files :
 • data_cleaner.py helps us clean specific portions of the dataframes, such as the release date of the movies which was initially either in a DateTime format or displayed as Year.
 • data_loader.py gathering functions allowing us to load data from external datasets and loading files while running Exceptions
 • data_transformer.py in which we preprocess, filter and clean raw and external datasets

« models » contain the fabulous Machine Learning model of Mahlia. While functions within evaluate_model.py literally evaluates the ML model created and contained in models.py, train_model.py once again literally trains the model evaluated.

The folder « utils » contains the inevitable « pycache » folder, as well as methods.py which gathers functions that process and shape data from movies_complete_df to plug it directly into the plotting functions gathered in visualization.py.

Do we still need to present the following files : .gitignore, pip_requirements.text and README.md ? These three files are inevitable in any GitHub repository and permit us to « hide » certain files or folders from GitHub, install specific versions of libraries and the README.md mmh, let me think. 

Lastly, the file results.ipynb contains all of our greatest analysis performed on the numerous datasets collected. Graphs functions have been written and artistically designed by the greatest artistic directors of the ADApocalypse group, and explained by statisticiens.

### Repository Structure 

This project is structured as follows:

```sh
├── .gitignore
├── data                                # data sources
│   ├── (processed)
│   │   ├── (characters_metadata.csv)    # pre-filtered youniverse files
│   │   ├── (imdb_ratings.csv)
│   │   ├── (movies_director.csv)
│   │   ├── (movies_directors_combined.csv)
│   │   ├── (movies_metadata_success.csv)
│   │   ├── (movies_metadata.csv)
│   │   └── (original)                  # original youniverse dataset
│   │       ├── (df_channels_en.tsv)
│   │       ├── (df_timeseries_en.tsv)
│   │       ├── (youtube_comments.tsv)
│   │       └── (yt_metadata_en.jsonl)
│   ├── (raw)
│   ├── games.csv
│   └── word_alpha.txt      
├── notebooks                          
│   ├── results.ipynb                   # main analysis notebook
│   └── prefiltering.ipynb              # data prefiltering notebook
├── src                                         
│   └── utils.py                        # utility functions
├── requirements.txt                    # pip requirements file
└── README.md
```

### Methods

### Management of External Datasets

Several large datasets essential for the MADAME project were excluded from the Git repository and added to `.gitignore`. These files are located in the `src/data/external_data` directory and need to be downloaded manually from their respective sources. Below is the list of datasets used:

1. **`title.basics.tsv`**  
   - **Source**: IMDB database  
   - **Description**: Contains metadata about films, including titles, release years, and genres.  

2. **`title.ratings.tsv`**  
   - **Source**: IMDB database  
   - **Description**: Provides information about movie ratings and vote counts.  

3. **`TMDB_movie_dataset_v11.csv`**  
   - **Source**: [Kaggle](https://www.kaggle.com/)  
   - **Description**: A comprehensive dataset with detailed information about movies, such as budgets, revenues, and more.  

4. **`film_tropes.csv`**  
   - **Source**: [TV Tropes Repository](https://github.com/dhruvilgala/tvtropes?tab=readme-ov-file)  
   - **Description**: Includes data on film tropes and their associations.  

5. **`genderedness_filter.csv`**  
   - **Source**: [TV Tropes Repository](https://github.com/dhruvilgala/tvtropes?tab=readme-ov-file)  
   - **Description**: Provides insights into gender-related classifications of tropes.  

#### Data Handling & preprocessing
- **Data Wrangling**: extraction, cleaning and standardization of the data
- Focus on aligning the datasets with respect to key attributes such as character tv tropes, character and actors respective names and genders, plots, and movie genres
- Data filtering to comply with the proposed additional datasets and assure compatibility across sources + reduction of the usable data size and 
- Data clustering
DETAILLER 

#### Data Visualization
- **Univariable Analysis**: use of data visualisation techniques (histograms, box and scatter plots...) to conduct a graphical analysis of the gender distribution of characters and actors.

- **Multivariable Analysis**: further analysis to identify relationships between various factors (e.g. the presence of female characters, movie ratings, box office performance, etc...)

#### Data Description
Robust statistical methods is used to evaluate correlations, distributions, and outliers in the data
Quels tests ont été utilisés ?
DETAILLER

#### 5. Learning From Data
- **Machine Learning Techniques**: ML methods (which ones) were employed to create a model for classifying films based on their gender representation and to predict whether a film will pass or fail the Bechdel Test. - Accuracy of 70%

#### 6. Sentiment Analysis
The sentiment of character descriptions and plot summaries will be analyzed using pre-trained sentiment models to assess how women’s roles are portrayed physically and emotionally.

### Timeline

**Week 1 to 9**:
  - Individual exploration and data wrangling
  - Preliminary analysis on the CMU Movie Summaries dataset
  - Definition of project objectives, allocation of tasks and delineation of additional datasets

**Week 10**:  
  - Further data wrangling
  - Analysis on the preprocessed data

**Week 11**:  
  - Team collaboration in order to refine data handling steps
  - Work on initial visualizations and analysis 

**Week 12**:  
  - Finalization of data analysis and visualizations.
  - Sentiment analysis on character descriptions and plot summaries
  - Website creation, using svelte UI framework 
  - Data and visualization formatting in json

**Week 13**:  
  - Focus on predictive modeling and refining the analysis based on feedback
  - Further work on web interface structure, improvement of interactiveness 
  - Further work on data and visualization formatting in json
  - Repository ‘cleaning’ (restructuring git and organizing functions, results files etc…) 
  - Storytelling writing 
DÉTAILLER 

**Week 14**:  
  - Completion of the final project notebook
  - Final work on data and visualization formatting in json 
  - Storytelling implementation on the webpage 
  - Styling and design of the webpage
  - Editing the readme 
  - Content proofreading 
DÉTAILLER 

### Organization within the Team

- **Coralie**: "movie metadata" analysis, website interface management, formatting data and visualizations in json

- **Juliette**: "movie metadata" analysis, data preprocessing, project timeline management, cleaning of the repository

- **Mahlia**: "character metadata" analysis, ML Bechdel model,  data preprocessing, cleaning of the repository 

- **Maximilien**: "tvtropes" and "plot_summaries" analysis, "transformer" model analysis

- **Pernelle**: "character metadata" analysis, storytelling writing, graphic design of the website page 

