## "MADAME" - Studying Women representation through Movie Director’s lens 

### Abstract

"What do we do now?" : this famous movie quote embodies how much women are discredited and set aside in cinema, often reduced to their appearance, or defined by their relationships with men. 
The MADAME project explores the dynamics of female representation in cinema, focusing on how the gender of a movie director influences the portrayal of women. By examining movie genres, character tropes, plot emotions and success, the project investigates how different male and female movie directors depict women. Through an in-depth analysis of plot summaries and character attributes, MADAME project identifies trends in gender representation, for both male and female directed movies. Readers will follow the story of Madame, who leads the analysis, uncovering insights and sparking discussions on the way male and female directors actually represent women in movies. The ultimate goal of our work is to question the rooted stereotypes in film industry to advocate for more inclusive and diverse narratives in film.

### Research Questions

- How does the **gender of a movie director** influence the portrayal of women in cinema?

- What are the **key female stereotypes** in movies?

- Are female director **better than men** at depicting women in movies? 

### Tools, Libraries, and Datasets

#### Tools and Libraries
- **Python Libraries**:
  - **Pandas**
  - **Numpy**
  - **Matplotlib**
  - **Seaborn** (display graphs)
  - **json** (clustering movie languages, genres and countries)
  - **tqdm** (progression bar when running functions)
  - **collections** (Counter)
  - [**Hugging Face’s transformers library**](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) (sentiment analysis)
  
- **Visualization**: Interactive visualization libraries (to be determined)

#### Main dataset
- **CMU Movie Summaries Dataset**: contains the following files:
  - **characters_metadata.tsv**  
  - **movie_metadata.tsv**  
  - **name_clusters.txt**  
  - **plot_summaries.txt**  
  - **tvtropes.clusters.txt**  

#### Proposed additional datasets
- [**IMDb Ratings**](https://datasets.imdbws.com/):
  - provides movie ratings data
  - reduces the initial dataset size to 32172 movies 
- [**TMDB Ratings**](https://www.kaggle.com/datasets/juzershakir/tmdb-movies-dataset):
  - provides box office and movie budget data
- [**Bechdel Test API**](https://bechdeltest.com/api/v1/doc): 
  - provides Bechdel Test result ('rating') for a group of movies.
  - drastically reduces primary dataset size
  - essential for assessing gender interaction trends and understanding the accuracy of the Bechdel Test as a predictor of gender equality in film.
- [**Gender by name - UCI** :](https://archive.ics.uci.edu/dataset/591/gender+by+name)
  - provides a wide range of first names and associated gender
  - used to recover missing gender in the character_metadata.tsv file
  - indirectly helps to analyze how genders correlate with character types


## Repository Structure

```sh
├── .gitignore                              # Git ignore file 
├── data
│ ├── processed                             # processed data
│ │   └── transitionary                      
│ │       ├── (bechdel_ratings.csv)               
│ │       ├── (characters_metadata.csv) 
│ │       ├── (imdb_ratings.csv) 
│ │       ├── (movies_director.csv) 
│ │       ├── (movies_metadata.csv) 
│ │       ├── (movies_success.csv) 
│ │       ├── (plot_emotions.csv) 
│ │       └── (movies_complete.csv) 
│ └── raw                                   # raw data
│     ├── (character.metadata.tsv)
│     ├── (clusters.json) 
│     ├── (movie.metadata.tsv) 
│     ├── (name.clusters.txt) 
│     ├── (plot_summaries.txt) 
│     ├── (README.txt) 
│     └── (tvtropes.clusters.txt) 
├── src                                     # Source code 
│   ├── data                                # data processing 
│   │   ├── data_cleaner.py 
│   │   ├── data_loader.py 
│   │   └── data_transformer.py 
│   ├── models                              # model scripts 
│   │   ├── evaluate_model.py 
│   │   ├── models.py 
│   │   └── train_model.py                  # Bechdel ML model
│   └── utils                               # utility functions 
│       ├── methods.py 
│       └── visualization.py 
├── results.ipynb                           # results and analysis notebook 
│ 
├── log_reg_model.pkl                       # Trained model file 
├── pip_requirements.txt                    # pip requirements file
└── README.md 
```

## Methods

### Management of External Datasets

Several large datasets essential for the MADAME project were excluded from the Git repository and added to `.gitignore`. These files are located in the `src/data/external_data` directory and need to be downloaded manually from their respective sources. Below is the list of datasets used:

1. **`title.basics.tsv`**  
   - **Source**: IMDB database  
   - **Description**: Contains metadata about films, including titles, release years, and genres.  

2. **`title.ratings.tsv`**  
   - **Source**: IMDB database  
   - **Description**: Provides information about movie ratings and vote counts.  

3. **`TMDB_movie_dataset_v11.csv`**  
   - **Source**: [Kaggle](https://www.kaggle.com/)  
   - **Description**: A comprehensive dataset with detailed information about movies, such as budgets, revenues, and more.  

4. **`film_tropes.csv`**  
   - **Source**: [TV Tropes Repository](https://github.com/dhruvilgala/tvtropes?tab=readme-ov-file)  
   - **Description**: Includes data on film tropes and their associations.  

5. **`genderedness_filter.csv`**  
   - **Source**: [TV Tropes Repository](https://github.com/dhruvilgala/tvtropes?tab=readme-ov-file)  
   - **Description**: Provides insights into gender-related classifications of tropes.  

#### Data Handling & preprocessing
- **Data Wrangling**: extraction, cleaning and standardization of the data
- Focus on aligning the datasets with respect to key attributes such as character tv tropes, character and actors respective names and genders, plots, and movie genres
- Data filtering to comply with the proposed additional datasets and assure compatibility across sources + reduction of the usable data size and 
- Data clustering
DETAILLER 

#### Data Visualization
- **Univariable Analysis**: use of data visualisation techniques (histograms, box and scatter plots...) to conduct a graphical analysis of the gender distribution of characters and actors.

- **Multivariable Analysis**: further analysis to identify relationships between various factors (e.g. the presence of female characters, movie ratings, box office performance, etc...)

#### Data Description
Robust statistical methods is used to evaluate correlations, distributions, and outliers in the data
Quels tests ont été utilisés ?
DETAILLER

#### Learning From Data
- **Machine Learning Techniques**: ML methods (which ones) were employed to create a model for classifying films based on their gender representation and to predict whether a film will pass or fail the Bechdel Test. - Accuracy of 72%

#### Sentiment Analysis
The sentiment of character descriptions and plot summaries will be analyzed using pre-trained sentiment models to assess how women’s roles are portrayed physically and emotionally.

### Timeline

**Week 1 to 9**:
  - Individual exploration and data wrangling
  - Preliminary analysis on the CMU Movie Summaries dataset
  - Definition of project objectives, allocation of tasks and delineation of additional datasets

**Week 10**:  
  - Further data wrangling
  - Analysis on the preprocessed data

**Week 11**:  
  - Team collaboration in order to refine data handling steps
  - Work on initial visualizations and analysis 

**Week 12**:  
  - Finalization of data analysis and visualizations.
  - Sentiment analysis on character descriptions and plot summaries
  - Website creation, using svelte UI framework 
  - Data and visualization formatting in json

**Week 13**:  
  - Focus on predictive modeling and refining the analysis based on feedback
  - Further work on web interface structure, improvement of interactiveness 
  - Further work on data and visualization formatting in json
  - Repository ‘cleaning’ (restructuring git and organizing functions, results files etc…) 
  - Storytelling writing 
DÉTAILLER 

**Week 14**:  
  - Completion of the final project notebook
  - Final work on data and visualization formatting in json 
  - Storytelling implementation on the webpage 
  - Styling and design of the webpage
  - Editing the readme 
  - Content proofreading 

### Organization within the team

| Member | Tasks |
| --- | --- |
| Coralie | General analysis, focus on TV tropes / Formatting data and visualizations in json / Website development & design |
| Maximilien | TV tropes and plot summaries analysis / "transformer" model analysis / Bechdel ML model |
| Juliette | Movie Success analysis / Data preprocessing / Project timeline management / Repository structure |
| Mahlia | General analysis, focus on TV tropes / ML Bechdel model / Data preprocessing / Repository organisation |
| Pernelle | General analysis / Storytelling managment / Website graphic design / Readme managment |

First, to seek through the data, we decided to split the work according to the different datasets. While Coralie, Pernelle and Mahlia were assigned to the "movie_metadata" and "character_metadata" datasets, Maximilien was in charge of "tv_tropes" and "plot_summaries", Juliette worked on the "IMDb" ans "TMDB" datasets. 

Then, Juliette and Mahlia carried the data preprocessing, repository organisation and result analysis within it. Maximilien worked on methods functions and results analysis, and Coralie and Pernelle focused on developping and designing the webiste, as well as storytelling. 
To follow, Mahlia and Maximilien created a Machine Learning Model to predict the output of the Bechdel Test, given several movie features. 

Finally, everyone participated in creating visualizations and graphs and respective discussion of the results. 