# ProjectName - Project documentation
<a href="">ProjectName</a> is a project developed by Ahsan Syed, Chloe Papadopoulou, Francesca Budel, and Orsola Maria Borrini for the final exam of the course <a href="https://www.unibo.it/it/didattica/insegnamenti/insegnamento/2022/467047">"Information Visualization"</a> held by professor Marilena Daquino within the <a href="https://corsi.unibo.it/2cycle/DigitalHumanitiesKnowledge">Digital Humanities and Digital Knowledge Master Degree</a> (University of Bologna), during the A.Y. 2022/2023.

The project analyses the concept of "male gaze" in cinema as described by feminist film theorist Laura Mulvey in her essay "Visual Pleasure and Narrative Cinema".

## Background

write here about domain, problem

## Goals

## Research questions

## Data preparation and data analysis
Overall, the data preparation and analysis was conducted following Mulvey's description of "the looks associated with cinema":
- **The camera**, recording the pro-filmic event
- **The characters**, looking at each other within the screen illusion
- **The audience**, watching the final product


Finally, a **"Gaze Score"** ranging from 0 to 100 was assigned to each film within our scope to rank them in a "male gaze" hierarchy. 

### The audience: webscraping, sentiment and sexism analyses
Focusing on the audience component of the male gaze implied looking through some of the **reviews** provided for all the movies belonging to our dataset, and focusing not only on the overall reception of the movie, but mostly on the individuals' perception of it and possible gender bias underlying their opinion.


Reviews are **not accompanied by the user that provided them**, since that was not useful for our analysis: what is important to keep in mind is that our reviews' dataset comprehends 1972 reviews related to our chosen movies, and that they are completely **public and available on the IMDB's reviews' pages**. Moreover, it's essential to underline that our analysis is partial and neutral, and hopes to elaborate useful reflections more than harsh critiques. 

#### Reviews webscraping
The first step of our audience's analysis comprehended a webscraping of the reviews' pages provided in the movie.csv files in URLs form. To do so, we used the [**BeautifulSoup library**](https://www.crummy.com/software/BeautifulSoup/) and we inspected the HTML structure of a standard IMDB's review's page: the textual content of any review is stored inside a `div` block marked by the tag "text", and here we access to all of our data. 
<br>
The task, mostly automated, only required a division of the URLS into chunks, to speed up the overall scraping process (since we were working with huge amounts of data!). 


We later stored our reviews in a dictionary, then turned dataframe, then turned into a **`.csv` file**, containing a unique column, `Reviews`, alongside an index. 


#### Sentiment Analysis
Now that our reviews were available, it was time to actually start working on our analysis: this second step focused mostly on **retrieving the sentiment of our reviews**: *are they positive or negative?*
<br>
This aspect was later used to understand if there were any strong correlations among the possible sexist tone of a review and its overall sentiment: for example, *how does a poor opinion on women affect the overall perception of a movie?* *Are negative reviews the most sexist?*


To achieve a correct sentiment analysis, we used the [**library `NLTK`**](https://www.nltk.org/) and its **`VADER`**, a rule-based sentiment analyzer in which the terms are generally labeled as per their semantic orientation as either positive or negative. 
The result of this analysis was a **new dataframe** containing our `Reviews` column, a new `Scores` column (containing non-weighted sentiment analysis scores, divided into negative, neutral and positive values), a `Compound` column (weighted values between 0 and 1) and a `Sentiment` column, that provides a clear label distinguishing Positive reviews (pos) from Negative ones (neg). 


#### Sexism Analysis
Having cleared the overall sentiment of our reviews, the final step of our audience's analysis comprehended **detecting possible traces of sexism in the reviews**.
<br>
To do this, we applied a model created and published by the group NLP-LTU on Hugging Face, the [**BERTweet-large-sexism-detector**](https://huggingface.co/NLP-LTU/bertweet-large-sexism-detector), a classification model for detecting sexism in Tweets or short text paragraphs. As some of our reviews were longer than the model's length limit, a few adjustments were implemented.


At the end, we obtained a clear result: our reviews were not sexist or, at least, they were *not completely* sexist.
<br>
BERT categorized them as lacking any kind of gender bias, but, having inspected the reviews ourselves, we knew this was not true: a few reviews showed clear signs of misogyny and sexism, not just by using offensive words such as "bitch" or "tramp" when referring to actresses or their characters, but by constantly describing them as sexy and beautiful or by comparing them to animals. 
BERT simply failed to recognized them because, if considered in a quantified way, those sentences weighted very little in the general structure of the review, that otherwise had a very neutral or even positive tone. 
What emerged from this analysis, is that **the audience's gaze is rarely guided by pure prejudice or malevolence**: realistically, our reviews displayed sexism in a "natural" and subtle way, so subtle that even the sexism-detector model failed to aknowledge them when analysing the bigger picture. 

### The characters: films and scripts analyses
The aim of this analysis is to extract the dominance of the male gaze in the scope of the film and script. This is one of the most important analysis as we also directly dive into the core content of the cinema industry which are the scripts, the basis of any film. The reason we chose scripts is because they address **the whole setting of the characters** as well as **how they are defined on the camera** (viewers) and **how the male character in the script perceives the non-male ones**. They also show what kind of dialogues or actions are assigned to male ones vs non male and give us a good comparative analysis. 


#### Bechdel Test
The first step into this analysis is the infamous [Bechdel Test](https://bechdeltest.com/), used for measuring **how women are represented in a given film**. There are generally three rules that a film needs to pass:

1. The movie has to have at least two women in it
2. The movie has to have at least two women who talk to each other
3. The movie has to have at least two women who talk to each other and it is about something other than a man

If a movie passes all three of the rules then it passes the Bechdel test. This goes to show a very bare minimum bar that ideally every movie should have. We will collect that data from already existing datasets and check the results with the scope of our movies. 
 

#### Character Description
In this step we will be diving into the **actual descriptions of characters in the scripts**. The idea of using descriptions of the characters is to get an understanding of how the camera wants to show certain features of the characters through the use of angles: in this way the camera becomes the gaze and the (non-male) character becomes the object for the gaze.

Our aim is to extract automatically such descriptions from the scripts using Natural Language Processing and show the words which are often used in the describing characters (both male and non-male), revealing the differences in the way they are portayed. We also aim to **categorize female descriptions** in terms of *highly sexist* descriptions and *dubious but problematic* descriptions.


#### Character Dialogue
In this step we are extracting all the dialogues spoken by male and non-male characters for each script automatically also using NLP tasks. The aim here is to show just how much the **division and representation of words** are given to men vs non-men characters. 


#### The "Gaze Score"
The final "gaze score" was measured taking into account all the factors related to the characters aspects detailed above.


The division of the score is as follows:
1. **Bechdel Test** (max. 40%), score assigned based on the following criteria
    1. If a movie passes **no rule**: 40%
    2. If a movie passes **only the first rule**: 26.66%
    3. If a movie passes **only the first and second rules**: 13.33%
    4. If a movie passes **all rules**: 0%
2. **Character description** (max. 35%), score assigned based on the following criteria
    1. If a female character's body is described **more than the observed average**: the percentage is assigned according to the number of occurences, with a maximum value of 30%
    2. If a female character is described in a **dubious, problematic or sexist** manner: the score's calculation is more sensitive to these occurences and the percentage is assigned accordingly, with a maximum value of 35%
    3. If a female character is not particularly described in any of the above manners: 0%
3. **Character dialogues** (max. 25%), score assigned based on the following criteria:
    1. If a male character has less than or equal to 50% of the overrall dialogue in the script: 0%
    2. If a male character has more than or equal to 70% of the overall dialogue in the script: 25%
    3. If a male character has dialogue between 51% to 69% of the overall dialogue in the script: the percentage will be assigned on the basis of the percentile between values 0.1%-24.9%

### The camera: SPARQL metadata retrieval
After gathering some results from the ["audience"](##The-audience:-webscraping,-sentiment-and-sexism) and ["characters"](###The-characters:-film-and-scripts-analysis) factors detailed above, we further researched some important aspects emerged during these first analyses.

Specifically, we found out that:
- The audience results
    - [FRA WRITE THE RESULTS BRIEFLY HERE]
- The characters results
    - Bechdel test: For the 70 movies that had available data for, 38 passed the Bechdel test. Interestingly, 5 movies failed rule 1, 20 movies failed rule 2 and 9 movies failed rule 3.
    - Character dialogue analysis: [AHSAN WRITE SOMETHING BRIEFLY HERE]
    - Gaze score: [WRITE SOMETHING BRIEFLY HERE]


To better understand these results, we implemented the following queries on the SPARQL Endpoint of [**Wikidata**](https://www.wikidata.org/wiki/Wikidata:Main_Page):
1. The "characters" queries:
    1. Bechdel test: *how many of the [selected] films have **male** directors?*
        - The result showed how **all** the [selected] films have male directors, no matter the result achieved in the Bechdel test
    2. Character dialogue: *what is the proportion between male and female writers in the [selected] films?*
        - Of the analysed 66 films and over the total of 154 writers, 143 are male and 11 female --> this result clearly indicates the absolute majority of male writers in the industry and could be a reason for the dialogue characteristics showcased in the ["Character Description"](####Character-Description) and ["Character Dialogue"](####Character-Dialogue) sections
2. Gaze score queries:
    1. *To what genre belong the top 10 films in the gaze score ranking?*
        - The top 10 films in the Gaze Score ranking show a variety of different genres, but the most represented one is Superhero films, with a total of 4 films --> this could be due to the traditional male-coded value of such genre of movies, whereas females are associated to different genres
    2. *Is there any correlation between rank in the gaze score ranking, box-office and production costs?*
        - From this query we could see how...
    


### Disclaimer
A more detailed documentation on the data preparation and analysis phase can be found in `documentation > DATA.ipynb`.

## Data visualizations selected and reasons
The prepared and processed data was visualized to gather some insight on the conclusions drawn. Different visualizations were employed for each of the three afore mentioned factors (audience, characters, camera).

### The audience visualizations
> 1. Layered bar chart (generic chart)
    > x: films
    > y: values for sexism and negativity
> 2. Donut chart for sexism (over total number of reviews)
> 3. Donut chart for negativity (over total number of reviews)


### The characters visualizations

#### Bechdel Test
> 1. Passed and not passed: bar chart --> highlights difference
> 2. Stacked or donut for not passed, showing 3 layers with dynamic list of the movies
 

#### Character Description
> 1. The first visualisation type, **word clouds**, although simple, are an effective way to provide an **overall picture** of the adjectives most often associated with male and female characters respectively, as found through the analysis of our scripts. We chose this particular visualisation in order to **immediately highlight the results** we uncovered through our analysis, and to **introduce the user** to the specific analysis section which presents the description of characters in the films. Following the stylistic choices of the previous visualization of character dialogues, the male and female characters' descriptions are presented in red and green shades respectively.
> 2. The second visualisation presents the numerical occurences of body descriptions for female characters, and if found, sexist and problematic descriptions, for every film. To present each movie (a categorical variable) and the variation of these occurences, we opted for a stacked bar chart, to let the viewer browse the total occurences for each movie for both categories and investigate differences. Lastly, the graph is sorted by the films' release year, which also provides a sense of how the values **have evolved over time**.


#### Character Dialogue
> 1. Vertical bar chart showing percentages between men and women

#### The "Gaze Score"
> 1. bar chart


### The camera visualizations
> - **Bech**: no viz
> - **Character dialogue**: pictogram
    >> Different icons for different genres
    >> Different colours for different dialogue percentage
> - **Gaze score**:
    >> - <span style="color:red;">Column charts with icons (x:genre, y:number of movies + gaze score??????)</span>
    >> - Clustered bar charts or bubblechart (x:box office and cost, y: gaze score)
    >> - Line chart (x: decades, y: gaze score)




## Data communication strategies

## Summary of results