# Through the Gaze - Project documentation
<a href="https://ahsanv101.github.io/ProjectGaze/">**Through the Gaze**</a> is a project developed by Ahsan Syed, Chloe Papadopoulou, Francesca Budel, and Orsola Maria Borrini for the final exam of the course <a href="https://www.unibo.it/it/didattica/insegnamenti/insegnamento/2022/467047">"Information Visualization"</a> held by professor Marilena Daquino within the <a href="https://corsi.unibo.it/2cycle/DigitalHumanitiesKnowledge">Digital Humanities and Digital Knowledge Master Degree</a> (Alma Mater Studiorum - University of Bologna), during the A.Y. 2022/2023.


## Background

The project analyses the concept of **"male gaze" in cinema** as described by feminist film theorist Laura Mulvey in her essay "Visual Pleasure and Narrative Cinema".
As noted by Jonathan Schroeder, the act of the gaze implies "a **psychological relationship of power**, in which the gazer is superior to the object of the gaze" (Schroeder, 1998). While this concept has been developed in a variety of different disciplines, the one we want to research with our project is feminist theory and, specifically, the "male gaze".

The term was introduced by John Berger, an English art critic, in 1972. Berger was analysing the different treatments of female and male nude in European paintings and noticed the tendency to place men in the role of the "watcher" and women in that of the "watched". A few years later, Laura Mulvey adopted the same critique with regards to media representations of female characters in cinema, noticing how **masculine characters had more active roles, whereas feminine ones were passive**. Furthermore, Mulvey highlighted the all-encompassing presence of such gaze not only in the male characters' behaviour in the movies, but also in how the camera work depicts the scenes and how the audience consumes them.

The male gaze in narrative cinema has, therefore, **three different perspectives**:
1. that of the **man behind the camera**
2. that of the **male characters** within the film’s cinematic representations
3. that of the **spectator** gazing at the image
Overall, the "male gaze" can be summarised as the **depiction of women from a masculine, heterosexual perspective** that represents them as mere sexual objects for the pleasure of the heterosexual male viewer.

## Goals
Our objective is to represent a significant and coherent **overview on the presence of the male gaze and its impact on the Western cinematic industry** by conducting a complete analysis through the three essential perspectives highlighted above.

The post-war U.S. popular culture's influence over the Western part of the world cannot be disputed and one of the domains in which this influence has been extensive is certainly that of mainstream narrative cinema. Undoubtedly, films can have a powerful effect on society (just as much as they are themselves influenced by our society and culture) and they can be detrimental to the shaping of one's personal views on a variety of topics.
Given these premises and considering that being constantly subjected to the male gaze in a patriarchal society has been proved to be deleterious for the mental health of women, we have decided to focus our research on the **ten highest-grossing U.S. films for each decade from the 1940s to the 2010s**, analysing their **scripts**, **reviews** on the web, and **metadata**.


## Research questions
Given the above, we are looking to uncover how much the film industry has perpetuated traditional gender presentation and its unfairness: *to what extent do popular films rely on the male gaze?*


## Data preparation and data analysis
The data preparation and analysis was conducted following Mulvey's definitions of the three perspectives associated with cinema:
1. The characters, looking at each other within the screen illusion
2. The audience, watching the final product
3. The camera, recording the pro-filmic event

Respectively, we have analysed the selected 80 films through: 
1. **Script analyses**, specifically examining the representation of women (Bechdel test), the descriptions of women (revealing the differences in the word choice used to portray them), the dialogue division between male and female characters
    - Finally, a **Gaze Score** (GS) ranging from 0 to 100 was assigned to each film; this score takes into account the above factors (weighted differently)
2. **Online reviews analyses**, focusing not only on the overall reception of the film, but mostly on the individuals' perception of it (sentiment analysis) and on possible gender bias underlying their opinions (sexism analysis)
3. **SPARQL metadata retrieval** using the most interesting results from the previous two analyses to deepen the research on the team working on the film production (e.g., gender of the directors, proportion between male and female writers, genres, box-office and production costs variables)


### The audience: webscraping, sentiment and sexism analyses
Focusing on the audience component of the male gaze implied looking through some of the **reviews** provided for all the movies belonging to our dataset, and focusing not only on the overall reception of the movie, but mostly on the individuals' perception of it and possible gender bias underlying their opinion.


Reviews are **not accompanied by the user that provided them**, since that was not useful for our analysis: what is important to keep in mind is that our reviews' dataset comprehends 1972 reviews related to our chosen movies, and that they are completely **public and available on the IMDB's reviews' pages**. Moreover, it's essential to underline that our analysis is partial and neutral, and hopes to elaborate useful reflections more than harsh critiques. 

#### Reviews webscraping
The first step of our audience's analysis comprehended a webscraping of the reviews' pages provided in the movie.csv files in URLs form. To do so, we used the [**BeautifulSoup library**](https://www.crummy.com/software/BeautifulSoup/) and we inspected the HTML structure of a standard IMDB's review's page: the textual content of any review is stored inside a `div` block marked by the tag "text", and here we access to all of our data. 
<br>
The task, mostly automated, only required a division of the URLS into chunks, to speed up the overall scraping process (since we were working with huge amounts of data!). 


We later stored our reviews in a dictionary, then turned dataframe, then turned into a **`.csv` file**, containing a unique column, `Reviews`, alongside an index. 


#### Sentiment Analysis
Now that our reviews were available, it was time to actually start working on our analysis: this second step focused mostly on **retrieving the sentiment of our reviews**: *are they positive or negative?*
<br>
This aspect was later used to understand if there were any strong correlations among the possible sexist tone of a review and its overall sentiment: for example, *how does a poor opinion on women affect the overall perception of a movie?* *Are negative reviews the most sexist?*


To achieve a correct sentiment analysis, we used the [**library `NLTK`**](https://www.nltk.org/) and its **`VADER`**, a rule-based sentiment analyzer in which the terms are generally labeled as per their semantic orientation as either positive or negative. 
The result of this analysis was a **new dataframe** containing our `Reviews` column, a new `Scores` column (containing non-weighted sentiment analysis scores, divided into negative, neutral and positive values), a `Compound` column (weighted values between 0 and 1) and a `Sentiment` column, that provides a clear label distinguishing Positive reviews (pos) from Negative ones (neg). 


#### Sexism Analysis
Having cleared the overall sentiment of our reviews, the final step of our audience's analysis comprehended **detecting possible traces of sexism in the reviews**.
<br>
To do this, we applied a model created and published by the group NLP-LTU on Hugging Face, the [**BERTweet-large-sexism-detector**](https://huggingface.co/NLP-LTU/bertweet-large-sexism-detector), a classification model for detecting sexism in Tweets or short text paragraphs. As some of our reviews were longer than the model's length limit, a few adjustments were implemented.


At the end, we obtained a clear result: our reviews were not sexist or, at least, they were *not completely* sexist.
<br>
BERT categorized them as lacking any kind of gender bias, but, having inspected the reviews ourselves, we knew this was not true: a few reviews showed clear signs of misogyny and sexism, not just by using offensive words such as "bitch" or "tramp" when referring to actresses or their characters, but by constantly describing them as sexy and beautiful or by comparing them to animals. 
BERT simply failed to recognized them because, if considered in a quantified way, those sentences weighted very little in the general structure of the review, that otherwise had a very neutral or even positive tone. 
What emerged from this analysis, is that **the audience's gaze is rarely guided by pure prejudice or malevolence**: realistically, our reviews displayed sexism in a "natural" and subtle way, so subtle that even the sexism-detector model failed to aknowledge them when analysing the bigger picture. 

### The characters: films and scripts analyses
The aim of this analysis is to extract the dominance of the male gaze in the scope of the film and script. This is one of the most important analysis as we also directly dive into the core content of the cinema industry which are the scripts, the basis of any film. The reason we chose scripts is because they address **the whole setting of the characters** as well as **how they are defined on the camera** (viewers) and **how the male character in the script perceives the non-male ones**. They also show what kind of dialogues or actions are assigned to male ones vs non male and give us a good comparative analysis. 


#### Bechdel Test
The first step into this analysis is the infamous [Bechdel Test](https://bechdeltest.com/), used for measuring **how women are represented in a given film**. There are generally three rules that a film needs to pass:

1. The movie has to have at least two women in it
2. The movie has to have at least two women who talk to each other
3. The movie has to have at least two women who talk to each other and it is about something other than a man

If a movie passes all three of the rules then it passes the Bechdel test. This goes to show a very bare minimum bar that ideally every movie should have. We will collect that data from already existing datasets and check the results with the scope of our movies. 
 

#### Character Description
In this step we will be diving into the **actual descriptions of characters in the scripts**. The idea of using descriptions of the characters is to get an understanding of how the camera wants to show certain features of the characters through the use of angles: in this way the camera becomes the gaze and the (non-male) character becomes the object for the gaze.

Our aim is to extract automatically such descriptions from the scripts using Natural Language Processing and show the words which are often used in the describing characters (both male and non-male), revealing the differences in the way they are portayed. We also aim to **categorize female descriptions** in terms of *highly sexist* descriptions and *dubious but problematic* descriptions.


#### Character Dialogue
In this step we are extracting all the dialogues spoken by male and non-male characters for each script automatically also using NLP tasks. The aim here is to show just how much the **division and representation of words** are given to men vs non-men characters. 


#### The "Gaze Score"
The final "gaze score" was measured taking into account all the factors related to the characters aspects detailed above.


The division of the score is as follows:
1. **Bechdel Test** (max. 40%), score assigned based on the following criteria
    1. If a movie passes **no rule**: 40%
    2. If a movie passes **only the first rule**: 26.66%
    3. If a movie passes **only the first and second rules**: 13.33%
    4. If a movie passes **all rules**: 0%
2. **Character description** (max. 35%), score assigned based on the following criteria
    1. If a female character's body is described **more than the observed average**: the percentage is assigned according to the number of occurences, with a maximum value of 30%
    2. If a female character is described in a **dubious, problematic or sexist** manner: the score's calculation is more sensitive to these occurences and the percentage is assigned accordingly, with a maximum value of 35%
    3. If a female character is not particularly described in any of the above manners: 0%
3. **Character dialogues** (max. 25%), score assigned based on the following criteria:
    1. If a male character has less than or equal to 50% of the overrall dialogue in the script: 0%
    2. If a male character has more than or equal to 70% of the overall dialogue in the script: 25%
    3. If a male character has dialogue between 51% to 69% of the overall dialogue in the script: the percentage will be assigned on the basis of the percentile between values 0.1%-24.9%

### The camera: SPARQL metadata retrieval
After gathering some results from the ["audience"](##The-audience:-webscraping,-sentiment-and-sexism) and ["characters"](###The-characters:-film-and-scripts-analysis) factors detailed above, we further researched some important aspects emerged during these first analyses.

Specifically, we found out that:
- The audience results
    - Sentiment analysis: 10 out of 80 audiences expressed a very negative opinion of the movie they watched
    - Sexism detection: 17 movies had a sexist audience, but instances of such behaviour were rare and sporadic if considered over the total number of reviews for each movie
    - Overall, we found no direct link between an audience's sexism and the reviews' tone
- The characters results
    - Bechdel test: For the 70 movies that had available data for, 38 passed the Bechdel test. Interestingly, 5 movies failed rule 1, 20 movies failed rule 2 and 9 movies failed rule 3
    - Character dialogue analysis: out of all the scripts that we were able to retrieve:
        - 94% of the scripts were male dominated (more than 50% dialogues)
        - 6% of the scripts had a majority of non male dialogues
    - Gaze score: we were successfully able to assign an arbitrary value between 0-100 to each of the films under our research. This score will help us understand gaze and will help us compare it to other variables

To better understand these results, we implemented the following queries on the SPARQL Endpoint of [**Wikidata**](https://www.wikidata.org/wiki/Wikidata:Main_Page):
1. The "characters" queries:
    1. Bechdel test: *how many of the [selected] films have **male** directors?*
        - The result showed how **all** the [selected] films have male directors, no matter the result achieved in the Bechdel test
    2. Character dialogue: *what is the proportion between male and female writers in the [selected] films?*
        - Of the analysed 66 films and over the total of 154 writers, 143 are male and 11 female &rarr; this result clearly indicates the absolute majority of male writers in the industry and could be a reason for the dialogue characteristics showcased in the ["Character Description"](####Character-Description) and ["Character Dialogue"](####Character-Dialogue) sections
2. Gaze score queries:
    1. *To what genre belong the top 10 films in the gaze score ranking?*
        - The top 10 films in the Gaze Score ranking show a variety of different genres, but the most represented one is Superhero films, with a total of 4 films &rarr; this could be due to the traditional male-coded value of such genre of movies, whereas females are associated to different genres
    2. *Is there any correlation between rank in the gaze score ranking, box-office and production costs?*
        - The result of this query (and specifically its visualization in the form of a bubblechart), did not showcase any particular correlation between the rank of the movie in the GS ranking, its box-office and production costs; however, we must also underline how all films of the ranking have similar GS scores, making it very hard to notice any possible pattern &rarr; this by itself is already an interesting result:  no matter the box-office or production costs amounts, the highest grossing movies of the last 80 years all have similar (and high) GS scores
    


### Disclaimer
A more detailed documentation on the data preparation and analysis phase can be found in `documentation > DATA.ipynb`.

## Data visualizations selected and reasons
The prepared and processed data was visualized to gather some insight on the conclusions drawn. Different visualizations were employed for each of the three afore mentioned factors (audience, characters, camera).

### The audience visualizations

The first visualization is a **layered bar chart**: on the x axis there are the films with a sexist audience, while on the y axis there are the sexist instances over the audience's tone.
By comparing the number of sexist reviews with each audience's tone (whose numerical value has been rounded down and multiplied by 100 to obtain an integer), we can distinguish a series of instances where usually the audience's sentiment greatly surpasses the examples of sexist language.
Values do not overlap, and because of this we can assume that a review's tone is rarely related or caused by sexism or lack thereof.


The second visualization is an **exploding piechart** regarding the sexism detected over the total number of reviews. The two pies contain all the sexist audiences detected among our data. The main pie contains all the sexist movies and each slice can be inspected through the second pie, which depicts the amount of sexist reviews for a singular audience over the total number of reviews.
The visualization makes clear that sexist instances are rare, and represent a small part of an audience's overall behavior.


The third visualization is a **bar chart** for audiences' general tone. With values from 0.0 to 0.9, this chart represents the overall sentiment of an audience in its reviews: a few movies were extremely criticized by their viewers.
It's important to underline that the lowest bar in the chart (the one depicting the tone for the movie The Exorcist) probably represents a misinterpretation made by the algorithm: since this instance corresponds to a horror movie, we have reason to believe that terms like fear, anxiety or terror influenced the final calculations.


### The characters visualizations

#### Bechdel Test
> 1. Passed and not passed: bar chart --> highlights difference
> 2. Stacked or donut for not passed, showing 3 layers with dynamic list of the movies
 

#### Character Description
The first visualisation type, **word clouds**, although simple, are an effective way to provide an **overall picture** of the adjectives most often associated with male and female characters respectively, as found through the analysis of our scripts.
We chose this particular visualisation in order to **immediately highlight the results** we uncovered through our analysis, and to **introduce the user** to the specific analysis section which presents the description of characters in the films. Following the stylistic choices of the previous visualization of character dialogues, the male and female characters' descriptions are presented in red and green shades respectively.


The second visualisation presents the numerical occurences of body descriptions for female characters, and if found, sexist and problematic descriptions, for every film. To present each movie (a categorical variable) and the variation of these occurences, we opted for a **stacked bar chart**, to let the viewer browse the total occurences for each movie for both categories and investigate differences. Lastly, the graph is sorted by the films' release year, which also provides a sense of how the values **have evolved over time**.


#### Character Dialogue

The purpose of using the **Vertical bar chart** was to show the spark differences to an end user at first glance between the dialogue distributions of males and non-males in each of the scripts. The bar chart that was used was a sub-type of bar chart. On the **y-axis** we have the names of all the movies whose dialaogues were successfully extracted. On the x-axis, we have 2 variables. On the right hand side of the **x-axis**, we have the percentages of the dialogues of non-male characters out of the total percentage of the dialogue of the script. On the left hand side of the **x-axis**, we have the percentages of the dialogues of the male characters out of the total percentage of the dialogue of the script.

#### The "Gaze Score" (GS)

The idea behind calculating and extracting a gaze score for each of the script was benificial for the overall project as it presents a mathematical model to quantify such results and use it for comparitive analysis. The reason for using a **bar chart** was to simply show the end-user the final scores that were assigned to each of the scripts after processing their data. On the **y-axis**, we have all the names of the movies, whereas on the **x-axis** we have the scores of these scripts. We also chose to **order** the visualizations by the gaze scores which resulted in the films with the highest gaze score on the top and lowest on the bottom. This allows users to navigate their way through the scripts and are easily able to distinguish scripts apart based on their gaze score.


### The camera visualizations

1. *How many of the [selected] films have male directors?*

As this query expressed a very simple and straightforward result (all the [selected] films have male directors), **no visualization** was deemed necessary to convey it.
A **disclaimer** was added in the appropriate section of the website to warn users on the results of the Bechdel test (no matter the result achieved, all the films have male directors!). This result goes to show how the (Western) cinematic industry is still dominated by male individuals and prevented us from further analysing on the presence of the male gaze even in female directors.


2. *What is the proportion between male and female writers in the [selected] films?*

Of the 66 analysed films and the 154 writers, 143 are male and 11 female. The most appropriate and direct chart to showcase such an imbalance in the industry was deemed to be the **pictogram** with different icons and colours representing the gender.


3. *To what genre belong the top ten films in the GS ranking?*

To showcase the different genres present in the top 10 films of the GS ranking, we first selected the "main" genre of each movie (some of them fall under different genres, such as "Mission Impossible 2", which is considered an action fil, spy film, and a thriller film &rarr; in this case, the main genre was considered to be the action one), then, we presented the results in a **pie chart** in which each colour represents a different genre and the segments' areas represent the amount of films belonging to each genre.


4. *Is there any correlation between the GS score, box-office, and production costs?*

As in this case we had to analyse three different numerical variables (GS score, box-office and production costs), trying to find a correlation between them, the most appropriate chart to use was the **bubblechart**. On the x axis there is the Production costs, on the y axis the Box Office, and the area of the bubbles represents the GS score.


## Data communication strategies
We present our work as a website themed around the "gaze" concept at the very basis of our research (the website can be found [here](https://ahsanv101.github.io/ProjectGaze/)).

After three preliminary sections ("About", "Project", "Data") in which we briefly talk about the background of the domain and the problem, the aims of our project, and the source data and its manipulation, the actual **storytelling** section of our website starts with a sort of **"metascript"**. This represents a conversation between two of the main figures of the cinematic industry, both de-personalised and represented only by the name of their profession: a DIRECTOR and a SCREENWRITER. The two figures talk with eachother about their jobs and the issues in the field, touching on the different topics we have analysed (in this way, the charts themselves are incorporated in what could be a normal, real-life conversation).

Finally, after dealing with all the issues of our research (representation of women through Bechdel test and dialogues/description analyses, audience recognition of the films and films' crews' characteristics), the two characters talk about another emerging topic: that of the **female gaze**, and discuss about further interest and research in it.


## Summary of results

## Bibliography
- Mulvey, Laura (1989). "Visual Pleasure and Narrative Cinema". In: Visual and Other Pleasures. Language, Discourse, Society. Palgrave Macmillan, London
- Schroeder, Jonathan (1998). "Consuming Representation: A Visual Approach to Consumer Research". Representing Consumers: Voices, Views and Visions. New York: Routledge
- Snow, Edward (1989-01-01). "Theorizing the Male Gaze: Some Problems". Representations. 25 (25)
- Cavitt, Sam (2022-06-15). "How All of Us Are Changed by the Movies That Entertain Us", The Cinema Connoisseur, Vol. 1 Issue 3, https://cinema-connoisseur.com/how-all-of-us-are-changed-by-the-movies-that-entertain-us/, last visited on April 13th, 2023