# Through the Gaze - Data documentation
This Jupyter Notebook analyses the data preparation and processing phase for ["NameProject"](https://ahsanv101.github.io/ProjectGaze/).

For this project, we are interested in studying the concept of the **"male gaze"** in cinema, inspired by the essay "Visual Pleasure and Narrative Cinema" by the feminist film theorist Laura Mulvey. Mulvey underlines how the "male gaze" is made of three main components:
1. The audience
2. The characters
3. The camera (i.e. the director)

To represent a coherent and significant overview on the male gaze's impact on western cinematic industry, we will identify the **10 highest-grossing U.S. films for each decade from 1940s to 2010s**. The reason to opt for highest-grossing movies is that they give a general understanding of the popularity of the movie also in terms of fame and profit (highest grossing = surplus amount of people saw it), as well as produce a sort of cultural normativity.
Taking highest-grossing movies per decade will help us generalize our results in terms of popularity.


### Disclaimer 
This Jupyter Notebook is of informational nature only, it is not thought to be used for the data preparation and processing, but only for the analysis and explanation of such processes.
<br>The Python files used for the clean up can be found in the `code` folder of the [Github repository](https://github.com/ahsanv101/ProjectGaze).

## The audience: webscraping, sentiment and sexism
Focusing on the audience component of the male gaze implied looking through some of the **reviews** provided for all the movies belonging to our dataset, and focusing not only on the overall reception of the movie, but mostly on the individuals' perception of it and possible gender bias underlying their opinion.


Reviews are **not accompanied by the user that provided them**, since that was not useful for our analysis: what is important to keep in mind is that our reviews' dataset comprehends 1972 reviews related to our chosen movies, and that they are completely **public and available on the IMDB's reviews' pages**. Moreover, it's essential to underline that our analysis is partial and neutral, and hopes to elaborate useful reflections more than harsh critiques. 

### Reviews webscraping
The first step of our audience's analysis comprehended a webscraping of the reviews' pages provided in the movie.csv files in URLs form. To do so, we used the [**BeautifulSoup library**](https://www.crummy.com/software/BeautifulSoup/) and we inspected the HTML structure of a standard IMDB's review's page: the textual content of any review is stored inside a `div` block marked by the tag "text", and here we access to all of our data. 
<br>
The task, mostly automated, only required a division of the URLS into chunks, to speed up the overall scraping process (since we were working with huge amounts of data!). 


We later stored our reviews in a dictionary, then turned dataframe, then turned into a **`.csv` file**, containing a unique column, `Reviews`, alongside an index. 


In [1]:

#We used the following libraries!
import pandas as pd 
import requests
from bs4 import BeautifulSoup
import pprint
import re

#Here we initialize and modify our CSVs accordingly and we create a list for the webscraped reviews 
movies = pd.read_csv('movies.csv')
title_reviews = movies[['Title','Reviews']].copy()

text_reviews = []

#The webrascraping starts here
batch_size = 79
urls = ['https://www.imdb.com/title/tt0038969/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0041838/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0031381/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0037536/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0034167/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0036872/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0039391/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0035575/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0034583/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0040806/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0049833/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0045793/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0044672/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0044672/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0047673/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0043949/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0051459/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0053291/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0048593/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0042192/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0059742/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0061722/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0064115/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0058331/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0056937/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0062622/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0055614/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0054215/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0056172/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0060164/?ref_=nv_sr_srsg_3', 'https://www.imdb.com/title/tt0073195/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0076759/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0070047/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0077631/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0071230/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0075148/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0066011/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0078346/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0067093/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0080684/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0083866/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0096895/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0086190/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0087332/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0088763/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0092099/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0092644/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0096438/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0081573/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0120338/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0120915/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0107290/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0116629/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0109830/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0119654/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0099653/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0103064/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0103776/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0112462/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0468569/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0383574/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0145487/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0417741/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0121766/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0316654/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0418279/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0325980/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0241527/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0120755/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt4154796/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt1825683/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt2488496/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0848228/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt2527336/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0499549/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0770828/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt3748528/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt1201607/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt1877832/reviews?ref_=tt_urv']
url_chunks = [urls[x:x+batch_size] for x in range(0, len(urls), batch_size)]

def scrape_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    for links in soup.find_all('div', class_='text'):
            review = links.get_text()
            text_reviews.append(review)
def scrape_batch(url_chunk):
    chunk_resp = []
    for url in url_chunk:
        chunk_resp.append(scrape_url(url))
    return chunk_resp
for url_chunk in url_chunks:
    scrape_batch(url_chunk)
    
#From the list, we store our results into a dictionary, to later convert into a new dataframe and CSV. 
reviews_dict = {'Reviews': text_reviews}
text_reviews = pd.DataFrame.from_dict(reviews_dict)
text_reviews.to_csv("text_reviews.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'movies.csv'

### Sentiment Analysis
Now that our reviews were available, it was time to actually start working on our analysis: this second step focused mostly on **retrieving the sentiment of our reviews**: *are they positive or negative?*
<br>
This aspect was later used to understand if there were any strong correlations among the possible sexist tone of a review and its overall sentiment: for example, *how does a poor opinion on women affect the overall perception of a movie?* *Are negative reviews the most sexist?*


To achieve a correct sentiment analysis, we used the [**library `NLTK`**](https://www.nltk.org/) and its **`VADER`**, a rule-based sentiment analyzer in which the terms are generally labeled as per their semantic orientation as either positive or negative. 
The result of this analysis was a **new dataframe** containing our `Reviews` column, a new `Scores` column (containing non-weighted sentiment analysis scores, divided into negative, neutral and positive values), a `Compound` column (weighted values between 0 and 1) and a `Sentiment` column, that provides a clear label distinguishing Positive reviews (pos) from Negative ones (neg). 

In [None]:
import nltk
nltk.download('vader_lexicon')
import numpy as np
import pandas as pd

from nltk.sentiment.vader import SentimentIntensityAnalyzer
vader = SentimentIntensityAnalyzer()

df = pd.read_csv('text_reviews.csv')

#Here starts the sentiment analysis 
df.dropna(inplace=True)
empty_objects = []
for review in df.itertuples():
     if type(review)==str:
             if review.isspace():
                     empty_objects.append(review)
df.drop(empty_objects, inplace=True)

#We calculate overall scores, compound value and the sentiment label. 
df['scores'] = df['Reviews'].apply(lambda Reviews: vader.polarity_scores(Reviews))
df['compound'] = df['scores'].apply(lambda score_dict: score_dict['compound'])
df['sentiment'] = df['compound'].apply(lambda c: 'pos' if c >= 0 else 'neg')

#... And then we obtain the CSV
df.to_csv('sentiment_reviews.csv')

### Sexism Analysis
Having cleared the overall sentiment of our reviews, the final step of our audience's analysis comprehended **detecting possible traces of sexism in the reviews**.
<br>
To do this, we applied a model created and published by the group NLP-LTU on Hugging Face, the [**BerTweet Large Sexism Detector**](https://huggingface.co/NLP-LTU/bertweet-large-sexism-detector), a classification model for detecting sexism in Tweets or short text paragraphs. As some of our reviews were longer than the model's length limit, a few adjustments were implemented.


At the end, we obtained a clear result: our reviews were not sexist or, at least, they were *not completely* sexist.
<br>
BERT categorized them as lacking any kind of gender bias, but, having inspected the reviews ourselves, we knew this was not true: a few reviews showed clear signs of misogyny and sexism, not just by using offensive words such as "bitch" or "tramp" when referring to actresses or their characters, but by constantly describing them as sexy and beautiful or by comparing them to animals. 
BERT simply failed to recognized them because, if considered in a quantified way, those sentences weighted very little in the general structure of the review, that otherwise had a very neutral or even positive tone. 
What emerged from this analysis, is that **the audience's gaze is rarely guided by pure prejudice or malevolence**: realistically, our reviews displayed sexism in a "natural" and subtle way, so subtle that even the sexism-detector model failed to aknowledge them when analysing the bigger picture. 

However, we were not satisfied with this result: we wanted to isolate these instances of sexism, and to do so, we needed to narrow the detector's scope of analysis. Therefore, we introduced a simpler function capable of dividing any reviews into smaller sentences: by doing this, we could obtain singular scores of sexism and give them more significance. 
If a review had a singular sexist sentence, was therefore marked as sexist, and sorted into the final CSV accordingly to its final sexist score. 

In [None]:
#For this code to work, the libraries Transformers and Torch are needed. 
import pandas as pd 
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer,pipeline
from transformers import BertForSequenceClassification, BertTokenizer
import torch

#We define the model, tokenizer and classifier we are going to use 
model = AutoModelForSequenceClassification.from_pretrained('NLP-LTU/bertweet-large-sexism-detector')
tokenizer = AutoTokenizer.from_pretrained('NLP-LTU/bertweet-large-sexism-detector') 
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

df = pd.read_csv('sentiment_reviews.csv')


#This portion of codes generates a prediction of the OVERALL review. According to the tensor size, it proceeds directly with the prediction or it adds an ulterior preprocessing and tokenization phase. 
import math

for item in df['Reviews']: 
  if (len(item.split())>512):
    n=math.ceil(len(item.split())/512)
    for i in range(n):
        if (i==(n-1)):
          safe_item=' '.join(item.split()[i*512::])  
        else:
          item=' '.join(item.split()[i*512:(i+1)*512])
          tokenized = tokenizer.encode(item, padding=True, truncation=True,max_length=50, add_special_tokens = True)
          prediction = classifier(str(tokenized))
          print(prediction, item)
          
#To work on the individual sentences, we used this instead. 

reviews = []
sentences = []

for index, item in df.Reviews.items(): 
      sentence = item.split('.')      
      prediction = classifier(sentence)
      sentences.append(sentence)  
      reviews.append(prediction)
      print([sentence, prediction])

## The characters: film and scripts analysis
The aim of this analysis is to extract the dominance of the male gaze in the scope of the film and script. This is one of the most important analysis as we also directly dive into the core content of the cinema industry which are the scripts, the basis of any film. The reason we chose scripts is because they address **the whole setting of the characters** as well as **how they are defined on the camera** (viewers) and **how the male character in the script perceives the non-male ones**. They also show what kind of dialogues or actions are assigned to male ones vs non male and give us a good comparative analysis. 


### Bechdel Test
The first step into this analysis is the infamous [Bechdel Test](https://bechdeltest.com/), used for measuring **how women are represented in a given film**. There are generally three rules that a film needs to pass:

1. The movie has to have at least two women in it
2. The movie has to have at least two women who talk to each other
3. The movie has to have at least two women who talk to each other and it is about something other than a man

If a movie passes all three of the rules then it passes the Bechdel test. This goes to show a very bare minimum bar that ideally every movie should have. We will collect that data from already existing datasets and check the results with the scope of our movies. 

> **Graphs**
> 1. Passed and not passed: bar chart --> highlights difference
> 2. Stacked or donut for not passed, showing 3 layers with dynamic list of the movies
 

### Character Description
In this step we will be diving into the **actual descriptions of characters in the scripts**. The idea of using descriptions of the characters is to get an understanding of how the camera wants to show certain features of the characters through the use of angles: in this way the camera becomes the gaze and the (non-male) character becomes the object for the gaze.

Our aim is to extract automatically such descriptions from the scripts using Natural Language Processing and show the words which are often used in the describing characters (both male and non-male), revealing the differences in the way they are portayed. We also aim to **categorize female descriptions** in terms of *highly sexist* descriptions and *dubious but problematic* descriptions.

> **Graphs**
> 1. Overall picture: word cloud: him versus her
> 2. Division of descriptions in layers - donut or stacked bar with layers with dynamic list


### Character Dialogue
In this step we are extracting all the dialogues spoken by male and non-male characters for each script automatically also using NLP tasks. The aim here is to show just how much the **division and representation of words** are given to men vs non-men characters. 

> **Graphs**
> 1. Vertical bar chart showing percentages between men and women

### Final "Gaze Score"
In this step we will be developing a mechanism in order to **assign a score to each film** within our scope. This scoring is important for us as we take into account all the factors analyzed above and assign a score from a **range of 0-100**.

The divisiion of the score is as follows:
1. **Bechdel Test** (max. 40%), score assigned based on the following criteria
    1. If a movie passes **no rule**: 40%
    2. If a movie passes **only the first rule**: 26.66%
    3. If a movie passes **only the first and second rules**: 13.33%
    4. If a movie passes **all rules**: 0%
2. **Character description** (max. 35%), score assigned based on the following criteria
    1. If a female character is described in a **highly sexist** manner: 35%
    2. If a female character is described in a **dubious but problematic** manner: 17.5%
    3. If a female character is not described in any of the above manners: 0%
3. **Character dialogues** (max. 25%), score assigned based on the following criteria:
    1. If a male character has less than or equal to 50% of the overrall dialogue in the script: 0%
    2. If a male character has more than or equal to 70% of the overall dialogue in the script: 25%
    3. If a male character has dialogue between 51% to 69% of the overall dialogue in the script: the percentage will be assigned on the basis of the percentile between values 0.1%-24.9%

> **Graphs**
> 1. bar chart


## The camera: SPARQL metadata retrieval

Finally, after gathering some preliminary results from the first analyses on film scripts and IMDB's reviews, we further deepened our research using [**Wikidata**](https://www.wikidata.org/wiki/Wikidata:Main_Page) and its **SPARQL endpoint**.

While we had found another interesting database with a SPARQL endpoint, the [**Linked Internet Movie Database (IMDb)**](https://triplydb.com/Triply/linkedmdb), and proceeded with an initial phase of **data exploration** (as it was an unknown), we quickly found out that it was missing some of more relevant information for the scope of our project, such as the gender of people working on the movie (e.g. directors, writers...). Moreover, the "imdb id" it presented was actually different than the one on Wikidata, which, on the other hand, had all the necessary information.

The SPARQL queries are based on the results coming from the [script analysis](###The-characters:-film-and-scripts-analysis) and [review analysis](##The-audience:-webscraping,-sentiment-and-sexism) (respectively, the "characters" and "audience" sections):,
- The audience results,
    - [FRA WRITE THE RESULTS HERE],
- The characters results,
    - Bechdel test: [CHLOE WRITE THE RESULTS HERE],
    - Character dialogue analysis: [AHSAN WRITE SOMETHING HERE],
    - Gaze score: [WRITE SOMETHING HERE]

Queries:
1. The "audience" query: *what audience is the most sexist?*, <span style="color:red;">*Is there any decade in which the reviews are the most sexist?*</span>
2. The "characters" queries:
    1. Bechdel test: *how many of the [selected] films have **male** directors?*
    2. Character dialogue: *what is the proportion between male and female writers in the [selected] films?*
3. Gaze score queries:
    1. *To what genre belong the top 10 films in the gaze score ranking?*
    2. *Is there any correlation between rank in the gaze score ranking, box-office and production costs?*
    3. *Is there any decade in which the films rank higher in the gaze score ranking?*

#### The "Audience" query: *what audience is the most sexist?*, *Is there any decade in which the reviews are the most sexist?*

In [None]:
import sparql_dataframe

wikidata_endpoint = 'https://query.wikidata.org/bigdata/namespace/wdq/sparql?query={SPARQL}'

#### The "Characters" queries
##### Bechdel test query: *how many of the [selected] films have **male** directors?*

##### Characters dialogue query: *how many of the [selected] films have **male** directors?*

#### Gaze score queries
##### GS query 1: *To what genre belong the top 10 films in the gaze score ranking?*

##### GS query 2: *Is there any correlation between rank in the gaze score ranking, box-office and production costs?*

##### GS query 3: *Is there any decade in which the films rank higher in the gaze score ranking?*