# Through the Gaze - Data documentation
This Jupyter Notebook analyses the data preparation and processing phase for [**Through the Gaze**](https://ahsanv101.github.io/ProjectGaze/), a project developed for the final exam of the course <a href="https://www.unibo.it/it/didattica/insegnamenti/insegnamento/2022/467047">"Information Visualization"</a> held by professor Marilena Daquino at Alma Mater Studiorum - University of Bologna.

For this project, we are interested in studying the concept of the **"male gaze"** in cinema, inspired by the essay "Visual Pleasure and Narrative Cinema" by the feminist film theorist Laura Mulvey. Mulvey underlines how the "male gaze" is made of three main components:
1. The audience
2. The characters
3. The camera (i.e. the director)

To represent a coherent and significant overview on the male gaze's impact on western cinematic industry, we will identify the **10 highest-grossing U.S. films for each decade from 1940s to 2010s**. The reason to opt for highest-grossing movies is that they give a general understanding of the popularity of the movie also in terms of fame and profit (highest grossing = surplus amount of people saw it), as well as produce a sort of cultural normativity.
Taking highest-grossing movies per decade will help us generalize our results in terms of popularity.


### Disclaimer 
This Jupyter Notebook is of informational nature only, it is not thought to be used for the data preparation and processing, but only for the analysis and explanation of such processes.
<br>The Python files used for the clean up can be found in the `code` folder of the [Github repository](https://github.com/ahsanv101/ProjectGaze).

## The audience: webscraping, sentiment and sexism
Focusing on the audience component of the male gaze implied looking through some of the **reviews** provided for all the movies belonging to our dataset, and focusing not only on the overall reception of the movie, but mostly on the individuals' perception of it and possible gender bias underlying their opinion.


Reviews are **not accompanied by the user that provided them**, since that was not useful for our analysis: what is important to keep in mind is that our reviews' dataset comprehends 1972 reviews related to our chosen movies, and that they are completely **public and available on the IMDB's reviews' pages**. Moreover, it's essential to underline that our analysis is partial and neutral, and hopes to elaborate useful reflections more than harsh critiques. 

### Reviews webscraping
The first step of our audience's analysis comprehended a webscraping of the reviews' pages provided in the movie.csv files in URLs form. To do so, we used the [**BeautifulSoup library**](https://www.crummy.com/software/BeautifulSoup/) and we inspected the HTML structure of a standard IMDB's review's page: the textual content of any review is stored inside a `div` block marked by the tag "text", and here we access to all of our data. 
<br>
The task, mostly automated, only required a division of the URLS into chunks, to speed up the overall scraping process (since we were working with huge amounts of data!). 


We later stored our reviews in a dictionary, then turned dataframe, then turned into a **`.csv` file**, containing a unique column, `Reviews`, alongside an index. 


In [3]:

# We used the following libraries!
import pandas as pd 
import requests
from bs4 import BeautifulSoup
import pprint
import re

# Here we initialize and modify our CSVs accordingly and we create a list for the webscraped reviews 
movies = pd.read_csv('../data/webscrape/movies-checkpoint.csv')
title_reviews = movies[['Title','Reviews']].copy()

text_reviews = []

# The webrascraping starts here
batch_size = 79
urls = ['https://www.imdb.com/title/tt0038969/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0041838/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0031381/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0037536/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0034167/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0036872/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0039391/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0035575/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0034583/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0040806/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0049833/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0045793/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0044672/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0044672/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0047673/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0043949/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0051459/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0053291/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0048593/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0042192/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0059742/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0061722/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0064115/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0058331/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0056937/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0062622/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0055614/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0054215/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0056172/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0060164/?ref_=nv_sr_srsg_3', 'https://www.imdb.com/title/tt0073195/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0076759/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0070047/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0077631/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0071230/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0075148/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0066011/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0078346/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0067093/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0080684/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0083866/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0096895/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0086190/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0087332/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0088763/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0092099/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0092644/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0096438/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0081573/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0120338/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0120915/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0107290/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0116629/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0109830/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0119654/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0099653/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0103064/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0103776/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0112462/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0468569/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0383574/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0145487/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0417741/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0121766/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0316654/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0418279/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0325980/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0241527/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0120755/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt4154796/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt1825683/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt2488496/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0848228/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt2527336/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0499549/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0770828/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt3748528/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt1201607/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt1877832/reviews?ref_=tt_urv']
url_chunks = [urls[x:x+batch_size] for x in range(0, len(urls), batch_size)]

def scrape_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    for links in soup.find_all('div', class_='text'):
            review = links.get_text()
            text_reviews.append(review)
def scrape_batch(url_chunk):
    chunk_resp = []
    for url in url_chunk:
        chunk_resp.append(scrape_url(url))
    return chunk_resp
for url_chunk in url_chunks:
    scrape_batch(url_chunk)
    
# From the list, we store our results into a dictionary, to later convert into a new dataframe and CSV. 
reviews_dict = {'Reviews': text_reviews}
text_reviews = pd.DataFrame.from_dict(reviews_dict)
# text_reviews.to_csv("text_reviews.csv")

KeyboardInterrupt: 

### Sentiment Analysis
Now that our reviews were available, it was time to actually start working on our analysis: this second step focused mostly on **retrieving the sentiment of our reviews**: *are they positive or negative?*
<br>
This aspect was later used to understand if there were any strong correlations among the possible sexist tone of a review and its overall sentiment: for example, *how does a poor opinion on women affect the overall perception of a movie?* *Are negative reviews the most sexist?*


To achieve a correct sentiment analysis, we used the [**library `NLTK`**](https://www.nltk.org/) and its **`VADER`**, a rule-based sentiment analyzer in which the terms are generally labeled as per their semantic orientation as either positive or negative. 
The result of this analysis was a **new dataframe** containing our `Reviews` column, a new `Scores` column (containing non-weighted sentiment analysis scores, divided into negative, neutral and positive values), a `Compound` column (weighted values between 0 and 1) and a `Sentiment` column, that provides a clear label distinguishing Positive reviews (pos) from Negative ones (neg). 

In [None]:
import nltk
nltk.download('vader_lexicon')
import numpy as np
import pandas as pd

from nltk.sentiment.vader import SentimentIntensityAnalyzer
vader = SentimentIntensityAnalyzer()

df = pd.read_csv('text_reviews.csv')

# Here starts the sentiment analysis 
df.dropna(inplace=True)
empty_objects = []
for review in df.itertuples():
     if type(review)==str:
             if review.isspace():
                     empty_objects.append(review)
df.drop(empty_objects, inplace=True)

# We calculate overall scores, compound value and the sentiment label. 
df['scores'] = df['Reviews'].apply(lambda Reviews: vader.polarity_scores(Reviews))
df['compound'] = df['scores'].apply(lambda score_dict: score_dict['compound'])
df['sentiment'] = df['compound'].apply(lambda c: 'pos' if c >= 0 else 'neg')

#... And then we obtain the CSV
# df.to_csv('sentiment_reviews.csv')

### Sexism Analysis
Having cleared the overall sentiment of our reviews, the final step of our audience's analysis comprehended **detecting possible traces of sexism in the reviews**.
<br>
To do this, we applied a model created and published by the group NLP-LTU on Hugging Face, the [**BerTweet Large Sexism Detector**](https://huggingface.co/NLP-LTU/bertweet-large-sexism-detector), a classification model for detecting sexism in Tweets or short text paragraphs. As some of our reviews were longer than the model's length limit, a few adjustments were implemented.


At the end, we obtained a clear result: our reviews were not sexist or, at least, they were *not completely* sexist.
<br>
BERT categorized them as lacking any kind of gender bias, but, having inspected the reviews ourselves, we knew this was not true: a few reviews showed clear signs of misogyny and sexism, not just by using offensive words such as "bitch" or "tramp" when referring to actresses or their characters, but by constantly describing them as sexy and beautiful or by comparing them to animals. 
BERT simply failed to recognized them because, if considered in a quantified way, those sentences weighted very little in the general structure of the review, that otherwise had a very neutral or even positive tone. 
What emerged from this analysis, is that **the audience's gaze is rarely guided by pure prejudice or malevolence**: realistically, our reviews displayed sexism in a "natural" and subtle way, so subtle that even the sexism-detector model failed to aknowledge them when analysing the bigger picture. 

However, we were not satisfied with this result: we wanted to isolate these instances of sexism, and to do so, we needed to narrow the detector's scope of analysis. Therefore, we introduced a simpler function capable of dividing any reviews into smaller sentences: by doing this, we could obtain singular scores of sexism and give them more significance. 
If a review had a singular sexist sentence, was therefore marked as sexist, and sorted into the final CSV accordingly to its final sexist score. 

In [None]:
#For this code to work, the libraries Transformers and Torch are needed. 
import pandas as pd 
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer,pipeline
from transformers import BertForSequenceClassification, BertTokenizer
import torch

#We define the model, tokenizer and classifier we are going to use 
model = AutoModelForSequenceClassification.from_pretrained('NLP-LTU/bertweet-large-sexism-detector')
tokenizer = AutoTokenizer.from_pretrained('NLP-LTU/bertweet-large-sexism-detector') 
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

df = pd.read_csv('sentiment_reviews.csv')


#This portion of codes generates a prediction of the OVERALL review. According to the tensor size, it proceeds directly with the prediction or it adds an ulterior preprocessing and tokenization phase. 
import math

for item in df['Reviews']: 
  if (len(item.split())>512):
    n=math.ceil(len(item.split())/512)
    for i in range(n):
        if (i==(n-1)):
          safe_item=' '.join(item.split()[i*512::])  
        else:
          item=' '.join(item.split()[i*512:(i+1)*512])
          tokenized = tokenizer.encode(item, padding=True, truncation=True,max_length=50, add_special_tokens = True)
          prediction = classifier(str(tokenized))
          print(prediction, item)
          
#To work on the individual sentences, we used this instead. 

reviews = []
sentences = []

for index, item in df.Reviews.items(): 
      sentence = item.split('.')      
      prediction = classifier(sentence)
      sentences.append(sentence)  
      reviews.append(prediction)
      print([sentence, prediction])

## The characters: film and scripts analysis
The aim of this analysis is to extract the dominance of the male gaze in the scope of the film and script. This is one of the most important analysis as we also directly dive into the core content of the cinema industry which are the scripts, the basis of any film. The reason we chose scripts is because they address **the whole setting of the characters** as well as **how they are defined on the camera** (viewers) and **how the male character in the script perceives the non-male ones**. They also show what kind of dialogues or actions are assigned to male ones vs non male and give us a good comparative analysis. 


### Bechdel Test
The first step into this analysis is the infamous [Bechdel Test](https://bechdeltest.com/), used for measuring **how women are represented in a given film**. There are generally three rules that a film needs to pass:

1. The movie has to have at least two women in it
2. The movie has to have at least two women who talk to each other
3. The movie has to have at least two women who talk to each other and it is about something other than a man

If a movie passes all three of the rules then it passes the Bechdel test. This goes to show a very bare minimum bar that ideally every movie should have. We will collect that data from already <a href= "https://www.kaggle.com/datasets/alisonyao/movie-bechdel-test-scores">existing datasets</a> and check the results with the scope of our movies. 

After importing our datasets and performing string cleaning for merging correctly, we assign the corresponding bechdel test values to our given films.

In [None]:
import os
import pandas as pd

cwd = os.getcwd()
path ="/".join(list(cwd.split('/')[0:-1])) 

top_movies = path+'/Data/webscrape/finalmovies.csv'
movies_df = pd.read_csv(top_movies,header=0)
bechdel_data=path+'/Data/bechdel/Bechdel_detailed.csv'
bechdel_df= pd.read_csv(bechdel_data)
bechdel_df.rename(columns={"title":"Title"}, inplace=True)
bechdel_df

This dataset contains information and metadata regarding the bechdel evaluation for a series of movies.
The information most relevant to us the <b>rating column</b> that contains a number from 0 to 3, where:
<ul>
<li>0 means there are no two female characters, </li>
<li>1 means if they exist, there is no talking between them, </li>
<li>2 means if they talk, it is only  about a man,</li>
<li>3 means it passes the test;</li>

</ul>
the column dubious states the submitter considered the rating dubious.


We will now perform cleaning and merging operations in order to get our final dataset.

In [None]:
import re

# Rename manually
bechdel_df.loc[bechdel_df['Title'].str.contains('Rogue One'), 'Title'] = bechdel_df['Title'].str.replace('Rogue One', 'Rogue One: A Star Wars Story')
bechdel_df.loc[bechdel_df['Title'].str.contains('Last Jedi'), 'Title'] = bechdel_df['Title'].str.replace('Star Wars: The Last Jedi', 'Star Wars: Episode VIII - The Last Jedi')
bechdel_df.loc[bechdel_df['Title'].str.fullmatch('Star Wars'), 'Title'] = bechdel_df['Title'].str.replace('Star Wars', 'Star Wars: Episode IV - A New Hope')

# Remove special characters etc
def normalize_string(s):
    s = s.replace('&#39;','')	
    s = s.replace("'", '')  # apostrophes with empty string
    s = re.sub(r'\W+', '', s)  # Remove non-alphanumeric 
    s = s.lower()  # Convert to lowercase
    s = s.replace(' ', '_') 
    s = s.replace('the', '')# Replace spaces with underscores
    s = s.replace('judgment', 'judgement')
    return s


bechdel_df['name_normalized'] = bechdel_df['Title'].apply(normalize_string)
movies_df['name_normalized'] = movies_df['Title'].apply(normalize_string)

final_df = pd.merge(bechdel_df, movies_df, on='name_normalized', how='right')

# Study missing values

null_values_x= final_df['Title_x'].isna()
null_values_x.sum()
final_df[null_values_x]

bechdel_no_data= final_df[null_values_x]
bechdel_no_data = bechdel_no_data.drop(["Unnamed: 0","name_normalized"], axis=1)
bechdel_no_data = bechdel_no_data.dropna(axis=1)
bechdel_no_data.rename(columns={"Title_y":"Title"}, inplace= True)
bechdel_no_data.reset_index(drop=True, inplace=True) #these are our movies that do not have any data regarding bechdel rules

# View movies that do not contain information
bechdel_no_data

10 of our selected movies, most of them released in the beginning of our time range, have not yet been evaluated.
Let's look at the information regarding the rest of the movies.


In [None]:
movies_bechdel = final_df.drop(['Unnamed: 0','submitterid','date','name_normalized', 'id','visible','Title_x'],axis=1)
movies_bechdel.rename(columns={"Title_y":"Title","rating":"bechdel_rating"}, inplace= True)
movies_bechdel = movies_bechdel[['Title', 'Decade', 'Genre', 'Director', 'year', 'bechdel_rating', 'dubious']]

title_duplicates = movies_bechdel[movies_bechdel['Title'].duplicated(keep=False)]
# Drop duplicates by their index
movies_bechdel= movies_bechdel.drop([24,25,29,77])
movies_bechdel.reset_index(drop=True, inplace=True)

movies_bechdel

For more specific information we can query the dataframe directly, for example if we want to take a look at which of our selected films have passed the bechdel test:

In [None]:
bechdel_passed = movies_bechdel[movies_bechdel["bechdel_rating"] == 3.0] 
bechdel_passed.reset_index(drop=True, inplace=True)
bechdel_passed

 

### Character Description
In this step we will be diving into the **actual descriptions of characters in the scripts**. The idea of using descriptions of the characters is to get an understanding of how the camera wants to show certain features of the characters through the use of angles: in this way the camera becomes the gaze and the (non-male) character becomes the object for the gaze.

Our aim is to extract automatically such descriptions from the scripts using Natural Language Processing and show the words which are often used in the describing characters (both male and non-male), revealing the differences in the way they are portayed. We also aim to **categorize female descriptions** in terms of *body descriptions* relating to the male gaze, and *dubious but problematic* descriptions relating to both the body and the personality of female characters.



In [None]:
from pandas import *
import matplotlib.pyplot as plt
import os
# reading the script files

from PyPDF2 import PdfReader
#nltk tools
import nltk 
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [None]:
lemmatizer = WordNetLemmatizer()

part = wn.synsets('body_part')[0]

def is_body_part(candidate):
    for ss in wn.synsets(candidate):
        # only get those where the synset matches exactly
        name = ss.name().split(".", 1)[0]
        if name != candidate:
            continue
        hit = part.lowest_common_hypernyms(ss)
        if hit and hit[0] == part:
            return True
    return False



cwd = os.getcwd()
path ="/".join(list(cwd.split('/')[0:-1])) 
 
# assign directory
directory = path+'/Data/scripts'
 
# iterate over script files
ignore='.DS_Store'
files = []
for filename in os.scandir(directory):

     if filename.is_file() and ignore not in str(filename):
        files.append(filename.path)
 
def get_title(file_name): #get/clean script titles
    title = file_name.split("/")[-1]
    
    return title

# keywords to look out for (body descriptions and adjectives). The last two lists contain problematic vocabulary often associated with females.

words_0= ['body', 'blonde', 'brunette', 'lips', 'beauty', 'age', 'smile', 'pants', 'skirt', 'dress', 'shirt', 'glow', 'shorts', 'hand','face','finger', 'throat','neck','hair','skin','arm','figure','shoulder'] 
adj_0=['beautiful', 'gorgeous', 'cute', 'pretty', 'devoted','divine', 'lawful','housewife', 'silly', 'frightening']
words_1=['ass', 'buxom','chest','boob', 'boob', 'bosom','buttock','breast', 'breasts','thigh', 'bottom', 'curve', 'underwear','thong','figure' 'panty', 'stocking', 'panties', 'lingerie', 'bra', 'nipple','vagina','cunt','womanhood']
adj_1= ['seductive','sexy','trashy', 'nude', 'sexuality','promiscuous', 'sexual', 'ignorant', 'hot', 'hottie', 'erotic','fuck-me', 'fuck me','juicy','sultry', 'banging','naked', 'topless',
        'stupid','helpless','fragile','dumb','weak','pitiful','enchanting', 'stunning','toned', 'breathtaking', 'breath-taking', 'perfect', 'bitch','slut','crazy',
        'sassy','dramatic','bubbly','hysterical', 'bitchy','catty','tease','prude','trollop']


# dictionary to store the data for each movie, for lists words_0 and adj_0
movies_dict={}
# dictionary  to store the data for each movie, for lists words_1 and adj_1
movies_dict_1={}



for f in files:
    movie_title = get_title(f)
    reader = PdfReader(f)
    lst=[]
    for i in range(0,len(reader.pages)):
        page = reader.pages[i]
        text = page.extract_text()
        lst.append(text)

    # dictionaries to store words and their number of occurences, for each list (body depiction and dubious words)
    word_counts={}
    word_counts_1={}
    
    check = ["she", "her", "woman", "woman's", "women", "women's", "she's","girl","girl's","girls"]

    for i in lst:
        tokens = word_tokenize(i)

        for k in range(0,len(tokens)):
            # lemmatize
            tokens[k] = lemmatizer.lemmatize(tokens[k])

            # if the token is a body part or in the list of keywords
            if is_body_part(tokens[k].lower()) == True or tokens[k].lower() in words_0+adj_0:  
                
                #get the n-grams near the token
                gram2 = tokens[k-2].lower()
                gram1 = tokens[k-1].lower()
                gram = tokens[k].lower()
                if k+1 in range(-len(tokens), len(tokens)):
                  gram0 = tokens[k+1].lower()

                # check whether they are associated with female pronouns
                if  gram2 in check or gram1 in check or gram0 in check:

                  #populate the dictionary
                  if tokens[k].lower() in word_counts:
                      word_counts[tokens[k].lower()] += 1
                  else:
                      word_counts[tokens[k].lower()] = 1

            # if the token is in the list of our problematic keywords
            if tokens[k].lower() in words_1+adj_1:

                # check the n-grams around the problematic token
                gram2 = tokens[k-2].lower()
                gram1 = tokens[k-1].lower()
                gram = tokens[k].lower()
                if k+1 in range(-len(tokens), len(tokens)):
                  gram0 = tokens[k+1].lower()

                # check whether they are associated with female pronouns
                if gram2 in check or gram1 in check or gram in check or gram0 in check:
            
                    # populate the dictionary for problematic keyword occurences
                    if tokens[k].lower() in word_counts_1:
                        word_counts_1[tokens[k].lower()] += 1
                    else:
                        word_counts_1[tokens[k].lower()] = 1
    
    # assign findings to the general dictionary for each key that is our movie
    movies_dict[movie_title] = word_counts
    movies_dict_1[movie_title] = word_counts_1
 

<p>The above code iterates over the scripts in our folder, and looks for:
<ul>
    <li>words that are used to describe one's body, with the function "is_body_part", </li>
    <li> words that we have deemed inappropriate for describing a female's body and character, given in a list of strings.</li>
    
</ul>
</p>
<p>Then, if these words are associated with <b>female characters</b> -through specific words and pronouns- they are added in their corresponding dictionaries, where each key is the movie name, whose key is a dictionary containing the words and the number of their occurence.</p>

Let's take a look at an example:

In [None]:
print('descriptions:',list(movies_dict.items())[0],'\n','problematic descriptions:',list(movies_dict_1.items())[0])

<br>
Next, we want to create a dataframe with our all occurences for each film.

In [None]:
# get total amount for each movie
movie_stats = {movie: sum(words.values()) for movie, words in movies_dict.items()}
movie_stats_inapp = {movie: sum(words.values()) for movie, words in movies_dict_1.items()}

movie_descriptions = DataFrame.from_dict(movie_stats, orient='index', columns=['count'])
movie_descriptions_inapp= DataFrame.from_dict(movie_stats_inapp, orient='index', columns=['inappropriate_count'])

movie_descriptions.reset_index(inplace=True)
movie_descriptions_inapp.reset_index(inplace=True)

movie_descriptions = movie_descriptions.rename(columns={'index':'script_name'})
movie_descriptions_inapp = movie_descriptions_inapp.rename(columns={'index':'script_name'})

movie_desc_graph= merge(movie_descriptions,movie_descriptions_inapp, left_on='script_name', right_on='script_name')


# difflib  will allow us to match our script names to the appropriate movie titles

import difflib
df_all_movies = read_csv(path+'/Data/Dialogue/dialogue_bechdel.csv')
import difflib
titles = df_all_movies['Title'].to_list()
titles_to_check= movie_desc_graph['script_name'].to_list()
titles_match=[]
for i in titles_to_check:
    titles_match.append(difflib.get_close_matches(i, titles, len(titles), 0)[0])

fem_desc_graph = DataFrame(list(zip(titles_match,titles_to_check)),
               columns =['movie', 'script_name'])
fem_desc_graph  = fem_desc_graph.merge(movie_desc_graph, left_on="script_name", right_on= "script_name")
fem_desc_graph


### Character Dialogue
In this step we are extracting all the dialogues spoken by male and non-male characters for each script automatically also using NLP tasks. The aim here is to show just how much the **division and representation of words** are given to men vs non-men characters. 



In [None]:

# importing required modules
from PyPDF2 import PdfReader
import nltk 
from nltk.corpus import wordnet as wn
import os
import re
import json
import gender_guesser.detector as gender
import csv
import pandas as pd
import difflib

 
# nltk.download('wordnet')

part = wn.synsets('body_part')[0]


# assign directory
directory = '../Data/scripts/'
 
# iterate over files in
# that directory

files = []
for filename in os.scandir(directory):
    if filename.is_file():
        files.append(filename.path)


In the next sections of the code first we read through all the scripts page by page. Each page have multiple texts on it that re retrieve. From there are we start our language computational work. Since we scripts are generally written in an agreed upon format, we take it from there to make our code adapt to our format. This format generally means that we are aiming to extract all the characters which are written in the middle of the page, which is followed by their dialogues. 


For each page, we extract all the text. Then from those text, using some helper-functions, we extract the names of each character and their dialogues. We do this using dictionary as a data structure. Where the key becomes the character and the value becomes all their dialogues. Then each of these dictionaries are stored in a main dictionary with the script file name as key and the dictionary of character-dialogues as value.

In [None]:


# This function is used to check if there is any word which is wither an upper case or as length 1 or 0 or if it has : in it
def checkUpper(s):
    if len(s) > 1 and s.isupper() != True:
        if ":" in e:
            return False
        else:
            return True
    elif len(s) == 1 or len(s) == 0:
        return True
    else:
        return False


# This function simply checks if there are any numbers in the string
def has_numbers(inputString):
    return any(char.isdigit() for char in inputString)

# files = ['../Data/scripts/ET_1.pdf']
# Data/scripts/avatar.php','Data/scripts/batmanforever.php'
# 'Data/scripts/backtothefuture.pdf
# dic={}

# This is the main dictionary where everything will be eventually stored


Maindic = {}
for f in files:
    dic = {}

    title = f.split("/")[-1]
    print('\n', '---------------------------------', title,
          '----------------------------------', '\n')
    reader = PdfReader(f)

    lst = []

    # for each page that the reader has read from the pdf
    for i in range(0, len(reader.pages)):

        page = reader.pages[i]
        text = page.extract_text()
#         splitting the text of the page on the basis of new line
        ls = text.split('\n')

        for j in range(0, len(ls)):

            #             for each line we clean it
            cleaned = ls[j].strip()

# Then we check if it is one word or two words or if it has : in it .
# this is to basically extract that this will be a character who will be saying some dialgoue
            if (len(cleaned.split(" ")) == 1 or len(cleaned.split(" ")) == 2) and (cleaned.isupper() or ":" in cleaned) and '!' not in cleaned and has_numbers(cleaned) == False:

                #             Then we check of the word index that we are iterating over has 6 lines or not. We do this check for 5,4,3 lines too

                if j+6 in range(-len(ls), len(ls)):
                    #                     Here we are adding all the lines together in one dialogue
                    word = ls[j+1].strip()+' '+ls[j+2].strip()+' '+ls[j+3].strip() + \
                        ' '+ls[j+4].strip() + ' '+ls[j+5].strip() + \
                        ' '+ls[j+6].strip()

                    newword = []
#             Here we are checking all the lines that we added. we want to see if all the lines are actually dialogoues and not
# continuation of some other character dialogue so we use the checkUpper function and break it whereever theres a doubt

                    for e in word.split(' '):

                        if checkUpper(e):
                            newword.append(e)
                        else:
                            break
#                     print(newword)

                    word = ' '.join(newword)

#                     Then we assign the character name to a local dictionary and all the dialogues will become the values
                    if ls[j] not in dic:

                        dic[ls[j].strip()] = word

                    else:
                        dic[ls[j].strip()] = dic[ls[j]] + ' '+word

                elif j+5 in range(-len(ls), len(ls)):

                    word = ls[j+1].strip()+' '+ls[j+2].strip()+' '+ls[j +
                                                                      3].strip()+' '+ls[j+4].strip() + ' '+ls[j+5].strip()
                    newword = []

                    for e in word.split(' '):

                        if checkUpper(e):
                            newword.append(e)
                        else:
                            break

                    word = ' '.join(newword)

                    if ls[j] not in dic:

                        dic[ls[j].strip()] = word

                    else:
                        dic[ls[j].strip()] = dic[ls[j]] + ' '+word

                elif j+4 in range(-len(ls), len(ls)):
                    word = ls[j+1].strip()+' '+ls[j+2].strip()+' ' + \
                        ls[j+3].strip()+' '+ls[j+4].strip()
                    newword = []

                    for e in word.split(' '):

                        if checkUpper(e):
                            newword.append(e)
                        else:
                            break

                    word = ' '.join(newword)

                    if ls[j] not in dic:

                        dic[ls[j].strip()] = word

                    else:
                        dic[ls[j].strip()] = dic[ls[j]] + ' '+word


#                 Doing for 3 lines because the above code already caters to the lines that already exist

                elif j+3 in range(-len(ls), len(ls)):
                    word = ls[j+1].strip()+' '+ls[j+2].strip() + \
                        ' '+ls[j+3].strip()
                    newword = []

                    for e in word.split(' '):

                        if checkUpper(e):
                            newword.append(e)
                        else:
                            break

                    word = ' '.join(newword)

                    if ls[j] not in dic:

                        dic[ls[j].strip()] = word

                    else:
                        dic[ls[j].strip()] = dic[ls[j]] + ' '+word

        Maindic[title] = dic
#     Here is where we save each of the dic for each script in a main dic


In [None]:


#   saving in a txt file
with open('convert.txt', 'w') as convert_file:
     convert_file.write(json.dumps(Maindic))
    

# reading the data from the file
with open('convert.txt') as f:
    data = f.read()
  

      
# # reconstructing the data as a dictionary
js = json.loads(data)
  

Once we have all the characters, with their names as written on the scripts, we try to get the gender of these characters. We use the python gender guesser library for that purpose. 

It is important to know that these values were also manually checked in case there was a discrepency. Also that many characters which are not really male or female were not marked as such.

In [1]:

d = gender.Detector(case_sensitive=False)
data=[]

# Here we are first cleaning some of the names that we have. We might have some apostraphe or capital small or decimals at 
# beginning or end of the word which we want to clean. Then we will be using a python library called gender guesser which
# takes a name and then assignes the gender. it is either male, female, mostly male, mostly female, androgynous.


for key,value in js.items():
#     if key == 'titanic-numbered.pdf':
    print("-----------------"+key+"-----------------------")
#     data.append([key,'','',''])
    #     Sorted = sorted(value, key=lambda k: len(d[k]), reverse=True)
    for k in sorted(value, key=lambda k: len(value[k]), reverse=True):
        if len(value[k])>=10 and len(k)>=2:
            name = k.split(" ")[0]
            name  = re.sub(r'\'\w+', '', name)
            name  = re.sub(r'\-\w+', '', name)
            name  = re.sub(r'\:','', name)
            name  = re.sub(r'\.','', name)
            name  = re.sub(r'(CONT)','', name)
            name  = re.sub(r'','', name)
            name = name.strip("()")
            name = name.strip('"')
            print([key,k,name,"m",len(value[k])])
            
            if d.get_gender(name) == "male":
                data.append([key,k,name,"m",len(value[k])])
            elif d.get_gender(name) == "female":
                data.append([key,k,name,"f",len(value[k])])
            elif d.get_gender(name) == "mostly_male":
                data.append([key,k,name,"m",len(value[k])])
            elif d.get_gender(name) == "mostly_female":
                data.append([key,k,name,"f",len(value[k])])
            elif d.get_gender(name) == "andy":
                data.append([key,k,name,"m",len(value[k])])
            else:
                data.append([key,k,name,"u",len(value[k])])
            
            
#             print (k,len(value[k]),d.get_gender(k.split(' ')[0]))

NameError: name 'gender' is not defined

Once we have all the male and non male characters which are saved in a csv file, we simply get the sum of all the male and non male characters dialogue length that was detected for each of the scripts. This helps us understand the distribution of the dialogues between the two categories 

In [None]:


header = ['movie','name from script', 'lemmatized name','gender', 'len of words']

# Here we save all the results in an excel file and manually fix for the names that could not be populated.

with open('char.csv', 'w', encoding='UTF8', newline='') as f:
    writer = csv.writer(f)

    # write the header
    writer.writerow(header)

    # write multiple rows
    writer.writerows(data)

# read by default 1st sheet of an excel file
df = pd.read_excel('char.xlsx')

# After reading from the excel we group by on the basis of the movie to get the sum of male and female characters.
df2 = df.groupby(['movie','gender'])['len of words'].sum()




# movies={}

    
#     dic={}
#     if row['movie'] not in movies:
#         movies[row['movie']]
#     else:
#         print(row['movie'], row['name from script'],row["lemmatized name"],row["gender"],row["len of words"])
with pd.option_context('display.max_rows', None,
                       'display.max_columns', None,
                       'display.precision', 3,
                       ):
    print(df2)

In [None]:
df2=df2.reset_index()


In [None]:
ddic={}
for index, row in df2.iterrows():
#     lst=[]
    if row['movie'] not in ddic:
        if row['gender'] == "m":
            ddic[row["movie"]]=[('m',row['len of words'])]
        elif row['gender'] == "f":
            ddic[row["movie"]]=[('f',row['len of words'])]
    else:
        if row['gender'] == "m":
#             print(m,ddic[row['movie']])
            ddic[row["movie"]].append(('m',row['len of words']))
        elif row['gender'] == "f":
#             print(m,ddic[row['movie']])
            ddic[row["movie"]].append(('f',row['len of words']))
        


In [None]:
# print(ddic)
lss1 =[]
lss2 = []
lss3 = []
lss4 = []

# Here we are able to make a dataframe of male and female dialogue divisions for each movie
for key in ddic:
    if len(ddic[key])>1:
        print(key,ddic[key])
        lss1.append(key)
        if ddic[key][0][0]=='m':
            lss2.append(ddic[key][0][1])
        if ddic[key][0][0]=='f':
            lss3.append(ddic[key][0][1])
        if ddic[key][1][0]=='m':
            lss2.append(ddic[key][1][1])
        if ddic[key][1][0]=='f':
            lss3.append(ddic[key][1][1])
            
            
            
df_f = pd.DataFrame(list(zip(lss1,lss2,lss3)),
               columns =['Name', 'm','f'])
df_f

In [None]:
df_f["total"] = df_f["m"] + df_f["f"]
df_f

In [None]:
# adding the percentage of male dialogue from total
df_f["male_percen"] = (df_f["m"] / df_f["total"])*100
df_f

once we have the male percentages, we can get the total and non-male percentages as well. Then we join them with all the movies dataframe which is saved in another csv to combine all our results

In [None]:
df_m = pd.read_csv('../Data/bechdel/all_movies_bechdel.csv')
# !pip install difflib

# Here we are using a library difflib which will allow us to get the actual names of the movies from the dataset already prepared

dd_1 = df_m['Title'].to_list()
dd_2= df_f['Name'].to_list()
dd_3 = df_f['male_percen'].to_list()
dd_4=[]
for i in dd_2:
    dd_4.append(difflib.get_close_matches(i, dd_1, len(dd_1), 0)[0])
#     print(i,difflib.get_close_matches(i, dd_1, len(dd_1), 0)[0])

df_5 = pd.DataFrame(list(zip(dd_2,dd_4,dd_3)),
               columns =['script_name', 'movie_name','male_percen'])
df_5


### Final "Gaze Score"
In this step we will be developing a mechanism in order to **assign a score to each film** within our scope. This scoring is important for us as we take into account all the factors analyzed above and assign a score from a **range of 0-100**.

The divisiion of the score is as follows:
1. **Bechdel Test** (max. 40%), score assigned based on the following criteria
    1. If a movie passes **no rule**: 40%
    2. If a movie passes **only the first rule**: 26.66%
    3. If a movie passes **only the first and second rules**: 13.33%
    4. If a movie passes **all rules**: 0%
2. **Character description** (max. 35%), score assigned based on the following criteria
    1. If a female character's body is described **more than the observed average**: the percentage is assigned according to the number of occurences, with a maximum value of 30%
    2. If a female character is described in a **dubious, problematic or sexist** manner: the score's calculation is more sensitive to these occurences and the percentage is assigned accordingly, with a maximum value of 35%
    3. If a female character is not particularly described in any of the above manners: 0%
3. **Character dialogues** (max. 25%), score assigned based on the following criteria:
    1. If a male character has less than or equal to 50% of the overrall dialogue in the script: 0%
    2. If a male character has more than or equal to 70% of the overall dialogue in the script: 25%
    3. If a male character has dialogue between 51% to 69% of the overall dialogue in the script: the percentage will be assigned on the basis of the percentile between values 0.1%-24.9%


In [None]:
# final.to_csv('dialogue_score.csv')
bechdel = pd.read_csv('../Data/bechdel/all_movies_bechdel.csv')
male = pd.read_csv('dialogue_score.csv')
fin=pd.merge(bechdel,male, left_on='Title', right_on='movie_name', how='left')

def categorise(row):  
    if row['bechdel_rating'] == 0.0:
        return 40
    elif row['bechdel_rating'] == 1.0:
        return 26.66
    elif row['bechdel_rating'] == 2.0:
        return 13.33
    elif row['bechdel_rating'] == 3.0:
        return 0
    
fin['bechdel_score'] = fin.apply(lambda row: categorise(row), axis=1)

fin.to_csv('dialogue_bechdel.csv')

<p>For assigning a score based on the type of character descriptions, we will be using the dataframe containing all the counts of body descriptions and dubious words for each film. We chose not to penalize films that contain just under the average amount of simple body descriptions, taking into consideration that film scripts will inadvertedly contain such descriptions.</p>

<p> The following function assigns a score to each film according to its number of body occurences, and it more sensitive to the dubious values, if they are found.</p>

In [None]:
import numpy as np

vals = fem_desc_graph['count'].to_list()
inapp_vals = fem_desc_graph['inappropriate_count'].to_list()

# calculate the mean value of the sum of the values
mean_val =  sum(vals)/len(vals) #average amount of body descriptions

# calculate the percentile for the maximum value
max_val = max(vals)
max_percentile = 35 #our max value for the characters' description category

# create a dictionary mapping each value to its percentile
score = []

for i, val in enumerate(vals):
    if val < mean_val:
        film_score = 0
    else:
        film_score = ((val-mean_val)/(max_val-mean_val))*30 + 1
    if inapp_vals[i] > 0:
        sigmoid_val = 1 / (1 + np.exp(-(inapp_vals[i] - 2.5) / 2))  # adjust the 2.5 parameter to adjust the sensitivity
        inapp_score = sigmoid_val * 10
        film_score += inapp_score
    score.append(min(film_score, max_percentile))

fem_desc_graph['score'] = score

fem_desc_graph.drop(columns=['script_name'], inplace=True)
fem_desc_graph

Following the mathematical model presented above, we assign the dialogue scores accordingly

In [None]:
df_5.to_csv('male_percen.csv')
final = pd.read_csv('male_percen.csv')
score=[]
l = final['male_percen'].to_list()

# Here we are assigning a score for the male percentages of dialogues. 
# If a male character has less than or equal to 50% of the overall dialougue in the script: 0%
# If a male character has more than or equal to 70% of the overall dialougue in the script: 25%
# If a male character has dialogue between 51% to 69% of the overall dialougue in the script then the percentage 
# will be assigned on the basis of the percentile between the values: 0.1%-24.9%

for i in l:
    if i >= 70:
        print(i,25)
        score.append(25)
    elif i <=50:
        print(i,0)
        score.append(0)
    elif i > 50 and i < 70:
        s = (i-50/(70-50))*0.25
        ss = (i-50)*1.25
        print(i,s,ss)
        score.append(ss)
        
        
final['score'] = score

Lastly, we proceed with creating a <b>final dataframe</b> containing all of the data retrieved from our scripts analysis and the movies' details.

In [None]:
#merge final results to get the score
import os
import pandas as pd

path ="/".join(list(cwd.split('/')[0:-1])) 

df_bechdel_dialogue = pd.read_csv(path+ '/Data/Dialogue/dialogue_bechdel.csv')
df_descriptions= pd.read_csv(path+ "/Data/Descriptions/female_descriptions.csv")

final_scores = pd.merge(df_bechdel_dialogue,df_descriptions, left_on='Title',right_on="movie",how='outer',indicator='_merge')

final_scores = final_scores[['imdbid','Title','Decade','Genre','Director','year','bechdel_rating','male_percen',
                            'nonmale_percentage','dialogue_score','bechdel_score','count','inappropriate_count','score','_merge']]
final_scores.rename(columns={'score':'descriptions_score'},inplace=True)
                    
final_scores.drop_duplicates(inplace=True)
                    
final_scores

scores_to_count = final_scores[['dialogue_score','bechdel_score','descriptions_score']]

final_scores['gaze_score'] = scores_to_count.sum(axis=1)
final_scores.drop_duplicates(subset=['Title'], inplace=True)
final_scores.drop(columns=['_merge'], inplace=True)
final_scores

## The camera: SPARQL metadata retrieval

Finally, after gathering some preliminary results from the first analyses on film scripts and IMDB's reviews, we further deepened our research using [**Wikidata**](https://www.wikidata.org/wiki/Wikidata:Main_Page) and its **SPARQL endpoint**.

While we had found another interesting database with a SPARQL endpoint, the [**Linked Internet Movie Database (IMDb)**](https://triplydb.com/Triply/linkedmdb), and proceeded with an initial phase of **data exploration** (as it was an unknown), we quickly found out that it was missing some of more relevant information for the scope of our project, such as the gender of people working on the movie (e.g. directors, writers...). Moreover, the "imdb id" it presented was actually different than the one on Wikidata, which, on the other hand, had all the necessary information.

The SPARQL queries are based on the results coming from the [script analysis](###The-characters:-film-and-scripts-analysis) and [review analysis](##The-audience:-webscraping,-sentiment-and-sexism) (respectively, the "characters" and "audience" sections):,
- The audience results,
    - Sentiment analysis: 10 out of 80 audiences expressed a very negative opinion of the movie they watched
    - Sexism detection: 17 movies had a sexist audience, but instances of such behaviour were rare and sporadic if considered over the total number of reviews for each movie
    - Overall, we found no direct link between an audience's sexism and the reviews' tone.
- The characters results,
    - Bechdel test: out of the 82 films what were evaluated:
        - 38 films passed the Bechdel test
        - 5 failed all rules
        - 20 failed the second and third rule
        - 9 failed the third rule
    - Character dialogue analysis: out of all the scripts that we were able to retreive:
        - 94% of the scripts were male dominated (more than 50% dialogues)
        - 6% scripts had non male dialogues in majority
    - Gaze score: We were successfully able to assign an arbitrary value between 0-100 to each of the films under our research. This score will help us understand gaze and will help us compare it to other variables.

Queries:
1. The "characters" queries:
    1. Bechdel test: *how many of the [selected] films have **male** directors?*
    2. Character dialogue: *what is the proportion between male and female writers in the [selected] films?*
2. Gaze score queries:
    1. *To what genre belong the top 10 films in the gaze score ranking?*
    2. *Is there any correlation between rank in the gaze score ranking, box-office and production costs?*

#### The "Characters" queries
##### Bechdel test query: *how many of the [selected and tested for Bechdel] movies have **male** directors?*
To answer this query (and the following one, regarding character dialogues) we gather data from the `dialogue_bechdel.csv` file.

We first **read the CSV file as a dataframe and clean it**, dropping all the movies which have not been tested for the Bechdel test: these movies will have a `NaN` value under the `bechdel_rating` column.

We then create an **empty list `film_list_bechdel`**, containing tuples representing the IMDB id of the movie (`imdbid` column) and its result in the Bechdel test (column `bechdel_rating`).
If the `bechdel_rating` is...:
- 0 &rarr; FAILED the first criteria
- 1 &rarr; FAILED the second criteria
- 2 &rarr; FAILED the third criteria
- 3 &rarr; PASSED the test (passed all three criteria) 

As our starting IMDB's ids are actually different than those present in wikipedia (which have a suffix differentiating between titles, names, companies, events, news...), before populating the `film_list_bechdel` we need to process the ids and add the suffix.

We do so with the appropriate function `createIMDBid`.

Then, we populate the `film_list_bechdel` and measure its length: this is the total number of movies which have been tested for the Bechdel test (72).

In [4]:
import sparql_dataframe

wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

df = pd.read_csv('../data/dialogue/dialogue_bechdel.csv')

def createIMDBid(code):
    if len(str(code)) == 5:
        return "tt00"+str(code)
    elif len(str(code)) == 6:
        return "tt0"+str(code)
    elif len(str(code)) == 7:
        return "tt"+str(code)

bechdel_df = df.dropna(axis=0, subset=["bechdel_rating"])

film_list_bechdel = list()

for idx, row in bechdel_df.iterrows():
    imdb_id = createIMDBid(row["imdbid"])
    tuple = (imdb_id, row["bechdel_rating"])
    film_list_bechdel.append(tuple)

n_films_B = len(film_list_bechdel)    # 72
print("Total number of movies tested for Bechdel test:\t",n_films_B)


Total number of movies tested for Bechdel test:	 72



We create a `ids_tpl_bechdel` tuple (to be used in the SPARQL query) containing only the formatted IMDB's ids taken from the `film_list_bechdel`.

Finally, we query the SPARQL endpoint, selecting only the movies from our list which have a **male director** (specified by the Wikidata class `wd:Q6581097`).

We use the `FILTER` and `IN` clauses to run the query on all the IMDB's ids contained in our list without having to open the query connection multiple times (as experienced, that will overwork Wikidata's query service and the IP of the computer used to run the query will be momentarily banned).

In [5]:
ids_tpl_bechdel = ()

for tpl in film_list_bechdel:
    ids_tpl_bechdel = ids_tpl_bechdel + (tpl[0],)

# SPARQL
query_gender_director = '''
        SELECT ?imdb ?Movie ?Director
        WHERE {{
            ?movie wdt:P345 ?imdb ;
                    wdt:P57 ?director ;
                    rdfs:label ?Movie .
            ?director rdfs:label ?Director ;
                        wdt:P21 wd:Q6581097 .
            FILTER ((lang(?Director) = "en") && (lang(?Movie) = "en")) .
            FILTER (?imdb IN {list}) .
        }}
    '''

result_bechdel_query = sparql_dataframe.get(
    wikidata_endpoint, query_gender_director.format(list=ids_tpl_bechdel), True)

Now, as we want to add the outcome of the Bechdel test of the result of the query, we create another dataframe `add_bech_df` from the list of tuples `film_list_bechdel` (in which the first element of each tuple is the formatted IMDB's id, and the second element is the outcome of the Bechdel test).

We then merge the two dataframes `result_bechdel_query` and `add_bech_df` together, using the `imdb` column as merging point.

Now, if we count the number of rows of the updated `result_bechdel_query`, we will have the **total number of male directors of the movies tested for the Bechdel test**.
Please notice how this number is actually higher than the total number of movies tested for the Bechdel test (`n_films_B`): this is because some movies will have more than one director.

The result of this query means that **no matter the result of the Bechdel test, all the movies which have been tested for it have male directors**.

In [6]:
add_bech_df = pd.DataFrame(film_list_bechdel, columns=[
                           "imdb", "Bechdel_result"])

result_bechdel_query = result_bechdel_query.merge(
    add_bech_df, left_on="imdb", right_on="imdb")


total_Mdirectors = (len(result_bechdel_query.index))  # 79
print("Total number of movies tested for Bechdel test WITH male director(s):\t",total_Mdirectors)



Total number of movies tested for Bechdel test WITH male director(s):	 79


##### Characters dialogue query: *how many of the [selected] films have **male** directors?*

Even for this query we are using the data from the `dialogue_bechdel.csv` file.

The reasoning behind this query is more or less the same as the previous one:
1. **Read the CSV file as a dataframe and clean it** from all the movies that have no dialogue analysis (using the `.dropna` instruction, as they will have a `NaN` value under the `male_percen` column)
2. Create a **`film_list_dlg` of tuples containing the IMDB id of the movie** (`imdbid` column) and the **information on the dialogues** (`male_percen` and `nonmale_percentage` columns); then, measure its length: this is the total number of movies which have dialogue analysis
3. Use the previously defined `createIMDBid` function to add the Wikidata's suffix to our starting IMDB's ids


In [7]:

film_list_dlg = list()

dlg_df = df.dropna(axis=0, subset=["male_percen"])

for idx, row in dlg_df.iterrows():
    imdb_id = createIMDBid(row["imdbid"])
    tuple = (imdb_id, row["male_percen"], row["nonmale_percentage"])
    film_list_dlg.append(tuple)

n_films = len(film_list_dlg)    # 66
print("Total number of movies with a dialogue analysis:\t",n_films)

Total number of movies with a dialogue analysis:	 66


4. Create the `ids_tpl_dlg` tuple with only the formatted IMDB's ids (taken from the `film_list_dlg`)
5. Query the SPARQL endpoint selecting **the writers for each movie (regardless of their gender) and their gender** "value"
    - The use of the `OPTIONAL` clause was necessary as it seems not all writers have the gender information available

In [8]:
ids_tpl_dlg = ()

for tpl in film_list_dlg:
    ids_tpl_dlg = ids_tpl_dlg + (tpl[0],)

# SPARQL
query_gender_director = '''
    SELECT ?imdb ?Movie ?Writer ?Gender
    WHERE {{
        ?movie wdt:P345 ?imdb ;
                wdt:P58 ?writer ;
                rdfs:label ?Movie .
        ?writer rdfs:label ?Writer .
        OPTIONAL {{
            ?writer wdt:P21 ?gender .
            ?gender rdfs:label ?Gender .
            FILTER ( (lang(?Gender) = "en") )
        }}
        FILTER ( (lang(?Writer) = "en") && (lang(?Movie) = "en"))
        FILTER ( ?imdb IN {list} )
}}
'''

result_dlg_query = sparql_dataframe.get(
    wikidata_endpoint, query_gender_director.format(list=ids_tpl_dlg), True)
result_dlg_query

KeyboardInterrupt: 

6. Add the outcome of the dialogue analysis through a new dataframe `add_dlg_df`, created from the list of tuples `film_list_dlg`
7. Merge the two dataframes `result_dlg_query` and `add_dlg_df`
8. Save the dataframe in a CSV file

In [None]:
add_dlg_df = pd.DataFrame(film_list_dlg, columns=[
                           "imdb", "male_percentage", "nonmale_percentage"])

# Merge the two dataframes together using the IMDB ids columns
result_dlg_query = result_dlg_query.merge(
    add_dlg_df, left_on="imdb", right_on="imdb")

result_dlg_query
# result_dlg_query.to_csv('data/sparql/dlg.csv')

Now we can quickly compare the number of male and female writers in our selection of the movies. We do so by simply iterating through the `result_dlg_query` dataframe and update the number of writers (either `n_Mwriters` or `n_Fwriters`) depending on the value under the column `Gender`. We also print out the total number of writers in our 66 selected movies.

The difference is clear and pretty straightforward.

In [None]:
n_writers = (len(result_dlg_query.index))
print("Total number of writers of the selected 66 movies:\t",n_writers) # 154


n_Mwriters = 0
n_Fwriters = 0
for idx, row in result_dlg_query.iterrows():
    if row["Gender"] == 'male':
        n_Mwriters += 1
    else:
        n_Fwriters += 1

print("Number of male writers\t:", n_Mwriters)  # 143
print("Number of female writers\t:", n_Fwriters)    # 11

# result_dlg_query.to_csv('data/sparql/dlg.csv')


#### Gaze score queries

For these queries, the data used comes from the `final_scores_df.csv` CSV file.
Again, even in this case the reasoning is always the same as before.


##### GS query 1: *To what genre belong the top 10 films in the gaze score ranking?*

1. **Read the CSV file as a dataframe and clean it** from all the movies that have no male gaze score (using the `.dropna` instruction, as they will have a `NaN` value under the `gaze_score` column); then, **sort it** depending on the male gaze value (`MG_df`) and then **select only the top 10 movies** (`topMG_df`)
2. Create a **`film_list_mg1` of tuples containing the IMDB id of the movie** (`imdbid` column) and the **male gaze score** (`gaze_score` column); then, measure its length: this is the total number of movies which have a male gaze score
3. Use the previously defined `createIMDBid` function to add the Wikidata's suffix to our starting IMDB's ids
4. Create the `ids_tpl_mg1` tuple with only the formatted IMDB's ids (taken from the `film_list_mg1`)
5. Query the SPARQL endpoint selecting **the genres for each movie** of the movies in the list

In [None]:
df_mg = pd.read_csv('../data/final_scores/final_scores_df.csv')

MG_df = df_mg.dropna(axis=0, subset=["gaze_score"])

MG_df.sort_values(by="gaze_score", ascending=False, inplace=True, ignore_index=True)

topMG_df = MG_df.head(10)


film_list_mg1 = list()

for idx, row in topMG_df.iterrows():
    imdb_id = createIMDBid(row["imdbid"])
    tuple = (imdb_id, row["gaze_score"])
    film_list_mg1.append(tuple)

print("Top 10 movies of the male gaze score ranking:\t",len(film_list_mg1))

ids_tpl_mg1 = ()

for tpl in film_list_mg1:
    ids_tpl_mg1 = ids_tpl_mg1 + (tpl[0],)

# SPARQL
query_10_mg = '''
    SELECT ?imdb ?Movie ?Genre
    WHERE {{
        ?movie wdt:P345 ?imdb ;
                wdt:P136 ?genre ;
                rdfs:label ?Movie .
        ?genre rdfs:label ?Genre .
        FILTER ( (lang(?Movie) = "en") && (lang(?Genre) = "en"))
        FILTER ( ?imdb IN {list} )
    }}
'''

result_mg1_query = sparql_dataframe.get(wikidata_endpoint, query_10_mg.format(list=ids_tpl_mg1),True)
#result_mg1_query

6. Add the male gaze score through a new dataframe `add_mg1_df`, created from the list of tuples `film_list_mg1`
7. Merge the two dataframes `result_mg1_query` and `add_mg1_df`
8. Save the dataframe in a CSV file

In [None]:
add_mg1_df = pd.DataFrame(film_list_mg1,columns=["imdb", "gaze_score"])

result_mg1_query = result_mg1_query.merge(add_mg1_df,left_on="imdb",right_on="imdb")
# result_mg1_query.to_csv('../data/sparql/mg1.csv')
print(result_mg1_query.head())

##### GS query 2: *Is there any correlation between rank in the gaze score ranking, box-office and production costs?*

1. Use the `MG_df` already cleaned and sorted from before
2. Create a **`film_list_mg2` of tuples containing the IMDB id of the movie** (`imdbid` column) and the **male gaze score** (`gaze_score` column); then, measure its length: this is the total number of movies which have a male gaze score
3. Use the previously defined `createIMDBid` function to add the Wikidata's suffix to our starting IMDB's ids
4. Create the `ids_tpl_mg2` tuple with only the formatted IMDB's ids (taken from the `film_list_mg2`)
5. Query the SPARQL endpoint selecting **the production costs and the box office** of the movies in the list
    - Again, the `OPTIONAL` clauses were necessary for the lack of information regarding production costs and box office for some of the movies in the list

In [None]:
film_list_mg2 = list()

for idx, row in MG_df.iterrows():
    imdb_id = createIMDBid(row["imdbid"])
    tuple = (imdb_id, row["gaze_score"])
    film_list_mg2.append(tuple)

print("Total number of movies with a male gaze score:\t",len(film_list_mg2))

ids_tpl_mg2 = ()

for tpl in film_list_mg2:
    ids_tpl_mg2 = ids_tpl_mg2 + (tpl[0],)


# SPARQL
query_costs_mg = '''
SELECT ?imdb ?Movie ?ProductionCosts ?BoxOffice 
WHERE {{
  ?movie wdt:P345 ?imdb ;
        rdfs:label ?Movie .
  OPTIONAL {{
    ?movie wdt:P2130 ?ProductionCosts .
  }}
  OPTIONAL {{
    ?movie wdt:P2142 ?BoxOffice .
    ?statement ps:P2142 ?BoxOffice .
    ?statement pq:P3005 ?validity .
    }}
  FILTER ( (lang(?Movie) = "en") && ((?validity = wd:Q30) || (?validity = wd:Q49)) )
  FILTER NOT EXISTS {{ ?statement pq:P1264 ?o }}
  FILTER ( ?imdb in {list} )
}}
'''

result_mg2_query = sparql_dataframe.get(wikidata_endpoint, query_costs_mg.format(list=ids_tpl_mg2),True)
result_mg2_query.head(10)

6. Add the male gaze score through a new dataframe `add_mg2_df`, created from the list of tuples `film_list_mg2`
7. Merge the two dataframes `result_mg2_query` and `add_mg2_df`
8. Save the dataframe in a CSV file

In [None]:
add_mg2_df = pd.DataFrame(film_list_mg2,columns=["imdb", "gaze_score"])

result_mg2_query = result_mg2_query.merge(add_mg2_df,left_on="imdb",right_on="imdb")
result_mg2_query

# result_mg2_query.to_csv('../data/sparql/mg2.csv')