# Through the Gaze - Data documentation
This Jupyter Notebook analyses the data preparation and processing phase for ["NameProject"](https://ahsanv101.github.io/ProjectGaze/).

For this project, we are interested in studying the concept of the **"male gaze"** in cinema, inspired by the essay "Visual Pleasure and Narrative Cinema" by the feminist film theorist Laura Mulvey. Mulvey underlines how the "male gaze" is made of three main components:
1. The audience
2. The characters
3. The camera (i.e. the director)

To represent a coherent and significant overview on the male gaze's impact on western cinematic industry, we will identify the **10 highest-grossing U.S. films for each decade from 1940s to 2010s**. The reason to opt for highest-grossing movies is that they give a general understanding of the popularity of the movie also in terms of fame and profit (highest grossing = surplus amount of people saw it), as well as produce a sort of cultural normativity.
Taking highest-grossing movies per decade will help us generalize our results in terms of popularity.


### Disclaimer 
This Jupyter Notebook is of informational nature only, it is not thought to be used for the data preparation and processing, but only for the analysis and explanation of such processes.
<br>The Python files used for the clean up can be found in the `code` folder of the [Github repository](https://github.com/ahsanv101/ProjectGaze).

## The audience: webscraping, sentiment and sexism
Focusing on the audience component of the male gaze implied looking through some of the **reviews** provided for all the movies belonging to our dataset, and focusing not only on the overall reception of the movie, but mostly on the individuals' perception of it and possible gender bias underlying their opinion.


Reviews are **not accompanied by the user that provided them**, since that was not useful for our analysis: what is important to keep in mind is that our reviews' dataset comprehends 1972 reviews related to our chosen movies, and that they are completely **public and available on the IMDB's reviews' pages**. Moreover, it's essential to underline that our analysis is partial and neutral, and hopes to elaborate useful reflections more than harsh critiques. 

### Reviews webscraping
The first step of our audience's analysis comprehended a webscraping of the reviews' pages provided in the movie.csv files in URLs form. To do so, we used the [**BeautifulSoup library**](https://www.crummy.com/software/BeautifulSoup/) and we inspected the HTML structure of a standard IMDB's review's page: the textual content of any review is stored inside a `div` block marked by the tag "text", and here we access to all of our data. 
<br>
The task, mostly automated, only required a division of the URLS into chunks, to speed up the overall scraping process (since we were working with huge amounts of data!). 


We later stored our reviews in a dictionary, then turned dataframe, then turned into a **`.csv` file**, containing a unique column, `Reviews`, alongside an index. 


In [None]:

#We used the following libraries!
import pandas as pd 
import requests
from bs4 import BeautifulSoup
import pprint
import re

#Here we initialize and modify our CSVs accordingly and we create a list for the webscraped reviews 
movies = pd.read_csv('movies.csv')
title_reviews = movies[['Title','Reviews']].copy()

text_reviews = []

#The webrascraping starts here
batch_size = 79
urls = ['https://www.imdb.com/title/tt0038969/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0041838/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0031381/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0037536/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0034167/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0036872/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0039391/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0035575/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0034583/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0040806/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0049833/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0045793/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0044672/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0044672/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0047673/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0043949/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0051459/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0053291/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0048593/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0042192/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0059742/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0061722/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0064115/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0058331/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0056937/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0062622/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0055614/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0054215/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0056172/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0060164/?ref_=nv_sr_srsg_3', 'https://www.imdb.com/title/tt0073195/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0076759/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0070047/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0077631/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0068646/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0071230/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0075148/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0066011/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0078346/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0067093/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0080684/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0083866/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0096895/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0086190/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0087332/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0088763/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0092099/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0092644/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0096438/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0081573/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0120338/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0120915/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0107290/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0116629/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0109830/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0119654/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0099653/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0103064/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0103776/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0112462/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0468569/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0383574/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0145487/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0417741/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0121766/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0316654/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0418279/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0325980/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0241527/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0120755/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt4154796/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt1825683/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt2488496/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0848228/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt2527336/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0499549/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt0770828/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt3748528/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt1201607/reviews?ref_=tt_urv', 'https://www.imdb.com/title/tt1877832/reviews?ref_=tt_urv']
url_chunks = [urls[x:x+batch_size] for x in range(0, len(urls), batch_size)]

def scrape_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    for links in soup.find_all('div', class_='text'):
            review = links.get_text()
            text_reviews.append(review)
def scrape_batch(url_chunk):
    chunk_resp = []
    for url in url_chunk:
        chunk_resp.append(scrape_url(url))
    return chunk_resp
for url_chunk in url_chunks:
    scrape_batch(url_chunk)
    
#From the list, we store our results into a dictionary, to later convert into a new dataframe and CSV. 
reviews_dict = {'Reviews': text_reviews}
text_reviews = pd.DataFrame.from_dict(reviews_dict)
text_reviews.to_csv("text_reviews.csv")

### Sentiment Analysis
Now that our reviews were available, it was time to actually start working on our analysis: this second step focused mostly on **retrieving the sentiment of our reviews**: *are they positive or negative?*
<br>
This aspect was later used to understand if there were any strong correlations among the possible sexist tone of a review and its overall sentiment: for example, *how does a poor opinion on women affect the overall perception of a movie?* *Are negative reviews the most sexist?*


To achieve a correct sentiment analysis, we used the [**library `NLTK`**](https://www.nltk.org/) and its **`VADER`**, a rule-based sentiment analyzer in which the terms are generally labeled as per their semantic orientation as either positive or negative. 
The result of this analysis was a **new dataframe** containing our `Reviews` column, a new `Scores` column (containing non-weighted sentiment analysis scores, divided into negative, neutral and positive values), a `Compound` column (weighted values between 0 and 1) and a `Sentiment` column, that provides a clear label distinguishing Positive reviews (pos) from Negative ones (neg). 

In [None]:
import nltk
nltk.download('vader_lexicon')
import numpy as np
import pandas as pd

from nltk.sentiment.vader import SentimentIntensityAnalyzer
vader = SentimentIntensityAnalyzer()

df = pd.read_csv('text_reviews.csv')

#Here starts the sentiment analysis 
df.dropna(inplace=True)
empty_objects = []
for review in df.itertuples():
     if type(review)==str:
             if review.isspace():
                     empty_objects.append(review)
df.drop(empty_objects, inplace=True)

#We calculate overall scores, compound value and the sentiment label. 
df['scores'] = df['Reviews'].apply(lambda Reviews: vader.polarity_scores(Reviews))
df['compound'] = df['scores'].apply(lambda score_dict: score_dict['compound'])
df['sentiment'] = df['compound'].apply(lambda c: 'pos' if c >= 0 else 'neg')

#... And then we obtain the CSV
df.to_csv('sentiment_reviews.csv')

### Sexism Analysis
Having cleared the overall sentiment of our reviews, the final step of our audience's analysis comprehended **detecting possible traces of sexism in the reviews**.
<br>
To do this, we applied a model created and published by the group NLP-LTU on Hugging Face, the [**BerTweet Large Sexism Detector**](https://huggingface.co/NLP-LTU/bertweet-large-sexism-detector), a classification model for detecting sexism in Tweets or short text paragraphs. As some of our reviews were longer than the model's length limit, a few adjustments were implemented.


At the end, we obtained a clear result: our reviews were not sexist or, at least, they were *not completely* sexist.
<br>
BERT categorized them as lacking any kind of gender bias, but, having inspected the reviews ourselves, we knew this was not true: a few reviews showed clear signs of misogyny and sexism, not just by using offensive words such as "bitch" or "tramp" when referring to actresses or their characters, but by constantly describing them as sexy and beautiful or by comparing them to animals. 
BERT simply failed to recognized them because, if considered in a quantified way, those sentences weighted very little in the general structure of the review, that otherwise had a very neutral or even positive tone. 
What emerged from this analysis, is that **the audience's gaze is rarely guided by pure prejudice or malevolence**: realistically, our reviews displayed sexism in a "natural" and subtle way, so subtle that even the sexism-detector model failed to aknowledge them when analysing the bigger picture. 

However, we were not satisfied with this result: we wanted to isolate these instances of sexism, and to do so, we needed to narrow the detector's scope of analysis. Therefore, we introduced a simpler function capable of dividing any reviews into smaller sentences: by doing this, we could obtain singular scores of sexism and give them more significance. 
If a review had a singular sexist sentence, was therefore marked as sexist, and sorted into the final CSV accordingly to its final sexist score. 

In [None]:
#For this code to work, the libraries Transformers and Torch are needed. 
import pandas as pd 
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer,pipeline
from transformers import BertForSequenceClassification, BertTokenizer
import torch

#We define the model, tokenizer and classifier we are going to use 
model = AutoModelForSequenceClassification.from_pretrained('NLP-LTU/bertweet-large-sexism-detector')
tokenizer = AutoTokenizer.from_pretrained('NLP-LTU/bertweet-large-sexism-detector') 
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

df = pd.read_csv('sentiment_reviews.csv')


#This portion of codes generates a prediction of the OVERALL review. According to the tensor size, it proceeds directly with the prediction or it adds an ulterior preprocessing and tokenization phase. 
import math

for item in df['Reviews']: 
  if (len(item.split())>512):
    n=math.ceil(len(item.split())/512)
    for i in range(n):
        if (i==(n-1)):
          safe_item=' '.join(item.split()[i*512::])  
        else:
          item=' '.join(item.split()[i*512:(i+1)*512])
          tokenized = tokenizer.encode(item, padding=True, truncation=True,max_length=50, add_special_tokens = True)
          prediction = classifier(str(tokenized))
          print(prediction, item)
          
#To work on the individual sentences, we used this instead. 

reviews = []
sentences = []

for index, item in df.Reviews.items(): 
      sentence = item.split('.')      
      prediction = classifier(sentence)
      sentences.append(sentence)  
      reviews.append(prediction)
      print([sentence, prediction])

## The characters: film and scripts analysis
The aim of this analysis is to extract the dominance of the male gaze in the scope of the film and script. This is one of the most important analysis as we also directly dive into the core content of the cinema industry which are the scripts, the basis of any film. The reason we chose scripts is because they address **the whole setting of the characters** as well as **how they are defined on the camera** (viewers) and **how the male character in the script perceives the non-male ones**. They also show what kind of dialogues or actions are assigned to male ones vs non male and give us a good comparative analysis. 


### Bechdel Test
The first step into this analysis is the infamous [Bechdel Test](https://bechdeltest.com/), used for measuring **how women are represented in a given film**. There are generally three rules that a film needs to pass:

1. The movie has to have at least two women in it
2. The movie has to have at least two women who talk to each other
3. The movie has to have at least two women who talk to each other and it is about something other than a man

If a movie passes all three of the rules then it passes the Bechdel test. This goes to show a very bare minimum bar that ideally every movie should have. We will collect that data from already existing datasets and check the results with the scope of our movies. 

> **Graphs**
> 1. Passed and not passed: bar chart --> highlights difference
> 2. Stacked or donut for not passed, showing 3 layers with dynamic list of the movies
 

### Character Description
In this step we will be diving into the **actual descriptions of characters in the scripts**. The idea of using descriptions of the characters is to get an understanding of how the camera wants to show certain features of the characters through the use of angles: in this way the camera becomes the gaze and the (non-male) character becomes the object for the gaze.

Our aim is to extract automatically such descriptions from the scripts using Natural Language Processing and show the words which are often used in the describing characters (both male and non-male), revealing the differences in the way they are portayed. We also aim to **categorize female descriptions** in terms of *highly sexist* descriptions and *dubious but problematic* descriptions.

> **Graphs**
> 1. Overall picture: word cloud: him versus her
> 2. Division of descriptions in layers - donut or stacked bar with layers with dynamic list


### Character Dialogue
In this step we are extracting all the dialogues spoken by male and non-male characters for each script automatically also using NLP tasks. The aim here is to show just how much the **division and representation of words** are given to men vs non-men characters. 

> **Graphs**
> 1. Vertical bar chart showing percentages between men and women

### Final "Gaze Score"
In this step we will be developing a mechanism in order to **assign a score to each film** within our scope. This scoring is important for us as we take into account all the factors analyzed above and assign a score from a **range of 0-100**.

The divisiion of the score is as follows:
1. **Bechdel Test** (max. 40%), score assigned based on the following criteria
    1. If a movie passes **no rule**: 40%
    2. If a movie passes **only the first rule**: 26.66%
    3. If a movie passes **only the first and second rules**: 13.33%
    4. If a movie passes **all rules**: 0%
2. **Character description** (max. 35%), score assigned based on the following criteria
    1. If a female character is described in a **highly sexist** manner: 35%
    2. If a female character is described in a **dubious but problematic** manner: 17.5%
    3. If a female character is not described in any of the above manners: 0%
3. **Character dialogues** (max. 25%), score assigned based on the following criteria:
    1. If a male character has less than or equal to 50% of the overrall dialogue in the script: 0%
    2. If a male character has more than or equal to 70% of the overall dialogue in the script: 25%
    3. If a male character has dialogue between 51% to 69% of the overall dialogue in the script: the percentage will be assigned on the basis of the percentile between values 0.1%-24.9%

> **Graphs**
> 1. bar chart


## The camera: SPARQL metadata retrieval

Finally, after gathering some preliminary results from the first analyses on film scripts and IMDB's reviews, we further deepened our research using the [**Linked Internet Movie Database (IMDb)**](https://triplydb.com/Triply/linkedmdb) and its **SPARQL endpoint**, hosted on Triply.

Not having knowledge about the structure of such knowledge base, an initial phase of **data exploration** was deemed necessary.

Afterwards it was finally possible to perform the queries and save the results in an appropriate format as to visualize the data and gather insight on it.

### Data exploration
In general, there are two different types of statements (triples) in knowledge bases: **T-Box** statements and **A-Box** statements.
1. **T-Box** (Terminological Box) statements describe the domain of interest defining classes and properties as the domain vocabulary; they contain information related to the **structure of the dataset**
2. **A-Box** (Assertional Box) statements provide facts associated with the TBox's conceptual model or ontologies; they contain information on instances and the relationships between them: the **main content of the dataset**

For the exploration of the IMDB dataset, we will follow this theoretical structure.

#### T-Box (Terminological Box)
A first interesting query would be to check the **number of triples** contained in the knowledge base, to get a flavor of the extent of it.


In [12]:
# Import library to display the results cleanly
import sparql_dataframe

# Reference resource: IMDB SPARQL endpoint URL
endpoint = 'https://api.triplydb.com/datasets/Triply/linkedmdb/services/linkedmdb/sparql'

# Query we want to run: how many triples are in the LOD source?
query_triples_count = '''
    SELECT (COUNT (*) AS ?tripleCount)
    WHERE {
        ?s ?p ?o
    }
'''

# Create dataframe and print it
df = sparql_dataframe.get(endpoint, query_triples_count)
print(f'The total number of triples is:\n {df}')

The total number of triples is:
    tripleCount
0      6950066


Then, to quickly comprehend the kind of data available, we **listed the predicates used** (maybe listing them alphabetically), which immediately can tell us interesting facts on the kind of data available.

In [13]:
# List predicates 
query_predicates = '''
    SELECT DISTINCT ?p
    WHERE { 
    ?s ?p ?o .
    } ORDER BY ?p
'''

df = sparql_dataframe.get(endpoint, query_predicates)
print(f'The list of predicates:\n {df}')

The list of predicates:
                                                      p
0       http://dbpedia.org/property/hasPhotoCollection
1                        http://purl.org/dc/terms/date
2                       http://purl.org/dc/terms/title
3                       http://rdfs.org/ns/void#subset
4    http://www.openlinksw.com/schemas/virtrdf#dialect
..                                                 ...
227  https://triplydb.com/Triply/linkedmdb/vocab/st...
228  https://triplydb.com/Triply/linkedmdb/vocab/ty...
229  https://triplydb.com/Triply/linkedmdb/vocab/wr...
230  https://triplydb.com/Triply/linkedmdb/vocab/wr...
231  https://triplydb.com/Triply/linkedmdb/vocab/wr...

[232 rows x 1 columns]


The presence of `rdfs:subClassOf` indicates the presence of some structure, while `dcterms:title` shows that the knowledge graph deals with works (clearly, this being a knowledge base on the Internet Movie Database contents), finally `foaf` indicates the presence of information about people. But these are only some of the many ontologies used in this knowledge graph.

Having ordered the resulting dataframe in alphabetical order allows us to immediately and easily see the **many different ontologies** employed:

In [14]:
# Print each row of the dataframe
for idx, row in df.iterrows():
    print(row['p'])

http://dbpedia.org/property/hasPhotoCollection
http://purl.org/dc/terms/date
http://purl.org/dc/terms/title
http://rdfs.org/ns/void#subset
http://www.openlinksw.com/schemas/virtrdf#dialect
http://www.openlinksw.com/schemas/virtrdf#dialect-exceptions
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/2000/01/rdf-schema#SeeAlso
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/2000/01/rdf-schema#subClassOf
http://www.w3.org/2002/07/owl#sameAs
http://www.w3.org/2004/02/skos/core#subject
http://xmlns.com/foaf/0.1/based_near
http://xmlns.com/foaf/0.1/made
http://xmlns.com/foaf/0.1/page
https://triplydb.com/Triply/linkedmdb/id/oddlinker/link_source
https://triplydb.com/Triply/linkedmdb/id/oddlinker/link_target
https://triplydb.com/Triply/linkedmdb/id/oddlinker/link_type
https://triplydb.com/Triply/linkedmdb/id/oddlinker/linkage_date
https://triplydb.com/Triply/linkedmdb/id/oddlinker/linkage_method
https://triplydb.com/Triply/linkedmdb/id/oddlinker/linkage_run
https:

We have:
- [DBpedia](https://www.dbpedia.org/)
- [Dcterms](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/)
- [OWL](https://www.w3.org/TR/owl-features/)
- [RDF Schema](https://www.w3.org/TR/rdf-schema/)
- [SKOS](https://www.w3.org/TR/skos-reference/)
- [FOAF](http://xmlns.com/foaf/0.1/)
- [VoID](https://www.w3.org/TR/void/)
- [Virtrdf](https://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFViewNorthwindOntology)
- Linkedimdb/id
- Linkedimdb/vocab

It is also interesting to understand which are the **most used properties**:

In [15]:
# Most used predicates list
query_predicate_repetition = '''
    SELECT ?p (COUNT(?p) AS ?predicate)
    WHERE { 
    ?s ?p ?o .
    }
    GROUP BY ?p
    ORDER BY DESC(?predicate)
'''

df = sparql_dataframe.get(endpoint, query_predicate_repetition)
print(f'The number of times each predicate is used:\n {df}')


The number of times each predicate is used:
                                                      p  predicate
0      http://www.w3.org/1999/02/22-rdf-syntax-ns#type     817073
1           http://www.w3.org/2000/01/rdf-schema#label     722092
2                       http://xmlns.com/foaf/0.1/page     577112
3    https://triplydb.com/Triply/linkedmdb/vocab/pe...     390322
4    https://triplydb.com/Triply/linkedmdb/vocab/actor     284409
..                                                 ...        ...
227  https://triplydb.com/Triply/linkedmdb/id/oddli...          7
228                     http://rdfs.org/ns/void#subset          2
229  http://www.openlinksw.com/schemas/virtrdf#dial...          1
230  http://www.openlinksw.com/schemas/virtrdf#dialect          1
231  https://triplydb.com/Triply/linkedmdb/vocab/fi...          1

[232 rows x 2 columns]


An interesting insight we get from this first exploration is the presence of an ontology specific to IMDB: **Linkedimdb.**

As linkedmdb/id obviously contains only linkage knowledge, we are more interested in linkedmbd/vocab and, to further analyse it, we can select only those properties belonging to it:

In [16]:
# List linkedimdb/vocab predicates 
query_predicates = '''
    SELECT DISTINCT ?p
    WHERE { 
        ?s ?p ?o .
        FILTER regex(?p, "https://triplydb.com/Triply/linkedmdb/vocab/", "i")
    }
    ORDER BY ?p
'''

df = sparql_dataframe.get(endpoint, query_predicates)
print(f'The list of linkedimdb/vocab predicates:\n {df}')

The list of linkedimdb/vocab predicates:
                                                      p
0    https://triplydb.com/Triply/linkedmdb/vocab/actor
1    https://triplydb.com/Triply/linkedmdb/vocab/ac...
2    https://triplydb.com/Triply/linkedmdb/vocab/ac...
3    https://triplydb.com/Triply/linkedmdb/vocab/ac...
4    https://triplydb.com/Triply/linkedmdb/vocab/ac...
..                                                 ...
205  https://triplydb.com/Triply/linkedmdb/vocab/st...
206  https://triplydb.com/Triply/linkedmdb/vocab/ty...
207  https://triplydb.com/Triply/linkedmdb/vocab/wr...
208  https://triplydb.com/Triply/linkedmdb/vocab/wr...
209  https://triplydb.com/Triply/linkedmdb/vocab/wr...

[210 rows x 1 columns]


We can now use this as a sort of "ordered vocabulary" for easily finding and selecting the predicates that could be the most useful for our intended queries. 

Moving on to Classes, we first need to understand **how are Classes defined**: if through other ontologies such as `owl:Class`, `rdf:type`/`a` `rdfs:Class`, or autonomously by the dataset (the latter would mean no result to our queries, as it actually happens).

In [17]:
query_classes_rdfs = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT DISTINCT ?c
    WHERE {
        ?c a rdfs:Class .
    }
    ORDER BY ?c
'''
df1 = sparql_dataframe.get(endpoint, query_classes_rdfs)
print(f'The list of classes (rdfs:Class):\n {df1}')

query_classes_owl = '''
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    SELECT DISTINCT ?c
    WHERE {
        ?c a owl:Class .
    }
    ORDER BY ?c
'''
df2 = sparql_dataframe.get(endpoint, query_classes_owl)
print(f'The list of classes (owl:Class):\n {df2}')

The list of classes (rdfs:Class):
 Empty DataFrame
Columns: [c]
Index: []
The list of classes (owl:Class):
 Empty DataFrame
Columns: [c]
Index: []


Seeing the lack of results, it is therefore clear that this dataset autonomously defines its classes. To gather the list of **class types** (alphabetically ordeered) we can look for the type of the concept describing a subject (either `rdf:type` or `a`):

In [18]:
query_concepts = '''
    SELECT DISTINCT ?concept 
    WHERE {
        ?s a ?concept .
    }
    ORDER BY ?concept
'''
df = sparql_dataframe.get(endpoint, query_concepts)
print(f'The list of Classes types:\n {df}')

The list of Classes types:
                                               concept
0                     http://rdfs.org/ns/void#Dataset
1                     http://xmlns.com/foaf/0.1/Agent
2                    http://xmlns.com/foaf/0.1/Person
3   https://triplydb.com/Triply/linkedmdb/id/oddli...
4   https://triplydb.com/Triply/linkedmdb/id/oddli...
5   https://triplydb.com/Triply/linkedmdb/vocab/Actor
6   https://triplydb.com/Triply/linkedmdb/vocab/Ar...
7   https://triplydb.com/Triply/linkedmdb/vocab/Ca...
8   https://triplydb.com/Triply/linkedmdb/vocab/Ci...
9   https://triplydb.com/Triply/linkedmdb/vocab/Co...
10  https://triplydb.com/Triply/linkedmdb/vocab/Co...
11  https://triplydb.com/Triply/linkedmdb/vocab/Co...
12  https://triplydb.com/Triply/linkedmdb/vocab/Co...
13  https://triplydb.com/Triply/linkedmdb/vocab/Di...
14  https://triplydb.com/Triply/linkedmdb/vocab/Du...
15  https://triplydb.com/Triply/linkedmdb/vocab/Ed...
16   https://triplydb.com/Triply/linkedmdb/vocab/Film


Again, if we are more interested in the **linkedmdb/vocab Classes**, we can easily list them all and have an "ordered vocabulary" (the process is the same as for the predicates)

In [19]:
query_concepts_imdb = '''
    SELECT DISTINCT ?concept 
    WHERE {
        ?s a ?concept .
        FILTER regex(?concept, "https://triplydb.com/Triply/linkedmdb/vocab/", "i")
    }
    ORDER BY ?concept
'''
df = sparql_dataframe.get(endpoint, query_concepts_imdb)
print(f'The list of imdb Classes types:\n {df}')

The list of imdb Classes types:
                                               concept
0   https://triplydb.com/Triply/linkedmdb/vocab/Actor
1   https://triplydb.com/Triply/linkedmdb/vocab/Ar...
2   https://triplydb.com/Triply/linkedmdb/vocab/Ca...
3   https://triplydb.com/Triply/linkedmdb/vocab/Ci...
4   https://triplydb.com/Triply/linkedmdb/vocab/Co...
5   https://triplydb.com/Triply/linkedmdb/vocab/Co...
6   https://triplydb.com/Triply/linkedmdb/vocab/Co...
7   https://triplydb.com/Triply/linkedmdb/vocab/Co...
8   https://triplydb.com/Triply/linkedmdb/vocab/Di...
9   https://triplydb.com/Triply/linkedmdb/vocab/Du...
10  https://triplydb.com/Triply/linkedmdb/vocab/Ed...
11   https://triplydb.com/Triply/linkedmdb/vocab/Film
12  https://triplydb.com/Triply/linkedmdb/vocab/Fi...
13  https://triplydb.com/Triply/linkedmdb/vocab/Fi...
14  https://triplydb.com/Triply/linkedmdb/vocab/Fi...
15  https://triplydb.com/Triply/linkedmdb/vocab/Fi...
16  https://triplydb.com/Triply/linkedmdb/vocab/F

We can also check how many predicates are associated to each Class:

In [20]:
query_property_per_type = '''
    SELECT DISTINCT ?type (COUNT(DISTINCT ?p) AS ?count)
    WHERE {
        ?s a ?type . 
        ?s ?p ?o . 
    }
    GROUP BY ?type
    ORDER BY DESC(?count)
'''

df = sparql_dataframe.get(endpoint, query_property_per_type)
print(f'The number of properties per type in descending order:\n {df}')

The number of properties per type in descending order:
                                                  type  count
0    https://triplydb.com/Triply/linkedmdb/vocab/Film     48
1                    http://xmlns.com/foaf/0.1/Person     19
2   https://triplydb.com/Triply/linkedmdb/vocab/Co...     15
3   https://triplydb.com/Triply/linkedmdb/vocab/Fi...     13
4   https://triplydb.com/Triply/linkedmdb/vocab/Fi...     12
5   https://triplydb.com/Triply/linkedmdb/vocab/Pe...     12
6   https://triplydb.com/Triply/linkedmdb/vocab/Actor     10
7   https://triplydb.com/Triply/linkedmdb/vocab/Pe...      9
8   https://triplydb.com/Triply/linkedmdb/vocab/Fi...      9
9   https://triplydb.com/Triply/linkedmdb/vocab/Du...      9
10  https://triplydb.com/Triply/linkedmdb/vocab/Co...      9
11  https://triplydb.com/Triply/linkedmdb/id/oddli...      8
12  https://triplydb.com/Triply/linkedmdb/vocab/Fi...      8
13  https://triplydb.com/Triply/linkedmdb/vocab/Fi...      7
14  https://triplydb.com/Trip

#### A-Box (Assertional Box)

The instances of Classes are specifically the "content" of a dataset. A first query could then be to look at **how many instances each Class has**, being therefore able to see which are the most recurrent concepts.


In [21]:
query_instance_per_concept = '''
    SELECT ?concept (COUNT (?s) AS ?instanceCount) 
    WHERE {
    ?s a ?concept . 
    }
    GROUP BY ?concept
    ORDER BY DESC(?instanceCount)
'''

df = sparql_dataframe.get(endpoint, query_instance_per_concept)
print(f'The number of instances per class are:\n {df}')

The number of instances per class are:
                                               concept  instanceCount
0   https://triplydb.com/Triply/linkedmdb/vocab/Pe...         199771
1   https://triplydb.com/Triply/linkedmdb/id/oddli...         162199
2    https://triplydb.com/Triply/linkedmdb/vocab/Film          98816
3                    http://xmlns.com/foaf/0.1/Person          97858
4   https://triplydb.com/Triply/linkedmdb/vocab/Actor          68205
5   https://triplydb.com/Triply/linkedmdb/vocab/Fi...          45423
6   https://triplydb.com/Triply/linkedmdb/vocab/Wr...          23664
7   https://triplydb.com/Triply/linkedmdb/vocab/Di...          21966
8   https://triplydb.com/Triply/linkedmdb/vocab/Pr...          18408
9   https://triplydb.com/Triply/linkedmdb/vocab/Fi...          17237
10  https://triplydb.com/Triply/linkedmdb/vocab/Fi...          16118
11  https://triplydb.com/Triply/linkedmdb/vocab/Fi...          15256
12  https://triplydb.com/Triply/linkedmdb/vocab/Mu...          

From the previous analysis of the properties, we can see that the two most used properties are `rdf:type` and `rdf:label`. The issue of typographical errors (which multiplicate the same label) in labels is very important and, to get around it, we can use the `SAMPLE` construct, an aggregate function of SPARQL.

In [22]:
query_instance_label = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
    SELECT ?instance 
        (SAMPLE(?label) AS ?instanceLabel) 
        (COUNT(?instance) AS ?instanceCount) 
    WHERE { 
        ?instance a ?class . 
        OPTIONAL{ ?instance rdfs:label ?label .} 
        }
        GROUP BY ?instance ?instanceLabel
        ORDER BY DESC(?instanceCount)
'''

df = sparql_dataframe.get(endpoint, query_instance_label)
print(f'The list of instances with labels and repetitions:\n {df}')

The list of instances with labels and repetitions:
                                                instance  \
0     https://triplydb.com/Triply/linkedmdb/id/actor...   
1     https://triplydb.com/Triply/linkedmdb/id/direc...   
2     https://triplydb.com/Triply/linkedmdb/id/actor...   
3     https://triplydb.com/Triply/linkedmdb/id/direc...   
4     https://triplydb.com/Triply/linkedmdb/id/actor...   
...                                                 ...   
9995  https://triplydb.com/Triply/linkedmdb/id/actor...   
9996  https://triplydb.com/Triply/linkedmdb/id/actor...   
9997  https://triplydb.com/Triply/linkedmdb/id/actor...   
9998  https://triplydb.com/Triply/linkedmdb/id/actor...   
9999  https://triplydb.com/Triply/linkedmdb/id/actor...   

                 instanceLabel  instanceCount  
0          Bill Knight (Actor)              2  
1      Susi Ganesan (Director)              2  
2         Linda Haynes (Actor)              2  
3          Lu Chuan (Director)              2  

### SPARQL Queries

We can now properly state our queries to the knowledge graph, and we do so based on the results coming from the **script analysis** and **review analysis** (respectively, the "characters" and the "audience" factors):
- THE AUDIENCE RESULTS:
    - some results here
- THE CHARACTERS RESULTS:
    - BECHDEL TEST results
    - CHARACTER DIALOGUE results

Queries:
1. THE AUDIENCE QUERY
2. THE CHARACTERS QUERY    
    1. BECHDEL TEST: how many of the movies that **did not pass** the Bechdel test have **male directors**?
    2. CHARACTER DIALOGUE: how many of the movies with a **majority of male dialogue** have **male writers**?
3. GAZE SCORE
    1. Compare the movies: what are their characteristics in terms of metadata?
        > We can build different **scatterplots** to see whether there seems to be correlations between the movies or not (e.g. compare DIRECTOR, GENRE, DURATION)
        1. We can also use RATING to see the popularity of each movie
        2. CAST composition may be an interesting point to discuss
        3. ...

To query the SPARQL endpoint we will use the IMDB's ID of each movie, already contained in our `data/Dialogue/dialogue_bechdel.csv`, with the addition of the Wikidata's prefixes to distinguish between titles ('tt'), names ('nm'), characters ('ch', but deprecated), companies ('co'), events ('ev'), and news ('ni').

Specifically, we will use the 'tt'+00 prefix.

Director, genre, duration
The input is going to be a movie list for the query (selected movies from the gaze score ranking)

#### The "Audience" query
#### The "Characters" queries
#### Gaze score queries

### References 
- Abox, Wikipedia. https://en.wikipedia.org/wiki/Abox
- "How to explore an unknown dataset - quickstart" by M. Daquino
- DuCharme Bob, "Exploring a SPARQL endpoint", August 24, 2014. https://www.bobdc.com/blog/exploring-a-sparql-endpoint/.
- DuCharme Bob, "Queries to explore a dataset", April 30, 2022. https://www.bobdc.com/blog/exploringadataset/.