# ProjectName - Data documentation

This Jupyter Notebook analyses the data preparation and processing phase for ["NameProject"](https://github.com/ahsanv101/ProjectGaze), a data visualization project regarding the perception of the "male gaze" in USA's highest grossing movies between 1940s and 2010s.

### Disclaimer ?? idk if it's needed
This Jupyter Notebook is of informational nature only, it is not thought to be used for the data preparation and processing, but only for the analysis and explanation of such processes.
<br>The Python files used for the clean up can be found in `scripts > [namefile].py`.

## Webscraping - SA on reviews

## Film Scripts analysis

## SPARQL Data retrieval

Finally, after gathering some preliminary results from the first analyses on film scripts and IMDB's reviews, we further deepened our research using the [**Linked Internet Movie Database (IMDb)**](https://triplydb.com/Triply/linkedmdb) and its **SPARQL endpoint**, hosted on Triply.

Not having knowledge about the structure of such knowledge base, an initial phase of **data exploration** was deemed necessary.

Afterwards it was finally possible to perform the queries and save the results in an appropriate format as to visualize the data and gather insight on it.

### Data exploration
In general, there are two different types of statements (triples) in knowledge bases: **T-Box** statements and **A-Box** statements.
1. **T-Box** (Terminological Box) statements describe the domain of interest defining classes and properties as the domain vocabulary; they contain information related to the **structure of the dataset**
2. **A-Box** (Assertional Box) statements provide facts associated with the TBox's conceptual model or ontologies; they contain information on instances and the relationships between them: the **main content of the dataset**

For the exploration of the IMDB dataset, we will follow this theoretical structure.

#### T-Box (Terminological Box)
A first interesting query would be to check the **number of triples** contained in the knowledge base, to get a flavor of the extent of it.


In [36]:
# Import library to display the results cleanly
import sparql_dataframe

# Reference resource: IMDB SPARQL endpoint URL
endpoint = 'https://api.triplydb.com/datasets/Triply/linkedmdb/services/linkedmdb/sparql'

# Query we want to run: how many triples are in the LOD source?
query_triples_count = '''
    SELECT (COUNT (*) AS ?tripleCount)
    WHERE {
        ?s ?p ?o
    }
'''

# Create dataframe and print it
df = sparql_dataframe.get(endpoint, query_triples_count)
print(f'The total number of triples is:\n {df}')

The total number of triples is:
    tripleCount
0      6950066


Then, to quickly comprehend the kind of data available, we **listed the predicates used** (maybe listing them alphabetically), which immediately can tell us interesting facts on the kind of data available.

In [37]:
# List predicates 
query_predicates = '''
    SELECT DISTINCT ?p
    WHERE { 
    ?s ?p ?o .
    } ORDER BY ?p
'''

df = sparql_dataframe.get(endpoint, query_predicates)
print(f'The list of predicates:\n {df}')

The list of predicates:
                                                      p
0       http://dbpedia.org/property/hasPhotoCollection
1                        http://purl.org/dc/terms/date
2                       http://purl.org/dc/terms/title
3                       http://rdfs.org/ns/void#subset
4    http://www.openlinksw.com/schemas/virtrdf#dialect
..                                                 ...
227  https://triplydb.com/Triply/linkedmdb/vocab/st...
228  https://triplydb.com/Triply/linkedmdb/vocab/ty...
229  https://triplydb.com/Triply/linkedmdb/vocab/wr...
230  https://triplydb.com/Triply/linkedmdb/vocab/wr...
231  https://triplydb.com/Triply/linkedmdb/vocab/wr...

[232 rows x 1 columns]


The presence of `rdfs:subClassOf` indicates the presence of some structure, while `dcterms:title` shows that the knowledge graph deals with works (clearly, this being a knowledge base on the Internet Movie Database contents), finally `foaf` indicates the presence of information about people. But these are only some of the many ontologies used in this knowledge graph.

Having ordered the resulting dataframe in alphabetical order allows us to immediately and easily see the **many different ontologies** employed:

In [38]:
# Print each row of the dataframe
for idx, row in df.iterrows():
    print(row['p'])

http://dbpedia.org/property/hasPhotoCollection
http://purl.org/dc/terms/date
http://purl.org/dc/terms/title
http://rdfs.org/ns/void#subset
http://www.openlinksw.com/schemas/virtrdf#dialect
http://www.openlinksw.com/schemas/virtrdf#dialect-exceptions
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/2000/01/rdf-schema#SeeAlso
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/2000/01/rdf-schema#subClassOf
http://www.w3.org/2002/07/owl#sameAs
http://www.w3.org/2004/02/skos/core#subject
http://xmlns.com/foaf/0.1/based_near
http://xmlns.com/foaf/0.1/made
http://xmlns.com/foaf/0.1/page
https://triplydb.com/Triply/linkedmdb/id/oddlinker/link_source
https://triplydb.com/Triply/linkedmdb/id/oddlinker/link_target
https://triplydb.com/Triply/linkedmdb/id/oddlinker/link_type
https://triplydb.com/Triply/linkedmdb/id/oddlinker/linkage_date
https://triplydb.com/Triply/linkedmdb/id/oddlinker/linkage_method
https://triplydb.com/Triply/linkedmdb/id/oddlinker/linkage_run
https:

We have:
- [DBpedia](https://www.dbpedia.org/)
- [Dcterms](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/)
- [OWL](https://www.w3.org/TR/owl-features/)
- [RDF Schema](https://www.w3.org/TR/rdf-schema/)
- [SKOS](https://www.w3.org/TR/skos-reference/)
- [FOAF](http://xmlns.com/foaf/0.1/)
- [VoID](https://www.w3.org/TR/void/)
- [Virtrdf](https://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFViewNorthwindOntology)
- Linkedimdb/id
- Linkedimdb/vocab

It is also interesting to understand which are the **most used properties**:

In [39]:
# Most used predicates list
query_predicate_repetition = '''
    SELECT ?p (COUNT(?p) AS ?predicate)
    WHERE { 
    ?s ?p ?o .
    }
    GROUP BY ?p
    ORDER BY DESC(?predicate)
'''

df = sparql_dataframe.get(endpoint, query_predicate_repetition)
print(f'The number of times each predicate is used:\n {df}')


The number of times each predicate is used:
                                                      p  predicate
0      http://www.w3.org/1999/02/22-rdf-syntax-ns#type     817073
1           http://www.w3.org/2000/01/rdf-schema#label     722092
2                       http://xmlns.com/foaf/0.1/page     577112
3    https://triplydb.com/Triply/linkedmdb/vocab/pe...     390322
4    https://triplydb.com/Triply/linkedmdb/vocab/actor     284409
..                                                 ...        ...
227  https://triplydb.com/Triply/linkedmdb/id/oddli...          7
228                     http://rdfs.org/ns/void#subset          2
229  http://www.openlinksw.com/schemas/virtrdf#dial...          1
230  http://www.openlinksw.com/schemas/virtrdf#dialect          1
231  https://triplydb.com/Triply/linkedmdb/vocab/fi...          1

[232 rows x 2 columns]


An interesting insight we get from this first exploration is the presence of an ontology specific to IMDB: **Linkedimdb.**

As linkedmdb/id obviously contains only linkage knowledge, we are more interested in linkedmbd/vocab and, to further analyse it, we can select only those properties belonging to it:

In [40]:
# List linkedimdb/vocab predicates 
query_predicates = '''
    SELECT DISTINCT ?p
    WHERE { 
        ?s ?p ?o .
        FILTER regex(?p, "https://triplydb.com/Triply/linkedmdb/vocab/", "i")
    }
    ORDER BY ?p
'''

df = sparql_dataframe.get(endpoint, query_predicates)
print(f'The list of linkedimdb/vocab predicates:\n {df}')

The list of linkedimdb/vocab predicates:
                                                      p
0    https://triplydb.com/Triply/linkedmdb/vocab/actor
1    https://triplydb.com/Triply/linkedmdb/vocab/ac...
2    https://triplydb.com/Triply/linkedmdb/vocab/ac...
3    https://triplydb.com/Triply/linkedmdb/vocab/ac...
4    https://triplydb.com/Triply/linkedmdb/vocab/ac...
..                                                 ...
205  https://triplydb.com/Triply/linkedmdb/vocab/st...
206  https://triplydb.com/Triply/linkedmdb/vocab/ty...
207  https://triplydb.com/Triply/linkedmdb/vocab/wr...
208  https://triplydb.com/Triply/linkedmdb/vocab/wr...
209  https://triplydb.com/Triply/linkedmdb/vocab/wr...

[210 rows x 1 columns]


We can now use this as a sort of "ordered vocabulary" for easily finding and selecting the predicates that could be the most useful for our intended queries. 

Moving on to Classes, we first need to understand **how are Classes defined**: if through other ontologies such as `owl:Class`, `rdf:type`/`a` `rdfs:Class`, or autonomously by the dataset (the latter would mean no result to our queries, as it actually happens).

In [41]:
query_classes_rdfs = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT DISTINCT ?c
    WHERE {
        ?c a rdfs:Class .
    }
    ORDER BY ?c
'''
df1 = sparql_dataframe.get(endpoint, query_classes_rdfs)
print(f'The list of classes (rdfs:Class):\n {df1}')

query_classes_owl = '''
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    SELECT DISTINCT ?c
    WHERE {
        ?c a owl:Class .
    }
    ORDER BY ?c
'''
df2 = sparql_dataframe.get(endpoint, query_classes_owl)
print(f'The list of classes (owl:Class):\n {df2}')

The list of classes (rdfs:Class):
 Empty DataFrame
Columns: [c]
Index: []
The list of classes (owl:Class):
 Empty DataFrame
Columns: [c]
Index: []


Seeing the lack of results, it is therefore clear that this dataset autonomously defines its classes. To gather the list of **class types** (alphabetically ordeered) we can look for the type of the concept describing a subject (either `rdf:type` or `a`):

In [42]:
query_concepts = '''
    SELECT DISTINCT ?concept 
    WHERE {
        ?s a ?concept .
    }
    ORDER BY ?concept
'''
df = sparql_dataframe.get(endpoint, query_concepts)
print(f'The list of Classes types:\n {df}')

The list of Classes types:
                                               concept
0                     http://rdfs.org/ns/void#Dataset
1                     http://xmlns.com/foaf/0.1/Agent
2                    http://xmlns.com/foaf/0.1/Person
3   https://triplydb.com/Triply/linkedmdb/id/oddli...
4   https://triplydb.com/Triply/linkedmdb/id/oddli...
5   https://triplydb.com/Triply/linkedmdb/vocab/Actor
6   https://triplydb.com/Triply/linkedmdb/vocab/Ar...
7   https://triplydb.com/Triply/linkedmdb/vocab/Ca...
8   https://triplydb.com/Triply/linkedmdb/vocab/Ci...
9   https://triplydb.com/Triply/linkedmdb/vocab/Co...
10  https://triplydb.com/Triply/linkedmdb/vocab/Co...
11  https://triplydb.com/Triply/linkedmdb/vocab/Co...
12  https://triplydb.com/Triply/linkedmdb/vocab/Co...
13  https://triplydb.com/Triply/linkedmdb/vocab/Di...
14  https://triplydb.com/Triply/linkedmdb/vocab/Du...
15  https://triplydb.com/Triply/linkedmdb/vocab/Ed...
16   https://triplydb.com/Triply/linkedmdb/vocab/Film


Again, if we are more interested in the **linkedmdb/vocab Classes**, we can easily list them all and have an "ordered vocabulary" (the process is the same as for the predicates)

In [43]:
query_concepts_imdb = '''
    SELECT DISTINCT ?concept 
    WHERE {
        ?s a ?concept .
        FILTER regex(?concept, "https://triplydb.com/Triply/linkedmdb/vocab/", "i")
    }
    ORDER BY ?concept
'''
df = sparql_dataframe.get(endpoint, query_concepts_imdb)
print(f'The list of imdb Classes types:\n {df}')

The list of imdb Classes types:
                                               concept
0   https://triplydb.com/Triply/linkedmdb/vocab/Actor
1   https://triplydb.com/Triply/linkedmdb/vocab/Ar...
2   https://triplydb.com/Triply/linkedmdb/vocab/Ca...
3   https://triplydb.com/Triply/linkedmdb/vocab/Ci...
4   https://triplydb.com/Triply/linkedmdb/vocab/Co...
5   https://triplydb.com/Triply/linkedmdb/vocab/Co...
6   https://triplydb.com/Triply/linkedmdb/vocab/Co...
7   https://triplydb.com/Triply/linkedmdb/vocab/Co...
8   https://triplydb.com/Triply/linkedmdb/vocab/Di...
9   https://triplydb.com/Triply/linkedmdb/vocab/Du...
10  https://triplydb.com/Triply/linkedmdb/vocab/Ed...
11   https://triplydb.com/Triply/linkedmdb/vocab/Film
12  https://triplydb.com/Triply/linkedmdb/vocab/Fi...
13  https://triplydb.com/Triply/linkedmdb/vocab/Fi...
14  https://triplydb.com/Triply/linkedmdb/vocab/Fi...
15  https://triplydb.com/Triply/linkedmdb/vocab/Fi...
16  https://triplydb.com/Triply/linkedmdb/vocab/F

We can also check how many predicates are associated to each Class:

In [44]:
query_property_per_type = '''
    SELECT DISTINCT ?type (COUNT(DISTINCT ?p) AS ?count)
    WHERE {
        ?s a ?type . 
        ?s ?p ?o . 
    }
    GROUP BY ?type
    ORDER BY DESC(?count)
'''

df = sparql_dataframe.get(endpoint, query_property_per_type)
print(f'The number of properties per type in descending order:\n {df}')

The number of properties per type in descending order:
                                                  type  count
0    https://triplydb.com/Triply/linkedmdb/vocab/Film     48
1                    http://xmlns.com/foaf/0.1/Person     19
2   https://triplydb.com/Triply/linkedmdb/vocab/Co...     15
3   https://triplydb.com/Triply/linkedmdb/vocab/Fi...     13
4   https://triplydb.com/Triply/linkedmdb/vocab/Fi...     12
5   https://triplydb.com/Triply/linkedmdb/vocab/Pe...     12
6   https://triplydb.com/Triply/linkedmdb/vocab/Actor     10
7   https://triplydb.com/Triply/linkedmdb/vocab/Pe...      9
8   https://triplydb.com/Triply/linkedmdb/vocab/Fi...      9
9   https://triplydb.com/Triply/linkedmdb/vocab/Du...      9
10  https://triplydb.com/Triply/linkedmdb/vocab/Co...      9
11  https://triplydb.com/Triply/linkedmdb/id/oddli...      8
12  https://triplydb.com/Triply/linkedmdb/vocab/Fi...      8
13  https://triplydb.com/Triply/linkedmdb/vocab/Fi...      7
14  https://triplydb.com/Trip

#### A-Box (Assertional Box)

The instances of Classes are specifically the "content" of a dataset. A first query could then be to look at **how many instances each Class has**, being therefore able to see which are the most recurrent concepts.


In [45]:
query_instance_per_concept = '''
    SELECT ?concept (COUNT (?s) AS ?instanceCount) 
    WHERE {
    ?s a ?concept . 
    }
    GROUP BY ?concept
    ORDER BY DESC(?instanceCount)
'''

df = sparql_dataframe.get(endpoint, query_instance_per_concept)
print(f'The number of instances per class are:\n {df}')

The number of instances per class are:
                                               concept  instanceCount
0   https://triplydb.com/Triply/linkedmdb/vocab/Pe...         199771
1   https://triplydb.com/Triply/linkedmdb/id/oddli...         162199
2    https://triplydb.com/Triply/linkedmdb/vocab/Film          98816
3                    http://xmlns.com/foaf/0.1/Person          97858
4   https://triplydb.com/Triply/linkedmdb/vocab/Actor          68205
5   https://triplydb.com/Triply/linkedmdb/vocab/Fi...          45423
6   https://triplydb.com/Triply/linkedmdb/vocab/Wr...          23664
7   https://triplydb.com/Triply/linkedmdb/vocab/Di...          21966
8   https://triplydb.com/Triply/linkedmdb/vocab/Pr...          18408
9   https://triplydb.com/Triply/linkedmdb/vocab/Fi...          17237
10  https://triplydb.com/Triply/linkedmdb/vocab/Fi...          16118
11  https://triplydb.com/Triply/linkedmdb/vocab/Fi...          15256
12  https://triplydb.com/Triply/linkedmdb/vocab/Mu...          

From the previous analysis of the properties, we can see that the two most used properties are `rdf:type` and `rdf:label`. The issue of typographical errors (which multiplicate the same label) in labels is very important and, to get around it, we can use the `SAMPLE` construct, an aggregate function of SPARQL.

In [46]:
query_instance_label = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
    SELECT ?instance 
        (SAMPLE(?label) AS ?instanceLabel) 
        (COUNT(?instance) AS ?instanceCount) 
    WHERE { 
        ?instance a ?class . 
        OPTIONAL{ ?instance rdfs:label ?label .} 
        }
        GROUP BY ?instance ?instanceLabel
        ORDER BY DESC(?instanceCount)
'''

df = sparql_dataframe.get(endpoint, query_instance_label)
print(f'The list of instances with labels and repetitions:\n {df}')

The list of instances with labels and repetitions:
                                                instance  \
0     https://triplydb.com/Triply/linkedmdb/id/actor...   
1     https://triplydb.com/Triply/linkedmdb/id/direc...   
2     https://triplydb.com/Triply/linkedmdb/id/actor...   
3     https://triplydb.com/Triply/linkedmdb/id/direc...   
4     https://triplydb.com/Triply/linkedmdb/id/actor...   
...                                                 ...   
9995  https://triplydb.com/Triply/linkedmdb/id/actor...   
9996  https://triplydb.com/Triply/linkedmdb/id/actor...   
9997  https://triplydb.com/Triply/linkedmdb/id/actor...   
9998  https://triplydb.com/Triply/linkedmdb/id/actor...   
9999  https://triplydb.com/Triply/linkedmdb/id/actor...   

                 instanceLabel  instanceCount  
0          Bill Knight (Actor)              2  
1      Susi Ganesan (Director)              2  
2         Linda Haynes (Actor)              2  
3          Lu Chuan (Director)              2  

### SPARQL Queries

We can now properly state our queries to the knowledge graph, and we do so based on the results coming from the **script analysis** and **review analysis**:
- Result 1:
- Result 2:
- Result 3:
- Result 4:
- Result 5:

Queries:
1. How many of these movies have been directed by a male director?
    1. For this query we will need to probably do a **federated query** combining linkedmdb, which links the movie title to its director(s) and wikidata, which can give us information on the director(s)' gender
2. How many of these movies have a majority of male actors in the cast?
    1. Again, probably will need a **federated query**
3. What are the lengths of these movies compared to the other movies?
4. Some other query (hard to define for now)
5. Some other query (hard to define for now)

### References 
- Abox, Wikipedia. https://en.wikipedia.org/wiki/Abox
- "How to explore an unknown dataset - quickstart" by M. Daquino
- DuCharme Bob, "Exploring a SPARQL endpoint", August 24, 2014. https://www.bobdc.com/blog/exploring-a-sparql-endpoint/.
- DuCharme Bob, "Queries to explore a dataset", April 30, 2022. https://www.bobdc.com/blog/exploringadataset/.