# ProjectName - Data documentation

This Jupyter Notebook analyses the data preparation and processing phase for ["NameProject"](https://github.com/ahsanv101/ProjectGaze), a data visualization project regarding the perception of the "male gaze" in USA's highest grossing movies between 1940s and 2010s.

### Disclaimer ?? idk if it's needed
This Jupyter Notebook is of informational nature only, it is not thought to be used for the data preparation and processing, but only for the analysis and explanation of such processes.
<br>The Python files used for the clean up can be found in `scripts > [namefile].py`.

## Webscraping - SA on reviews

## Film Scripts analysis

## SPARQL Data retrieval

Finally, after gathering some preliminary results from the first analyses on film scripts and IMDB's reviews, we further deepened our research using the [**Linked Internet Movie Database (IMDb)**](https://triplydb.com/Triply/linkedmdb) and its **SPARQL endpoint**, hosted on Triply.

Not having knowledge about the structure of such knowledge base, an initial phase of **data exploration** was deemed necessary.

Afterwards it was finally possible to perform the queries and save the results in an appropriate format as to visualize the data and gather insight on it.

### Data exploration
In general, there are two different types of statements (triples) in knowledge bases: **T-Box** statements and **A-Box** statements.
1. **T-Box** (Terminological Box) statements describe the domain of interest defining classes and properties as the domain vocabulary; they contain information related to the **structure of the dataset**
2. **A-Box** (Assertional Box) statements provide facts associated with the TBox's conceptual model or ontologies; they contain information on instances and the relationships between them: the **main content of the dataset**

For the exploration of the IMDB dataset, we will follow this theoretical structure.

#### T-Box (Terminological Box)
A first interesting query would be to check the **number of triples** contained in the knowledge base, to get a flavor of the extent of it.


In [None]:
# Import library to display the results cleanly
import sparql_dataframe
import pandas

# Reference resource: IMDB SPARQL endpoint URL
endpoint = 'https://api.triplydb.com/datasets/Triply/linkedmdb/services/linkedmdb/sparql'

# Query we want to run: how many triples are in the LOD source?
query_triples_count = """
SELECT (COUNT (*) AS ?tripleCount)
WHERE {
?s ?p ?o
}
"""

# Create dataframe and print it
df = sparql_dataframe.get(endpoint, query_triples_count)
print(f'The total number of triples is:\n {df}')

Then, to quickly comprehend the kind of data available, we **listed the predicates used** (maybe listing them alphabetically), which immediately can tell us interesting facts on the kind of data available.

In [None]:
# List predicates 
query_predicates = """
    SELECT DISTINCT ?p
    WHERE { 
    ?s ?p ?o .
    } ORDER BY ?p
"""

df = sparql_dataframe.get(endpoint, query_predicates)
print(f'The list of predicates:\n {df}')

The presence of `rdfs:subClassOf` indicates the presence of some structure, while `dcterms:title` shows that the knowledge graph deals with works (clearly, this being a knowledge base on the Internet Movie Database contents), finally `foaf` indicates the presence of information about people. But these are only some of the many ontologies used in this knowledge graph.

Having ordered the resulting dataframe in alphabetical order allows us to immediately and easily see the **many different ontologies** employed:

In [None]:
# Print each row of the dataframe
for idx, row in df.iterrows():
    print(row['p'])

We have:
- [DBpedia](https://www.dbpedia.org/)
- [Dcterms](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/)
- [OWL](https://www.w3.org/TR/owl-features/)
- [RDF Schema](https://www.w3.org/TR/rdf-schema/)
- [SKOS](https://www.w3.org/TR/skos-reference/)
- [FOAF](http://xmlns.com/foaf/0.1/)
- [VoID](https://www.w3.org/TR/void/)
- [Virtrdf](https://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFViewNorthwindOntology)
- Linkedimdb/id
- Linkedimdb/vocab

It is also interesting to understand which are the **most used properties**:

In [None]:
# Most used predicates list
query_predicate_repetition = '''
    SELECT ?p (COUNT(?p) AS ?predicate)
    WHERE { 
    ?s ?p ?o .
    }
    GROUP BY ?p
    ORDER BY DESC(?predicate)
'''

df = sparql_dataframe.get(endpoint, query_predicate_repetition)
print(f'The number of times each predicate is used:\n {df}')


An interesting insight we get from this first exploration is the presence of an ontology specific to IMDB: **Linkedimdb.**

As linkedmdb/id obviously contains only linkage knowledge, we are more interested in linkedmbd/vocab and, to further analyse it, we can select only those properties belonging to it:

In [None]:
# List linkedimdb/vocab predicates 
query_predicates = """
    SELECT DISTINCT ?p
    WHERE { 
    ?s ?p ?o .
    FILTER regex(?p, "https://triplydb.com/Triply/linkedmdb/vocab/", "i")
    } ORDER BY ?p
"""

df = sparql_dataframe.get(endpoint, query_predicates)
print(f'The list of linkedimdb/vocab predicates:\n {df}')

[here missing part on classes]

#### A-Box (Assertional Box)


### SPARQL Queries

## SPARQL Queries

We can now properly state our queries to the knowledge graph, and we do so based on the results coming from the **script analysis** and **review analysis**:
- Result 1:
- Result 2:
- Result 3:
- Result 4:
- Result 5:

Queries:
1. How many of these movies have been directed by a male director?
    1. For this query we will need to probably do a **federated query** combining linkedmdb, which links the movie title to its director(s) and wikidata, which can give us information on the director(s)' gender
2. How many of these movies have a majority of male actors in the cast?
    1. Again, probably will need a **federated query**
3. What are the lengths of these movies compared to the other movies?
4. Some other query (hard to define for now)
5. Some other query (hard to define for now)

## References 
- Abox, Wikipedia. https://en.wikipedia.org/wiki/Abox
- "How to explore an unknown dataset - quickstart" by M. Daquino
- DuCharme Bob, "Exploring a SPARQL endpoint", August 24, 2014. https://www.bobdc.com/blog/exploring-a-sparql-endpoint/.
- DuCharme Bob, "Queries to explore a dataset", April 30, 2022. https://www.bobdc.com/blog/exploringadataset/.