# Aanalyse STW Thesaurus

### Importing Required Libraries

In this cell, we are importing the following libraries:

- `rdflib.Graph`: This library is used to create an RDF graph and perform operations on it.
- `rdflib.namespace.RDFS`: This library provides access to the RDFS namespace, which is used for working with RDF Schema.
- `rdflib.URIRef`: This library is used to create URI references for RDF resources.
- `rdflib`: This library provides various functionalities for working with RDF data.
- `urllib3`: This library is used for making HTTP requests.
- `json`: This library is used for working with JSON data.
- `collections.deque`: This library provides a double-ended queue implementation.
- `numpy`: This library provides support for large, multi-dimensional arrays and matrices.
- `pandas`: This library is used for data manipulation and analysis.

These libraries are required for the subsequent cells in this Jupyter Notebook.

In [None]:
from rdflib import Graph
from rdflib.namespace import RDFS
from rdflib import URIRef
import rdflib
import urllib3
import json
from collections import deque
import numpy as np
import pandas as pd

### Setting Input File

In this cell, we are setting the input file for our analysis. The input file path is specified as follows:

In [None]:
input_file = "./stw9-16.ttl"

### Parsing RDF Data

In this cell, we are parsing RDF data using the `rdflib.Graph()` class. The RDF data is loaded from the file `stw.ttl` using the `parse()` method of the `Graph` class.

Here's the code:

In [None]:
g = Graph()
g.parse('./stw.ttl')

### Querying RDF Data

In this cell, we are querying the RDF data using the `rdflib.Graph.query()` method. The SPARQL query selects all distinct concepts from the RDF data.

Here's the code:


In [None]:
qres = g.query(
    """PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
       SELECT DISTINCT ?a
       WHERE {
          ?a a skos:Concept .
       }""")


topics = dict()
for row in qres:
    topics[row[0]] = True
    
print("Number of concepts: {}".format(len(topics)))

### Querying Broader Concepts

In this cell, we are querying the broader concepts using the `rdflib.Graph.query()` method. The SPARQL query selects all distinct pairs of concepts and their broader concepts from the RDF data.

Here's the code:

In [None]:
qres = g.query(
    """PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
       SELECT DISTINCT ?a ?b
       WHERE {
          ?a skos:broader ?b .
       }""")

broaders = dict()
narrowers = dict()
for row in qres:
    if row[0] not in broaders:
        broaders[row[0]] = list()
    broaders[row[0]].append(row[1])
    if row[1] not in narrowers:
        narrowers[row[1]] = list()
    narrowers[row[1]].append(row[0])

### Querying Alternative Labels (only English)

In this cell, we are querying the alternative labels using the `rdflib.Graph.query()` method. The SPARQL query selects all distinct pairs of concepts and their alternative labels from the RDF data. We also filter the alternative labels to only include those in English.

Here's the code:

In [None]:
qres = g.query(
    """PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
       SELECT DISTINCT ?a ?b
       WHERE {
          ?a skos:altLabel ?b .
          FILTER (lang(?b) = 'en')
       }""")

altlabel = list()
print(len(qres))
for row in qres:
    altlabel.append(row[1])
print(len(altlabel))

### Querying Alternative Labels (all Languages)

In this cell, we are querying the alternative labels using the `rdflib.Graph.query()` method. The SPARQL query selects all distinct pairs of concepts and their alternative labels from the RDF data.

Here's the code:

In [None]:
qres = g.query(
    """PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
       SELECT DISTINCT ?a ?b
       WHERE {
          ?a skos:altLabel ?b .
       }""")

altlabel = list()
print(len(qres))
for row in qres:
    altlabel.append(row[1])
print(len(altlabel))

### Calculating Maximum Depth of Concepts

In this cell, we are calculating the maximum depth of concepts in the `concepts` dictionary. We iterate over each concept and its corresponding value in the `concepts` dictionary. 

For each concept, we initialise a queue and set the maximum depth to the initial value. We then enqueue the concept and its depth into the queue. 

While the queue is not empty, we dequeue an element and check if its corresponding concept exists in the `unhier` dictionary. If it does, we update the maximum depth if necessary and enqueue the broader concepts with an incremented depth. 

Finally, we update the value of the concept in the `concepts` dictionary with the maximum depth.

This calculation helps us determine the hierarchical depth of each concept in the `concepts` dictionary.

In [None]:
unhier = broaders
concepts = topics
for concept, value in concepts.items():
    queue = deque() 
    max_depth = value
    queue.append({"t":concept,"d":value})
    while len(queue) > 0:
        dequeued = queue.popleft()
        if dequeued["t"] in unhier:
            broads = unhier[dequeued["t"]]
            new_depth = dequeued["d"]+1
            if new_depth > max_depth:
                max_depth = new_depth
            for broader in broads:
                queue.append({"t":broader,"d":dequeued["d"]+1})
    
    concepts[concept] = max_depth

In [None]:
unhier = altlabel
concepts = topics
for concept, value in concepts.items():
    queue = deque() 
    max_depth = value
    queue.append({"t":concept,"d":value})
    while len(queue) > 0:
        dequeued = queue.popleft()
        if dequeued["t"] in unhier:
            broads = unhier[dequeued["t"]]
            new_depth = dequeued["d"]+1
            if new_depth > max_depth:
                max_depth = new_depth
            for broader in broads:
                queue.append({"t":broader,"d":dequeued["d"]+1})
    
    concepts[concept] = max_depth

In [None]:
list_of_depths = pd.DataFrame.from_dict(concepts, orient='index', columns=['depth'])
list_of_depths.sort_values('depth', inplace=True, ascending=False)

In [None]:
print("Concepts are ranked by maximum depth")
list_of_depths.head(20)

In [None]:
for k, v in broaders.items(): 
    if len(v) > 1: 
        print("{} has {} parents".format(k, len(v)))