# Analyse IEEE Thesaurus

### Importing Required Libraries

In this cell, we import the following libraries:

- `rdflib`: A library for working with RDF (Resource Description Framework) data.
- `Graph` and `URIRef` from `rdflib`: Classes for creating and manipulating RDF graphs and URIs.
- `RDFS` from `rdflib.namespace`: A namespace for RDF Schema vocabulary.
- `urllib3`: A library for making HTTP requests.
- `json`: A library for working with JSON data.
- `deque` from `collections`: A double-ended queue implementation.
- `numpy` and `pandas`: Libraries for data manipulation and analysis.

These libraries are required for the subsequent cells.

In [64]:
import rdflib
import urllib3
import json
import numpy as np
import pandas as pd
from rdflib import Graph
from rdflib import URIRef
from collections import deque
from rdflib.namespace import RDFS

In [42]:
input_file = 'ieee-thesaurus.xml' # input file path

### Parsing the Input File

In this cell, we parse the input file using the `rdflib` library. The input file path is stored in the variable `input_file`.

The code snippet below demonstrates how to parse the input file and store the RDF graph in the variable `g`.

In [None]:
g = Graph()
g.parse(input_file)

### Counting the Number of Concepts

In this cell, we count the number of concepts in the RDF graph using the `g.query()` method from the `rdflib` library. The SPARQL query selects distinct concepts that have the RDF type `skos:Concept`.

The code snippet below demonstrates how to count the number of concepts and store them in the `topics` dictionary. Finally, we print the total number of concepts.

In [None]:
qres = g.query(
    """PREFIX   
                skos: <http://www.w3.org/2004/02/skos/core#>
       SELECT DISTINCT ?a
       WHERE {
          ?a a skos:Concept .
       }""")

topics = dict()
for row in qres:
    topics[row[0]] = True
    
print("Number of concepts: {}".format(len(topics)))

### Querying and Organising Concepts

In this cell, we query the RDF graph using the `g.query()` method from the `rdflib` library. The SPARQL query selects distinct concepts and their corresponding broader concepts.

The code snippet below demonstrates how to execute the query and organise the concepts into dictionaries `broaders` and `narrowers`. The `broaders` dictionary stores each concept as a key and its list of broader concepts as the corresponding value. Similarly, the `narrowers` dictionary stores each concept as a key and its list of narrower concepts as the corresponding value.

In [65]:
qres = g.query(
    """PREFIX skos:<https://ieee-thesaurus.org/schema>
       SELECT DISTINCT ?a ?b
       WHERE {
          ?a skos:Concept ?b .
       }""")


broaders = dict()
narrowers = dict()


for row in qres:
    if row[0] not in broaders:
        broaders[row[0]] = list()
    broaders[row[0]].append(row[1])
    if row[1] not in narrowers:
        narrowers[row[1]] = list()
    narrowers[row[1]].append(row[0])

### Calculating Maximum Depth of Concepts

In this cell, we calculate the maximum depth of concepts in the `concepts` dictionary. The `concepts` dictionary contains the concepts as keys and their initial depth as values.

In [66]:
broaders = dict()
unhier = dict(broaders)
concepts = dict(topics)
for concept, value in concepts.items():
    queue = deque() 
    max_depth = value 
    queue.append({"t": concept, "d": value})
    while len(queue) > 0:
        dequeued = queue.popleft()
        if dequeued["t"] in unhier:
            broads = unhier[dequeued["t"]]
            new_depth = dequeued["d"] + 1
            if new_depth > max_depth:
                max_depth = new_depth
            for broader in broads:
                queue.append({"t": broader, "d": dequeued["d"] + 1})

    concepts[concept] = max_depth

In [67]:
list_of_depths = pd.DataFrame.from_dict(concepts, orient='index', columns=['depth'])
list_of_depths.sort_values('depth', inplace=True, ascending=False)

In [None]:
print("Concepts are ranked by maximum depth")
list_of_depths.head(20)

### Checking for Monohierarchy

In this cell, we check if the IEEE Thesaurus is monohierarchical, meaning that each narrower concept has only one broader concept.

The code snippet below prints a message indicating that if nothing is printed after this line, it means that the IEEE Thesaurus is monohierarchical. It then iterates over the `broaders` dictionary and checks if any concept has more than one parent. If a concept has more than one parent, it prints the concept name and the number of parents it has.


In [None]:
print("If it does not print anything after this line, it means IEEE Thesaurus is monohierarchical (a narrower has only one broader)")
for k, v in broaders.items(): 
    if len(v) > 1: 
        print("{} has {} parents".format(k, len(v)))