# Analyse TheSoz
Get the number of concepts and the depth of the ontology

It is important to have rdflib, SPARQLWrapper, pandas, and numpy
* pip install rdflib
* pip install SPARQLWrapper
* pip install pandas
* pip install numpy

Download the lastest version of TheSoz from http://lod.gesis.org/download-thesoz.html

In [1]:
from rdflib import Graph
from rdflib.namespace import RDFS
from rdflib import URIRef
import rdflib
import json
from collections import deque
import numpy as np
import pandas as pd

In [2]:
input_file = "thesoz-komplett.xml"
g = Graph()
g.parse(input_file)

<Graph identifier=Nb8397126b45e41ae8f1c2fc1b012ef92 (<class 'rdflib.graph.Graph'>)>

In [3]:
qres = g.query(
    """PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
       SELECT DISTINCT ?a
       WHERE {
          ?a a skos:Concept .
       }""")


topics = dict()
for row in qres:
    topics[row[0]] = True
    
print("Number of concepts: {}".format(len(topics)))

Number of concepts: 8223


In [4]:
qres = g.query(
    """PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
       SELECT DISTINCT ?a ?b
       WHERE {
          ?a skos:broader ?b .
       }""")

broaders = dict()
narrowers = dict()
for row in qres:
    if row[0] not in broaders:
        broaders[row[0]] = list()
    broaders[row[0]].append(row[1])
    if row[1] not in narrowers:
        narrowers[row[1]] = list()
    narrowers[row[1]].append(row[0])

In [5]:
unhier = broaders
concepts = topics
for concept, value in concepts.items():
    queue = deque() 
    max_depth = value
    queue.append({"t":concept,"d":value})
    while len(queue) > 0:
        dequeued = queue.popleft()
        if dequeued["t"] in unhier:
            broads = unhier[dequeued["t"]]
            new_depth = dequeued["d"]+1
            if new_depth > max_depth:
                max_depth = new_depth
            for broader in broads:
                queue.append({"t":broader,"d":dequeued["d"]+1})
    
    concepts[concept] = max_depth

In [6]:
list_of_depths = pd.DataFrame.from_dict(concepts, orient='index', columns=['depth'])
list_of_depths.sort_values('depth', inplace=True, ascending=False)

In [7]:
print("Concepts are ranked by maximum depth")
list_of_depths.head(20)

Concepts are ranked by maximum depth


Unnamed: 0,depth
http://lod.gesis.org/thesoz/concept_10061648,6
http://lod.gesis.org/thesoz/concept_10058142,6
http://lod.gesis.org/thesoz/concept_10035238,6
http://lod.gesis.org/thesoz/concept_10036183,6
http://lod.gesis.org/thesoz/concept_10046475,6
http://lod.gesis.org/thesoz/concept_10082204,6
http://lod.gesis.org/thesoz/concept_10042743,6
http://lod.gesis.org/thesoz/concept_10061649,6
http://lod.gesis.org/thesoz/concept_10040091,6
http://lod.gesis.org/thesoz/concept_10050512,6


In [8]:
for k, v in broaders.items(): 
    if len(v) > 1: 
        print("{} has {} parents".format(k, len(v)))

http://lod.gesis.org/thesoz/concept_10051649 has 2 parents
http://lod.gesis.org/thesoz/concept_10051733 has 2 parents
http://lod.gesis.org/thesoz/concept_10035652 has 3 parents
http://lod.gesis.org/thesoz/concept_10034970 has 2 parents
http://lod.gesis.org/thesoz/concept_10038178 has 2 parents
http://lod.gesis.org/thesoz/concept_10093747 has 2 parents
http://lod.gesis.org/thesoz/concept_10043347 has 2 parents
http://lod.gesis.org/thesoz/concept_10047976 has 2 parents
http://lod.gesis.org/thesoz/concept_10040512 has 2 parents
http://lod.gesis.org/thesoz/concept_10042365 has 3 parents
http://lod.gesis.org/thesoz/concept_10043826 has 3 parents
http://lod.gesis.org/thesoz/concept_10036205 has 2 parents
http://lod.gesis.org/thesoz/concept_10043309 has 2 parents
http://lod.gesis.org/thesoz/concept_10035657 has 3 parents
http://lod.gesis.org/thesoz/concept_10044122 has 2 parents
http://lod.gesis.org/thesoz/concept_10039808 has 2 parents
http://lod.gesis.org/thesoz/concept_10035660 has 3 paren