# Experiment: Chat with the [Research Data Information knowledge graph](https://nfdi4culture.de/resources/knowledge-graph.html)

The reputation of difficulty in using SPARQL query interfaces prevents many users of even trying to searc in knowledge graphs.
Even a seasoned professional can sometimes be baffled by not finding results thay they suspect (or even know) are in a knowledge graph.

Can we also present an alternative interface to the Research Data Information knowledge graph (RDIKG) by using chat interfaces, like ChatGPT?

## Method

Index the RDIKG with an embedding, store the embeddings in a fast retrieval system. Allow the end-user to pose a question, look up the question in the embedings, retreiving all matched nodes from the RDIKG. Use the retrieved nodes for the question as input to ChatGPT as a promt, re-posing the question to be answered.


In [1]:
from pyoxigraph import *
import os, sys, json, random, io, rich

g = Store()
g.bulk_load("a.ttl", "text/turtle")
print(len(g))


14960


In [2]:
DEFAULT_PREFIXES = {
    "http://www.w3.org/1999/02/22-rdf-syntax-ns#": "rdf",
    "http://www.w3.org/2000/01/rdf-schema#": "rdfs",
    "http://www.w3.org/2002/07/owl#": "owl",
    "http://schema.org/": "schema",
    "http://www.wikidata.org/entity/": "wd",
    "http://www.wikidata.org/entity/statement/": "wds",
    "http://wikiba.se/ontology#": "wikibase",
    "http://www.wikidata.org/prop/direct/": "wdt",
    "http://www.w3.org/2004/02/skos/core#": "skos",
    "http://purl.org/dc/terms/": "dct",
    "http://purl.org/dc/elements/1.1/": "dc",
    "http://dbpedia.org/resource/": "dbr",
    "https://nfdi4culture.de/ontology#": "nfdico",
    "http://xmlns.com/foaf/0.1/": "foaf",
    "http://purl.org/cerif/frapo/": "frapo",
    "http://vivoweb.org/ontology/core#": "vivo"
}
class Namespace:
    def __init__(self, iri:str):
        self._iri = iri
    def __getattr__(self, key):
        return NamedNode(self._iri+str(key))
    
NFDICO = Namespace('https://nfdi4culture.de/ontology#')
RDF = Namespace("http://www.w3.org/1999/02/22-rdf-syntax-ns#")
RDFS = Namespace("http://www.w3.org/2000/01/rdf-schema#")
SDO = Namespace("http://schema.org/")

def pfx(item)-> str:
    for k, p in DEFAULT_PREFIXES.items():
        if item.value.startswith(k):
            return item.value.replace(k, p+":")
    return item.value

def pfy(item)-> str:
    for k, p in DEFAULT_PREFIXES.items():
        if item.value.startswith(k):
            return item.value.replace(k, "")
    return item.value


In [3]:
print(pfy(NFDICO.person))

person


Looking at the [overview of the graph](https://nfdi4culture.de/ontology.html) - we can see what the main kinds of "things" there are.

Let's list all the kinds of things, and the number of triples per thing.

## What are the classes and properties?

In [4]:
classes = {}
predicates = {}
class_by_s = {}

for s, p, o, _ in g.quads_for_pattern(None, None, None):
    if p == RDF.type:
        classes.setdefault(o, []).append(s)
        class_by_s[s] = o
    predicates.setdefault(p, []).append(s)
print('### Classes')
for k, v in reversed(sorted(classes.items(), key=lambda x: len(x[1]))):
    print(pfx(k), len(v))
print()
print('### Predicates')
for k, v in reversed(sorted(predicates.items(), key=lambda x: len(x[1]))):
    print(pfx(k), len(v))

### Classes
schema:OrganizationRole 1103
nfdico:Person 231
nfdico:Organization 185
schema:DefinedTerm 176
schema:Role 126
nfdico:Contribution 121
schema:Review 96
schema:NewsArticle 82
nfdico:MediaType 65
schema:Place 60
schema:GeoCoordinates 60
schema:Event 52
nfdico:Software 44
schema:MediaObject 31
nfdico:DataPortal 23
nfdico:Service 20
nfdico:AcademicDiscipline 8
schema:Guide 6
nfdico:Project 4
schema:WebSite 1

### Predicates
rdf:type 2494
schema:name 1307
schema:sameAs 1279
schema:roleName 1200
rdfs:label 1123
schema:member 1104
schema:memberOf 1103
schema:keywords 860
nfdico:mediaType 490
nfdico:url 380
schema:image 283
schema:url 246
schema:familyName 230
schema:givenName 230
owl:sameAs 224
schema:provider 194
schema:description 165
schema:knowsAbout 155
schema:honorificPrefix 150
frapo:hasAcronym 133
schema:author 123
nfdico:subjectArea 123
schema:contributor 101
schema:reviewBody 96
schema:review 96
schema:datePublished 77
schema:text 74
schema:organizer 71
nfdico:subsidiaryO

Looking at the above, we can intuit that 'Person' and 'Organization' might be things that we are interested in chunking as information. But how are they related? Looking at the ontology docs at https://nfdi4culture.de/ontology.html or https://nfdi.fiz-karlsruhe.de/ontology does not straight away give away what the relations are. Maybe we can [browse the graph with shmarql](https://epoz.org/shmarql?e=https://nfdi4culture.de/sparql&s=%3Chttps%3A//nfdi4culture.de/id/E1835%3E), and see if that helps.

But we also want see how things are related.

Let's nose around and look at some persons and [organizations](https://epoz.org/shmarql?e=https://nfdi4culture.de/sparql&s=%3Chttps%3A//nfdi4culture.de/id/E1835%3E).

> interlude: shmarql in current form does not allow browsing blanknodes. 😡 let's fix this

*sidenote* turns out we _can't_ fix this, as using blanknodes to query in a shmarql browse does not make sense. You need at least two BGPs in a query to really make an intersting blanknode query. So it was only today that I really grokked this. 🙃

I did try running this [query](https://nfdi4culture.de/sparql):

```sparql
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema: <http://schema.org/>
PREFIX nfdico: <https://nfdi4culture.de/ontology#>

SELECT distinct ?type ?p WHERE {
  ?s rdf:type ?type .
  ?a ?p ?type .
}
```
 and it worked, but as soon as tried ordering it by ?type, the server times out. 


In [61]:
class_properties = {}
for s, p, o, _ in g.quads_for_pattern (None, None, None):
    t = class_by_s.get(s)
    if t is None:
        continue
    class_properties.setdefault(t and pfx(t), set()).add(pfx(p))
rich.print(class_properties)

Of course, if you can read the OWL constraints in the TBox of the ontology natively, the above Class-Propery list will also be clear, but we are not reading this from the TBox, but deriving it from the ABox.

In [6]:
class Portion:
    def __init__(self, sourcegraph, subject):
        self.graph = Store()
        for s, p, o, _ in sourcegraph.quads_for_pattern(subject, None, None):
            self.graph.add(Quad(s, p, o))
            if type(o) in (NamedNode, BlankNode):
                for ss, pp, oo, _ in sourcegraph.quads_for_pattern(o, None, None):
                    if type(oo) == Literal:
                        self.graph.add(Quad(ss, pp, oo))
                        continue
                    for sss, ppp, ooo, _ in sourcegraph.quads_for_pattern(
                        oo, None, None
                    ):
                        if type(sss) != BlankNode and type(ooo) != BlankNode:
                            self.graph.add(Quad(sss, ppp, ooo))
        # for s, p, o, _ in sourcegraph.quads_for_pattern(None, None, subject):
        #     for ss, pp, oo, _ in sourcegraph.quads_for_pattern(s, None, None):
        #         self.graph.add(Quad(ss, pp, oo))
        self._uri = subject
        self._data = {}
        for s, p, o, _ in self.graph.quads_for_pattern(None, None, None):
            if type(o) == BlankNode:
                continue
            if p == RDFS.seeAlso:
                continue
            self._data.setdefault(s, {}).setdefault(p, []).append(o)

    def __str__(self):
        buf = []
        for s, predicates in self._data.items():
            if type(s) == BlankNode:
                continue
            buf.append("\nID: " + pfx(s))
            for k, v in predicates.items():
                buf.append(pfy(k) + " " + "\n  ".join([pfx(vv) for vv in v]))
        return "\n".join(buf)

    def turtle(self):
        output = io.BytesIO()
        serialize(
            [
                Triple(s, p, o)
                for s, p, o, _ in self.graph.quads_for_pattern(None, None, None)
            ],
            output,
            "text/turtle",
        )
        return output.getvalue().decode("utf8")


In [12]:
p = Portion(g, random.choice(classes[NFDICO.Person]))
#p = Portion(g, NamedNode("https://nfdi4culture.de/id/E2118"))
print(p)


ID: https://nfdi4culture.de/id/E1824
hasAcronym Publication & Availability
description 
    Within the cultural heritage domain, both repository providers and researches from all disciplines face increasing challenges concerning the publication and long-term digital preservation of research outputs. The growing complexity calls for an improvement of existing standards and the optimisation of services across the subject areas art history and architecture, musicology, performing arts as well as film and media studies. In close collaboration with the community, we aim to close gaps within the existing infrastructure and to establish sustainable and reliable services.
Repositories

type nfdico:Organization
parentOrganization https://nfdi4culture.de/id/E1820
name Data publication and data availability
  Task Area 4: Data publication and data availability
image https://nfdi4culture.de/fileadmin/user_upload/task-areas/ta4.svg
label Task Area 4: Data publication and data availability

ID: htt

In [8]:
o = Portion(g, random.choice(classes[NFDICO.Organization]))
print(o)


ID: https://nfdi4culture.de/id/E2055
givenName Susanne
url https://nfdi4culture.de/about-us/people/c7c8260a-2c87-4f86-9d74-488aae68f639.html
type nfdico:Person
familyName Rode-Breymann
honorificPrefix Prof. Dr.
name Susanne  Rode-Breymann
label Susanne  Rode-Breymann

ID: https://nfdi4culture.de/id/E1865
url https://die-deutschen-musikhochschulen.de/die-rkm/
hasAcronym RdMH
knowsAbout https://nfdi4culture.de/id/E2314
type nfdico:Organization
sameAs wd:Q2142432
name Rektorenkonferenz der deutschen Musikhochschulen
label Rektorenkonferenz der deutschen Musikhochschulen

ID: https://nfdi4culture.de/id/E1940
givenName Bernd
url https://nfdi4culture.de/about-us/people/c3251983-44ef-4f36-96e6-b2b2680a5774.html
type nfdico:Person
familyName Redmann
honorificPrefix Prof. Dr.
name Bernd  Redmann
image https://nfdi4culture.de/fileadmin/user_upload/people/CSB/Redmann_Bernd__Rektorenkonferenz_der_deutschen_Musikhochschulen_neu.JPG
label Bernd  Redmann

ID: https://nfdi4culture.de/id/E2400
type sc

## Indexing Time

In [17]:
!ls /Users/etienne/.cache/torch/sentence_transformers

[1;34msentence-transformers_all-mpnet-base-v2[0m


In [57]:
from rich.progress import track
from sentence_transformers import SentenceTransformer, util
# model = SentenceTransformer("msmarco-roberta-base-ance-firstp")
# model = SentenceTransformer("all-mpnet-base-v2")
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

corpus = []
corpus_ids = []
for x in classes[NFDICO.Person]:
    corpus.append(str(Portion(g, x)))
    corpus_ids.append(x)
for x in classes[NFDICO.Organization]:
    corpus.append(str(Portion(g, x)))
    corpus_ids.append(x)
corpus_embeddings = model.encode(corpus)


In [60]:
#a = embeddings["https://nfdi4culture.de/id/E3113"]
query_embedding = model.encode("partners of nfdi4culure")
search_results = util.semantic_search(query_embedding, corpus_embeddings)
for sr in search_results[0][:5]:
    print(sr)
    print(corpus[sr['corpus_id']])
    print("-----------------------------")


#r = [(util.cos_sim(e, q).item(), i) for i, e in embeddings.items()]
#r.sort(key=lambda x: x[0])
#rich.print(r[-5:])
#rich.print(r[:5])

{'corpus_id': 177, 'score': 0.45451000332832336}

ID: https://nfdi4culture.de/id/E1827
hasAcronym Governance & Administration
description 
    We bring together all administrative and co-ordinative activities, incentives for participation and inward-outward cooperation, dissemination, community engagement and outreach while taking care of reporting and all governance operations. We enable knowledge pooling and exchange between our task areas and other NFDI consortia while engaging in NFDI-wide cross-cutting topics.

type nfdico:Organization
subOrganization https://nfdi4culture.de/id/E1830
  https://nfdi4culture.de/id/E1832
name Governance and Administration
  Task Area 7: Governance and Administration
image https://nfdi4culture.de/fileadmin/user_upload/task-areas/ta7.svg
label Task Area 7: Governance and Administration

ID: https://nfdi4culture.de/id/E1929
type schema:Role
sameAs wd:Q9200127
label Member

ID: https://nfdi4culture.de/id/E1830
hasAcronym CCO
type nfdico:Organization
pare

In [46]:
print(corpus[220])
# print(Portion(g, NamedNode("https://nfdi4culture.de/id/E2137")))


ID: https://nfdi4culture.de/id/E2378
type schema:Role
sameAs wd:Q46135267
label Affiliation

ID: https://nfdi4culture.de/id/E2025
url https://rism.info
hasAcronym RISM
knowsAbout https://nfdi4culture.de/id/E2314
type nfdico:Organization
sameAs wd:Q2178828
name Répertoire International des Sources Musicales International e. V.
  Répertoire International des Sources Musicales International
label Répertoire International des Sources Musicales International

ID: https://nfdi4culture.de/id/E2137
givenName Klaus
url https://nfdi4culture.de/about-us/people/aacff1c5-47ad-42af-80c8-bd378efb1566.html
type nfdico:Person
familyName Pietschmann
honorificPrefix Prof. Dr.
name Klaus  Pietschmann
label Klaus  Pietschmann
