# Part 1: Construct a KG from unstructured data

This notebook illustrates the internal steps for the implementation of an NLP pipeline for constructing _knowledge graphs_ from unstructured data sources.



## Set up

In [None]:
%pip install -q ipywidgets

In [2]:
from IPython.display import display, HTML, Image, SVG

from collections import defaultdict
from dataclasses import dataclass
import itertools
import os
import typing
import warnings

from gliner_spacy.pipeline import GlinerSpacy
from icecream import ic
from pydantic import BaseModel
from pyinstrument import Profiler
import glirel
import matplotlib
import matplotlib.colors
import networkx as nx
import pandas as pd
import pyvis
import spacy
import transformers

Override specific Hugging Face error messages, since `transformers` and `tokenizers` have noisy logging.

In [3]:
transformers.logging.set_verbosity_error()
os.environ["TOKENIZERS_PARALLELISM"] = "0"

Show a watermark of the OS, hardware, language environment, and dependent library versions.

In [None]:
%load_ext watermark
%watermark
%watermark --iversions

Start the stochastic stack trace profiler.

In [5]:
profiler: Profiler = Profiler()
profiler.start()

Define the model selections and parameter settings.

In [6]:
CHUNK_SIZE: int = 1024

GLINER_MODEL: str = "urchade/gliner_small-v2.1"

NER_LABELS: typing.List[str] = [
    "PERSON",        # For military/political leaders like Churchill, General Weygand, King Leopold
    "ORG",          # For organizations like Royal Air Force, Royal Navy, British Expeditionary Force  
    "GPE",          # For geopolitical entities like Belgium, France, England
    "LOC",          # For locations like beaches, channels
    "FACILITY",     # For facilities like ports, fortifications
    "DATE",         # For temporal references
    "EVENT",        # For battles and military operations
    "PRODUCT",      # For military equipment/vehicles (e.g. Spitfire, Hurricane)
    "NORP",         # For nationalities/religious/political groups (e.g. German, British, French)
]

RE_LABELS: dict = {
    "glirel_labels": {
        "commands": {
            "allowed_head": ["PERSON"], 
            "allowed_tail": ["ORG"]
        },
        "located_in": {
            "allowed_head": ["FACILITY", "LOC"], 
            "allowed_tail": ["GPE"]
        },
        "member_of": {
            "allowed_head": ["PERSON"],
            "allowed_tail": ["ORG"]
        },
        "affiliated_with": {
            "allowed_head": ["ORG"],
            "allowed_tail": ["GPE"]
        },
        "participated_in": {
            "allowed_head": ["ORG", "PERSON"],
            "allowed_tail": ["EVENT"] 
        },
        "occurred_at": {
            "allowed_head": ["EVENT"],
            "allowed_tail": ["LOC", "GPE", "FACILITY"]
        },
        "occurred_on": {
            "allowed_head": ["EVENT"],
            "allowed_tail": ["DATE"]
        },
        "leader_of": {
            "allowed_head": ["PERSON"],
            "allowed_tail": ["GPE", "ORG"]
        },
        "allied_with": {
            "allowed_head": ["GPE"],
            "allowed_tail": ["GPE"]
        },
        "no_relation": {}
    }
}

SPACY_MODEL: str = "en_core_web_md"

STOP_WORDS: typing.Set[ str ] = set([
    "PRON.it",
    "PRON.that",
    "PRON.they",
    "PRON.those",
    "PRON.we",
    "PRON.which",
    "PRON.who",
])

TR_ALPHA: float = 0.85
TR_LOOKBACK: int = 3

Load the models for `spaCy`, `GLiNER`, `GLiREL` -- this may take several minutes when run the first time.

In [None]:
nlp: spacy.Language = spacy.load(SPACY_MODEL)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    nlp.add_pipe(
        "gliner_spacy",
        config = {
            "gliner_model": GLINER_MODEL,
            "labels": NER_LABELS,
            "chunk_size": CHUNK_SIZE,
            "style": "ent",
        },
    )
        
    nlp.add_pipe(
        "glirel",
        after = "ner",
    );

Define the global data structures -- which need to be reset for every run, not for each chunk iteration.

In [8]:
graph: nx.Graph = nx.Graph()
known_lemma: typing.List[ str ] = []

## Parse one text chunk

Define an input text chunk.

In [9]:
class TextChunk (BaseModel):
    uid: int
    url: str
    text: str
    
SAMPLE_CHUNK: TextChunk = TextChunk(
    uid = 1,
    url = "https://raw.githubusercontent.com/donbr/kg_rememberall/refs/heads/main/references/winston_churchill_we_shall_fight_speech_june_1940.txt",
    text = """
I have, myself, full confidence that if all do their duty, if nothing is neglected, and if the best arrangements are made, as they are being made, we shall prove ourselves once again able to defend our Island home, to ride out the storm of war, and to outlive the menace of tyranny, if necessary for years, if necessary alone. At any rate, that is what we are going to try to do. That is the resolve of His Majesty’s Government-every man of them. That is the will of Parliament and the nation. The British Empire and the French Republic, linked together in their cause and in their need, will defend to the death their native soil, aiding each other like good comrades to the utmost of their strength. Even though large tracts of Europe and many old and famous States have fallen or may fall into the grip of the Gestapo and all the odious apparatus of Nazi rule, we shall not flag or fail. We shall go on to the end, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing confidence and growing strength in the air, we shall defend our Island, whatever the cost may be, we shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the streets, we shall fight in the hills; we shall never surrender, and even if, which I do not for a moment believe, this Island or a large part of it were subjugated and starving, then our Empire beyond the seas, armed and guarded by the British Fleet, would carry on the struggle, until, in God’s good time, the New World, with all its power and might, steps forth to the rescue and the liberation of the old.
    """.strip(),
)

chunk: TextChunk = SAMPLE_CHUNK

Parse the input text.

In [10]:
doc: spacy.tokens.doc.Doc = list(
    nlp.pipe(
        [( chunk.text, RE_LABELS )],
        as_tuples = True,
    )
)[0][0]

Visualize the `spaCy` parse and `GLiNER` _named entity recognition_ results.

In [None]:
for sent in doc.sents:
    spacy.displacy.render(
        sent,
        style = "ent",
        jupyter = True,
    )

    parse_svg: str = spacy.displacy.render(
        sent,
        style = "dep",
        jupyter = True,
    )

    display(SVG(parse_svg))

## Layer 1: construct a lexical graph

Scan the document tokens to add lemmas to the _textgraph_.

In [None]:
for sent in doc.sents:
    node_seq: typing.List[ int ] = []
    ic(sent)

    for tok in sent:
        text: str = tok.text.strip()
        
        if tok.pos_ in [ "NOUN", "PROPN" ]:
            key: str = tok.pos_ + "." + tok.lemma_.strip().lower()
            print(tok.i, key, tok.text.strip())

            if key not in known_lemma:
                # create a new node
                known_lemma.append(key)
                node_id: int = known_lemma.index(key)
                node_seq.append(node_id)

                graph.add_node(
                    node_id,
                    key = key,
                    kind = "Lemma",
                    pos = tok.pos_,
                    text = text,
                    chunk = chunk.uid,
                    count = 1,
                )
            else:
                # link to an existing node, adding weight
                node_id = known_lemma.index(key)
                node_seq.append(node_id)

                node: dict = graph.nodes[node_id]
                node["count"] += 1

    # create the textrank edges
    ic(node_seq)

    for hop in range(TR_LOOKBACK):
        for node_id, node in enumerate(node_seq[: -1 - hop]):            
            neighbor: int = node_seq[hop + node_id + 1]
            graph.add_edge(
                node,
                neighbor,
                rel = "FOLLOWS_LEXICALLY",
            )

Keep track of the sentence numbers, which we'll use later for entity co-occurrence links.

In [13]:
sent_map: typing.Dict[ spacy.tokens.span.Span, int ] = {}

for sent_id, sent in enumerate(doc.sents):
    sent_map[sent] = sent_id

## Layer 2: overlay entities onto the graph

Classify spans as potential entities.

Note that if we'd run [_entity resolution_](https://neo4j.com/developer-blog/entity-resolved-knowledge-graphs/) previously from _structured_ or _semi-structured_ data sources to generate a "backbone" for the knowledge graph, then we use the contextualized _surface forms_ from that phase to perform _entity linking_ on the entities extracted here from _unstructured_ data.

In [14]:
@dataclass(order=False, frozen=False)
class Entity:
    loc: typing.Tuple[ int ]
    key: str
    text: str
    label: str
    chunk_id: int
    sent_id: int
    span: spacy.tokens.span.Span
    node: typing.Optional[ int ] = None


span_decoder: typing.Dict[ tuple, Entity ] = {}


def make_entity (
    span: spacy.tokens.span.Span,
    chunk: TextChunk,
    ) -> Entity:
    """
Instantiate one `Entity` dataclass object, adding it to the working "vocabulary".
    """
    key: str = " ".join([
        tok.pos_ + "." + tok.lemma_.strip().lower()
        for tok in span
    ])
    
    ent: Entity = Entity(
        ( span.start, span.end, ),
        key,
        span.text,
        span.label_,
        chunk.uid,
        sent_map[span.sent],
        span,
    )

    if ent.loc not in span_decoder:
        span_decoder[ent.loc] = ent
        ic(ent)

    return ent

In [None]:
for span in doc.ents:
    make_entity(span, chunk)

for span in doc.noun_chunks:
    make_entity(span, chunk)

Overlay the inferred entity spans atop the base layer constructed by _textgraph_ analysis of the `spaCy` parse trees.

In [16]:
def extract_entity (
    ent: Entity,
    ) -> None:
    """
Link one `Entity` into the existing graph.
    """
    if ent.key not in known_lemma:
        # add a new Entity node to the graph and link to its component Lemma nodes
        known_lemma.append(ent.key)
        node_id: int = known_lemma.index(ent.key)
        
        graph.add_node(
            node_id,
            key = ent.key,
            kind = "Entity",
            label = ent.label,
            pos = "NP",
            text = ent.text,
            chunk = ent.chunk_id,
            count = 1,
        )

        for tok in ent.span:
            tok_key: str = tok.pos_ + "." + tok.lemma_.strip().lower()

            if tok_key in known_lemma:
                tok_idx: int = known_lemma.index(tok_key)

                graph.add_edge(
                    node_id,
                    tok_idx,
                    rel = "COMPOUND_ELEMENT_OF",
                )
    else:
        node_id: int = known_lemma.index(ent.key)
        node: dict = graph.nodes[node_id]
        # promote to an Entity, in case the node had been a Lemma
        node["kind"] = "Entity"
        node["chunk"] = ent.chunk_id
        node["count"] += 1

        # select the more specific label
        if "label" not in node or node["label"] == "NP":
          node["label"] = ent.label

    ent.node = node_id

In [None]:
for ent in span_decoder.values():
    if ent.key not in STOP_WORDS:
        extract_entity(ent)
        ic(ent)

Report the relations inferred by `GLiREL`.

In [None]:
relations: typing.List[ dict ] = sorted(
    doc._.relations,
    key = lambda x: x["score"],
    reverse = True,
)

for item in relations:
    src_loc: typing.Tuple[ int ] = tuple(item["head_pos"])
    dst_loc: typing.Tuple[ int ] = tuple(item["tail_pos"])
    skip_rel: bool = False

    if src_loc not in span_decoder:
        print("MISSING src entity:", item["head_text"], item["head_pos"])
        
        src_ent: Entity = make_entity(
            doc[ item["head_pos"][0] : item["head_pos"][1] ],
            chunk,
        )

        if src_ent.key in STOP_WORDS:
            skip_rel = True
        else:
            extract_entity(src_ent)

    if dst_loc not in span_decoder:
        print("MISSING dst entity:", item["tail_text"], item["tail_pos"])

        dst_ent: Entity = make_entity(
            doc[ item["tail_pos"][0] : item["tail_pos"][1] ],
            chunk,
        )

        if dst_ent.key in STOP_WORDS:
            skip_rel = True
        else:
            extract_entity(dst_ent)

    # link the connected nodes
    if not skip_rel:
        src_ent = span_decoder[src_loc]
        dst_ent = span_decoder[dst_loc]

        rel: str = item["label"].strip().replace(" ", "_").upper()
        prob: float = round(item["score"], 3)

        print(f"{src_ent.text} {src_ent.node} -> {rel} -> {dst_ent.text} {dst_ent.node} | {prob}")

        graph.add_edge(
            src_ent.node,
            dst_ent.node,
            rel = rel,
            prob = prob,
        )

Connect the co-occurring entities.

In [19]:
ent_map: typing.Dict[ int, typing.Set[ int ]] = defaultdict(set)

for ent in span_decoder.values():
    if ent.node is not None:
        ent_map[ent.sent_id].add(ent.node)    

for sent_id, nodes in ent_map.items():
    for pair in itertools.combinations(list(nodes), 2):
        if not graph.has_edge(*pair):
            graph.add_edge(
                pair[0],
                pair[1],
                rel = "CO_OCCURS_WITH",
                prob = 1.0,
            )

Run eigenvalue centrality (i.e., _Personalized PageRank_) to rank the entities.

In [20]:
for node, rank in nx.pagerank(graph, alpha = TR_ALPHA, weight = "count").items():
    graph.nodes[node]["rank"] = rank

Report the top-ranked entities extracted from this text chunk.

In [None]:
df: pd.DataFrame = pd.DataFrame([
    node_attr
    for node, node_attr in graph.nodes(data = True)
    if node_attr["kind"] == "Entity"
]).sort_values(by = [ "rank", "count" ], ascending = False)

df.head(20)

## Visualize the results

Use `pyvis` to provide an interactive visualization of both layers of the graph, so far...

In [None]:
pv_net: pyvis.network.Network = pyvis.network.Network(
    height = "750px",
    width = "100%",
    notebook = True,
    cdn_resources = "remote",
)

for node_id, node_attr in graph.nodes(data = True):
    if node_attr["kind"] == "Entity":
        color: str = "hsl(65, 46%, 58%)"
        size: int = round(200 * node_attr["rank"])
    else:
        color = "hsla(72, 10%, 90%, 0.95)"
        size = round(30 * node_attr["rank"])

    pv_net.add_node(
        node_id,
        label = node_attr["text"],
        title = node_attr.get("label"),
        color = color,
        size = size,
    )

for src_node, dst_node, edge_attr in graph.edges(data = True):
    pv_net.add_edge(
        src_node,
        dst_node,
        title = edge_attr.get("rel"),
    )

pv_net.toggle_physics(True)
pv_net.show("../data/processed/graphrag_contruct.html")

Show a cluster analysis of the _lexical graph_.

In [23]:
communities: typing.Generator = nx.community.louvain_communities(graph)

comm_map: typing.Dict[ int, int ] = {
    node_id: i
    for i, comm in enumerate(communities)
    for node_id in comm
}
                                                                                                                            
xkcd_colors: typing.List[ str ] = list(matplotlib.colors.XKCD_COLORS.values())

colors: typing.List[ str ] = [
    xkcd_colors[comm_map[n]]
    for n in list(graph.nodes())
]
                                                                                                                                         
labels: typing.Dict[ int, str ] = {
    node_id: node_attr["text"]
    for node_id, node_attr in graph.nodes(data = True)
}

In [None]:
SPRING_DISTANCE: float = 2.5
                                                                                                                                     
nx.draw_networkx(
    graph,
    pos = nx.spring_layout(
        graph,
        k = SPRING_DISTANCE / len(communities),
    ),
    labels = labels,
    node_color = colors,
    edge_color = "#bbb",
    with_labels = True,
    font_size = 8,
)

## Tear down

How much did the global data structures grow?

In [None]:
ic(len(known_lemma))
ic(len(span_decoder))
ic(len(graph.nodes()));

Stop the profiler and report the performance measures.

In [None]:
profiler.stop()
profiler.print()

## Quality checks

Are there any prounoun lemmas that we need to add to the `STOP_WORDS` list? Until we have a good _coreference_ stage in this workflow, the pronouns are too generic and tend to distort the graph results. NB: compound references are "contained" and not a problem.

In [None]:
for x in known_lemma:
    if "PRON" in x:
        print(x)

Which nodes should we promote to the next level?

In [None]:
kept_nodes: typing.Set[ int ] = set()

for node_id, node_attr in graph.nodes(data = True):
    if node_attr["kind"] == "Entity":
        print(node_id, node_attr["key"], node_attr["rank"], node_attr["label"], node_attr["text"], node_attr["chunk"])
        kept_nodes.add(node_id)

Which edges should we promote to the next level?

In [None]:
skip_rel: typing.Set[ str ] = set([ "FOLLOWS_LEXICALLY", "COMPOUND_ELEMENT_OF" ])

for src_id, dst_id, edge_attr in graph.edges(data = True):
    if src_id in kept_nodes and dst_id in kept_nodes:
        rel: str = edge_attr["rel"]

        if rel not in skip_rel:
            print(src_id, dst_id, rel, edge_attr["prob"])

## CITATION:  inspired by Poco Nathan's presentation to GraphGeeks.org on 2024-08-14
- https://github.com/DerwenAI/strwythura