In [1]:
%config IPCompleter.greedy=True

### TEI Preprocessing
We preprocess a PDF of our source material: *Graph Representation Learning* by Hamilton, available [here](https://www.cs.mcgill.ca/~wlh/grl_book/files/GRL_Book.pdf).

Text extraction is done following Alpizar-Chacon & Sosnovsky, 2020.

Their data pipeline is available as a web service at https://intextbooks.science.uu.nl/.

The code for the TEI pipeline is available on Github ([link](https://github.com/intextbooks/ITCore?tab=readme-ov-file)), but requires the deployment and coordination of multiple software components. Specifically, it requires MySQL, Apache Jena, and a partial local copy of DBPedia. We use the web service to avoid the effort of deploying the extraction pipeline locally.

We optionally enabled "identify index terms in text" and "link entities to DBPedia" using the category "https://<span/>dbpedia.org/page/Category:Technology."

### XML Data-Munging
We process the XML output of the TEI pipeline as described in Yao 2023.

#### install stuff

In [14]:
!pip install xmltodict==0.13.0



In [150]:
import itertools
import xmltodict

#### ingest xml

In [16]:
f = open("DLB_TEI/teiModel.xml")

book = xmltodict.parse(
    f.read(),
    xml_attribs=True,
)

f.close()

#### get section headings from table of contents

In [34]:
table_of_contents = book["TEI"]["front"]["div"]

# crawl the XML tree in search of items with text
# return a list of table of contents item headings
def grab_toc_headings(node):
    items = []
    if type(node) is dict:
        keys = node.keys()
        if "#text" in keys:
            tup = (
                node["#text"],
                node["ref"].get("@target", "NO_TARGET"),
                
            )
            items.append(tup)
        if "item" in keys:
            items += grab_toc_headings(node["item"])
        if "list" in keys:
            items += grab_toc_headings(node["list"])
    if type(node) is list:
        for elem in node:
            items += grab_toc_headings(elem)
    return items

# remove section numbers from heading text like "1.2.3 foo bar section"
def strip_toc_headings(lst):
    return [
        (heading.split(" ", maxsplit=1)[-1].strip(), ref) for heading, ref in lst
    ]

toc_headings = grab_toc_headings(table_of_contents)
clean_toc_headings = strip_toc_headings(toc_headings)

In [63]:
toc_headings[:4]

[('1 introduction', 'seg_1'),
 ('i applied math and machine learning basics', 'NO_TARGET'),
 ('ii deep networks: modern practices', 'NO_TARGET'),
 ('1.1 who should read this book?', 'seg_3')]

In [64]:
clean_toc_headings[:4]

[('introduction', 'seg_1'),
 ('applied math and machine learning basics', 'NO_TARGET'),
 ('deep networks: modern practices', 'NO_TARGET'),
 ('who should read this book?', 'seg_3')]

#### get index entries
For some books, TEI fails to distinguish the *bibliography* section and the *index* section. It just combines papers, citations, and index terms. We filter these out. It also tends to interpret page ranges in bib citations ("pages 177-228") as if they were indexes back into the book text.

We use a simple heuristic of checking the string length of items that TEI identifies as index entries and set a cutoff between the point where the actual index items end and the bibliographic citations begin.

This is not completely effective, because TEI also has trouble with two-column layouts that are common in book indexes. For about 15% of the items, it produces combinations like "Point estimator, 119 Reinforcement learning." This results in several relative long, garbled index items.

For the Deep Learning Book, we just set a heuristic of 65 chars. In this book, it separates index items from bib citations. In other cases, it might also filter out extra-long garbled index items.

In [199]:
index_items = book["TEI"]["back"]["div"]["list"]["item"]

# remove bibliography citations that TEI mixed into the index for some reason
# we just use excessive length as the heuristic
# every index item for the Deep Learning Book is under 70 characters
def remove_overlong_items(lst):
    return [
        elem for elem in lst
        if len(elem.get("#text", "")) < 70
    ]

# index tuples are ("foo", set(seg_id...))
def grab_index_tuples(lst):
    tuples = []
    for elem in lst:
        elem_name = elem["#text"]
        if "ref" in elem.keys():
            ref = elem["ref"]
            if type(ref) is dict:
                target = ref.get("@target", "NO_TARGET")
                tup = (elem_name, set([target]))
                tuples.append(tup)
            if type(ref) is list:
                targets = set(r.get("@target", "NO_TARGET") for r in ref)
                tup = (elem_name, targets)
                tuples.append(tup)
    return tuples

# property URIs look something like https://intextbooks.science.uu.nl/model/XXX/property_name
model_domain = book["TEI"]["teiHeader"]["fileDesc"]["publicationStmt"]["pubPlace"]
model_id = book["TEI"]["@xml:id"]
model_property_uri_prefix = f"{model_domain}model/{model_id}/"

all_pairs = lambda lst: itertools.permutations(lst, 2)

normalize_prop_uri = lambda s: s.removeprefix(model_property_uri_prefix).replace("_", " ").lower()

# for index items that have a "FOO, see BAR"
# we make a dict of all pairs "foo=bar" and "bar=foo"
# all lowercase, for canonical lookups
def grab_index_aliases(lst):
    alias_dict = {}
    sameas_uri = "owl:sameAs"
    for elem in lst:
        elem_name = elem["#text"]
        ref = elem.get("seg", {}).get("ref", {})
        
        if type(ref) is dict and ref.get("@property", "") == sameas_uri:
            equivalents = map(str.lower, [
                elem_name,
                normalize_prop_uri(ref["@resource"]),
            ])
            alias_dict.update(all_pairs(equivalents))

        if type(ref) is list:
            equivalents = map(str.lower, [
                elem_name,
                *(
                    normalize_prop_uri(r["@resource"])
                    for r in ref
                    if r.get("@property", "") == sameas_uri
                ),
            ])
            alias_dict.update(all_pairs(equivalents))
    
    return alias_dict

index_items_filtered = remove_overlong_items(index_items)

index_tuples = grab_index_tuples(index_items_filtered)
alias_dict = grab_index_aliases(index_items_filtered)
index_dict = dict(index_tuples)

# add the "FOO, see BAR" terms to the index dict
# with the same segments as BAR
def enrich_with_aliases(index_dict, alias_dict):
    index_dict_copy = dict(index_dict)
    keys = list(index_dict_copy.keys())
    keys_lower = list(map(str.lower, keys))
    for key, alias in alias_dict.items():
        if key not in keys_lower:
            matching_term = next((k for k in keys if k.lower() == alias), None)
            if matching_term != None:
                index_dict_copy[key] = index_dict[matching_term]
    return index_dict_copy

index_dict_all = enrich_with_aliases(index_dict, alias_dict)

In [200]:
[*itertools.islice(index_dict_all.items(), 10)]

[('Absolute value rectification', {'seg_103'}),
 ('Accuracy', {'seg_207'}),
 ('Activation function', {'seg_99'}),
 ('Active constraint', {'seg_71'}),
 ('AdaGrad', {'seg_151'}),
 ('Adam', {'seg_153', 'seg_211'}),
 ('Adaptive linear element', {'seg_5'}),
 ('Adversarial example', {'seg_137'}),
 ('Adversarial training', {'seg_137', 'seg_141', 'seg_267'}),
 ('Affine', {'seg_77'})]

In [57]:
sorted([len(elem["#text"]) for elem in term_index], reverse=True)

[210,
 207,
 166,
 165,
 163,
 156,
 148,
 133,
 122,
 107,
 105,
 79,
 61,
 59,
 50,
 49,
 49,
 49,
 47,
 46,
 45,
 43,
 42,
 42,
 42,
 40,
 40,
 39,
 38,
 38,
 37,
 37,
 36,
 36,
 36,
 35,
 33,
 33,
 32,
 31,
 31,
 31,
 30,
 30,
 29,
 29,
 29,
 29,
 29,
 29,
 29,
 29,
 28,
 28,
 28,
 28,
 28,
 28,
 27,
 27,
 27,
 27,
 27,
 27,
 27,
 27,
 27,
 27,
 27,
 26,
 26,
 26,
 26,
 26,
 26,
 25,
 25,
 25,
 25,
 25,
 24,
 24,
 24,
 24,
 24,
 24,
 24,
 24,
 24,
 24,
 24,
 24,
 24,
 24,
 24,
 24,
 24,
 23,
 23,
 23,
 23,
 23,
 23,
 23,
 23,
 23,
 23,
 23,
 23,
 22,
 22,
 22,
 22,
 22,
 22,
 22,
 22,
 22,
 22,
 22,
 22,
 21,
 21,
 21,
 21,
 21,
 21,
 21,
 21,
 21,
 21,
 21,
 21,
 21,
 21,
 21,
 21,
 21,
 21,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 20,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 19,
 18,
 18,
 18,
 18,
 18,
 18,
 18,
 18,
 18,
 18,
 18,
 18,
 18,
 18,
 18,
 18,
 18,

#### Wikidata enrichment
Use the Neo4J query API to look for matching entities in Wikidata.

#### Wikidata Query Notes
- RDF resource description framework (w3c standard)
- OWL web ontology language
- subject-predicate-object
- <http://www.wikidata.org/entity/Q30> x 3 or wd:Q30  wdt:P36  wd:Q61 .
- wdt for truthy, props have a ranking of current truthiness
- subj and prop are uri's, value not necessarily
- 

In [227]:
from urllib.request import Request, urlopen
from urllib.parse import urlencode
import json

def get_uris_from_term(term):
    wikidata_url = "https://query.wikidata.org/sparql"
    
    req_headers = {
        "User-Agent": "Mozilla/5.0",
        "Accept": "application/sparql-results+json",
    }

    query = {
        "query": f"""
            PREFIX wikibase: <http://wikiba.se/ontology#>
            
            SELECT DISTINCT ?item ?itemLabel ?itemDescription
            
            WHERE {{
              VALUES ( ?termAsis ?termTitle ?termLower ?termCaps )
              
              {{ (
                  "{term}"@en
                  "{term.title()}"@en
                  "{term.lower()}"@en
                  "{term.upper()}"@en
              ) }}
              
              {{ ?item rdfs:label ?termAsis }}
              UNION
              {{ ?item rdfs:label ?termTitle }}
              UNION
              {{ ?item rdfs:label ?termLower }}
              UNION
              {{ ?item rdfs:label ?termCaps }} .
              
              SERVICE wikibase:label {{
                bd:serviceParam wikibase:language "en" .
              }}
            }}
        """
    }

    body = urlencode(query).encode()
    req = Request(url=wikidata_url, headers=req_headers, data=body)
    return json.load(urlopen(req))

get_uris_from_term("java")


{'head': {'vars': ['item', 'itemLabel', 'itemDescription']},
 'results': {'bindings': [{'item': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q2089134'},
    'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Java'},
    'itemDescription': {'xml:lang': 'en',
     'type': 'literal',
     'value': 'human settlement in Walworth County, South Dakota, United States of America'}},
   {'item': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q1430334'},
    'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Java'},
    'itemDescription': {'xml:lang': 'en',
     'type': 'literal',
     'value': 'dance which was developed in France in the early part of the 20th century'}},
   {'item': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q1441377'},
    'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Java'},
    'itemDescription': {'xml:lang': 'en',
     'type': 'literal',
     'value': 'fictional character in the Martin Myst

In [221]:
results["results"]["bindings"][0]

{'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q4677469'},
 'itemLabel': {'xml:lang': 'en',
  'type': 'literal',
  'value': 'activation function'},
 'itemDescription': {'xml:lang': 'en',
  'type': 'literal',
  'value': 'a function associated to a node in a computational network that defines the output of that node given an input or set of inputs'}}