Cylleneus + intertextuality
===========================

Once installed, the Cylleneus engine can be used in conjunction with the CLTK to perform queries programmatically via the search API. In this way, the engine can in fact be used to build NLP applications. One very simple and straightforward use of Cylleneus' lemma-based query functionality would be to try to find 'intertexts' -- passages of text that are lexically similar to, but not morphologically identical to, some source text. Let's try it out.

First, set up the environment.

In [2]:
# Utility imports
from copy import copy
from pprint import pprint
from textwrap import wrap
import re

# Some standard CLTK imports
from cltk.stop.latin import STOPS_LIST
from cltk.tokenize.latin.word import WordTokenizer
STOPS_LIST += ['-que', '-ve', '-ne']

from cylleneus.corpus import Corpus
from cylleneus.search import Searcher, Collection

import multiwordnet

# Check MultiWordNet installation
for language in ["common", "english", "french", "hebrew", "italian", "latin", "spanish"]:
    if not multiwordnet.db.exists(language):
        multiwordnet.db.compile(language, verbose=False)


ModuleNotFoundError: No module named 'cylleneus.corpus.grk.diorisis.tokenizer'

Let's use the pre-indexed Perseus Digital Library sample mini-corpus; it includes the major works of Vergil.

In [None]:
corpus = Corpus("perseus")
if not corpus.searchable:
    corpus.download()

Because we want to abstract away morphological details of our source text, we are also going to need to tokenize and lemmatize this text. In this case, for simplicity's sake, we will just be searching for a single phrase, which we can input manually. Since the text isn't coming from a structured corpus, we can use the built-in plaintext tokenizer and lemmatizer.

In [None]:
# The plaintext tokenizer is suitable for tokenizing plaintext sources.
from cylleneus.corpus.default import CachedTokenizer

# The lemma filter takes a sequence of tokens (word-forms) and uses the Latin WordNet for lemmatization and morphological analysis.
from cylleneus.engine.analysis.filters import CachedLemmaFilter

word_tokenizer = WordTokenizer()
tokenizer = CachedTokenizer()
lemmatizer = CachedLemmaFilter(cached=False)


Now let's run our "source" text through our lemmatization pipeline. In this fabricated example, we are going to search for texts similar to the phrase of Lucretius: *gelidamque pruinam* (Lucr. *RN.* 2.431).

In [None]:
text = 'gelidamque pruinam'

# For efficiency the tokenizer reuses a single Token object, so each token needs to be copied to be preserved
words = [word for word in word_tokenizer.tokenize(text) if word not in STOPS_LIST]
tokens = [copy(token) for token in tokenizer(words, mode='index', tokenize=False)]

lemmas = []
for token in tokens:
    lemmatized = set()
    for lemma in lemmatizer([copy(token),]):
        lemmatized.add(lemma.text.split(':')[0])

    lemmas.append(list(lemmatized))

At this point we need to construct a well-formed lemma-based query for Cylleneus to execute. We could simply combine the lemmatized tokens together as a sequence.

NB. The lemmatizer tries to be inclusive as possible, so a form like *fatis* will generate multiple lemmas for possible matching: *fatum* as well as *fatis* and *fatus*. This is why, if we were to inspect the `lemmas` object, we would find that each word of the original text resolves to a list of lemmas.

In [None]:
# pprint(lemmas)

In [None]:
# Construct sequential lemma-based query
subqueries = []
for i, lemma in enumerate(lemmas):
    # If lemmatization didn't produce anything, use the original form
    if len(lemma) == 0:
        subqueries.append(tokens[i].text)
    elif len(lemma) == 1:
        subqueries.append(f"<{lemma[0]}>")
    else:
        subqueries.append(f'''({' OR '.join([f"<{alt}>" for alt in lemma])})''')

# Join all subqueries into a single adjacency query
adjacency_lemmas = f'''"{' THEN '.join(subqueries)}"'''
# pprint(adjacency_lemmas)


To be more inclusive -- and to take account of that intervening *-que* in Lucretius -- we should probably do away with the strict sequential requirement and try instead using a proximity query. In this case, any text will match provided only that it contains the matching query terms, irrespective of their ordering.

In [None]:
proximity_lemmas = f'''{' AND '.join(subqueries)}'''
# pprint(proximity_lemmas)

In [None]:
# Execute the query against the given collection of texts.
searcher = Searcher(Collection(works=corpus.works))
results = searcher.search(proximity_lemmas)

# Display results nicely
def display_text(text: str):
    subs = [("<pre>", ""), ("</pre>", ""), ("<match>", ""), ("</match>", ""), ("<post>", ""), ("</post>", ""), (r"<em>(.+?)</em>", r"\033[1m\033[36m\1\033[21m\033[0m")]
    for pat, sub in subs:
        text = re.sub(pat, sub, text, re.DOTALL)
    return "\n".join(wrap(text))

# Display the query if any matches
if results.count != (0, 0, 0):  # matches, docs, corpora
    for n, (c, author, title, urn, reference, text) in enumerate(results.to_text()):
        pprint(f"{n}. {author}, {title}: {reference}\n{display_text(text)}\n")


Let's go one step further: finding so-called 'semantic intertexts', namely texts that do not depend on a similarity of word form, but on a similarity of meaning. In this case, we are going to abstract away from the phrase's lexical composition, with a query that will look something like this:

In [None]:
proximity_glosses = f"[en?icy] AND [en?frost]"
results = searcher.search(proximity_glosses)

# Display the query if any matches
if results.count != (0, 0, 0):  # matches, docs, corpora
    for n, (c, author, title, urn, reference, text) in enumerate(results.to_text()):
        pprint(f"{n}. {author}, {title}: {reference}\n{display_text(text)}\n")