Cylleneus + NLP
===============

Once installed, the Cylleneus engine can be used in conjunction with the CLTK to perform queries programmatically via the search API. In this way, the engine can in fact be used to build NLP applications. One very simple and straightforward use of Cylleneus' lemma-based query functionality would be to try to find 'intertexts' -- passages of text that are lexically similar to, but not morphologically identical to, some source text. Let's try it out.

First, set up the environment.

In [10]:
# Some standard CLTK imports
from cltk.tokenize.latin.word import WordTokenizer
from cltk.stop.latin import STOPS_LIST
STOPS_LIST += ['-que', '-ve', '-ne']

# We need tell Cylleneus what corpus we want to search, and then instantiate a Searcher object to execute specific queries.
from corpus import Corpus
from search import Searcher, Collection

# Utility imports
from copy import copy
import json
from pprint import pprint

Let's use the Perseus Digital Library minicorpus that comes pre-installed with the Cylleneus repository; it includes the major works of Vergil.

In [11]:
corpus = Corpus('perseus')

Because we want to abstract away morphological details of our source text, we are also going to need to tokenize and lemmatize this text. In this case, for simplicity's sake, we will just be searching for a single phrase, which we can input manually. Since the text isn't coming from a structured corpus, we can use the built-in plaintext tokenizer and lemmatizer.

In [12]:
# The plaintext tokenizer is suitable for 'raw' Latin text.
from corpus.default import CachedTokenizer

# The lemma filter takes a sequence of tokens (word-forms) and uses the Latin WordNet for lemmatization and morphological analysis.
from engine.analysis.filters import CachedLemmaFilter

word_tokenizer = WordTokenizer()
tokenizer = CachedTokenizer()
lemmatizer = CachedLemmaFilter(cached=False)

Now let's run our source text through our lemmatization pipeline. In this fabricated example, we are going to search for texts similar to the phrase of Lucretius: *gelidamque pruinam* (Lucr. *RN.* 2.431).

In [13]:
text = 'gelidamque pruinam'

# For efficiency the tokenizer reuses a single Token object, so each token needs to be copied to be preserved
words = [word for word in word_tokenizer.tokenize(text) if word not in STOPS_LIST]
tokens = [copy(token) for token in tokenizer(words, mode='index', tokenize=False)]
pprint(tokens)

lemmas = []
for token in tokens:   
    lemmatized = set()
    for lemma in lemmatizer([copy(token),]):
        lemmatized.add(lemma.text.split(':')[0])
    
    lemmas.append(list(lemmatized))

[CylleneusToken(positions=True, chars=True, stopped=False, boost=1.0, removestops=True, mode='index', original='gelidam', text='gelidam', pos=0, startchar=0, endchar=7),
 CylleneusToken(positions=True, chars=True, stopped=False, boost=1.0, removestops=True, mode='index', original='pruinam', text='pruinam', pos=1, startchar=7, endchar=14)]


At this point we need to construct a well-formed lemma-based query for Cylleneus to execute. We could simply combine the lemmatized tokens together as a sequence.

NB. The lemmatizer tries to be inclusive as possible, so a form like *fatis* will generate multiple lemmas for possible matching: *fatum* as well as *fatis* and *fatus*. This is why, if we were to inspect the `lemmas` object, we would find that each word of the original text resolves to a list of lemmas.

In [14]:
pprint(lemmas)

[['gelidus', 'gelida'], ['pruina']]


In [15]:
# Construct sequential lemma-based query
subqueries = []
for i, lemma in enumerate(lemmas):
    # If lemmatization didn't produce anything, use the original form
    if len(lemma) == 0:  
        subqueries.append(tokens[i].text)
    elif len(lemma) == 1:
        subqueries.append(f"<{lemma[0]}>")
    else:
        subqueries.append(f'''({' OR '.join([f"<{alt}>" for alt in lemma])})''')

# Join all subqueries into a single adjacency query
adjacency_lemmas = f'''"{' THEN '.join(subqueries)}"'''
pprint(adjacency_lemmas)

'"(<gelidus> OR <gelida>) THEN <pruina>"'


To be more inclusive -- and to take account of that intervening *-que* in Lucretius -- we should probably do away with the strict sequential requirement and try instead using a proximity query. In this case, any text will match provided only that it contains the matching query terms, irrespective of their ordering.

In [16]:
proximity_lemmas = f'''{' AND '.join(subqueries)}'''
pprint(proximity_lemmas)

'(<gelidus> OR <gelida>) AND <pruina>'


In [17]:
# Execute the query against the given collection of texts.
searcher = Searcher(Collection(corpus.works))
search = searcher.search(proximity_lemmas)  

# Display the query if any matches
if search.count != (0, 0, 0):  # matches, docs, corpora
    for result in json.loads(search.to_json())['results']:
        pprint([result['author'],
               result['title'],
               result['reference'],
               result['text']])

['Virgil',
 'Georgics',
 'poem: 2, line: 263',
 '<pre>ante supinatas aquiloni ostendere glaebas,</pre>\n'
 '\n'
 '<pre>quam laetum infodias vitis genus. Optima putri</pre>\n'
 '\n'
 '<match>arva solo: id venti curant <em>gelidaeque</em> '
 '<em>pruinae</em></match>\n'
 '\n'
 '<post>et labefacta movens robustus iugera fossor.</post>\n'
 '\n'
 '<post>Ac si quos haud ulla viros vigilantia fugit,</post>']


Let's go one step further: finding so-called 'semantic intertexts', namely texts that do not depend on a similarity of word form, but on a similarity of meaning. In this case, we are going to abstract away from the phrase's lexical composition, with a query that will look something like this:

In [18]:
proximity_glosses = f"[en?icy] AND [en?frost]"
search = searcher.search(proximity_glosses)  

# Display the query if any matches
if search.count != (0, 0, 0):  # matches, docs, corpora
    for result in json.loads(search.to_json())['results']:
        pprint([result['author'],
               result['title'],
               result['reference'],
               result['text']])

['Virgil',
 'Aeneid',
 'book: 12, line: 905',
 '<pre>Sed neque currentem se nec cognoscit euntem</pre>\n'
 '\n'
 '<pre>tollentemve manus saxumve immane moventem;</pre>\n'
 '\n'
 '<match>genua labant, <em>gelidus</em> concrevit <em>frigore</em> '
 'sanguis.</match>\n'
 '\n'
 '<post>Tum lapis ipse viri, vacuum per inane volutus,</post>\n'
 '\n'
 '<post>nec spatium evasit totum neque pertulit ictum.</post>']
['Virgil',
 'Georgics',
 'poem: 2, line: 263',
 '<pre>ante supinatas aquiloni ostendere glaebas,</pre>\n'
 '\n'
 '<pre>quam laetum infodias vitis genus. Optima putri</pre>\n'
 '\n'
 '<match>arva solo: id venti curant <em>gelidaeque</em> '
 '<em>pruinae</em></match>\n'
 '\n'
 '<post>et labefacta movens robustus iugera fossor.</post>\n'
 '\n'
 '<post>Ac si quos haud ulla viros vigilantia fugit,</post>']
['Virgil',
 'Georgics',
 'poem: 3, line: 441-poem: 3, line: 443',
 '<pre>arduus ad solem et linguis micat ore trisulcis.</pre>\n'
 '\n'
 '<pre>Morborum quoque te causas et signa docebo.<