Cylleneus + NLP
===============

Probably most often, the Cylleneus search engine will be used through one of its more user-friendly interfaces. However, it is also possible to use the engine as an API and to perform queries programmatically. In this way, the engine can in fact be used to build NLP applications. One very simple and straightforward use of Cylleneus' lemma-based query functionality would be to try to find 'intertexts' -- passages of text that are lexically similar to, but not morphologically identical to, some source text. Let's try it out.

First, set up the environment.

In [234]:
# Utility imports
import json
from lang.latin.stop_words import STOP_WORDS

# We need tell Cylleneus what corpus we want to search, and then instantiate a Searcher object to execute specific queries.
from corpus import Corpus
from search import Searcher, Collection

Let's use the Perseus Digital Library minicorpus that comes pre-installed with the Cylleneus repository; it includes the major works of Vergil.

In [235]:
corpus = Corpus('perseus')

Because we want to abstract away morphological details of our source text, we are also going to need to tokenize and lemmatize this text. In this case, for simplicity's sake, we will just be searching for a single phrase, which we can input manually. Since the text isn't coming from a structured corpus, we can use the built-in plaintext tokenizer and lemmatizer.

In [236]:
# The plaintext tokenizer should be suitable for just about any 'raw' Latin text.
from engine.analysis.tokenizers import CachedPlainTextTokenizer

# The lemma filter takes a sequence of tokens (word-forms) and uses the Latin WordNet for lemmatization and morphological analysis.
from engine.analysis.filters import CachedLemmaFilter

tokenizer = CachedPlainTextTokenizer()
lemmatizer = CachedLemmaFilter()

Now let's run our source text through our lemmatization pipeline. In this fabricated example, we are going to search for texts similar to the phrase of Lucretius: *gelidamque pruinam* (Lucr. *RN.* 2.431).

In [237]:
# For efficiency the tokenizer reuses a single Token object, so each token needs to be copied to be preserved
from copy import copy

text = 'gelidamque pruinam'
tokens = [copy(token) for token in tokenizer(text, mode='index') if token.text not in STOP_WORDS['CONJUNCTIONS']]

lemmas = []
for token in tokens:
    lemmas.append(list(set([lemma.text.split(':')[0] for lemma in lemmatizer([token,], mode='query')])))

At this point we need to construct a well-formed lemma-based query for Cylleneus to execute. In the most basic kind of query, we would simply combine the lemmatized tokens together as a sequence.

NB. The lemmatizer tries to be inclusive as possible, so a form like *fatis* will generate multiple lemmas for possible matching: *fatum* as well as *fatis* and *fatus*. This is why, if we were to inspect the `lemmas` object, we would find that each word of the original text resolves to a list of lemmas.

In [238]:
from pprint import pprint
pprint(lemmas)

[['gelidus', 'gelida'], ['pruina']]


In [239]:
# Construct sequential lemma-based query
subqueries = []
for i, lemma in enumerate(lemmas):
    if len(lemma) == 0:  # no lemma found, use the original form
        subqueries.append(tokens[i].text)
    elif len(lemma) == 1:
        subqueries.append(f"<{lemma[0]}>")
    else:
        subqueries.append(f'''({' OR '.join([f"<{alt}>" for alt in lemma])})''')

# Join all subqueries into a single adjacency query
adjacent_lemmas = f'''"{' '.join(subqueries)}"'''

To be more inclusive, we could do away entire with the sequential requirement and try instead using a proximity query. In this case, any text will match provided only that it contains the matching query terms, irrespective of their ordering

In [240]:
proximal_lemmas = f'''{' '.join(subqueries)}'''
pprint(proximal_lemmas)

'(<gelidus> OR <gelida>) <pruina>'


In [241]:
# Execute the query against the given collection of texts.
searcher = Searcher(Collection(corpus.works))
search = searcher.search(proximal_lemmas)  

# Display the query if any matches
if search.count != (0, 0, 0):  # matches, docs, corpora
    for result in json.loads(search.to_json())['results']:
        pprint([result['author'],
               result['title'],
               result['reference'],
               result['text']])

['Virgil',
 'Georgics',
 'poem: 2, line: 263',
 '<pre>ante supinatas aquiloni ostendere glaebas,</pre>\n'
 '\n'
 '<pre>quam laetum infodias vitis genus. Optima putri</pre>\n'
 '\n'
 '<match>arva solo: id venti curant <em>gelidaeque</em> '
 '<em>pruinae</em></match>\n'
 '\n'
 '<post>et labefacta movens robustus iugera fossor.</post>\n'
 '\n'
 '<post>Ac si quos haud ulla viros vigilantia fugit,</post>']
