Cylleneus + Translation Alignments
==================================

Once installed, the Cylleneus engine can be used in conjunction with the CLTK and other libraries to perform queries programmatically via the search API. In this way, the engine can be used in the service of text analysis. One use could be to query and manipulate so-called "translation alignments", where the translation of some Latin or Greek text is linked with the primary-language source text, supporting research in classical reception or translation studies. Let's see what we can do!

First, we set up our environment.

In [None]:
# Utility imports
import codecs

import lxml.etree as et

# We need tell Cylleneus what corpus we want to search, and then instantiate a Searcher object to execute specific queries.
from cylleneus.corpus import Corpus
from cylleneus.search import Collection, Searcher
from cylleneus.utils import nrange
from cylleneus.settings import CORPUS_DIR

Robust translation-aligned corpora do not yet exist for Greek and Latin, but an alignment of A. T. Murray's translation of Homer's *Iliad*, produced by Gregory Crane, is available. Cylleneus comes with this text already indexed for immediate use.

In [None]:
corpus = Corpus('translation_alignments')
searcher = Searcher(Collection(works=corpus.works_for(author="Homer", title="Iliad (English)")))

We can start by simply searching the translation for a word (or lemma) that interests us. Perhaps we're interested in swords. Because Cylleneus supports simple word-form, searching, we could search for occurrences of 'sword' or 'swords'.

In [None]:
# Let's see how results we get.
search = searcher.search('sword')
print("Results for 'sword':", len(search.results))

search = searcher.search('swords')
print("Results for 'swords':", len(search.results))

But Cylleneus also supports lemma-based queries. Since we want to capture all occurrences of this English word (not just those in the singular or plural), we can chance our query to search by the lemma 'sword'.

In [None]:
# This should get us all of them.
search = searcher.search('<sword>')
print("Results for <sword>:", len(search.results))

# View the raw search results.
# for hlite in search.highlights:
#     print(hlite.reference, "=", hlite.text)

That's better. But by now we must be wondering: what word or words in Greek is English 'sword' translating? There are, after all, at least three Greek words corresponding to this meaning: *ξίφος*, *ἄορ*, and *φάσγανον* (there is also *ῥομφαία*, but that is more like a 'broadsword' and at any rate is a late word). Does Homer use all three for 'sword'? We can use our translation alignment, along with a primary source text, to explore the question.

*ἄορ* is intriguing. It occurs in the *Odyssey* as a clear synonym of *ξίφος* and can be substituted for it in similar formulaic expressions (cf. 10.294, *σὺ ξίφος ὀξὺ ἐρυσσάμενος παρὰ μηροῦ* with 11.24, *ἐγὼ δ᾽ ἄορ ὀξὺ ἐρυσσάμενος παρὰ μηροῦ*). Does it ever bear the meaning of 'sword' in the *Iliad*?

In [None]:
# Let's first define some convenience functions to help us manipulate the alignment data.
def match_lemma(doc, alignment, lemma):    
    """ Match a lemma within a given alignment reference range """

    # Get the alignment reference for the match in the English translation
    try:
        start, end = alignment.split('-')
    except ValueError:
        start = end = alignment

    # Get all tokens within the reference range
    tokens = [t
        for ref in nrange(start.split('.'), end.split('.'))
        for t in doc.findall(".//t[@p='{}']".format('.'.join([str(n) for n in ref])))
    ]

    # Scan the tokens for the target lemma
    matches = []
    for token in tokens:
        l1 = token.find('l').find('l1')

        if l1 is not None:
            if l1.text == lemma:
                matches.append(token.get('p'))
    return matches

def aligned_text(doc, alignment):
    """ Fetch source text for a given alignment reference range """
    
    # Get the alignment reference for the match in the English translation
    try:
        start, end = alignment.split('-')
    except ValueError:
        start = end = alignment

    # Get all tokens within the reference range
    tokens = [t
        for ref in nrange(start.split('.'), end.split('.'))
        for t in doc.findall(".//t[@p='{}']".format('.'.join([str(n) for n in ref])))
    ]

    # Reconstruct the text
    s = ""
    for token in tokens:
        f = token.find('f')
        join = token.get('join')
        if join and join == "b":
            s += f.text     
        else:
            s += " " + f.text
    return s

Now let's load a lemmatized text of the _Iliad_, to check our results against.

In [None]:
with codecs.open(CORPUS_DIR + "/eng/translation_alignments/source/tlg0012.tlg001.perseus-grc2.xml", 'rb') as fp:
    value = fp.read()
parser = et.XMLParser(encoding='utf-8')
doc = et.XML(value, parser=parser)

Now we can match our translations against the source text. Do any of the English results for 'sword' match for *ἄορ* in the Greek?


In [None]:
matched = []
hlites = list(search.highlights)
print("Results matching for 'ἄορ':")
for i, (hit, meta, fragment) in enumerate(search.results):
    text = aligned_text(doc, meta["start"]["alignment"])
    for match in match_lemma(doc, meta["start"]["alignment"], "ἄορ"):
        print(match, text, hlites[i].text)
        matched.append(meta["start"]["alignment"])

Yes! A fair number, in fact. What about *ξίφος*?

In [None]:
print("Results matching for 'ξίφος':")
for i, (hit, meta, fragment) in enumerate(search.results):
    text = aligned_text(doc, meta["start"]["alignment"])
    for match in match_lemma(doc, meta["start"]["alignment"], "ξίφος"):
        print(match, text, hlites[i].text)
        matched.append(meta["start"]["alignment"])

And *φάσγανον*?

In [None]:
print("Results matching for 'φάσγανον':")
for i, (hit, meta, fragment) in enumerate(search.results):
    text = aligned_text(doc, meta["start"]["alignment"])
    for match in match_lemma(doc, meta["start"]["alignment"], "φάσγανον"):
        print(match, text, hlites[i].text)
        matched.append(meta["start"]["alignment"])
        

*ξίφος* stands behind the greatest number of 'sword' appearances by far. You may have noticed, though, that the number of results matching for *ξίφος*, *ἄορ*, and *φάσγανον* together does not add up to the number of results for `<sword>` in Murray's translation. What accounts for the unmatched translations?

In [None]:
missing = [
    (meta["start"]["alignment"], hlites[i].text)
    for i, (hit, meta, fragment) in enumerate(search.results)
    if meta["start"]["alignment"] not in matched
]

for alignment, text in missing:
    print(alignment, text, aligned_text(doc, alignment))
    

In the first case, 'sword' actually occurs as part of the translation of *χρυσάορος*, an epithet of Apollo ("golden-sword'd"): a part-of-speech mismatch. In the third, 'swords' translates *ξιφέεσσιν*, which should have been counted amongst the matches for that word. For some reason, however, it was not correctly lemmatized. The remaining occurrence is more revealing. It reminds us that *χαλκός* (like Latin *ferrum*) can also, by a kind of metonymy, have the sense of 'sword'. This is a case, then, where the translator has "flattened" the sense of the source text, replacing the figurative expression with a more prosaic one. It is also a case in which the translator has interpreted a generic term -- in addition to 'sword', we know that *χαλκός* (literally, 'bronze') can also cover the meaning of 'spear', 'knife', 'axe', or even 'armor' -- in a more specialized sense. Of course, Cylleneus can't explain why that choice was made. But it has helped us identify some interesting features of Murray's translation.