# More powerful concordances

In this notebook, we show how we can leverage Python NLTK, CTS-compliant texts and treebanks to create some powerful concordances with just a few lines of code.

It is amazing how much you can accomplish with Python NLTK! In this case, it is sufficient to override a couple of methods of the classes in the [text](http://www.nltk.org/_modules/nltk/text.html) modules and you can really have a lot of fun...

I will also make use of my special [corpus reader]() for cts-compliant texts, which will make it easier for us to load and tokenize Perseus' and 1stKYearsOfGreek texts, as well as to keep tokens in sync with the citations

In what are our concordances better?

1. They include a **canonical citation** for each occurrence (something which you don't find often in corpus investigation tools)
2. The indexing is very **flexible**! they support indexing for:
    * simple form (vanilla concordances)
    * lemma
    * morpho-tag
    * ... in brief: every metadata you want to append to your tokens!
    
In what follows we see how to create these concordances with a few lines of code

In [6]:
import sys
sys.path.append("../")
sys.path.append("../../")

In [10]:
from cite_corpus_reader import CapitainCorpusReader
from perseus_text import CitableConcordanceIndex
import os
# we use etree to parse the treebank XML: if you mycapitain, then you also have lxml...
from lxml import etree

# Set up

Now we load a series of sample files:
* the treebanks of the tragedies of Aeschylus
* the TEI editions of Aeschylus from the Perseus DL

In [11]:
pers_root = os.path.expanduser("~/cltk_data/greek/text/canonical-greekLit-master/data")
tb_root = os.path.expanduser("~/Documents/lavoro/treebank/files/AGDT2.X/PerseusDL-treebank_data-96df3cc/v2.1/Greek/texts")

## The simple text of Aeschylus' editions

In [16]:
aesch_corpus = CapitainCorpusReader(pers_root, "tlg0085/tlg.+/tlg.+grc2\.xml")
aesch_corpus.fileids()

['tlg0085/tlg001/tlg0085.tlg001.perseus-grc2.xml',
 'tlg0085/tlg002/tlg0085.tlg002.perseus-grc2.xml',
 'tlg0085/tlg003/tlg0085.tlg003.perseus-grc2.xml',
 'tlg0085/tlg004/tlg0085.tlg004.perseus-grc2.xml',
 'tlg0085/tlg005/tlg0085.tlg005.perseus-grc2.xml',
 'tlg0085/tlg006/tlg0085.tlg006.perseus-grc2.xml',
 'tlg0085/tlg007/tlg0085.tlg007.perseus-grc2.xml']

Now we save cites and words in two variables that we'll use later on

In [17]:
cite_words = aesch_corpus.corpus_cite_words()
aesch_tokens = [t[-1] for t in cite_words]
aesch_cites = ["{}:{}".format(t[0], t[1]) for t in cite_words]

## Aeschylus treebank

In [15]:
# I'll use glob to speed up the parsing of the xml files
from glob import glob
aes_tb_files = glob(tb_root+"/tlg0085*.xml")
aesch_tb_words = []

for f in aes_tb_files:
    x = etree.parse(f)
    ws = x.xpath("//word[@cite]")
    aesch_tb_words.extend(ws)

Here we make every treebank token in Aeschylus as a tuple with: form, lemma, postag, relation. We'll use these attributes later to index the entries in the tragedies

In [46]:
tbcites = [":".join(w.attrib["cite"].split(":")[3:5]) for w in aesch_tb_words]
tbtokens = [(w.attrib["form"], w.attrib["lemma"], 
             w.attrib["postag"], w.attrib["relation"]) for w in aesch_tb_words]

# The concordances

## Simple-text concordances

Creating a concordance index is as easy as this:

In [23]:
aesch_index = CitableConcordanceIndex(aesch_tokens, aesch_cites)

Now let's use this object to print the first 10 occurrence of the word ἄναξ

In [24]:
aesch_index.print_concordance("ἄναξ", lines=10)

Displaying 10 of 31 matches:
λιπεῖν ἔτλητε ; τίς κατέσκηψεν τύχη ; ἄναξ Πελασγῶν , αἰόλʼ ἀνθρώπων κακά . πόνο (tlg0085.tlg001:328)
 κλῦθί μου πρόφρονι καρδίᾳ , Πελασγῶν ἄναξ . ἴδε με τὰν ἱκέτιν φυγάδα περίδρομον (tlg0085.tlg001:349)
πειθὼ δʼ ἕποιτο καὶ τύχη πρακτήριος . ἄναξ ἀνάκτων , μακάρων μακάρτατε καὶ τελέω (tlg0085.tlg001:524)
ις ; αὐτὸς ὁ πατὴρ φυτουργὸς αὐτόχειρ ἄναξ γένους παλαιόφρων μέγας τέκτων , τὸ π (tlg0085.tlg001:592)
τοιάνδʼ ἔπειθεν ῥῆσιν ἀμφʼ ἡμῶν λέγων ἄναξ Πελασγῶν , ἱκεσίου Ζηνὸς κότον μέγαν  (tlg0085.tlg001:616)
οὐ κατοικτιεῖ . διωλόμεσθʼ · ἄσεπτʼ , ἄναξ , πάσχομεν — πολλοὺς ἄνακτας , παῖδας (tlg0085.tlg001:904)
άσκεις ; τὰ θεῶν μηδὲν ἀγάζειν . Ζεὺς ἄναξ ἀποστεροί - η γάμον δυσάνορα δάιον ,  (tlg0085.tlg001:1062)
ων φύλακες , κατὰ πρεσβείαν οὓς αὐτὸς ἄναξ Ξέρξης βασιλεὺς Δαρειογενὴς εἵλετο χώ (tlg0085.tlg002:5)
έφθιτο καὶ νὺξ ἐπῄει , πᾶς ἀνὴρ κώπης ἄναξ ἐς ναῦν ἐχώρει πᾶς θʼ ὅπλων ἐπιστάτης (tlg0085.tlg002:378)
ξεκείνωσεν πεσόν , ἐξ οὗτε τιμὴν Ζεὺς ἄναξ τήνδʼ ὤπασε

Cool, we have the lines and the (sort of) canonical citations (but we can make it 100% canonical if we want...)

Can we make it prettier?

Sure! the `find_concordance` method locates all the information we need and returns a list of matches with context. Then we can format it with html, or anyway we want...

In [33]:
conc = aesch_index.find_concordance("ἄναξ")
conc[0]

ConcordanceLine(left=['.', 'δοκεῖτε', 'δή', 'μοι', 'τῆσδε', 'κοινωνεῖν', 'χθονὸς', 'τἀρχαῖον', '.', 'ἀλλὰ', 'πῶς', 'πατρῷα', 'δώματα', 'λιπεῖν', 'ἔτλητε', ';', 'τίς', 'κατέσκηψεν', 'τύχη', ';'], query='ἄναξ', right=['Πελασγῶν', ',', 'αἰόλʼ', 'ἀνθρώπων', 'κακά', '.', 'πόνου', 'δʼ', 'ἴδοις', 'ἂν', 'οὐδαμοῦ', 'ταὐτὸν', 'πτερόν', '·', 'ἐπεὶ', 'τίς', 'ηὔχει', 'τήνδʼ', 'ἀνέλπιστον'], offset=1759, cite='tlg0085.tlg001:328', left_print='λιπεῖν ἔτλητε ; τίς κατέσκηψεν τύχη ;', right_print='Πελασγῶν , αἰόλʼ ἀνθρώπων κακά . πόνο', line='λιπεῖν ἔτλητε ; τίς κατέσκηψεν τύχη ; ἄναξ Πελασγῶν , αἰόλʼ ἀνθρώπων κακά . πόνο')

In [34]:
h = "<ul>\n"
for c in conc[:10]:
    h += '<li>{} <span style="color:blue">{}</span> {} (<span style="color:green">{}</span>)</li>'.format(c.left_print,
                                                                                                       c.query,
                                                                                                       c.right_print,
                                                                                                       c.cite)


In [35]:
from IPython.display import HTML
HTML(h)

So we learn that there are 31 occurrences of ἄναξ in Aeschylus. What about the other forms of this word?

In [36]:
aesch_index.print_concordance("ἄνακτος")

Displaying 5 of 5 matches:
γός , τῆσδε γῆς ἀρχηγέτης . ἐμοῦ δʼ ἄνακτος εὐλόγως ἐπώνυμον γένος Πελασγῶν τήν (tlg0085.tlg001:252)
ἱδρῦσθαι χθονός . τῆλε πρὸς δυσμαῖς ἄνακτος Ἡλίου φθινασμάτων . ἀλλὰ μὴν ἵμειρʼ (tlg0085.tlg002:232)
χέρσον ἐληλαμέναι πέρι πύργον τοῦδʼ ἄνακτος ἄιον , Ἕλλας τʼ ἀμφὶ πόρον πλατὺν ε (tlg0085.tlg002:873)
γένοιτο δʼ οὖν μολόντος εὐφιλῆ χέρα ἄνακτος οἴκων τῇδε βαστάσαι χερί . τὰ δʼ ἄλ (tlg0085.tlg005:35)
τὰ μάσσω μὲν τί δεῖ σέ μοι λέγειν ; ἄνακτος αὐτοῦ πάντα πεύσομαι λόγον . ὅπως δ (tlg0085.tlg005:599)


How do we find them all?

## Annotation-based concordances

If we have a lemmatized corpus (e.g. a treebank) we could use the same expanded concurdance indexer to create a similar list that searches for lemmata instead of forms.

Above we read the treebank files and created a list of tokens with some addition information. E.g.:

In [38]:
tbtokens[3]

('ἐπίδοι', 'ἐπεῖδον', 'v3saoa---', 'PRED')

Now let's create a concordance indexer that makes use of lemmata instead of forms. Instead of passing a list of forms to the constructor, we pass a list of tuples and we specify the function (`key`) of the column we want to index for; in this case, the lemma is the second element in the tuple (indexed 1)

In [48]:
tbindex = CitableConcordanceIndex(tbtokens, tbcites, key=lambda x:x[1])

With the argument `key` we passed a function that for every element in a list of iterable return the element indexed 1 (the lemma in this case); this function is used to create the lemma index. The default function of the class constructor (the one that we used before) is:

```python
lambda x : x
```

That is, a function that merely returns the element passed and does nothing with it.

Now, let's see if we can use this indexer to get concordances of all the forms of ἄναξ in Aeschylus

In [49]:
tbindex.print_concordance("ἄναξ", lines=20)

Displaying 20 of 40 matches:
ροις νυν ἐσθλὰ κηρυκευέτω . πάντων δ̓ ἀνάκτων τῶνδε κοινοβωμίαν σέβεσθ̓ · ἐν ἁγνῷ δ (tlg0085.tlg001:222)
ασγός , τῆσδε γῆς ἀρχηγέτης . ἐμοῦ δ̓ ἄνακτος εὐλόγως ἐπώνυμον γένος Πελασγῶν τήνδε (tlg0085.tlg001:252)
λιπεῖν ἔτλητε ; τίς κατέσκηψεν τύχη ; ἄναξ Πελασγῶν , αἰόλ̓ ἀνθρώπων κακά . πόνο (tlg0085.tlg001:328)
 κλῦθί μου πρόφρονι καρδίᾳ , Πελασγῶν ἄναξ . ἴδε με τὰν ἱκέτιν φυγάδα περίδρομον (tlg0085.tlg001:349)
πειθὼ δ̓ ἕποιτο καὶ τύχη πρακτήριος . ἄναξ ἀνάκτων , μακάρων μακάρτατε καὶ τελέω (tlg0085.tlg001:524)
 δ̓ ἕποιτο καὶ τύχη πρακτήριος . ἄναξ ἀνάκτων , μακάρων μακάρτατε καὶ τελέων τελειό (tlg0085.tlg001:524)
ις ; αὐτὸς ὁ πατὴρ φυτουργὸς αὐτόχειρ ἄναξ γένους παλαιόφρων μέγας τέκτων , τὸ π (tlg0085.tlg001:592)
τοιάνδ̓ ἔπειθεν ῥῆσιν ἀμφ̓ ἡμῶν λέγων ἄναξ Πελασγῶν , ἱκεσίου Ζηνὸς κότον μέγαν  (tlg0085.tlg001:616)
οὐ κατοικτιεῖ . διωλόμεσθ̓ · ἄσεπτ̓ , ἄναξ , πάσχομεν - πολλοὺς ἄνακτας , παῖδας (tlg0085.tlg001:904)
 · ἄσεπτ̓ , ἄναξ , πάσχομεν - πολλοὺς ἄνακτα

## More complex indexing

Can we make concordances of the main predicates in Aeschylus or the subjects?

In [55]:
syntindex = CitableConcordanceIndex(tbtokens, tbcites, key=lambda x:x[3])
syntindex.print_concordance("PRED", lines=20)

Displaying 20 of 1987 matches:
 ἐπίδοι προφρόνως στόλον ἡμέτερον νάιον ἀρθέν (tlg0085.tlg001:1)
ίαν δὲ λιποῦσαι χθόνα σύγχορτον Συρίᾳ φεύγομεν , οὔτιν̓ ἐφ̓ αἵματι δημηλασίαν ψήφῳ π (tlg0085.tlg001:5)
σίαρχος τάδε πεσσονομῶν κύδιστ̓ ἀχέων ἐπέκρανε , φεύγειν ἀνέδην διὰ κῦμ̓ ἅλιον , κέλ (tlg0085.tlg001:13)
ίν̓ ἂν οὖν χώραν εὔφρονα μᾶλλον τῆσδ̓ ἀφικοίμεθα σὺν τοῖσδ̓ ἱκετῶν ἐγχειριδίοις ἐριοστ (tlg0085.tlg001:20)
τὴρ τρίτος , οἰκοφύλαξ ὁσίων ἀνδρῶν , δέξασθ̓ ἱκέτην τὸν θηλυγενῆ στόλον αἰδοίῳ πνε (tlg0085.tlg001:27)
̓ ἐν ἀσώδει θεῖναι , ξὺν ὄχῳ ταχυήρει πέμψατε πόντονδ̓ · ἔνθα δὲ λαίλαπι χειμωνοτύπ (tlg0085.tlg001:33)
ις , τὰ δ̓ ἄελπτά περ ὄντα φανεῖται . γνώσεται δὲ λόγους τις ἐν μάκει . εἰ δὲ κυρεῖ  (tlg0085.tlg001:56)
πόλων ἔγγαιος οἶκτον [οἰκτρὸν] ἀίων , δοξάσει τις ἀκούειν ὄπα τᾶς Τηρεΐας Μήτιδος ο (tlg0085.tlg001:60)
ὶ ἐγὼ φιλόδυρ - τος Ἰαονίοισι νόμοισι δάπτω τὰν ἁπαλὰν Νειλοθερῆ παρειὰν ἀπειρόδα (tlg0085.tlg001:70)
ις ἐστὶ κηδεμών . ἀλλά , θεοὶ γενέται κλύετ̓ εὖ τὸ δίκαιον ἰδόντες · 

(**WARNING**: the formatting is not handled well when the matching token is at the beginning of the corpus. Also, I expect that the concordance will consider the ending of preciding text as "context" in case the match comes at the beginning of a new text...)

In [56]:
syntindex.print_concordance("SBJ", lines=10)

Displaying 10 of 2337 matches:
 Ζεὺς μὲν ἀφίκτωρ ἐπίδοι προφρόνως στόλον ἡ (tlg0085.tlg001:1)
γύπτου παίδων ἀσεβῆ ̓ ξονοταζόμεναι . Δαναὸς δὲ πατὴρ καὶ βούλαρχος καὶ στασίαρχος (tlg0085.tlg001:11)
ον , κέλσαι δ̓ Ἄργους γαῖαν , ὅθεν δὴ γένος ἡμέτερον τῆς οἰστροδόνου βοὸς ἐξ ἐπαφ (tlg0085.tlg001:16)
ες , ὄλοιντο , πρίν ποτε λέκτρων , ὧν θέμις εἴργει , σφετεριξάμενοι πατραδέλφειαν (tlg0085.tlg001:37)
ιν · ἐπωνυμίᾳ δ̓ ἐπεκραίνετο μόρσιμος αἰὼν εὐλόγως , Ἔπαφόν τ̓ ἐγέννασεν · ὅντ̓  (tlg0085.tlg001:46)
 ἐπιδείξω πιστὰ τεκμήρια γαιονόμοις , τὰ δ̓ ἄελπτά περ ὄντα φανεῖται . γνώσετα (tlg0085.tlg001:55)
ερ ὄντα φανεῖται . γνώσεται δὲ λόγους τις ἐν μάκει . εἰ δὲ κυρεῖ τις πέλας οἰων (tlg0085.tlg001:57)
 δὲ λόγους τις ἐν μάκει . εἰ δὲ κυρεῖ τις πέλας οἰωνοπόλων ἔγγαιος οἶκτον [οἰκτ (tlg0085.tlg001:58)
γαιος οἶκτον [οἰκτρὸν] ἀίων , δοξάσει τις ἀκούειν ὄπα τᾶς Τηρεΐας Μήτιδος οἰκτρ (tlg0085.tlg001:60)
τρᾶς ἀλόχου , κιρκηλάτου τ̓ ἀηδόνος , ἅτ̓ ἀπὸ χλωρῶν πετάλων ἐργομένα πενθεῖ μὲ (tlg0085.tlg001:63)

Now let's try with indexing based on morphology. As it can be seen, the indexer is rather flexible: you can devise whatever function you can imagine and pass it to the constructor as key instead of the "throw-away" lambda function.

However, the easiest solution for morphology based indexing is to "massage" the morpho tag itself and pass it to the construtor as its first argument.

Let's say we want a concordance of all the accusative names. We build a list of tokens where each token is a tuple: form,pos_case, so that it will be easier to query it later

In [62]:
# pos and case are the character indexed as nr 0 and 7 in the postag string
postoks = [(w[0], w[2][0]+w[2][7]) for w in tbtokens]
posindex = CitableConcordanceIndex(postoks, tbcites, key=lambda x:x[1])
posindex.print_concordance("na", lines=20)

Displaying 20 of 2873 matches:
 στόλον ἡμέτερον νάιον ἀρθέντ̓ ἀπὸ προστομίων  (tlg0085.tlg001:2)
 Δίαν δὲ λιποῦσαι χθόνα σύγχορτον Συρίᾳ φεύγ (tlg0085.tlg001:4)
 χθόνα σύγχορτον Συρίᾳ φεύγομεν , οὔτιν̓ ἐφ̓  (tlg0085.tlg001:5)
τον Συρίᾳ φεύγομεν , οὔτιν̓ ἐφ̓ αἵματι δημηλασίαν ψήφῳ πόλεως γνωσθεῖσαν , ἀλλ̓ αὐτογενε (tlg0085.tlg001:6)
 γνωσθεῖσαν , ἀλλ̓ αὐτογενεῖ φυξανορίᾳ γάμον Αἰγύπτου παίδων ἀσεβῆ ̓ ξονοταζόμεναι  (tlg0085.tlg001:9)
τ̓ ἀχέων ἐπέκρανε , φεύγειν ἀνέδην διὰ κῦμ̓ ἅλιον , κέλσαι δ̓ Ἄργους γαῖαν , ὅθεν  (tlg0085.tlg001:14)
έδην διὰ κῦμ̓ ἅλιον , κέλσαι δ̓ Ἄργους γαῖαν , ὅθεν δὴ γένος ἡμέτερον τῆς οἰστροδόν (tlg0085.tlg001:15)
ιὸς εὐχόμενον τετέλεσται . τίν̓ ἂν οὖν χώραν εὔφρονα μᾶλλον τῆσδ̓ ἀφικοίμεθα σὺν το (tlg0085.tlg001:19)
ὕπατοί τε θεοί , καὶ βαρύτιμοι χθόνιοι θήκας κατέχοντες , καὶ Ζεὺς σωτὴρ τρίτος , ο (tlg0085.tlg001:25)
τος , οἰκοφύλαξ ὁσίων ἀνδρῶν , δέξασθ̓ ἱκέτην τὸν θηλυγενῆ στόλον αἰδοίῳ πνεύματι χώ (tlg0085.tlg001:27)
ν ἀνδρῶν , δέξασθ̓ ἱκέτην τὸν θηλυγενῆ στόλ

In [61]:
postoks[:5]

[('Ζεὺς', 'n'),
 ('μὲν', 'g'),
 ('ἀφίκτωρ', 'n'),
 ('ἐπίδοι', 'v'),
 ('προφρόνως', 'd')]

In [63]:
word = "ἄναξ"

In [64]:
half_width = (80 - len(word) - 2) // 2

In [65]:
half_width

37