In [2]:
import sys
sys.path.append("../")

In [3]:
from cite_corpus_reader.reader import CapitainCorpusReader, AncientGreekPunktVar
from nltk.tokenize.punkt import PunktSentenceTokenizer
import cltk
import os
from cltk.corpus.utils.formatter import cltk_normalize

# iAligner

To install `iAligner`,
* download the latest release from the [github repo](https://github.com/OpenGreekAndLatin/ILA_python.git)
* unzip the downloaded file
* cd into the unzipped directory
* run: `python setup.py install`

In [4]:
from iAlignment.iAligner import iAligner
from iAlignment.Viewer import Viewer
from iAlignment.MultipleAligner import MultipleAligner

# Making the CaptainReader and iAligner play together

iAligner expects the texts to be compared to be chunked into comparable units. In the visualization, these groups are visualized as $n$ parallel blocks, where the tokens are aligned. For prose texts, it makes sense to identify the sentence as the basic unit to chunk and compare the texts. For poetry, we may still align editions sentence by sentence or line by line. For this second approach, a CapitainCorpusReader is perhaps an overkill, but you can use it nonetheless

## Prose texts

### Reading the texts

With these few lines of codes we see how we can take two (or more!) capitain-compliant editions of Greek texts and compare them. As an example, we'll use two editions of Aristotle's *Analytica Priora*, that of Bekker and of Ross.

In [46]:
# root of the two folders: Perseus Texts and First1kGreek
fist1kroot = os.path.expanduser("~/cltk_data/greek/text/greek_text_first1kgreek/data")
perseusroot = os.path.expanduser("~/cltk_data/greek/text/canonical-greekLit-master/data")

arist_root = os.path.join(fist1kroot, "tlg0086")
ar_an1_files = ["tlg001/tlg0086.tlg001.1st1K-grc1.xml", "tlg001/tlg0086.tlg001.1st1K-grc2.xml"]
bekker,ross = ar_an1_files
ar_an1 = CapitainCorpusReader(arist_root, ar_an1_files) #sent_tokenizer=sent_tok)

How big are these two texts (in terms of nr. of words)?

In [23]:
print(len(ar_an1.words(bekker)), len(ar_an1.words(ross)))

71019 68575


As said, we work sentence by sentence. Let's now grab the sentences that we'll compare

In [7]:
sents_bekker = ar_an1.sents(bekker)
sents_ross = ar_an1.sents(ross)

Let us compare the first two sentences in a quick-and-dirty way

In [8]:
for b,r in zip(sents_bekker[0], sents_ross[0]) :
    print(b,r)

ΑΝΑΛΥΤΙΚΩΝ ΑΝΑΛΥΤΙΚΩΝ
ΠΡΟΤΕΡΩΝ ΠΡΟΤΕΡΩΝ
Α Α
. .
ΠΡΩΤΟΝ Πρῶτον
εἰπεῖν εἰπεῖν
περὶ περὶ
τί τί
καὶ καὶ
τίνος τίνος
ἐστὶν ἐστὶν
ἡ ἡ
σκέψις σκέψις
, ,
ὅτι ὅτι
περὶ περὶ
ἀπόδειξιν ἀπόδειξιν
καὶ καὶ
ἐπιστήμης ἐπιστήμης
ἀποδεικτικῆς ἀποδεικτικῆς
· ·


They look the same but don't be fooled! There's this dreaded problem of [unicode compatibility](http://docs.cltk.org/en/latest/greek.html#normalization) looming in the background...

Obviously we want to solve that before we compare the texts

In [20]:
print(sents_bekker[0][-3], sents_ross[0][-3])
print(sents_bekker[0][-3] == sents_ross[0][-3])
print(sents_bekker[0][-3][5], ord(sents_bekker[0][-3][5]),
     ord(sents_ross[0][-3][5]))

print(cltk_normalize(sents_bekker[0][-3]) == sents_ross[0][-3])

ἐπιστήμης ἐπιστήμης
False
ή 8053 942
True


### Passing the sentences to iAligner

iAligner provide some cool examples on how to use the intra-language-alignment tool to generate good html visualizations of the differences.

Here's a code snippet of one of them

```python
html=[]
groups=content.split("\n\n")
for group in groups:
    sentences=group.split("\n")
    alignment = aligner.align(sentences)
    html.append(viewer.mAlignmentToHtmlCode(alignment,sentences))
    html.append(viewer.mAlignmentToTableCode(alignment, sentences))
viewer.exportHtml("<br>".join(html),"OutputExample5-1.html")

```

We can build a same routine to process the first 20 sentences of the *Analytics* by using the same logic. As the sentences returned by the corpus readers are lists of lists (tokens), we'll have to join them into a string (iAligner uses its own tokenizer): that's kind of redundant but it doesn't hurt so much...

As we know that Bekker uses old unicode accented characters, we'll also normalize that using `cltk_normalize`

In [56]:
html = []
aligner = MultipleAligner()
viewer = Viewer()
aligner.setOptions(1,0,1,1)

for b_s,r_s in zip(sents_bekker[:20], sents_ross[:20]):
    b_s = cltk_normalize(" ".join(b_s))
    r_s = cltk_normalize(" ".join(r_s))
    sentences = [b_s, r_s]
    alignment = aligner.align(sentences)
    html.append(viewer.mAlignmentToHtmlCode(alignment,sentences))
    html.append(viewer.mAlignmentToTableCode(alignment, sentences))
viewer.exportHtml("<br>".join(html),"AristotleExample.html")

## Poetry

In [72]:
def compare_texts(text1, text2, outname, options = (1,0,1,1)):
    """Compare 2 texts using iAligner.
    The two texts must be lists of chunks to be compared. 
    The chunks can be sentences or sections (e.g. lines), 
    they may hold lists of tokens or strings.
    
    Parameters
    ----------
    text1: list
        text1 as a list of sections or senteces
    text2: list
        text2 as a list of sections or senteces
    """
    html = []
    aligner = MultipleAligner()
    viewer = Viewer()
    aligner.setOptions(*options)

    for s1,s2 in zip(text1,text2):
        if isinstance(s1, list):
            assert isinstance(s2, list), "Text1 and text2 must both be a list of lists or strings!"
            s1 = " ".join(s1)
            s2 = " ".join(s2)
        sentences = [s1, s2]
        alignment = aligner.align(sentences)
        html.append(viewer.mAlignmentToHtmlCode(alignment,sentences))
        html.append(viewer.mAlignmentToTableCode(alignment, sentences))
    viewer.exportHtml("<br>".join(html),outname)
    return alignment

Let us now load two versions of a hotly debated tragedy: the *Suppliants* of Aeschylus

In [67]:
sigd = CapitainCorpusReader(fist1kroot, "tlg0085/tlg001/tlg0085.tlg001.opp-grc3.xml")

sigd_lines = [l.replace("᾽", "'") for l in sigd.paras()]
sigd_lines[:5]

['Ζεὺς μὲν ἀφίκτωρ ἐπίδοι προφρόνως ',
 "στόλον ἡμέτερον νάιον ἀρθέντ' ",
 'ἀπὸ προστομίων λεπτοψαμάθων ',
 'Νείλου. Δίαν δὲ λιποῦσαι ',
 'χθόνα σύγχορτον Συρίᾳ φεύγομεν, ']

In [80]:
smyth = CapitainCorpusReader(perseusroot, "tlg0085/tlg001/tlg0085.tlg001.perseus-grc2.xml")

smyth_lines = [l.replace("ʼ", "'") for l in smyth.paras()]
smyth_lines = [l.replace("·", ":") for l in smyth_lines]
smyth_lines[:5]

['Ζεὺς μὲν ἀφίκτωρ ἐπίδοι προφρόνως ',
 "στόλον ἡμέτερον νάιον ἀρθέντ' ",
 'ἀπὸ προστομίων λεπτοψαμάθων ',
 'Νείλου. Δίαν δὲ λιποῦσαι ',
 'χθόνα σύγχορτον Συρίᾳ φεύγομεν, ']

Let's make sure that we don't have any unicode problem

In [77]:
smyth_lines[0][11] == sigd_lines[0][11]

True

In [81]:
alignment = compare_texts(sigd_lines[:51], smyth_lines[:51], "suppliants.html")

In [66]:
alignment[2]

['τόποις', 'τόποις']

# Appendix: testing iAligner

In [56]:
from IPython.display import IFrame, HTML

In [24]:
from iAlignment.iAligner import iAligner
from iAlignment.Viewer import Viewer

#Example 1: simple alignment, align two sentences
sentence1="And the earth was waste and void; and darkness was upon the face of the deep: and the Spirit of God moved upon the face of the waters. And God said, Let there be light: and there was light."
sentence2="And the earth was waste and without form; and it was dark on the face of the deep: and the Spirit of God was moving on the face of the waters. And God said, Let there be light: and there was light."

aligner = iAligner()
# alignment options /
aligner.setOptions(1,0,1,0)
viewer = Viewer()

# run the alignment
alignment = aligner.align(sentence1, sentence2)

# view the alignment results in html table
html=viewer.alignmentToText(alignment)
html+=viewer.alignmentToHtmlCode(alignment)
viewer.exportHtml(html,"outputExample1.html") #"<br>".join(text)

In [27]:
alignment[1]

{'sentence1': 'the', 'sentence2': 'the', 'relation': 'Aligned-complete'}

In [27]:
IFrame("ialigner_output/outputExample1.html", 800, 600)

---

In [32]:
from cite_corpus_reader import reader

from importlib import reload
reload(reader)

<module 'cite_corpus_reader.reader' from '../cite_corpus_reader/reader.py'>

In [33]:
from cite_corpus_reader.reader import CapitainCorpusReader, AncientGreekPunktVar
from nltk.tokenize.punkt import PunktSentenceTokenizer