# Introduction

The code here attempts to produce an automatic metrical analysis and syllabification (according to the prosody of Greek poetry) of a Greek tragedy. We start with the iambic trimeters in Euripides' Medea and we use the edition by Kovacs published in the [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.01.0113%3Acard%3D1)

In [159]:
h = scan2html(syllMetre(l11),scanner.scan_text(l11+'\n')[0])
HTML(h)

˘,¯,˘.1,¯.1,˘.2,¯.2,˘.3,¯.3,˘.4,¯.4,˘.5,x
φί,λων,τε,τῶν,πρὶ,νἀμ,πλα,κοῦ,σα,καὶ,πά,τρας


In [117]:
from cltk.corpus.utils.importer import CorpusImporter
import cltk

In [118]:
import greek_accentuation
from greek_accentuation.characters import base
from greek_accentuation import syllabify

In [119]:
import requests
from lxml import etree
import re
import os

In [120]:
from cltk.prosody.greek.scanner import Scansion

In [121]:
import pyCTS

# Load the text

In [122]:
x = etree.parse("/Users/fmambrini/cltk_data/canonical_protected-master/CTS_XML_TEI/perseus/greekLit/tlg0006/tlg003/tlg0006.tlg003.perseus-grc1.xml")

First, we load some text. We fetch the first few lines of the *Medea* from the xml in Perseus

(Eur. *Medea* is not available in the Perseus GitHub repo.)

In [123]:
x = etree.parse("http://www.perseus.tufts.edu/hopper/xmlchunk?doc=Perseus%3Atext%3A1999.01.0113%3Acard%3D1")

In [124]:
txt = x.xpath("//l/text()")
lines = x.xpath("//l")

In [125]:
for l in lines:
    print(l.xpath("text()")[0])

Εἴθ᾽ ὤφελ᾽ Ἀργοῦς μὴ διαπτάσθαι σκάφος
Κόλχων ἐς αἶαν κυανέας Συμπληγάδας,
μηδ᾽ ἐν νάπαισι Πηλίου πεσεῖν ποτε
τμηθεῖσα πεύκη, μηδ᾽ ἐρετμῶσαι χέρας
ἀνδρῶν ἀριστέων οἳ τὸ πάγχρυσον δέρος
Πελίᾳ μετῆλθον. οὐ γὰρ ἂν δέσποιν᾽ ἐμὴ
Μήδεια πύργους γῆς ἔπλευσ᾽ Ἰωλκίας
ἔρωτι θυμὸν ἐκπλαγεῖσ᾽ Ἰάσονος:
οὐδ᾽ ἂν κτανεῖν πείσασα Πελιάδας κόρας
πατέρα κατῴκει τήνδε γῆν Κορινθίαν
<φίλων τε τῶν πρὶν ἀμπλακοῦσα καὶ πάτρας.>
<καὶ πρὶν μὲν εἶχε κἀνθάδ᾽ οὐ μεμπτὸν βίον>
ξὺν ἀνδρὶ καὶ τέκνοισιν, ἁνδάνουσα μὲν
φυγὰς πολίταις ὧν ἀφίκετο χθόνα
αὐτῷ  τε πάντα ξυμφέρουσ᾽ Ἰάσονι:
ἥπερ μεγίστη γίγνεται σωτηρία,
ὅταν γυνὴ πρὸς ἄνδρα μὴ διχοστατῇ.
νῦν δ᾽ ἐχθρὰ πάντα, καὶ νοσεῖ τὰ φίλτατα.
προδοὺς γὰρ αὑτοῦ τέκνα δεσπότιν τ᾽ ἐμὴν
γάμοις Ἰάσων βασιλικοῖς εὐνάζεται,
γήμας Κρέοντος παῖδ᾽, ὃς αἰσυμνᾷ χθονός.
Μήδεια δ᾽ ἡ δύστηνος ἠτιμασμένη
βοᾷ μὲν ὅρκους, ἀνακαλεῖ δὲ δεξιᾶς
πίστιν μεγίστην, καὶ θεοὺς μαρτύρεται
οἵας ἀμοιβῆς ἐξ Ἰάσονος κυρεῖ.
κεῖται δ᾽ ἄσιτος, σῶμ᾽ ὑφεῖσ᾽ ἀλγηδόσιν,
τὸν πάντα συντήκουσα δακρύοις χρόνον
ἐπεὶ

# Syllabify and scan the text

In [126]:
l = lines[0].xpath("text()")[0]
print(l)

Εἴθ᾽ ὤφελ᾽ Ἀργοῦς μὴ διαπτάσθαι σκάφος


We start our exploration of scanning Greek poetry using the scansion module of CLTK

In [127]:
scanner = Scansion()

First, however we have to work around the specific design choice of the module, which is not fit to work with poetry...

(We started some discussion about it, as you can see [here](https://github.com/cltk/cltk/issues/447))

In [373]:
scanner.punc_stops = ["\n"]

In [374]:
scanner.punc.append(".")

In [375]:
scanner.scan_text(l+"\n")

['˘¯˘˘¯¯˘¯˘¯˘x']

The fact is that the scansion module was designed to scan prose. It works by taking a chunk of text (or even a whole text), tokenizing it at sentence-final punctuation and then it returns a list of metrical patterns. That doesn't work for poetry, where syntactic breaks hardly coincide with metrical units.

Also, I think the work on poetry may procede line by line. So we changed the behavior of the class: it now tokenizes using the new line return (\n) and strips the period like all the other punctuation marks

Now the scanner behaves as we expect, but a minor problem is that the syllabification that it uses (although effective for scanning) is not very good to show what's going on metrically.

I like the syllabified version that can be obtained with J. Tauber's Sillabify library from [Greek Accentuation](https://github.com/jtauber/greek-accentuation/blob/master/docs.rst). CLTK's Scansion has also a sillabify (hidden) method, but its algorithm works by putting consonants together at the following syllable's onset, which is practical, but less useful for teaching metre.

This is what the hidden methods of CLTK Scansion produces

In [355]:
scanner._make_syllables(l+"\n")

[[['κα', 'κων:'],
  ['νε', 'α'],
  ['γαρ'],
  ['φρο', 'ντις'],
  ['ουκ'],
  ['α', 'λγεῖν'],
  ['φι', 'λεῖ']]]

Intead, if we use CLTK Scansion's hidden methods to clear the text from diacritics and punctuation, the result can be passed to Sillabify to get a good (or at least, plausible: some difficult scansion must be corrected) metrical analysis:

In [356]:
tok = scanner._tokenize(l+"\n")

First, we need to create a single, continuous string in what in Greek parlance is known as **synapheia**

In [357]:
s = "".join(tok[0])
print(s)

κακων:νεαγαρφροντιςουκαλγεῖνφιλεῖ.


In [358]:
syllabify.syllabify(s)

['κα',
 'κων:',
 'νε',
 'α',
 'γαρ',
 'φρον',
 'τι',
 'ςου',
 'καλ',
 'γεῖν',
 'φι',
 'λεῖ.']

That's much better (except for διάπτασθαι)! The probles are due to the fact that the greek_accentuation module works with the ordinary scansion rules: every consonant cluster that can be found at word's onset is groupped toegether. This doesn't work for poetic scansion, so we must override one meythod:

In [186]:
def is_valid_poetic_consonant_cluster(b, c):
    s = base(b).lower() + ("".join(base(b2) for b2 in c)).lower()
    return s.startswith((
        "γλ", "γρ",
        "δρ",
        "θλ", "θν", "θρ", "θμ",
        "κλ", "κν", "κρ",
        "πλ", "πν", "πρ",
        "τρ", "τμ", "τν",
        "φλ", "φρ",
        "χλ", "χρ",
))

In [137]:
syllabify.is_valid_consonant_cluster = is_valid_poetic_consonant_cluster

Let's see if the scansion is improved now

In [138]:
syllabify.syllabify(s)

['ει',
 'θω',
 'φε',
 'λαρ',
 'γοῦς',
 'μη',
 'δι',
 'απ',
 'τασ',
 'θαισ',
 'κα',
 'φος.']

It's good! Now, let's see what we get if we chain CLTK's Scansion with Syllabify:

In [139]:
for syll,scan in zip(syllabify.syllabify(s) ,list(scanner.scan_text(l+"\n")[0])):
    print(syll,scan)

ει ¯
θω ¯
φε ˘
λαρ ¯
γοῦς ¯
μη ¯
δι ˘
απ ¯
τασ ¯
θαισ ¯
κα ˘
φος. x


Which is good enough to start! However, as a result of our workaround with consonants, a bit of postprocessing is needed, in case a word starts with a consonant cluster other than *muta cum liquida*, or you'll get a consonant-only first syllables...

Let's wrap this up in a few functions

In [187]:
def cleanStr(s):
    reg = re.compile(r'[^\w]')
    return reg.sub("", s)

In [201]:
def _isconsonantonly(st):
    if syllabify.nucleus(st):
        return False
    else:
        return True

In [212]:
def syllMetre(line):
    l = line.rstrip("\n") + "\n"
    s = cleanStr(l)
    syllables = syllabify.syllabify(s)
    if _isconsonantonly(syllables[0]):
        syllables[1] = syllables[0] + syllables[1]
        syllables = syllables[1:]
    return syllables

In [218]:
syllMetre(l)

['Εἴ', 'θὤ', 'φε', 'λἈρ', 'γοῦς', 'μὴ', 'δι', 'απ', 'τάσ', 'θαισ', 'κά', 'φος']

In [189]:
syllables = syllMetre(l)

In [211]:
sy = "λρμθ"
print(_isconsonantonly(sy))
syllabify.nucleus(sy)

True


# CTS-URNs

Ideally, I'd like to provide each syllable in our scanned text with a CTS-URN, so that our metrical analysis can always be linked to the text and to any other resource that we create from this text (e.g. treebanks)!

We use Matteo Romanello's very practical porting of CTS specs to Python, the [pyCTS](https://github.com/mromanello/pyCTS) library

In [144]:
urn_base_string = "urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:{}#{}"

We need however a method to map the preprocessed text (without spaces, diacritics etc) on the raw text. Therefore, we design a couple of functions that try to do that:

In [376]:
def matchOnStr(fullstr, syllables):
    remain = fullstr
    matches = []
    for s in syllables:
        m = partMatch(remain,s)
        matches.append(m)
        remain = remain.lstrip(m)
    return matches
        
def partMatch(substr, syll):
    r = ''
    for s in syll:
        r = r+s+r"[^\w]*"
    m = re.search(r, substr)
    if m:
        return m.group().rstrip(" ")
    else:
        return ""

We also need to append an index to the reference, in case the substring we are trying to reference is popping up multiple times...

In [148]:
def _index(fullst, substr):
    """
    checks the index of the provided substrings .
    Indexing starts at 1 and always considers the last match: "la" in "lala land" returns la[3]
    For the first match, consider slicing your full string differently...
    """
    c = fullst.count(substr)
    return "{}[{}]".format(substr,c)
    

def getCiteSubstring(syllables):
    """
    Note: use it on the matched syllables
    """
    i = 0
    passages = []
    for s in syllables:
        p = []
        if " " in s:
            #split at whitespace
            parts = s.split(" ")
            #loop over first and last part and index them
            st = "".join(syllables[:i])
            for part in [parts[0],parts[-1]]:
                st = st + part
                p.append(_index(st, part))
        else:
            p.append(_index(syllables[:i+1], s))
        passages.append(p)
        i +=1
    return passages

In [149]:
l11 = "<φίλων τε τῶν πρὶν ἀμπλακοῦσα καὶ πάτρας.>"
l1 = "Εἴθ᾽ ὤφελ᾽ Ἀργοῦς μὴ διαπτάσθαι σκάφος"
syls = syllMetre(l11)

In [150]:
matched_syll = matchOnStr(l11, syls)

In [151]:
matched_syll

['φί',
 'λων',
 'τε',
 'τῶν',
 'πρὶ',
 'ν ἀμ',
 'πλα',
 'κοῦ',
 'σα',
 'καὶ',
 'πά',
 'τρας.>']

In [152]:
getCiteSubstring(matched_syll)

[['φί[1]'],
 ['λων[1]'],
 ['τε[1]'],
 ['τῶν[1]'],
 ['πρὶ[1]'],
 ['ν[3]', 'ἀμ[1]'],
 ['πλα[1]'],
 ['κοῦ[1]'],
 ['σα[1]'],
 ['καὶ[1]'],
 ['πά[1]'],
 ['τρας.>[1]']]

It's time now to generate a CTS URN for each of the syllables! Let's wrap it up in one function that:

* matches the syllabified text on the raw text
* append the indexes for each syllables
* write a cts urn

In [302]:
def createCTS_URNs(syllables, linetxt, edurn, linenum):
    matched = [m.replace(":", "") for m in matchOnStr(linetxt, syllables)]
    substr = getCiteSubstring(matched)
    urns = []
    for s in substr:
        #if there's no space
        if len(s) == 1:
            urn = pyCTS.CTS_URN(edurn + ":{}@{}".format(linenum, s[0]))
        else:
            sub_range = "{0}@{1}-{0}@{2}".format(linenum, s[0], s[1]) #1@λ᾽[1]-1@Ἀρ[1]
            urn = pyCTS.CTS_URN(edurn + ":" + sub_range)
        urns.append(urn)
    return urns

In [154]:
createCTS_URNs(syllMetre(l11), l11, "urn:cts:greekLit:tlg0006.tlg003.perseus-grc1", "11")

[urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:11@φί[1],
 urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:11@λων[1],
 urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:11@τε[1],
 urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:11@τῶν[1],
 urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:11@πρὶ[1],
 urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:11@ν[3]-11@ἀμ[1],
 urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:11@πλα[1],
 urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:11@κοῦ[1],
 urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:11@σα[1],
 urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:11@καὶ[1],
 urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:11@πά[1],
 urn:cts:greekLit:tlg0006.tlg003.perseus-grc1:11@τρας.>[1]]

# Display the scanned lines

We display the scansion of the syllables in a nice html table

In [155]:
from IPython.display import HTML

In [160]:
def scan2html(syllabs, scan,style=True):
    html = '''<table>
<tr>'''
    if style:
        scanstyle = ' style="border:0px;text-align:center;padding:0pt 10pt;background-color: #f2f2f2"'
        syllstyle = " style='border:0px;text-align:center;padding : 0pt 10pt;'"
    else:
        scanstyle,syllstyle = ('','')
    scans = [s for s in scan]
    assert len(scan) == len(syllabs), "Scan and syllables do not match!"
    for sc in scans:
        html = html + '\n<th{}>{}</th>'.format(scanstyle, sc)
    html = html + "</tr>\n<tr>"
    for sy in syllabs:
        html = html + "\n<td{}>{}</td>".format(syllstyle,sy)
        
    html = html + '''\n</tr>
</table>
    '''
    return html

# Wrapping it up 

As it turns out that there are also couple of bugs that we have to correct in the Scansion class, it's partical to write those modifications in a subclass that overrides whatever we'd like to override in the Scansion. We'll call it **PoetryScansion**.

We'll also create a special method to scan a line, and add one that includes the syllabification and add the method to display scansions and syllables as an html table

## A new Scansion class

In [319]:
class PoetryScansion( Scansion ) :
    """
    Override a few properties and methods of cltk.greek.scansion
    """

    def __init__(self):
        super(PoetryScansion, self).__init__()
        self.punc_stops = ["\n"]
        self.punc = self.punc + [".", "<", ">"]        
    
    def _clean_accents(self, text):
        """Remove most accent marks.
        Note that the circumflexes over alphas and iotas in the text since
        they determine vocalic quantity.
        :param text: raw text
        :return: clean text with minimum accent marks
        :rtype : string
        """
        accents = {
            'ὲέἐἑἒἓἕἔ': 'ε',
            'ὺύὑὐὒὓὔὕ': 'υ',
            'ὸόὀὁὂὃὄὅ': 'ο',
            'ὶίἰἱἲἳἵἴ': 'ι',
            'ὰάἁἀἂἃἅἄᾳᾂᾃᾁ': 'α',
            'ὴήἠἡἢἣἥἤἧἦῆῄῂῇῃᾓᾒᾗᾖᾑᾐᾔ': 'η',
            'ὼώὠὡὢὣὤὥὦὧῶῲῴῷῳᾧᾦᾢᾣᾡᾠ': 'ω',
            'ἶἷ': 'ῖ',
            'ἆἇᾷᾆᾇ': 'ᾶ',
            'ὖὗ': 'ῦ',
            }
        text = self._clean_text(text)
        for char in text:
            for key in accents.keys():
                if char in key:
                    text = text.replace(char, accents.get(key))
                else:
                    pass
        return text
    
    def scan_line(self, line):
        """
        Scan a single line of poetry
        :param line: the line to scan
        :return: a list with the scansion
        """
        
        #makes sure that the line ends with a line break, but no more than 1 line break
        l = line.rstrip("\n") + "\n"
        return list(self.scan_text(l)[0])
    
    def _isconsonantonly(self, st):
        if syllabify.nucleus(st):
            return False
        else:
            return True
    
    def _cleanStr(self, s):
        reg = re.compile(r'[^\w]')
        return reg.sub("", s)
    
    def get_metrical_syllables(self, line):
        s = self._cleanStr(line)
        syllables = syllabify.syllabify(s)
        if self._isconsonantonly(syllables[0]):
            syllables[1] = syllables[0] + syllables[1]
            syllables = syllables[1:]
        return syllables
    
    def scan2html(self, syllabs, scan, style=True):
        """Format scansion and syllables as an html table
        :param syllabs: list of syllables
        :param scan: list of scansions
        :style: (optional) if True, style instructions are added to every <th> and <td>; useful for Jupyter Notebooks
        :raise: AssertionError if scnas and syllables are not of the same length
        :return: string with <table> element
        """
        
        html = '''<table>
        <tr>'''
        if style:
            scanstyle = ' style="border:0px;text-align:center;padding:0pt 10pt;background-color: #f2f2f2"'
            syllstyle = " style='border:0px;text-align:center;padding : 0pt 10pt;'"
        else:
            scanstyle,syllstyle = ('','')
        scans = list(scan)
        assert len(scan) == len(syllabs), "Scan and syllables do not match!\nLine: {}".format("".join(syllabs))
        for sc in scans:
            html = html + '\n<th{}>{}</th>'.format(scanstyle, sc)
        html = html + "</tr>\n<tr>"
        for sy in syllabs:
            html = html + "\n<td{}>{}</td>".format(syllstyle,sy)

        html = html + '''\n</tr>
        </table>
        '''
        return html

In [289]:
scanner = PoetryScansion()

In [290]:
sylls = scanner.get_metrical_syllables(txt[3])
sylls

['τμη', 'θεῖ', 'σα', 'πεύ', 'κη', 'μη', 'δἐ', 'ρετ', 'μῶ', 'σαι', 'χέ', 'ρας']

In [291]:
scan = scanner.scan_line(txt[3])
scan

['¯', '¯', '˘', '¯', '¯', '¯', '˘', '˘', '¯', '¯', '˘', 'x']

In [293]:
HTML(scanner.scan2html(sylls, scan))

¯,¯.1,˘,¯.2,¯.3,¯.4,˘.1,˘.2,¯.5,¯.6,˘.3,x
τμη,θεῖ,σα,πεύ,κη,μη,δἐ,ρετ,μῶ,σαι,χέ,ρας


We can now procede to create a metrical analysis for the whole prologue.

## Main loop

In [295]:
txt = [l.xpath("text()")[0] for l in lines]

In [321]:
scanner = PoetryScansion()
tabs = []
cts_urns = []
i = 1
for line in txt:
    sylls = scanner.get_metrical_syllables(line)
    scan = scanner.scan_line(line)
    tab = scanner.scan2html(sylls, scan, style=False)
    urn = createCTS_URNs(sylls, line, "urn:cts:greekLit:tlg0006.tlg003.perseus-grc1", str(i))
    cts_urns.append(urn)
    tabs.append(tab)
    i += 1

Now we write the output of our analysis of the prologue in two files:
* an HTML file with the lines and the tables
* a file the list of CTS URNS

In [353]:
with open("Medea_scansion.html", "w") as out:
    html = '''<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<style>

table {
    border-collapse : collapse;
    margin-bottom : 30pt;
    margin-left: 20pt;
    }

th, td, table {
    border-bottom: 1px solid black;
    border-top: 1px solid black;
    text-align: center;
    padding:0pt 10pt
}

th { background-color: #f2f2f2 ;
    padding-top:10pt;
}

p { 
    font-weight: bold;
    }


</style>
</head>
<body>\n
'''
    i = 1
    for l,tab in zip(txt,tabs):
        html = html + "<p>{}. {}<p>\n{}".format(str(i), l, tab)
        i += 1
    out.write(html)

In [372]:
with open("/Users/fmambrini/Desktop/Medea_Metrical_CTS.txt", "w") as out:
    for c in cts_urns:
        for u in c:
            out.write(u._as_string + "\n")

done for now!

# To do

* evaluate!
* check which scans do not validate against all possible instances of the iambic trimeter