# Tutorial: Python 3, lxml, and Greek Syntax Trees

[jonathanrobie.biblicalhumanities.org](http://jonathanrobie.biblicalhumanities.org)

This Jupyter Notebook shows how to work with the biblicalhumanities.org [Greek syntax trees](github.com/biblicalhumanities/greek-new-testament/syntax-trees) using Python 3, and the [lxml library](http://lxml.de/xpathxslt.html). These syntax trees are represented as XML text files. In this tutorial, we parse the files once and find what we want using XPath.  The lxml library supports only XPath 1.0 - I will explore using XPath 3.1 and XQuery 3.1 using the [BaseX Client API](https://github.com/BaseXdb/basex/tree/master/basex-api/src/main/python) in a future notebook.

## Install lxml

Install lxml - see instructions [here](http://lxml.de/installation.html).

If you want to create notebooks (highly recommended!) then install Jupyter - see instructions [here](https://jupyter.org/install.html).

## Get the Syntax trees

Get the syntax trees using git:

```
$ git clone https://github.com/biblicalhumanities/greek-new-testament
```

On my machine, I clone repos into the ~/git subdirectory. If you use a different directory structure, set the following variable to the location of the file `nestle1904lowfat.xml` on your machine.

In [1]:
TREEBANK = "~/git/greek-new-testament/syntax-trees/nestle1904-lowfat/xml/nestle1904lowfat.xml"

This file contains a series of XInclude statements that each include one file from the Greek New Testament:

```xml
<gnt xmlns:xi="http://www.w3.org/2001/XInclude">
    <xi:include href="01-matthew.xml"/>
    <xi:include href="02-mark.xml"/>
    <xi:include href="03-luke.xml"/>
    <xi:include href="04-john.xml"/>
    !!! SNIP !!!
</gnt>
```

Each of these included files contains a `<book/>` element that looks like this:

```xml 
<book xml:base="01-matthew.xml">
  <sentence>
    <milestone unit="verse" id="Matt.1.1">Matt.1.1</milestone>
    <p>Βίβλος γενέσεως Ἰησοῦ Χριστοῦ υἱοῦ Δαυεὶδ υἱοῦ Ἀβραάμ.</p>
    <wg class="cl">
      <wg class="cl" n="400010010010082">
        <wg role="p" class="np" n="400010010010080">
          <w class="noun"
             type="common"
             osisId="Matt.1.1!1"
             n="400010010010010"
             lemma="βίβλος"
             normalized="Βίβλος"
             strong="976"
             number="singular"
             gender="feminine"
             case="nominative"
             head="true">Βίβλος</w>
    !!! SNIP !!!
```

Note the following:

- The `<book/>` element contains a sequence of `<sentence/>` elements, which represent the sentences of a book.
- When expanded with XInclude, the `xml:base` attribute identifies the book.  
- Verses are represented using `<milestone/>` elements that occur within sentences.
- For the sake of readability, each sentence has a `<p/>` element that contains the sentence in plain text.
- Sentences contain word groups (`<wg/>`) and words (`<w/>`), which can each contain `class` and `role` elements.  For instance, a clause is a word group where `class="cl"`, a noun phrase is a word group where `class="np"`.
- More details on this format can be found in the [Nestle1904 Lowfat README](https://github.com/biblicalhumanities/greek-new-testament/tree/master/syntax-trees/nestle1904-lowfat/README.md).  The values that an attribute can take are more fully documented in the [Nestle 1904 Documentation](https://github.com/biblicalhumanities/greek-new-testament/blob/master/syntax-trees/nestle1904/doc/SBLGNT%20Treebank%20Documentation.pdf).

## Import lxml and parse the syntax trees.

Import the `lxml` library and parse `gnt.xml`, then use XInclude to expand each book inline.

In [2]:
from lxml import etree

In [None]:
tree = etree.parse(TREEBANK)

In [None]:
tree.xinclude()

## Books, verses

Most realistic use cases involving books and verses require both linguistic units (sentences, word groups, words) too, but let's start by showing just the books and verses.  After we introduce linguistic units, we will show how to combine the two.

### Books

Each book is an element located directly under the `<gnt/>` element:

```xml
<gnt xmlns:xi="http://www.w3.org/2001/XInclude">
    <book xml:base="01-matthew.xml">
```

Let's use XPath to find the books. 
    

In [None]:
books = tree.xpath('/gnt/book')

In [None]:
len(books)

Now let's print out the `osisId` attribute of each book to make sure that we have the books we are expecting.

In [None]:
for book in books:
    print(book.get("osisId"))

### Verses

Verses can be found in several ways. Each sentence has a milestone element identifying the starting verse for the sentence:

```xml
 <sentence>
    <milestone unit="verse" id="Matt.1.1">Matt.1.1</milestone>
    <p>Βίβλος γενέσεως Ἰησοῦ Χριστοῦ υἱοῦ Δαυεὶδ υἱοῦ Ἀβραάμ.</p>
```

Note that the milestones correspond to the starting point of a sentence, so the count will be the same as the count of sentences. Because many verses contain multiple sentences, the number of milestones is larger than the number of verses.  Also, be aware that some sentences span multiple verses.

In [None]:
verses = tree.xpath('//milestone[@unit="verse"]')

In [None]:
verses[0].text

In Python, `-1` means the last item in a list, so the following finds the last milestone:

In [None]:
verses[-1].text

You can search for verses in a book by looking for verses that start with the name of the book: 

In [None]:
john_verses = tree.xpath('//milestone[@unit="verse" and starts-with(., "John.")]')

## Linguistic units

Let's look at the sentences, clauses, and words in our treebank.  

### Sentences

First, let's count all sentences in our treebank:

In [None]:
sentences = tree.xpath('//sentence')

In [None]:
len(sentences)

Now let's look at just the sentences in John 3 - that is, sentences containing a milestone that starts with the string `John.3.`:

In [None]:
sentences = tree.xpath('//sentence[ milestone[@unit="verse" and starts-with(., "John.3.")] ]')

In [None]:
len(sentences)

### Clauses and Phrases

Now let's look at clauses and noun phrases, which are both represented as `<wg/>` elements.  The `class` attribute identifies the class of the word group:

In [None]:
clauses = tree.xpath('//wg[@class="cl"]')

In [None]:
len(clauses)

We can pick one of these clauses and use XPath to see the words that it contains:

In [None]:
clauses[1].xpath(".//w/text()")

Now let's look at noun phrases:

In [None]:
nps = tree.xpath('//wg[@class="np"]')

In [None]:
len(nps)

In [None]:
nps[2].xpath(".//w/text()")

### Words

Now let's look at individual words.

In [None]:
words = tree.xpath('//w')

In [None]:
len(words)

In [None]:
words[0].text

In [None]:
words[-1].text

We can choose words of a given class, e.g. nouns or verbs:

In [None]:
nouns = tree.xpath('//w[@class="noun"]')

In [None]:
len(nouns)

In [None]:
nouns[0].text

In [None]:
nouns[-1].text

In [None]:
verbs =  tree.xpath('//w[@class="verb"]')

In [None]:
len(verbs)

In [None]:
verbs[0].text

In [None]:
verbs[-1].text

## Morphology

Words contain attributes that describe their morphology.  These can be used in queries on words.  For instance, here is an example of a word in John 3:16:

```xml
   <w role="v"
      class="verb"
      osisId="John.3.16!17"
      n="430030160170010"
      lemma="πιστεύω"
      normalized="πιστεύων"
      strong="4100"
      number="singular"
      gender="masculine"
      case="nominative"
      tense="present"
      voice="active"
      mood="participle"
      head="true">πιστεύων</w>
```

Let's do some queries using the attributes we see in this example.  

How many times do we see the word πιστεύω?

In [None]:
pisteuw = tree.xpath('//w[@lemma="πιστεύω"]')

In [None]:
len(pisteuw)

Because we did not specify morphology, this verb will occur in many forms.  Let's look at the first few:

In [None]:
pisteuw[0].text

In [None]:
pisteuw[1].text

The osisId identifies the verse in which each word occurs:

In [None]:
pisteuw[0].get("osisId")

In [None]:
pisteuw[1].get("osisId")

Now let's look for participle forms of this verb that are singular, masculine, and nominative:

In [None]:
smn = tree.xpath('//w[@mood="participle" and @number="singular" and @gender="masculine" and @case="nominative" and @lemma="πιστεύω"]')

In [None]:
len(smn)

That still allows multiple forms, e.g. both present and aorist forms of the verb:

In [None]:
smn[0].text

In [None]:
smn[1].text

Now let's query based on tense, voice, and mood, allowing a different set of possible forms:

In [None]:
pap = tree.xpath('//w[@tense="present" and @voice="active" and @mood="participle" and @lemma="πιστεύω"]')

In [None]:
len(pap)

In [None]:
pap[0].text

In [None]:
pap[1].text

In [None]:
pap[2].text

## Syntax

Syntax is largely about exploring relationships among clauses, and the `@role` attribute expresses some particularly important relationships.  First, let's take a look at all clauses.

In [None]:
clauses = tree.xpath('//wg[@class="cl"]')

In [None]:
len(clauses)

Adverbial clauses are marked with the role `adv`:

In [None]:
adverbial_clauses = tree.xpath('//wg[@role="adv" and @class="cl"]')

In [None]:
len(adverbial_clauses)

In [None]:
adverbial_clauses[0].xpath('.//w/text()')

And we can also look for clauses that contain adverbial clauses:

In [None]:
clauses_containing_adverbial_clauses = tree.xpath('//wg[@class="cl" and wg[@role="adv" and @class="cl"]]')

In [None]:
len(clauses_containing_adverbial_clauses)

In [None]:
clauses_containing_adverbial_clauses[0].xpath('.//w/text()')

A clause can also be the object of another clause:

In [None]:
object_clauses = tree.xpath('//wg[@role="o" and @class="cl"]')

In [None]:
len(object_clauses)

In [None]:
object_clauses[1].xpath(".//w/text()")

In [None]:
clauses_containing_object_clauses = tree.xpath('//wg[@class="cl" and wg[@role="o" and @class="cl"]]')

In [None]:
len(clauses_containing_object_clauses)

In [None]:
clauses_containing_object_clauses[1].xpath(".//w/text()")