# Tutorial: Python 3, lxml, and Greek Syntax Trees

[jonathanrobie.biblicalhumanities.org](http://jonathanrobie.biblicalhumanities.org)

** NOTE: This notebook does not render correctly on some mobile devices.  If it looks like a JSON document, read it with a desktop-based web browser **

This Jupyter Notebook shows how to work with the biblicalhumanities.org [Greek syntax trees](github.com/biblicalhumanities/greek-new-testament/syntax-trees) using Python 3, and the [lxml library](http://lxml.de/xpathxslt.html). These syntax trees are represented as XML text files. In this tutorial, we parse the files once and find what we want using XPath.  The lxml library supports only XPath 1.0 - I will explore using XPath 3.1 and XQuery 3.1 using the [BaseX Client API](https://github.com/BaseXdb/basex/tree/master/basex-api/src/main/python) in a future notebook.

## Install Jupyter

Follow the instructions [here](https://jupyter.org/install.html).

## Install lxml

Install lxml - see instructions [here](http://lxml.de/installation.html).

If you want to create notebooks (highly recommended!) then install Jupyter - see instructions [here](https://jupyter.org/install.html).

## Get the Syntax trees

Get the syntax trees using git:

```
$ git clone https://github.com/biblicalhumanities/greek-new-testament
```

On my machine, I clone repos into the ~/git subdirectory. If you use a different directory structure, set the following variable to the location of the file `nestle1904lowfat.xml` on your machine.

In [1]:
TREEBANK = "/Users/jonathan/git/greek-new-testament/syntax-trees/nestle1904-lowfat/xml/nestle1904lowfat.xml"

This file contains a series of XInclude statements that each include one file from the Greek New Testament:

```xml
<gnt xmlns:xi="http://www.w3.org/2001/XInclude">
    <xi:include href="01-matthew.xml"/>
    <xi:include href="02-mark.xml"/>
    <xi:include href="03-luke.xml"/>
    <xi:include href="04-john.xml"/>
    !!! SNIP !!!
</gnt>
```

Each of these included files contains a `<book/>` element that looks like this:

```xml 
<book xml:base="01-matthew.xml">
  <sentence>
    <milestone unit="verse" id="Matt.1.1">Matt.1.1</milestone>
    <p>Βίβλος γενέσεως Ἰησοῦ Χριστοῦ υἱοῦ Δαυεὶδ υἱοῦ Ἀβραάμ.</p>
    <wg class="cl">
      <wg class="cl" n="400010010010082">
        <wg role="p" class="np" n="400010010010080">
          <w class="noun"
             type="common"
             osisId="Matt.1.1!1"
             n="400010010010010"
             lemma="βίβλος"
             normalized="Βίβλος"
             strong="976"
             number="singular"
             gender="feminine"
             case="nominative"
             head="true">Βίβλος</w>
    !!! SNIP !!!
```

Note the following:

- The `<book/>` element contains a sequence of `<sentence/>` elements, which represent the sentences of a book.
- When expanded with XInclude, the `xml:base` attribute identifies the book.  
- Verses are represented using `<milestone/>` elements that occur within sentences.
- For the sake of readability, each sentence has a `<p/>` element that contains the sentence in plain text.
- Sentences contain word groups (`<wg/>`) and words (`<w/>`), which can each contain `class` and `role` elements.  For instance, a clause is a word group where `class="cl"`, a noun phrase is a word group where `class="np"`.
- More details on this format can be found in the [Nestle1904 Lowfat README](https://github.com/biblicalhumanities/greek-new-testament/tree/master/syntax-trees/nestle1904-lowfat/README.md).  The values that an attribute can take are more fully documented in the [Nestle 1904 Documentation](https://github.com/biblicalhumanities/greek-new-testament/blob/master/syntax-trees/nestle1904/doc/SBLGNT%20Treebank%20Documentation.pdf).

## Import lxml and parse the syntax trees.

Import the `lxml` library and parse `gnt.xml`, then use XInclude to expand each book inline.

In [2]:
from lxml import etree

In [3]:
tree = etree.parse(TREEBANK)

In [4]:
tree.xinclude()

## Books, verses

Most realistic use cases involving books and verses require both linguistic units (sentences, word groups, words) too, but let's start by showing just the books and verses.  After we introduce linguistic units, we will show how to combine the two.

### Books

Each book is an element located directly under the `<gnt/>` element:

```xml
<gnt xmlns:xi="http://www.w3.org/2001/XInclude">
    <book xml:base="01-matthew.xml">
```

Let's use XPath to find the books. 
    

In [5]:
books = tree.xpath('/gnt/book')

In [6]:
len(books)

27

Now let's print out the `osisId` attribute of each book to make sure that we have the books we are expecting.

In [7]:
for book in books:
    print(book.get("osisId"))

Matt
Mark
Luke
John
Acts
Rom
1Cor
2Cor
Gal
Eph
Phil
Col
1Thess
2Thess
1Tim
2Tim
Titus
Phlm
Heb
Jas
1Pet
2Pet
1John
2John
3John
Jude
Rev


### Verses

Verses can be found in several ways. Each sentence has a milestone element identifying the starting verse for the sentence:

```xml
 <sentence>
    <milestone unit="verse" id="Matt.1.1">Matt.1.1</milestone>
    <p>Βίβλος γενέσεως Ἰησοῦ Χριστοῦ υἱοῦ Δαυεὶδ υἱοῦ Ἀβραάμ.</p>
```

Note that the milestones correspond to the starting point of a sentence, so the count will be the same as the count of sentences. Because many verses contain multiple sentences, the number of milestones is larger than the number of verses.  Also, be aware that some sentences span multiple verses.

In [8]:
verses = tree.xpath('//milestone[@unit="verse"]')

In [9]:
verses[0].text

'Matt.1.1'

In Python, `-1` means the last item in a list, so the following finds the last milestone:

In [10]:
verses[-1].text

'Rev.22.21'

You can search for verses in a book by looking for verses that start with the name of the book: 

In [11]:
john_verses = tree.xpath('//milestone[@unit="verse" and starts-with(., "John.")]')

## Linguistic units

Let's look at the sentences, clauses, and words in our treebank.  

### Sentences

First, let's count all sentences in our treebank:

In [12]:
sentences = tree.xpath('//sentence')

In [13]:
len(sentences)

8011

Now let's look at just the sentences in John 3 - that is, sentences containing a milestone that starts with the string `John.3.`:

In [14]:
sentences = tree.xpath('//sentence[ milestone[@unit="verse" and starts-with(., "John.3.")] ]')

In [15]:
len(sentences)

42

### Clauses and Phrases

Now let's look at clauses and noun phrases, which are both represented as `<wg/>` elements.  The `class` attribute identifies the class of the word group:

In [16]:
clauses = tree.xpath('//wg[@class="cl"]')

In [17]:
len(clauses)

61551

We can pick one of these clauses and use XPath to see the words that it contains:

In [18]:
clauses[1].xpath(".//w/text()")

['Βίβλος', 'γενέσεως', 'Ἰησοῦ', 'Χριστοῦ', 'υἱοῦ', 'Δαυεὶδ', 'υἱοῦ', 'Ἀβραάμ.']

Now let's look at noun phrases:

In [19]:
nps = tree.xpath('//wg[@class="np"]')

In [20]:
len(nps)

38507

In [21]:
nps[2].xpath(".//w/text()")

['Ἰησοῦ', 'Χριστοῦ', 'υἱοῦ', 'Δαυεὶδ', 'υἱοῦ', 'Ἀβραάμ.']

### Words

Now let's look at individual words.

In [22]:
words = tree.xpath('//w')

In [23]:
len(words)

137779

In [24]:
words[0].text

'Βίβλος'

In [25]:
words[-1].text

'πάντων.'

We can choose words of a given class, e.g. nouns or verbs:

In [26]:
nouns = tree.xpath('//w[@class="noun"]')

In [27]:
len(nouns)

28455

In [28]:
nouns[0].text

'Βίβλος'

In [29]:
nouns[-1].text

'Ἰησοῦ'

In [30]:
verbs =  tree.xpath('//w[@class="verb"]')

In [31]:
len(verbs)

28357

In [32]:
verbs[0].text

'ἐγέννησεν'

In [33]:
verbs[-1].text

'ἔρχου'

## Morphology

Words contain attributes that describe their morphology.  These can be used in queries on words.  For instance, here is an example of a word in John 3:16:

```xml
   <w role="v"
      class="verb"
      osisId="John.3.16!17"
      n="430030160170010"
      lemma="πιστεύω"
      normalized="πιστεύων"
      strong="4100"
      number="singular"
      gender="masculine"
      case="nominative"
      tense="present"
      voice="active"
      mood="participle"
      head="true">πιστεύων</w>
```

Let's do some queries using the attributes we see in this example.  

How many times do we see the word πιστεύω?

In [34]:
pisteuw = tree.xpath('//w[@lemma="πιστεύω"]')

In [35]:
len(pisteuw)

241

Because we did not specify morphology, this verb will occur in many forms.  Let's look at the first few:

In [36]:
pisteuw[0].text

'ἐπίστευσας'

In [37]:
pisteuw[1].text

'Πιστεύετε'

The osisId identifies the verse in which each word occurs:

In [38]:
pisteuw[0].get("osisId")

'Matt.8.13!9'

In [39]:
pisteuw[1].get("osisId")

'Matt.9.28!15'

Now let's look for participle forms of this verb that are singular, masculine, and nominative:

In [40]:
smn = tree.xpath('//w[@mood="participle" and @number="singular" and @gender="masculine" and @case="nominative" and @lemma="πιστεύω"]')

In [41]:
len(smn)

27

That still allows multiple forms, e.g. both present and aorist forms of the verb:

In [42]:
smn[0].text

'πιστεύσας'

In [43]:
smn[1].text

'πιστεύων'

Now let's query based on tense, voice, and mood, allowing a different set of possible forms:

In [44]:
pap = tree.xpath('//w[@tense="present" and @voice="active" and @mood="participle" and @lemma="πιστεύω"]')

In [45]:
len(pap)

53

In [46]:
pap[0].text

'πιστευόντων'

In [47]:
pap[1].text

'πιστεύοντες'

In [48]:
pap[2].text

'πιστεύοντι.'

## Syntax

Syntax is largely about exploring relationships among clauses, and the `@role` attribute expresses some particularly important relationships.  First, let's take a look at all clauses.

In [49]:
clauses = tree.xpath('//wg[@class="cl"]')

In [50]:
len(clauses)

61551

Adverbial clauses are marked with the role `adv`:

In [51]:
adverbial_clauses = tree.xpath('//wg[@role="adv" and @class="cl"]')

In [52]:
len(adverbial_clauses)

2622

In [53]:
adverbial_clauses[0].xpath('.//w/text()')

['πρὶν', 'ἢ', 'συνελθεῖν', 'αὐτοὺς']

And we can also look for clauses that contain adverbial clauses:

In [54]:
clauses_containing_adverbial_clauses = tree.xpath('//wg[@class="cl" and wg[@role="adv" and @class="cl"]]')

In [55]:
len(clauses_containing_adverbial_clauses)

2343

In [56]:
clauses_containing_adverbial_clauses[0].xpath('.//w/text()')

['πρὶν',
 'ἢ',
 'συνελθεῖν',
 'αὐτοὺς',
 'εὑρέθη',
 'ἐν',
 'γαστρὶ',
 'ἔχουσα',
 'ἐκ',
 'Πνεύματος',
 'Ἁγίου.']

A clause can also be the object of another clause:

In [57]:
object_clauses = tree.xpath('//wg[@role="o" and @class="cl"]')

In [58]:
len(object_clauses)

1284

In [59]:
object_clauses[1].xpath(".//w/text()")

['αὐτὴν', 'δειγματίσαι,']

In [60]:
clauses_containing_object_clauses = tree.xpath('//wg[@class="cl" and wg[@role="o" and @class="cl"]]')

In [61]:
len(clauses_containing_object_clauses)

1284

In [62]:
clauses_containing_object_clauses[1].xpath(".//w/text()")

['Ἰωσὴφ',
 'ὁ',
 'ἀνὴρ',
 'αὐτῆς,',
 'δίκαιος',
 'ὢν',
 'καὶ',
 'μὴ',
 'θέλων',
 'αὐτὴν',
 'δειγματίσαι,',
 'ἐβουλήθη',
 'λάθρᾳ',
 'ἀπολῦσαι',
 'αὐτήν.']