# Tutorial: Python 3, BaseX, and Greek Syntax Trees

[jonathanrobie.biblicalhumanities.org](http://jonathanrobie.biblicalhumanities.org)

This Jupyter Notebook shows how to work with the biblicalhumanities.org [Greek syntax trees](github.com/biblicalhumanities/greek-new-testament/syntax-trees) using Python 3 and the [BaseXClient Library](https://pypi.python.org/pypi/BaseXClient/8.4.4).  Using BaseX provides several useful advantages:

- Fully supports XQuery 3.1 and XPath 3.1, including grouping, windowing, and JSON support
- Indexes and query optimization
- BaseX GUI is very productive for testing and optimizing queries

## Install BaseX and the BaseX Client

First, [download and install BaseX](http://basex.org/products/download/all-downloads/).

Then install the BaseXClient for Python 3 using `pip` or `pip3` (depending on your system):

```
$ pip install BaseXClient
Collecting BaseXClient
  Using cached BaseXClient-8.4.4-py2.py3-none-any.whl
Installing collected packages: BaseXClient
Successfully installed BaseXClient-8.4.4
```

## Get the Syntax trees, add to BaseX using BaseX GUI

Get the syntax trees using git:

```
$ git clone https://github.com/biblicalhumanities/greek-new-testament
```

Follow these BaseX instructions to create a database:

- [Create Database](http://docs.basex.org/wiki/Graphical_User_Interface#Create_Database)

On my machine, I clone repos into the ~/git subdirectory, and the XML file you need to create the database is here:

```
~/git/greek-new-testament/syntax-trees/nestle1904-lowfat/xml/nestle1904lowfat.xml
```

This file contains a series of XInclude statements that each include one file from the Greek New Testament:

```xml
<gnt xmlns:xi="http://www.w3.org/2001/XInclude">
    <xi:include href="01-matthew.xml"/>
    <xi:include href="02-mark.xml"/>
    <xi:include href="03-luke.xml"/>
    <xi:include href="04-john.xml"/>
    !!! SNIP !!!
</gnt>
```

Each of these included files contains a `<book/>` element that looks like this:

```xml 
<book xml:base="01-matthew.xml">
  <sentence>
    <milestone unit="verse" id="Matt.1.1">Matt.1.1</milestone>
    <p>Βίβλος γενέσεως Ἰησοῦ Χριστοῦ υἱοῦ Δαυεὶδ υἱοῦ Ἀβραάμ.</p>
    <wg class="cl">
      <wg class="cl" n="400010010010082">
        <wg role="p" class="np" n="400010010010080">
          <w class="noun"
             type="common"
             osisId="Matt.1.1!1"
             n="400010010010010"
             lemma="βίβλος"
             normalized="Βίβλος"
             strong="976"
             number="singular"
             gender="feminine"
             case="nominative"
             head="true">Βίβλος</w>
    !!! SNIP !!!
```

Note the following:

- The `<book/>` element contains a sequence of `<sentence/>` elements, which represent the sentences of a book.
- When expanded with XInclude, the `xml:base` attribute identifies the book.  
- Verses are represented using `<milestone/>` elements that occur within sentences.
- For the sake of readability, each sentence has a `<p/>` element that contains the sentence in plain text.
- Sentences contain word groups (`<wg/>`) and words (`<w/>`), which can each contain `class` and `role` elements.  For instance, a clause is a word group where `class="cl"`, a noun phrase is a word group where `class="np"`.
- More details on this format can be found in the [Nestle1904 Lowfat README](https://github.com/biblicalhumanities/greek-new-testament/tree/master/syntax-trees/nestle1904-lowfat/README.md).  The values that an attribute can take are more fully documented in the [Nestle 1904 Documentation](https://github.com/biblicalhumanities/greek-new-testament/blob/master/syntax-trees/nestle1904/doc/SBLGNT%20Treebank%20Documentation.pdf).

## Import BaseXClient and open the database

Import the client and open the database

In [1]:
from BaseXClient import BaseXClient

In [2]:
session = BaseXClient.Session('localhost', 1984, 'admin', 'admin')

In [3]:
session.execute("open nestle1904lowfat")

''

## Define some helper functions

Sometimes we want to return query results as text, sometimes we want to return results as an etree to make it easier to process the XML in Python, sometimes we just want to print the result.  Let's define a function to do each.

In [4]:
from lxml import etree

def xquery(query):
    return session.query(query).execute()

def xquery_etree(query):
    return etree.XML(xquery(query))

# https://stackoverflow.com/questions/40085818/jupyter-notebook-output-cell-syntax-highlighting

from pygments import highlight
from pygments.lexers import XmlLexer
from pygments.formatters import HtmlFormatter
import IPython

def xml_display(xml):
    formatter = HtmlFormatter()
    display(
        IPython.display.HTML('<style type="text/css">{}</style>{}'.format (
            formatter.get_style_defs('.highlight'),
            highlight(xml, XmlLexer(), formatter))))

def xquery_display(query):
    xml_display(xquery(query))

Now let's demonstrate each of these functions with a simple query. The `xquery()` function does a query and returns the result as a string.

In [5]:
xquery("//milestone[@id='Matt.1.1']")

'<milestone xmlns:xi="http://www.w3.org/2001/XInclude" unit="verse" id="Matt.1.1">Matt.1.1</milestone>'

The `xquery_etree()` function does a query and returns the result as an `etree` object.

In [6]:
xquery_etree("//milestone[@id='Matt.1.1']")

<Element milestone at 0x1109b3f08>

The `xml_display()` function formats and displays a string of XML.

In [7]:
xml_display('<milestone xmlns:xi="http://www.w3.org/2001/XInclude" unit="verse" id="Matt.1.1">Matt.1.1</milestone>')

The `xquery_display()` function does a query, then formats and displays the result.

In [8]:
xquery_display("//milestone[@id='Matt.1.1']")

## Books, Verses

Most realistic use cases involving books and verses require both linguistic units (sentences, word groups, words) too, but let's start by showing just the books and verses.  As we introduce linguistic units, we will show how to combine the two.

### Books

Each book is an element located directly under the `<gnt/>` element:

```xml
<gnt xmlns:xi="http://www.w3.org/2001/XInclude">
    <book xml:base="01-matthew.xml">
```

Let's use XPath to find the books. First, let's just count the books to make sure we get the right number.
    

In [9]:
xquery("count(//book)")

'27'

Books are too long to display entirely, so look at the first sentence of the first book to make sure that we are looking at the right thing.

In [10]:
matthew = xquery("//book[1]")

In [11]:
xml_display(matthew[0:1000])

Note the `osisId` attribute, which can be used to identify a book, and the `xml:base` attribute, which identifies the file that contains the book.  Let's make sure we have all the books of the New Testament, and that they are in the right order.

In [12]:
q = """
    for $book in //book
    return
       <book>
         {
            $book/@osisId,
            $book/@xml:base
         }
       </book>
"""

xquery_display(q)

### Verses

Verses can be found in several ways. Each sentence has a milestone element identifying the starting verse for the sentence:

```xml
 <sentence>
    <milestone unit="verse" id="Matt.1.1">Matt.1.1</milestone>
    <p>Βίβλος γενέσεως Ἰησοῦ Χριστοῦ υἱοῦ Δαυεὶδ υἱοῦ Ἀβραάμ.</p>
```

Note that the milestones correspond to the starting point of a sentence, so the count will be the same as the count of sentences. Because many verses contain multiple sentences, the number of milestones is larger than the number of verses.  Also, be aware that some sentences span multiple verses.

Let's start by looking for the first and last verses in the Greek New Testament:

In [13]:
xquery_display('(//milestone[@unit="verse"])[1]')

In [14]:
xquery_display('(//milestone[@unit="verse"])[last()]')

You can search for verses in a book by looking for milestones with `id` attributes that start with the name of the book: 

In [15]:
john_verses = xquery('//milestone[@unit="verse" and starts-with(@id, "John.")]')

You can search for verses in a chapter by looking for milestones with `id` attributes that start with the name of the book and a chapter number: 

In [16]:
john3_verses = xquery('//milestone[@unit="verse" and starts-with(@id, "John.3")]')

## Syntactic Units:  Sentences, Clauses, Phrases, Word Groups, and Words

Let's look at the sentences, clauses, and words in our treebank.  

### Sentences

First, let's count all sentences in our treebank:

In [17]:
xquery('count(//sentence)')

'8011'

You can use milestones to identify sentences. Let's look at just the sentences in John 3 - that is, sentences containing a `milestone` with an `id` attribute that starts with the string `John.3.`

In [18]:
xquery('count(//sentence[ milestone[starts-with(@id, "John.3.")] ] )')

'42'

Let's show the syntax tree for the sentence that contains John 3:16:

In [19]:
xquery_display('//sentence[ milestone[@id = "John.3.16"] ]')

Once you find a sentence, you can return any part of the sentence and apply functions to their contents.  For instance, the `p` element contains the text of the sentence in sentence order. This query returns just that part of John 3:16.

In [20]:
q = """
  let $s := //sentence[ milestone[@id = "John.3.16"] ]
  return string($s/p)
"""

xquery(q)

'Οὕτως γὰρ ἠγάπησεν ὁ Θεὸς τὸν κόσμον, ὥστε τὸν Υἱὸν τὸν μονογενῆ ἔδωκεν, ἵνα πᾶς ὁ πιστεύων εἰς αὐτὸν μὴ ἀπόληται ἀλλ’ ἔχῃ ζωὴν αἰώνιον.'

We can also find sentences based on the clauses, phrases, and words they contain.  Before we do that, let's learn how these are represented in the trees.

### Clauses and Phrases

Now let's look at clauses and noun phrases, which are both represented as `<wg/>` elements.  The `class` attribute identifies the class of the word group.  Lets start by counting word groups:

In [21]:
xquery('count(//wg)')

'112898'

Now let's narrow that down to clauses, which are word groups where the `class` attribute is `cl`:

In [22]:
xquery('count(//wg[@class="cl"])')

'61551'

Let's look at the first clause:

In [23]:
xquery_display('(//wg[@class="cl"])[1]')

Now let's count noun phrases:

In [24]:
xquery('count(//wg[@class="np"])')

'38507'

And look at the first noun phrase (which is found inside the first clause shown above):

In [25]:
xquery_display('(//wg[@class="np"])[1]')

And the first noun phrase that does not contain any word groups:

In [26]:
xquery_display('(//wg[@class="np" and not(.//wg)])[1]')

### Words

Now let's look at individual words.  First, let's count the words in the New Testament:

In [27]:
xquery('count(//w)')

'137779'

Let's look at the first word:

In [28]:
xquery_display('(//w)[1]')

And the last:

In [29]:
xquery_display('(//w)[last()]')

We can choose words of a given class, e.g. nouns or verbs.  Let's count the nouns:

In [30]:
xquery('count(//w[@class="noun"])')

'28455'

Let's see how many are proper and how many are common:

In [31]:
xquery('count(//w[@class="noun" and @type="proper"])')

'4632'

In [32]:
xquery('count(//w[@class="noun" and @type="common"])')

'23627'

Now let's count the verbs:

In [33]:
xquery('count(//w[@class="verb"])')

'28357'

## Morphology

Words contain attributes that describe their morphology.  These can be used in queries on words.  For instance, here is an example of a word in John 3:16:

```xml
   <w role="v"
      class="verb"
      osisId="John.3.16!17"
      n="430030160170010"
      lemma="πιστεύω"
      normalized="πιστεύων"
      strong="4100"
      number="singular"
      gender="masculine"
      case="nominative"
      tense="present"
      voice="active"
      mood="participle"
      head="true">πιστεύων</w>
```

Let's do some queries using the attributes we see in this example.  

How many times do we see the word πιστεύω?

In [34]:
xquery('count(//w[@lemma="πιστεύω"])')

'241'

The osisId identifies the verse in which each word occurs and the position of the word within the verse:

In [35]:
xquery_display('//w[@lemma="πιστεύω"]/@osisId')

Now let's look for participle forms of this verb that are singular, masculine, and nominative using XQuery grouping.

In this query, the normalized attribute removes punctuation and normalizes accent differences due to phonological context.

In [36]:
q = """   
    for $w in //w[@lemma="πιστεύω"]
    where $w/@number="singular"  
      and $w/@gender="masculine"
      and $w/@case="nominative"
    let $form := $w/@normalized ! lower-case(.)
    group by $form
    order by $form
    return $form
""" 

xquery_display(q)

We can also create an XML structure that shows where each of these forms occurs:

In [37]:
q = """   
    for $w in //w[@lemma="πιστεύω"]
    for $id in $w/@osisId
    where $w/@number="singular"  
      and $w/@gender="masculine"
      and $w/@case="nominative"
    let $form := $w/@normalized ! lower-case(.)
    group by $form
    order by $form
    return 
      <form>
       { attribute n  {$form}}
       { $id ! <loc>{ .}</loc> }
      </form>
""" 

xquery_display(q)

## Syntax

Syntax is largely about exploring relationships within a clause, and the `@role` attribute identifies these relationships.

Adverbial clauses are marked with the role `adv`. Let's count adverbial clauses:

In [38]:
xquery('count(//wg[@role="adv" and @class="cl"])')

'2622'

Let's look at the first adverbial clause. To make it easier to read, we will use the `string-join()` function to show just the text of the words in the clause.

In [39]:
q = """
    let $adv := (//wg[@class="cl" and @role="adv"])[1]
    return string-join($adv//w, " ")
"""

xquery_display(q)

We can look at the parent of this clause to see the clause that contains it.  

In [40]:
q = """
    let $adv := (//wg[@class="cl" and @role="adv"])[1]
    let $parent := $adv/parent::wg
    return string-join($parent//w, " ")
"""

xquery_display(q)

If we show the parent in XML, we can see that the verb is marked with the attribute `role="v"` and ἐν γαστρὶ ἔχουσα ἐκ Πνεύματος Ἁγίου is interpreted as an object of the verb:

In [41]:
q = """
    let $adv := (//wg[@class="cl" and @role="adv"])[1]
    let $parent := $adv/parent::wg
    return $parent
"""

xquery_display(q)

Let's find the verb using that `role="v"` attribute:

In [42]:
q = """
    let $adv := (//wg[@class="cl" and @role="adv"])[1]
    let $parent := $adv/parent::wg
    let $verb := $parent/*[@role="v"]
    return string($verb)
"""

xquery_display(q)

And we can look for the object within this same parent clause using the `role="o"` attribute:

In [43]:
q = """
    let $adv := (//wg[@class="cl" and @role="adv"])[1]
    let $parent := $adv/parent::wg
    let $object := $parent/*[@role="o"]
    return string-join($object//w, " ")
"""

xquery_display(q)