<img align="right" src="tf-small.png"/>

# SBLGNT and Text-Fabric

The source of the SBLGNT data in TF is really a treebank, a hierarchical structure.
We converted an XML representation of it into TF.

The question of this notebook is: how to measure syntactic complexity of sentences by their tree structure,
and how is that complexity distributed over the text of the Greek New Testament.

## Complexity
It is too simple to equate complexity with the depth of the syntax trees.
For example, the genealogy in Luke 3 has a depth of more than 100, but this is a *tail* embedding: after the 
embedded elements, the embedders do not resume.
Real complexity arises when embedders resume after embeddings at many levels.

<img align="left" src="Complexity.png" width="40%"/>

## Rank
The following definition tries to capture the notion of complexity by weeding out chains of nodes that all branch in the same direction. The intuition is that every such chain only counts for one in terms of depth.

1. The *rank* of a tree is the maximum of the ranks of its paths (all paths going from the root to a terminal node)
2. The *rank* of a path is the rank of its terminal node
3. The *rank* of a node is determined as follows:
4. The *rank* of the root is 0
5. The *rank* of the child of a unary node is the same as the rank of its parent
6. If a node has rank 0, and has 2 or more children, all its children have rank 1
7. If a node $n$ has a positive rank $r$ and it has at least two children, consider its ancestor $p$ with rank $r-1$.
8. If the path from $p$ to $n$ did not start with a left-most or right-most child, then all children of $n$
   have rank $r+1$.
9. If the path from $p$ to $n$ started with a left-most child, then all children of $n$, except the left-most one,
   have rank $r+1$, and the left-most one has rank $r$.
10. If the path from $p$ to $n$ started with a right-most child, then all children of $n$, except the right-most one,
   have rank $r+1$, and the right-most one has rank $r$.
   
## Distribution
We will compute the rank of all terminal nodes in the GNT, and order them accordingly, and see how complexity is distributed over the chapters.

Here is an 
[Excel Sheet](complexity.xlsx) of all words with rank >= 10 (2721 words).

In [1]:
import os,collections,xlsxwriter
from tf.fabric import Fabric

In [2]:
TF = Fabric(modules='greek/sblgnt')

This is Text-Fabric 2.3.5
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
60 features found and 0 ignored


In [3]:
api = TF.load('''
    Unicode UnicodeLemma Mood
    book book@en chapter verse
    otype
    nodeId
    child
''')

api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.02s B otype                from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.00s B book                 from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.00s B chapter              from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.04s B verse                from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.11s B Unicode              from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.08s B UnicodeLemma         from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.01s B Mood                 from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.00s B book@en              from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.21s B nodeId               from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.38s B child                from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.00s Feature overview: 57 for nodes; 2 fo

In [4]:
def computeRanks():
    for s in F.otype.s('sentence'):
        computeRankNodes(s, 0, 0)

def computeRankNodes(node, rank, branchType):
    children = E.child.f(node)
    if children == None or len(children) == 0:
        ranks[node] = rank
        return
    lc = len(children)
    if lc == 1:
        computeRankNodes(children[0], rank, branchType)
    else:
        for (i, c) in enumerate(children):
            newRank = rank + 1 if branchType == 0 or\
                                  (branchType == -1 and i > 0) or\
                                  (branchType == 1 and i < lc - 1)\
                               else rank
            newBranchType = -1 if i == 0 else 1 if i == lc - 1 else 0
            computeRankNodes(c, newRank, newBranchType)

In [5]:
indent(reset=True)
info('computing ranks ...')
ranks = dict()
computeRanks()
info('assigned ranks to {} terminals'.format(len(ranks)))

  0.00s computing ranks ...
  1.12s assigned ranks to 137794 terminals


# Distribution of ranks

In [6]:
rankCounter = collections.Counter()
for (n, r) in ranks.items(): rankCounter[r] += 1

for (r, ns) in sorted(rankCounter.items(), key=lambda x: x[0]):
    print('Rank {:>3}: {:>6} terminal{}'.format(r, ns, '' if ns == 1 else 's'))

Rank   0:      2 terminals
Rank   1:  17105 terminals
Rank   2:  46511 terminals
Rank   3:  32770 terminals
Rank   4:  29764 terminals
Rank   5:   7783 terminals
Rank   6:   3249 terminals
Rank   7:    495 terminals
Rank   8:    101 terminals
Rank   9:     13 terminals
Rank  10:      1 terminal


In [22]:
result = []
for (t, r) in sorted((x for x in ranks.items() if x[1] >= 10), key=lambda y: -y[1]):
    (book, chapter, verse) = T.sectionFromNode(t)
    result.append(dict(
        book=book,
        chapter=chapter,
        verse=verse,
        node=t,
        nodeId=F.nodeId.v(t),
        word=F.Unicode.v(t),
        rank=r,
    ))
    
print('{} rows'.format(len(result)))

2721 rows


In [24]:
workbook = xlsxwriter.Workbook('complexity.xlsx', {'strings_to_urls': False})
worksheet = workbook.add_worksheet('complexity')

greekFormat = workbook.add_format({'font_name': 'Times New Roman', 'font_size': 14})
codeFormat = workbook.add_format({'font_name': 'Courier New', 'font_size': 11})
smallFormat = workbook.add_format({'font_name': 'Arial', 'font_size': 10})
nodeFormat = workbook.add_format({'font_name': 'Arial', 'font_size': 9})
normalFormat = workbook.add_format({'font_name': 'Arial', 'font_size': 11})

# book chapter node nodeId word rank
fields = '''
    book
    chapter
    verse
    node
    nodeId
    word
    rank
'''.strip().split()

fieldSpecs = dict(
    book=(10, smallFormat),
    chapter=(3, smallFormat),
    verse=(3, smallFormat),
    node=(7, nodeFormat),
    nodeId=(20, codeFormat),
    word=(30, greekFormat),
    rank=(4, normalFormat),
)

fieldOrder = list(enumerate(fields))

for (f, field) in fieldOrder:
    (width, fmt) = fieldSpecs[field]
    worksheet.set_column(f, f, width, fmt)

for r in range(len(result)):
    worksheet.set_row(r, 24)

for (f, field) in fieldOrder: worksheet.write(0, f, field)
for (r, row) in enumerate(result):
    for (f, field) in fieldOrder:
        worksheet.write(r+1, f, row[field])
workbook.close()