<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/ninologo.png" width="128"/>
<img align="right" src="images/dans.png" width="128"/>

---

To get started: consult [start](start.ipynb)

---

# Part of Speech tagging

## Team

* Alba de Ridder: Assyriology, master student @ NINO, Leiden
* Martijn Kokken: Assyriology, master student @ NINO, Leiden
* Dirk Roorda: Computer Science, researcher @ DANS, Den Haag
* Cale Johnson: Assyriology, researcher & lecturer @ Univ Birmingham
* Caroline Waerzeggers: Assyriology, head @ NINO, Leiden

In [1]:
COLOPHON = dict(
  acronym='ABB-pos',
  corpus='Old Babylonian Letter Corpus (ABB)',
  dataset='oldbabylonian',
  compiler='Dirk Roorda',
  editors='Alba de Ridder, Martijn Kokken',
  initiators='Cale Johnson, Caroline Waerzeggers',
  institute='NINO, DANS',
)

## Status

* 2019-06-05 Dirk has reorganised the messy code after the sprint into a repeatable and documented workflow.
  The workflow covers special cases, prepositions, and nouns, not yet the extra insights of the sprint.
* 2019-06-03/04 Martijn, Alba and Dirk do a two-day sprint to follow-up on heuristics supplied by Cale Johnson.
  Martijn and Alba provide extra insights.

# Introduction

We collect and execute ideas to tag all word occurrences with a part-of-speech, such as `noun`, `prep`, `verb`.

In the end, we intend to provide extra features to the Old Babylonian corpus, as a standard module that will be always loaded
alongside the corpus.

This notebook will produce some word-level features:

* `pos`: main category of the word: `noun`, `verb`, `prep`, `pcl` (particle)
* `subpos`: secondary category of the word: `rel` (relation), `neg` (negation)

But in the meanwhile, it is work in progress, and during the work we collect candidate assignments in sets, which we save to disk.

These sets correspond to `noun`, `prep`, `nonprep` words as far as we have tagged them in the current state of the workflow.

The sets are all saved in a file `sets.tfx`, both next to this notebook (so that you can get it through GitHub), as in a shared
Dropbox folder `obb`, so that the Akkadian specialists (Alba de Ridder, Martijn Kokken, Cale Johnson) have instant access to them and
can test them in their TF-browser.

See **Usage** at the end of this notebook for how you can make use of these results.

# Method

## Overview

We perform the following steps in that order:

### Known words
We identify a bunch of words in closed categories that tend to interfere with noun/verb detection.
After identification, we exclude them from all subsequent pattern detection.

### Prepositions
We detect a few prepositions, especially those that (nearly) always preceed a noun.

### Nouns
We use several markers to detect nouns:

* determinatives
* prepositions
* Sumerian logograms
* numerals

We collect the marked occurrences and then look up the unmarked occurrences of the same words.
In this way we extend the detection of nouns considerably.

We have to deal with one big complication, though: **unkowns**.
If we have marked word occurrences with unknown signs in them, we cannot be confident that unmarked occurrences
of the same thing are really occurrences of the same underlying word.

So, if we transfer categorizations from marked occurrences to unmarked occurrences, we only do so if
the word in question does not have unknowns.

We save a lot of intermediate sets: for each step we save the nouns that result from that step:

These sets may overlap.

We also save subsets of these sets, namely the occurrences that are positively marked, and
the occurrences that lack marking and have been inferred.
These marked and unmarked subsets of each step are disjoint. 

whole step | marked | unmarked
--- | --- | ---
`noundet` | `nounMdet` | `nounUdet`
`nounprep` | `nounMprep` | `nounUprep`
`nounlogo` | `nounMlogo` | `nounUlogo`
`nounnum` | `nounMnum` | `nounUnum`

**Note on determinatives**

Determinative and phonetic complements are signs marked in ATF by being inside `{ }`, and in TF by having `det=1`.
From now on, we will abbreviate it: a **det** is a determinative or a phonetic complement.

# Start the engines

We load the Python modules we need.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import os
import collections

from tf.app import use

from pos import PosTag

We load the corpus and obtain a handle to it: `A`.

In [4]:
A = use('oldbabylonian:local', checkout='local', hoist=globals(), silent='deep')

# Run the workflow

We set up the detection machinery.

In [5]:
PT = PosTag(A)

## Step 0: Inventory

We collect all the words and their occurrences and sift through determinatives and numerals.

We make a dictionary of words and their occurrences.
When we compute the word form, we pick the basic info of a sign, not the full ATF-representation with flags and brackets.

We also store the form without the *dets* that are present in the word.

In [6]:
PT.prepare()


kind of word | distinct forms | number of occurrences
--- | --- | ---
all | 15958 | 76503
with unknown sign | 955 | 9926
unknown numeral | 3 | 32
numeral | 47 | 2238
without dets | 13872 | 70349
with det | 2088 | 6157
with det cut away | 1880 | 2088


## Step: Known words

The case specification is a string.

In [7]:
cases = '''
  la + u2-ul + u2-la = pcl, neg
  sza = pcl, rel
  u3 + u2-lu + u2 = pcl, conj
  lu = pcl
  an-nu-um + an-ni-im + an-nu-u2 = prn, dem
  i-na-an-na + a-nu-um-ma = adv, tmp
'''

To be read as follows:

Each line specifies a bunch of words, separated by `+` on the left hand side of the `=`;
the right hand side specifies the categories those words receive, separated by `,`.

The first category is the `pos`, (main part-of-speech),
the second category is the `subpos` (sub category within the main part-of-speech).

We use abbreviated forms, because users of this dataset will have to type them quite often.

### Categories

category | subcategory | meaning
--- | --- | ---
`pcl` | &nbsp; | particle (unspecified)
`pcl` | `neg` | negative particle
`pcl` | `rel` | relative particle
`pcl` | `conj` | conjunction
`prn` | `dem` | demonstrative pronoun
`adv` | `tmp` | temporal adverb

In [8]:
PT.doKnownCases(cases)

    distinct words:     13
   pos assignments:   7681
subpos assignments:   7293


## Step: Pronouns - personal

In [9]:
prnPrs = '''
nom:
  1csg:
    - a-na-ku
    - a-na-ku-ma
    - a-na-ku-u2
    - a-na-ku-u2-ma
    - a-na-ku-ma-mi
  2msg:
    - at-ta
    - at-ta-ma
    - at-ta-a
    - at-ta-a-ma
  2fsg:
    - at-ti
    - at-ti-ma
    - at-ti-i-ma
  3msg:
    - szu-u2
    - szu-u2-ma
  3fsg:
    - szi-i
    - szi-i-ma
  1mpl:
    - ni-nu
    - ni-i-ni
  2mpl:
    - at-tu-nu
    - at-tu-nu-ma
    - at-tu-nu-u2
    - at-tu-u2-nu
    - at-tu-u2-nu-ma
  2fpl:
    - at-ti-na-ma
  3mpl:
    - szu-nu
    - szu-nu-ma
    - szu-nu-mi
    - szu-nu-u2
  3fpl:
    - szi-na

acg:
  1csg:
    - ia-ti
    - ia-ti-i-ma
    - ia-a-ti
  2msg:
    - ka-ta
    - ka-ta-a-ma
    - ka-a-ti
    - ka-ti:
        - P510880 reverse:8
        - P306656 obverse:8
    - ka-ti-i:
        - P292855 obverse:4
        - P292983 obverse:4
  2fsg:
    - ka-ti
  3csg:
    - szu-a-ti
    - szu-a-tu
    - sza-a-ti
    - sza-a-tu
    - szi-a-ti
  1cpl:
    - ni-a-ti
  2mpl:
    - ku-nu-ti
  2fpl:
    - /
  3mpl:
    - szu-nu-ti
  3fpl:
    - szi-na-ti

dat:
  1csg:
    - ia-szi
    - ia-szi-im
    - ia-a-szi
    - ia-a-szi-im
  2csg:
    - ka-szi-im
    - ka-szi-im-ma
    - ka-a-szum
  3msg:
    - szu-a-szi-im
  1cpl:
    - /
  2mpl:
    - ku-nu-szi-im
  2fpl:
    - /
  3mpl:
    - szu-nu-szi-im-ma
  3fpl:
    - /
'''

In [17]:
PT.doPrnPrs(prnPrs)

    distinct words:     55
   pos assignments:   9121
subpos assignments:   8733


## Step: Prepositions

The following prepositions are known to precede nouns.

In [18]:
preps = '''
  i-na
  a-na
  e-li
  isz-tu
  it-ti
  ar-ki
'''

In [19]:
PT.doPreps(preps)

 distinct words:      6
pos assignments:   5943
  non-prep occs:  70562


We have made a set of all non-prepositions, i.e. all word occurrences not of one of these prepositions.

## Step: Nouns

### pass: Determiners

We take all words that have a *det*.

We collect the *markedData* for this step: all words that have a *det* inside.

The *unmarkedData* for this step are the occurrences of the stripped forms of the marked words, i.e.
the forms with the *det*s removed.
But only if those forms do not have an unknown in them., i.e. a `x`, `n`, or `...`.

### pass: Prepositions

Words after the given set of prepositions are usually nouns.
However, sometimes there are multiple prepositions in a row.
We take care that we do not mark those second prepostions as nouns.

### pass: Sumerian logograms

Any word that has one or more Sumerian logograms in it, will be marked as noun.

Sumerian logograms are defined as signs within the scope of an enclosing `_ _` pair.

In TF such signs are characterized by having `langalt=1`.

The unmarked data are the occurrences of the same words, but where none of the signs have `langalt=1`.

### pass: Numerals

Numerals are individual signs, but they can be part of words.
In those cases, we call the whole word a numeral.

We consider the category of numeral words as a subcategory of the nouns.

Note that there are also unknown numerals: those with reading `n`.

A numeral is always marked, there is no concept of unmarked occurrences of numerals.

In [20]:
PT.doNouns()

Before step det                    :     0 words in      0 occurrences
Due to step det marked             :  2088 words in   6173 occurrences
Due to step det unmarked           :   290 words in   1920 occurrences
Due to step det all                :  2378 words in   8093 occurrences
After  step det                    :  2378 words in   8093 occurrences
----------------------------------------
Before step prep                   :  2378 words in   8093 occurrences
Due to step prep marked            :  2222 words in   5825 occurrences
Due to step prep unmarked          :  2112 words in  14263 occurrences
Due to step prep all               :  2222 words in  20088 occurrences
After  step prep                   :  4010 words in  23245 occurrences
----------------------------------------
Before step logo                   :  4010 words in  23245 occurrences
Due to step logo marked            :  1616 words in  11647 occurrences
Due to step logo unmarked          :  1572 words in   3593 occurre

# Results

We specify the metadata that we want to include into our new features.

In [24]:
metaData = {
  '': COLOPHON,
  'pos': {
    'valueType': 'str',
    'description': 'primary part-of-speech category on full words',
  },
  'subpos': {
    'valueType': 'str',
    'description': 'secondary category within part-of-speech on full words',
  },
  'cs': {
    'valueType': 'str',
    'description': 'grammatical case: nom, acc, acg, gen, dat',
  },
  'ps': {
    'valueType': 'str',
    'description': 'grammatical person: 1, 2, 3',
  },
  'gn': {
    'valueType': 'str',
    'description': 'grammatical gender: m, f, c',
  },
  'nu': {
    'valueType': 'str',
    'description': 'grammatical number: sg, du, pl',
  },
}

The next cell saves the features to disk, and the sets as well.

In [25]:
PT.export(metaData)


---

## Features

**6 TF features saved: cs, gn, nu, pos, ps, subpos**.

11 categories.

category | % | number of nodes
--- | --- | ---
none | 46 | 34881
all | 54 | 41624
noun- | 32 | 24322
prep- | 8 | 5943
pcl-conj | 3 | 2570
pcl-rel | 3 | 2363
noun-numeral | 3 | 2238
pcl-neg | 2 | 1909
prn-prs | 2 | 1210
adv-tmp | 1 | 399
pcl- | 1 | 388
noun-prs | 0 | 230
prn-dem | 0 | 52



---

## sets

**22 sets written to disk (GitHub repo and Dropbox)**.

set | number of nodes
--- | ---
advtmp | 399
nonprep | 70562
noun | 26599
nounMdet | 6173
nounMlogo | 11647
nounMnum | 2238
nounMprep | 5825
nounUdet | 1920
nounUlogo | 3593
nounUnum | 0
nounUprep | 14263
noundet | 8093
nounlogo | 15240
nounnum | 2238
nounprep | 20088
pcl | 388
pclconj | 2570
pclneg | 1909
pclrel | 2363
prep | 5943
prndem | 52
prnprs | 1440


# Usage

For now, you can make use of a bunch of sets in your queries, whether in the TF-browser or in a notebook.

## Getting the sets

Here is how you can get the sets.

### With Dropbox

If you are synchronized to the `obb` shared folder on Dropbox
(that means, you have installed the Dropbox client and accepted the invitation to `obb`):

You are all set, you have the newest version of the sets file on your computer seconds after
it has been updated.

### With Github

First get the tutorials repo:

For the first time:

```sh
cd ~/github/annotation
git clone https://github.com/annotation/tutorials
```

Advice: do not work in your clone directly, but in a working directory outside this clone.
When you want to get updates the repo:

```sh
cd ~/github/annotation/tutorials
git pull origin master
```

(This will fail if you have worked inside your clone).

## Using the sets and features

You can use the sets and features directly in your programs, or in TF-queries, whether in notebooks or in the TF-browser.

### TF-browser

To start the TF browser:

```sh
text-fabric oldbabylonian --sets=~/Dropbox/obb/sets.tfx --mod=annotation/tutorials/oldbabylonian/cookbook/pos/tf
```

or 

```sh
text-fabric oldbabylonian --sets=~/github/annotation/tutorials/oldbabylonian/cookbook/sets.tfx --mod=annotation/tutorials/oldbabylonian/cookbook/pos/tf
```

### In notebooks

See below how you can work with the new data in a notebook.

## Using sets in queries

You can use the names of sets in all places where you currently use `word`, `sign`, `face`, etc.
More info in the [docs](https://annotation.github.io/text-fabric/Use/Search/#search-template-reference).

# Example

We load the corpus again but now with the new features.

In [39]:
A = use('oldbabylonian', hoist=globals(), mod='annotation/tutorials/oldbabylonian/cookbook/pos/tf')

	connecting to online GitHub repo annotation/app-oldbabylonian ... connected
Using TF-app in /Users/dirk/text-fabric-data/annotation/app-oldbabylonian/code:
	rv0.2=#4bb2530bfb94dc93601f8b3df7722cb0e5df7a43 (latest release)
	connecting to online GitHub repo Nino-cunei/oldbabylonian ... connected
Using data in /Users/dirk/text-fabric-data/Nino-cunei/oldbabylonian/tf/1.0.4:
	rv1.4 (latest release)
	connecting to online GitHub repo annotation/tutorials ... connected
Using data in /Users/dirk/text-fabric-data/annotation/tutorials/oldbabylonian/cookbook/pos/tf/1.0.4:
	#8c7d5be76a10610263b7fb24db0c1d94f548c1cf (latest commit)
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


Note that the features `pos` and `subpos` are loaded now.

Let's print the frequency lists of their values.

In [32]:
for (p, n) in F.pos.freqList():
  print(f'{p:<12}: {n:>5} x')

noun        : 26790 x
pcl         :  7230 x
prep        :  5943 x
adv         :   399 x
prn         :    52 x


In [33]:
 
for (p, n) in F.subpos.freqList():
  print(f'{p:<12}: {n:>5} x')

conj        :  2570 x
rel         :  2363 x
numeral     :  2238 x
neg         :  1909 x
tmp         :   399 x
dem         :    52 x


We still need to load the sets.

In [34]:
from tf.lib import readSets

In [35]:
sets = readSets('~/github/annotation/tutorials/oldbabylonian/cookbook/sets.tfx')
sorted(sets)

['advtmp',
 'nonprep',
 'noun',
 'nounMdet',
 'nounMlogo',
 'nounMnum',
 'nounMprep',
 'nounUdet',
 'nounUlogo',
 'nounUnum',
 'nounUprep',
 'noundet',
 'nounlogo',
 'nounnum',
 'nounprep',
 'pcl',
 'pclconj',
 'pclneg',
 'pclrel',
 'prep',
 'prndem']

We perform a query with the new sets:

In [36]:
query = '''
pclneg
<: noun
'''
results = A.search(query)

 0 
 1 pclneg
 2 <: noun
 3 
line 1: Unknown object type: "pclneg"
line 2: Unknown object type: "noun"
Valid object types are: document, face, line, word, cluster, sign


  0.01s 0 results


Oops! Of course, we have to inform `A.search()` about the sets:

In [40]:
query = '''
pclneg
<: noun
'''
results = A.search(query, sets=sets)

  0.01s 81 results


In [41]:
A.table(results, end=10)

n,p,word,word.1
1,P509376 obverse:6,u2-ul,ta-asz-pu-ra-am
2,P509376 obverse:8,u2-ul,ta-asz-pu-ra-am
3,P509377 reverse:12,la,_sza3-gal_
4,P481192 obverse:12',la,_in-nu_
5,P510526 obverse:12,la,ki
6,P510551 reverse:2,la,"s,u2-ha-ri-ka"
7,P510562 obverse:12,u2-ul,ta-asz-pu-ra-am
8,P510562 reverse:2,u2-ul#,ta-asz-pu-ra#-am
9,P510569 obverse:14,u2-ul,ta-asz-pu-ra-am#
10,P510576 reverse:10,u2-ul#,[ta-asz]-pu#-ra-am
