<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/ninologo.png" width="128"/>
<img align="right" src="images/dans.png" width="128"/>

---

To get started: consult [start](start.ipynb)

---

# Part of Speech tagging

## Team

* Alba de Ridder: Assyriology, master student @ NINO, Leiden
* Martijn Kokken: Assyriology, master student @ NINO, Leiden
* Dirk Roorda: Computer Science, researcher @ DANS, Den Haag
* Cale Johnson: Assyriology, researcher & lecturer @ Univ Birmingham
* Caroline Waerzeggers: Assyriology, head @ NINO, Leiden

In [1]:
COLOPHON = dict(
  acronym='ABB-pos',
  corpus='Old Babylonian Letter Corpus (ABB)',
  dataset='oldbabylonian',
  compiler='Dirk Roorda',
  editors='Alba de Ridder, Martijn Kokken',
  initiators='Cale Johnson, Caroline Waerzeggers',
  institute='NINO, DANS',
)

## Status

* 2019-06-05 Dirk has reorganised the messy code after the sprint into a repeatable and documented workflow.
  The workflow covers special cases, prepositions, and nouns, not yet the extra insights of the sprint.
* 2019-06-03/04 Martijn, Alba and Dirk do a two-day sprint to follow-up on heuristics supplied by Cale Johnson.
  Martijn and Alba provide extra insights.

# Introduction

We collect and execute ideas to tag all word occurrences with a part-of-speech, such as `noun`, `prep`, `verb`.

In the end, we intend to provide extra features to the Old Babylonian corpus, as a standard module that will be always loaded
alongside the corpus.

This notebook will produce two word-level features:

* `pos`: main category of the word: `noun`, `verb`, `prep`, `pcl` (particle)
* `subpos`: secondary category of the word: `rel` (relation), `neg` (negation)

But in the meanwhile, it is work in progress, and during the work we collect candidate assignments in sets, which we save to disk.

These sets correspond to `noun`, `prep`, `nonprep` words as far as we have tagged them in the current state of the workflow.

The sets are all saved in a file `sets.tfx`, both next to this notebook (so that you can get it through GitHub), as in a shared
Dropbox folder `obb`, so that the Akkadian specialists (Alba de Ridder, Martijn Kokken, Cale Johnson) have instant access to them and
can test them in their TF-browser.

See **Usage** at the end of this notebook for how you can make use of these results.

# Method

## Overview

We perform the following steps in that order:

### Known words
We identify a bunch of words in closed categories, that tend to interfere with noun/verb detection.
We identification, we exclude them from all subsequent pattern detection.

### Prepositions
We detect a few prepositions, especially those that (nearly) always preceed a noun.

### Nouns
We use several markers to detect nouns:

* determinatives
* prepositions
* Sumerian logograms
* numerals

We collect the marked occurrences and then look up the unmarked occurrences of the same words.
In this way we extend the detection of nouns considerably.

We have to deal with one big complication, though: **unkowns**.
If we have marked word occurrences with unknown signs in it, we cannot be confident that unmarked occurrences
of the same thing are really occurrences of the same underlying word.

So, if we transfer categorizations from marked occurrences to unmarked occurrences, we only do so if
the word in question does not have unknowns.

# Start the engines

We load the Python modules we need.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import os
import collections

from tf.app import use

from utils import PosTag

We load the corpus and obtain a handle to it: `A`.

In [4]:
A = use('oldbabylonian:local', checkout='local', hoist=globals(), silent='deep')

We set up the detection machinery.

In [5]:
PT = PosTag(A)

We collect all the words and their occurrences and sift through determinatives and numerals.

We make a dictionary of words and their occurrences.
When we compute the word form, we pick the basic info of a sign, not the full ATF-representation with flags and brackets.

We also store the form without the determinatives that are present in the word.

# Run the workflow

## Step 0: Inventory

In [6]:
PT.prepare()

Words (all)          : 15958
Words (nondet)       : 13872
Words (det)          :  2088
Words (det, stripped):  1880
Words (numeral)      :    47


## Step 1: Known words

The case specification is a string, to be read as follows:

each line specifies a bunch of words, separated by `+` on the left hand side of the `=`;
the right hand side specifies the categories those words receive, separated by `,`.

The first category is the `pos`, (main part-of-speech),
the second category is the `subpos` (sub category within the main part-of-speech).

We use abbreviated forms, because users of this dataset will have to type them quite often.

### Categories

category | subcategory | meaning
--- | --- | ---
`pcl` | &nbsp; | particle (unspecified)
`pcl` | `neg` | negative particle
`pcl` | `rel` | relative particle
`pcl` | `conj` | conjunction
`prn` | `dem` | demonstrative pronoun
`adv` | `tmp` | temporal adverb

In [7]:
cases = '''
  la + u2-ul + u2-la = pcl, neg
  sza = pcl, rel
  u3 + u2-lu + u2 = pcl, conj
  lu = pcl
  an-nu-um + an-ni-im + an-nu-u2 = prn, dem
  i-na-an-na + a-nu-um-ma = adv, tmp
'''

In [8]:
PT.doKnownCases(cases)

    distinct words:     13
   pos assignments:   7681
subpos assignments:   7293


## Step 2: Prepositions

The following prepositions are known to precede nouns.

In [9]:
preps = '''
  i-na
  a-na
  e-li
  isz-tu
  it-ti
  ar-ki
'''

In [10]:
PT.doPreps(preps)

 distinct words:      6
pos assignments:   5943
  non-prep occs:  70562


We have made a set of all non-prepositions, i.e. all word occurrences not of one of these prepositions.

## Step 3: Nouns

### pass 1: Determiners

We take all words that have a determinative or a phonetic complement.
Both are signs marked in ATF by being inside `{ }`, and in TF by having `det=1`.
From now on, we will abrreviate it: a **det** is a determinative or a phonetic complement.

We collect the *markedData* for this step: all words that have a *det* inside.

The *unmarkedData* for this step are the occurrences of the stripped forms of the marked words, i.e.
the forms with the *det*s removed.
But only if those forms do not have `x`, `n`, `...` in them.

### pass 2: Prepositions

Words after the given set of prepositions are usually nouns.
However, sometimes there are multiple prepositions in a row.
We take care that we do not mark those second prepostions as nouns.

### pass 3: Sumerian logograms

Any word that has one or more Sumerian logograms in it, will be marked as noun.

Sumerian logograms are defined as signs within the scope of an enclosing `_ _` pair.

In TF such signs are characterized by having `langalt=1`.

The unmarked data are the occurrences of the same words, but where none of the signs have `langalt=1`.

### pass 4: Numerals

Numerals are individual signs, but they can be part of words.
In those cases, we call the whole word a numeral.

We consider the category of numeral words as a subcategory of the nouns.

Note that there are also unknown numerals: those with reading `n`.

A numeral is always marked, there is no concept of unmarked occurrences of numerals.

In [11]:
PT.doNouns()

Before step det                    :     0 words in      0 occurrences
Due to step det marked             :  2088 words in   6173 occurrences
Due to step det unmarked           :   290 words in   1920 occurrences
Due to step det all                :  2378 words in   8093 occurrences
After  step det                    :  2378 words in   8093 occurrences
----------------------------------------
Before step prep                   :  2378 words in   8093 occurrences
Due to step prep marked            :  2222 words in   5825 occurrences
Due to step prep unmarked          :  2112 words in  14263 occurrences
Due to step prep all               :  2222 words in  20088 occurrences
After  step prep                   :  4010 words in  23245 occurrences
----------------------------------------
Before step logo                   :  4010 words in  23245 occurrences
Due to step logo marked            :  1616 words in  11647 occurrences
Due to step logo unmarked          :  1572 words in   3593 occurre

# Results

In [12]:
metaData = {
  '': COLOPHON,
  'pos': {
    'valueType': 'str',
    'description': 'primary part-of-speech category on full words',
  },
  'subpos': {
    'valueType': 'str',
    'description': 'secondary category within part-of-speech on full words',
  },
}

In [15]:
PT.export(metaData)


---

## Features

**2 TF features saved: pos, subpos**.

9 categories.

category | % | number of nodes
--- | --- | ---
none | 47 | 36091
all | 53 | 40414
noun- | 32 | 24552
prep- | 8 | 5943
pcl-conj | 3 | 2570
pcl-rel | 3 | 2363
noun-numeral | 3 | 2238
pcl-neg | 2 | 1909
adv-tmp | 1 | 399
pcl- | 1 | 388
prn-dem | 0 | 52



---

## sets

**21 sets written to disk (GitHub repo and Dropbox)**.

set | number of nodes
--- | ---
advtmp | 399
nonprep | 70562
noun | 26599
nounMdet | 6173
nounMlogo | 11647
nounMnum | 2238
nounMprep | 5825
nounUdet | 1920
nounUlogo | 3593
nounUnum | 0
nounUprep | 14263
noundet | 8093
nounlogo | 15240
nounnum | 2238
nounprep | 20088
pcl | 388
pclconj | 2570
pclneg | 1909
pclrel | 2363
prep | 5943
prndem | 52


# Usage

For now, you can make use of a bunch of sets in your queries, whether in the TF-browser or in a notebook.

## Getting the sets

Here is how you can get the sets.

### With Dropbox

If you are synchronized to the `obb` shared folder on Dropbox
(that means, you have installed the Dropbox client and accepted the invitation to `obb`):

You are all set, you have the newest version of the sets file on your computer seconds after
it has been updated.

### With Github

First get the tutorials repo:

For the first time:

```sh
cd ~/github/annotation
git clone https://github.com/annotation/tutorials
```

Advice: do not work in your clone directly, but in a working directory outside this clone.
When you want to get updates the repo:

```sh
cd ~/github/annotation/tutorials
git pull origin master
```

(This will fail if you have worked inside your clone).

## Using the sets and features

You can use the sets and features directly in your programs, or in TF-queries, whether in notebooks or in the TF-browser.

### TF-browser

The start the TF browser as follows:

```sh
text-fabric oldbabylonian --sets=~/Dropbox/obb/sets.tfx --mod=annotation/tutorials/oldbabylonian/cookbook/pos/tf'
```

or 

```sh
text-fabric oldbabylonian --sets=~/github/annotation/tutorials/oldbabylonian/cookbook/data/sets.tfx --mod=annotation/tutorials/oldbabylonian/cookbook/pos/tf'
```

### In queries

You can load the new features as follows:

```python
A = use('oldbabylonian', hoist=globals(), mod='annotation/tutorials/oldbabylonian/cookbook/pos/tf')
```

You can use the names of sets in all places where you currently use `word`, `sign`, `face`, etc.
More info in the [docs](https://annotation.github.io/text-fabric/Use/Search/#search-template-reference).

As an example, we have used a few sets already in this notebook in order to find the words immediately
following a preposition, without being a preposition themselves.

If you are running queries in a notebook, you can import the set by means of
[readSets](https://annotation.github.io/text-fabric/Api/Lib/#sets):

```python
sets = readSets('~/Dropbox/obb/sets.tfx')
```

And then in queries:

```python
results = A.search('''
prep
:> nonprep
''', sets=sets)
```