For the main tutorial go to [start](../start.ipynb)

---

# Part of speech

We collect and execute ideas to tag all word occurrences with a part of speech.

## Usage

**At the bottom of this notebooks there are instructions how to use its results in the TF browser.**

In [1]:
import collections
from functools import reduce

from tf.app import use
from tf.lib import writeSets

In [2]:
A = use('oldbabylonian:local', checkout='local', hoist=globals())

Using TF-app in /Users/dirk/text-fabric-data/annotation/app-oldbabylonian/code:
	rv0.2=#4bb2530bfb94dc93601f8b3df7722cb0e5df7a43 offline under ~/text-fabric-data (local release)
Using data in /Users/dirk/text-fabric-data/Nino-cunei/oldbabylonian/tf/1.0.4:
	rv1.4 offline under ~/text-fabric-data (local release)


# Nouns

## Step 1: Determiners

We take all words that have a determiner or a phonetic complement.
Both are signs marked in ATF by being inside `{ }`, and in TF by having `det=1`.

We make a dictionary of words and their occurrences.
When we compute the form, we pick the basic info of a sign, not the full atf with flags and brackets.

Then we pick all words with a determiner, and look for occurrences of those same words without determiner.

Looking at the
[feature documentation, section Text-Formats](https://github.com/Nino-cunei/oldbabylonian/blob/master/docs//transcription.md#text-formats), 
we choose `text-orig-rich` for our representation.

In order to make a word representation, we need a function that leaves out all unusable bits of a word.

In some cases we also want to strip the determiners from a word.

In [3]:
def usable(s, stripDet=True):
  return (
    (not stripDet or not F.det.v(s)) and
    not F.type.v(s) == 'unknown' and
    (F.reading.v(s) or F.grapheme.v(s))
  )

def wordFromSigns(signs):
  return '-'.join((F.reading.v(s) or F.grapheme.v(s)) for s in signs)

Now we calculate all words, and while we're at it, we collect the words that do not have a deerminative or phonetic complement into a set.

We also collect all words with a det, but stripped from all dets.

In [4]:
wordsOccs = collections.defaultdict(set)
wordsWithoutDet = set()
wordsWithDet = collections.defaultdict(set)
wordsStrippedDet = collections.defaultdict(set)

for w in F.otype.s('word'):
  signs = L.d(w, otype='sign')
  signsUsable = [s for s in signs if usable(s, stripDet=False)]
  if len(signsUsable) == 0:
    continue
  word = '-'.join((F.reading.v(s) or F.grapheme.v(s)) for s in signsUsable)
  wordsOccs[word].add(w)
  
  if all(not F.det.v(s) for s in signsUsable):
    wordsWithoutDet.add(word)
    continue
    
  signsNonDet = [s for s in signs if usable(s, stripDet=True)]
  if len(signsNonDet) == 0:
    continue
  wordStripped = '-'.join((F.reading.v(s) or F.grapheme.v(s)) for s in signsNonDet)
  wordsWithDet[word].add(w)
  wordsStrippedDet[wordStripped].add(word)
print(f'Words (all)          : {len(wordsOccs):>5}')
print(f'Words (nondet)       : {len(wordsWithoutDet):>5}')
print(f'Words (det)          : {len(wordsWithDet):>5}')
print(f'Words (det, stripped): {len(wordsStrippedDet):>5}')

Words (all)          : 15642
Words (nondet)       : 13587
Words (det)          :  2056
Words (det, stripped):  1852


In [5]:
wordSorted = sorted(
  wordsOccs.items(),
  key=lambda x: (-len(x[1]), x[0]),
)

for (word, occs) in wordSorted[0:20]:
  wordRep = f'"{word}"'
  print(f'{wordRep:<30} : {len(occs):>5}')

"a-na"                         :  4072
"u3"                           :  2517
"sza"                          :  2364
"um-ma"                        :  1645
"i-na"                         :  1441
"..."                          :  1388
"qi2-bi2-ma"                   :  1140
"la"                           :  1076
"disz"                         :  1053
"u2-ul"                        :   797
"d-utu"                        :   779
"d-marduk"                     :   630
"ku3-babbar"                   :   586
"ki-ma"                        :   552
"asz-szum"                     :   540
"hi-a"                         :   407
"lu"                           :   392
"ki-a-am"                      :   363
"szum-ma"                      :   341
"li-ba-al-li-t,u2-ka"          :   322


Now we can look at the words with a deteminer and look if they also occur without determiner.

Now get the occs of these words without determiner.

In [6]:
withOrWithout = wordsWithoutDet & set(wordsStrippedDet)
nWithOrWithout = len(withOrWithout)
print(f'words with or without det: {nWithOrWithout}')

words with or without det: 321


In [7]:
sorted(withOrWithout)[0:20]

['...',
 'ARAD-esz-sze-szi',
 'ARAD-esz3-esz3',
 'ARAD-i3-li2-szu',
 'ARAD-ku-bi',
 'ARAD-si-gar',
 'ARAD2-NI-tim',
 'ARAD2-i3-li2-szu',
 'ARAD2-ku-bi',
 'BA',
 'LI',
 'LU',
 'NIG2',
 'SU',
 'UD',
 'a-a',
 'a-am-ma',
 'a-bu-um-wa-qar',
 'a-bu-wa-qar',
 'a-da-ia-tum']

In [8]:
nDetFull = 0
for (word, wordDets) in wordsStrippedDet.items():
  for wordDet in wordDets:
    nDetFull += len(wordsOccs[wordDet])
    
print(f'detFull occs of nouns: {nDetFull}')

nDetLess = sum(len(occs) for (word, occs) in wordsOccs.items() if word in wordsStrippedDet)
print(f'detLess occs of nouns: {nDetLess}')

detFull occs of nouns: 6114
detLess occs of nouns: 7478


## Result of step 1: determinatives

In [9]:
nouns = {}
nouns[''] = set()

def nOccs(data):
  n = 0
  for word in data:
    n += len(wordsOccs[word])
  return n

def gather(label, data):
  prefix = f'Before step {label}'
  print(f'{prefix:<25}: {len(nouns[""]):>5} words in {nOccs(nouns[""]):>6} occurrences')
  prefix = f'Due to step {label}'
  print(f'{prefix:<25}: {len(data):>5} words in {nOccs(data):>6} occurrences')
  nouns[label] = set(data)
  nouns[''] |= set(data)
  prefix = f'After  step {label}'
  print(f'{prefix:<25}: {len(nouns[""]):>5} words in {nOccs(nouns[""]):>6} occurrences')

In [10]:
data = (
    (wordsWithoutDet & set(wordsStrippedDet)) |
    reduce(
      set.union,
      wordsStrippedDet.values(),
      set(),
    )
)

len(data)

2377

In [11]:
print('\n'.join(sorted(data)[0:20]))

...
...-d-en-lil2
...-d-la-ga-ma-al
...-ki
ARAD-d-e2-ul-masz
ARAD-d-la-ah-mi-ma
ARAD-d-mar-tu
ARAD-d-marduk
ARAD-d-nanna
ARAD-d-suen
ARAD-d-tasz-me-tum
ARAD-d-ul-masz-szi-tum
ARAD-d-utu
ARAD-d-utu-ma
ARAD-esz-sze-szi
ARAD-esz3-esz3
ARAD-i3-li2-szu
ARAD-ku-bi
ARAD-si-gar
ARAD2-NI-tim


In [12]:
gather('Det', data)

Before step Det          :     0 words in      0 occurrences
Due to step Det          :  2377 words in  13592 occurrences
After  step Det          :  2377 words in  13592 occurrences


## Step 2: Prepositions

In [13]:
preps = set(
'''
  i-na
  a-na
  e-li
  isz-tu
  it-ti
  ar-ki
'''.strip().split()
)

There are cases with several prepositions in a row.

In order to exclude that, we prepare the way by making a set of prepositions.

In [14]:
query = '''
word
/with/
  =: sign reading=i
  <: sign reading=na
/or/
  =: sign reading=a
  <: sign reading=na
/or/
  =: sign reading=e
  <: sign reading=li
/or/
  =: sign reading=isz
  <: sign reading=tu
/or/
  =: sign reading=it
  <: sign reading=ti
/or/
  =: sign reading=ar
  <: sign reading=ki
/-/
/without/
  sign
  <: sign
  <: sign
/-/
'''

In [15]:
results = A.search(query, shallow=True)
len(results)

  1.55s 5943 results


5943

In [16]:
sets = dict(
  prep=results,
  nonprep=set(set(F.otype.s('word')) - results)
)

In [17]:
query = '''
prep
<: nonprep
'''

In [18]:
results = A.search(query, sets=sets)

  0.08s 5927 results


In [19]:
results[0:10]

[(258163, 258164),
 (258171, 258172),
 (258199, 258200),
 (258205, 258206),
 (258261, 258262),
 (258274, 258275),
 (258276, 258277),
 (258314, 258315),
 (258336, 258337),
 (258362, 258363)]

In [20]:
data = {wordFromSigns(L.d(x[1], otype='sign')) for x in results}
len(data)

2225

In [21]:
sorted(data)[0:20]

['...',
 '...-ia',
 '...-x',
 '...-x-um-x',
 'AB',
 'ARAD-d-e2-ul-masz',
 'ARAD-d-marduk',
 'ARAD-d-suen',
 'ARAD-dingir-imin',
 'ARAD-esz3-esz3',
 'ARAD-i3-li2-szu',
 'ARAD-ku-bi',
 'ARAD-si-gar',
 'ARAD2-d-suen',
 'ARAD2-i3-li2-szu',
 'ARAD2-ku-bi',
 'BI',
 'BI-ia-a',
 'BI-ti',
 'BI-x']

In [22]:
gather('Prep', data)

Before step Prep         :  2377 words in  13592 occurrences
Due to step Prep         :  2225 words in  25022 occurrences
After  step Prep         :  4021 words in  28571 occurrences


## Step 3: Sumerian logograms

In [23]:
query = '''
word
/with/
  sign langalt
/-/
'''

In [24]:
results = A.search(query)

  0.21s 11672 results


In [25]:
results[0:10]

[(258164,),
 (258167,),
 (258168,),
 (258170,),
 (258174,),
 (258175,),
 (258176,),
 (258177,),
 (258186,),
 (258189,)]

In [26]:
data = {wordFromSigns(L.d(x[0], otype='sign')) for x in results}
len(data)

1619

In [27]:
gather('Logo', data)

Before step Logo         :  4021 words in  28571 occurrences
Due to step Logo         :  1619 words in  19651 occurrences
After  step Logo         :  4784 words in  34864 occurrences


## Export sets:

In [36]:
for (name, data) in nouns.items():
  nodeSet = set()
  for word in data:
    nodeSet |= wordsOccs[word]
  sets[f'noun{name}'] = nodeSet

writeSets(sets, 'data/nounSets.tf')

True

In [37]:
for (name, data) in sorted(sets.items()):
  print(f'set {name:<9} with {len(data):>5} elements')

set nonprep   with 70562 elements
set noun      with 34864 elements
set nounDet   with 13592 elements
set nounLogo  with 19651 elements
set nounPrep  with 25022 elements
set prep      with  5943 elements


In [38]:
sorted(sets['nounDet'] & sets['nounPrep'])[0]

258164

In [39]:
A.plain(258164)

Usage:

First get the tutorials repo:

For the first time:

```sh
cd ~/github/annotation
git clone https://github.com/annotation/tutorials
```

When you want to update the repo:

```sh
cd ~/github/annotation/tutorials
git pull origin master
```

The start the TF browser as follows:

```sh
text-fabric oldbabylonian --sets=~/github/annotation/tutorials/oldbabylonian/cookbook/data/nounSets.tf
```