<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Collocation" data-toc-modified-id="Collocation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Collocation</a></span><ul class="toc-item"><li><span><a href="#TF-ad" data-toc-modified-id="TF-ad-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>TF ad</a></span></li><li><span><a href="#Back-to-collocation" data-toc-modified-id="Back-to-collocation-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Back to collocation</a></span></li><li><span><a href="#Collect-sign-pairs" data-toc-modified-id="Collect-sign-pairs-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Collect sign-pairs</a></span></li></ul></li></ul></div>

<img align="right" src="images/tf-small.png"/>

In [3]:
import sys, os
LOC = ('~/github', 'Nino-cunei/uruk', 'start')
sys.path.append(os.path.expanduser(f'{LOC[0]}/{LOC[1]}/programs'))
from cunei import Cunei
A = Cunei(*LOC)
A.api.makeAvailableIn(globals())

Found 2724 tablet linearts
Found 2095 ideographs



**Documentation:**
[Feature docs](https://github.com/Nino-cunei/uruk/blob/master/docs/transcription.md)
[Cunei API](https://github.com/Nino-cunei/uruk/blob/master/docs/cunei.md)
[Text-Fabric API](https://github.com/Dans-labs/text-fabric)



Go to
[nbviewer](http://nbviewer.jupyter.org/github/Nino-cunei/tutorials/blob/master/start.ipynb)
to view this notebook with all its pdf images.


## Collocation

We end this tutorial with a cliff hanger: collocation.

For the study of cuneiform corpora it is useful to know which signs co-occur: on tablets and faces,
in columns and lines, and in cases.

Here we just show how to compute collocation information with respect to tablets.

We refer to a future notebook [collocation](collocation.ipynb) that will be dedicated to the art and craft
of collocation.

### TF ad
Already the task of computing collocation of signs for tablets shows a typical pattern in the modus operandi of Text-Fabric. In order to compute collocation efficiently, we have to grab a significant swath of the data, and reorganise it before we can do business.

**The bad news is**: Text-Fabric does not have the right organization for this particular problem.

**The good news is**: You can put the data in the right order.

**The best news is**: because of the IKEA-like organization of the data in TF, you can easily
put your bits and pieces in a cart, walk outside, and stack it in new ways to your liking.
Indeed, the bit that draws the data from TF and puts it into the required form, is only
a few lines of code.

### Back to collocation

This is what we do:

* we collect all pairs of signs that co-occur on a tablet
* we compute a measure of co-occurrence: 
  * closer together is better
  * more tablets with the same co-occurence is better
  
We explain the steps as we go.

### Collect sign-pairs

We want signs with primes, variants and modifiers, but without flags.

In a first round, we collect all pairs of signs that have a co-occurrence on a tablet.

Suppose two signs co-occur on a tablet.
Both may have multiple occurrences.

The question is: what is a sensible measure for the the degree of co-occurrence of that pair on
that tablet?

In this tutorial we ignore the faces, columns, lines and cases that the signs occur in.
The only thing that counts is the distance between two occurrences, seen as slots.
Every sign has a sequence number, its slot number, which tells you where the sign stands in the whole
corpus. The distance between two slots is just the difference of those slots as numbers.

The distance between two signs on a tablet is the minumum distance you can find between an occurence
of the one and an occurrence of the other.

In fact, we turn distance into closeness.
If, on a tablet of 200 signs long, there are signs with occurrences on 40 and 60,
their distance is 20, but there closeness is 200 - 20 = 180.
We shall make that closeness proportional to the length of the tablet (in signs): 
180 / 200 = 0.9

The same signs may co-occur on other tablet. We also compute the relative closeness there.
In the end, we add it all up.

So every pair of signs gets a measure that expresses the total relative closeness of its co-occurrences
on all tablets where they co-occur.

We have to visit nodes multiple times and get their atf representation,
so we do it once for all and store them.

We exclude the empty graphemes and the `…` , `X` graphemes.

In [75]:
NA = {'', '…', 'X'}

signFromNode = dict()

for tablet in F.otype.s('tablet'):
    for s in L.d(tablet, otype='sign'):
        if F.grapheme.v(s) in NA:
            continue
        signFromNode[s] = A.atfFromSign(s)
print(len(signFromNode))

LIMIT = 20
n = 0
for i in sorted(signFromNode):
    print(f'{i:>2} = {signFromNode[i]}')
    n += 1
    if n > LIMIT:
        break

91362
 6 = 3(N14)
 8 = SANGA~a
22 = 3(N14)
24 = 1(N14)
25 = SUHUR
28 = 1(N01)
29 = DUG~b
30 = 1(N57)
41 = 1(N46)
42 = 2(N19)
43 = 4(N41)
44 = AB~a
45 = APIN~a
46 = NUN~a
51 = SZE~a
52 = DU
53 = NUN~a
58 = n
60 = KA~a
61 = n
62 = 2(N14)


Now we work per tablet.
First we collect the relevant sign slots in a list.

Then we loop through all distinct pairs of slots of that list, and store the difference between the slots
for each pair.
If we encounter the same pair with a smaller difference, we replace the bigger difference with the smaller one.
We end up with a dictionary that gives for each pair of signs the minimal difference.

Then we turn distance into closeness and make it proportional to the length of the tablet, for all pairs.

In [76]:
pairs = collections.Counter()

for tablet in F.otype.s('tablet'):
    slots = L.d(tablet, otype='sign')
    length = slots[-1] - slots[0]
    thesePairs = {}
    for i in range(len(slots)):
        slotI = slots[i]
        if slotI not in signFromNode:
            continue
        signI = signFromNode[slotI]
        for j in range(i + 1, len(slots)):
            slotJ = slots[j]
            if slotJ not in signFromNode:
                continue
            signJ = signFromNode[slotJ]
            if signJ == signI:
                continue
            pair = (signI, signJ) if signI < signJ else (signJ, signI)
            difference = slotJ - slotI
            oldDifference = thesePairs.get(pair, None)
            if oldDifference is None or oldDifference > difference:
                thesePairs[pair] = difference
    for ((signI, signJ), difference) in thesePairs.items():
        relativeCloseness = (length - difference) / length
        pairs[(signI, signJ)] += relativeCloseness

len(pairs)

117472

In [77]:
for ((signI, signJ), closeness) in sorted(pairs.items(), key=lambda x: (-x[1], x[0]))[0:100]:
    print(f'{signI:<10} <=> {signJ:<10} at closeness {closeness:>7.2f}')

1(N01)     <=> 2(N01)     at closeness  794.77
1(N01)     <=> 1(N14)     at closeness  641.24
1(N01)     <=> EN~a       at closeness  573.55
1(N14)     <=> 2(N01)     at closeness  572.39
1(N01)     <=> 3(N01)     at closeness  507.99
2(N01)     <=> 3(N01)     at closeness  439.41
1(N01)     <=> N          at closeness  434.06
1(N14)     <=> 3(N01)     at closeness  413.32
1(N01)     <=> AN         at closeness  402.56
1(N01)     <=> GAL~a      at closeness  387.34
1(N14)     <=> 5(N01)     at closeness  383.53
1(N01)     <=> 4(N01)     at closeness  373.17
1(N01)     <=> 5(N01)     at closeness  370.17
1(N14)     <=> 2(N14)     at closeness  364.15
2(N01)     <=> EN~a       at closeness  355.88
1(N01)     <=> SZE~a      at closeness  352.99
1(N01)     <=> 2(N14)     at closeness  351.48
1(N01)     <=> U4         at closeness  349.05
2(N01)     <=> 2(N14)     at closeness  343.82
1(N14)     <=> 4(N01)     at closeness  341.40
2(N01)     <=> 4(N01)     at closeness  335.39
2(N01)     <=

We print all collocations to the file
[collocations-tablet.tsv](https://github.com/Dans-labs/Nino-cunei/blob/master/reports/collocations-tablet.tsv)

In [78]:
with open(f'{A.reportDir}/collocations-tablet.tsv', 'w') as fh:
    fh.write(f'sign1\tsign2\tcloseness\n')
    for ((signI, signJ), closeness) in sorted(pairs.items(), key=lambda x: (-x[1], x[0])):
        fh.write(f'{signI}\t{signJ}\t{closeness:>7.2f}\n')

Finally, we show an overview of how the closeness of collocated pairs is distributed.

In [79]:
closenessDistribution = collections.Counter()
for ((signI, signJ), closeness) in pairs.items():
    closenessDistribution[int(round(closeness))] += 1

for (closeness, amount) in sorted(closenessDistribution.items()):
    print(f'{amount:>4} pairs with closeness ~ {closeness:>4}')

10493 pairs with closeness ~    0
56654 pairs with closeness ~    1
16882 pairs with closeness ~    2
8277 pairs with closeness ~    3
5045 pairs with closeness ~    4
3409 pairs with closeness ~    5
2449 pairs with closeness ~    6
1880 pairs with closeness ~    7
1517 pairs with closeness ~    8
1224 pairs with closeness ~    9
1022 pairs with closeness ~   10
 845 pairs with closeness ~   11
 741 pairs with closeness ~   12
 625 pairs with closeness ~   13
 503 pairs with closeness ~   14
 480 pairs with closeness ~   15
 410 pairs with closeness ~   16
 357 pairs with closeness ~   17
 325 pairs with closeness ~   18
 289 pairs with closeness ~   19
 268 pairs with closeness ~   20
 207 pairs with closeness ~   21
 197 pairs with closeness ~   22
 210 pairs with closeness ~   23
 195 pairs with closeness ~   24
 156 pairs with closeness ~   25
 140 pairs with closeness ~   26
 137 pairs with closeness ~   27
 129 pairs with closeness ~   28
 109 pairs with closeness ~   29
 113 pa