# Case

In [39]:
import os, sys, collections
from IPython.display import display, Markdown
from tf.extra.cunei import Cunei

In [2]:
import sys, os
LOC = ('~/github', 'Nino-cunei/uruk', 'casesByLevel')
CN = Cunei(*LOC)
CN.api.makeAvailableIn(globals())

Found 2095 ideograph linearts
Found 2724 tablet linearts
Found 5495 tablet photos


**Documentation:** <a target="_blank" href="https://github.com/Nino-cunei/uruk/blob/master/docs/about.md" title="provenance of this corpus">Uruk IV-III (v1.0)</a> <a target="_blank" href="https://github.com/Nino-cunei/uruk/blob/master/docs/transcription.md" title="feature documentation">Feature docs</a> <a target="_blank" href="https://github.com/Dans-labs/text-fabric/wiki/Cunei" title="cunei api documentation">Cunei API</a> <a target="_blank" href="https://github.com/Dans-labs/text-fabric/wiki/api" title="text-fabric-api">Text-Fabric API</a>


This notebook online:
<a target="_blank" href="http://nbviewer.jupyter.org/github/Nino-cunei/tutorials/blob/master/bits-and-pieces/casesByLevel.ipynb">NBViewer</a>
<a target="_blank" href="https://github.com/Nino-cunei/tutorials/blob/master/bits-and-pieces/casesByLevel.ipynb">GitHub</a>


## Level 0

If we do `casesByLevel(0, terminal=False)` we get all lines.

If we do `casesByLevel(0)`, we get precisely the undivided lines.

In [3]:
test0Cases = set(CN.casesByLevel(0, terminal=False))
allLines = set(F.otype.s('line'))
types0 = {F.otype.v(n) for n in test0Cases}
print(f'test0Cases: {len(test0Cases):>5}')
print(f'allLines : {len(allLines):>5}')
print(f'test0Cases equal to allLines: {test0Cases == allLines}')
print(f'types of test0Cases: {types0}')

test0CasesT = set(CN.casesByLevel(0))
print(f'test0CasesT: {len(test0CasesT):>5}')
print(f'Divided lines: {len(test0Cases) - len(test0CasesT):>5}')

test0Cases: 35842
allLines : 35842
test0Cases equal to allLines: True
types of test0Cases: {'line'}
test0CasesT: 32732
Divided lines:  3110


## Level 1

If we do `casesByLevel(1, terminal=False)` we get all cases (not lines) that are the first subdivision of a line.

If we do `casesByLevel(1)`, we get a subset of these cases, namely the ones that are not themselves subdivided.

In [4]:
test1Cases = set(CN.casesByLevel(1, terminal=False))
types1 = {F.otype.v(n) for n in test1Cases}

print(f'test1Cases: {len(test1Cases):>5}')
print(f'types of test1Cases: {types1}')

test1CasesT = set(CN.casesByLevel(1))
print(f'test1CasesT: {len(test1CasesT):>5}')
print(f'Divided cases: {len(test1Cases) - len(test1CasesT):>5}')

test1Cases:  6559
types of test1Cases: {'case'}
test1CasesT:  5468
Divided cases:  1091


## Example tablet
Here we show by means of an example tablet the difference between `terminal=False` and 
`terminal=True` when calling `CN.casesByLevel`

We'll use an example tablet `P471695`.

In [5]:
examplePnum = 'P471695'
exampleTablet = T.nodeFromSection((examplePnum,))
print('\n'.join(CN.getSource(exampleTablet)))

&P471695 = Anonymous 471695
#atf: lang qpc 
@obverse 
@column 1 
1.a. 3(N01) , APIN~a 3(N57) UR4~a 
1.b1. , (EN~a DU ZATU759)a 
1.b2. , (BAN~b KASZ~c)a 
1.b3. , (KI@n SAG)a 
2.a. 1(N14) 2(N01) , [...] 
2.b1. , (3(N57) PAP~a)a 
2.b2. , (SZU KI X)a 
$ n lines broken 
2.b3'. , (EN~a AN EZINU~d)a 
2.b4'. , (IDIGNA [...])a 
$ rest broken 
$ (for a total of 12 sub-cases with PNN) 
@column 2 
1.a. 1(N01) , ISZ~a#? 
1.b1. , (PAP~a GIR3~c)a 
$ blank space 
$ rest broken 
@reverse 
$ beginning broken 
1'. [1(N14)] 6(N01)# , [...] 
1'. [1(N14)] 6(N01)# , [...] 


Above we have selected all cases of level 1 from the whole corpus, in two ways.
Now we take the intersection of these sets with the cases of the example tablet.

In [6]:
exampleCases = (
    set(L.d(exampleTablet, otype='case'))
    |
    set(L.d(exampleTablet, otype='line'))
)
example2 = test1Cases & exampleCases
example2T = test1CasesT & exampleCases

In [7]:
print(f'\n{"-" * 48}\n'.join('\n'.join(CN.getSource(c)) for c in sorted(example2)))

1.a. 3(N01) , APIN~a 3(N57) UR4~a 
------------------------------------------------
1.b1. , (EN~a DU ZATU759)a 
1.b2. , (BAN~b KASZ~c)a 
1.b3. , (KI@n SAG)a 
------------------------------------------------
2.a. 1(N14) 2(N01) , [...] 
------------------------------------------------
2.b1. , (3(N57) PAP~a)a 
2.b2. , (SZU KI X)a 
$ n lines broken 
2.b3'. , (EN~a AN EZINU~d)a 
2.b4'. , (IDIGNA [...])a 
$ rest broken 
$ (for a total of 12 sub-cases with PNN) 
------------------------------------------------
1.a. 1(N01) , ISZ~a#? 
------------------------------------------------
1.b1. , (PAP~a GIR3~c)a 
$ blank space 
$ rest broken 


In [8]:
print(f'\n{"-" * 48}\n'.join('\n'.join(CN.getSource(c)) for c in sorted(example2T)))

1.a. 3(N01) , APIN~a 3(N57) UR4~a 
------------------------------------------------
2.a. 1(N14) 2(N01) , [...] 
------------------------------------------------
1.a. 1(N01) , ISZ~a#? 


What about case `1.b`?
It is a case at level 2.
Why isn't is in `example2T`?

Yes, but it is not a terminal case. It has subcases.
That is why `1.b` is left out. 
The parameter `terminal` specifies that only cases without children will be in the result.

## Level 2

What if we want all signs that occur in a subcase, i.e. a case at level 2?

We can call `casesByLevel(2, terminal=False)`, iterate through the resulting cases, and 
collect all signs per case.
However, we will encounter signs multiple times. 
Because if a sign is in a subcase, it is also in its containing case and in its containing line.
We can solve this by collecting the signs in a set.
Then we loose the corpus order of the signs, but we can easily reorder the set into a list.

There is an alternative method: a search template.
Search delivers unordered results, so we will reorder the search results as well.

Text-Fabric has an API function for sorting nodes into corpus order: `sortNodes`.

Let us try out both methods and compare the outcomes.

### Method casesByLevel

In [11]:
cases = CN.casesByLevel(2, terminal=False)
signSet = set()
for case in cases:
    signSet |= set(L.d(case, otype='sign'))
signsA = sortNodes(signSet)
len(signsA)

7738

### Method search

You might want to read the 
[docs](https://github.com/Dans-labs/text-fabric/wiki/Api#search) first.

#### Explanation

The search template is basically

```
line
   case
      case
         sign
```

This bare template looks for a sign within a case within a case within a line.
Indententation acts as shorthand for embedding.

But this is not enough, because a subsubcase of a case is also embedded in that case.
We look for a situation where the first case is *directly* embedded in the line,
and the second case is *directly* embedded in the first case.

In our data we have an *edge* (relationship), called `sub`, that connects lines/cases with
cases that are directly embedded in them.

So

```
c0 -sub> c1
```

means that `c0` is `sub`-related to `c1`.

We just give names to our animals (`case` and `line`) and state
the desired relationships.

Now it is possible to see that the result of this query will have signs that occur in
subcases of cases of lines.

In [10]:
query = '''
c0:line
   c1:case
      c2:case
         sign
c0 -sub> c1
c1 -sub> c2
'''
results = S.search(query)
signsB = sortNodes(r[3] for r in results)
len(signsB)

7738

A bit about results.
The query mentions four quantities: `line`, `case`, `case`, `sign`.
Every result of the query is an instantiation of those 4 quantities, hence a tuple of nodes:

```
(resultLine, resultCase1, resultCase2, resultSign)
```

For our purposes we are only interested in the `resultSign` part, so we select it by the
`r[3]` when we walk through all results `r`.

### Check

Both methods yield the same number of results, but are they exactly the same results?

In [12]:
signsA == signsB

True

Yes!

### Twist

Now we want to restrict ourselves to non-numerical signs.
If you look at the feature docs (see the link at the start of the notebook),
and read about the `type` feature for signs, you see that it can have the values
`empty` `unknown` `numeral` `ideograph`.
We just want the ideographs.

We'll adapt both methods to get them and ignore the numerals and lesser defined graphemes.

Of course, we can just filter the result list that we have already got, 
but this is a tutorial, and it may come in handy to have a well stocked repertoire 
of direct ways to drill to your data.

### Method casesByLevel (revisited)

In [13]:
cases = CN.casesByLevel(2, terminal=False)
signSet = set()
for case in cases:
    signSet |= set(s for s in L.d(case, otype='sign') if F.type.v(s) == 'ideograph')
signsA = sortNodes(signSet)
len(signsA)

3815

### Method search (revisited)

Note that it is very easy to add the desired condition to the template.

This method is much easier to adapt than the first method!

In [14]:
query = '''
c0:line
   c1:case
      c2:case
         sign type=ideograph
c0 -sub> c1
c1 -sub> c2
'''
results = S.search(query)
signsB = sortNodes(r[3] for r in results)
len(signsB)

3815

In [15]:
signsA == signsB

True

# Supercase versus subcase

We finish of with a comparison of the frequencies of signs that occur on lines and level-1 cases, and the frequencies of signs that occur on level-2 and deeper cases.

From both groups we pick the top-20.
We make a nice markdown table showing the frequencies those top-20 signs in both groups.

We do this for non-numeric ideographs only.

Note that we have already collected the group of the subcases and deeper: `signsB`.

We give this sequence an other name: `subSigns`.

In [44]:
subSigns = signsB
len(subSigns)

3815

We need to collect the group of signs in lines and immediate cases.
So we have to exclude cases that are subdivided in subcases.

For that, we use the feature `terminal`, which exists and is equal to `1` for undivided
cases and lines, and which does not exist for divided cases and lines.

We get this group by two queries.

In [45]:
query0 = '''
line terminal=1
   sign type=ideograph
'''
signs0 = [r[1] for r in S.search(query0)]
len(signs0)

42485

In [46]:
query1 = '''
c0:line
   c1:case terminal=1
      sign type=ideograph
c0 -sub> c1
'''
signs1 = [r[2] for r in S.search(query1)]
len(signs1)

7665

Let us collect both results into `superSigns`.
Note that `signs0` and `signs1` have no occurrences in common:
a sign in `signs1` is part of a case, so the line that contains that case is divided,
so it has no value for the`terminal` feature, so ot is not in the results of `query0`.

In [47]:
superSigns = signs0 + signs1

Also note that `superSigns` and `subSigns` have nothing in common, for the same kind of reasoning as why `signs0` and `signs1` have no occurrences in common.

That said, reasoning is one thing, and using data to verify assertions is another thing.
Let us just check!

In [48]:
set(signs0) & set(signs1)

set()

In [23]:
set(subSigns) & set(superSigns)

set()

Check!

Last, but not least, we want to compare the frequencies of the super and sub groups with the
overall frequencies.

In [31]:
queryA = '''
line
   sign type=ideograph
'''
allSigns = [r[1] for r in S.search(queryA)]
len(allSigns)

53965

### Frequency and rank

We are going to make a frequency distribution for both groups.
We do not want to repeat ourselves
[DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself),
so we write a function that given a list of items,
produces a frequency list.

While we're at it, we also produce a ranking list: the most frequent item has rank 1,
the second frequent item has rank 2, and so on.

When we compute the frequencies, we count the number of times a sign, identified by its
ATF transcription (without flags), occurs.

In [49]:
def getFreqs(items):
    freqs = collections.Counter()
    for item in items:
        freqs[CN.atfFromSign(item)] += 1
    ranks = {}
    for item in sorted(freqs, key=lambda i: -freqs[i]):
        ranks[item] = len(ranks) + 1
    return (freqs, ranks)

In [50]:
(allFreqs, allRanks) = getFreqs(allSigns)
(superFreqs, superRanks) = getFreqs(superSigns)
(subFreqs, subRanks) = getFreqs(subSigns)

Now we want the top scorers in the super and sub teams.
We make it customisable whether you want the top-20 or top-100, or whatever.

In [51]:
def getTop(ranks, amount):
    return sorted(ranks, key=lambda i: ranks[i])[0:amount]

In [52]:
AMOUNT = 20
superTop = getTop(superRanks, AMOUNT)
subTop = getTop(subRanks, AMOUNT)

We combine the two tops without duplication ...

In [53]:
combiTopSet = set(superTop) | set(subTop)

... and sort them by overall rank:

In [54]:
combiTop = sorted(combiTopSet, key=lambda i: allRanks[i])

Since we have now our top characters ready, let us just show them.
We group them into horizontal lines.

In [55]:
def chunk(items, chunkSize):
    chunks = [[]]
    j = 0
    for item in items:
        if j == chunkSize:
            chunks.append([])
            j = 0
        chunks[-1].append(item)
        j += 1
    return chunks
    

In [56]:
for batch in chunk(combiTop, 4):
    display(Markdown('\n\n---\n\n'))
    display(CN.lineart(batch, height=80, width=60))



---





---





---





---





---





---





---



We can now compose our table.

For each sign we make a row in which we report the frequency and rank of that sign in all
groups.

In [60]:
table = f'''
### Frequencies and ranks of non-numeral signs
sign | all F | all R | super F | super R | sub F | sub R
---  | ---   | ---   | ---     | ---     | ---   | ---
'''
for sign in combiTop:
    allF = allFreqs[sign]
    allR = allRanks[sign]
    superF = superFreqs.get(sign, ' ')
    superR = superRanks.get(sign, ' ')
    subF = subFreqs.get(sign, ' ')
    subR = subRanks.get(sign, ' ')
    row = f'**{sign}** | **{allF}** | **{allR}** | {superF} | *{superR}* | {subF} | *{subR}*'
    table += f'{row}\n'
display(Markdown(table))


### Frequencies and ranks of non-numeral signs
sign | all F | all R | super F | super R | sub F | sub R
---  | ---   | ---   | ---     | ---     | ---   | ---
**EN~a** | **1830** | **1** | 1670 | *1* | 160 | *1*
**SZE~a** | **1294** | **2** | 1178 | *2* | 116 | *2*
**GAL~a** | **1164** | **3** | 1136 | *3* | 28 | *30*
**U4** | **1022** | **4** | 936 | *5* | 86 | *7*
**AN** | **1020** | **5** | 946 | *4* | 74 | *9*
**SAL** | **876** | **6** | 795 | *6* | 81 | *8*
**PAP~a** | **851** | **7** | 765 | *7* | 86 | *6*
**GI** | **849** | **8** | 754 | *8* | 95 | *4*
**BA** | **781** | **9** | 679 | *11* | 102 | *3*
**NUN~a** | **719** | **10** | 677 | *12* | 42 | *20*
**N** | **716** | **11** | 714 | *9* | 2 | *241*
**SANGA~a** | **714** | **12** | 698 | *10* | 16 | *60*
**SZU** | **680** | **13** | 611 | *14* | 69 | *10*
**BU~a** | **653** | **14** | 561 | *17* | 92 | *5*
**NAM2** | **649** | **15** | 625 | *13* | 24 | *39*
**E2~a** | **646** | **16** | 583 | *15* | 63 | *11*
**UDU~a** | **616** | **17** | 574 | *16* | 42 | *19*
**A** | **600** | **18** | 557 | *18* | 43 | *18*
**KI** | **546** | **19** | 503 | *20* | 43 | *17*
**DUG~b** | **509** | **20** | 506 | *19* | 3 | *226*
**DU** | **480** | **21** | 435 | *23* | 45 | *14*
**HI** | **408** | **30** | 363 | *32* | 45 | *15*
**TUR** | **382** | **33** | 330 | *37* | 52 | *12*
**KU3~a** | **264** | **53** | 220 | *59* | 44 | *16*
**HI@g~a** | **235** | **59** | 186 | *68* | 49 | *13*
