# Compare the Afifi and Lakhnawi editions of the Fusus

In [1]:
%load_ext autoreload
%autoreload 2

In [141]:
import collections
from Levenshtein import distance, ratio

In [3]:
from tf.app import use

In [4]:
BASE = "~/github/among/fusus"
VERSION = "0.7"

# Load both editions

Normally, when we load a single data source in a notebook, we store the handle in a variable called
`A`, and we hoist additional variables `F`, `L`, `T`, etc to the global namespace.

But now we work with two datasources, so we store the handles in a dictionary `A`, with
a key `L` for the Lakhnawi edition and a key `A` for the Afifi edition.

We also make dictionaries for `F`, `L`, `T`, etc, keyed with the same keys.

In that way we can systematically select our handles for the desired editions.

In [5]:
LK = "LK"
AF = "AF"

EDITIONS = {
    LK: "Lakhnawi",
    AF: "Afifi",
}

A = {}
F = {}
E = {}
L = {}
T = {}
N = {}

In [6]:
for (acro, name) in EDITIONS.items():
    A[acro] = use(f"among/fusus/tf/{name}:clone", writing="ara", version=VERSION)
    F[acro] = A[acro].api.F
    E[acro] = A[acro].api.E
    L[acro] = A[acro].api.L
    T[acro] = A[acro].api.T
    N[acro] = A[acro].api.N

This is Text-Fabric 9.1.3
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

27 features found and 0 ignored


This is Text-Fabric 9.1.3
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

17 features found and 0 ignored


Let's find out the max slot of both editions.

In [7]:
maxSlot = {acro: F[acro].otype.maxSlot for acro in EDITIONS}
maxSlot

{'LK': 40379, 'AF': 40271}

We set up our comparison.

We work with the latin transcriptions, in order to avoid complications with right-to-left writing in 
the displays of situations where discrepancies occur.

The result of the comparison will be a table that aligns the LK slots with the AF slots.

In [147]:
getTextLK = F[LK].lettersn.v
getTextAF = F[AF].lettersn.v

maxLK = maxSlot[LK]
maxAF = maxSlot[AF]

comparison = []
indexLK = {}
indexAF = {}

We define auxiliary functions for finding discrepancies and inspecting them.

In [148]:
def printLines(start=0, end=None):
    if start < 0:
        start = 0
    if end is None or end > len(comparison):
        end = len(comparison)
    lines = []
    for (iLK, left, distance, right, iAF) in comparison[start:end]:
        textLK = getTextLK(iLK) if iLK else ""
        textAF = getTextAF(iAF) if iAF else ""
        lines.append(f"{iLK:>5} {left:<2} {textLK:>20} @{distance:<2} {textAF:<20} {right:>2} {iAF:>5}")
    return "\n".join(lines)
        
        
def printComparison(path):
    with open(path, "w") as fh:
        fh.write(printLines())
        fh.write("\n")

            
def printDiff(before, after):
    print(printLines(start=len(comparison) - before))
    lastLK = None
    lastAF = None
    for c in range(len(comparison) - 1, -1, -1):
        comp = comparison[c]
        if lastLK is None:
            if comp[0]:
                lastLK = comp[0]
        if lastAF is None:
            if comp[4]:
                lastAF = comp[4]
        if lastLK is not None and lastAF is not None:
            break
    if lastLK is not None and lastAF is not None:
        for i in range(after):
            iLK = lastLK + 1 + i
            iAF = lastAF + 1 + i
            textLK = getTextLK(iLK) if iLK <= maxLK else ""
            textAF = getTextAF(iAF) if iAF <= maxAF else ""
            print(f"{iLK:>5} =  {textLK:>20} ??? {textAF:<20}  = {iAF:>5}")

Now the proper algorithm.

We stop when we cannot solve a discrepancy.

When solving discrepancies, we adjust the mapping and we record the severity of the
discrepancy in a separate dict `dissimilarity`.

We need to compute whether $n$ consecutive words left are similar to $m$ consecutive words
right.

We assume there is a boundary of *C* words that we will combine.

We need to walk to al possible combinations, from simplest and shortest to longest and most complex.

Every combination can be characterized by $(n, m)$, where $n$ is the number of words on the left
and $m$ is the number of words on the right. $n$ and $m$ are in the range $1 \ldots C$.

Suppose $C = 3$, then we want to compare combinations in the following order:

combination|x
---|---
$(1, 1)$|--
$(1, 2)$|--
$(2, 1)$|--
$(2, 2)$|--
$(1, 3)$|--
$(3, 1)$|--
$(2, 3)$|--
$(3, 2)$|--
$(3, 3)$|--

In fact, we list all possible combinations and then sort them first by sum of the pair 
and then by decreasing difference of the pair.

We fix a `C` (called `COMBI`) and compute the sequence of combinations up front.

In [149]:
def getCombis(c):
    combis = []
    for i in range(1, c + 1):
        for j in range(1, c + 1):
            if i != 1 or j != 1:
                combis.append((i, j))
    return tuple(sorted(combis, key=lambda x: (x[0] + x[1], abs(x[0] - x[1]))))

In [150]:
COMBINE = 4

COMBIS = getCombis(COMBINE)
COMBIS

((1, 2),
 (2, 1),
 (2, 2),
 (1, 3),
 (3, 1),
 (2, 3),
 (3, 2),
 (1, 4),
 (4, 1),
 (3, 3),
 (2, 4),
 (4, 2),
 (3, 4),
 (4, 3),
 (4, 4))

In [151]:
def similar(s1, s2, strictness):
    if s1 == s2:
        return (True, 0)
    
    d = distance(s1, s2)
    if type(strictness) is int:
        return (d <= strictness, d)
    else:
        return (ratio(s1, s2) >= strictness, d)

        
def catchupAF(start, end):
    for i in range(start, end + 1):
        indexAF[i] = len(comparison)
        comparison.append(("", "-", 99, "=", i))
        
        
def catchupLK(start, end):
    for i in range(start, end + 1):
        indexLK[i] = len(comparison)
        comparison.append((i, "=", 99, "-", ""))
        
        
def findCombi(iLK, iAF, strictness):
    found = None
    
    for (cLK, cAF) in COMBIS:
        if iLK + cLK > maxLK or iAF + cAF > maxLK:
            continue
        textLK = "".join(getTextLK(iLK + i) for i in range(cLK))
        textAF = "".join(getTextAF(iAF + i) for i in range(cAF))
        (isSimilar, d) = similar(textLK, textAF, strictness)
        if isSimilar:
            found = (cLK, cAF)
            common = min((cLK, cAF))
            for i in range(max((cLK, cAF))):
                nComparison = len(comparison)
                if i < common:
                    comparison.append((iLK + i, f"+{cLK}", d, f"{cAF}+", iAF + i))
                    indexLK[iLK + i] = nComparison
                    indexAF[iAF + i] = nComparison
                elif i < cLK:
                    comparison.append((iLK + i, f"+{cLK}", d, f"{cAF}^", ""))
                    indexLK[iLK + i] = nComparison
                elif i < cAF:
                    comparison.append(("", f"^{cLK}", d, f"{cAF}+", iAF + i))
                    indexAF[iAF + i] = nComparison
            break
    return found
        
    
def doCase(iLK, iAF):
    if iLK not in cases:
        return None
    
    (cLK, cAF) = cases[iLK]
    common = min((cLK, cAF))
    for i in range(max((cLK, cAF))):
        nComparison = len(comparison)
        if i < common:
            comparison.append((iLK + i, f"+{cLK}", 88, f"{cAF}+", iAF + i))
            indexLK[iLK + i] = nComparison
            indexAF[iAF + i] = nComparison
        elif i < cLK:
            comparison.append((iLK + i, f"+{cLK}", 88, f"{cAF}^", ""))
            indexLK[iLK + i] = nComparison
        else:
            comparison.append(("", f"^{cLK}", 88, f"{cAF}+", iAF + i))
            indexAF[iAF + i] = nComparison
    return (iLK + cLK, iAF + cAF)
    
    
def compare(iLK, iAF, strictness):
    """Strictness is edit distance if it is an integer, otherwise it is ratio
    """
    textLK = getTextLK(iLK)
    textAF = getTextAF(iAF)
    (isSimilar, d) = similar(textLK, textAF, strictness)
    if isSimilar:
        nComparison = len(comparison)
        comparison.append((iLK, "=", d, "=", iAF))
        indexLK[iLK] = nComparison
        indexAF[iAF] = nComparison
        return (iLK + 1, iAF + 1)

    combi = findCombi(iLK, iAF, strictness)
    if combi is not None:
        (cLK, cAF) = combi
        return (iLK + cLK, iAF + cAF)
    
    return None
            
    
def lookup(iLK, iAF, strictness, start, end):
    step = None
    
    for i in range(start, end + 1):
        prevComparisonIndex = len(comparison)
        
        if iAF + i <= maxAF:
            step = compare(iLK, iAF + i, strictness)
            if step:
                thisComparison = list(comparison[prevComparisonIndex:])
                comparison[prevComparisonIndex:] = []
                
                catchupAF(iAF, iAF + i - 1)
                for thisComp in thisComparison:
                    nComparison = len(comparison)
                    thisLK = thisComp[0]
                    thisAF = thisComp[4]
                    if thisLK:
                        indexLK[thisLK] = nComparison
                    if thisAF:
                        indexAF[thisAF] = nComparison
                    comparison.append(thisComp)
                break

        if iLK + i <= maxLK:
            step = compare(iLK + i, iAF, strictness)
            if step:
                thisComparison = list(comparison[prevComparisonIndex:])
                comparison[prevComparisonIndex:] = []
                
                catchupLK(iLK, iLK + i -1)
                for thisComp in thisComparison:
                    nComparison = len(comparison)
                    thisLK = thisComp[0]
                    thisAF = thisComp[4]
                    if thisLK:
                        indexLK[thisLK] = nComparison
                    if thisAF:
                        indexAF[thisAF] = nComparison
                    comparison.append(thisComp)
                break
    return step

In [221]:
def doDiffs(startLK=1, startAF=1, steps=-1, show=False):
    comparison.clear()
    
    step = (startLK, startAF)

    complete = False
    it = 0
    
    while it != steps:
        it += 1
        step = doDiff(*step)
        
        if step is True:
            printComparison("zipLK-AF-complete.txt")
            print(f"Comparison complete, {len(comparison)} entries.")
            break
        elif step is False:
            printComparison("zipLK-AF-incomplete.txt")
            print(f"Comparison blocked, {len(comparison)} entries.")
            printDiff(20, 20)
            break
            
    if show:
        print(printLines())
        
        
def doDiff(iLK, iAF):
        if iLK > maxLK or iAF > maxAF:
            if iAF < maxAF:
                catchupAF(iAF, maxAF)
            if iLK < maxLK:
                catchupLK(iLK, maxLK)
            return True
            
        strictness = 1
        
        step = doCase(iLK, iAF)
        if step:
            return step
            
        step = compare(iLK, iAF, strictness)
        if step:
            return step
            
        strictness = 2
        
        step = compare(iLK, iAF, strictness)
        if step:
            return step
            
        strictness = 3
        
        step = compare(iLK, iAF, strictness)
        if step:
            return step
            
        strictness = 0.5
        
        step = compare(iLK, iAF, strictness)
        if step:
            return step
            
        strictness = 0.6
        
        step = lookup(iLK, iAF, strictness, 1, 5)
        if step:
            return step
            
        strictness = 0.8
        
        step = lookup(iLK, iAF, strictness, 5, 10)
        if step:
            return step
            
        strictness = 1
        
        step = lookup(iLK, iAF, strictness, 10, 1000)
        if step:
            return step
            
        return False

# Check the result

We must make sure that the algorithm has not skipped material, dupliplicated material, or put material in the wrong order.

We examine the comparison list and check that we have all slot numbers of LK in the right order and all slot numbers of AF idem.

The mapping itself is needed elsewhere in Text-Fabric, let us write it to file.
We write it as an edge feature into the AF edition.

In [318]:
# this number of good lines between bad lines will not lead to the
# interruption of bad stretches

LOOKAHEAD = 3


def analyseStretch(start, end):
    total = 0
    good = 0
    onlyLK = 0
    onlyAF = 0
    
    for (iLK, left, d, right, iAF) in comparison[start:end + 1]:
        total += 1
        if not iLK:
            onlyAF += 1
        if not iAF:
            onlyLK += 1
        if d == 0:
            good += 1
    
    suspect = onlyAF > 1 and onlyLK > 1 and onlyAF + onlyLK > 5
    return suspect
    
def checkComparison():
    errors = {}
    prevILK = 0
    prevIAF = 0
    
    where = collections.Counter()
    agreement = collections.Counter()
    badStretches = collections.defaultdict(lambda: [])
    
    startBad = 0
    
    for (c, (iLK, left, d, right, iAF)) in enumerate(comparison):
        thisBad = d > 0 or not iLK or not iAF
        # a good line between bad lines is counted as bad
        if not thisBad and startBad:
            nextGood = True
            for j in range(1, LOOKAHEAD + 1):
                if c + j < len(comparison):
                    compJ = comparison[c + j]
                    if compJ[2] > 0 or not compJ[0] or not compJ[-1]:
                        nextGood = False
                        break
            if not nextGood:
                thisBad = True
        if startBad:
            if not thisBad:
                badStretches[c - startBad].append(startBad)
                startBad = 0
        else:
            if thisBad:
                startBad = c
        
        agreement[d] += 1
        
        if iLK:
            if iLK != prevILK + 1:
                errors.setdefault("wrong iLK", []).append(f"{c:>5}: Expected {prevILK + 1}, found {iLK}")
            prevILK = iLK
            if iAF:
                where["both"] += 1
        else:
            where[AF] += 1
        if iAF:
            if iAF != prevIAF + 1:
                errors.setdefault("wrong iAF", []).append(f"{c:>5}: Expected {prevIAF + 1}, found {iAF}")
            prevIAF = iAF
        else:
            where[LK] += 1
            
    if startBad:
        badStretches[len(comparison) - startBad].append(startBad)
            
    if prevILK < maxLK:
        errors.setdefault("missing iLKs at the end", []).append(f"last is {prevILK}, expected {maxLK}")
    elif prevILK > maxLK:
        errors.setdefault("too many iLKs at the end", []).append(f"last is {prevILK}, expected {maxLK}")
    if prevIAF < maxAF:
        errors.setdefault("missing iAFs at the end", []).append(f"last is {prevIAF}, expected {maxAF}")
    elif prevIAF > maxAF:
        errors.setdefault("too many iAFs at the end", []).append(f"last is {prevIAF}, expected {maxAF}")
    
    print("\nSANITY\n")
    if not errors:
        print("All OK")
    else:
        for (kind, msgs) in errors.items():
            print(f"ERROR {kind} ({len(msgs):>5}x):")
            for msg in msgs[0:10]:
                print(f"\t{msg}")
            if len(msgs) > 10:
                print(f"\t ... and {len(msgs) - 10} more ...")
                
    print(f"\nAGREEMENT\n")
    print("Where are the words?\n")
    print(f"\t{LK}-only: {where[LK]:>5} slots")
    print(f"\t{AF}-only: {where[AF]:>5} slots")
    print(f"\tboth:    {where['both']:>5} slots")
    
    print("\nHow well is the agreement?\n")
    for (d, n) in sorted(agreement.items()):
        print(f"edit distance {d:>3} : {n:>5} words")
    print(f"NB: 88 are special cases that have been declared explicitly")
    
    print(f"\nBAD STRETCHES\n")
    print("How many of which size?\n")
    allSuspects = []
    someBenigns = []
    for (size, starts) in sorted(badStretches.items(), key=lambda x: (-x[0], x[1])):
        suspects = {start: size for start in starts if analyseStretch(start, start + size)}
        benigns = {start: size for start in starts if start not in suspects}
        allSuspects.extend([(start, start + size) for (start, size) in suspects.items()])
        someBenigns.extend([(start, start + size) for (start, size) in list(benigns.items())[0:3]])
        examples = ", ".join(str(start) for start in list(suspects.keys())[0:3])
        if not suspects:
            examples = ", ".join(str(start) for start in list(benigns.keys())[0:3])
        print(f"bad stretches of size {size:>3} : {len(suspects):>4} suspect of total {len(starts):>4} x see e.g. {examples}")
        
    print(f"\nShowing all {len(allSuspects)} inversion suspects" if len(allSuspects) else "\nNo suspect bad stretches\n")
    for (i, (start, end)) in enumerate(reversed(allSuspects)):
        print(f"\nSUSPECT {i + 1:>2}")
        print(printLines(max((1, start - 5)), min((len(comparison), end + 5))))
    print(f"\nShowing some ({len(someBenigns)}) benign examples" if len(someBenigns) else "\nNo bad stretches\n")
    for (i, (start, end)) in enumerate(someBenigns):
        print(f"\nBENIGN {i + 1:>2}")
        print(printLines(max((1, start - 2)), min((len(comparison), end + 2))))

# Special cases

The special cases are declared in a dict.
The keys are slot positions in the LK, the values are amounts of words to be identified.

For example:

```
1000: (2,5)
```

means that slots 1000 + 1001 in the LK match with slots x, x + 1, x + 2, x + 3, x + 4 in the AF,
where x is the slot in the AF that corresponds to slot 1000 in the LK.

In [299]:
cases = {
    13539: (2, 1),
    8273: (4, 0),
    14878: (1, 1),
    14879: (1, 0),
    14880: (12, 0),
}

# Run the comparison

Here we go!

In [309]:
doDiffs()

Comparison complete, 40971 entries.


# Check the results

The result of the alignment is a table where the first column contains all the slots in LK,
and the second column all the slots in AF:

Left you see the LK slot numbers and the words in those slots.

Right you see the AF words and their AF slot numbers.

The middle column is a measure for the edit distance between the words at both sides.

`99` means: at one of the sides the word is missing.
`88` means: this was a prescribed, special case, no edit distance has been measured.

All other values are the number of edits you have to make in order to change one word to the other.

Sometimes words are combined: a number of words left corresponds to a posiibly different number of words right.
You see that indicated by `+n` and `n+`. If the numbers are not equal, empty words are inserted with `^n` or `n^`.

If just a single word left is aligned with a single word right, you see it marked with `=` on the left and right.

In [310]:
print(printLines(start=0, end=20))

      -                       @99 bnzlylālʿ             =     1
    1 +1                ālḥmd @5  ylrʿā                2+     2
      ^1                      @5  ālḥmd                2+     3
    2 =                   llh @1  lh                    =     4
    3 =                  mnzl @0  mnzl                  =     5
    4 =                 ālḥkm @1  ālḥk                  =     6
    5 =                   ʿlá @0  ʿlá                   =     7
    6 =                  ḳlwb @0  ḳlwb                  =     8
    7 =                 ālklm @0  ālklm                 =     9
    8 =                bāḥdyŧ @0  bāḥdyŧ                =    10
    9 =                ālṭryḳ @0  ālṭryḳ                =    11
   10 =                 ālāmm @0  ālāmm                 =    12
   11 =                    mn @0  mn                    =    13
   12 =                ālmḳām @0  ālmḳām                =    14
   13 +1               ālāḳdm @1  ā                    2+    15
      ^1                      @1  ālāḳdm

# Quality check

How did the alignment perform?
It did complete, but what have we got?
It could be just garbage.

## Sanity

First of all we need to know whether all words of both LF and LK occur left and right, without gaps and duplications
and in the right order. We check that.

## Agreement

We provide information on about the agreement in both sources.
How many words are there for which there is no alignment inb the other edition?

And how close are the words for which an alignment could be established?

Note that there are two reasons for bad agreement results:

1. The editions are really very different
2. The alignment is not optimal and fails to align many words that would have matched under another
   alignment strategy.
   
## Bad stretches

Are there long stretches of poorly matching alignments?
We are going to examine them.

If they contain many cases of a left missing word and many cases of a right missing word,
they are suspect, because they might contain largely the same words, but the algorithm has failed
to spot them.

We show all suspect bad stretches.

The remaining stretches are benign.
We show examples of benign bad strectches (at most three examples per size).

In [319]:
checkComparison()


SANITY

All OK

AGREEMENT

Where are the words?

	LK-only:   700 slots
	AF-only:   592 slots
	both:    39679 slots

How well is the agreement?

edit distance   0 : 36785 words
edit distance   1 :  2920 words
edit distance   2 :   441 words
edit distance   3 :   204 words
edit distance   4 :   108 words
edit distance   5 :   162 words
edit distance   6 :    73 words
edit distance   7 :    43 words
edit distance   8 :    52 words
edit distance   9 :    29 words
edit distance  10 :    38 words
edit distance  11 :     6 words
edit distance  12 :     8 words
edit distance  13 :     6 words
edit distance  14 :     4 words
edit distance  15 :     3 words
edit distance  88 :    20 words
edit distance  99 :    69 words
NB: 88 are special cases that have been declared explicitly

BAD STRETCHES

How many of which size?

bad stretches of size  37 :    0 suspect of total    1 x see e.g. 40934
bad stretches of size  30 :    0 suspect of total    1 x see e.g. 18394
bad stretches of size  25 :    0 s

# Individual decisions.

You can direct the alignment machinery to a specific set of points in the LK and AF texts and
let it run for a number of steps. After running it, you can ask it to show the output table.

This is handy when you explore bad stretches and want to look very closely how the decision to align is taken.
You can then add debug statements, or change the code and see what happens in that case.

In [305]:
doDiffs(startLK=8273, startAF=8286, steps=5, show=True)

 8273 +4                  ṣlá @88                      0^      
 8274 +4                 āllh @88                      0^      
 8275 +4                 ʿlyh @88                      0^      
 8276 +4                 wslm @88                      0^      
 8277 =                  wḳāl @0  wḳāl                  =  8286
 8278 =                  āllh @0  āllh                  =  8287
 8279 =                 tʿālá @0  tʿālá                 =  8288
 8280 =              lābrāhym @0  lābrāhym              =  8289


In [306]:
doDiffs(startLK=23983, startAF=23877, steps=1, show=True)

23983 =                slymān @2  lyān                  = 23877


# THE END