# Resume here

## What we’ve learned

1. Unambiguous MFS is better (or, at least, not worse) than any alternative starting point.
1. Full-depth partitioning reduces the complexity enormously, so if we do it first, we turn an unscalably large task into multiple much more scalable small tasks.
1. After each full-depth partition, start from the beginning (create new MFS) inside each partition.

## What we still need to learn

1. At some point we run out of full-depth partitions and have to do something different, and we haven't yet decided what that something different is.
1. Length matters, so once we run out of full-depth, depth is no longer paramount. Cf. our earlier block priority based on a balancing of depth, length, and token infrequency.
1. High-frequency tokens may lead to spurious alignments. We can tabulate token frequencies since we have to touch every token anyway, and then treat tokens with a frequency above a certain threshold as if they were ambiguous.

## Procedure

1. Build MFS OOTB, keeping only first instance in case of repetition. Use compact feature to remove subsequences. (Requires more elaborate source code, and not the briefer one below.) Or use topk instead of frequent?
1. Prioritize MFS based first on depth and then length. Start with only full-depth MFS.
1. Place MFS tokens in order, checking each token for ambiguity and placing all of the unambiguous ones.
1. After each full-depth partitioning, each partition is a separate alignment task, so go back to step #1, above, separately for each partition and build new MFS.
1. When no more full-depths MFSs, change strategy. How? MFS but not full depth? Back to blocks?

## What to do next

1. Identify ambiguous tokens in MFS so that we can skip them.

In [1]:
import pprint
pp = pprint.PrettyPrinter(indent=2)

In [2]:
from collections import defaultdict

def frequent_rec(patt, mdb):
    results.append((len(mdb), patt))

    occurs = defaultdict(list) # keys are token strings, values are lists of tuples of (witness number, witness token offset)
    for (i, startpos) in mdb:
        seq = db[i] # witness tokens
        for j in range(startpos + 1, len(seq)): # index into witness tokens
            l = occurs[seq[j]] # list of tuples of positions previously associated with (witness i, token at position j)
            if len(l) == 0 or l[-1][0] != i: # if no entries for this token yet or same as the last one
                l.append((i, j))

#     pp.pprint(patt)
#     pp.pprint(occurs)

    for (c, newmdb) in occurs.items(): # c is word token, newmdb is list of tuples
        if len(newmdb) >= minsup: # number of tuples (occurrences of c in vocabulary)
            frequent_rec(patt + [(c, newmdb)], newmdb)

db = [
    ["the", "red", "and", "the", "black", "cat"],
    ["the", "black", "and", "the", "red", "cat"],
    ["the", "black", "cat"],
]

# db = [
#     [0, 1, 2, 3, 4],
#     [1, 1, 1, 3, 4],
#     [2, 1, 2, 2, 0],
#     [1, 1, 1, 2, 2],
# ]

minsup = 2

results = []

frequent_rec([], [(i, -1) for i in range(len(db))]) # void; updates global results (list) in place

for result in sorted(results, key=lambda x: (x[0], len(x[1])), reverse=True):
    print(result)

(3, [('the', [(0, 0), (1, 0), (2, 0)]), ('black', [(0, 4), (1, 1), (2, 1)]), ('cat', [(0, 5), (1, 5), (2, 2)])])
(3, [('the', [(0, 0), (1, 0), (2, 0)]), ('black', [(0, 4), (1, 1), (2, 1)])])
(3, [('the', [(0, 0), (1, 0), (2, 0)]), ('cat', [(0, 5), (1, 5), (2, 2)])])
(3, [('black', [(0, 4), (1, 1), (2, 1)]), ('cat', [(0, 5), (1, 5), (2, 2)])])
(3, [('the', [(0, 0), (1, 0), (2, 0)])])
(3, [('black', [(0, 4), (1, 1), (2, 1)])])
(3, [('cat', [(0, 5), (1, 5), (2, 2)])])
(3, [])
(2, [('the', [(0, 0), (1, 0), (2, 0)]), ('and', [(0, 2), (1, 2)]), ('the', [(0, 3), (1, 3)]), ('cat', [(0, 5), (1, 5)])])
(2, [('the', [(0, 0), (1, 0), (2, 0)]), ('red', [(0, 1), (1, 4)]), ('cat', [(0, 5), (1, 5)])])
(2, [('the', [(0, 0), (1, 0), (2, 0)]), ('and', [(0, 2), (1, 2)]), ('the', [(0, 3), (1, 3)])])
(2, [('the', [(0, 0), (1, 0), (2, 0)]), ('and', [(0, 2), (1, 2)]), ('cat', [(0, 5), (1, 5)])])
(2, [('the', [(0, 0), (1, 0), (2, 0)]), ('the', [(0, 3), (1, 3)]), ('cat', [(0, 5), (1, 5)])])
(2, [('and', [(0, 2)