# Alignment with MUFS (maximal unambiguous frequent sequences)
## What we’ve learned

1. Unambiguous MFS is better (or, at least, not worse) than any alternative starting point.
1. Full-depth partitioning reduces the complexity enormously, so if we do it first, we turn an unscalably large task into multiple much more scalable small tasks.
1. After each full-depth partition, start from the beginning (create new MFS) inside each partition.

## What we still need to learn

1. At some point we run out of full-depth partitions and have to do something different, and we haven't yet decided what that something different is.
1. Length matters, so once we run out of full-depth, depth is no longer paramount. Cf. our earlier block priority based on a balancing of depth, length, and token infrequency.
1. High-frequency tokens may lead to spurious alignments. We can tabulate token frequencies since we have to touch every token anyway, and then treat tokens with a frequency above a certain threshold as if they were ambiguous.

## Procedure

1. Build MFS OOTB, keeping only first instance in case of repetition. Use compact feature to remove subsequences. (Requires more elaborate source code, and not the briefer one below.) Or use topk instead of frequent?
1. Prioritize MFS based first on depth and then length. Start with only full-depth MFS.
1. Place MFS tokens in order, checking each token for ambiguity and placing all of the unambiguous ones.
1. After each full-depth partitioning, each partition is a separate alignment task, so go back to step #1, above, separately for each partition and build new MFS.
1. When no more full-depths MFSs, change strategy. How? MFS but not full depth? Back to blocks?

## What to do next

1. Identify ambiguous tokens in MFS so that we can skip them.

In [1]:
# Imports
%load_ext autoreload
%autoreload 2
%load_ext line_profiler

from collections import defaultdict, deque
from typing import Set, List
from dataclasses import dataclass
from bitarray import bitarray
import networkx as nx
import re
import queue

import numpy as np
import numba as nb
from numba import jit

import graphviz
from IPython.display import SVG

import pprint
pp = pprint.PrettyPrinter(indent=2)
debug = False

In [2]:
# Load data into plain_text_witnesses (dictionary)
#
# Load first chapter of six editions of the Origin of species from disk
# Each paragraph is a line, with trailing newlines and intervening blank lines, which we strip on import
# sigla = ['w0', 'w1', 'w2', 'w3', 'w4', 'w5']
# filenames = ['darwin1859.txt', 'darwin1860.txt', 'darwin1861.txt', 'darwin1866.txt', 'darwin1869.txt', 'darwin1872.txt', ]
sigla = ['w0', 'w3', 'w4']
filenames = ['darwin1859.txt', 'darwin1866.txt', 'darwin1869.txt']
first_paragraph = 0
last_paragraph = 1
how_many_paragraphs = last_paragraph - first_paragraph
plain_text_witnesses = {}
for siglum, filename in zip(sigla, filenames):
    with open(filename) as f:
        lines = f.readlines()
        lines = [line for line in lines if line != '\n']
        plain_text_witnesses[siglum] = " ".join(lines[first_paragraph : last_paragraph])
if debug:
    print(f"{how_many_paragraphs} paragraphs from {len(sigla)} witnesses")

In [3]:
# Tokenize witnesses
def tokenize_witnesses(witness_strings: List[str]): # one string per witness
    '''Return list of witnesses, each represented by a list of tokens'''
    # TODO: handle punctuation, upper- vs lowercase
    witnesses = []
    for witness_string in witness_strings:
        witness_tokens = witness_string.split()
        witnesses.append(witness_tokens)
    return witnesses

In [4]:
# Witness sigla and witness token lists
witness_sigla = [key for key in plain_text_witnesses.keys()]
witness_token_lists = tokenize_witnesses([value for value in plain_text_witnesses.values()]) # list of lists

In [5]:
# Create MFS; void, updates global (results)
# TODO: get rid of global (challenging with a recursive function)
# TODO: get rid of brute force repeated transitions
def frequent_rec(patt, mdb):
    """Add a docstring someday"""
    results.append((len(mdb), patt))
    print(len(results))
    if len(results) > 10:
        print(results)
        raise Exception("Results list too long!")

    occurs = defaultdict(list) # keys are token strings, values are lists of tuples of (witness number, witness token offset)
    for (i, startpos) in mdb:
        seq = db[i] # witness tokens
        for j in range(startpos + 1, len(seq)): # index into witness tokens
            l = occurs[seq[j]] # list of tuples of positions previously associated with (witness i, token at position j)
            if len(l) == 0 or l[-1][0] != i: # if no entries for this token yet or same as the last one
                l.append((i, j))
    for (c, newmdb) in occurs.items(): # c is word token, newmdb is list of tuples
        if len(newmdb) >= minsup: # number of tuples (occurrences of c in vocabulary)
            frequent_rec(patt + [(c, newmdb)], newmdb)

In [7]:
# db = [
#     ["the", "red", "and", "the", "black", "cat"],
#     ["the", "black", "and", "the", "red", "cat"],
#     ["the", "black", "cat"],
# ]
db = witness_token_lists
minsup = 2 # global constant, used by frequent_rec()

results = []

frequent_rec([], [(i, -1) for i in range(len(db))]) # void; updates global results (list) in place

for result in sorted(results, key=lambda x: (x[0], len(x[1])), reverse=True):
    pp.pprint(result)

1
2
3
4
5
6
7
8
9
10
11
[(3, []), (3, [('WHEN', [(0, 0), (1, 3), (2, 3)])]), (3, [('WHEN', [(0, 0), (1, 3), (2, 3)]), ('we', [(0, 1), (1, 4), (2, 4)])]), (2, [('WHEN', [(0, 0), (1, 3), (2, 3)]), ('we', [(0, 1), (1, 4), (2, 4)]), ('look', [(0, 2), (1, 5)])]), (2, [('WHEN', [(0, 0), (1, 3), (2, 3)]), ('we', [(0, 1), (1, 4), (2, 4)]), ('look', [(0, 2), (1, 5)]), ('to', [(0, 3), (1, 6)])]), (2, [('WHEN', [(0, 0), (1, 3), (2, 3)]), ('we', [(0, 1), (1, 4), (2, 4)]), ('look', [(0, 2), (1, 5)]), ('to', [(0, 3), (1, 6)]), ('the', [(0, 4), (1, 7)])]), (2, [('WHEN', [(0, 0), (1, 3), (2, 3)]), ('we', [(0, 1), (1, 4), (2, 4)]), ('look', [(0, 2), (1, 5)]), ('to', [(0, 3), (1, 6)]), ('the', [(0, 4), (1, 7)]), ('individuals', [(0, 5), (1, 8)])]), (2, [('WHEN', [(0, 0), (1, 3), (2, 3)]), ('we', [(0, 1), (1, 4), (2, 4)]), ('look', [(0, 2), (1, 5)]), ('to', [(0, 3), (1, 6)]), ('the', [(0, 4), (1, 7)]), ('individuals', [(0, 5), (1, 8)]), ('of', [(0, 6), (1, 9)])]), (2, [('WHEN', [(0, 0), (1, 3), (2, 3)]),

Exception: Results list too long!