# Code for Chapter 4: Learning Mappings

This notebook provides the code written for and used in the Chapter 4 of my dissertation **_SigmaPie_ for subregular and subsequential grammar induction**. All the links will be added soon. :)

# Generators and evaluators: the setup for the experiments

## Step 1: loading dependencies, including _SigmaPie_

In [1]:
import codecs
from random import choice, randint
from pprint import pprint

In [2]:
# accessing SigmaPie toolkit: I know, horrible!
# I promise I'll make it a package soon
%cd local_sigmapie/code/
from main import *
%cd ../..

/home/alenaks/subregular-experiments/local_sigmapie/code

You successfully loaded SigmaPie. 

Formal language classes and grammars available:
	* strictly piecewise: SP(alphabet, grammar, k, data, polar);
	* strictly local: SL(alphabet, grammar, k, data, edges, polar);
	* tier-based strictly local: TSL(alphabet, grammar, k, data, edges, polar, tier);
	* multiple tier-based strictly local: MTSL(alphabet, grammar, k, data, edges, polar).

Alternatively, you can initialize a transducer: FST(states, sigma, gamma, initial, transitions, stout).
Learning algorithm:
	OSTIA: ostia(sample, sigma, gamma).
/home/alenaks/subregular-experiments


## Step 2: defining general harmonic evaluator

Here, I will talk about the artificial harmonic generator that I will be using throughout Chapters 3 and 4 of my dissertation.
It can generate two types of samples:

* Samples of **well-formed words**, i.e. words that don't violate the rules of the harmony; and
* Samples of **underlying -> surface forms**, i.e. pairs where the first member has only the first value of every harmonic class specified (i.e. the feature that needs to be spread is given), and all consecutive members of the same class are masked as the name of that class.

### Parameters of the generator

List of the parameters that are available:

* number of strings to be generated;
* harmonic classes and their members (harmonic class is a class of segments that don't co-occur unless there is a blocker in-between them);
* minimal and maximal cluster length of each of the harmonic classes;
* blockers and the new domain that they introduce;
* a probability of observing a blocker (1 / n, where n is a parameter): basically means "every n-th cluster will be the blocker".

In [3]:
class Harmony(object):
    """
    Class defining the toy generator for the harmonic datasets.
    
    Attributes:
        cl_members (dict): dictionary of the type {(harmonic_class_1):class_id_1,
            (harmonic_class_2):class_id_2, ...} that contains info about the present
            harmonic classes. Note that the transparent element can be encoded by 
            a harmonic class containing a single element.
            Example: {("a", "o"):"A", ("b", "p"):"B", ("c"):"C"}
        cl_lengths (dict): dictionary of the type {class_id:(min_len, max_len)},
            where min_len and max_len denote the min and max len of the cluster
            made out of elements of class_id.
            Example: {"A":(1, 3), "B":(2, 4), "C":(4, 8)}
        blockers (dict): dictionary of the type {"b_1":"u_1", "b_2":"u_2", ...} where
            "b" is the blocker, and "u" is the newly introduced value.
            Example: {"t":"p"}
        blocker_prob (int): a chance of observing a blocker, the P evaluates from
            (1/blocker_prob).
            Example: 5
    """
    def __init__(self, cl_members, cl_lengths = None, blockers = None, blocker_prob = 5):
        """
        Init function for the Harmony class.
        """
        self.cl_members = cl_members
        if cl_lengths is not None:
            self.cl_lengths = cl_lengths
        else:
            self.cl_lengths = {i:(1, 3) for i in self.cl_members.values()}
        self.blockers = blockers
        self.blocker_prob = blocker_prob
        

        
    def generate_words(self, n = 3, length = 10):
        """
        Generates n strings of a given length.
        
        Arguments:
            n (int): how many strings need to be generated;
            length (int): length of the strings.
            
        Returns:
            list[str]: n generated strings.
        """
        # check if the harmony rules are well-formed
        if not self._verify_classes():
            raise("Cannot generate dataset: the sets are overlapping.")
            
        # unpack the dictionary for a quicker lookup
        unpacked = self._unpack_classes()
        transparent = self._transparent()
        generated = [self._generate(unpacked, length) for i in range(n)]
        return generated
    

    def generate_pairs(self, n = 3, length = 10):
        """
        Generates n pairs of strings of a given length.
        
        Arguments:
            n (int): how many strings need to be generated;
            length (int): length of the strings.
            
        Returns:
            list[tuple[str]]: n generated pairs of strings.
        """
        transparent = self._transparent()
        outputs = self.generate_words(n, length)
        inputs = self._mask_words(outputs, transparent)
        return list(zip(inputs, outputs))
        
        
    def _generate(self, unpacked, length):
        """
        Generates a set of strings; helper function.
        
        Output type: list[str]
        """
        
        # initialize the specifications of this particular string
        string = ""
        specs = self._specify()
        
        while len(string) < length:
            
            
            # check if we can now output the blocker
            if self.blockers is not None:
                while randint(1, self.blocker_prob) == 1:
                    b = choice(list(self.blockers))
                    string += b
                    
                    if len(string) == length:
                        return string
                    
                    # rewrite the specification because of the blocker
                    if self.blockers[b] not in specs:
                        for spec in specs:
                            if unpacked[spec] == unpacked[self.blockers[b]]:
                                specs.remove(spec)
                                specs.append(self.blockers[b])
                                break
                                
            # make sure that we don't generate cluster of the same
            # harminic set as the previous one
            if len(string) > 0:
                change = string[-1] in unpacked
            else:
                change = False
            
            # select and add new possible character as many times as
            # cl_lengths indicate
            if not change:
                newchar = choice(specs)
            else:
                collection = [i for i in specs]
                collection.remove(string[-1])
                newchar = choice(collection)
            freq_b, freq_e = self.cl_lengths[unpacked[newchar]]
            string += newchar * randint(freq_b, freq_e)
            
            # output
            if len(string) > length:
                string = ""
            elif len(string) == length:
                return string
            
            
    def _mask(self, string, transparent):
        """
        Masks all non-initial mentions of the specified allophone: helper function.
        
        Output type: str
        """
        classes = {i:False for i in self.cl_members.keys()}
        undergoers = self._undergoers()
        new = ""
        for s in string:
            if (s in undergoers) and (s not in transparent.values()):
                for c in classes:
                    
                    # rewrite the non-initial mention of the harmonic set member
                    # as its harmony_class_id
                    if s in c and not classes[c]:
                        classes[c] = True
                        new += s
                    elif s in c:
                        new += self.cl_members[c]
            else:
                new += s
        return new

    
    def _mask_words(self, words, transparent):
        """
        Masks every word of a given list; helper function.
        
        Output type: list[str]
        """
        return [self._mask(w, transparent) for w in words]
            
            
    def _undergoers(self):
        """
        Collects all undergoers; helper function.
        
        Output type: list[char]
        """
        items = []
        for i in self.cl_members:
            items.extend(list(i))
        return items
    
    def _transparent(self):
        """
        Checks if there are transparent items, i.e. if there is
        a harmonic class or classes that only contain a single item.
        
        Output type: dict[str:str]
        """
        transparent = dict()
        for i in self.cl_members:
            if len(i) == 1:
                transparent[self.cl_members[i]] = i[0]
        return transparent
        
        
    def _verify_classes(self):
        """
        Verifies that no set (harmonic sets or the set of blockers)
        overlaps with each other.
        
        Output type: bool
        """
        items = self._undergoers()
        if self.blockers is not None:
            block_ok = all([i not in items for i in self.blockers])
        else:
            block_ok = True
        return len(items) == len(set(items)) and block_ok
    
    
    def _unpack_classes(self):
        """
        Creates a dictionary where every harmonizing element 
        is mapped to its harmonic class; helps to optimize 
        the lookup of this information.
        
        Output type: dict
        """
        items = self._undergoers()
        unpacked = {}
        for i in items:
            for j in self.cl_members:
                if i in j:
                    unpacked[i] = self.cl_members[j]
        return unpacked

    
    def _specify(self):
        """
        Randomly initialize a specification from all given
        harmonic datasets.
        
        Output type: list[char]
        """
        return list(map(choice, self.cl_members.keys()))

### Examples of the data generated by AHG

#### Parallel vowel and consonant harmonies
Harmony of a class "A" that contains "a" and "o" and of a class "B" that contains "b" and "p". Linguistically, these are simultaneous and independent vowel and consonant harmonies.

In [4]:
s1 = {("a", "o"):"A", ("b", "p"):"B"}
h1 = Harmony(s1)

Now, let's generate a sample of well-formed words.

In [5]:
print(h1.generate_words(n = 5, length = 10))

['aaabbabbaa', 'bboobooobo', 'aappaaapap', 'bbbaabbaaa', 'appaappaaa']


And now, a sample of well-formed pairs.

In [6]:
pprint(h1.generate_pairs(n = 4, length = 8))

[('abBAABAB', 'abbaabab'),
 ('bBoAABAB', 'bbooobob'),
 ('aAbBBABB', 'aabbbabb'),
 ('pBBaBAAB', 'pppapaap')]


#### Harmony with a transparent element

Transparent, or irrelevant items that only introduce the long-distance effect in the dataset can be modeled by providing an extra harmonic class with just a single item in it.

In [7]:
s2 = {("a", "o"):"A", ("x"):"X"}
l2 = {"A":(1, 2), "X":(2, 4)}
h2 = Harmony(s2, l2)

Now, us generate some well-formed words.

In [8]:
print(h2.generate_words(n = 5, length = 10))

['aaxxxaaxxx', 'aaxxxaxxxx', 'xxxaaxxxxa', 'aaxxxaxxaa', 'oxxxoxxxxo']


And now the underlying and surface forms. Note that transparent items are not masked!

In [9]:
pprint(h2.generate_pairs(n = 4, length = 8))

[('xxoxxxAA', 'xxoxxxoo'),
 ('xxxaAxxx', 'xxxaaxxx'),
 ('oAxxAxxx', 'ooxxoxxx'),
 ('xxaxxxxA', 'xxaxxxxa')]


#### Parallel vowel and consonant harmonies with a blocking effect

Harmony of a class "A" and of a class "B", where if "t" occurred, "p" cannot be observed anymore: class "B" changes its specification to "p". Namely, "t" is a blocker that only allows for "p" after itself.

Additionally, clusters of the A-element consist usually from 1 to 3 elements, and clusters of the B-elements are 2 to 4 elements long. The probability of observing the blocker is $\frac{1}{4}$ at every step of the generation.

In [10]:
s3 = {("a", "o"):"A", ("b", "p"):"B"}
l3 = {"A":(1, 3), "B":(2, 4)}
b3 = {"t":"p"}
p3 = 4
h3 = Harmony(s3, l3, b3, p3)

Let's first generate some well-formed words.

In [11]:
print(h3.generate_words(n = 5, length = 10))

['aaabbbttpp', 'oooppppooo', 'tpppaatppt', 'ppooppppoo', 'appptppppa']


And now, some underlying and surface forms.

In [12]:
pprint(h3.generate_pairs(n = 5, length = 15))

[('pBBBttotBBABBBB', 'ppppttotppopppp'),
 ('tapBBAtBBBBABBA', 'tapppatppppappa'),
 ('apBABBBBAAABBBB', 'appappppaaapppp'),
 ('pBBtttBBBBoAAtA', 'ppptttppppoooto'),
 ('pBBoAABBAAABBBA', 'pppoooppooopppo')]


## Step 3: Turkish generators and evaluators

The following two functions I will be using in order to verify the well-formedness of generated Turkish or fake Turkish words:
  * `backness_harmony` takes a string as input and tells if that strings is well-formed with respect to the rules of Turkish backness harmony;
  * `rounding_harmony` does the same thing for the rounding harmony.

In [13]:
def backness_harmony(string):
    """
    Tells if a string is well-formed according to rules
    of Turkish backness harmony.
    """
    front_class, back_class = "Iaou", "ieOU"
    front, back = False, False
    
    for v in front_class + back_class:
        if v in string:
            front = True if v in front_class else front
            back = True if v in back_class else back

    return not (front and back)

In [14]:
def rounding_harmony(string):
    """
    Tells if a string is well-formed according to rules
    of Turkish rounding harmony.
    """
    high, low, rounded = "iIuU", "aeoO", "uUoO"
    
    vowels = "".join([v for v in string if v in high + low])
    if len(vowels) < 2:
        return True
    
    ro = vowels[0] in rounded
    
    for v in vowels[1:]:
        if v in low:
            if v in rounded:
                return False
            ro = False
        elif (ro and v not in rounded) or (not ro and v in rounded):
            return False
            
    return True

In [15]:
def backness_and_rounding(string):
    return backness_harmony(string) and rounding_harmony(string)

Additionally, to generate simplified Turkish data I will be using `turkish_word` and `generate_turkish_words` that generate a single word and a dataset, correspondingly.

Their parameters are:
* `length` is a desired length of the Turkish word;
* `cond` is a choice of "consonant" that will be separating the vowels;
* `vowel_cluster` is a tuple of integers representing minimal and maximal length of the vowel cluster;
* `cons_cluster` is a tuple of integers representing minimal and maximal length of the consonantal cluster;
* `n` (available for `generate_turkish` only) is the number of the examples that need to be generated.

In [16]:
def turkish_word(length = 10, cons = "x", vowel_cluster = (1, 2),
                          cons_cluster = (0, 3)):
    """
    This generator generates fake Turkish words: namely, the words in which
    the harmonic system and rules of Turkish are preserved, but all consonants
    were substituted by a single given consonant.
    
    Arguments:
    * length (int): a length of a word that needs to be generated;
    * cons (str): a single character (or an empty string if only vowels
                  need to be generated), a "choice" of the consonant 
                  that makes this harmony long-distant;
    * vowel_cluster (tuple[int, int]): a tuple of integers representing
                                       minimal and maximal length of
                                       the vowel cluster;
    * cons_cluster (tuple[int, int]): a tuple of integers representing
                                      minimal and maximal length of
                                      the consonantal cluster.
                                      
    Returns:
    * str: a fake Turkish harmonic word, where all consonants are masked.
    """
    if length < 1:
        raise ValueError("Words cannot be so short.")
    
    vowels = {
        (True, True, True):"u",
        (True, True, False):"I",
        (True, False, True):"o",
        (True, False, False):"a",
        (False, True, True):"U",
        (False, True, False):"i",
        (False, False, True):"O",
        (False, False, False):"e"
    }
    
    backness = choice([True, False])
    height = choice([True, False])
    rounding = choice([True, False])
    
    specs = (backness, height, rounding)
    word = ""
    
    if choice([0, 1]):
            word += "x" * randint(*cons_cluster)
            
    while len(word) < length:
        vc = vowels[specs] * randint(*vowel_cluster)
        
        # this part is neededd to avoid the word-initial *oo clusters
        if len(vc) > 1 and not height and rounding:
            rounding = False
            vc = vc[0] + vowels[(backness, height, rounding)] * (len(vc) - 1)
            
        word += vc
        word += "x" * randint(*cons_cluster)
        
        height = choice([True, False])
        rounding = False if not height else rounding
        specs = (backness, height, rounding)
        
    return word[:length]

In [17]:
def generate_turkish_words(n = 10, length = 10, cons = "x",
                           vowel_cluster = (1, 2), cons_cluster = (1, 3)):
    """
    This generator generates a list of fake Turkish words.
    
    Arguments:
    * n (int): a number of strings that need to be generated;
    ... for the rest of the arguments, see generate_turkish_word.
    
    Outputs:
    * list: the list containing n fake Turkish words.
    """
    return [turkish_word(length, cons, vowel_cluster, cons_cluster) for i in range(n)]

Additionally, I wrote `turkish_pair` and `generate_turkish_pairs` that produce pairs (UR -> SF based on the Turkish words). In this case, all non-initial vowels are masked, and only their height specification remains: `H` or `L`).

In [18]:
def turkish_pair(word):
    """
    This function takes Turkish surface form as input, and returns
    the pair of its underlying representation and the surface form.
    """
    high, low = "iIuU", "aeoO"
    initial = False
    
    UR = ""
    for s in word:
        if s in high + low and not initial:
            UR += s
            initial = True
        elif s in high:
            UR += "H"
        elif s in low:
            UR += "L"
        else:
            UR += s
            
    return (UR, word)

In [19]:
def generate_turkish_pairs(n = 10, length = 10, cons = "x",
                           vowel_cluster = (1, 2), cons_cluster = (1, 3)):
    """
    This generator generates a list of fake Turkish pairs.
    
    Arguments:
    * n (int): a number of strings that need to be generated;
    ... for the rest of the arguments, see generate_turkish_word.
    
    Outputs:
    * list: the list containing n fake Turkish pairs.
    """
    words = [turkish_word(length, cons, vowel_cluster, cons_cluster) for i in range(n)]
    return [turkish_pair(w) for w in words]

## Step 4: other harmonic evaluators

The function `harmonic_evaluator` below takes two arguments: `data` and `rule`. `data` is a list of words that need to be evaluated, and `rule` is the evaluation function for some concrete harmony. This function will be further used in order to evaluate the performance of the learners on the generated datasets.

In [20]:
def harmonic_evaluator(data, rule):
    """
    Evaluates the provided data with respect to a given
    rule of harmony.
    
    Arguments:
    * data (list[str]): a list of strings tht need to be evaluated;
    * rule (function): a function that evaluates a string according
                       to some harmony.
                       
    Results:
    * Prints the report that shows if the data follows the rule.
    """
    correct = 0
    for w in data:
        correct = (correct + 1) if rule(w) else correct
        
    ratio = (correct / len(data))
    print(f"Percentage of harmonic words: {int(ratio * 100)}%.")

### Finnish

Finally, `front_harmony` defines a function that tells if a given string follows a rule of Finnish vowel harmony.

In [21]:
def front_harmony(string):
    """
    Tells if a string is well-formed according to rules
    of Finnish backness harmony.
    """
    front_class, back_class = "AOy", "aou"
    front, back = False, False
    
    for v in front_class + back_class:
        if v in string:
            front = True if v in front_class else front
            back = True if v in back_class else back

    return not (front and back)

### Fake harmonies evaluators

This section would need to eventually be redone.

In [22]:
def single_harmony_no_blockers(string):
    """
    Checks if a single [a, o] harmony is well-formed.
    """
    return not("a" in string and "o" in string)

In [23]:
def single_harmony_with_blockers(string):
    """
    Checks if a single [a, o] harmony with a blocker f:a is well-formed.
    """
    if "f" in string:
        s1 = string[:string.index("f")]
        s2 = string[string.index("f") + 1:]
        return single_harmony_no_blockers(s1) and (not "o" in s2)
    else:
        return single_harmony_no_blockers(string)

In [24]:
def double_harmony(string, group = ["a", "o", "u", "e"]):
    """
    Tells if a string contains only one out of four
    (vowel) classes; check that at most one class
    of vowels occurs within one word.
    
    Arguments:
    * string (str): a string that needs to be verified;
    * group (list[char]): the harmonic class.
    """
    assert len(group) == 4
    classes = 0
    
    for i in group:
        classes = (classes + 1) if i in string else classes
        
    return classes in [0, 1]

In [25]:
def double_harmony_no_blockers(string):
    """
    Checks if a double [a, o] and [b, p] harmony is well-formed.
    """
    vowels = not("a" in string and "o" in string)
    consonants = not("b" in string and "p" in string)
    return vowels and consonants

In [26]:
def double_harmony_with_blockers(string):
    """
    Checks if a double [a, o] and [b, p] harmony with a blocker t:p
    is well-formed.
    """
    if "a" in string and "o" in string:
        return False
    
    if "t" in string:
        s1 = string[:string.index("t")]
        s2 = string[string.index("t") + 1:]
        return double_harmony_no_blockers(s1) and ("b" not in s2)
    else:
        return double_harmony_no_blockers(string)

## Step 5: Word-final devoicing generators and evaluators

The functions `word_final_devoicing` and `generate_wfd` imitate the process of word-final devoicing.
The former one generates a string or a pair of strings (UR -> SF) implementing that rule, and the latter one generates dataset consisting of ones.

Their arguments are the following:
* `sigma` is a list of symbols that can be used in the words;
* `devoice` contains two tuples, where the first tuple represents voiced obstruents, and the second one stands for their voiceless counterparts;
* `length` is the length of the intended words;
* if `pairs` is True, (UG, SF) pairs will be returned, if False, only the surface forms;
* `n` (available only for `generate_wfd`) is a number of strings or pairs that need to be generated.

In [27]:
def word_final_devoicing(sigma = ("a", "b", "p"), devoice = (("b"), ("p")),
                         length = 10, pairs = False):
    """
    This function generates either a word grammatical with respect to a rule
    of the word final devoicing, or a fake UG -> SF pair.
    
    Arguments: 
    * sigma (list[str]): a list of symbols that can be used in the words;
    * devoice (tuple[tuple, tuple]): the first tuple represents voiced
                                     obstruents, and the second one stands
                                     for their voiceless counterparts;
    * length (int): a length of the intended words;
    * pairs (bool): if True, (UG, SF) pairs will be returned, if False, only
                    the surface forms.
                    
    Outputs:
    * str/tuple: a string or a tuple of strings (depending on the parameter 
                 `pairs`) representing the application of the word-final 
                 devoicing.
    """
    if length < 1:
        raise ValueError("The string has a very weird length.")
        
    before, after = devoice
    string = "".join([choice(sigma) for i in range(length)])
    
    if string[-1] not in before:
        return (string, string) if pairs else string
    
    devoiced = string[:-1] + after[before.index(string[-1])]
    return (string, devoiced) if pairs else devoiced

In [28]:
def generate_wfd(n = 10, sigma = ("a", "b", "p"), devoice = (("b"), ("p")),
                 length = 10, pairs = False):
    """
    Generates a set of strings or pairs that satisfy the rule of
    the word-final devoicing.
    
    Arguments:
    * n (int): the number of strings that need to be generated;
    ... for the rest of the arguments see word_final_devoicing.
    
    Outputs:
    * list: a list of strings or tuples (depending on the parameter `pairs`)
            representing the application of the word-final devoicing.
    """
    return [word_final_devoicing(sigma, devoice, length, pairs) for i in range(n)]

The following functions `evaluate_wfd_words` and `evaluate_wfd_pairs` evaluate words and pairs of words with respect to the rules of the word-final devoicing.

In [29]:
def evaluate_wfd_words(data, voiced = ("b")):
    """
    Evaluates the provided words with respect to the rule 
    of the word-final devoicing.
    
    Arguments:
    * data (list[str]): a list of strings tht need to be evaluated;
    * voiced (tuple[char]): a list of voiced characters, i.e. those
                            that cannot be word-final.
                       
    Results:
    * Prints the report that shows if the data follows the rule.
    """
    correct = 0
    for w in data:
        
        if not len(w):
            correct += 1
            continue
            
        correct = (correct + 1) if w[-1] not in voiced else correct
        
    ratio = (correct / len(data))
    print(f"Percentage of well-formed words: {int(ratio * 100)}%.")

In [30]:
def evaluate_wfd_pairs(data, devoice = (("b"), ("p"))):
    """
    Evaluates the provided pairs with respect to the rule 
    of the word-final devoicing.
    
    Arguments:
    * data (list[str]): a list of strings tht need to be evaluated;
    * voiced (tuple[char]): a list of voiced characters, i.e. those
                            that cannot be word-final.
                       
    Results:
    * Prints the report that shows if the data follows the rule.
    """
    correct = 0
    before, after = devoice
    
    for w in data:
        
        UR, SF = w
        assert len(UR) == len(SF)
        
        if not len(UR):
            correct += 1
            continue
        
        if UR[-1] not in before:
            correct = (correct + 1) if UR == SF else correct
            continue
        
        SF_bar = UR[:-1] + after[before.index(UR[-1])]
        correct = (correct + 1) if SF == SF_bar else correct
        
    ratio = (correct / len(data))
    print(f"Percentage of well-formed words: {int(ratio * 100)}%.")

As before, we can generate some words or pairs of words representing the rule of the word-final devoicing, and then check if the evaluator considers that those datasets are well-formed.

In [31]:
evaluate_wfd_words(generate_wfd(n = 1000, pairs = False))
evaluate_wfd_pairs(generate_wfd(n = 1000, pairs = True))

Percentage of well-formed words: 100%.
Percentage of well-formed words: 100%.


## Step 6: UTP generator and evalurator

The function `generate_tonal_pattern` takes a length of the string that needs to be generated, and returns a random string of raising (H) and falling (L) tones as output. `utp_tones` takes that string of tones as input, and rewrites it according to the UTP rules: no L tones are allowed in-between two H tones.

In [32]:
def generate_tonal_pattern(length = 5):
    """ Generates a random sequence of tones of a given length. """
    return "".join(choice(["H", "L"]) for i in range(length))

In [33]:
def utp_tones(string):
    """ Rewrites a tonal string with respect to the rules of UTP. """
    
    if set(string) not in [{"H", "L"}, {"H"}, {"L"}, set("")]:
        print(string)
        raise ValueError("Unexpected symbols in the tonal string!")
    if not ("H" in string and "L" in string):
        return string
    
    first_h = string.find("H")
    last_h = len(string) - string[::-1].find("H")
    return string[:first_h] + "H" * (last_h - first_h) + string[last_h:]

Then, `generate_utp_strings` and `generate_utp_pairs` generate strings and pairs (UR -> SF) of tones that are well-formed accroding to the rules of UTP. As before, `n` signifies the number of strings that need to be generated, and `length` is the length of those strings.

In [34]:
def generate_utp_strings(n = 10, length = 5):
    """ Generates n strings of tones that follow UTP rules. """
    return [utp_tones(generate_tonal_pattern(length)) for i in range(n)]

In [35]:
def generate_utp_pairs(n = 10, length = 5):
    """ Generates n pairs of tones (UR -> SF) that follow UTP rules. """
    pairs = []
    for i in range(n):
        before = generate_tonal_pattern(length)
        pairs.append((before, utp_tones(before)))
    return pairs

Finally, `evaluate_utp_strings` and `evaluate_utp_pairs` calculate what is the percentage of the input data (strings or pairs of strings) is well-formed with respect to the rules of UTP.

In [36]:
def evaluate_utp_strings(data):
    """ Evaluates the correctness of if the given sample of tonal strings. """
    correct = 0
    for w in data:
        correct = (correct + 1) if utp_tones(w) == w else correct
        
    ratio = (correct / len(data))
    print(f"Percentage of well-formed tonal layers: {int(ratio * 100)}%.")
    
    
def evaluate_utp_pairs(data):
    """
    Evaluates the correctness of if the given sample
    of tonal pairs (UR -> SF).
    """
    correct = 0
    for pair in data:
        before, after = pair
        
        if len(set(after)) > 2:
            continue
            
        correct = (correct + 1) if utp_tones(before) == after else correct
        
    ratio = (correct / len(data))
    print(f"Percentage of well-formed tonal layers: {int(ratio * 100)}%.")

As before, we can verify the correctness of the generator using the evaluation functions.

In [37]:
evaluate_utp_strings(generate_utp_strings(n = 1000))
evaluate_utp_pairs(generate_utp_pairs(n = 1000))

Percentage of well-formed tonal layers: 100%.
Percentage of well-formed tonal layers: 100%.


## Step 7: First-last harmony generators and evaluators

In [4]:
def first_last_UR(n = 10, length = 10):
    """ Generates URs of first-last harmony words. """
    strings = []
    for i in range(n):
        new = choice(["a", "o"])
        new += "".join([choice(["a", "o", "x"]) for j in range(length - 2)])
        new += choice(["a", "o"])
        strings.append(new)
    return strings

def first_last(string):
    """ Makes the first and the last segment of the string the same. """
    return string[:-1] + string[0]

def first_last_words(n = 10, length = 10):
    """ Generates N first-last words. """
    return [first_last(w) for w in first_last_UR(n, length)]

def first_last_pairs(n = 10, length = 10):
    """ Generates N first-last pairs of words. """
    URs = first_last_UR(n, length)
    pairs = []
    for ur in URs:
        pairs.append((ur, first_last(ur)))
    return pairs

In [5]:
def evaluate_first_last_words(data):
    """
    Evaluates the correctness of if the given sample
    of first-last harmony (UR -> SF).
    """
    newdata = [i for i in data if len(i) > 1]
    correct = 0
    for w in newdata:
        if w[0] == w[-1]:
            correct += 1
        
    ratio = (correct / len(newdata))
    print(f"Percentage of first-last harmonic words: {int(ratio * 100)}%.")
    
    
def evaluate_first_last_pairs(data):
    """
    Evaluates the correctness of if the given sample
    of first-last harmony (UR -> SF).
    """
    newdata = [i for i in data if len(i) > 1]
    correct = 0
    for pair in newdata:
        if first_last(pair[0]) == pair[1]:
            correct += 1
        
    ratio = (correct / len(newdata))
    print(f"Percentage of first-last harmonic words: {int(ratio * 100)}%.")

### Quick helper functions

In [6]:
def test_fst(fst, testdata):
    n = 0
    for i in testdata:
        if fst.rewrite(i[0]) == i[1]:
            n += 1
    print("Score:", str(n / len(testdata) * 100) + "%")

In [7]:
from random import randint

# Preparing the training samples

## Dataset 1: word-final devoicing

In [84]:
wfd_data = lambda x, y: [word_final_devoicing(sigma = ("a", "b", "p"), devoice = (("b"), ("p")), 
                                           length = y, pairs = True) for i in range(x)]

data_1 = []
for i in range(1500):
    new = word_final_devoicing(sigma = ("a", "b", "p"), devoice = (("b"), ("p")),
                               length = randint(1, 5), pairs = True)
    data_1.append(new)

test_1 = []
for i in range(1000):
    new = word_final_devoicing(sigma = ("a", "b", "p"), devoice = (("b"), ("p")),
                               length = randint(1, 5), pairs = True)
    test_1.append(new)

    
sigma1 = ["a", "b", "p"]
gamma1 = ["a", "b", "p"]

print(data_1[:15])

[('aa', 'aa'), ('p', 'p'), ('ab', 'ap'), ('p', 'p'), ('ppbbp', 'ppbbp'), ('bpba', 'bpba'), ('bapb', 'bapp'), ('abbap', 'abbap'), ('p', 'p'), ('apa', 'apa'), ('apbap', 'apbap'), ('a', 'a'), ('b', 'p'), ('ppa', 'ppa'), ('pabap', 'pabap')]


In [85]:
o1 = ostia(data_1, sigma1, gamma1)
test_fst(o1, test_1)

# Score: 100.0%

Score: 100.0%


In [45]:
# o1.E

# [['', 'b', '', 'b'],
#  ['b', 'b', 'b', 'b'],
#  ['', 'a', 'a', ''],
#  ['', 'p', 'p', ''],
#  ['b', 'a', 'ba', ''],
#  ['b', 'p', 'bp', '']]

In [46]:
# o1.stout

# {'': '', 'b': 'p'}

In [119]:
test_11 = []
for i in range(1000):
    new = word_final_devoicing(sigma = ("a", "b", "p"), devoice = (("b"), ("p")),
                               length = randint(15, 20), pairs = True)
    test_11.append(new)
test_fst(o1, test_11)

# Score: 100.0%

Score: 100.0%


## Dataset 2: one harmony, no blockers

In [121]:
a = {("a", "o"):"A", ("x"):"X"}
b = ({"A":(1, 2), "X":(1, 3)})
h = Harmony(a, b)

data_2 = []
for i in range(5000):
    data_2.extend(h.generate_pairs(n = 1, length = randint(1, 10)))
    
test_2 = []
for i in range(1000):
    test_2.extend(h.generate_pairs(n = 1, length = randint(1, 10)))

sigma2 = ["a", "o", "A", "x"]
gamma2 = ["a", "o", "x"]

print(data_2[:15])

[('axx', 'axx'), ('xaxA', 'xaxa'), ('xxxox', 'xxxox'), ('axx', 'axx'), ('oAxxA', 'ooxxo'), ('oAxxxAAxxx', 'ooxxxooxxx'), ('xxxoAxx', 'xxxooxx'), ('oAxxxAxxx', 'ooxxxoxxx'), ('axAxx', 'axaxx'), ('aAxAxxxA', 'aaxaxxxa'), ('xxxoA', 'xxxoo'), ('axxxAA', 'axxxaa'), ('xxaA', 'xxaa'), ('xaxxxA', 'xaxxxa'), ('aA', 'aa')]


In [88]:
o2 = ostia(data_2, sigma2, gamma2)
# test_fst(o2, test_2)

# Score: 100.0%

In [49]:
# o2.E

# [['', 'o', 'o', ''],
#  ['', 'x', 'x', 'x'],
#  ['x', 'x', 'x', 'xx'],
#  ['xx', 'x', 'x', ''],
#  ['x', 'o', 'o', 'xx'],
#  ['', 'a', 'a', 'a'],
#  ['a', 'x', 'x', 'a'],
#  ['xx', 'o', 'o', ''],
#  ['xx', 'a', 'a', 'a'],
#  ['a', 'A', 'a', 'a'],
#  ['x', 'a', 'a', 'a'],
#  ['', 'A', 'o', 'xx'],
#  ['x', 'A', 'o', ''],
#  ['xx', 'A', 'o', '']]

In [50]:
# o2.stout
# {'': '', 'x': '', 'xx': '', 'a': ''}

In [122]:
test_22 = []
for i in range(1000):
    test_22.extend(h.generate_pairs(n = 1, length = randint(15, 20)))
test_fst(o2, test_22)

# Score: 100.0%

Score: 100.0%


## Dataset 3: one harmony with blockers

In [142]:
a = {("a", "o"):"A", ("x"):"X"}
b = ({"A":(1, 2), "X":(1, 3)})
c = {"f":"a"}
d = 5
h = Harmony(a, b, c, d)

data_3 = []
for i in range(5000):
    data_3.extend(h.generate_pairs(n = 1, length = randint(1, 10)))
    
test_3 = []
for i in range(1000):
    test_3.extend(h.generate_pairs(n = 1, length = randint(1, 10)))
    
sigma3 = ["a", "o", "A", "x", "f"]
gamma3 = ["a", "o", "x", "f"]

print(data_3[:15])

[('xoAxxx', 'xooxxx'), ('faxxxfA', 'faxxxfa'), ('oAffAA', 'ooffaa'), ('xoAxxfA', 'xooxxfa'), ('aAxfAA', 'aaxfaa'), ('o', 'o'), ('oAxxAx', 'ooxxox'), ('oA', 'oo'), ('fxxfaxx', 'fxxfaxx'), ('aA', 'aa'), ('f', 'f'), ('xxxa', 'xxxa'), ('oxxAAx', 'oxxoox'), ('ff', 'ff'), ('aAxxAxxxA', 'aaxxaxxxa')]


In [94]:
o3 = ostia(data_3, sigma3, gamma3)
# test_fst(o3, test_3)

# Score: 99.2%

In [53]:
# len(o3.E)
# 247

In [54]:
# len(o3.stout) 
# 74

In [143]:
test_33 = []
for i in range(1000):
    test_33.extend(h.generate_pairs(n = 1, length = randint(15, 20)))
test_fst_special(o3, test_33)

# % Score: 95.19999999999999%

('fxxafAxxxAAxxxAAxxx', 'fxxafaxxxaaxxxaaxxx') xxx fxxafaaxxxaaxxxaaxxx
('axAxxfAxxAxAAxxAAfA', 'axaxxfaxxaxaaxxaafa') xxx axaxxfooxxoofa
('xoAxxAAffxxAAfxx', 'xooxxooffxxaafxx') xxx xooxxooxfxxaafxx
('xoAxAxxAAxxAxxx', 'xooxoxxooxxoxxx') xxx xooxoxxooxxaafxxx
('oxAAxAAxxfxAAxxf', 'oxooxooxxfxaaxxf') xxx oxooxooxxfffaxxxxaxxf
('ffffafxxAAfAAfx', 'ffffafxxaafaafx') xxx ffffaafxxaafaafx
('faAxxAxAxxAfAAxxxfxx', 'faaxxaxaxxafaaxxxfxx') xxx faaxxaxaxxafaafxx
('aAxxxAAxxxfxxxffAAfA', 'aaxxxaaxxxfxxxffaafa') xxx aaxxxaaxxxfxxxffaafaa
('xfffxxaxxxfAAxxx', 'xfffxxaxxxfaaxxx') xxx xfffxxaxxxf
('xaAxAAxxAxAxxfAxfxA', 'xaaxaaxxaxaxxfaxfxa') xxx xaaxaaxxaxaxxfxfxa
('aAxAAxxxAfAxxAxAxxfA', 'aaxaaxxxafaxxaxaxxfa') xxx aaxaaxxxafaxxaxaxxf
('xxaxxAfAAxffxAA', 'xxaxxafaaxffxaa') xxx xxaxxafaaxffxoo
('faxxffAAxxAxAAxxA', 'faxxffaaxxaxaaxxa') xxx faxxffaaooxxo
('fxxxaxAAxAxxxfxxxAxA', 'fxxxaxaaxaxxxfxxxaxa') xxx fxxxaxaaxaxxxfxxxoxo
('fxxxaxAxxffAxAAx', 'fxxxaxaxxffaxaax') xxx fxxxaxaxxffaaxaax
('xxfxxfx

In [140]:
def test_fst_special(fst, testdata):
    n = 0
    for i in testdata:
        if fst.rewrite(i[0]) == i[1]:
            n += 1
        else:
            print(i, "xxx", fst.rewrite(i[0]))
    print("Score:", str(n / len(testdata) * 100) + "%")

## Dataset 4: double harmony, no blockers

In [125]:
a = {("a", "o", "e", "u"):"A", ("x"):"X"}
b = ({"A":(1, 2), "X":(1, 3)})
h = Harmony(a, b)

data_4 = []
for i in range(5000):
    data_4.extend(h.generate_pairs(n = 1, length = randint(1, 10)))
    
test_4 = []
for i in range(1000):
    test_4.extend(h.generate_pairs(n = 1, length = randint(1, 10)))

sigma4 = ["a", "o", "e", "u", "A", "x"]
gamma4 = ["a", "o", "e", "u", "x"]

print(data_4[:15])

[('xxxox', 'xxxox'), ('xxxuAxAxxx', 'xxxuuxuxxx'), ('oA', 'oo'), ('xax', 'xax'), ('x', 'x'), ('xxx', 'xxx'), ('xoAxxxA', 'xooxxxo'), ('oxxAxxxAA', 'oxxoxxxoo'), ('xxxuxxAxxA', 'xxxuxxuxxu'), ('aAxx', 'aaxx'), ('xuxA', 'xuxu'), ('xxxoxAAxxx', 'xxxoxooxxx'), ('u', 'u'), ('xoxA', 'xoxo'), ('x', 'x')]


In [99]:
o4 = ostia(data_4, sigma4, gamma4)
# test_fst(o4, test_4)

# Score: 100.0%

In [57]:
# len(o4.E)
# 90

In [58]:
# len(o4.stout)
# 37

In [126]:
test_44 = []
for i in range(1000):
    test_44.extend(h.generate_pairs(n = 1, length = randint(15, 20)))
test_fst(o4, test_44)

# Score: 100.0%

Score: 100.0%


## Dataset 5: double harmony with blockers

In [127]:
data_5 = []
for i in range(5000):
    data_5.extend(generate_turkish_pairs(n = 1, length = randint(1, 10), cons = "x", vowel_cluster = (1, 2),
                                cons_cluster = (1, 2)))
test_5 = []
for i in range(5000):
    test_5.extend(generate_turkish_pairs(n = 1, length = randint(1, 10), cons = "x", vowel_cluster = (1, 2),
                                cons_cluster = (1, 2)))
    
sigma5 = ["H", "L", "x", "a", "e", "o", "O", "u", "U", "i", "I"]
gamma5 = ["x", "a", "e", "o", "O", "u", "U", "i", "I"]

print(data_5[:15])

[('xx', 'xx'), ('axxLxxLxxH', 'axxaxxaxxI'), ('oxx', 'oxx'), ('iHxH', 'iixi'), ('xiHx', 'xiix'), ('oLxxH', 'oaxxI'), ('xxa', 'xxa'), ('OxxLLx', 'Oxxeex'), ('xiHxxL', 'xiixxe'), ('xiHxH', 'xiixi'), ('oL', 'oa'), ('UHxHxxLx', 'UUxUxxex'), ('xuHxxH', 'xuuxxu'), ('ixLL', 'ixee'), ('xxOxxHH', 'xxOxxUU')]


In [60]:
# o5 = ostia(data_5, sigma5, gamma5)
# test_fst(o5, test_5)

# Score: 97.92%

Score: 97.92%


In [61]:
# len(o5.E)
# 318

318

In [62]:
# len(o5.stout)
# 107

107

In [144]:
test_55 = []
for i in range(5000):
    test_55.extend(generate_turkish_pairs(n = 1, length = randint(15, 20), cons = "x", vowel_cluster = (1, 2),
                                cons_cluster = (1, 2)))
test_fst_special(o5, test_55)
# 'uxHxLxxLLxxHxxL', 'uxuxaxxaaxxIxxa') xxx uxuxaxxaaxxuxxa
# Score: 91.42%

('oxxHHxxLxHxHHxHxxLLx', 'oxxuuxxaxIxIIxIxxaax') xxx oxxuuxxaxIxIIxIIxxaax
('xxoxxHHxHxLxxHx', 'xxoxxuuxuxaxxIx') xxx xxoxxuuxuxaxxux
('oLxxHxHxHxxLxxH', 'oaxxIxIxIxxaxxI') xxx oaxxIxIxIxxaI
('oxHxxLxHxLxHxLLx', 'oxuxxaxIxaxIxaax') xxx oxuxxaxIxaxIxaaxxx
('xxexxLxLxxHxLxxLxLxx', 'xxexxexexxixexxexexx') xxx xxexxexexxUxexxexexx
('xxaxLLxxLxLLxHHxHH', 'xxaxaaxxaxaaxIIxII') xxx xxaxaaxxaxaaxxxUU
('uxxLLxxHxxHxHHxxHH', 'uxxaaxxIxxIxIIxxII') xxx uxxaaxxIxxIxIIxIIIIx
('xxIHxxHxHxLLxHHxx', 'xxIIxxIxIxaaxIIxx') xxx xxIIxxIxaxIIxx
('xOxxHxxHxxHHxxHHxHxL', 'xOxxUxxUxxUUxxUUxUxe') xxx xOxxUxxixxiixxiixixe
('xxIHxxLLxHxLLxHxLxxH', 'xxIIxxaaxIxaaxIxaxxI') xxx xxIIxxaaxIxaaxxxaxxI
('xaLxxLLxHxHHxxH', 'xaaxxaaxIxIIxxI') xxx xaaxxaaxIxIIxIII
('uxHxLxxLLxxHxxL', 'uxuxaxxaaxxIxxa') xxx uxuxaxxaaxxuxxa
('xIHxxLxxHHxxHxx', 'xIIxxaxxIIxxIxx') xxx xIIxxaxxuuxxuuxIxx
('xaLxxHxxHxxHxHxHx', 'xaaxxIxxIxxIxIxIx') xxx xaaxxIxxIxxIx
('OxxLxxLxxLLxxHHxL', 'Oxxexxexxeexxiixe') xxx OxxexxexxeaxxIIxa
('xoxLxxHHxxLxxLL

('OxHxHHxxLLxHxxHxxL', 'OxUxUUxxeexixxixxe') xxx OxUxUUxexIxxaaxIxxIxxa
('UxxLxxHxHHxxHHxHxH', 'Uxxexxixiixxiixixi') xxx UxxexxixiixixIxxauxIxIxI
('xaLxLxxLLxHxHHxLLx', 'xaaxaxxaaxIxIIxaax') xxx xaaxaxaaxIxIIxaax
('xUHxxLLxHxHxHHxLx', 'xUUxxeexixixiixex') xxx xUUxxeexixUxUUxex
('xxIxxHxHxxHxxHxxHxx', 'xxIxxIxIxxIxxIxxIxx') xxx xxIxxIxIxxauxxuuxIxxIxx
('xxaLxxLxxLxHxHH', 'xxaaxxaxxaxIxII') xxx xxaaxxaxxaxIxIIxII
('xexxHxxLLxLxLxHxxHH', 'xexxixxeexexexixxii') xxx xexxixxeexexexUxxUU
('xaxxLxxHxHxxHxHH', 'xaxxaxxIxIxxIxII') xxx xaxxaxxIxIxxIxIIx
('axHxHHxHxHHxLxLxLx', 'axIxIIxIxIIxaxaxax') xxx axIxIIxIxIIxIIaaxIxax
('xxaxLLxLLxLLxLLxHHx', 'xxaxaaxaaxaaxaaxIIx') xxx xxaxaaxaaxxaxaaxIIx
('xoLxLLxHxHxxLxL', 'xoaxaaxIxIxxaxa') xxx xoaxaaxIxIxxa
('xxuHxHxxLLxHxHH', 'xxuuxuxxaaxIxII') xxx xxuuxuxxaaxIxIIxII
('uxLxLxLxxLxLLxLLxL', 'uxaxaxaxxaxaaxaaxa') xxx uxaxaxaxxaxaaxxaxa
('OxxHxLxxHHxLLxLxxH', 'OxxUxexxiixeexexxi') xxx OxxUxexxUUxeexexxU
('IHxxLLxxHxLxLxxLLxxH', 'IIxxaaxxIxaxaxxaaxxI') xxx I

('IxLLxHxLLxLxxHx', 'IxaaxIxaaxaxxIx') xxx IxaaxIxaaxxxuuxIx
('exHxLLxLxxHxxHHxxHx', 'exixeexexxixxiixxix') xxx exixeaxaxxIxxIIxxIx
('UxxHxxHxxLxxHxLLxx', 'UxxUxxUxxexxixeexx') xxx UxxUxxUxxexxixeaxx
('UxxHHxHxHHxLxxH', 'UxxUUxUxUUxexxi') xxx UxxUUxUxUUxexxU
('xUHxHxxLxHHxLxHxLLx', 'xUUxUxxexiixexixeex') xxx xUUxUxxexUUxexUxeex
('xxoxLxLxxHxxHHxLxxL', 'xxoxaxaxxIxxIIxaxxa') xxx xxoxaxaxxuxxuuxaxxa
('xxixxLLxLxLxxHx', 'xxixxeexexexxix') xxx xxixxeexexexxUx
('aLxHxxLxHxxLLxxHH', 'aaxIxxaxIxxaaxxII') xxx aaxIxxaxIxxaaxxIIx
('xxUHxLxxHxLxLxHH', 'xxUUxexxixexexii') xxx xxUUxexxixexexUU
('aLxHxxLLxHHxxLxxLxLx', 'aaxIxxaaxIIxxaxxaxax') xxx aaxIxxaaxxxxexxexex
('xoLxxHxLLxHxLLxxHH', 'xoaxxIxaaxIxaaxxII') xxx xoaxxIxaaxIxaaxxIIx
('xexHHxLLxHxxHxHH', 'xexiixeexixxixii') xxx xexiixeexixxixiixixI
('xxIxxHxHHxxHHxx', 'xxIxxIxIIxxIIxx') xxx xxIxxIxIIxIIIIxx
('aLxLLxLLxHxHxHHxH', 'aaxaaxaaxIxIxIIxI') xxx aaxaaxaaxIxIxI
('xxIxHHxxLLxLLxHH', 'xxIxIIxxaaxaaxII') xxx xxIxIIxxaaxaaxx
('axHxxHxLLxLxHHxxLxH

## Dataset 6: two independent harmonies, no blockers

In [146]:
a = {("a", "o"):"A", ("b", "p"):"B"}
b = ({"A":(1, 2), "B":(1, 2)})
h = Harmony(a, b)

data_6 = []
for i in range(5000):
    data_6.extend(h.generate_pairs(n = 1, length = randint(1, 10)))
    
test_6 = []
for i in range(1000):
    test_6.extend(h.generate_pairs(n = 1, length = randint(1, 10)))

sigma6 = ["A", "a", "o", "B", "b", "p"]
gamma6 = ["a", "o", "b", "p"]

print(data_6[:15])

[('oAbB', 'oobb'), ('bBoABAA', 'bbooboo'), ('paBAABAAB', 'papaapaap'), ('oAbA', 'oobo'), ('ob', 'ob'), ('pBaABBABBA', 'ppaappappa'), ('abBAABABB', 'abbaababb'), ('opBABAB', 'oppopop'), ('pBaABBAABB', 'ppaappaapp'), ('bBoBABBA', 'bbobobbo'), ('o', 'o'), ('oApB', 'oopp'), ('aAbBAABAA', 'aabbaabaa'), ('obAABABA', 'oboobobo'), ('aAp', 'aap')]


In [64]:
# o6 = ostia(data_6, sigma6, gamma6)
# test_fst(o6, test_6)

# Score: 100.0%
# FLT NOTEBOOK

Score: 100.0%


In [65]:
# len(o6.E)

# T.E = [['', 'p', 'p', ''], ['', 'a', 'a', ''], ['', 'o', 'o', 'o'],
#        ['o', 'p', 'p', 'o'], ['', 'b', 'b', 'b'], ['b', 'a', 'a', 'b'],
#        ['o', 'A', 'o', 'o'], ['o', 'b', 'b', 'ob'], ['ob', 'A', 'o', 'ob'],
#        ['b', 'B', 'b', 'b'], ['b', 'o', 'o', 'ob'], ['ob', 'B', 'b', 'ob'],
#        ['o', 'B', 'p', 'o'], ['', 'B', 'p', ''], ['', 'A', 'a', ''],
#        ['b', 'A', 'a', 'b']]

48

In [66]:
# len(o6.stout)

# T.Q = ['', 'o', 'b', 'ob']

15

In [147]:
test_66 = []
for i in range(1000):
    test_66.extend(h.generate_pairs(n = 1, length = randint(15, 20)))
test_fst(o6, test_66)

# Score: 100.0%

Score: 100.0%


## Dataset 7: two independent harmonies with blockers

In [148]:
a = {("a", "o"):"A", ("b", "p"):"B"}
b = ({"A":(1, 2), "B":(1, 2)})
c = {"t":"p"}
d = 7
h = Harmony(a, b, c, d)

data_7 = []
for i in range(5000):
    data_7.extend(h.generate_pairs(n = 1, length = randint(1, 10)))
    
test_7 = []
for i in range(1000):
    test_7.extend(h.generate_pairs(n = 1, length = randint(1, 10)))

sigma7 = ["A", "a", "o", "B", "b", "p", "t"]
gamma7 = ["a", "o", "b", "p", "t"]

print(data_7[:15])

[('aApAB', 'aapap'), ('baBBA', 'babba'), ('aApABBAAB', 'aapappaap'), ('abBAA', 'abbaa'), ('btBoBAABBA', 'btpopooppo'), ('pBat', 'ppat'), ('pBoA', 'ppoo'), ('ttpB', 'ttpp'), ('oApAABABB', 'oopoopopp'), ('paABBtAABB', 'paapptaapp'), ('aAt', 'aat'), ('bB', 'bb'), ('apBAA', 'appaa'), ('oAbBA', 'oobbo'), ('pBaA', 'ppaa')]


In [68]:
# o7 = ostia(data_7, sigma7, gamma7)
# test_fst(o7, test_7)

# Score: 96.39999999999999%

Score: 96.39999999999999%


In [69]:
# len(o7.E)
# 406

406

In [70]:
# len(o7.stout)
# 101

101

In [149]:
test_77 = []
for i in range(1000):
    test_77.extend(h.generate_pairs(n = 1, length = randint(15, 20)))
test_fst_special(o7, test_77)

# Score: 87.3%

('attAtpBABABAAtAA', 'attatppapapaataa') xxx attatppapapaatoo
('poABBABBAABBABBAAt', 'pooppoppooppoppoot') xxx pooppoppooppobboot
('poBAABAtBBAABAtBABB', 'popoopotppoopotpopp') xxx popoopotppoobotpopp
('pttBoABBABAAtABBAAt', 'pttpooppopootoppoot') xxx pttpooppopaapaatoppoot
('oApABBAtttBBABA', 'oopoppotttppopo') xxx oopoppotttpbobo
('pBoBBtBBABBABtABBAB', 'ppopptppoppoptoppop') xxx ppopptppoppoptpappap
('ttaAtpAtAABAABB', 'ttaatpataapaapp') xxx ttaatapaapp
('opBAABAABAtBBAABAAB', 'oppoopoopotppoopoop') xxx oppoopoopotppooboob
('pBoAtABBAABBABtAt', 'ppootoppooppoptot') xxx ppootoppooppobtot
('poBBAABABBABABABt', 'poppoopoppopopopt') xxx poppoopoppobobobt
('btBoBBtAABABBAABABAB', 'btpopptoopoppoopopop') xxx btpotpoaapaaobobob
('oApABBABBAABABBABBt', 'oopoppoppoopoppoppt') xxx oopoppoppoopoppobbt
('baABtABABBAABAtBBAB', 'baabtapappaapatppap') xxx baabtapappaapatapop
('oAbABtABBAtAABBtBABA', 'oobobtoppotoopptpopo') xxx oobobtpppataapptpapa
('obABtBAABABAtBABAAB', 'obobtpoopopotpopoop') xxx

## Dataset 8: unbounded  tone plateauing

In [152]:
data_8 = []
for i in range(5000):
    data_8.extend(generate_utp_pairs(n = 1, length = randint(1, 6)))
    
test_8 = []
for i in range(1000):
    test_8.extend(generate_utp_pairs(n = 1, length = randint(1, 6)))
    
    
sigma8 = ["H", "L"]
gamma8 = ["H", "L"]
    
    
print(data_8[:15])

[('LHHHHL', 'LHHHHL'), ('L', 'L'), ('LHLLH', 'LHHHH'), ('HHLHHL', 'HHHHHL'), ('HLHHH', 'HHHHH'), ('HLL', 'HLL'), ('LH', 'LH'), ('HHLHLH', 'HHHHHH'), ('L', 'L'), ('LLH', 'LLH'), ('L', 'L'), ('LH', 'LH'), ('L', 'L'), ('HHHHH', 'HHHHH'), ('LLLH', 'LLLH')]


In [151]:
o8 = ostia(data_8, sigma8, gamma8)
test_fst(o8, test_8)

Score: 100.0%


In [73]:
len(o8.E)

12

In [79]:
o8.E

[['', 'L', 'L', ''],
 ['', 'H', 'H', 'H'],
 ['H', 'H', 'H', 'H'],
 ['H', 'L', '', 'HL'],
 ['HL', 'L', '', 'HLL'],
 ['HLL', 'L', '', 'HLLL'],
 ['HL', 'H', 'HH', 'H'],
 ['HLLL', 'H', 'HHHH', ''],
 ['HLL', 'H', 'HHH', 'H'],
 ['HLLL', 'L', '', 'HLLLL'],
 ['HLLLL', 'L', 'LLLLL', ''],
 ['HLLLL', 'H', 'HHHHH', '']]

In [74]:
len(o8.stout)

6

In [153]:
test_88 = []
for i in range(1000):
    test_88.extend(generate_utp_pairs(n = 1, length = randint(15, 20)))
    
test_fst_special(o8, test_88)

('HLLLLHHLHHLLLLLHLL', 'HHHHHHHHHHHHHHHHLL') xxx HHHHHHHHHHLLLLLHLL
('HHLHLLLLLLHHLHHLLHH', 'HHHHHHHHHHHHHHHHHHH') xxx HHHHLLLLLLHHHHHHHHH
('HHLLLHLLHHHLLLHLHLLH', 'HHHHHHHHHHHHHHHHHHHH') xxx HHHHHHLLHHHHHHHLHHHH
('LLHLLLLHLLLHHLHLHLLH', 'LLHHHHHHHHHHHHHHHHHH') xxx LLHHHHHHLLLHHHHHHHHH
('LHLLLHHHLLLHLHL', 'LHHHHHHHHHHHHHL') xxx LHHHHHHHHHHHLHL
('LHHHHLLHHLLLLLHLHLHL', 'LHHHHHHHHHHHHHHHHHHL') xxx LHHHHHHHHLLLLLHHHHHL
('LLLHHHHLHHLLLLLHLHL', 'LLLHHHHHHHHHHHHHHHL') xxx LLLHHHHHHHLLLLLHHHL
('LHLLHHLHLLLLLLH', 'LHHHHHHHHHHHHHH') xxx LHHHHHHHLLLLLLH
('HLHLLLHLHLHLLLLHL', 'HHHHHHHHHHHHHHHHL') xxx HHHHHHHLHHHHHHHHL
('LLHHLHLHHLLLHLLLLH', 'LLHHHHHHHHHHHHHHHH') xxx LLHHHHHHHHHHHLLLLH
('HHHLLLHLHHLHLLLLL', 'HHHHHHHHHHHHLLLLL') xxx HHHHHHHLHHHHLLLLL
('HLLLHLLLHHHHHHLHL', 'HHHHHHHHHHHHHHHHL') xxx HHHHHLLLHHHHHHHHL
('LLLHLLLLLHHHLLLH', 'LLLHHHHHHHHHHHHH') xxx LLLHLLLLLHHHHHHH
('LLHHLHHLLHHHLLLLLHHL', 'LLHHHHHHHHHHHHHHHHHL') xxx LLHHHHHHHHHHLLLLLHHL
('LHLLLLLHLHLHHHLLHH', 'LHHHHHHHHHHHHHHHHH') xxx LH

## Dataset 9: first-last harmony

In [8]:
data_9 = []
for i in range(5000):
    data_9.extend(first_last_pairs(n = 1, length = randint(1, 6)))

test_9 = []
for i in range(5000):
    test_9.extend(first_last_pairs(n = 1, length = randint(1, 6)))
    
    
sigma9 = ["a", "o", "x"]
gamma9 = ["a", "o", "x"]
    
    
print(data_9[:15])

[('oaxxo', 'oaxxo'), ('oxxaa', 'oxxao'), ('aaooo', 'aaooa'), ('ooaxo', 'ooaxo'), ('axxxxo', 'axxxxa'), ('aoaoo', 'aoaoa'), ('oxxo', 'oxxo'), ('ooo', 'ooo'), ('oxxoaa', 'oxxoao'), ('aao', 'aaa'), ('ao', 'aa'), ('aooxo', 'aooxa'), ('ooaaa', 'ooaao'), ('axaa', 'axaa'), ('aoxaao', 'aoxaaa')]


In [9]:
o9 = ostia(data_9, sigma9, gamma9)
test_fst(o9, test_9)

Score: 100.0%


In [23]:
o9.E

[['', 'o', 'o', 'o'],
 ['o', 'a', '', 'oa'],
 ['oa', 'x', 'ax', 'o'],
 ['o', 'x', 'x', 'o'],
 ['', 'a', 'a', 'a'],
 ['a', 'a', 'a', 'a'],
 ['o', 'o', 'o', 'o'],
 ['a', 'x', 'x', 'a'],
 ['a', 'o', '', 'ao'],
 ['ao', 'a', 'oa', 'a'],
 ['ao', 'o', 'o', 'ao'],
 ['ao', 'x', 'ox', 'a'],
 ['oa', 'o', 'ao', 'o'],
 ['oa', 'a', 'a', 'oa']]

In [24]:
o9.stout

{'': '*', 'o': '', 'oa': 'o', 'a': '', 'ao': 'a'}

In [12]:
o9.E

[['', 'o', 'o', 'o'],
 ['o', 'a', '', 'oa'],
 ['oa', 'x', 'ax', 'o'],
 ['o', 'x', 'x', 'o'],
 ['', 'a', 'a', 'a'],
 ['a', 'a', 'a', 'a'],
 ['o', 'o', 'o', 'o'],
 ['a', 'x', 'x', 'a'],
 ['a', 'o', '', 'ao'],
 ['ao', 'a', 'oa', 'a'],
 ['ao', 'o', 'o', 'ao'],
 ['ao', 'x', 'ox', 'a'],
 ['oa', 'o', 'ao', 'o'],
 ['oa', 'a', 'a', 'oa']]

In [13]:
test_99 = []
for i in range(5000):
    test_99.extend(first_last_pairs(n = 1, length = randint(10, 15)))

test_fst(o9, test_99)

Score: 100.0%


In [14]:
test_fst(o9, test_99)

Score: 100.0%


In [15]:
t = [('oo', 'oo'), ('oxo', 'oxo'), ('axxo', 'axxa')]

In [16]:
def add_xxx(strings, xrange):
    new = []
    for p0, p1 in strings:
        initial_x = randint(*xrange)
        final_x = randint(*xrange)
        annotate = lambda s: "x" * initial_x + s + "x" * final_x
        annotated = (annotate(p0), annotate(p1))
        new.append(annotated)
    return new

In [17]:
add_xxx(t, (4, 6))

[('xxxxxxooxxxx', 'xxxxxxooxxxx'),
 ('xxxxoxoxxxxx', 'xxxxoxoxxxxx'),
 ('xxxxxxaxxoxxxx', 'xxxxxxaxxaxxxx')]

In [18]:
data_9 = []
for i in range(5000):
    data_9.extend(first_last_pairs(n = 1, length = randint(1, 6)))
data_9 = add_xxx(data_9, (0, 3))
    
test_9 = []
for i in range(5000):
    test_9.extend(first_last_pairs(n = 1, length = randint(1, 6)))
test_9 = add_xxx(test_9, (0, 3))
    
sigma9 = ["a", "o", "x"]
gamma9 = ["a", "o", "x"]
    
    
print(data_9[:15])

[('xoaaxxx', 'xoaoxxx'), ('xxxaxoxaxxx', 'xxxaxoxaxxx'), ('aooxxo', 'aooxxa'), ('xoaaxaa', 'xoaaxao'), ('xoaaoaa', 'xoaaoao'), ('xoaxx', 'xooxx'), ('xxoxaaxxx', 'xxoxaoxxx'), ('aaao', 'aaaa'), ('xxoaaaaxx', 'xxoaaaoxx'), ('xxoaoxxx', 'xxoaoxxx'), ('xoaxoxx', 'xoaxoxx'), ('xxxaxaxaaxxx', 'xxxaxaxaaxxx'), ('xxaoaoo', 'xxaoaoa'), ('xoxaxxx', 'xoxoxxx'), ('xxxaaoaxxx', 'xxxaaoaxxx')]


In [19]:
o99 = ostia(data_9, sigma9, gamma9)
test_fst(o99, test_9)

Score: 99.53999999999999%


In [20]:
len(o99.E)

235

In [21]:
len(o99.stout)

79

In [22]:
test_999 = []
for i in range(5000):
    test_999.extend(first_last_pairs(n = 1, length = randint(10, 15)))
test_999 = add_xxx(test_999, (0, 7))

test_fst(o99, test_999)

Score: 33.92%
