<div style="text-align: right">
    <i>
        Alëna Aksënova
    </i>
</div>
 

# [_SigmaPie_](https://github.com/alenaks/SigmaPie) for subregular grammar induction

This toolkit is relevant for anyone who is working or going to work with subregular grammars both from the perspectives of theoretical linguistics and formal language theory.

## Importance of formal languages

**Why theoretical linguists might be interested in formal language theory?** <br>
_Formal language theory_ explains how potentially infinite string sets, or _formal languages_,
can be generalized to grammars encoding the desired patterns and what properties those
grammars have. It also allows one to compare different grammars regarding parameters such as **expressivity**.


**The Chomsky hierarchy** aligns the main classes of formal languages with respect to their expressive power [(Chomsky 1959)](http://www.cs.utexas.edu/~cannata/pl/Class%20Notes/Chomsky_1959%20On%20Certain%20Formal%20Properties%20of%20Grammars.pdf).

  * **Regular** grammars are as powerful as finite-state devices or regular expressions, and they cannot produce patterns that require counting up to an arbitrary number (no $a^{n}b^{n}$ patterns);
  * **Context-free** grammars have access to a potentially infinite _stack_ that allows them to reproduce patterns that involve center embedding;
  * **Mildly context-sensitive** grammars are powerful enough to handle cross-serial dependencies such as some types of copying;
  * **Context-sensitive** grammars can handle non-linear patterns such as $a^{2^{n}}$ for $n > 0$;
  * **Recursively enumerable** grammars are as powerful as any theoretically possible computer and generate languages such as $a^n$, where $n \in \textrm{primes}$.



<img src="images/chomhier.png" width="600">


Both phonology and morphology frequently display properties of regular languages.

**Phonology** does not require the power of center-embedding, which is a property of context-free languages. For example, consider a harmony where the first vowel agrees with the last vowel, the second vowel agrees with the pre-last one, etc. The following example shows this rule using English orthography.
    
    GOOD: "arugula", "tropicalization", "electrotelethermometer", etc.
    BAD:  any other word violating the rule.


While it is a theoretically possible pattern, harmonies of that type are unattested in natural languages.

**Morphology** avoids center-embedding as well. In [Aksënova et al. (2016)](https://www.aclweb.org/anthology/W16-2019) we show that it is possible to iterate prefixes with the meaning "after" in Russian. In Ilocano, where the same semantics are expressed via a circumfix, its iteration is prohibited.
    
    RUSSIAN: "zavtra" (tomorrow), "posle-zavtra" (the day after tomorrow), 
             "posle-posle-zavtra" (the day after the day after tomorrow), ...
    ILOCANO: "bigat" (morning), "ka-bigat-an" (the next morning),
             <*>"ka-ka-bigat-an-an" (the morning after the next one).

## Subregular language classes


Typological review of patterns shows that **phonology and morphology do not require the full power of regular languages**. As an example of an unattested pattern, [Heinz (2011)](http://jeffreyheinz.net/papers/Heinz-2011-CPF.pdf) provides a language where a word must have an even number of nasals to be well-formed. Regular languages can be sub-divided into another nested hierarchy of languages decreasing in their expressive power: **subregular hierarchy**.
Among some of the most important characteristics of subregular languages is their learnability only from positive data: more powerful classes require negative input as well.


<img src="images/subreg.png" width="250">


The _SigmaPie_ toolkit currently contains functionality for the following subregular language and grammar classes:
  * strictly piecewise (SP);
  * strictly local (SL);
  * tier-based strictly local (TSL);
  * multiple tier-based strictly local (MTSL).
  
| Language | Dependencies it can handle                                    |
|----------|---------------------------------------------------------------|
| SL       | _only_ local dependencies                                     |
| SP       | _only_ multiple long-distance dependencies _without_ blocking |
| TSL      | long-distance dependencies _with_ blocking                    |
| MTSL     | multiple long-distance dependencies _with_ blocking           |


The work here is based on **string representations**. The exemplified learning algorithms focus on **structural properties**, and are limited to non-probabilistic algorithms evaluating the well-formedness of input stings. This approach is currently extended to features ([Chandlee et al. 2019](https://www.aclweb.org/anthology/W19-5708/)) and autosegmental representations ([Chandlee and Jardine 2019](https://www.aclweb.org/anthology/Q19-1010/); [Rawski and Dolatian (to appear)](https://drive.google.com/file/d/19Ft6j7ta71uTTw3qkLRa6bqhR98caJGk/view)) in order to be more coherent with the representations used in linguistics. However, the statistical algorithms and the algorithms working with non-string-based representations are not implemented yet.

## Linguistically-inspired examples

In this section, I present $2$ examples that I will later use to exemplify the functionality of the package:

  * toy double harmony example;
  * toy tone plateauing example.
  
### Vowel harmony and consonant harmony

In Bukusu, vowels agree in height, and a liquid "l" assimilates to "r" if followed by "r" somewhere further in the word [(Odden 1994)](https://www.jstor.org/stable/415830?seq=1#metadata_info_tab_contents).

  * <b>r</b><i>ee</i>b-<i>e</i><b>r</b>- _ask-APPL_
  * <b>l</b><i>i</i>m-<i>i</i><b>l</b>- _cultivate-APPL_
  * <b>r</b><i>u</i>m-<i>i</i><b>r</b>- _send-APPL_
  
This pattern involves two long-distance assimilations: one of them affects vowels, and the other one is concerned with the consonants. To capture the big picture, we can simplify the dependency as follows: the two harmonic classes of vowels are mapped to "a" and "o", and the affected consonants are mapped to "b" and "p".

    Good strings: aaabbabba, oppopooo, aapapapp, obooboboboobbb, ...
    Bad strings:  <*>aabaoob, <*>paabab, <*>obabooo, ...
    Generalization: if a string contains "a", it cannot contain "o", and vice versa;
                    if a string contains "p", it cannot contain "b", and vice versa.

In [None]:
harm_data = ['aabbaabb', 'abab', 'aabbab', 'abaabb', 'aabaab', 'abbabb', 'ooppoopp',
             'opop', 'ooppop', 'opoopp', 'oopoop', 'oppopp', 'aappaapp', 'apap',
             'aappap', 'apaapp', 'aapaap', 'appapp', 'oobboobb', 'obob', 'oobbob',
             'oboobb', 'ooboob', 'obbobb', 'aabb', 'ab', 'aab', 'abb', 'oopp', 'op',
             'oop', 'opp', 'oobb', 'ob', 'oob', 'obb', 'aapp', 'ap', 'aap', 'app',
             'aaa', 'ooo', 'bbb', 'ppp', 'a', 'o', 'b', 'p', '']

### Tone plateauing

In some of the Bantu languages, the prosodic domain cannot have more than one stretch of H tones. For example, in Luganda (Bantu) a low tone (L) cannot intervene in-between two high tones (H): L is changed to H in such configurations. This pattern is called _tone plateauing_, and its computational properties are discussed in [Jardine (2015,](https://adamjardine.net/files/jardinecomptone-short.pdf) [2016)](https://adamjardine.net/files/jardine2016dissertation.pdf). Consider the following Luganda data from [Hyman and Katamba (2010)](http://linguistics.berkeley.edu/phonlab/documents/2010/Hyman_Katamba_Paris_PLAR.pdf), cited by [Jardine (2016)](https://www.cambridge.org/core/services/aop-cambridge-core/content/view/B01C656A2B96316F3ADCC836BD2A6244/S0952675716000129a.pdf/computationally_tone_is_different.pdf).

  * /tw-áa-mú-láb-a, walúsimbi/ $\Rightarrow$ tw-áá-mu-lab-a, walúsimbi <br>
    ‘we saw him, Walusimbi’ <br>
    **HHLLL, LHLL**
    
  * /tw-áa-láb-w-a walúsimbi/ $\Rightarrow$ tw-áá-láb-wá wálúsimbi <br>
    ‘we were seen by Walusimbi’ <br>
    **HHHHHHLL**
    
  * /tw-áa-láb-a byaa=walúsimbi/ $\Rightarrow$ tw-áá-láb-á byáá-wálúsimbi <br>
    ‘we saw those of Walusimbi’ <br>
    **HHHHHHHHLL**
    
Intuitively, this pattern can be generalized as "make sure that there is no L tone in-between two H tones".

    Good strings: HHLLL, LHHHLL, LLLLHHHH, ...
    Bad strings:  <*>LLHLHLLL, <*>HLHLLL, <*>HLLLLLHL, ...
    Generalization: no H tone should intervene in-between two L tones.

In [None]:
tone_data = ["LLLL", "HHLLL", "LHHHLL", "LLLLHHHH", "HHH", "HHHHHLLL", "LLLLHH"]

## Organization of SigmaPie
### The functionality of the toolkit

The functionality implemened in SigmaPie includes, but is not limited to...

  * **learners:** extract grammars from string sets;
  * **scanners:** evaluate strings with respect to a given grammar;
  * **sample generators:** generate stringsets for a given grammar;
  * **FSM constructors:** translate subregular grammars to finite state machines;
  * **polarity converters** switch negative grammars to positive, and vice versa.

### How to run the code

#### Way 1: running from the terminal
  1. Download the code from the [SigmaPie GitHub folder](https://github.com/alenaks/SigmaPie);
  2. Open the terminal and use `cd` to move to the `SigmaPie/code/` repository.
  3. Run Python3 compiler by typing `python3`.
  4. `from main import *` will load all the modules of the package.
 
  <img src="images/terminal.png" width="650">
  
#### Way 2: running from the Jupyter notebooks
  1. Download the code from the [SigmaPie GitHub folder](https://github.com/alenaks/SigmaPie).
  2. Modify the second line in the cell below so that it contains the correct path to `SigmaPie/code/`.
  3. Run that cell.

In [None]:
%cd
%cd SigmaPie/code/

from main import *

### Intuitions behind the implemented subregular classes

Grammars can be positive or negative. **Positive grammars** list all allowed substructures of its language, whereas **negative grammars** list the substructures that must not be encountered in well-formed strings of its language. Moreover, these grammars are equivalent, i.e. for every negative grammar, it is possible to construct a positive grammar that generates the same language, and vice versa.

**Negative strictly piecewise (`SP`)** grammars prohibit the occurrence of sequences of symbols at an arbitrary distance from each other. Every SP grammar is associated with the value of $k$ that defines the size of the longest sequence that this grammar can prohibit. Alternatively, if the grammar is positive, it lists all subsequences that are allowed in well-formed words of the language. **SP grammars capture only long-distance dependencies that do not include blockers.**

    k = 2
    POLARITY: negative
    GRAMMAR:  ab, ba
    LANGUAGE: accaacc, cbccc, cccacaaaa, ...
              <*>accacba, <*>bcccacbb, <*>bccccccca, ...

**Negative strictly $k$-local (`SL`)** grammars prohibit the occurrence of consecutive substrings consisting of up to $k$ symbols. The value of $k$ defines the longest substring that cannot be present in a well-formed string of a language. Positive SL grammars define substrings that can be present in the language.
To define _first_ and _last_ elements, SL languages use delimiters (">" and "<") that indicate the beginning and the end of the string. In phonology, changes involve adjacent segments very frequently, and the notion of locality is therefore extremely important. A discussion of local processes in phonology can be found in [(Chandlee 2014)](http://dspace.udel.edu/bitstream/handle/19716/13374/2014_Chandlee_Jane_PhD.pdf). **SL grammars only capture local dependencies.**

    k = 2
    POLARITY: positive
    GRAMMAR:  >a, ab, ba, b<
    LANGUAGE: ab, abab, abababab, ...
              <*>babab, <*>abaab, <*>bababba, ...
              
**Tier-based strictly local (`TSL`)** grammars operate just like strictly local ones, but they have the power to _ignore_ a certain set of symbols completely. The set of symbols that are not ignored are called **tier** symbols, and the ones that do not matter for the well-formedness of strings are the **non-tier** ones [(Heinz et al. 2011)](https://pdfs.semanticscholar.org/b934/bfcc962f65e19ae139426668e8f8054e5616.pdf). The representation of a string with all non-tier symbols ignored is a _tier image_ of that string, and then the TSL grammars can be defined as _SL grammars that operate over a tier._ **TSL grammars capture a single long-distance dependency that can possibly include blockers.**

    k = 2
    POLARITY: negative
    TIER:     b
    GRAMMAR:  ><, bb
    LANGUAGE: aaaabaaaa, b, aaaab, baaaa, aaabaaaa, ...
              <*>aaaaa, <*>aaaabaabaa, <*>baaabaaa, ...
              
**Multiple tier-based strictly local (`MTSL`)** grammars are a conjunction of multiple TSL grammars: they consist of several tiers, and a set of restrictions is defined for every one of those tiers. In fact, there are numerous examples from the typological literature showing that there are phonological patterns of complexity which are beyond the power of TSL languages. One example could be any pattern where several long-distance dependencies affect different sets of elements, see [McMullin (2016)](https://www.dropbox.com/s/txmk4efif9f5bvb/McMullin_Dissertation_UBC.pdf?dl=0) and [Aksënova and Deshmukh (2018)](https://www.aclweb.org/anthology/W18-0307.pdf) for examples and discussions of those patterns. **MTSL grammars capture multiple long-distance dependencies that can possibly include blockers.**

    k = 2
    POLARITY:   negative
    TIER_1:     o, ö, u
    GRAMMAR_1:  oö, öo, uö
    TIER_2:     p, b
    GRAMMAR_2:  bp
    LANGUAGE:   obobo, öpöpbbu, öpuupobbo, opuopo, ...
                <*>öbuuupoo, <*>oobbböb, <*>poobböp, ...

### Attributes implemented for grammars

Languages ($L$) in the _SigmaPie_ toolkit are defined as objects initialized with the following attributes:

  * `L.polar` is the polarity of the grammar, and this attribute is only available upon initialization: `L.switch_polarity(new_value)` needs to be used to change it later, because it makes sure that the grammar is converted to a new polarity as well (default: positive);
  * `L.alphabet` is a set of symbols used in the language of the grammar;
  * `L.grammar` lists allowed or banned sequences;
  * `L.k` defines the locality window of the grammar;
  * `L.data` is a training sample;
  * `L.fsm` corresponds to a finite state device or devices that correspond to the grammar;
  * `L.edges` lists the delimiters implemented in the grammar (not relevant for SP);
  * `L.tier` is a list or lists of tier symbols (not relevant for SP and SP).


| Attributes         | SP | SL | TSL | MTSL |
|----------|----|----|-----|------|
| `L.polar`    | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `L.alphabet` | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `L.grammar`  | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `L.k`        | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `L.data`     | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `L.fsm`      | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `L.edges`    | $\neg\exists$  | $\surd$  | $\surd$   | $\surd$    |
| `L.tier`     | $\neg\exists$  | $\neg\exists$  | $\surd$   | $\surd$    |

### Methods implemented for grammars

The following methods are available in the toolkit for every language class $L$:

  * `L.learn()` extracts the grammar from the provided training sample in `L.data`;
  * `L.scan(string)` verifies if the `string` is well-formed with respect to `L.grammar` or not;
  * `L.check_polarity()` returns the `L.polar` of the grammar;
  * `L.switch_polarity(new_value)` changes the polarity of the grammar to `new_value`, if no arguments are provided, it changes to the opposite polarity;
  * `L.extract_alphabet()` fills the `L.alphabet` attribute by determining it based on `L.data` or `L.grammar`;
  * `L.generate_all_ngrams(alphabet, n)` generates a list of all possible n-grams (length `n`) for the list of symbols in `alphabet`;
  * `L.generate_sample(n, repeat, safe)` randomly generates `n` strings that are well-formed with respect to `L.grammar`, `repeat` allows or prohibits repeating the same generated strings, and `safe` detects cases when $n$ strings cannot be generated, i.e. the grammar generates a finite language with a size less than $n$ (safe mode is on by-default);
  * `L.fsmize()` creates a FSM or a collection of FSMs that correspond to `L.grammar`;
  * `L.clean_grammar()` detects uninformative elements of `L.grammar` and removes them;
  * `L.tier_image(string)` returns a tier image or a list of tier images of the given `string` (relevant for TSL and MTSL);
  * `L.subsequences(string)` returns all possible `L.k`-long subsequences of `string`.
  
| Methods             | SP | SL | TSL | MTSL |
|---------------------|----|----|-----|------|
| `learn`               | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `scan`                | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `check_polarity`      | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `switch_polarity`     | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `extract_alphabet`    | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `generate_all_ngrams` | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `generate_sample`     | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `fsmize`              | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `clean_grammar`       | $\surd$  | $\surd$  | $\surd$   | $\surd$    |
| `tier_image`          | $\neg\exists$  | $\neg\exists$  | $\surd$   | $\surd$    |
| `subsequences`        | $\surd$  | $\neg\exists$  | $\neg\exists$   | $\neg\exists$    |


**Learning algorithms:**
  * **_k_-SL** and **_k_-SP** learning strategies are explained by [Heinz (2010)](http://jeffreyheinz.net/papers/Heinz-2010-SEL.pdf);
  * **_k_TSLIA** is a learning algorithm for $k$-TSL languages, designed by [McMullin and Jardine (2017)](https://adamjardine.net/files/jardinemcmullin2016tslk.pdf), which in turn is based on [Jardine and Heinz (2016)](http://jeffreyheinz.net/papers/Jardine-Heinz-2016-LTSLL.pdf).
  * **MTSL2IA** is a learning algorithm for $2$-MTSL languages, developed by [McMullin et al. (2019)](https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=scil). Currently we are working on extending this algorithm to an arbitrary window size.


These methods enable a wide variety of ways to use the _SigmaPie_ toolkit.
For example, `L.scan(string)` checks if the current grammar still works with an updated or held-out dataset.
`L.generate_sample(n, repeat, safe)` can be used to generate an arbitrary-sized dataset for artificial grammar learning experiments, or neural models. Finally, if the grammar is hand-written, `L.clean_grammar()` will detect and remove its purposeless elements.

## Generalizing the harmonic pattern

The harmonic pattern involves two independent long-distance assimilations.

    Good strings: aaabbabba, oppopooo, aapapapp, obooboboboobbb, ...
    Bad strings:  <*>aabaoob, <*>paabab, <*>obabooo, ...
    Generalization: if a string contains "a", it cannot contain "o", and vice versa;
                    if a string contains "p", it cannot contain "b", and vice versa.

In [None]:
harm_data = ['aabbaabb', 'abab', 'aabbab', 'abaabb', 'aabaab', 'abbabb', 'ooppoopp',
             'opop', 'ooppop', 'opoopp', 'oopoop', 'oppopp', 'aappaapp', 'apap',
             'aappap', 'apaapp', 'aapaap', 'appapp', 'oobboobb', 'obob', 'oobbob',
             'oboobb', 'ooboob', 'obbobb', 'aabb', 'ab', 'aab', 'abb', 'oopp', 'op',
             'oop', 'opp', 'oobb', 'ob', 'oob', 'obb', 'aapp', 'ap', 'aap', 'app',
             'aaa', 'ooo', 'bbb', 'ppp', 'a', 'o', 'b', 'p', '']

The first step is to initialize $4$ subregular grammars for every one of the available classes and provide the training sample.

In [None]:
sp_h = SP()
sl_h = SL()
tsl_h = TSL()
mtsl_h = MTSL()

sp_h.data = harm_data
sl_h.data = harm_data
tsl_h.data = harm_data
mtsl_h.data = harm_data

We can then simply fill the `L.alphabet` attributes by applying the method `L.extract_alphabet()`.

In [None]:
sp_h.extract_alphabet()
sl_h.extract_alphabet()
tsl_h.extract_alphabet()
mtsl_h.extract_alphabet()

print("SP alphabet:", sp_h.alphabet)
print("SL alphabet:", sl_h.alphabet)
print("TSL alphabet:", tsl_h.alphabet)
print("MTSL alphabet:", mtsl_h.alphabet)

### Learning

After the data and alphabet are established, we can extract the grammar with its corresponding complexity from every one of those classes using the method `L.learn()`.

In [None]:
sp_h.learn()
sl_h.learn()
tsl_h.learn()
mtsl_h.learn()

The learned **SP grammar** lists all possible $2$-long subsequences observed in the training sample. The sequences are represented as tuples instead of strings in order to avoid restricting the basic alphabet units to a single symbol.

In [None]:
print("Positive SP grammar:", sp_h.grammar)

In order to express the same generalization using a negative grammar, we can `L.switch_polarity()` of the grammar.

In [None]:
sp_h.switch_polarity()
print("Polarity of the SP grammar:", sp_h.check_polarity())
print("SP grammar:", sp_h.grammar)

The negative **SL grammar** contains the same set of restrictions as its SP counterpart. So, the same set of restrictions is detected, even though SP grammars express long-distance restrictions, whereas SL grammars only limit local relations.

In [None]:
sl_h.switch_polarity()
print("Polarity of the SL grammar:", sl_h.check_polarity())
print("SL grammar:", sl_h.grammar)

**TSL grammars** also express the same limitations over a tier as the previous two, and the tier includes all elements that participate in some sort of long-distance agreement. In this case, the tier includes all elements of the alphabet.

In [None]:
print("TSL tier:", tsl_h.tier)
tsl_h.switch_polarity()
print("Polarity of the TSL grammar:", tsl_h.check_polarity())
print("TSL grammar:", tsl_h.grammar)

In **MTSL grammars**, the value of the attribute `L.grammar` is represented in the following way:

    G = {
            tier_1 (tuple): tier_1_restrictions (list),
            tier_2 (tuple): tier_2_restrictions (list),
                ...
            tier_n (tuple): tier_n_restrictions (list)
        }
        
The learned grammar detected two tiers: a tier of vowels `("a", "o")` and a tier of consonants `("p", "b")`. For every one of these tiers, it learned the corresponding set of restrictions.

In [None]:
print("MTSL tiers:", mtsl_h.tier)
mtsl_h.switch_polarity()
print("Polarity of the MTSL grammar:", mtsl_h.check_polarity())
print("MTSL grammar:", mtsl_h.grammar)

### Generating new strings

Now, when all the grammars are learned, we can generate new data using the `L.generate_sample(n, repeat, safe)` method.

The **SP-generated** data is consistent with the desired pattern: no "a" is followed by "o" in the generated strings of the language, and the consonantal agreement is preserved as well. SP grammar succeeded in capturing the pattern!

In [None]:
print(sp_h.generate_sample(25, repeat=False, safe=True))

As we can see below, **SL grammar** captured the local effect of the learned pattern ("p" is never adjacent to "b", "o" is never adjacent to "a", etc.), but it failed to generalize it to long-distance relations.

In [None]:
print(sl_h.generate_sample(25, repeat=False))

Similarly to the errors made by the SL grammar, **TSL grammar** only captured a local dependency, and failed to generalize the pattern.

In [None]:
print(tsl_h.generate_sample(25, repeat=False))

Increasing the locality window `L.k` did not help solve the issue: violations still occur, the only difference is that now they occur at a further distance away from each other.

In [None]:
tsl_h.k = 3
tsl_h.learn()
print(tsl_h.generate_sample(25, repeat=False))

And, as the next cell exemplifies, the **MTSL grammar** also successfully captured the double harmony pattern. Intuitively, MTSL grammar learned that on the tier of vowels, "o" and "a" should never be adjacent, and on the tier of consonants, restrictions "pb" and "bp" need to be imposed.

In [None]:
print(mtsl_h.generate_sample(25, repeat=False))

Overall, we can see that two grammars (SP and MTSL) successfully handled the double harmonic pattern.

  * **SP solution:** within the same word, there cannot be two different vowels and two different consonants;
  * **MTSL solution:** if we only look at vowels, all vowels that are adjacent to each other need to be the same, and similarly for consonants.

## Generalizing the tone plateauing pattern

The tone plateauing process ensures that there is no L tone between two H tones.

    Good strings: HHLLL, LHHHLL, LLLLHHHH, ...
    Bad strings:  <*>LLHLHLLL, <*>HLHLLL, <*>HLLLLLHL, ...
    Generalization: no H tone should intervene in-between two L tones.

In [None]:
tone_data = ['', 'LLLHL', 'HHLLLL', 'LLHHLL', 'HLL', 'HHHHHHHL', 'LHHHH', 'LH', 'HHL', 'HH',
             'HL', 'LLLHLLLLL', 'LHL', 'LLLLHLLLL', 'HHHHHH', 'HHHHH', 'LLLHLLLL', 'HHHHL',
             'HLLLLL', 'LLL', 'LHLLLL', 'L', 'LLLLL', 'HHH', 'HLLLL', 'HHHH', 'HHLLL',
             'LLLLLLLLH', 'HHHLL', 'LLHHHHH', 'LLLHHHHL', 'LLHL', 'LHHHL', 'LLLLH', 'LL',
             'HHLL', 'HHHLLLL', 'LHH', 'LHHLLL', 'HLLL', 'LHHH', 'LHLL', 'H', 'LLLHHHL', 'HHHL',
             'LLHLL', 'HHHHLL', 'LLH', 'HLLLLLLL', 'LHHLLLLL']

For the first step, let us directly initialize negative SP, SL, TSL and MTSL grammars.

In [None]:
sp_tp = SP(polar='n')
sl_tp = SL(polar='n')
tsl_tp = TSL(polar='n')
mtsl_tp = MTSL(polar='n')

The next step is, as before, to provide the input data to language objects, and to extract the alphabets automatically.

In [None]:
sp_tp.data = tone_data
sp_tp.extract_alphabet()

sl_tp.data = tone_data
sl_tp.extract_alphabet()

tsl_tp.data = tone_data
tsl_tp.extract_alphabet()

mtsl_tp.data = tone_data
mtsl_tp.extract_alphabet()

For this pattern, we will need to increase the locality window of the grammar to $3$.

In [None]:
sp_tp.k = 3
sl_tp.k = 3
tsl_tp.k = 3
mtsl_tp.k = 3

### Learning

Similarly, we can now generalize the tone plateauing pattern with respect to those grammars.

In [None]:
sp_tp.learn()
sl_tp.learn()
tsl_tp.learn()
mtsl_tp.learn()

The **SP grammar** learned the desired pattern, i.e. no "HLH" subsequence must be found across the input string.

In [None]:
print("Negative SP grammar:", sp_tp.grammar)

**SL, TSL and MTSL** seem to detect the same restriction as well: "HLH" is prohibited by every one of those grammars.

In [None]:
print("Negative SL grammar:", sl_tp.grammar)
print("Negative TSL grammar:", tsl_tp.grammar)
print("Negative MTSL grammar:", mtsl_tp.grammar)

### Generating new strings

Now, when all initialized languages have their grammar learned, we can try to generate new data to see if it will still be consistent with the rule of tone plateauing, i.e. "don't have L tones in-between H tones".

The **SP grammar** consistently predicts correct forms: it allows the occurrence of H tones between L tones, but not the other way around.

In [None]:
print("SP sample:", sp_tp.generate_sample(20, repeat=False))

However, none of the **local grammars** generalized the pattern correctly. Even though we do not see the "HLH" substring locally in any of the strings generated by those grammars, we observe the illicit substructure spread across some of those strings.

In [None]:
print("SL sample:", sl_tp.generate_sample(20, repeat=False), "\n")
print("TSL tier:", tsl_tp.tier)
print("TSL sample:", tsl_tp.generate_sample(20, repeat=False), "\n")
print("MTSL tiers:", mtsl_tp.tier)
print("MTSL sample:", mtsl_tp.generate_sample(20, repeat=False))

Only the SP grammar was able to correctly generalize the tone plateauing pattern. The logic of that grammar is "nowhere in the string can a low tone be simultaneously followed and preceded by a high tone".

## Grammar learning exercise: liquid dissimilation

In several languages (Latin, Georgian, a.o.), liquids tend to alternate. 
Consider the following Latin data: if the final liquid of the stem is "l", the adjectival affix is realized as "aris". And vice versa, if the final liquid is "r", the choice of the affix is "alis".

  * mi<b>l</b>ita<b>r</b>is \~ <*>mi<b>l</b>ita<b>l</b>is _"military"_
  * f<b>l</b>o<b>r</b>a<b>l</b>is \~ <*>f<b>l</b>o<b>r</b>a<b>r</b>is _"floral"_
  * p<b>l</b>u<b>r</b>a<b>l</b>is \~ <*>p<b>l</b>u<b>r</b>a<b>r</b>is _"plural"_
  
We can simplify this pattern by mapping the liquids to themselves, and any intervening vowel or consonant to `c`:

    Good strings: clcccrccl, ccrccclcccrclc, lccccrlcccrcclccc, ...
    Bad strings:  <*>ccrcccrccl, <*>clccrcr, <*>lccrccrc, ...
    Generalization: the liquid following "r" needs to be "l", and vice versa;
                    "c" is irrelevant for the ordering of the liquids. 
                    
                    
The following dataset exemplifies the pattern of liquid dissimilation.

In [None]:
liquid_data = ["", "ccc", "lccrcccclcr", "lrl", "rcclc"]

Use the cell below to explore which language type generalizes the liquid dissimilation pattern the best.

**NB:** an improved version of the MTSL learning algorithm is coming soon!

## Future work

Formal languages and their corresponding finite-state acceptors map strings to truth values. They answer the question **"Is this string well-formed according to the given grammar?"** This question helps to define the well-formedness conditions for _phonotactics_.

However, to capture _phonological processes_, we need to also ask
**"What string will be the output be if we process the input string according to the given mapping?"** Subregular mappings and finite-state transductions map strings to strings, so they can help us with finding the answer to this question.

<img src="images/scheme.png" width="400">

Thus, the next steps of the development of _SigmaPie_ include the implementation of transducers and different transduction learning algorithms, such as:
  * Onward Subsequential Transducer Inference Algorithm (_OSTIA_) by [Oncina, Garcia and Vidal (1993)](https://pdfs.semanticscholar.org/9058/01c8e75daacb27d70ccc3c0b587411b6d213.pdf) and [de la Higuera (2014)](https://www.cambridge.org/core/books/grammatical-inference/CEEB229AC5A80DFC6436D860AC79434F);
  * Input Strictly Local Function Learning Algorithm (_ISLFLA_) by [Chandlee, Eyraud and Heinz (2014)](https://hal.archives-ouvertes.fr/hal-01193047/document);
  * Output Strictly Local Function Inference Algorithm (_OSLFIA_) by [Chandlee, Eyraud and Heinz (2015)](https://www.aclweb.org/anthology/W15-2310.pdf)
  
... and others.

**Suggestions** 

If your research can benefit in any way from the extension of _SigmaPie_, please let me know by shooting me an email at <alena.aksenova@stonybrook.edu>!


**Acknowledgments** 

I am very grateful to [_Thomas Graf_](https://thomasgraf.net/), [_Jeffrey Heinz_](http://jeffreyheinz.net/), [_Aniello De Santo_](https://aniellodesanto.github.io/about/) and _Ayla Karakaş_ whose input on different parts of this project was extremely helpful.