<div style="text-align: right">
    <i>
        AMP 2019 (October 12) <br>
        Alëna Aksënova
    </i>
</div>

# _SigmaPie_ for subregular grammar induction

## Subregular languages in phonology

This toolkit is relevant for anyone who is working or going to work with subregualar grammars both from the perspectives of theoretical linguistics and formal language theory.

**Why theoretical linguistics should be interested in formal language theory?** <br>
_Formal language theory_ explains how potentially infinite stringsets, or _formal languages_,
can be generalized to grammars encoding the desired patterns and what properties those
grammars have. It also allows one to compare different grammars with respect to parameters such as expressivity.

**Chomsky hierarchy** aligns main classes of formal languages with respect to their expressive power.
  * **Regular** grammars are as powerful as finite state devices or regular expressions: they can "count" only until certain threshold (no $a^{n}b^{n}$ patterns);
  * **Context-free** grammars have access to potentially infinite _stack_ that allows them to reproduce patterns that involve center embedding;
  * **Mildly context-sensitive** grammars are powerful enough to handle some types of cross-serial dependencies such as copying;
  * **Context sensitive** grammars are restricted to a finitely long [memory tape](https://en.wikipedia.org/wiki/Punched_tape) encoding the pattern;
  * **Recursively enumerable** grammars are as powerful as any theoretically possible computer in this universe, they can use infinitely long memory tape.



<img src="images/chomhier.png" width="600">


Both phonology and morphology frequently display properties of regular languages.

**Phonology** does not require the power of center-embedding. For example, consider a harmony where the first vowel agrees with the last vowel, second vowel agrees with the pre-last vowel, etc.
    
    GOOD: "arugula", "tropicalization", "electrotelethermometer", etc.
    BAD:  any other word violating the rule.


While it is a theoretically possible pattern, harmonies of that types are unattested in natural languages.

**Morphology** avoids center-embedding as well. In [Aksënova et al. (2016)](https://www.aclweb.org/anthology/W16-2019) we show that it is possible to iterate prefixes with the meaning "after" in Russian. In Ilocano, where the same semantics is expressed via a circumfix, its iteration is prohibited.
    
    RUSSIAN: "zavtra" (tomorrow), "posle-zavtra" (the day after tomorrow), 
             "posle-posle-zavtra" (the day after the day after tomorrow), ...
    ILOCANO: "bigat" (morning), "ka-bigat-an" (the next morning),
             <*>"ka-ka-bigat-an-an" (the morning after the next one).


Moreover, typological review of patterns shows that phonology and morphology do not require the full power of regular languages. As an example of an unattested pattern, [Heinz (2011)](http://jeffreyheinz.net/papers/Heinz-2011-CPF.pdf) provides a language where a word must have an even number of vowels to be well-formed.


Regular languages can be sub-divided into another nested hierarchy of languages decreasing in their expressive power: **subregular hierarchy**.


<img src="images/subreg.png" width="250">


This tutorial and _SigmaPie_ toolkit currently contains functionality for the following classes:
  * strictly piecewise (SP);
  * strictly local (SL);
  * tier-based strictly local (TSL);
  * multiple tier-based strictly local (MTSL).

## Functionality of the toolkit

  * **Learners** extract grammars from stringsets.
  * **Scanners** evaluate strings with respect to a given grammar.
  * **Sample generators** generate stringsets for a given grammar.
  * **FSM constructors** translate subregular grammars to finite state machines.
  * **Polarity converters** switch negative grammars to positive, and vice versa.

In [None]:
%cd
%cd Desktop/SigmaPie/code/

from main import *

## Strictly piecewise languages

**Negative strictly $k$-piecewise (SP)** grammars prohibit occurrence of sequences of $k$ symbols at an arbitrary distance from each other. The value of $k$ defines the size of the window of the grammar, or the length of the longest sequence that the grammar can prohibit. Alternatively, if the grammar is positive, it lists subsequences that are allowed in well-formed words of the language.

    k = 2
    POLARITY: negative
    GRAMMAR:  ab, ba
    LANGUAGE: accaacc, cbccc, cccacaaaa, ...
              <*>accacba, <*>bcccacbb, <*>bccccccca, ...
              
              
In phonology, an example of an SP pattren is _tone plateauing_ considered in [Jardine (2015,](https://adamjardine.net/files/jardinecomptone-short.pdf) [2016)](https://adamjardine.net/files/jardine2016dissertation.pdf).
For example, in Luganda (Bantu) a low tone (L) cannot intervene in-between two high tones (H): L is changed to H in such configuration.
The prosodic domain cannot have more than one stretch of H tones.

**Luganda verb and noun combinations** (Hyman and Katamba (2010), cited by Jardine (2016))

  * /tw-áa-mú-láb-a, walúsimbi/ $\Rightarrow$ tw-áá-mu-lab-a, walúsimbi <br>
    ‘we saw him, Walusimbi’ <br>
    **HHLLL, LHLL**
    
  * /tw-áa-láb-w-a walúsimbi/ $\Rightarrow$ tw-áá-láb-wá wálúsimbi <br>
    ‘we were seen by Walusimbi’ <br>
    **HHHHHHLL**
    
  * /tw-áa-láb-a byaa=walúsimbi/ $\Rightarrow$ tw-áá-láb-á byáá-wálúsimbi <br>
    ‘we saw those of Walusimbi’ <br>
    **HHHHHHHHLL**
    
This pattern can be described using SP grammar $G_{SP_{neg}} = \{HLH\}$.

### Learning tone plateauing pattern

Let us say that `tone_plat` represents a "toy" example of tonal plateauing (TP) pattern.

In [None]:
luganda = ["LLLL", "HHLLL", "LHHHLL", "LLLLHHHH"]

Our goal will be to learn the generalization behind TP.

Negative and positive SP grammars are implemented in the package in the `SP()` class.

In [None]:
tp_pattern = SP()

### Attributes of SP grammars
  * `alphabet` (list) is the set of symbols that the grammar uses.
  * `grammar` (list of tuples) is the list of allowed or prohibited substructures of the language;
  * `k` (int) is the size of the locality window of the grammar, by default it is $2$;
  * `data` (list of string) is the learning sample;
  * `fsm` (FSM object) is the finite state device that corresponds to the grammar; in this case, the devide is FSM family constructed according to [Heinz&Rogers(2013)](https://www.aclweb.org/anthology/W13-3007).
  
The initial step is to define the training sample and the alphabet.

In [None]:
tp_pattern.data = luganda
tp_pattern.alphabet = ["H", "L"]

By default, the locality window of the grammar is $2$ and the delimiters are ">" and "<".

In [None]:
print("Locality of the SP grammar:", tp_pattern.k)
print("Delimiters:", tp_pattern.edges)

All these attributes can be directly accessed. For example, let us change the locality of the window from $2$ to $3$:

In [None]:
tp_pattern.k = 3
print("Locality of the SP grammar:", tp_pattern.k)

### Methods for SP grammars
  * `check_polarity()` and `switch_polarity()` display and changes the polarity of the grammar;
  * `learn()` extracts prohibited or allowed subsequences from the training sample;
  * `scan(string)` tells if a given string is well-formed with respect to a learned grammar;
  * `extract_alphabet()` collects alphabet based on the provided data;
  * `generate_sample(n, repeat)` generates $n$ strings based on the given grammar; by default, `repeat` is set to False, and repetitions of the generated strings are not allowed, but this parameter can be set to True;
  * `fsmize()` creates the corresponding FSM family by following the steps outlined in [Heinz&Rogers(2013)](https://www.aclweb.org/anthology/W13-3007);
  * `subsequences(string)` returns all $k$-piecewise subsequences of the given string;
  * `generate_all_ngrams()` generates all possible strings of the length $k$ based on the provided alphabet.

**Checking and changing polarity of the grammar**

By default, the grammars are positive. The polarity can be checked by running the `check_polarity` method:

In [None]:
print("Polarity of the grammar:", tp_pattern.check_polarity())

If the polarity needs to be changed, this can be done using the `switch_polarity` method. It will automatically switch the grammar, if one is provided or already extracted, to the opposite one.

In [None]:
tp_pattern.switch_polarity()
print("Polarity of the grammar:", tp_pattern.check_polarity())

**Learning the SP grammar**

Method `learn` extracts allowed or prohibited subsequences from the learning sample based on the polarity of the grammar and the locality window. Currently, $k=2$ and the grammar is negative.

In [None]:
tp_pattern.learn()
print("Extracted grammar:", tp_pattern.grammar)

Indeed, it learned the TP pattern!

$n$-grams are represented as tuples of strings, because in this case, elements of the alphabet are not restricted to characters, and it allows for other representations to be learned as well.

**Scanning strings and telling if they are part of the language**

Method `scan` takes  string as input and returns True or False depending on if the current string is contained in the language of the grammar:

In [None]:
tp = ["HHHLLL", "L", "HHL", "LLHLLL"]
no_tp = ["LLLLHLLLLH", "HLLLLLLH", "LLLHLLLHLLLHL"]

print("Tonal plateauing:")
for string in tp:
    print("String", string, "is in L(G):", tp_pattern.scan(string))
    
print("\nNo tonal plateauing:")
for string in no_tp:
    print("String", string, "is in L(G):", tp_pattern.scan(string))

**Generating a data sample**

Based on the learned grammar, a data sample of the desired size can be generated.

In [None]:
sample = tp_pattern.generate_sample(n = 10)
print("Sample:", sample)

**Extracting subsequences**

Finally, this toolkit can be used also in order to extract subsequences from the input word by feeding it to the `subsequences` method.

In [None]:
tp_pattern.k = 3
print("k = 3:", tp_pattern.subsequences("regular"), "\n")
tp_pattern.k = 5
print("k = 5:", tp_pattern.subsequences("regular"))

While SP languages capture multiple long-distance processes such as tone plateauings or some harmonies, they are unable to capture local processes, or blocking effect.

## Strictly local languages

**Negative strictly $k$-local (SL)** grammars prohibit occurrence of consecutive substrings consisting of up to $k$ symbols. The value of $k$ in this case, defines the longest substring that cannot be present in a well-formed string of a language. Positive SL grammars defines substrings that can be present in the language.

Importantly, in order to define _first_ and _last_ elements, SL languages use delimiters (">" and "<") that indicate the beginning and the end of the string.

    k = 2
    POLARITY: positive
    GRAMMAR:  >a, ab, ba, b<
    LANGUAGE: ab, abab, abababab, ...
              <*>babab, <*>abaab, <*>bababba, ...

In phonology, very frequently changes involve adjacent segments, and the notion of locality is therefore extremely important. The discussion of local processes in phonology can be found in ([Chandlee 2014](http://dspace.udel.edu/bitstream/handle/19716/13374/2014_Chandlee_Jane_PhD.pdf)).


**Russian word-final devoicing**

In Russian, the final obstruent of a word cannot be voiced. <br>
  * "lug" \[luK\] _meadow_ $\Rightarrow$ "lug-a" \[luGa\] _of the meadow_
  * "luk" \[luK\] _onion_ $\Rightarrow$ "luk-a" \[luKa\] _of the onion_
  * "porog" \[paroK\] _doorstep_ $\Rightarrow$ "porog-a" \[paroGa\] _of the doorstep_
  * "porok" \[paroK\] _vice_ $\Rightarrow$ "porok-a" \[paroKa\] _of the vice_

### Learning word-final devoicing

Assume the following toy dataset where the following mapping is defined:
  * "a" stands for a vowel;
  * "b" stands for a voiced obstruent;
  * "p" stands for any other consonant.

In [None]:
russian = ["", "ababa", "babbap", "pappa", "pabpaapba" "aap"]

In this term, the Russian word-final devoicing generalization would be _"do not have "b" at the end of the word"_. However, in order to define "beginning" and "end", we need to use delimiters ">" and "<".

This pattern can then be described using SL grammar $G_{SL_{neg}} = \{b<\}$.

Let us initialize a SL object.

In [None]:
wf_devoicing = SL()
wf_devoicing.data = russian

### Attributes of SL grammars
  * `alphabet` (list) is the set of symbols that the grammar uses.
  * `grammar` (list of tuples) is the list of allowed or prohibited substructures of the language;
  * `k` (int) is the size of the locality window of the grammar, by default it is $2$;
  * `data` (list of string) is the learning sample;
  * `edges` (list of two characters) are the delimiters used by the grammar, the default value is ">" and "<";
  * `fsm` (FSM object) is the finite state device that corresponds to the grammar.
  
### Methods defined for SL grammars
  * `check_polarity()` and `switch_polarity()` display and changes the polarity of the grammar;
  * `learn()` extracts prohibited or allowed subsequences from the training sample;
  * `scan(string)` tells if a given string is well-formed with respect to a learned grammar;
  * `extract_alphabet()` collects alphabet based on the provided data;
  * `generate_sample(n, repeat)` generates $n$ strings based on the given grammar; by default, `repeat` is set to False, and repetitions of the generated strings are not allowed, but this parameter can be set to True;
  * `fsmize()` creates the corresponding FSA;
  * `clean_grammar()` removes useless $k$-grams from the grammar.

**Extracting alphabet and learning SL grammar**

As before, `learn()` method extracts dependencies from the data. It simply extracts $k$-grams of the indicated size from the data, and the default value of $k$ is $2$.

In [None]:
wf_devoicing.learn()
print("The grammar is", wf_devoicing.grammar)

In order to automatically extract the alphabet from the data, it is possible to run `extract_alphabet()`.

In [None]:
print("Original value of the alphabet is", wf_devoicing.alphabet)
wf_devoicing.extract_alphabet()
print("Modified value of the alphabet is", wf_devoicing.alphabet)

**Changing polarity of the grammar**

The grammar outputted above is positive. If we want to capture the pattern using restrictions rather then the allowed substrings, we can `switch_polarity()` of the grammar:

In [None]:
wf_devoicing.switch_polarity()
print("The grammar is", wf_devoicing.grammar)

**Scanning strings**

As before, `scan(string)` method returns True or False depending on the well-formedness of the given string with respect to the learned grammar.

In [None]:
wfd = ["apapap", "papa", "abba"]
no_wfd = ["apab", "apapapb"]

print("Word-final devoicing:")
for string in wfd:
    print("String", string, "is in L(G):", wf_devoicing.scan(string))
    
print("\nNo word-final devoicing:")
for string in no_wfd:
    print("String", string, "is in L(G):", wf_devoicing.scan(string))

**Generating data samples**

If the grammar is non-empty, the data sample can be generated in the same way as before for SP grammars: `generate_sample(n, repeat)`, where `n` is the number of examples that need to be generated, and `repeat` is a flag allowing or prohibiting repetitings of the same strings in the generated data.

In [None]:
sample = wf_devoicing.generate_sample(5, repeat = False)
print(sample)

**Cleaning grammar**

Potentially, a grammar that user provides can contain "useless" $k$-grams. For example, consider the following grammar:

In [None]:
sl = SL()
sl.grammar = [(">", "a"), ("b", "a"), ("a", "b"), ("b", "<"),
              (">", "g"), ("f", "<"), ("t", "t")]

This grammar contains $3$ useless bigrams:
  
  * `(">", "g")` can never be used because nothing can follow "g";
  * `("f", "<")` is useless because there is no way to start a string that would lead to "f";
  * `("t", "t")` has both problems listed above.
  
Method `clean_grammar()` detects and removes such $n$-grams by constructing a corresponding finite state machine, and trimming all inaccessible nodes of that FSM.

In [None]:
print("Old grammar:", sl.grammar)
sl.clean_grammar()
print("Clean grammar:", sl.grammar)

Even though SP and SL languages can capture a large portion of phonological well-formedness conditions, there are numerous examples of patterns that require increased complexity. For example, **harmony with a blocking effect** cannot be captured using SP grammars because they will "miss" a blocker, and cannot be encoded via SL grammars because they cannot be used for long-distance processes.

## Tier-based strictly local languages

**Acknowledgements** 
  * Thomas
  * Jeff
  * Ani

**Bibliography**

  * Make a reference to Chomsky (?) for the Chomsky hierarchy
  * Kaplan and Kay (1994)
  * Karttunen et al. (1992)
  * Shieber (1985)
  * Heinz (2011)
  * Aksenova et al (2016)
  * Jardine 2015 and 2016
  * Hyman and Katamba (2010)
  * Jeffrey Heinz and James Rogers. 2013. Learning subregular classes of languages with factored deterministic automata. In Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13), pages 64–71, Sofia, Bulgaria. Association for Computational Linguistics.
  * Chandlee 2016