<div style="text-align: right">
    <i>
        AMP 2019 (October 12) <br>
        Alëna Aksënova
    </i>
</div>

# _SigmaPie_ for subregular grammar induction

## Subregular languages in phonology

This toolkit is relevant for anyone who is working or going to work with subregualar grammars both from the perspectives of theoretical linguistics and formal language theory.

**Why theoretical linguistics should be interested in formal language theory?** <br>
_Formal language theory_ explains how potentially infinite stringsets, or _formal languages_,
can be generalized to grammars encoding the desired patterns and what properties those
grammars have. It also allows one to compare different grammars with respect to parameters such as expressivity.

**Chomsky hierarchy** aligns main classes of formal languages with respect to their expressive power.
  * **Regular** grammars are as powerful as finite state devices or regular expressions: they can "count" only until certain threshold (no $a^{n}b^{n}$ patterns);
  * **Context-free** grammars have access to potentially infinite _stack_ that allows them to reproduce patterns that involve center embedding;
  * **Mildly context-sensitive** grammars are powerful enough to handle some types of cross-serial dependencies such as copying;
  * **Context sensitive** grammars are restricted to a finitely long [memory tape](https://en.wikipedia.org/wiki/Punched_tape) encoding the pattern;
  * **Recursively enumerable** grammars are as powerful as any theoretically possible computer in this universe, they can use infinitely long memory tape.



<img src="images/chomhier.png" width="600">


Both phonology and morphology frequently display properties of regular languages.

**Phonology** does not require the power of center-embedding. For example, consider a harmony where the first vowel agrees with the last vowel, second vowel agrees with the pre-last vowel, etc.
    
    GOOD: "arugula", "tropicalization", "electrotelethermometer", etc.
    BAD:  any other word violating the rule.


While it is a theoretically possible pattern, harmonies of that types are unattested in natural languages.

**Morphology** avoids center-embedding as well. In [Aksënova et al. (2016)](https://www.aclweb.org/anthology/W16-2019) we show that it is possible to iterate prefixes with the meaning "after" in Russian. In Ilocano, where the same semantics is expressed via a circumfix, its iteration is prohibited.
    
    RUSSIAN: "zavtra" (tomorrow), "posle-zavtra" (the day after tomorrow), 
             "posle-posle-zavtra" (the day after the day after tomorrow), ...
    ILOCANO: "bigat" (morning), "ka-bigat-an" (the next morning),
             <*>"ka-ka-bigat-an-an" (the morning after the next one).


Moreover, typological review of patterns shows that phonology and morphology do not require the full power of regular languages. As an example of an unattested pattern, [Heinz (2011)](http://jeffreyheinz.net/papers/Heinz-2011-CPF.pdf) provides a language where a word must have an even number of vowels to be well-formed.


Regular languages can be sub-divided into another nested hierarchy of languages decreasing in their expressive power: **subregular hierarchy**.


<img src="images/subreg.png" width="250">


This tutorial and _SigmaPie_ toolkit currently contains functionality for the following classes:
  * strictly piecewise (SP);
  * strictly local (SL);
  * tier-based strictly local (TSL);
  * multiple tier-based strictly local (MTSL).

## Functionality of the toolkit

  * **Learners** extract grammars from stringsets.
  * **Scanners** evaluate strings with respect to a given grammar.
  * **Sample generators** generate stringsets for a given grammar.
  * **FSM constructors** translate subregular grammars to finite state machines.
  * **Polarity converters** switch negative grammars to positive, and vice versa.

In [55]:
%cd
%cd Desktop/SigmaPie/code/

from main import *

/home/alenaks
/home/alenaks/Desktop/SigmaPie/code


## Strictly piecewise languages

**Negative strictly $k$-piecewise (SP)** grammars prohibit occurrence of sequences of $k$ symbols at an arbitrary distance from each other. The value of $k$ defines the size of the window of the grammar, or the length of the longest sequence that the grammar can prohibit. Alternatively, if the grammar is positive, it lists subsequences that are allowed in well-formed words of the language.

    k = 2
    POLARITY: negative
    GRAMMAR:  ab, ba
    LANGUAGE: accaacc, cbccc, cccacaaaa, ...
              <*>accacba, <*>bcccacbb, <*>bccccccca, ...
              
              
In phonology, an example of an SP pattren is _tone plateauing_ considered in [Jardine (2015,](https://adamjardine.net/files/jardinecomptone-short.pdf) [2016)](https://adamjardine.net/files/jardine2016dissertation.pdf).
For example, in Luganda (Bantu) a low tone (L) cannot intervene in-between two high tones (H): L is changed to H in such configuration.
The prosodic domain cannot have more than one stretch of H tones.

**Luganda verb and noun combinations** (Hyman and Katamba (2010), cited by Jardine (2016))

  * /tw-áa-mú-láb-a, walúsimbi/ $\Rightarrow$ tw-áá-mu-lab-a, walúsimbi <br>
    ‘we saw him, Walusimbi’ <br>
    **HHLLL, LHLL**
    
  * /tw-áa-láb-w-a walúsimbi/ $\Rightarrow$ tw-áá-láb-wá wálúsimbi <br>
    ‘we were seen by Walusimbi’ <br>
    **HHHHHHLL**
    
  * /tw-áa-láb-a byaa=walúsimbi/ $\Rightarrow$ tw-áá-láb-á byáá-wálúsimbi <br>
    ‘we saw those of Walusimbi’ <br>
    **HHHHHHHHLL**
    
This pattern can be described using SP grammar $G_{SP_{neg}} = \{HLH\}$.

### Learning tone plateauing pattern

Negative and positive SP grammars are implemented in the package in the `SP()` class.

In [60]:
sp_neg = SP()
sp_neg.change_polarity()
sp_neg.k = 3
sp_neg.data = ["LLLL", "HHLLL", "LHHHLL", "LLLLHHHH"]
sp_neg.extract_alphabet()
sp_neg.learn()
sp_neg.grammar

[('H', 'L', 'H')]

**Acknowledgements** 
  * Thomas
  * Jeff
  * Ani

**Bibliography**

  * Make a reference to Chomsky (?) for the Chomsky hierarchy
  * Kaplan and Kay (1994)
  * Karttunen et al. (1992)
  * Shieber (1985)
  * Heinz (2011)
  * Aksenova et al (2016)
  * Jardine 2015 and 2016
  * Hyman and Katamba (2010)