In [1]:
from time import time

In [2]:
# accessing SigmaPie toolkit: I know, horrible!
# I promise I'll make it a package soon

%cd local_sigmapie/code/
from main import *
%cd ../..

/home/alenaks/subregular-experiments/local_sigmapie/code

You successfully loaded SigmaPie. 

Formal language classes and grammars available:
	* strictly piecewise: SP(alphabet, grammar, k, data, polar);
	* strictly local: SL(alphabet, grammar, k, data, edges, polar);
	* tier-based strictly local: TSL(alphabet, grammar, k, data, edges, polar, tier);
	* multiple tier-based strictly local: MTSL(alphabet, grammar, k, data, edges, polar).

Alternatively, you can initialize a transducer: FST(states, sigma, gamma, initial, transitions, stout).
Learning algorithm:
	OSTIA: ostia(sample, sigma, gamma).
/home/alenaks/subregular-experiments


# Experiments on SL languages

## Learning toy pattern of word-final devoicing

Word-final devoicing: we are trying to learn "no voiced obstruents at the end of the word", or `*b<` constraint. It is a pretty widespread phonological rule (German, Russian a.o.)

In [3]:
toy_german = ["aaba", "abp", "bbaaapapp", "pbaa", "pabpp"]

In [4]:
TG = SL(polar = "n")
TG.alphabet = ["a", "b", "p"]
TG.k = 2
TG.data = toy_german

In [5]:
TG.learn()

In [6]:
print(TG.grammar)

[('b', '<'), ('>', '<')]


Side effect: also learns that strings cannot be empty.

## Learning the word-final devoicing from the real data (German)

The data comes from the [wordlist by enz](https://github.com/enz/german-wordlist).

Loading the data:

In [7]:
import codecs
words = []
with codecs.open('words.txt', encoding='utf-8') as f:
    for line in f:
        if line != "":
            words.append(line[:-1])
            
print(words[:10])

['Aa', 'Aachener', 'Aachenerin', 'Aachenerinnen', 'Aachenern', 'Aacheners', 'Aaden', 'Aak', 'Aake', 'Aaken']


What is the length of the wordlist?

In [8]:
print(len(words))
# total, we have 685618 words

685618


How many /b/, /d/ and /g/ are final in the current dataset?

In [9]:
count_final_b = 0
count_final_d = 0
count_final_g = 0

for i in words:
    
    if i[-1] == "b":
        count_final_b += 1
    elif i[-1] == "d":
        count_final_d += 1
    elif i[-1] == "g":
        count_final_g += 1
        
print("Number of /b/:", count_final_b) # 1599, or 0.2% words
print("Number of /d/:", count_final_d) # 15294, or 2.2% words
print("Number of /g/:", count_final_g) # 17098, or 2.4 % words

Number of /b/: 1599
Number of /d/: 15294
Number of /g/: 17098


##### Preprocessing, step 1: "implementing" word-final devoicing in the dataset

In German, orthography doesn't reflect the word-final devoicing. So first of all, I rewrite all word-final /b/, /d/ and /g/ as /p/, /t/ and /k/, correspondingly.

In [10]:
apply_wf = []
for w in words:
    word = w.lower()
    if word[-1] == "b":
        word = word[:-1] + "p"
    elif word[-1] == "d":
        word = word[:-1] + "t"
    elif word[-1] == "g":
        word = word[:-1] + "k"
        
    apply_wf.append(word)

##### Preprocessing, step 2

Secondly, I remove words with "non-German" characters.

In [11]:
german_wordlist = []
banned = []

ban = ['à', 'á', 'â', 'å', 'ç', 'è', 'é', 'ê', 'ë', 'í', 'î', 'ñ', 'ó', 'õ', 'ú',
       'û', 'č', 'ē', 'ī', 'ł', 'ō', 'œ', 'š', 'ū']

for w in apply_wf:
    present = False
    for i in ban:
        if i in w:
            banned.append(w)
            present = True
            break
            
    if present:
        continue
        
    german_wordlist.append(w)

What words did we ban?

In [12]:
print(banned[:15])

['abbé', 'abbés', 'abrégé', 'abrégés', 'acheuléen', 'acheuléens', 'agrément', 'agréments', 'ampère', 'ångström', 'ångströms', 'aperçu', 'aperçus', 'apéro', 'apéros']


How long is the wordlist now?

In [16]:
print(len(german_wordlist))

685147


### 2-SL model of German word-final devoicing

In [14]:
G = SL(polar = "n")
G.data = german_wordlist
G.k = 2
G.extract_alphabet()

In [15]:
begin = time()
G.learn()
end = time()
# takes around 4-4.5 seconds on my 16 core laptop
print("It took", end - begin, "seconds.")

It took 4.097039699554443 seconds.


Did it learn the rule of the word-final devoicing?

In [17]:
if all([(goal in G.grammar) for goal in [("b", "<"), ("g", "<"), ("d", "<")]]):
    print("Yes, it did!")
else:
    print("Nope, it didn't.")

Yes, it did!


How big is the grammar?

In [18]:
print(len(G.grammar))

109


What's there?!

In [20]:
print(G.grammar[:20], "...")

[('b', 'x'), ('b', 'ß'), ('b', '<'), ('c', 'j'), ('c', 'v'), ('c', 'w'), ('c', 'x'), ('c', 'ß'), ('d', 'x'), ('d', 'ß'), ('d', '<'), ('f', 'x'), ('f', 'ß'), ('g', 'x'), ('g', 'ß'), ('g', '<'), ('h', 'x'), ('h', 'ß'), ('j', 'c'), ('j', 'd')] ...


Apart from the target grammar, it also learned all the bigrams that happened to not occurre in that German corpus. The next question is then **can SL grammars generate words that look German?**

## Generating fake German words

I'll be using positive SL grammars instead of the negative. It's much faster since all grammars are positive by default, and switching the polarity of the grammar is $O(|\Sigma|^k)$ operation. Okay for $k=2$, but takes ages for $k=4$ are more.

### 2-SL generation

In [23]:
G.switch_polarity()
print(G.generate_sample(n = 7))

['uiriäknclzndkküfsewrzbfäßmui', 'q', 'ücoxwaopmhmldctfzädwwpwpäknohämawasfyzuoh', 'ep', 'kua', 'aßszrqikqanovzhqalrntüdühuxhanbäßupmböwsödaxemouq', 'yoevdafjiffeßnjö']


Absolutely horrible.

### 3-SL generation

In [24]:
G3 = SL()
G3.data = german_wordlist
G3.k = 3
G3.extract_alphabet()
G3.learn()
print(G3.generate_sample(n = 5))

['eguslugfähaesäbbyfoh', 'uvöliraoglyrrh', 'nisjeaulh', 'örm', 'tjoye']


Tiny bit better.

### 4-SL generation

Careful, this is incredibly slow, we can at least decrease the number of the examples we are generating.

In [26]:
G4 = SL()
G4.data = german_wordlist
G4.k = 4
G4.extract_alphabet()
G4.learn()
print(G4.generate_sample(n = 3))

['wanzyklorbnic', 'avoya', 'wyando']


Starts reminding German, but only sometimes.

## What SL languages cannot learn

Let's try to learn a simple harmony. The transparent element is `x`, and within one word there can be either only `a`s or `o`s, i.e. we're trying to fake vowel harmony in rounding.

In [67]:
sl_harmony = ['axxxxaaxxa', 'ooxxooxxoo', 'xxxxaaxxxa', 'aaxxxaxxxx', 'xxxooxxxoo']
print(sl_harmony)

['axxxxaaxxa', 'ooxxooxxoo', 'xxxxaaxxxa', 'aaxxxaxxxx', 'xxxooxxxoo']


In [95]:
slh = SL(polar = "n")
slh.data = sl_harmony
slh.extract_alphabet()
slh.learn()
print(slh.grammar)

[('a', 'o'), ('o', 'a'), ('>', '<')]


The grammar looks reasonable! Let's see what words it predicts to be grammatical.

In [107]:
slh.generate_sample(n = 10)

['o', 'ooxxaaa', 'aaxo']

Nope, it learned those \*ao and \*oa generalization _locally_ but not in a _long-distant_ manner. Tier-based strictly local or strictly piecewise grammars would be more suitable here.