In [1]:
import numpy as np

import os,sys

def add_path_to_local_module(module_name):
    module_path = os.path.abspath(os.path.join(module_name))
    if module_path not in sys.path:
        sys.path.append(module_path)
add_path_to_local_module("epitran")
add_path_to_local_module("panphon")


import panphon
import panphon.distance
import epitran
import pickle

from wordgen import *

In [2]:
wg = WordgenLearned(3,"ind-Latn")

Loading Epitran with language code ind-Latn... success!


In [3]:
wg.learn_distribution("slice.txt")

About to learn from slice.txt
For each word, each chunk of 3 sounds will be considered.
Some of the words that could not be processed will be printed below; just check that nothing too bad is happening.
"[[1]]" was not processed.
"|" was not processed.                
"." was not processed.
"," was not processed.                
"," was not processed.
"," was not processed.
"," was not processed.
"." was not processed.
"," was not processed.                
"(" was not processed.
":" was not processed.
")," was not processed.
"." was not processed.
"," was not processed.                
"." was not processed.
"," was not processed.                
";" was not processed.
"," was not processed.
"." was not processed.
"." was not processed.                
"(" was not processed.                
")" was not processed.
"(" was not processed.
")." was not processed.
"|" was not processed.                
"400px" was not processed.
"|" was not processed.
"(" was not processed.
")" was not pro

In [4]:
wg.generate_word()

'an'

Now we are interested in generating a "phonology" for a fictional language. This constitutes an equivalence relation on the set of IPA symbols. Then generate an orthography to display things. 

In [5]:
ft=panphon.FeatureTable()
dst=panphon.distance.Distance()

  dogol_prime = yaml.load(f.read())


In [31]:
dst.weighted_feature_edit_distance('k','ɡ'),dst.weighted_feature_edit_distance('k','d')

(0.125, 1.1875)

In [10]:
epi = epitran.Epitran("eng-Latn")

In [43]:
epi.transliterate("party")

'pɑɹti'

In [13]:
ft = panphon.featuretable.FeatureTable()

In [38]:
ft.word_array(['voi'],"pɑɹti")

array([[-1],
       [ 1],
       [ 1],
       [-1],
       [ 1]])

In [46]:
for v in ft.word_to_vector_list("pɑɹti",numeric=True): print(list(v))

[-1, -1, 1, -1, -1, -1, -1, 0, -1, -1, -1, 1, -1, 0, 1, -1, -1, -1, -1, -1, 0, -1]
[1, 1, -1, 1, 0, -1, -1, 0, 1, -1, -1, 0, -1, 0, -1, -1, 1, 1, -1, -1, 1, -1]
[-1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, 1, 1, -1, -1, 1, -1, 1, 1, -1, 0, -1]
[-1, -1, 1, -1, -1, -1, -1, 0, -1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, 0, -1]
[1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, 0, -1, 0, -1, 1, -1, -1, -1, -1, 1, -1]


In [52]:
# I think these weights are chosen by panphon authors based on their sense for what matters when it comes
# to phonological differences. (Probably based on some empirical knowledge since they are linguists...)
print(list(zip(dst.fm.names,dst.fm.weights)))

[('syl', 1.0), ('son', 1.0), ('cons', 1.0), ('cont', 0.5), ('delrel', 0.25), ('lat', 0.25), ('nas', 0.25), ('strid', 0.125), ('voi', 0.125), ('sg', 0.125), ('cg', 0.125), ('ant', 0.25), ('cor', 0.25), ('distr', 0.125), ('lab', 0.25), ('hi', 0.25), ('lo', 0.25), ('back', 0.25), ('round', 0.25), ('velaric', 0.25), ('tense', 0.125)]


To generate a phonology, here's my idea:
- Pick a random batch of weights like the one shown above. Center your choice on the one above, and draw from a gaussian distribution... somehow pick variance intelligently. The chosen weights will be fixed for the phonology generation.
- Pick (randomly? or hard coded from start?) a maximum number of phonemes $M$, like 40 or 50.
- Gather all IPA symbols somehow. There's the csv files in panphon/data like ipa_all.csv... but idk if that will get everything the epitrans.transliterate can produce. Instead, you could  go through your directory of saved WordgenLearned objects and union together all the sets of ipa symbols found in there (excluding the start and end tokens). Let's call the resulting set of IPA symbols $S$.
- The phonology will be a mapping $S\rightarrow\mathbb{N}$. Think of at as an enumeration of equivalence classes. The weights chosen above define a particular distance function $d:S\times S\rightarrow \mathbb{R}$. To generate a phonology:
  - Start with an injective mapping $f:S\rightarrow\mathbb{N}$.
  - Randomly pick an element $x$ of $S$ and randomly choose a radius $r$. The radius will be pulled from a distribution that is fixed beforehand (uniform? gaussian? this will require experimentation to pick).
  - For each $y\in S$ with $d(y,x)<r$, redefine $f(y)$ to be $f(x)$ (i.e. "identify" those two sounds).
  - Now check how many phonemes there are (i.e. size of range of $f$); if it's $\leq$ the maximum $M$, then stop. Otherwise pick another $x\in S$ and repeat the process.