# On tier representation

As with biology, linguistic data and relationships are commonly represented and visualized with one of three fundamental structures: the sequence, for sequential data, the tree, for hierarchical non-sequential data, and the network, for non-hierachical non-sequential data. Typical examples are n-grams, syntax tree, and colexification networks.

![ngrams](ngrams.jpg)

Ngrams

![syntax tree](syntaxtree.jpg)

Syntax Tree

![colex](colex.jpg)

Colexification

The sequential representation is by far the most common one for representing sound sequences, both for easiness of using it, for the example of alphabets, and for the fact that sound sequences are themselves sequential and temporally delimited. As such, most mathematical model of utterances represent words as sequences of elements from a random distribution. It is important to remember that, in this sense, "random" does not necessarily means "without a specific pattern" or "unpredictable" (as the sequence of sounds is usually quite predictable due to the language inventory and to the phonotactics), but just that the sequence is composed of different elements.

The most common way of modelling such sequentes is with a Markov model. While there are many different structures and approaches that qualify as "Markovian" (such as Markov chains, Hidden Markov Models, etc.), we can simply think of a Markov model simply as a model describing a sequence of possible events (the "elements") in which the probability of each event depends only on the state attained in the previous event. For example, in English, if we have no other information about the context of occurence, after a sound /ð/ (represented in writing by "TH") we are right to expect the sound /iː/ ("E") as the most probable, as the article "THE" is the most common word in the language.

We can, in fact, easily train a model on such premises. If we consider a context composed only of one sound (a "monogram", even though larger, combined, and other alternatives are possible), we can easily come up with a matrix of probabilities of transition from state to state (i.e., from sound to sound) such as the following:

|     | aɪ | aʊ    | b    | d    | dʒ   | (...) |ʒ    | θ |
|-----|----|-------|------|------|------|-------|------|---|
| aɪ | 0.03 | 0.04 | 2.53 | 9.18 | 0.55 |-------| 0.00 | 0.10 |
| aʊ | 0.03 | 0.00 | 2.01 | 4.68 | 0.25 |-------| 0.00 | 2.20 |
| b | 2.24 | 1.20 | 0.00 | 0.51 | 0.15 |-------| 0.00 | 0.01 |
| d | 2.50 | 1.09 | 0.79 | 0.02 | 0.02 |-------| 0.00 | 0.05 |
| dʒ | 1.18 | 0.14 | 0.04 | 2.49 | 0.00 |-------| 0.00 | 0.00 |
| (...) |
| ʒ | 0.55 | 1.11 | 0.00 | 2.21 | 0.00 |-------| 0.00 | 0.00 |
| θ | 1.77 | 0.71 | 1.15 | 0.62 | 0.09 |-------| 0.00 | 0.00 |


While a model such as this is reasonably effective given its simplicity, it fails to support more advanced modelings such as those described above. In terms of patterns of sounds, for example, while the probability distributions it involves could be used as a first step for generalizing observations, there are limits in terms of context and not much can be done without a human intervention in each single stage. In terms of phonetic representation, it cannot fully account for suprasegmental features such as tones, i.e., the contrastive elements that cannot be analyzed as distinct segments but belong to a subgroup of them (sometimes not even following the boundaries of the segments themselves). For pseudo-word generation, not only the context is too limited to capture complex pattern and medium- and long-distance relationships, it also cannot encode information that is above the level of the individual elements, such as if the pseudo-word being generated is a verb or a noun.

While some of the problems can be solved or partially remediated with more complex approaches, such as higher order n-grams (i.e., more context), or combined information, we cannot surpass the limit that each element can only encode one item of information.

One possible objection to this is that information such as IPA graphemes is by no way atomic: even without considering possible suprasegmental information or more context, a graphame such as /b/ already carries many levels, or "tiers", of information, such as its manner of articulation ("occlusive"), its place of articulation ("bilabial"), its voiceness ("voiced"), and so on. In actual words, more information can be obtained by inspection of the word where it is pronuounced, such as the stress of its syllable and its intonation, and even the system where it is pronounced, such as the frequency of such word, its age, the donor language in case of borrowings, and so on.

Our proposed solution is to consider parallel, multilayered and conceptually linked sequences, which are to a point analogous to some solutions with marginal adoption in stochastic methods such as Layered Hidden Markov models. In our proposal, a number, potentially enormous, of tiers can be expressed in its relationship to a given sequence (i.e., word). While the most obvious ones are distinctive features, suprasegmental information and extra lexical information, such as we just described, this can potentially accomodate even the relationship between two or more words, such as cognates. One reduced example is presented here:

| Tier             | Description            | Alignment           |
|------------------|------------------------|---------------------|
| SOURCE           | source sounds          | s | w | e  | r | d  |
| CV  ?X           | previous sound C or V  | ∅ | C | C  | V | C  |
| CV  X?           | following sound C or V | C | C | C  | C | ∅  |
| SOUND CLASS  ?X  | previous sound class   | ∅ | S | W  | V | R  |
| SOUND CLASS  X?  | following sound class  | W | V | R  | T | ∅  |
| STRESS           | stress in source       | 1 | 1 | 1  | 1 | 1  |
| TARGET           | target sounds          | ʃ | v | e: | r | t  |

While a huge amount of tiers is not feasible for human inspection and manipulation, tiers can be easily pre-processed by algorithms for finding relationships that would take many hours of work of manual inspection; in particular, automatic methods should be able to easily deal and remove the many layers of information that are either strongly correlated or plain redudant (for example, that a vowel is voiced). One preliminary example of this is shown below, where we build an aligned dataset of reconstructed Proto-Germanic and German words and test, with a decision tree, for the features that are significant in terms of prediction of the initial sound in German given the initial sound in Proto-Germanic.

![dt](dt.png)

We can also show the results of a Random Forest algorithm, telling us exactly which features are more significat and the weight of each of them in the prediction.

```
TYPE +1 0.0463960538092521
TYPE +2 0.03964767314605851
bilabial 0.14551587078536848
central +1 0.0041946829560337105
close +1 0.07958534505038174
close-mid +1 0.024241467989517797
epiglottal +2 0.0305714628091662
fricative +2 0.031108215364567888
labial +1 0.02417614434351352
labial +2 0.00285658934186892
labialized-velar +1 0.029780864044775292
mid +2 0.017456456188379096
nasal +1 0.07773802387731676
nasal +2 0.027646999094609288
near-back +1 0.03614749000236881
open-mid +2 0.001983432117108016
post-alveolar +1 0.022261976122683224
rounded +1 0.06091768909118193
to-mid +2 0.007023064882485494
to-mid-low +2 0.035949329648071814
unrounded +2 0.04830601414148979
velar 0.08780810923391782
voiceless 0.09297693511751451
voiceless +2 0.025710110842369276
```
