# Using Phylogenies in Computational Historical Linguistics (Tiago Tresoldi, Mei-Shin Wu, and Nathanael E. Schweikhard)






## 1. Introduction

### 1.1 Inner and Outer Language History
The origins of phylogenetics go back to the 19th century and its principles have not changed much since. Among the less well-known early linguists there is Georg von der Gabelentz (1840-1893). His characterization from 1891 of the questions that phylogenetic reconstruction wants to answer may still hold true:
> * Are languages A and B related and to what degree?
> * Does this word or word form exist in that language or at that time of language history?
> * How is it pronounced there?
> * What kinds of rules can be found in the sound differences?
> * Is it a native or a loan word?
> * Which words belong to the original lexicon, which have been added later?
>
> (Gabelentz 1891: 145: translation by NES)

He therefore distinguishes between "outer and inner language history", with the former being "the history of its spacial and temporal spreading, of its branchings and possible interminglings - in other words, its genealogy", the latter "shows and tries to explain how the language slowly changed in regards to meaning and form" (ibid 146: translation by NES). But even when restricted to the outer language history there still are differences in how people interpreted the answers to these questions as can be seen also in the various approaches to visualizing the results.

### 1.2 Trees, Waves, and Nets

#### Trees

This first phylogenetic tree for the Slavic languages by František Čelakovský (1799-1852) was published in 1853, shortly before August Schleier published his tree model theory in the same year:

![alt text](img/S14.1.png)

> The length of lines implies time depth, their distance the degree of relationship. (Schleicher 1861: 6: translation by NES)

#### Waves

Not long after Schleicher's publishing of his tree model, there was opposition among Indo-Europeanists and language historians. Most famous is the wave theory by Johannes Schmidt (1843-1901) who described the relationships between the Indo-European languages in "the image of a wave spreading in concentric circles that grow weaker and weaker the farther away they are from the center" and as "a plane tilted in one direct line from Sanskrit to Celtic" (Schmidt 1872: 27: translation by NES). He did so in order to account for Lithuanian words that share characteristics both from German and Indo-Iranian cognates (cf. ibid: 26).

#### Nets and Other Kinds of Entanglements

The biggest problem of Schmidt's wave theory was that nobody knew exactly how to illustrate outer language history schematically in it. And therefore many different approaches to visualize wave theory can be found.

![alt text](img/S14.2.png)

### 1.3 A Matter of Perspective

One basic difference between the tree and the wave model lies in their different goals: Wave theory is oriented towards an epistemological perspective on outer language history whereas tree theory is oriented towards an ontological perspective on it. In other words, waves describe what we would like to know, trees what we can know. That is why Schmidt opposed the tree model on the grounds of the impossibility of representing the facts with it:

> Twist it how you will, as long as you insist on the opinion that historical languages came to be by multiple offshoots from the original language, i. e. as long as you believe in a tree model of Indo-European languages, you will never be able to explain all the facts in question scientifically.
> 
> (Schmidt 1872: 17: translation by NES)

The facts he refers to here are mainly lists of roots shared among different Indo-European languages (cognates in the broadest sense). The problem with these facts is however that they themselves depend on the state of research. If you for example compare Schmidt's number of roots shared among Greek, Old Indian and Latin with the number by Nicholaev (2007), you will notice that the closeness between Latin and Greek suggested by Schmidt's data does not seem that pronounced anymore considering Nicholaev's data. The same holds true for the very small similarities between Old Indian and Latin in Schmidt's data:

![alt text](img/S14.3.png)

### 1.4 Not Seeing the Tree for the Waves...

Computer-assisted methods for phylogenetic reconstruction still have their problems, but those are not always caused by the methods being merely aimed at the detection of trees, thereby being based on the wrong model, as is often assumed. Instead also the opposite problem can occur of treelike processes disguising themselves as being not treelike. In biology, this phenomenon is called incomplete lineage sorting:

> Incomplete lineage sorting occurs when an ancestral species undergoes several speciation events in a short period of time. If, for a given gene, the ancestral polymorphism is not fully resolved into two monophyletic lineages when the second speciation occurs, then with some probability the gene tree will be diﬀerent from the species tree.
> (Galtier and Daubin 2008: 4023)

Below you see examples of how polymorphism works in linguistics:

![alt text](img/S14.3.png)

Image A and B illustrate how replacements (e.g. new words replacing old ones) at different points in time help inform us about the genealogic relationship of languages. In C it is demonstrated that the reality of the situation might be more complicated than a simple tree can show due to competing word forms existing in the same language simultaneously.
In image D to F you can see the same for the phenomenon of paradigmatic leveling which may be applied in different directions in different branches.

## 2. Inferring Phylogenies


### 2.1 UPGMA and Neighbor-joining tree

#### 2.1.1 UPGMA 
UPGMA is the abbreviation of Unweighted Pair Group Method with Arithmetic Mean. This algorithm assumes a constant evolution rate, which is not always true in a real world scenario. If two species have shortest distance between them, then these two species are considered as last two diverge and joined with a common ancestor. 

See the following example : 

|   | A | B | C | D |
|---|---|---|---|---|
| A | 0 | 2 | 3 | 4 |
| B | 2 | 0 | 5 | 6 |
| C | 3 | 5 | 0 | 7 |
| D | 4 | 6 | 7 | 0 |

The distance between A and B is the shortest among all of the other pairs. So A and B will be joined the first. And then thus the matrix will be reduced to 3X3. Iterate the above process until the last pairs. 
The result will be :
<img src="img/UPGMA-toyexample.png" width="200" height="200">

#### 2.1.2 Neighbor-joining tree
The goal of neighbor-joining tree algorithm is to find the shortest total branch length. It starts from a star-like shape and starts to find the "neighbors", which means two nodes only have one common node.

The same example may result in different tree 
<img src="img/NJ-toyexample.png" width="200" height="200">


### 2.2 Distance matrix based phylogeny anaylsis via LingPy

This section shows our audience how to draw a non-Bayesian phylogeny by using LingPy.

In [1]:
from lingpy import * 

# import data
wl=Wordlist('../data/S08-computed.tsv')

# NJ= neighbor
mytree=Tree(wl.get_tree(tree_calc='neighbor',force=True))
print('Neigjbor-joining tree')
print(mytree.asciiArt())
print('UPGMA tree')
# UPGMA= upgma
mytreeupgma=Tree(wl.get_tree(tree_calc='upgma',force=True))
print(mytreeupgma.asciiArt())

2018-07-10 12:24:05,799 [INFO] Successfully calculated tree.


Neigjbor-joining tree
                                        /-Hawaiian
                              /edge.1--|
                             |         |          /-Mangareva
                             |          \edge.0--|
                    /edge.4--|                    \-North_Marquesan
                   |         |
                   |         |          /-Rapanui
                   |          \edge.3--|
          /edge.5--|                   |          /-Sikaiana
         |         |                    \edge.2--|
         |         |                              \-Tuamotuan
-root----|         |
         |          \-Maori
         |
         |          /-Ra’ivavae
          \edge.7--|
                   |          /-Rurutuan
                    \edge.6--|
                              \-Tahitian
UPGMA tree


2018-07-10 12:24:05,803 [INFO] Successfully calculated tree.


                    /-Sikaiana
          /edge.5--|
         |         |          /-Maori
         |          \edge.4--|
         |                   |          /-Rapanui
         |                    \edge.3--|
         |                             |          /-Hawaiian
         |                              \edge.2--|
-root----|                                       |          /-Mangareva
         |                                        \edge.1--|
         |                                                 |          /-North_Marquesan
         |                                                  \edge.0--|
         |                                                            \-Tuamotuan
         |
         |          /-Ra’ivavae
          \edge.7--|
                   |          /-Rurutuan
                    \edge.6--|
                              \-Tahitian


#### LingPy can output the distance matrix and tree files from wordlist, too.

In [2]:
# distance matrix 
wl.output('dst', filename='../data/S14-distanceoutput',taxa='DOCULECT', ref='COGID',prettify=False)

# NJ= neighbor
wl.output('tre',filename='../data/S14-njtree',taxa='DOCULECT',ref='COGID', tree_calc='neighbor')

# UPGMA= upgma
wl.output('tre',filename='../data/S14-upgmatree',taxa='DOCULECT', ref='COGID', tree_calc='upgma')

2018-07-10 12:24:05,813 [INFO] Data has been written to file <../data/S14-distanceoutput.dst>.
2018-07-10 12:24:05,815 [INFO] Data has been written to file <../data/S14-njtree.tre>.
2018-07-10 12:24:05,817 [INFO] Data has been written to file <../data/S14-upgmatree.tre>.


#### LingPy can output nexus file for further Bayesian phylogenetic analysis.

In [3]:
wl.output('paps.nex',filename='../data/S14-pap', taxa='DOCULECT', ref='COGID')


2018-07-10 12:24:05,864 [INFO] Data has been written to file <../data/S14-pap.paps.nex>.


## 3. Split graphs

Besides trees, one type of graph that is finding its way in quantitative historical linguistics are split graphs. It is important to know how they are generated and how they should be interpreted. First, it is important to know that graphs that are apparently of the same time can be produced by different methods; in particular they can be 

> produced by distance-based network methods such as NeighborNet and Split Decomposition, by character-based methods such as Median Networks and Parsimony Splits, and by tree-based methods such as Consensus Networks and SuperNetworks. They are all interpreted in the same way[... :] An essential point to understand is that splits graphs are separation networks. That is, the edges in the graph represent separation between two clusters of nodes in the network; or, they split the graph in two. Formally, each edge represents a bipartition (or split) of the taxa based on one or more characteristics. If there is no conflict in the data then each bipartition is represented by a single edge, and if there are contradictory patterns then the each bipartition is represented by a set of parallel edges. The edge lengths represent the relative amount of support in the whole dataset for each of the splits. [from David Morrison's http://phylonetworks.blogspot.com/2012/08/how-to-interpret-splits-graphs.html]

Morrison (2012) offers a primer using data "about opinion polls prior to a few Australian elections", comparing the "Actual" results of the elections to four different polls: Morgan (from Roy Morgan Research), Saulwick (from Saulwick Poll), McNair (from McNair Survey), and Other (from a pool of polls). The network resulting from this simple data, where a distance is calculated for the actual and the predicted result, is the following:

![auspolls](img/AusPolls.gif)

As explained by Morrison (2012), the network has "five informative splits (bipartitions), each represented by a different set of parallel edges[; the] remaining five splits are simply shown as the single edges leading to each of the five sources of data". The informative bipartitions in order of decreasing support and with the weight of support are:

| Partition | Group 1                | Group 2                | Weight |
|-----------|------------------------|------------------------|--------|
| #1        | Actual Morgan Other    | McNair Saulwick        | 0.813  |
| #2        | Actual Morgan Saulwick | McNair Other           | 0.649  |
| #3        | Actual Morgan          | McNair Other Saulwick  | 0.493  |
| #4        | Actual Other           | McNair Morgan Saulwick | 0.486  |
| #5        | Actual McNair Other    | Morgan Saulwick        | 0.188  |

Which can be represented graphically as such:

![split1](img/AusPolls_split1.gif)
![split2](img/AusPolls_split2.gif)
![split3](img/AusPolls_split3.gif)
![split4](img/AusPolls_split4.gif)
![split5](img/AusPolls_split5.gif)

Distances can be computed as such:

![distance](img/AusPolls_distance.gif)

As per Morrison (2012), the pathlengths

> can also be used to evaluate the relative success of the opinion polls. That is, the network pathlength distance from Actual to Morgan is the shortest, which we can interpret as indicating that Roy Morgan Research was the most "successful" of the four opinion polls. That is, its predictions were the "least different" from the actual election results, across all of the elections.

This should make it easier to interpret language split graphs. Let's discuss a famous one, from Grey et al. (2010):

![network](img/Language-network.jpg)

This by Sean Roberts:

![game](img/language_game.png)

## 3. Reception and Limitations

The reception of quantitative methods in historical linguistics, particularly of phylogenetic methods, has not been unanimously positive. The most known response has probably been the book "The Indo-European Controversy - Facts and Fallacies in Historical Linguistics" by Asya Pereltsvaig and Martin W. Lewis (2015), which focuses on the papers by Grey et al. on the Heimat of the Indo-Europeans, a somewhat scathing review of the work by experts that might focus too much on the reception of that research by the big media. Another known criticism is one by Roger Blench (2015) and, in general, we can say that most scholars who are not directly dealing with quantitative methods remain a bit skeptic, although more and more optimistic.

Some of the resistance is due to a general skepticism in terms of quantitative research in fields traditionally reserved for humanities, in part caused by the arrogant behavior of some fields which believe that century-old questions could be solved with some quick and dirty algorithms combined with statistical approximations. In historical linguistics, a traditional resistance to quantitative linguistics is also at play: while mathematical methods in the field go at least back to Dumont d'Urville in the 1830s, quantitative methods are still strongly associated with Swadesh's work, whose lexicostatistics is nowadays mostly taken as a "first look" that only explain what is known, and whose glottochronology is better left unspoken. Alternative takes such as Greenberg and Starostin, while more grounded, are mostly regarded as not mainstream, in part due to their defense of supra-families, such as Notrastic, to which historical linguists in general either take an agnostic point-of-view or reduce to chance resemblance. As such, specialist in key phyla, such as Indo-European and Austronesian, have been very slowly, if at all, accepting some of the insights from quantitative methods; ironically, sometimes the authors of the datasets and cognancy judgments used explicitly reject the phylogenetic results.

Objections tend to also be raised due to the fact that only recently (at most in the last ten years) this kind of research has started to appear in solid linguistic journals: most of the findings are still being published in general scientific publications (even though reliable ones, such as "Nature" or "Science") or in "hard science" (mostly statistics) outlets instead of peer-reviewed linguistic journals. The skepticism of traditional historical linguists, while in some cases might be just an expression of anti-quantitative or even neo-Luddite feelings, is generally well-grounded and must be considered by us. Among the objections we find:

- That the value of the findings cannot be assessed
- That the results and especially the inner-workings are difficult to interpret and, thus, to question
- That it is not clear how results and methods could be improved
- That is not being scientific in the sense of being reproducible and falsifiable
- That the judgment tend to be based only on lexical cognancy, i.e., the subjective assessment that two lexemes share an ancestor, something which is hard to apply to other areas such as phonology, morphology, and syntax (also because such areas are assumed to be less resistant to change)
- That automatic judgment of cognancy is not on par with expert level, and that experts are more likely to detect morphological components and issues such as corrupted data
- That phylogenetic methods operate on shared retentions and, especially, shared innovations, which would not be proved to be good predictions in language history as they are in biology
- That the assumption that rates of change are comparable is not proved, and there is more evidence on the contrary
- The assumption that cognancy based in inheritance can be reliably distinguished from that based on borrowing, especially among related languages
- That we cannot assume that all languages necessarily diversity according to arborescent or starburst-like structures, which means that visualizations can be misleading
- The fact that the systems would not be ready to deal with known phenomena such as Wanderwörter

Let's discuss each point.

Blench (2015) even says that they are not subject to the "same critical gaze" of traditional historical linguistics and that there is a parallel with generative analysis in the sense that they are "impossible to falsify". Cognancy judgements are particularly questioned, because they are assumed to be mostly correct (and, especially, that the assumptions on transmission and evolution hold), that it is possible to distinguish from borrowings, that each language is somewhat closed (i.e., that it is, in fact, a taxa), that relationships can be expressed by splits which are, at the very lowest level, binary, etc. The problem is differences in cognate judgment among different authors is generally considered particularly strong, as the trees would only reproduce the tree assumed when the cognates are judged (not to mention that we found differences in the same author, and Blench always remembers Hal Fleming, who famously said that it is easier to find cognates after a good lunch).

While we discuss such limitations, we must keep in mind that, in the end, historical linguistics, in general, is based on subjective evaluations, sometimes stemming from circular reasoning, that operate, even when not admittedly, on probabilities. If basic components of the data for quantitative methods are inherently subjective and based in intuition, such as cognancy judgements, we could actually say that the simple fact that the methods require machine-readable data, and even more in our approach where the data must also be human-readable and open, makes in this aspect the computational methods more transparent and replicable than traditional. We don't usually get to observe the starting point and the steps of traditional historical linguistics, and even worse, they are sometimes kept as precious treasures of their "owners". Even the fact that the data is usually collected by a set of specialist, analyzed by a different set of specialists, and evaluated by yet another set is a part of a collaboration that is still lacking in the field of "lone wolves".

That said, we want to thank you all for your participation and hope that what we covered will help you to take part in this new wave of research in historical linguistics. See you around! ```:)```