# Using Phylogenies in Computational Historical Linguistics






## 1. Introduction

### 1.1 Inner and Outer Language History
The origins of phylogenetics go back to the 19th century and its principles haven't changed much since. Among the lesser known early linguists there is Georg von der Gabelentz (1840-1893). The following description of phylogenetic reconstruction which is based on his words from 1891 may still hold true:

>This branch of linguistics is at first concerned with the most dry individual facts: Are languages A and related and to what degree? Does this word or word form exist in that language or at that time of language history? How is it pronounced there? What kinds of rules can be found in the sound differences? Is it a native or a loan word? Which words belong to the original lexicon, which have been added later? And so on. All this sounds and really is very dry. The innermost drive of human language, that which otherwise makes comparative linguistics one of the most lively subjects, stands back at first: Merely a few of the field's offshoots wind to the mental and social life of the speakers. A linguist focused on a single language hurries to absorb it into themself whereas a language historian stands outside of it, like an anatomist next to a corpse.
[...]
We best should distinguish between outer and inner language history. Outer language history is the history of its spacial and temporal spreading, of its branchings and possible interminglings - in other words, its genealogy. Inner language history shows and tries to explain how the language slowly changed in regards to meaning and form."
(Gabelentz 1891: 145f., translated/rephrased)

### 1.2 Trees, Waves, and Nets
   
#### Trees

This first phylogenetic tree on the Slawic languages by František Čelakovský (1799-1852) was published in 1853, shortly before August Schleier published his tree model theory in the same year:

![alt text](https://github.com/digling/calc-seminar/blob/master/data/S14.1.png)

>The oldest splits of Indo-European until the creation of the basic languages of which the language trunk consists can be illustrates by this diagram. The length of lines imply time depth, their distance the degree of relationship.
(Schleicher 1861: 6, translated/rephrased)

#### Waves

Not long after Schleicher's publishing of his tree model, there was opposition among indo-europeanists and language historians. Most famous is the wave theory by Johannes Schmidt
(1843-1901):

>There is no choice, we need to recognize that Lithuanian is inseparably connected both with German and with Indo-Iranian. The European, German, and Indo-Iranian features pervade each other so fully that a whole list of phenomena are caused just by their organic collaboration and that there are words whose forms are neither fully European nor fully Indo-Iranian and can only be understood when seen as the result of these two streams crossing each other.
[...]
If we now want to represent the relationships of the Indo-European languages in an image that illustrates how their differences came to be, we need to fully give up on the idea of the tree model. Instead I would like to use the image of a wave spreading in concentric circles that grow weaker and weaker the farther away they are from the center. It's of no issue that our language area doesn't form a circle but a circle's sector at best and that the oldest language doesn't lie in the center but rather at one end of the area. I also consider the image of a plane tilted in one direct line from Sanskrit to Celtic to be not unfitting.
(Schmidt 1872: 27, translated/rephrased)

#### Nets and Other Kinds of Entanglements

The biggest problem of Schmidt's wave theory was that nobody knew exactly how to illustrate outer language history schematically in it. And therefore many different approaches to visualize wave theory can be found.

![alt text](https://github.com/digling/calc-seminar/blob/master/data/S14.2.png)

### 1.3 A Matter of Perspective

One basic difference between the tree and the wave model lies in their different goals: Wave theory is oriented towards an epistemological perspective on outer language history whereas tree theory is oriented towards an ontological perspective on it. In other words, waves describe what we would like to know, trees what we can know. That's why Schmidt opposed the tree model on the grounds of the impossibility of representing the facts with it:

>The fact is, twist it how you will, as long as you insist on the opinion that historical languages came to be by multiple offshoots from the original language, i. e. as long as you believe in a tree model of Indo-European languages, you will never be able to explain all the facts in question scientifically.
(Schmidt 1872: 17, translated/rephrased)

The facts he refers to here are mainly lists of roots shared among different Indo-European languages (cognates in the broadest sense). The problem with these facts is however that they themselves depend on the state of research. If you for example compare Schmidt's number of roots shared among Greek, Old Indian and Latin with the number by Nicholaev (2007), you will notice that the closeness between Latin and Greek suggested by Schmidt's data doesn't seem that pronounced anymore considering Nicholaev's data. The same holds true for the very small similarities between Old Indian and Latin in Schmidt's data:

![alt text](https://github.com/digling/calc-seminar/blob/master/data/S14.3.png)

### 1.4 Not Seeing the Tree for the Waves...

Computer-assisted methods for phylogenetic reconstruction still have their problems, but those aren't always caused by the methods being merely aimed at the detection of trees, thereby being based on the wrong model, as is often assumed. Instead also the opposite problem can occur of treelike processes disguising themselves as being not treelike. In biology, this phenomenon is called incomplete lineage sorting:

>Incomplete lineage sorting occurs when an ancestral species undergoes several spe-
ciation events in a short period of time. If, for a given gene, the ancestral polymorphism
is not fully resolved into two monophyletic lineages when the second speciation occurs,
then with some probability the gene tree will be diﬀerent from the species tree. (Galtier
und Daubin 2008: 4023)

Below you see examples of how polymorphism works in linguistics:

![alt text](https://github.com/digling/calc-seminar/blob/master/data/S14.3.png)

## Inferring Phylogenies


## Mapping Characters on Reference Phylogenies
+++introduce the major idea of gene tree reconciliation, character mapping, etc. +++