# Automatic Detection of Correspondence Patterns (Johann-Mattis List)




## Correspondence Patterns in Historical Linguistics

### Introduction

One of the fundamental insights of early historical linguistic research was that – as
a result of systemic changes in the sound system of languages – genetically related
languages exhibit structural similarities in those parts of their lexicon which were
commonly inherited from their ancestral languages. These similarities surface in form
of correspondence relations between sounds from different languages in cognate words.

Given the increasing application of automatic methods in historical linguistics after
the “quantitative turn” ([Geisler and List 2013](http://bibliography.lingpy.org?key=Geisler2013), 111) in the beginning of this millennium,
scholars have repeatedly attempted to either directly infer regular sound correspon-
dences across genetically related languages ([Kondrak 2009](http://bibliography.lingpy.org?key=Kondrak2009), [2003](http://bibliography.lingpy.org?key=Kondrak2003); [Brown et al. 2013](http://bibliography.lingpy.org?key=Brown2013)) or integrated the inference into workflows for automatic
cognate detection ([List 2014](http://bibliography.lingpy.org?key=List2014d)). What is
interesting in this context, however, is that almost all approaches dealing with regular
sound correspondences, be it early formal – but classically grounded – accounts (
[Grimes and Agard 1959](http://bibliography.lingpy.org?key=Grimes1959); [Hoenigswald 1960](http://bibliography.lingpy.org?key=Hoenigswald1960)) or computer-based methods ([Kondrak 2003](http://bibliography.lingpy.org?key=Kondrak2003); List 2014) only consider sound correspondences between *pairs* of languages.

A rare exception can be found in the work of [Anttila](http://bibliography.lingpy.org?key=Anttila1972) (1972, 229-263), who presents
the search for regular sound correspondences across multiple languages as the basic tech-
nique underlying the comparative method for historical language comparison. Anttila’s
description starts from a set of cognate word forms (or morphemes) across the languages
under investigation. These words are then arranged in such a way that corresponding
sounds in all words are placed into the same column of a matrix. The extraction of
regularly recurring sound correspondences in the languages under investigation is then
based on the identification of similar patterns recurring across different columns within
the cognate sets.

The procedure is illustrated in the following figure, where four cognate sets in Sanskrit, Ancient Greek, Latin, and Gothic are shown.

![img](img/s10-fig1.png)

While it seems trivial to identify sound correspondences across multiple languages
from the few examples provided in the figure, the problem can become quite complicated
if we add more cognate sets and languages to the comparative sample. Especially the
handling of missing reflexes for a given cognate set becomes a problem here, as missing
data makes it difficult for linguists to decide which alignment columns to group with each
other. This can already be seen from the examples given in Figure 1, where we have two
possibilities to group the patterns A, C, E, and F.

### Preliminaries on Sound Correspondence Patterns


Sound correspondences are most easily defined for pairs of languages. Thus, it is straight-
forward to state that German `[`d`]` regularly corresponds to English `[`θ`]`, that German `[`ts`]`
regularly corresponds to English `[`t`]`, and that German `[`t`]` corresponds to English `[`d`]`.
We can likewise expand this view to multiple languages by adding another Germanic
language, such as, for example, Dutch to our comparison, which has `[`d`]` in the case of
German `[`d`]` and English `[`θ`]`, `[`t`]` in the case of German `[`ts`]` and English `[`t`]`, and `[`d`]`
in the case of German [t] and English [d]. Examples for all forms are given along with
proto-forms in Proto-Germanic in the table below.

![img](img/s10-tab1.png)

The more languages we add to the sample, however, the more complex the picture will
get, and while we can state three (basic) patterns for the case of English, German, and
Dutch, given in our example, we may get easily more patterns, due to secondary sound
changes in the different languages, although we would still reconstruct only three sounds
in the proto-language (`[`θ, t, d`]`). Thus, there is a one-to-n relationship between what we
interpret as a proto-sound of the proto-language, and the regular correspondence patterns
which we may find in our data.

While we will reserve the term sound correspondence for
pairwise language comparison, we will use the term *sound correspondence pattern* (or
simply *correspondence pattern*) for the abstract notion of regular sound correspondences
across a set of languages which we can find in the data.

### Correspondence Patterns and Proto-Forms

Scholars like Meillet ([1908](http://bibliography.lingpy.org?key=Meillet1908), 23) have stated that the core of historical linguistics is not
linguistic reconstruction, but the inference of correspondence patterns, emphasizing that
'reconstructions are nothing else but the signs by which one points to the correspondences in short form’' However, given the one-to-n relation between proto-sounds and
correspondence patterns, it is clear, that this is not quite correct. Having inferred regular
correspondence patterns in our data, our reconstructions will add a different level of
analysis by further clustering these patterns into groups which we believe to reflect one
single sound in the ancestral language.

That there are usually more than just one correspondence pattern for a reconstructed
proto-sound is nothing new to most practitioners of linguistic reconstruction. Unfortunately, however, linguists do rarely list all possible correspondence patterns exhaustively
when presenting their reconstructions, but instead select the most frequent ones, leaving
the explanation of weird or unexpected patterns to comments written in prose.

### Correspondence Patterns in Classical Linguistic Literature

What scholars do instead is providing tables which summarise the correspondence
patterns in a rough form, e.g., by showing the reflexes of a given proto-sound in the
descendant languages in a table, where multiple reflexes for one and the same language
are put in the same cell. An example, taken with modifications from Clackson ([2007](http://bibliography.lingpy.org?key=Clackson2007): 37),
is given in the following table.

![img](img/s10-tab2.png)

### Correspondence Patterns and Alignments

In order to infer correspondence patterns, the data must be available in aligned form, that is, we must know which of the
sound segments that we compare across cognate sets are assumed to go back to the
same ancestral segment. This is illustrated in the following figure, where the cognate sets from the table above are presented in aligned form, following the alignment annotations of LingPy and EDICTOR.

![img](img/s10-fig2.png)


It is important to keep in mind that strict alignments can only be made of cognate
words (or parts of cognate words) that are *directly related*. The notion of directly related
word (parts) is close to the notion of orthologs in evolutionary biology ([List 2016](http://bibliography.lingpy.org?key=List2016f)) and
refers to words or word parts whose development have not been influenced by secondary
changes due to morphological processes.

Following evolutionary biology, a given column of an alignment is called an *alignment
site* (or simply a site). An alignment site may reflect the same values as we find
in a correspondence pattern, and correspondence patterns are usually derived from
alignment sites, but in contrast to a correspondence pattern, an alignment site may
reflect a correspondence pattern only incompletely, due to missing data in one or more
of the languages under investigation.

In alignments, "gaps" due to missing reflexes of a given cognate set are not the same as
the gaps inside an alignment, since the latter are due to the (regular) loss or gain of a
sound segment in a given alignment site, while gaps due to missing reflexes may either
reflect processes of lexical replacement (List 2014, 37f), or a preliminary stage of research
resulting from insufficient data collections or insufficient search for potential reflexes.

While we follow the LingPy annotation for gaps in alignments by using the dash as a
symbol for gaps in alignment sites, we will use the character Ø (denoting the empty set)
to represent missing data in correspondence patterns and alignment sites. This is illustrated in the following figure.

![img](img/s10-fig3.png)


## 2 An Algorithm for Automatic Correspondence Pattern Recognition

### 2.1 Preliminary Thoughts

#### 2.1.1 Compatibility of Alignment Sites

If we recall the problem we had in grouping the alignment sites E and F from Figure 1
with either A or C, we can see that the general problem of grouping alignment sites to
correspondence patterns is their *compatibility*. If we had reflexes for all languages under
investigation in all cognate sets, the compatibility would not be a problem, since we
could simply group all identical sites with each other, and the task could be considered
as solved. However, since it is rather an exception than the norm to have reflexes for all
languages under consideration in a number of cognate sets, we will always find alternative
possibilities to group our alignment sites in correspondence patterns. In the following, I
will assume that two alignment sites are compatible, if they (a) share at least one sound
which is not a gap symbol, and (b) do not have any conflicting sounds. We can further
weight the compatibility by counting how many sounds are shared among two alignment
sites. This is illustrated in the following figure for our four alignment sites A, C, E, and F from
the figure above. As we can see from the figure, only two sites are incompatible, namely
A and C, as they show different sounds for the reflexes in Gothic. Given that the reflex
for Latin is missing in site C, we can further see that C shares only two sounds with E
and F.

![img](img/s10-fig4.png)

#### 2.1.2 Modeling Sound Correspondence Patterns in Networks

Having established the concept of alignment site compatibility in the previous section, it
is straightforward to go a step further and model alignment sites in form of a network.
Here, all sites in the data represent nodes (or vertices), and edges are only drawn between
those nodes which are compatible, following the criterion of compatibility outlined in
the previous section. We can further weight the edges in the alignment site network, for
example, by using the number of matching sounds (where no missing data is encountered)
to represent the strength of the connection (but we will disregard weighting in our
method). The following figure illustrates how an alignment site network can be created from the compatibility comparison shown in the figure above.

![img](img/s10-fig5.png)

#### 2.1.3 Correspondence Pattern Recognition as a Clique Coverage Problem

As was mentioned already before, the main problem of assigning different
alignment sites to correspondence patterns is to decide about those cases where one site
could be assigned to more than one patterns. Having shown how the data can be modeled
in form of a network, we can rephrase the task of identifying correspondence patterns
as a *network partitioning task* with the goal to split the network into non-overlapping
sets of nodes. Given that our main criterion for a valid correspondence pattern is full 
compatibility among all alignment sites of a given partition, we can further specify the
task as a clique partitioning task. A clique in a network is 'a maximal subset of the
vertices [nodes] in an undirected network such that every member of the set is connected
by an edge to every other' ([Newman 2010](http://bibliography.lingpy.org?key=Newman2010): 193). Demanding that sound correspondence
patterns should form a clique of compatible nodes in the network of alignment sites is
directly reflecting the basic practice of historical language comparison as outlined by
Anttila (1972), according to which a further grouping of incompatible alignment sites by
proposing a proto-form would require us to identify a phonetic environment that could
show incompatible sites to be complementary.

The minimum clique cover problem is a well-known problem in graph theory and
computer science, although it is usually more prominently discussed in form of its inverse
problem, the graph coloring problem, which tries to assign different colors to all nodes
in a graph which are directly connected ([Hetland 2010](http://bibliography.lingpy.org?key=Hetland2010): 276). While the problem is
generally known to be NP-hard (ibid.), fast approximate solutions like the Welsh-Powell
algorithm ([Welsh and Powell 1967](http://bibliography.lingpy.org?key=Welsh1967)) are available. Using approximate solutions seems
to be appropriate for the task of correspondence pattern recognition, given that we do
not (yet) have formal linguistic criteria to favor one clique cover over another.

### 2.2 A Method for Correspondence Pattern Recognition

#### 2.2.1 General Workflow

The general workflow underlying the method for automatic correspondence pattern
recognition can be divided into five different stages. Starting from a multilingual wordlist
in which translations for a concept list are provided in form of phonetic transcriptions
for the languages under investigation, the words in the same semantic slot are manually
or automatically searched for cognates (A) and (again manually or automatically)
phonetically aligned (B). The alignment sites are then used to construct an alignment site
network in which edges are drawn between compatible sites (C). The alignment sites are
then partitioned into distinct non-overlapping subsets using an approximate algorithm
for the minimum clique cover problem (D). In a final step, potential correspondence
patterns are extracted from the non-overlapping subsets, and all individual alignment
sites are assigned to those patterns with which they are compatible (E). While there are
both standard algorithms and annotation frameworks for stages (A) and (B), the major
contribution of this paper is to provide the algorithms for stages (C), (D), and (E). The
workflow is further illustrated in the following figure. In the following sections, I will provide more
detailed explanations on the different stages.

![img](img/s10-fig6.png)

#### 2.2.2 Input Format

The input format follows the general input format used in LingPy and EDICTOR. 
In addition to the generally needed information on the identifier of each word (ID),
on the language (DOCULECT), the concept or elicitation gloss (CONCEPT), the
(not necessarily required) orthographic form (FORM), and the phonetic transcription
provided in space-segmented form (TOKENS), the method requires information on the
type of sound (consonant or vowel, STRUCTURE), the cognate set (COGID), and the
alignment (ALIGNMENT). This is illustrated in the following table.

![img](img/s10-tab3.png)

#### 2.2.3 Cognate Detection and Phonetic Alignment

Given that the method is implemented in form of a plugin for the LingPy library, all
cognate detection and phonetic alignment methods offered in LingPy are also available
for the approach and have been tested. The automatic methods for cognate detection and phonetic alignments, however, are
not necessarily needed in order to apply the automatic method for correspondence pattern
recognition. Alternatively, users can prepare their data with help of the EDICTOR tool
users both to annotate cognates and alignments from scratch or to refine cognate sets
and alignments that have been derived from automatic approaches.

#### 2.2.4 Correspondence Pattern Recognition

The method for correspondence pattern recognition consists of three stages (C-E in our
general workflow). It starts with the reconstruction of an alignment site network in which
each node represents a unique alignment site, and links between alignments sites are
drawn if the sites are compatible, following the criterion for site compatibility outlined
in Section 3.1 (C). It then uses a greedy algorithm to compute an approximate minimal
clique cover of the network (D). All partitions proposed in stage (D) qualify as potentially
valid correspondence patterns of our data. But the individual alignment sites in a given
dataset may as well be compatible with more than one correspondence pattern. For this
reason, the method iterates again over all alignment sites in the data and checks with
which of the correspondence patterns inferred in stage (D) they are compatible. This
procedure yields a (potentially) fuzzy assignment of each alignment site to at least one
but potentially more different sound correspondence patterns (E). By further weighting
and sorting the fuzzy patterns to which a given site has been assigned, the number of
fuzzy alignment sites can be further reduced.

The clique cover algorithm consists of two steps. In a first step, the data is sorted,
using a customized variant of the Quicksort algorithm ([Hoare 1962](http://bibliography.lingpy.org?key=Hoare1962)), which seeks to
sort patterns according to compatibility and similarity. By iterating over the sorted
patterns, all compatible patterns are assigned to the same cluster in this first pass, which
provides a first very rough partition of the network. While this procedure is by no means
perfect, it has the advantage of detecting major signals in the data very quickly. For this
reason, it has also been introduced into the EDICTOR tool, where a more
refined method addressing the clique cover problem could not be used, due to the typical
limitations of JavaScript running on client-side.

In a second step, an inverse version of the Welsh-Powell algorithm for graph
coloring (Welsh and Powell 1967) is employed. This algorithm starts from sorting all
existing partitions by size, beginning with the largest partitions. It then consecutively
compares the currently largest partition with all other partitions, merging those which
are compatible with each other, and keeping the incompatible partitions in the queue.
The algorithm stops, once all partitions have been visited and compared against the
remaining partitions.

The figure below gives an artificial example that illustrates how the basic method infers the
clique cover. Starting from the data in (A), the method assembles patterns A and B in (B)
and computes their pattern, thereby retaining the non-missing data for each language in
the pattern as the representative value. Having added C and D in this fashion in steps (C)
and (D), the remaining three alignment sites, E-G are merged to form a new partition,
accordingly, in steps (E) and (F).

![img](img/s10-fig8.png)

In the final stage of assigning alignment sites to correspondence patterns, our method
first assembles all correspondence patterns inferred from the greedy clique cover analysis
and then iterates over all alignment sites, checking again whether they are compatible
with a given pattern or not. Since alignment sites may suffer from missing data, their
assignment is not always unambiguous. The example alignment from above, for
example, would yield two general correspondence patterns, namely u-u-u-au vs. u-u-u-u.
While the assignment of the alignment sites A and C in the figure would be unambiguous,
the sites E and F would be assigned to both patterns, since, judging from the data, we
could not tell what correspondence pattern they represent in the end.

## 3 Automatic Correspondence Pattern Recognition with LingPy

The correspondence pattern algorithm is currently only available as a plugin for LingPy which can be downloaded from [the open science framework](https://osf.io/mbzsj/?view_only=
b7cbceac46da4f0ab7f7a40c2f457ada). To install the plugin, one needs to make sure to have all dependencies installed (which are the same as the ones needed by LingPy), and then install the package by unpacking it and runnin the typical ```sudo python setup.py develop``` routine. The package itself is called `lingrex`, and our main module of this package is the `copar` module (=*Correspondence Pattern Recognition*). The code runs both with full and with partial cognates, and there are multiple tweaks to work with the data. In the following, however, I will restrict the demonstration to two examples, were we first align known cognates in a Germanic dataset and then investigate correspondence patterns, and then align known partial cognates in a Chinese dataset. 

### 3.1 Correspondence Pattern Analysis with Full Cognates

We start by loading the data (file `S10-GER.tsv`) and carrying out an automatic alignment analysis, as illustrated in an earlier session. The concrete method we use to create the alignments is less relevant for us in this context, so we just use the defaults offered by LingPy. This analysis could also be carried out in the correspondence pattern library itself, since the relevant class the `CoPaR` class directly inherits from the `Alignments` class of LingPy. But to stick more clearly to the original workflow, we carry out the alignments now with help of the normal `Alignments` class.



In [1]:
from lingpy import *
alms = Alignments('../data/S10-GER.tsv', ref='cogid')
alms.align()

To be able to carry out our correspondence pattern analysis on this data, we now need to add a dummy `STRUCTURE` column. This is a dummy column in our example, because we want to analyze *all* correspondence patterns, and not restrict the analysis to certain contexts. In many situations, however, it may be useful to split the data when carrying out the analysis and only investigate, for example, the patterns on vowels or the patterns on consonants. For this reason, the `copar` package expects a column name as input that indicates in which part of the input data the table with the structural representation can be found. By "structure" we mean here nothing else than an additional column that lists for each segment (column `TOKENS`) its respective "class", which can be the same class for all segments (and would enable a complete comparison of all patterns against each other), or a very fine-grained model, depending on the knowledge of the researcher about a particular language family. 

The `lingrex` package allows us to carry out these calculation from within the class, so we can now already start loading the data.

In [2]:
from lingrex.copar import CoPaR
cop = CoPaR(alms, ref='cogid', segments='tokens', alignment='alignment')


Adding structure (in our case: the dummy structure that allows us to cluster all elements against all, regardless if they are vowels or consonants) is easily done with help fo the `add_structure` method. This method currently supports two models: "cv" (split into vowel, consonant, and tone) and "c" (disregard all class distinctions). The resulting representation of the `TOKENS` in the dataset is added to a new column `structure="keyword"`. 

In [3]:
cop.add_structure(model='c', structure='structure')
for i in [x for x in cop][:5]:
    print(' '.join([x.rjust(5, ' ') for x in cop[i, 'tokens']]))
    print(' '.join([x.rjust(5, ' ') for x in cop[i, 'structure'].split()]))
    print('')

    a     l
    c     c

   ɔː     l
    c     c

   æˀ     l
    c     c

    a    lː
    c     c

    a   t͡l     i     r
    c     c     c     c



We can now run the analysis, by 1) getting the sites (`CoPaR.get_sites`), 2) sorting the patterns (`CoPaR.sort_patterns`), and then `CoPaR.cluster_patterns`. For the first function we need to inform the function which "structure" we have chosen as base representation. Don't forget to specify the structure-keyword! The column, where our structural (phonotactic) representation of the data resides is named "structure" now, so we need to indicate this. Note also that we set the keyword `minrefs` to 2, which means that we basically except all alignment sites (`minrefs` referst to `minimal number of reflexes`). You can use this keyword to yield more conservative scores that, however, also may exclude many alignments from the beginning, prior to the analysis.

In [4]:
patterns, sites = cop.get_sites(pos='c', structure='structure', minrefs=2)

                                                              

Let us first inspect our sites object, which is an `OrderedDict`, and thus behaves as a normal Python dictionary. The keys consist of tuples of integers, the first indicating the cognate identifier (our `COGID` in the data), and the second the concrete position in the alignment, using Python indices, thus starting from 0 to mark the position 1. If we know a certain cognate set, we can query the corresponding alignment site:

In [5]:
print('\t'.join([x[:3] for x in cop.cols]))
print('\t'.join(['---' for x in cop.cols]))
for i in range(4):
    print('\t'.join(sites[181, i]))

Dan	Dut	Eng	Ger	Ice	Nor	Swe
---	---	---	---	---	---	---
-	ʋ	ʋ	v	-	Ø	Ø
o	ɔ	ɜː	ʊ	ɔ	Ø	Ø
ʁˀ	r	r	r	r	Ø	Ø
m	m	m	m	m	Ø	Ø


From our cognate set for "worm", we can see clearly why we need two gap symbols, the classical `-` for alignments, pointing to zero-correspondences in the data, and the new gap symbol `Ø` pointing to missing data (here for Norwegian and Swedish, where the word for "worm" has been replaced by non-cognate words).

We can now follow the basic procedure of the workflow and sort the patterns, as mentioned above. We derive a first set of patterns by using the `sort_patterns` method of the `CoPaR` class (and count, how many patterns the algorithm finds:

In [6]:
patterns = cop.sort_patterns()
print(len(patterns))

455


The structure of the patterns object is a dictionary in which the pattern is now the key, and the values are lists of the tuples indicating the index of a given alignment site (cognate identifier plus the position in the alignment). By counting the size of the list of values, we can see, how many patterns are unique in our data:

In [7]:
print(len([p for p in patterns.values() if len(p) > 1]))

88


To see whether the improved method for clustering yields more promising results than the simple sorting approach, let's now cluster the patterns with the Baum-Welsh algorithm. Here, we specify a number of iterations, as the algorithm may fail to cluster all patterns sufficiently during the first run. 

In [8]:
cop.cluster_patterns(iterations=2)
print(len(cop.clusters), len([p for p in cop.clusters.values() if len(p) > 1]))

                                                                              

312 108




We can see that this has increased the number of non-unique patterns to quite some degree. 108 non-unique patterns for such a small dataset may still seem to be surprising, but since correspondence patterns have not yet been thoroughly investigated from a quantitative perspective, it may well be "normal", given the multiple possibilities where secondary change can occur that my disturb a given pattern. As a final example, we now assign patterns to fuzzy clusters and compute the fuzziness of our data. We also refine the patterns by adding each pattern to only one cluster, during which we take the information on fuzziness into account.

In [9]:
cop.sites_to_pattern()
cop.refine_sites()
print('{0:.2f}'.format(cop.fuzziness()), len(cop.clusters), len([p for p in cop.clusters.values() if len(p) > 1]))

                                                                              

1.01 312 100




We can see that the fuzziness for this dataset is not very high, which is promising, as it means that our 100 (we further reduced the number of clusters needed for our 313 different patterns) patterns are potentially rather regular instances of sound correspondences. We can now finally export these results to a text file that presents all patterns in spreadsheet form. Furthermore, we can export the wordlist format of the data, in which the patterns are given in a specific column, so that the data can be displayed in EDICTOR. For this, we first call the `add_patterns` function.

In [10]:
cop.write_patterns('../data/S10-patterns-out.tsv')
cop.add_patterns(ref='patterns')
cop.output('tsv', filename='../data/S10-wl-with-patterns')

The following table shows the basic structure of our file `S10-patterns-out.tsv`:

ID | FREQUENCY | Danish | Dutch | English | German | Icelandic | Norwegian | Swedish | COGIDS | CONCEPTS
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
1 | 4 | g | x | ɡ | ɡ | k | g | g | 13:0, 56:0, 57:0, 150:0 | big / good / green / walk (V)
2 | 4 | g | - | - | - | k | k | k | 2:2, 4:3, 51:3, 180:3 | ashes / bark / fish / worm
3 | 2 | g | Ø | Ø | Ø | c | Ø | - | 28:1, 117:1 | cloud / skin
4 | 1 | g | k | k | k | k | Ø | kː | 36:4 | drink (V)
5 | 1 | g | Ø | Ø | Ø | c | k | kː | 99:1 | not
6 | 1 | g | - | ɡ | - | kː | g | gː | 44:1 | egg
7 | 1 | g | x | j | ɡ | k | g | g | 161:0 | yellow
8 | 1 | g | x | ɡ | ɡ | c | j | j | 55:0 | give (V)
9 | 1 | g | - | - | - | - | k | - | 101:5 | person
10 | 1 | g | ɣ | - | ɡ | cː | g | gː | 74:2 | lie (V)

What we can see from the table is that it infers all regular and irregular cases of correspondence patterns for each language as inferred by the algorithm. To facilitate inspection, entries are sorted according to the first language in the list, but if users select one language as their "proto-language", this language will be used for sorting. In our case, we can see that there is good evidence for the first two patterns, corresponding to 8 cognate sets, while the rest of the alignment sites and patterns are less convincing, given that they contain many gaps or occur only one time. This tutorial is not the place to discuss the nature of these findings (whether they are artifacts of the data or reflecting interesting aspects of sound change), but we can easily see that patterns occur with different frequencies in our data.

### 3.2 Correspondence Pattern Analysis with Partial Cognates

For the illustration of correspondence pattern detection with partial cognates, we will use the file `S10-BAI.tsv`, a file showing examples from the Bai dialects. This file has been very carefully segmented morphologically. Note that without the explicit segmentation, the `CoPaR` method usually throws an error. It is therefore very important to check that the data is carefully segmented and aligned, especially when analysing data from manual analyses.

We start again by preparing the data, but this time, we carry out an automatic cognate detection analysis first, since our data does not contain information on cognacy so far.


In [11]:
# load partial cognate detection tool
from lingpy.compare.partial import Partial
part = Partial('../data/S10-BAI.tsv')
part.partial_cluster(method='sca', cluster_method='upgma', threshold=0.45, ref='cogids')

# pass to alignment algorithm and align the data
alms = Alignments(part, ref='cogids', fuzzy=True)
alms.align()

                                                                              

We can now carry out the correspondence detection method. In contrast to the previous run, we will reduce the number of explanations (users can refer back to the correspondence pattern detection for full cognates for questions on the details). But in contrast to the previous examples, we will now reduce the analysis to consonants by adding "cv"-structures to the data instead of "c"-structures, where each segment is represented with a "c". 

In [12]:
cop = CoPaR(alms, ref='cogids', segments='tokens')
cop.add_structure(model='cv', structure='structure')
patterns, sites = cop.get_sites(structure='structure', pos='c')
patterns = cop.sort_patterns()
print(len(patterns), len([p for p in patterns.values() if len(p) == 2]))

                                                              

265 39




We can see that there are fewer patterns in the data than observed for the case for Germanic languages. This, however, is not due to a decreased diversity of the dialects (which is also evident, however), but more due to the fact that we restrict the analysis to consonants now, ignoring all vowels. 

In the last step, we follow the same approach as for full cognates in clustering the patterns fully and writing them to file.

In [13]:
cop.cluster_patterns()
print('clustered 1:', len(cop.clusters), len([p for p in cop.clusters.values() if len(p) > 1]))
cop.sites_to_pattern()
cop.refine_sites()
cop.write_patterns('../data/S10-pat-bai.tsv')
cop.add_patterns(ref='patterns')
cop.output('tsv', filename='../data/S10-bai-with-patterns')
print('clustered 2:', len(cop.clusters), len([p for p in cop.clusters.values() if len(p) > 1]))

                                                                              

clustered 1: 156 72
clustered 2: 155 62




The final algorithmic enhancement has again led to a reduction of our clusters (from 73 to 62). 
Looking at the data, however, shows that we are facing quite some degree of diversitty here, as a look on some counterparts of *n* in Dashi Bai reveals.

ID | FREQUENCY | Dashi | Ega | Enqi | Gongxing | Jinman | Jinxing | Mazhelong | Tuolo | Zhoucheng | COGIDS | CONCEPTS
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
1 | 11 | n | - | - | n | - | n | n | n | - | 126:3, 176:2, 274:3, 282:2, 313:2, 332:3, 353:2, 383:2, 448:2, 605:3, 614:2 | ear / heart / moon / near / new / nose / person / round / sun / wind / woman
2 | 4 | n | ŋ | Ø | n | Ø | - | n | n | - | 73:2, 413:0, 537:2, 589:2 | burn / sleep / warm / who
3 | 3 | n | n | n | n | n | n | n | n | n | 484:0, 502:0, 510:0 | that / this / thou
4 | 2 | n | ŋ | ŋ | ŋ | ŋ | Ø | Ø | ŋ | ŋ | 84:0, 145:0 | cloud / fish
5 | 1 | n | Ø | Ø | Ø | Ø | n | n | - | - | 138:3 | far
6 | 1 | n | - | - | n | - | n | - | n | - | 132:2 | egg
7 | 1 | n | - | - | - | - | n | n | n | - | 423:2 | smoke
8 | 1 | n | - | Ø | ŋ | Ø | n | - | Ø | Ø | 110:2 | ear


## Correspondence Pattern Inspection with EDICTOR

EDICTOR offers two ways to inspect correspondence patterns which are both provided along with the `ANALYZE->Correspondence Patterns` panel of the tool. For the first analysis, EDICTOR employs the simple sorting algorithm to arrive at a rough clustering of the data. This is convenient for users interested in learning to which degree their correspondence patterns are recurring frequently in their data. All that is needed to use this module are a column with the cognate identifiers (`COGIDS`), a column with segments (`TOKENS`), the language (`DOCULECT`) and the concept (`CONCEPT`), and, of course, the (correct) alignment (`ALIGNMENT`). Before loading the panel in the EDICTOR tool, a popup will ask the user to indicate the preference (full or partial cognates). Afterwards, the window displays all cognate sets in the data arranged by compatible alignment sites. The following screenshot is just an example for the Germanic data in which the preview is restricted to 10 rows:

![img](img/s10-fig9.png)

There are now several possibilities to analyze the data further. Users can change the threshold of reflexes that must be included in the data. After selecting, one needs to press the "REFRESH" button to make sure that the tool reloads the data. Users can also select specific patterns according to theri basic characteristics (derived from the sounds of the first language in the sample (again: if a proto-language is in the sample, users can specify this and will be given a convenient way to browse their data). Clicking on the cells in the COGNATES column will open the corresponding alignment, and clicking on the individual cells for each language will show the whole word under question. As a final information, the SIZE column indicates roughly, how "filled" a column is, although this score is only sub-optimal for the time being.

A probably even more useful way to inspect correspondence patterns is to go through the analyses made by the Python algorithm, since these are exploring the data much more exhaustively. In order to do so, one only needs to have a file with the data fields (columns) as indicated above and an extra column that should be called `PATTERNS`. 
Since the format specifics are not straightforward to be manually edited, this file should be created with LingPy, as has been shown above. If this file is loaded and the patterns view is selected, EDICTOR will display the patterns inferred by the algorithm rather then inferring them itself. The advantage is a much more refined view on correspondence patterns that makes it very convenient for scholars to check how convincing their alignments and cognates are for a given dataset.


## Open Questions

There are a couple of open questions:

* how can the algorithm be refined?
* could the algorithm for correspondence pattern recognitions, be used on other types of data, e.g., structural features, e.g., to identify instances of features that evolve together?
* how can we conveniently modify EDICTOR to allow users to re-define patterns so that they can annotate data for correspondence patterns without relying on the computer.