# Session 3: Concepticon (Johann-Mattis List, Tiago Tresoldi, and Mei-Shin Wu)



## 1 Reference Catalogs

While we often think of linguistic data in terms of a datapoint tied to a specific language variety or a specific language family, many aspects of linguistic data can be generalized. The most prominentt example would be language names, as the the names which scholars give to the language varieties they study may often vary (for different reasons, be it socio-cultural, traditions of specific scholarly fields, etc.), and these issues are differently addressed and may vary from study to study and from dataset to dataset. Given that the first and foremost goal of computer-assisted approaches to historical language comparison is to make sure that data *is* comparable across applications, it is important to have some kind of authority that can be used in order to reference a given language variety. Since most linguists are anarchists deep down in their hearts, it is easy to understand that following some authority when writing papers or creating datasets where they include data of different language varieties, is not the first thing they think of when creating and curating cross-linguistic data. If a certain name is common for the variety they work on in their specific field of study, but the authorities give it another name which follows other fields, it is obvious that they don't want to use the name that most scholars use but rather the name that is most appropriate for their peers. 

This is where reference catalogs for language varieties come into play. A reference catalog of language names, geolocations, and expert classifications like the one by the [Glottolog](http://glottolog.org) project offers (a) unique identifiers for language names, (b) a release and update strategy to cope with errors in the data, and its expansions, and (c) exhaustive metadata (like references, classifications, etc.) linked to each of the unique identifiers. While scholars don't need to change the actual names they use to denote their language varieties in their work, all they need to do is to provide a mapping of the names they use and the identifiers provided by Glottolog in order to render their data comparable. But in contrast to what some scholars think, mapping language names in a given dataset to Glottolog does not result in unidirectional profit (from the scholar to the community), but instead enables those scholars who link their data to employ abundant amounts of metadata (references, classifications, geolocations) which they won't need to assemble themselves, since they are already there.

This means that if scholars link their data to reference catalogs like Glottolog, both the community will profit, as scholars can easily find out which studies were devoted to which language varieties, while scholars who want to assemble linguistic data (mostly typologists and historical linguistics) can profit from the ground work on assigning metadata to languages that has been already done by the Glottolog editorial board.

The idea of reference catalogs can be further expanded to other kinds of linguistic data. Concepts for example, are notoriously difficult to define, and yet scholars usually try to define them, as a lot of research is based on questionnaires where scholars assemble a list of concepts in some elicitation language and then go out to the field to ask their informations how they pronounce those concepts. Since questionnaires differ a lot, not only regarding the languages used for elicitation, but also regarding the specific conventions that scholars use to avoid the fuzziness of words in the eliciation language (e.g., distinction between noun and verb in words like *hand*), a catalog of concepts can help to render different studies comparable which are devoted to lexical data.

Another example are phonetic transcriptions: Scholars often use different transcription systems, at times even customized variants, for several reasons like ease of typing, phonological considerations, or traditions in their field, but even if scholars use a system like the one proposed by the International Phonetic Alphabet, they may use it in slightly different ways. As a result, phonetic transcriptions are not necessarily directly comparable across datasets and studies. A reference catalog in which scholars provide their original transcriptions in a standardized form would again increase the comparability across datasets, but it would also be helpful for the scholars, as they could use such a reference catalog to receive information on additional aspects of sounds, such as common feature systems, general frequency distributions, etc.

## 2 Background on Concept Lists

In [1950](http://bibliography.lingpy.org?key=Swadesh1950), Morris Swadesh (1909 – 1967) proposed the idea that certain parts of the lexicon of human languages are universal, stable over time, and rather resistant to borrowing. As a result, he claimed that this part of the lexicon, which was later called basic vocabulary, would be very useful to address the problem of subgrouping in historical linguistics:

> [...] it is a well known fact that certain types of
> morphemes are relatively stable. Pronouns and
> numerals, for example, are occasionally replaced
> either by other forms from the same language or
> by borrowed elements, but such replacement is
> rare. The same is more or less true of other everyday expressions connected with concepts and
> experiences common to all human groups or to
> the groups living in a given part of the world during a given epoch. (Swadesh, 1950, 157)

He illustrated this by proposing a first list of basic concepts,
which was, in fact, nothing else than a collection of concept
labels, as shown below:

> I, thou, he, we, ye, one, two, three, four, five,
> six, seven, eight, nine, ten, hundred, all, ani-
> mal, ashes, back, bad, bark, belly, big, [...] this,
> tongue, tooth, tree, warm, water, what, where,
> white, who, wife, wind, woman, year, yellow.
> (Swadesh, 1950, 161)

In the following years, Swadesh refined his original concept
lists of basic vocabulary items, thereby reducing the original test list of 215 items first to 200 ([Swadesh, 1952](http://bibliography.lingpy.org?key=Swadesh1952)) and
then to 100 items ([Swadesh, 1955](http://bibliography.lingpy.org?key=Swadesh1955)). Scholars working on
different language families and different datasets provided
further modifications, be it that the concepts which Swadesh
had proposed were lacking proper translational equivalents
in the languages they were working on, or that they turned
out to be not as stable and universal as Swadesh had claimed
([Matisoff, 1978](http://bibliography.lingpy.org?key=Matisoff1978); [Alpher and Nash, 1999](http://bibliography.lingpy.org?key=Alpher1999)). Up to today,
dozens of different concept lists have been compiled for various purposes. They are used as heuristical tools for the detection of deep genetic relationships among languages ([Dolgopolsky, 1964](http://bibliography.lingpy.org?key=Dolgopolsky1964)), as basic values for traditional lexicostatistical and glottochronological studies 
([Dyen et al., 1992](http://bibliography.lingpy.org?key=Dyen1992);
[Starostin, 1991](http://bibliography.lingpy.org?key=Starostin1991)), or as litmus test for dubious cases of language relationship which might be due to inheritance or borrowing ([McMahon et al., 2005](http://bibliography.lingpy.org?key=McMahon2005a); 
[Chén Bǎoyà 陈保亚, 1996](http://bibliography.lingpy.org?key=Chen1996);
[Wang and Wang, 2004](http://bibliography.lingpy.org?key=Wang2004)).

Apart from concept lists proposed for the application in
historical linguistics, there is a large amount of not explicitly diachronic data, including concept lists serving as the
basis for field work in specific linguistic areas ([Kraft, 1981](http://bibliography.lingpy.org?key=Kraft1981)),
concept lists which serve as the basis for large surveys on
specific linguistic phenomena ([Haspelmath and Tadmor, 2009](http://bibliography.lingpy.org?key=Haspelmath2009)), or concept lists which deal with the internal structuring of concepts, be it cognitive associations 
([Nelson et al., 2004](http://bibliography.lingpy.org?key=Nelson2004); [Hill et al., 2014](http://bibliography.lingpy.org?key=Hill2014)), cross-linguistic polysemies 
([List et al., 2014](http://bibliography.lingpy.org?key=List2014f)), or frequently recurring semantic shifts 
([Bulakh et al., 2013](http://bibliography.lingpy.org?key=Bulakh2013)). Concept lists play also an important role in
education, where they are used to measure and aid learners’ progress ([Dolch, 1936](http://bibliography.lingpy.org?key=Dolch1936)), in psycholinguistics, where different kinds of word norm data, like frequency and concreteness, are needed to control for variables in experiments
([Wilson, 1988](http://bibliography.lingpy.org?key=Wilson1988)), and in public health studies, where stan-
dardized naming tests are used to assess the degree of apha-
sia or language disturbance ([Nicholas et al., 1989](http://bibliography.lingpy.org?key=Nicholas1989); 
[Ardila, 2007](http://bibliography.lingpy.org?key=Ardila2007)).

So in brief: there is a large number of different concept lists which has been used and is being actively used in different branches of science related to language, but which is not comparable directly, as scholars do not try to normalize the ways they name concepts in their elicitation glosses.

## 3 Linking Concept Lists within the Concepticon Project

The basic idea of the Concepticon reference catalog is to provide consistent links across the multitude of lexical questionnaires that linguists have been using to elicit words, be it during field work, for
the purpose of establishing new datasets from published resources, or as a starting point for typological or
historical language comparison. One major problem with the way people construct their questionnaires is
that they have never been standardized across datasets, but were only considered to be applicable within one
application. The Concepticon project tries to make up for this by linking the glosses that scholars use to
elicit a certain meaning to concept sets which are themselves linked to additional metadata, such as a short
definition, a rough semantic field, ontological categories (reflecting the more language-specific notion of
part of speech), as well as additional metadata derived from norm datasets in psycholinguistics and natural language processing, including age-of-acquisition information for individual languages ([Kuperman et al. 2012](http://bibliography.lingpy.org?key=Kuperman2012)), ontologies like WordNet ([Princeton University 2010](http://bibliography.lingpy.org?key=Wordnet2010)), or word frequency counts, again for individual languages
([Brysbaert and New 2009](http://bibliography.lingpy.org?key=Brysbaert2009)).

In addition to the metadata, Concepticon concept sets can further be linked among each other with help
of a simplifying ontology that identifies concept sets which are broader or narrower with respect to their
denotation range. The concept set [ARM OR HAND](http://concepticon.clld.org/parameters/2121) for example, 
which is best represented by the Russian
word *ruká*, insofar as it typically refers to the whole upper limb but also to the part which other languages
denote as hand, is considered broader than
[ARM](http://concepticon.clld.org/parameters/1277) and
[HAND](http://concepticon.clld.org/parameters/1637).
While scholars might object to this procedure, preferring to represent a comparative concept reflecting the semantics of Russian *ruká* by linking
it to both ARM
and HAND, it is important to emphasize that this practice, which may seem counterintuitive
from the perspective of a given language, is indispensable to guarantee a rigorous mapping of word elicitation glosses in questionnaires to lexical comparative concepts. If a given questionnaire contains the gloss
*arm/hand* (as we can find across many questionnaires which have been used to assemble a large number
of data points) and we linked it to both ARM and HAND, we would lose the essential information
that the original questionnaire was asking for the word expressing the concept that covers both concept sets
in a single term. Since the ontology allows us to derive the information that ARM OR HAND is semantically broader than ARM and HAND, we could automatically turn the Concepticon data into a form where
the elicitation gloss *arm/hand* links to both narrower concept sets, but we could not get back to the more
rigorous representation.

To inspect how the concepticon organizes this, let us look at a concrete example, namely the elicitation glosses linked to the concept set [FAT (ORGANIC SUBSTANCE)](http://concepticon.clld.org/parameters/323). 

The immediate advantage of linking data to the Concepticon is that it enables the merging of data from
different sources quickly. Instead of working with ad-hoc solutions that would provide links between a
range of different resources, it is much quicker to link different questionnaires to the Concepticon (leaving
out those concepts which cannot be found, or adding them as new concepts). 


## 4 Design Principles of Concept List Linking

### 4.1 The Base File: `concepticon.tsv`

| ID  | GLOSS        | SEMANTICFIELD       | DEFINITION                                                                                                       | ONTOLOGICAL_CATEGORY |
|-----|--------------|---------------------|------------------------------------------------------------------------------------------------------------------|----------------------|
| 1   | CONTEMPTIBLE | Emotions and values | Deserving of contempt or scorn.                                                                                  | Property             |
| 2   | DUST         | The physical world  | Any kind of solid material divided in particles of very small size.                                              | Person/Thing         |
| 3   | BRAVE        | Emotions and values | Having or characterized by courage.                                                                              | Property             |
| 4   | COURTYARD    | The house           | An area wholly or partly surrounded by walls or buildings.                                                       | Person/Thing         |
| 5   | GAZELLE      | Animals             | An antelope of the genus Gazella mostly native to Africa and capable of running at high speeds for long periods. | Person/Thing         |
| ... | ...          | ...                 | ...                                                                                                              | ...                  |

### 4.2 Information on Concept Lists: `conceptlists.tsv`

| ID                | AUTHOR                                                       | YEAR | ITEMS | SOURCE_LANGUAGE | TARGET_LANGUAGE         | REFS         | NOTE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|-------------------|--------------------------------------------------------------|------|-------|-----------------|-------------------------|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Sun-1991-1004     | Sun, Hongkai et al.                                          | 1991 | 1004  | English         | Tibeto-Burman languages | Sun1991      | This concept list originally served as a questionnaire for a large-scale investigation on Tibeto-Burman languages. The original questionnaire was in Chinese with no English translation. Later the STEDT project (:bib:Matisoff2015) digitized the data, translating Chinese concept labels to English, but not listing the original Chinese forms. As a result, some concept labels are identical, although they are different in the Chinese version. We are still trying to add the original Chinese concept labels to this resource, but for the moment, we only link the STEDT version, occasionally adding Chinese concept labels, where we figure they are important to distinguish the original meaning. |
| Robinson-2012-398 | Robinson, Laura C and Holton, Gary                           | 2012 | 398   | English         | Alor-Pantar languages   | Robinson2012 | The authors inform that the list is built upon the about 400 items of [Holton (2012)](:bib:Holton2012), but this source did not submit the dataset                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Heggarty-2017-200 | Heggarty, Paul and Anderson, Cormac and Scarborough, Matthew | 2017 | 200   | English         | Indo-European languages | Heggarty2017 | This is the basic list of 200 items which were selected for the Cognates in Basic Lexicon project. The compilers selected the items following various different criteria, such as ease of elicitation, non-fuzziness of the meaning, representation in Indo-European languages, and more. The numbers are at times higher than 200. This results from the earlier selection which containted more concepts which were no longer used in the official version                                                                                                                                                                                                                                                      |
| ...               | ...                                                          | ...  | ...   | ...             | ...                     | ...          | ...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |

### 4.3 Concept Set Relations: `conceptrelations.tsv`

| SOURCE | SOURCE_GLOSS          | RELATION | TARGET | TARGET_GLOSS             |
|--------|-----------------------|----------|--------|--------------------------|
| 559    | BROTHER (OF MAN)      | narrower | 2414   | OLDER BROTHER (OF MAN)   |
| 1759   | OLDER BROTHER         | narrower | 2414   | OLDER BROTHER (OF MAN)   |
| 2383   | AUNT OR MOTHER-IN-LAW | narrower | 1272   | AUNT                     |
| 2383   | AUNT OR MOTHER-IN-LAW | narrower | 2256   | MOTHER-IN-LAW (OF WOMAN) |
| 1212   | WE                    | hasform  | 2310   | US (OBLIQUE CASE OF WE)  |
| ...    | ...                   | ...      | ...    | ...                      |

### 4.4 Concrete Concept Lists in `conceptlists/`

| ID                | CONCEPTICON_ID | CONCEPTICON_GLOSS | NUMBER | CHINESE | ENGLISH  |
|-------------------|----------------|-------------------|--------|---------|----------|
| Sun-1991-1004-79  | 2847           | APRIL             | 79     | 四月    | April    |
| Sun-1991-1004-83  | 2851           | AUGUST            | 83     | 八月    | August   |
| Sun-1991-1004-315 |                |                   | 315    | 汉族    | Chinese  |
| Sun-1991-1004-87  | 2855           | DECEMBER          | 87     | 十二月  | December |
| Sun-1991-1004-77  | 2037           | FEBRUARY          | 77     | 二      | February |
| Sun-1991-1004-969 | 1209           | I                 | 969    | 我      | I        |
| ...               | ...            | ...               | ...    | ...     | ...      |


## 5 Fun with Concept Sets: `pyconcepticon` API


To make sure that your data is comparable in terms of the concepts that you investigated, you should link your questionnaire to the Concepticon ([List et al. 2016](http://bibliography.lingpy.org?key=List2016a)). Many scholars still have a huge problem in understanding what the Concepticon actually is. We won't go into the details here, but if you are interested in selecting comparable questionnaires (e.g., words less prone to borrowing) for your language sample, you should definitely have a close look at the Concepticon website at http://concepticon.clld.org, since it is highly likely that your specific questionnaire has already been linked. In this case, you should download the concept list in the form in which it is provided by the Concepticon project, as this will spare you the time of typing it off yourself (which may introduce new errors), and you will get a lot of meta-information which may be useful. For example, if you download the Leipzig-Jakarta list ([Tadmor 2009](http://bibliography.lingpy.org?key=Tadmor2009), [Tadmor-2009-100](http://concepticon.clld.org/contributions/Tadmor-2009-100)), you may first learn a lot about how it was constructed, but you can also directly compare it with lists that may be similar. If you want to know how stable the concepts in this list are, for example, you could have a look at the basic list underlying the original project ([Haspelmath-2009-1460](http://concepticon.clld.org/contributions/Haspelmath-2009-1460)), where you will receive explicit ranks for all concepts.

Note that for the following illustration, we assume that you have a local copy of the [calc-seminar](https://github.com/digling/calc-seminar) folder on your computer, ideally by typing

```
$ git clone https://github.com/digling/calc-seminar.git
```

in your terminal in your preferred location. We also assume that you `cd`ed into the [data folder](https://github.com/digling/calc-seminar/tree/master/data) in the repository by typing:

```
$ cd PATH_TO/calc-seminar/data
```

This will give you direct access to all data example files that we will be using in this session.


### 5.1 Comparing Concept Lists

If you want to check the overlap between the Leipzig-Jakarta list and Swadesh's ([1955](http://bibliography.lingpy.org?key=Swadesh1955)) list of 100 items, you can use the Concepticon API, querying for the intersection of both lists:

```shell
$ concepticon intersection Tadmor-2009-100 Swadesh-1955-100
  1   ARM OR HAND            [2121] HAND (1, Swadesh-1955-100)
  2   ASH                    [646 ] 
  3   BIG                    [1202] 
  4   BIRD                   [937 ] 
  5   BITE                   [1403] 
  6   BLACK                  [163 ] 
  7   BLOOD                  [946 ] 
  8   BONE                   [1394] 
  9   BREAST                 [1402] 
 10   BURN                   [2102] BURNING (1, Tadmor-2009-100)
 11   COME                   [1446] 
 12   DOG                    [2009] 
 13   DRINK                  [1401] 
 14   EAR                    [1247] 
 15   EARTH (SOIL)           [1228] 
 16   EAT                    [1336] 
 17   EGG                    [744 ] 
 18   EYE                    [1248] 
 19   FIRE                   [221 ] 
 20   FISH                   [227 ] 
 21   FLESH OR MEAT          [2615] 
 22   FLY (MOVE THROUGH AIR) [1441] 
 23   FOOT OR LEG            [2098] FOOT (1, Swadesh-1955-100)
 24   GIVE                   [1447] 
 25   GO                     [695 ] WALK (1, Swadesh-1955-100)
 26   GOOD                   [1035] 
 27   HAIR                   [1040] 
 28   HEAR                   [1408] 
 29   HORN (ANATOMY)         [1393] 
 30   I                      [1209] 
 31   KNEE                   [1371] 
 32   KNOW (SOMETHING)       [1410] 
 33   LEAF                   [628 ] 
 34   LIVER                  [1224] 
 35   LONG                   [1203] 
 36   LOUSE                  [1392] 
 37   MOUTH                  [674 ] 
 38   NAME                   [1405] 
 39   NECK                   [1333] 
 40   NEW                    [1231] 
 41   NIGHT                  [1233] 
 42   NOSE                   [1221] 
 43   NOT                    [1240] 
 44   ONE                    [1493] 
 45   RAINING OR RAIN        [2108] RAIN (PRECIPITATION) (1, Tadmor-2009-100)
 46   RED                    [156 ] 
 47   ROOT                   [670 ] 
 48   SAND                   [671 ] 
 49   SAY                    [1458] 
 50   SEE                    [1409] 
 51   SKIN                   [763 ] 
 52   SMALL                  [1246] 
 53   SMOKE (EXHAUST)        [778 ] 
 54   STAND                  [1442] 
 55   STAR                   [1430] 
 56   STONE OR ROCK          [2125] STONE (1, Swadesh-1955-100)
 57   TAIL                   [1220] 
 58   THIS                   [1214] 
 59   THOU                   [1215] 
 60   TONGUE                 [1205] 
 61   TOOTH                  [1380] 
 62 * TREE OR WOOD           [2141] WOOD (1, Tadmor-2009-100), 
                                    TREE (1, Swadesh-1955-100)
 63   WATER                  [948 ] 
 64   WHAT                   [1236] 
 65   WHO                    [1235] 
```

From this output, you can learn that Leipzig-Jakarta lists "arm or hand" as a concept, while Swadesh is more concrete, listing only "hand". You can also learn that Swadesh is not very concrete regarding the concept "rain" where he fails to inform us whether it was intended as a noun or a verb. From the match 62, you can further see that "tree" and "wood" are both judged to be subsets of the meta-concept "tree or wood", and indeed, there are quite a few languages which do not distinguish between the two.

There are more possibilities: The ```concepticon union``` command allows you to calculate the union of different lists, thus allowing you to create your own questionnaires based on different concept lists. By typing the following command in the command line, for example, you can learn that the union of Leipzig-Jakarta and Swadesh's 100-item list are 135 concepts:

```shell
$ concepticon union Tadmor-2009-100 Swadesh-1955-100 | wc -l
135
```
And if you add the 200-item list by Swadesh ([1952](http://bibliography.lingpy.org?key=Swadesh1952)), you will see that the union has 222 concepts:

```shell
$ concepticon union Tadmor-2009-100 Swadesh-1955-100 Swadesh-1952-200 | wc -l
222
```

### 5.2 Linking Concept Listts

More importantly, if you want to merge data from different questionnaires or datasets where your do not know to which degree concepts overlap, you can use the automatic mapping algorithm provided by the Concepticon API to get a first intelligent guess which concepts your data contains. This works even across different languages, as we have so far assembled concept labels in quite a few different language varieties which we can use to search for similar concepts. The command is a simple as typing ```concepticon map_concepts <yourconceptlist>``` in your terminal, where you replace ```<yourconceptlist>``` with your filename. We have prepared three files, one in English, one in Chinese, and one in German, all showing the following tabule structure (the following being from the file ```C_concepts.tsv```):

```
NUMBER	ENGLISH
1	word
2	hand
3	eggplant
4	aubergine
5	simpsons (tv series)
```

In order to link this English file to the Concepticon, all we have to do is to type:

```shell
$ concepticon map_concepts C_concepts.tsv
NUMBER	ENGLISH	CONCEPTICON_ID	CONCEPTICON_GLOSS	SIMILARITY
1	word	1599	WORD	2
2	hand	1277	HAND	2
3	eggplant	1146	AUBERGINE	2
4	aubergine	1146	AUBERGINE	4
5	simpsons (tv series)		???	
#	4/5	80%	
```

The output tells us first, whether the Concepts can be linked to Concepticon, and second, it gives us the overall percentage for inferred links. You can see that the mapping algorithm is not based on simple string identity, as it correctly links "eggplant" to the concept set ```AUBERGINE```.

Similarly, we can try to link our file with Chinese concepts, the file ```C_concepts-chinese.tsv```:

```shell
$ concepticon --language=zh map_concepts C_concepts-chinese.tsv
NUMBER	GLOSS	CONCEPTICON_ID	CONCEPTICON_GLOSS	SIMILARITY
1	我	1209	I	2
2	你	1215	THOU	2
3	太陽	1343	SUN	2
4	吃飯		???	
5	月亮	1313	MOON	2
#	4/5	80%	
```

And accordingly also our file ```C_concepts-german.tsv```:

```shell
$ concepticon --language=de map_concepts C_concepts-german.tsv
NUMBER	GLOSS	CONCEPTICON_ID	CONCEPTICON_GLOSS	SIMILARITY
1	Hand	1277	HAND	2
2	Schuh	1381	SHOE	2
3	Fuß	1301	FOOT	2
4	Abend	1629	EVENING	2
5	Sonne	1343	SUN	2
#	5/5	100%	
```

As a final example, let us see what the Concepticon API does if we encounter a "fuzzy" matching:

```bash
$ concepticon map_concepts C_concepts-fuzzy.tsv 
NUMBER	ENGLISH	CONCEPTICON_ID	CONCEPTICON_GLOSS	SIMILARITY
1	word	1599	WORD	2
#<<<				
2	hand / arm	1277	HAND	4
2	hand / arm	1019	RIGHT	4
2	hand / arm	244	LEFT	4
2	hand / arm	1673	ARM	4
2	hand / arm	2121	ARM OR HAND	4
#>>>				
3	eggplant	1146	AUBERGINE	2
4	aubergine	1146	AUBERGINE	4
#<<<				
5	man (male)	1554	MAN	2
5	man (male)	2106	MALE PERSON	2
#>>>				
#	5/5	100%	

```

Here, you can see that the concept labels "hand / arm" and "man (male)" are linked to multiple concept sets. The output further indicates which of those multiple links form a block: The characters "#<<<" in a line indicate the start, and the characters "#>>>" the end. This allows you to conveniently jump from block to block in order to select the best match (or manually add a better match). Note that mapping to the concepticon should NEVER link one concept in your data to two or more concept sets in the Concepticon. The linking to Concepticon is, as a requirement, always *n* to 1, with *n* ideally being 1 as well. 

<span style="color:red">You may wonder why the API gives you certain similarity scores. For example, why would "eggplant" rank higher than "aubergine". The reason can be found in the specific mapping algorithm that we use and which may need future refinement. This algorithm essentially divides a "gloss" (a concept label) into different parts, and also tries to determine information regarding part of speech and the like. This algorithm is currently being revised, and we hope to be able to provide information soon.</span>



### 5.3 The Concepticon look-up tool

We have created a standalone look-up tool that allows you to quickly check for a certain concept in Concepticon. This tool is available within the `pyconcepticon` API, and you can create it by typing:

```
$ concepticon app
```
This will open a browser window where you can select different languages and then type in potential eliciationg glosses. By typing random words in the field, the tool will try to match them to potential concept sets in Concepticon, based on the actual eliciation glosses inside Concepticon, but also based on a fuzzy search that accounts for small spelling errors. 

### 5.4 Contributing to Concepticon

The Concepticon is a collaborative effort that is supposed to render our linguistic data more comparable. The more questionnaires we can add to our collection, the easier it will be for future research to build on these resources. Even if you think that you do not need to link your data to Concepticon, since you anyway use the "standard list" by Swadesh, you should at least provide a ```concepts.tsv``` file in which you list your explicit links. In this way you guarantee that other can re-use your data and also contribute to the collaborative efforts which are currently being done in the context of the CLDF initiative.

### 5.5 Example process : contributing to Concepticon
1. Prepare data, make sure the columns are in the order of 'ID','NUMBER', 'ENGLISH'
2. Fork and then clone the repository.
```console
$ git clone your/own/concepticon-data/url
```

3. Mapping with concepticon.
```console
$ concepticon map_concepts [FILE NAME] > [OUTPUT]
```

4. Clean the output file and then move this file into the concept folder. **Please name the file as [First Author]-[Year]-[Concept Number], eg. Marrison-1967-909**
```console
$ mv [mapped file] ~/concepticon-data/concepticondata/.
```

5. Adding new concepts to the **concepticon.tsv**
6. Adding new meta data to the **conceptlists.tsv**
```console
$ awk "NR==2{print $0}" [META DATA FILE] >> ~/concepticon-data/concepticondata/conceptlists.tsv
```

7. Adding new reference (bibtex format) to the **references.bib**
8. Run a test.
```console
$ concepticon test
```

9. Push to GitHub
```console
$ git commit -a 
$ git push origin master
```

10. Navigate to your github repository, press **New pull request**

### 5.6 Useful resources when cleaning the data
There are times that we are not sure which concepticon concepts are the appropriate ones. The following websites can assist you to find the good matches. 
1. Concepticon Lookup : http://calc.digling.org/concepticon/
2. Glosbe online dictionary : https://glosbe.com/
3. STEDT (if your data is collected from this database) : http://stedt.berkeley.edu/~stedt-cgi/rootcanal.pl/source/GEM-CNL
4. Longman dictionary: https://www.ldoceonline.com/
4. Original articles. 


## 6 Caveats

When linking to the Concepticon, we have found that it may be difficult for scholars to understand by which principle they should guide the way they link concepts. Usually, scholars assume that they NEED to find some link at any cost, even if it is not there. 
As a general rule, however, scholars should do the opposite: if their eliciation gloss denotes a specific meaning that cannot be found (yet) in Concepticon, they should **never** link it to some approximate similar concept set, but rather leave the item unlinked instead. If scholars find that the concept is important, they can file an issue on GitHub and ask to include it, or alternatively make a pull-request on GitHub and include it themselves. There are some further cases that are difficult when dealing with Concepticon mappings. These are based on (a) translation problems and the original "meaning" of an eliciation gloss (as opposed to its apparent meaning in its surface form), (b) how to handle approximate linkings, and (c) how to "read" the definitions we provide in the Concepticon project. In the following, we will briefly discuss these point separately, providing examples.

### Translation problems and the "original meaning" of an eliciation gloss

When inspecting a concept list that list as its English eliciation gloss the word "blunt", the obvious guess is that this should refer to [BLUNT](http://concepticon.clld.org/parameters/379), i.e.:

> Having a thick edge or point, as an instrument; not sharp.

If we look at the reflexes (i.e., all concrete eliciation glosses for the concept set that have been collected so far), we can, further see that some scholars use the term *dull* instead. As both *blunt* and *dull* can denote the concept that is intended with the concept set, this is not surprising. However, since *dull* can also mean *stupid*, it is possible that scholars confuse the items, so we have to be careful when seeing *dull* as an eliciation gloss. An example for such a case, where the eliciation gloss is *dull* but we should not link to BLUNT is the list by [Wang and Wang (2004)](http://concepticon.clld.org/contributions/Wang-2004-200), where the authors translate the traditional Swadesh item BLUNT in their Chinese elicitation gloss with [呆、笨](http://concepticon.clld.org/values/Wang-2004-200-114), which means *stupid*. This shows that when linking concept lists to Concepticon, we cannot simply look at the English gloss that we are provided with, but also need to understand what concept the authors actually elicited. From the data by Wang and Wang (2004), we can clearly see that they elicited *stupid* and not *blunt/dull*, and we link it accordingly. This means, that, when linking to Concepticon, we should *never* take a literal reading of eliciation glosses for granted. Instead, we need to find out what the authors *intended* when setting up their questionnaire and what they actually asked in the field. This may be easy to spot if one knows the languages that were documented, but at times also impossible. However, just linking elicitation glosses literally to Concepticon should never be done. One should always struggle hard to find the best solution possible.

### Approximate linkings 

Approximate linkings, e.g., linking an eliciation gloss *some kind of a tree* to [TREE](http://concepticon.clld.org/parameters/906), should never be done. If field workers write *some kind of a tree*, they mean a specific species, they don't mean the generic concept for TREE. In general, we recommend that instead of linking an elicitation gloss to Concepticon when having doubts at the same time, scholars should leave all entries of doubt unlinked. There will usually always be enough data left where they can be sure.

In the same spirit, care must be taken in cases of establishing concept relations. It is very important, in such cases, to remember and understand that Concepticon is not a lexical or ontological network, like [WordNet](https://wordnet.princeton.edu/) or [BabelNet](http://babelnet.org/), but a network for establishing relations among concept lists. For example, while in general ontologies and programming languages the `instanceof` property is used to determine relationships between objects and their classes, Concepticon's algorithms for automatic mapping don't employ it as it could wrongly map specific instances to the general concepts. The `narrower` and `broader` properties, on the other hand, are used by the same algorithms when establishing fuzzy mappings via database operations of "union" and "set". As such, and unlike what might be expected, a celebration like `NEW YEAR'S EVE` or `DRAGON BOAT FESTIVAL` should be mapped as an instance of `FESTIVAL` and not as a narrower concept, otherwise the algorithm would try to link these specific events with the general concept of "festival" from other wordlists. The `narrower` property is used in cases where a concept is entirely part of the definition of a second concept, usually in cases where two concepts are referred to by the same word; for example, both `ARM` and `HAND` are concepts narrower than the `ARM OR HAND` concept, usually found in concept lists applied to languages that use the same word for the entire upper limb.

### Interpreting Concepticon definitions

Concepticon definitions have been drawn from different sources. They are by no means perfect, maybe rather the other way round: they are imperfect and often do not express what the concepts that have been linked to a Concepticon concept set actually express. We try hard to improve the glosses, but we are only a small team of contributors so far, so we have limited time to work on this. We recommend scholars intending to link their data to Concepticon to not take the definitions as literal, but to rather check along with the definitions of a given concept set which concepts we have actually linked. In most cases, this gives a much clearer idea on what has been done. 

As an example, consider the concept set [BOAR](http://concepticon.clld.org/parameters/1348). While the definition says that it is actually intended to denote

> An adult male pig.

the reflexes of the concept set are the following:

![image](img/concept_boar.png)

We can see, that only the Chinese eliciation gloss clearly meets the intended definition. The other concepts reflect the boar as a wild pig species, opposed to the domestiated pig. This is a clear bug in Concepticon, and it has already been put on the [issue tracker](https://github.com/clld/concepticon-data/issues/466) and will hopefully soon be corrected. It illustrates, however, that scholars linking their data to Concepticon should always be aware of potential problems due to inconsistencies in Concepticon itself (*errare humanum est*), but also due to some ill-worded definitions which should *never* be taken literally. What needs to be taken literally are the links, i.e., the reflexes per concept set. If they are inconsistent, it's a bug with Concepticon, if they are not, the definition is best derived from there.