# Session 3: Concepticon (Johann-Mattis List and Tiago Tresoldi)



## 1 Reference Catalogs



## 2 Background on Concept Lists



## 3 Linking Concept Lists within the Concepticon Project



## 4 Design Principles of Concept List Linking

### 4.1 The Base File: `concepticon.tsv`

### 4.2 Information on Concept Lists: `conceptlists.tsv`

### 4.3 Concept Set Relations: `conceptrelations.tsv`

### 4.4 Concrete Concept Lists in `conceptlists/`

### 4.5 Concept Set Metadata in `concept_set_meta`



## 5 Fun with Concept Sets: `pyconcepticon` API


To make sure that your data is comparable in terms of the concepts that you investigated, you should link your questionnaire to the Concepticon ([List et al. 2016](http://bibliography.lingpy.org?key=List2016a)). Many scholars still have a huge problem in understanding what the Concepticon actually is. We won't go into the details here, but if you are interested in selecting comparable questionnaires (e.g., words less prone to borrowing) for your language sample, you should definitely have a close look at the Concepticon website at http://concepticon.clld.org, since it is highly likely that your specific questionnaire has already been linked. In this case, you should download the concept list in the form in which it is provided by the Concepticon project, as this will spare you the time of typing it off yourself (which may introduce new errors), and you will get a lot of meta-information which may be useful. For example, if you download the Leipzig-Jakarta list ([Tadmor 2009](http://bibliography.lingpy.org?key=Tadmor2009), [Tadmor-2009-100](http://concepticon.clld.org/contributions/Tadmor-2009-100)), you may first learn a lot about how it was constructed, but you can also directly compare it with lists that may be similar. If you want to know how stable the concepts in this list are, for example, you could have a look at the basic list underlying the original project ([Haspelmath-2009-1460](http://concepticon.clld.org/contributions/Haspelmath-2009-1460)), where you will receive explicit ranks for all concepts.


### 5.1 Comparing Concept Lists

If you want to check the overlap between the Leipzig-Jakarta list and Swadesh's ([1955](http://bibliography.lingpy.org?key=Swadesh1955)) list of 100 items, you can use the Concepticon API, querying for the intersection of both lists:

```shell
$ concepticon intersection Tadmor-2009-100 Swadesh-1955-100
  1   ARM OR HAND            [2121] HAND (1, Swadesh-1955-100)
  2   ASH                    [646 ] 
  3   BIG                    [1202] 
  4   BIRD                   [937 ] 
  5   BITE                   [1403] 
  6   BLACK                  [163 ] 
  7   BLOOD                  [946 ] 
  8   BONE                   [1394] 
  9   BREAST                 [1402] 
 10   BURN                   [2102] BURNING (1, Tadmor-2009-100)
 11   COME                   [1446] 
 12   DOG                    [2009] 
 13   DRINK                  [1401] 
 14   EAR                    [1247] 
 15   EARTH (SOIL)           [1228] 
 16   EAT                    [1336] 
 17   EGG                    [744 ] 
 18   EYE                    [1248] 
 19   FIRE                   [221 ] 
 20   FISH                   [227 ] 
 21   FLESH OR MEAT          [2615] 
 22   FLY (MOVE THROUGH AIR) [1441] 
 23   FOOT OR LEG            [2098] FOOT (1, Swadesh-1955-100)
 24   GIVE                   [1447] 
 25   GO                     [695 ] WALK (1, Swadesh-1955-100)
 26   GOOD                   [1035] 
 27   HAIR                   [1040] 
 28   HEAR                   [1408] 
 29   HORN (ANATOMY)         [1393] 
 30   I                      [1209] 
 31   KNEE                   [1371] 
 32   KNOW (SOMETHING)       [1410] 
 33   LEAF                   [628 ] 
 34   LIVER                  [1224] 
 35   LONG                   [1203] 
 36   LOUSE                  [1392] 
 37   MOUTH                  [674 ] 
 38   NAME                   [1405] 
 39   NECK                   [1333] 
 40   NEW                    [1231] 
 41   NIGHT                  [1233] 
 42   NOSE                   [1221] 
 43   NOT                    [1240] 
 44   ONE                    [1493] 
 45   RAINING OR RAIN        [2108] RAIN (PRECIPITATION) (1, Tadmor-2009-100)
 46   RED                    [156 ] 
 47   ROOT                   [670 ] 
 48   SAND                   [671 ] 
 49   SAY                    [1458] 
 50   SEE                    [1409] 
 51   SKIN                   [763 ] 
 52   SMALL                  [1246] 
 53   SMOKE (EXHAUST)        [778 ] 
 54   STAND                  [1442] 
 55   STAR                   [1430] 
 56   STONE OR ROCK          [2125] STONE (1, Swadesh-1955-100)
 57   TAIL                   [1220] 
 58   THIS                   [1214] 
 59   THOU                   [1215] 
 60   TONGUE                 [1205] 
 61   TOOTH                  [1380] 
 62 * TREE OR WOOD           [2141] WOOD (1, Tadmor-2009-100), 
                                    TREE (1, Swadesh-1955-100)
 63   WATER                  [948 ] 
 64   WHAT                   [1236] 
 65   WHO                    [1235] 
```

From this output, you can learn that Leipzig-Jakarta lists "arm or hand" as a concept, while Swadesh is more concrete, listing only "hand". You can also learn that Swadesh is not very concrete regarding the concept "rain" where he fails to inform us whether it was intended as a noun or a verb. From the match 62, you can further see that "tree" and "wood" are both judged to be subsets of the meta-concept "tree or wood", and indeed, there are quite a few languages which do not distinguish between the two.

There are more possibilities: The ```concepticon union``` command allows you to calculate the union of different lists, thus allowing you to create your own questionnaires based on different concept lists. By typing the following command in the command line, for example, you can learn that the union of Leipzig-Jakarta and Swadesh's 100-item list are 135 concepts:

```shell
$ concepticon union Tadmor-2009-100 Swadesh-1955-100 | wc -l
135
```
And if you add the 200-item list by Swadesh ([1952](http://bibliography.lingpy.org?key=Swadesh1952)), you will see that the union has 222 concepts:

```shell
$ concepticon union Tadmor-2009-100 Swadesh-1955-100 Swadesh-1952-200 | wc -l
222
```

### 5.2 Linking Concept Listts

More importantly, if you want to merge data from different questionnaires or datasets where your do not know to which degree concepts overlap, you can use the automatic mapping algorithm provided by the Concepticon API to get a first intelligent guess which concepts your data contains. This works even across different languages, as we have so far assembled concept labels in quite a few different language varieties which we can use to search for similar concepts. The command is a simple as typing ```concepticon map_concepts <yourconceptlist>``` in your terminal, where you replace ```<yourconceptlist>``` with your filename. We have prepared three files, one in English, one in Chinese, and one in German, all showing the following tabule structure (the following being from the file ```C_concepts.tsv```):

```
NUMBER	ENGLISH
1	word
2	hand
3	eggplant
4	aubergine
5	simpsons (tv series)
```

In order to link this English file to the Concepticon, all we have to do is to type:

```shell
$ concepticon map_concepts C_concepts.tsv
NUMBER	ENGLISH	CONCEPTICON_ID	CONCEPTICON_GLOSS	SIMILARITY
1	word	1599	WORD	2
2	hand	1277	HAND	2
3	eggplant	1146	AUBERGINE	2
4	aubergine	1146	AUBERGINE	4
5	simpsons (tv series)		???	
#	4/5	80%	
```

The output tells us first, whether the Concepts can be linked to Concepticon, and second, it gives us the overall percentage for inferred links. You can see that the mapping algorithm is not based on simple string identity, as it correctly links "eggplant" to the concept set ```AUBERGINE```.

Similarly, we can try to link our file with Chinese concepts, the file ```C_concepts-chinese.tsv```:

```shell
$ concepticon --language=zh map_concepts C_concepts-chinese.tsv
NUMBER	GLOSS	CONCEPTICON_ID	CONCEPTICON_GLOSS	SIMILARITY
1	我	1209	I	2
2	你	1215	THOU	2
3	太陽	1343	SUN	2
4	吃飯		???	
5	月亮	1313	MOON	2
#	4/5	80%	
```

And accordingly also our file ```C_concepts-german.tsv```:

```shell
$ concepticon --language=de map_concepts C_concepts-german.tsv
NUMBER	GLOSS	CONCEPTICON_ID	CONCEPTICON_GLOSS	SIMILARITY
1	Hand	1277	HAND	2
2	Schuh	1381	SHOE	2
3	Fuß	1301	FOOT	2
4	Abend	1629	EVENING	2
5	Sonne	1343	SUN	2
#	5/5	100%	
```

As a final example, let us see what the Concepticon API does if we encounter a "fuzzy" matching:

```bash
$ concepticon map_concepts C_concepts-fuzzy.tsv 
NUMBER	ENGLISH	CONCEPTICON_ID	CONCEPTICON_GLOSS	SIMILARITY
1	word	1599	WORD	2
#<<<				
2	hand / arm	1277	HAND	4
2	hand / arm	1019	RIGHT	4
2	hand / arm	244	LEFT	4
2	hand / arm	1673	ARM	4
2	hand / arm	2121	ARM OR HAND	4
#>>>				
3	eggplant	1146	AUBERGINE	2
4	aubergine	1146	AUBERGINE	4
#<<<				
5	man (male)	1554	MAN	2
5	man (male)	2106	MALE PERSON	2
#>>>				
#	5/5	100%	

```

Here, you can see that the concept labels "hand / arm" and "man (male)" are linked to multiple concept sets. The output further indicates which of those multiple links form a block: The characters "#<<<" in a line indicate the start, and the characters "#>>>" the end. This allows you to conveniently jump from block to block in order to select the best match (or manually add a better match). Note that mapping to the concepticon should NEVER link one concept in your data to two or more concept sets in the Concepticon. The linking to Concepticon is, as a requirement, always *n* to 1, with *n* ideally being 1 as well. 

<span style="color:red">You may wonder why the API gives you certain similarity scores. For example, why would "eggplant" rank higher than "aubergine". The reason can be found in the specific mapping algorithm that we use and which may need future refinement. This algorithm essentially divides a "gloss" (a concept label) into different parts, and also tries to determine information regarding part of speech and the like. This algorithm is currently being revised, and we hope to be able to provide information soon.</span>

### 5.3 Contributing to Concepticon

The Concepticon is a collaborative effort that is supposed to render our linguistic data more comparable. The more questionnaires we can add to our collection, the easier it will be for future research to build on these resources. Even if you think that you do not need to link your data to Concepticon, since you anyway use the "standard list" by Swadesh, you should at least provide a ```concepts.tsv``` file in which you list your explicit links. In this way you guarantee that other can re-use your data and also contribute to the collaborative efforts which are currently being done in the context of the CLDF initiative.