The third installment of CLICS - the database of Cross-Linguistic Colexifications.
Cite as
Rzymski, Tresoldi et al. 2019. The Database of Cross-Linguistic Colexifications, reproducible analysis of cross- linguistic polysemies. DOI: doi.org/10.17613/5awv-6w15
This repository contains
- specifications of the source data as human readable table and as
requirements file to install with
pip
- a map showing the geographic distribution of languages in the CLICS database, encoded in GeoJSON
- instructions to compile the CLICS database (see below) and
- the CLICS data:
- the (zipped) SQLite database created in step 4 below, zipped running
zip -9 clics3.sqlite.zip clics.sqlite
- the full, GML encoded network created in step 5 below, zipped running
zip -9 clics3-network.gml.zip graphs/network-3-families.gml
- the (zipped) SQLite database created in step 4 below, zipped running
- the exact set of
pip
requirements used to create the artefacts above.
This data is licensed under CC BY 4.0.
-
Analyses should ideally be carried within virtual environments, so that guarantee that the necessary libraries don't interact with other pipelines. This is particularly true if you have multiple Python versions installed in your system. Please confirm that you are running
clics3
in a virtual environment if you run into any issue. There are many different solutions on how to use virtual environments, with details depending on the system you use (for example, if you are using Conda). A good overview is provided here but, for most systems, it should be enough to create an environment withpython -m venv env
And activate it with
source env/bin/activate
Once all the work had been carried, you can leave the virtual environment by closing your shell or issueing the
deactivate
command. -
Install version 3 of the Python package
pyclics
:pip install "pyclics>=3.0"
-
Download and install the Lexibank datasets from which to aggregate colexifications:
curl -O https://raw.githubusercontent.com/clics/clics3/master/datasets.txt pip install -r datasets.txt
-
Download data for the reference catalogs Glottolog and Concepticon:
cldfbench catconfig
-
Create the SQLite database
clics.sqlite
:clics load --glottolog-version v4.0 --concepticon-version v2.2.0
and confirm all 30 datasets have been loaded
clics datasets
# | Dataset | Parameters | Concepticon | Varieties | Glottocodes | Families |
---|---|---|---|---|---|---|
1 | abrahammonpa | 304 | 304 | 30 | 16 | 2 |
2 | allenbai | 499 | 499 | 9 | 9 | 1 |
3 | bantubvd | 420 | 415 | 10 | 10 | 1 |
4 | beidasinitic | 736 | 735 | 18 | 18 | 1 |
5 | bodtkhobwa | 553 | 536 | 8 | 8 | 1 |
6 | bowernpny | 338 | 338 | 175 | 172 | 1 |
7 | castrosui | 510 | 508 | 16 | 3 | 1 |
8 | chenhmongmien | 793 | 793 | 22 | 20 | 1 |
9 | diacl | 537 | 537 | 371 | 351 | 25 |
10 | halenepal | 699 | 662 | 13 | 13 | 2 |
11 | hantganbangime | 299 | 299 | 22 | 22 | 5 |
12 | hubercolumbian | 346 | 345 | 69 | 65 | 16 |
13 | ids | 1310 | 1308 | 320 | 275 | 60 |
14 | kraftchadic | 433 | 428 | 66 | 59 | 2 |
15 | lexirumah | 604 | 602 | 357 | 231 | 12 |
16 | logos | 707 | 707 | 5 | 5 | 1 |
17 | marrisonnaga | 580 | 572 | 40 | 39 | 1 |
18 | mitterhoferbena | 342 | 335 | 13 | 13 | 1 |
19 | naganorgyalrongic | 969 | 877 | 10 | 8 | 1 |
20 | northeuralex | 952 | 951 | 107 | 107 | 21 |
21 | robinsonap | 391 | 391 | 13 | 13 | 1 |
22 | satterthwaitetb | 418 | 418 | 18 | 18 | 1 |
23 | sohartmannchin | 279 | 279 | 8 | 7 | 1 |
24 | suntb | 929 | 929 | 49 | 49 | 1 |
25 | tls | 1140 | 811 | 126 | 107 | 1 |
26 | transnewguineaorg | 904 | 865 | 1004 | 760 | 106 |
27 | tryonsolomon | 317 | 314 | 111 | 96 | 5 |
28 | wold | 1459 | 1458 | 41 | 41 | 24 |
29 | yanglalo | 875 | 869 | 7 | 7 | 1 |
30 | zgraggenmadang | 311 | 310 | 98 | 98 | 1 |
TOTAL | 0 | 2906 | 3156 | 2271 | 200 |
- Create the colexification network (encoded as GML graph):
This will create the graph at
clics -t 3 -f families colexification --show 20 --format pipe
graphs/network-3-families.gml
and show the 20 most common colexifications:
ID A | Concept A | ID B | Concept B | Families | Languages | Words |
---|---|---|---|---|---|---|
906 | TREE | 1803 | WOOD | 59 | 348 | 361 |
1313 | MOON | 1370 | MONTH | 57 | 324 | 327 |
72 | CLAW | 1258 | FINGERNAIL | 55 | 236 | 243 |
1297 | LEG | 1301 | FOOT | 52 | 349 | 358 |
1352 | KNIFE | 3210 | KNIFE (FOR EATING) | 51 | 268 | 282 |
2267 | SON-IN-LAW (OF MAN) | 2266 | SON-IN-LAW (OF WOMAN) | 49 | 261 | 280 |
763 | SKIN | 1204 | BARK | 49 | 209 | 213 |
1307 | LANGUAGE | 1599 | WORD | 49 | 148 | 149 |
1277 | HAND | 1673 | ARM | 48 | 294 | 300 |
1408 | HEAR | 1608 | LISTEN | 48 | 107 | 109 |
634 | MEAT | 2259 | FLESH | 47 | 252 | 262 |
2265 | DAUGHTER-IN-LAW (OF MAN) | 2264 | DAUGHTER-IN-LAW (OF WOMAN) | 47 | 234 | 256 |
763 | SKIN | 629 | LEATHER | 46 | 236 | 258 |
837 | BLUE | 1425 | GREEN | 46 | 195 | 204 |
2261 | MALE (OF PERSON) | 2263 | MALE (OF ANIMAL) | 45 | 145 | 163 |
1199 | WIFE | 962 | WOMAN | 44 | 289 | 301 |
480 | PLATE | 481 | DISH | 44 | 155 | 170 |
2260 | FEMALE (OF PERSON) | 2262 | FEMALE (OF ANIMAL) | 44 | 146 | 154 |
1228 | EARTH (SOIL) | 626 | LAND | 43 | 159 | 167 |
667 | ROAD | 2252 | PATH | 43 | 133 | 153 |
705 | GO UP (ASCEND) | 1102 | CLIMB | 43 | 132 | 146 |
683 | PERSON | 1554 | MAN | 41 | 199 | 205 |
2255 | FATHER-IN-LAW (OF MAN) | 2254 | FATHER-IN-LAW (OF WOMAN) | 41 | 187 | 204 |
133 | WEAVE | 3294 | BRAID (VERB) OR WEAVE (BASKET) | 41 | 122 | 133 |
2261 | MALE (OF PERSON) | 1554 | MAN | 41 | 104 | 115 |
1474 | SEA | 645 | OCEAN | 41 | 101 | 110 |
215 | LIE DOWN | 1585 | SLEEP | 40 | 191 | 197 |
1265 | HIGH | 711 | TALL | 40 | 168 | 182 |
256 | FOOD | 1526 | MEAL | 40 | 124 | 136 |
1732 | SKY | 1565 | HEAVEN | 40 | 117 | 120 |
1443 | WALK | 695 | GO | 39 | 288 | 320 |
2257 | MOTHER-IN-LAW (OF MAN) | 2256 | MOTHER-IN-LAW (OF WOMAN) | 39 | 181 | 203 |
1618 | GRANDSON | 1619 | GRANDDAUGHTER | 39 | 133 | 151 |
1203 | LONG | 711 | TALL | 39 | 113 | 121 |
2260 | FEMALE (OF PERSON) | 962 | WOMAN | 39 | 109 | 119 |
948 | WATER | 666 | RIVER | 38 | 197 | 200 |
1229 | OLD | 406 | OLD MAN | 38 | 103 | 107 |
531 | HOW MUCH | 3450 | HOW MANY PIECES | 37 | 184 | 203 |
706 | DARK | 163 | BLACK | 37 | 95 | 97 |
855 | SEIZE | 702 | CATCH | 36 | 150 | 161 |
- Run subgraph and infomap cluster algorithms:
The clustered networks will be written to GML graphs and exported in a way suitable for exploring with the CLICS javascript app. We can get some summary statistics running
clics --seed 42 -t 3 -f families makeapp
Note that clustering may be non-deterministic, i.e. you may compute slightly different clusters than the ones distributed in the GML files in this repository.clics -t 3 -f families --graphname infomap graph_stats ----------- ---- nodes 1647 edges 2967 components 92 communities 249 ----------- ----
- Finally, we can explore the clusters in the CLICS javascript app:
clics runapp