Skip to content

Latest commit

 

History

History
132 lines (110 loc) · 6.27 KB

README.md

File metadata and controls

132 lines (110 loc) · 6.27 KB

CLDF Examples

Examples for testing

Small toy examples for testing implementations of the specification are available from the pycldf test suite at https://github.com/cldf/pycldf/tree/master/tests/data

A WALS feature as CLDF structure dataset

Feature 1A of WALS Online, converted to a CLDF StructureDataset is available at wals_1A_cldf. This dataset has been created using the code and instructions in the pycldf repository.

One of the design goals of CLDF was to make re-use of existing tools for linguistic data possible. As an example, we can convert the CLDF dataset to the legacy custom tab-delimited export of WALS features (e.g. http://wals.info/feature/1A.tab), using tools from the csvkit package:

  • We use csvjoin three times, to augment the ValueTable with metadata about languages, parameters and feature values.
  • The we use csvcut to prune the excess columns and re-order the remaining ones.
  • Using sed we replace the first line, thereby renaming the columns.
  • Finally we use csvformat to switch to tab-delimited values.
$ csvjoin -c Language_ID,ID wals_1A_cldf/values.csv wals_1A_cldf/languages.csv \
| csvjoin -c Parameter_ID,ID - wals_1A_cldf/parameters.csv \
| csvcut -C Name2 \
| csvjoin -c Code_ID,ID - wals_1A_cldf/codes.csv \
| csvcut -c Language_ID,Name,Value,Name2,Latitude,Longitude,Genus,Family,Area \
| sed "1s/.*/wals code,name,value,description,latitude,longitude,genus,family,area/" \
| csvformat -T \
| head -n 5
wals code	name	value	description	latitude	longitude	genus	family	area
abi	Abipón	2	Moderately small	-29.0	-61.0	South Guaicuruan	Guaicuruan	Phonology
abk	Abkhaz	5	Large	43.0833333333	41.0	Northwest Caucasian	Northwest Caucasian	Phonology
ach	Aché	1	Small	-25.25	-55.1666666667	Tupi-Guaraní	Tupian	Phonology
acm	Achumawi	2	Moderately small	41.5	-121.0	Palaihnihan	Hokan	Phonology

If we wanted to be fancy, we could even add the citation information, using jq to read the CLDF metadata:

$ cat wals_1A_cldf/StructureDataset-metadata.json | jq -r '."dc:bibliographicCitation"'
Ian Maddieson. 2013. Consonant Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/1 )

and adding in the date:

$ cat wals_1A_cldf/StructureDataset-metadata.json \
| jq -r '."dc:bibliographicCitation"' \
| sed "s/ )/, Accessed on $(date -I).)/g"
Ian Maddieson. 2013. Consonant Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/1, Accessed on 2021-04-22.)

Putting it all together:

$ ( cat wals_1A_cldf/StructureDataset-metadata.json | jq -r '."dc:bibliographicCitation"' | sed "s/ )/, Accessed on $(date -I).)/g" ; echo ""; csvjoin -c Language_ID,ID wals_1A_cldf/values.csv wals_1A_cldf/languages.csv | csvjoin -c Parameter_ID,ID - wals_1A_cldf/parameters.csv | csvcut -C Name2 | csvjoin -c Code_ID,ID - wals_1A_cldf/codes.csv | csvcut -c Language_ID,Name,Value,Name2,Latitude,Longitude,Genus,Family,Area | sed "1s/.*/wals code,name,value,description,latitude,longitude,genus,family,area/" | csvformat -T | head -n 5 ) | cat

will yield

Ian Maddieson. 2013. Consonant Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/1, Accessed on 2021-04-22.)

wals code	name	value	description	latitude	longitude	genus	family	area
abi	Abipón	2	Moderately small	-29.0	-61.0	South Guaicuruan	Guaicuruan	Phonology
abk	Abkhaz	5	Large	43.0833333333	41.0	Northwest Caucasian	Northwest Caucasian	Phonology
ach	Aché	1	Small	-25.25	-55.1666666667	Tupi-Guaraní	Tupian	Phonology
acm	Achumawi	2	Moderately small	41.5	-121.0	Palaihnihan	Hokan	Phonology

A Wordlist with cognate judgements

The directory wordlist contains an example of a CLDF Wordlist, including cognates and partial cognates.

A first inspection with the cldf stats command from the pycldf package reveals:

$ cldf stats Wordlist-metadata.json
<cldf:v1.0:Wordlist at ../../glottobank/cldf/examples/wordlist>
key            value
-------------  --------------------------------------------
dc:conformsTo  http://cldf.clld.org/v1.0/terms.rdf#Wordlist

Path                 Type                     Rows
-------------------  ---------------------  ------
forms.csv            Form Table               1825
cognates.csv         Cognate Table            1825
partialcognates.csv  Partial Cognate Table    2531
sources.bib          Sources                     2

Again, we can use the tools from the csvkit package, e.g. to show the alignments for all cognate sets for a particular concept:

$ csvjoin -c Form_ID,ID cognates.csv forms.csv \
| csvgrep -c Segment_slice -r "." -i \
| csvsort -c Concept,Cognateset_ID,Language - \
| csvcut -c Concept,Cognateset_ID,Alignment,Language - \
| csvgrep -c Concept -m "the skin" \
| csvformat -T
Concept	Cognate_set_ID	Alignment	Language
the skin	2	ʃ ɔ̆ ³⁵ + j a m ⁵⁵	Maru
the skin	3	a ³¹ - + ʐ ɿ - ⁵⁵	Achang_Longchuan
the skin	3	- - a + r i j -	Old_Burmese
the skin	3	a ³¹ - + ʐ ɿ - ⁵⁵	Xiandao
the skin	204	ʃ ŏ ²¹ + k ṵ - ʔ ⁵⁵	Atsi
the skin	204	ʃ ă ³⁵ + k a̰ u ʔ ⁵⁵	Bola
the skin	204	ʃ ŏ ⁵⁵ + k ṵ - k ⁵⁵	Lashi
the skin	343	ɑ ⁵³ + tθ ɑ ⁵⁵ + ɑ ⁵³ + j e ²²	Rangoon

Note that the first invocation of csvgrep is used to filter out partial cognates.

Examples "in the wild"