I am a field linguist with a spreadsheet of words and their meanings that I collected long ago. I would like to make the data available for re-use by adopting the CLDF format. Where do I start?

Starting Point

For this scenario, let's assume a simple spreadsheet file with the following structure:

ID	Chinese	English
1	天	sky
2	太阳	sun
3	月亮	moon
4	星星	star
5	云	cloud

You can find a corresponding Excel file here. This sample data is taken from the first 5 rows of the Allen-2007-500 concept list.

In order to make this simple (singular) spreadsheet into a CLDF-conformant data set, we have to obey the three principle components that make a CLDF data set (see also the CLDF repository):

a set of UTF-8 encoded CSV files (our above-mentioned spreadsheet, see also CSV caveats below)
a JSON file that describes the spreadsheet in detail (a CLDF Metadata file),
- in particular including a dc:conformsTo property that points to the CLDF module that we're using as a basis for describing our spreadsheet.

Excel to CSV

Please note that Excel to proper CSV is somewhat tricky, since the delimiter that is being used for the export can only be specified within the regional settings of the machine (see for example here).

Additionally, take special note of the exporting options with respect to the encoding schemes. See this screenshot for the option that I'm using for this tutorial (macOS High Sierra, Microsoft Excel for Mac 16.15).

The resulting CSV file can be found here. Opening the file with a simple text editor (Notepad, TextEdit, etc.) reveals the default CSV separator employed by Excel, ;. We can either change this here in the file (e.g. by using search and replace) or change the delimiter in the JSON file that describes our data for the CLDF framework. For the sake of better understanding the CLDF JSON file, we're going to do the latter.

Find an Appropriate CLDF Module

In order to satisfy the other two requirements (a JSON file describing our table, a dc:conformsTo property) we need to figure out what CLDF modules fit our data (or go the route of creating a custom module, which goes beyond the scope of this basic introduction).

For this purpose, we can browse the CLDF ontology and the modules in the CLDF repository. Note that the individual module descriptions provide further guidance for finding something that fits your specific use case.

For the purpose of the above-mentioned scenario, the Wordlist module seems appropriate.

Download/Inspect the Module's JSON

Download and/or inspect the JSON file accompanying the module of your choice. For the purpose of this tutorial, I'm going to download the Wordlist-metadata.json to a cldf folder (also in this tutorial's directory) and rename it to MyData.json.

Anatomy of a CLDF Module JSON

Opening the JSON file reveals a simple structure:

there is supposed to be a single file (look for "url": "forms.csv") called forms.csv that ...
has got one table (analogous to one sheet in Excel) (look for "tables") ...
that can have certain columns (look for "tableSchema").

Inspecting the "tableSchema" more closely reveals that there has to be (look for "required")

a column ID,
a column Language_ID,
a column Parameter_ID,
and a column Form.

Additionally, there can be

a column Segments,
a column Comment,
and a column Source, separated by ;.

Implementing the Requirements

Note that a data set can be CLDF-conformant without strictly following the default modules (see also here). For the purpose of this tutorial, we'll stick to the defaults as closely as possible. However, you're free to customise! This is the CLDF spirit!

First, I move the JSON file and the CSV file into on directory (cldf in this tutorial).

Next, I edit MyData.json to fit my data set:

"url": "Simple_Wordlist.csv"

(Our data set is not in form.csv but rather Simple_Wordlist.csv.)

"dialect": {
    "commentPrefix": null,
    "delimiter": ";"
},

(Our data set is not comma-separated, but semicolon-separated.)

{
    "name": "Language_ID",
    "virtual": true,
    "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#languageReference",
    "valueUrl": "comm1247"
}

(Rather than adding a language column that repeats the same Glottocode again and again in each row, I add a virtual column that specifies one and the same value for all rows. Note that this has to be the very last column in the JSON file. See MyData.json if you're not sure how that's supposed to look like.)

{
    "name": "Parameter_ID",
    "required": true,
    "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#parameterReference",
    "datatype": "string",
    "titles": "English"
},

(The parameter or the concept I'm describing in the table is found in the English column. This can be specified by adding "titles": "English".)

{
    "name": "Form",
    "required": true,
    "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#form",
    "datatype": "string",
    "titles": "Chinese"
},

(The form for which I'm providing information for can be found in Chinese, thus "titles": "Chinese".)

With these changes, the joint-venture of MyData.json and Simple_Wordlist.csv comprise a CLDF-conformant data set. This data set can be validated by running (in the directory with MyData.json):

cldf validate MyData.json

(See the pycldf repository for more information about the cldf command.)

A More Advanced Example

I am an expert on languages of a particular region/linguistic family. I have a spreadsheet in which rows are languages, columns are concepts, and cells contain the word for concept X in language Y. I would like to make these data available for re-use by ensuring they follow the CLDF format. Where do I start?

Let's take the following data as a starting point:

	ashes	bird	bone	cloud	eye
Yem	bèd-nā	kàs-à	mèg-ā	šāːr-ù	áːf-ā
Chara	bˈend-a	kˈaf-a	mert-ˈa	šˈač-a	ˈaːp-a
Bench	ˈaːp-a	šǒyt	mèrt	dùldűl	ȁp
Dorze	buɗ-o	kapʼ-o	mekʼeð-a	guːl-e	aypʼ-e

(Data taken from here. You can find an Excel version of the data in this repository.)

This data is in "wide format" as opposed to "long format", i.e. column headers are values, not variable names.

Depending on the complexity of the data in question, transforming the data from a wide to a long format might be more involved and external tools (e.g. R, csvkit) might be required, although this problem can also be tackled with Excel semi-automatically.

Essentially, we need to make the following modifications (analogously to the previous tutorial, you're free to change the Wordlist JSON to your needs; for the sake of this tutorial, we'll stick as closely as possible to the original):

add an ID column
add a header for the language column
move the data points into individual rows
add a Parameter_ID column in which to paste the concept labels

ID	Concept	Form	Language_ID
1	ashes	bèd-nā	Yem
2	bird	kàs-à	Yem
3	bone	mèg-ā	Yem
4	cloud	šāːr-ù	Yem
5	eye	áːf-ā	Yem
6	ashes	bˈend-a	Chara
7	bird	kˈaf-a	Chara
8	bone	mert-ˈa	Chara
9	cloud	šˈač-a	Chara
10	eye	ˈaːp-a	Chara
11	ashes	cʼyāk	Bench
12	bird	šǒyt	Bench
13	bone	mèrt	Bench
14	cloud	dùldűl	Bench
15	eye	ȁp	Bench
16	ashes	buɗ-o	Dorze
17	bird	kapʼ-o	Dorze
18	bone	mekʼeð-a	Dorze
19	cloud	guːl-e	Dorze
20	eye	aypʼ-e	Dorze

(You can also find a CSV version of this table in this repository.)

Adapting the Default Wordlist JSON

As in the introductory tutorial, we now need to describe the data with the help of a JSON file. As a starting point, we're again going to use the Wordlist JSON. We have to change the following things:

file URL
delimiter
column name (Parameter_ID is Concept)

You can find the modified JSON file here in this repository. This file can again be validated with the help of pycldf:

cldf validate WideNarrow.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Starting Point

Excel to CSV

Find an Appropriate CLDF Module

Download/Inspect the Module's JSON

Anatomy of a CLDF Module JSON

Implementing the Requirements

A More Advanced Example

Adapting the Default Wordlist JSON

Files

README.md

Latest commit

History

README.md

File metadata and controls

Starting Point

Excel to CSV

Find an Appropriate CLDF Module

Download/Inspect the Module's JSON

Anatomy of a CLDF Module JSON

Implementing the Requirements

A More Advanced Example

Adapting the Default Wordlist JSON