Cross-Linguistic Data Formats and Beyond (Christoph Rzymski and Nathanael E. Schweikhard and Tiago Tresoldi and Johann-Mattis List)

# 1 Introduction

Welcome to this session, in which we will cover Orthographic Profiles and the Cross-Linguistic Data Formats (CLDF).


# 2. Orthographic Profiles

## 2.1 The Idea

Say you have a huge amount of linguistic data that all uses the same transcription system, and you want to convert that data into IPA.

Orthographic Profiles provide an easy method of replacing letters and letter combinations with the corresponding IPA-symbols by using a simple replacement table. They work like Finite State Automata: They look at each letter of the data from left to right, check whether it is in the replacement table, and replace it with the corresponding IPA. They also take letter combinations into account, and therefore they go through the table from the longest to the shortest entry.

For example if you would want to convert a few Castilian Spanish words into a fairly phonemic IPA rendering:

*hola llamar amiga*

you would need the following correspondences:

| Grapheme | IPA  |
|----------|------|
| h        | NULL |
| o        | o    |
| l        | l    |
| a        | a    |
| ll       | ʎ    |
| m        | m    |
| r        | ɾ    |
| g        | ɣ    |
| i        | i    |

to gain the output:
o l a # ʎ a m a ɾ # a m i ɣ a

The &lt;ll> in llamar will get converted correctly into /ʎ/ instead of /l/ since the &lt;ll>, being longer than &lt;l>, gets checked first, so you don't need to worry about putting the rows into any specific order.

For deleting a letter since it’s not pronounced (like &lt;h>) you should use the value NULL in the IPA column.

## 2.2 Using Your Orthographic Profile in Python

For using the profile you created, you first need to turn the table into a simple tab-separated-values-file, e.g. by copy-pasting it from the spreadsheet-software into an editor like Notepad++ and then saving it with the file-ending .tsv.

Afterwards you can use the following Python-code to turn your data into the correct IPA:

In [1]:
# import relevant modules
from segments.tokenizer import Tokenizer

# load the tokenizer object
tk = Tokenizer('../data/orthographic_profile1.tsv')

# convert a string to test it
print(tk('hola llamar amiga', column='IPA'))

o l a # ʎ a m a ɾ # a m i ɣ a


which should give you the output mentioned above.

If the pronunciation depends on whether the letter is at the beginning or end of the word or somewhere in the middle, you can use the symbols ^ and $ for that.

So if we want to also include a word that starts with &lt;g> (which is pronounced differently from a &lt;g> in the middle of a word in Spanish) we could add the following row. We also need to add additional rows for the word border signs:

| Grapheme | IPA  |
|----------|------|
| ^g       | g    |
| ^        | NULL |
| $        | NULL |

We then also need to prepare the input data to include word borders, but Python can do that for us:

In [2]:
# import relevant modules
from segments.tokenizer import Tokenizer
import csv

# load the tokenizer object
tk = Tokenizer('../data/orthographic_profile2.tsv')

# read input file
with open('../data/input_file.tsv') as reader:
    reader = csv.DictReader(reader, delimiter="\t")
    words = [row['FORM'] for row in reader] 
    
# add the word borders
    words = ['^' + word + '$' for word in words]

# apply the orthographic profile to the file
# and save the output in a file
    with open('../data/output_file.tsv', 'w') as handler:
        for word in words:
            handler.write(tk(word,column='IPA'))
            handler.write('\n')

Here you also see how you input and output a file instead of having the input and output only in the terminal.

If your input data is already really close to IPA and only needs a bit of cleaning up, you can also automatically create an Orthographic Profile with the help of Lingpy in the terminal by just giving it your data in one column named "IPA" in a tsv-file:

```
lingpy profile -i data/input_file.tsv -o data/simple_profile.tsv --column=ipa
```

The output file is then an Orthographic Profile which you can use the same as one you have created manually. It might contain a few mistakes, though, so you should check whether everything was recognized correctly.

Further information on how to create Orthographic Profiles automatically can be found here: http://lingpy.org/docu/sequence/profile.html

## 2.3 More Complex Profiles

So far this sounds pretty easy. There however are some caveats.

Orthographic Profiles need to be prepared individually not only for each language but also for each transcription system.

They work well if there is a 1 to 1 correspondence between grapheme and sound. But they are a bit tedious to create if the pronunciation depends on the surrounding letters as that can easily turn into a long list of letter combinations you need to include.

One easy example for this was the &lt;ll> above, here is another one: In Spanish, the letter c gets pronounced differently depending on the letter afterwards. Normally it's /k/, but before high vowels it's /θ/ (depending on the dialect). Therefore you need to add the following rows to account for this:

| Grapheme | IPA  |
|----------|------|
| c        | k    |
| ci       | θi    |
| ce       | θe    |

Spanish orthography being as regular as it is, making a full Orthographic Profile for it wouldn't be too difficult a task (at least for a phonemic rendering of standard dialects). But in other orthographic systems, if there are irregular exceptions to the pronunciation you then need to list each of those individually. For example an Orthographic Profile for English would be nearly impossible.

Furthermore, Orthographic Profiles are not capable of handling things that are not visible in the orthography (which often includes accent, morpheme borders, tone,...), and therefore also not pronunciation differences that depend on this information. If you have data depending on things like that you first need to add this kind of information to the input (often manually).

E.g. if you have some German input in standard orthography you would need to first mark the morpheme borders (e.g. with .) and, especially, also mark the vowel length in some fashion.

Otherwise your computer wouldn't be able to tell that there is a glottal stop before the second &lt;a> in *Hausarzt* but not in *Bausatz*. Nor would it be able to figure out that the &lt;a> in *Tat* is long but the &lt;a> in *hat* is short.

Therefore, the usefulness of the method of Orthographic Profiles hugely depends on what kind of data you are dealing with, and also how exact the IPA needs to be, but can be very helpful in a lot of contexts. It basically depends on whether it's more effort to create the Orthographic Profile or to transliterate all the linguistic data manually.

# 3 Cross-Linguistic Data

## 3.1 The Current Situation

In order to annotate linguistic data, scholars often make use of ad-hoc formats based on spreadsheet software (Excel, LibreOffice, GoogleSheets). The most common format, of which many linguists also claim that it is the most universal and most easy-to-understand one, is the one according to which rows represent concepts and columns represent languages. This format essentially reserves one column for one language in the spreadsheet, and one row for one concept. The language names are given in the first row, and the concept labels are given in the first column of the spreadsheet.

<style>img[alt="Table 1: Tabular data format with languages in columns and concepts in rows."]{width:400px;}</style>
<table>
  <tr>
    <th rowspan="2">Concepts</th>
    <th colspan="5">Languages</th>
  </tr>
  <tr>
    <th>English</th>
    <th>German</th>
    <th>Dutch</th>
    <th>Danish</th>
    <th>Swedish</th>
  </tr>
  <tr>
    <th>&quot;hand&quot;</th>
    <td>hænd</td>
    <td>hant</td>
    <td>hɑnt</td>
    <td>hʌnˀ</td>
    <td>hanːd</td>
  </tr>
  <tr>
    <th>&quot;ashes&quot;</th>
    <td>æʃ</td>
    <td>aʃə</td>
    <td>ɑs</td>
    <td>asg</td>
    <td>asːka</td>
  </tr>
  <tr>
    <th>&quot;bark&quot;</th>
    <td>bɑːrk</td>
    <td>rɪndə</td>
    <td>bɑst</td>
    <td>bɑːg</td>
    <td>barːk</td>
  </tr>
  <tr>
    <th>...</th>
    <td>...</td>
    <td>...</td>
    <td>...</td>
    <td>...</td>
    <td>...</td>
  </tr>
</table>

This format does not only lack flexibility, as there is only one piece of information that we can give for each concept in a given language, it is also getting more and more impractical if we are dealing with many different languages, as it will be extremely hard to inspect them on a screen (scrolling horizontally is always harder for inspection than scrolling vertically).

Despite the shortcomings, this format, or any variant of it, is one of the most widespread forms in which language data is annotated nowadays. The problem of adding essential information on cognacy, or allowing for synonyms is again mostly handled in an ad-hoc manner. Some scholars add additional rows for the same concept in order to allow to add more than one word per meaning and per language:

<!-- ![Table 2: Tabular data format with additional rows for synonym rendering.](img/table-2.png) -->

<style>img[alt="Table 2: Tabular data format with additional rows for synonym rendering."]{width:500px;}</style>

<table>
  <tr>
    <th rowspan="2">Concepts</th>
    <th colspan="5">Languages</th>
  </tr>
  <tr>
    <th>English</th>
    <th>German</th>
    <th>Dutch</th>
    <th>Danish</th>
    <th>Swedish</th>
  </tr>
  <tr>
    <th>&quot;bark&quot;</th>
    <td>bɑːrk</td>
    <td>rɪndə</td>
    <td>bɑst</td>
    <td>bɑːg</td>
    <td>barːk</td>
  </tr>
  <tr>
    <th>&quot;bark&quot;</th>
    <td></td>
    <td>bɔrkə</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
</table>

Some scholars use commata or other separators to add the same entry in the same cell:

<!-- ![Table 3: Multiple synonyms in the same cell.](images/table-3.png) -->

<style>img[alt="Table 3: Multiple synonyms in the same cell."]{width:450px;}</style>

<table>
  <tr>
    <th rowspan="2">Concepts</th>
    <th colspan="5">Languages</th>
  </tr>
  <tr>
    <th>English</th>
    <th>German</th>
    <th>Dutch</th>
    <th>Danish</th>
    <th>Swedish</th>
  </tr>
  <tr>
    <th>&quot;bark&quot;</th>
    <td>bɑːrk</td>
    <td>rɪndə, bɔrkə</td>
    <td>bɑst</td>
    <td>bɑːg</td>
    <td>barːk</td>
  </tr>
</table>

And some scholars add another column for the language which shows the synonym:

<!-- ![Table 4: Additional column for language to render synonyms.](images/table-4.png) -->

<style>img[alt="Table 4: Additional column for language to render synonyms."]{width:450px;}</style>
<table>
  <tr>
    <th rowspan="2">Concepts</th>
    <th colspan="6">Languages</th>
  </tr>
  <tr>
    <th>English</th>
    <th>German</th>
    <th>German (b)</th>
    <th>Dutch</th>
    <th>Danish</th>
    <th>Swedish</th>
  </tr>
  <tr>
    <th>&quot;bark&quot;</th>
    <td>bɑːrk</td>
    <td>rɪndə</td>
    <td>bɔrkə</td>
    <td>bɑst</td>
    <td>bɑːg</td>
    <td>barːk</td>
  </tr>
</table>


For people concerned with a consistent representation of knowledge, this is a nightmare, but the nightmare gets even more frightening, when it comes to the annotation of cognate sets. Here, people have been proving an incredible amount of phantasy in creating solutions that are computationally not only difficult to track, but also extremely prone to errors. Scholars have been using colors:

<!-- ![Table 5: Color-based annotation of cognate sets.](images/table-5.png) -->

<style>img[alt="Table 5: Color-based annotation of cognate sets."]{width:450px;}</style>
<table>
  <tr>
    <th rowspan="2">Concepts</th>
    <th colspan="5">Languages</th>
  </tr>
  <tr>
    <th>English</th>
    <th>German</th>
    <th>Dutch</th>
    <th>Danish</th>
    <th>Swedish</th>
  </tr>
  <tr>
    <th>&quot;bark&quot;</th>
    <td style="background-color:LightYellow;">bɑːrk</td>
    <td style="background-color:LightBlue;">rɪndə</td>
    <td style="background-color:LightGreen;">bɑst</td>
    <td style="background-color:LightYellow;">bɑːg</td>
    <td style="background-color:LightYellow;">barːk</td>
  </tr>
  <tr>
    <th></th>
    <td style="background-color:LightYellow;">bɔrkə</td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
</table>

They often even just put the information on cognacy in a separate sheet, which makes it incredibly difficult to compare their judgments, especially when then number of language exceeds a handful:

<!-- ![Table 6: Multi-sheet-based annotation of cognate sets.](images/table-6.png) -->

<style>img[alt="Table 6: Multi-sheet-based annotation of cognate sets."]{width:800px;}</style>
<table>
  <tr><th>Sheet 1</th><th>Sheet 2</th></tr>
  <tr><td>
  <table>
  <tr>
    <th rowspan="2">Concepts</th>
    <th colspan="5">Languages</th>
  </tr>
  <tr>
    <th>English</th>
    <th>German</th>
    <th>Dutch</th>
    <th>Danish</th>
    <th>Swedish</th>
  </tr>
  <tr>
    <th>&quot;bark&quot;</th>
    <td>bɑːrk</td>
    <td>rɪndə, bɔrkə</td>
    <td>bɑst</td>
    <td>bɑːg</td>
    <td>barːk</td>
  </tr>
  </table></td>
  <td>
  <table>
  <tr>
    <th rowspan="2">Concepts</th>
    <th colspan="5">Languages</th>
  </tr>
  <tr>
    <th>English</th>
    <th>German</th>
    <th>Dutch</th>
    <th>Danish</th>
    <th>Swedish</th>
  </tr>
  <tr>
    <th>&quot;bark&quot;</th>
    <td>A</td>
    <td>B, A</td>
    <td>C</td>
    <td>A</td>
    <td>A</td>
  </tr>
  </table>
  </td></tr>
</table>

At times, they may even binarise the data manually, which is even more dangerous, as it is almost guaranteed that manually binarising cognate sets will yield errors (not to speak of the waste of time and the fact that one cannot trace the characters back when carrying out phylogenetic analyses).

<!-- ![Table 7: Binary annotation of cognate sets in multiple sheets.](images/table-7.png) -->

<style>img[alt="Table 7: Binary annotation of cognate sets in multiple sheets."]{width:500px;}</style>
<table>
  <tr><th>Sheet 1</th><th>Sheet 2</th></tr>
  <tr><td>
  <table>
  <tr>
    <th rowspan="2">Concepts</th>
    <th colspan="5">Languages</th>
  </tr>
  <tr>
    <th>English</th>
    <th>German</th>
    <th>Dutch</th>
    <th>Danish</th>
    <th>Swedish</th>
  </tr>
  <tr>
    <th>&quot;bark&quot;</th>
    <td>bɑːrk</td>
    <td>rɪndə, bɔrkə</td>
    <td>bɑst</td>
    <td>bɑːg</td>
    <td>barːk</td>
  </tr>
  </table></td>
  <td>
  <table>
  <tr><th>Characters</th><th>A</th><th>B</th><th>C</th></tr>
  <tr>
    <th>English</th>
    <td>1</td><td>0</td><td>0</td></tr>
  <tr><th>German</th><td>1</td><td>1</td><td>0</td></tr>
  <tr><th>Dutch</th><td>0</td><td>0</td><td>1</td></tr>
  <tr><th>Danish</th><td>1</td><td>0</td><td>0</td></tr>
  <tr><th>Swedish</th><td>1</td><td>0</td><td>0</td></tr>
  </table></td></tr>
</table>

## 3.2 More consistent ways of data representation


Software packages like STARLING try to circumvent the above-mentioned problems by allowing for additional columns which add additional information for the same language. LingPy and EDICTOR, however, employ a different approach which greatly increases the flexibility of the format. The major principle of this approach is to reserve one row in the spreadsheet for exactly one word form. Additional information for each word form is provided in additional columns (which can be flexibly added by the user both in LingPy and EDICTOR). The content of each column in a LingPy/EDICTOR-spreadsheet is given in the header of the file, with the first column being reserved for a numeric ID which should be greater than 0:

<!-- ![Table 9: Polynesian data example for standard format in LingPy and EDICTOR.](images/table-8.png) -->

<style>img[alt="Table 9: Polynesian data example for standard format in LingPy and EDICTOR."]{width:700px;}</style>
<table>
  <tr>
    <th>ID</th>
    <th>DOCULECT</th>
    <th>CONCEPT</th>    
    <th>VALUE</th>
    <th>FORM</th>
    <th>TOKENS</th>
    <th>BORROWING</th>
    <th>COGID</th>
  </tr>
  <tr>
    <td>3631</td>
    <td>East_Futuna</td>
    <td>above</td>
    <td>à/luga/</td>
    <td>luga</td>
    <td>l u g a</td>
    <td>0</td>
    <td>1382</td>
  </tr>
    <tr>
    <td>284</td>
    <td>Wallisian</td>
    <td>above</td>
    <td>'o/luga/</td>
    <td>luga</td>
    <td>l u g a</td>
    <td>1</td>
    <td>1382</td>
  </tr>
  <tr>
    <td>5391</td>
    <td>Futuna_Aniwa</td>
    <td>above</td>
    <td>weihlunga</td>
    <td>weihlunga</td>
    <td>w e i + ʰl u ŋ a</td>
    <td>0</td>
    <td>1382</td>
  </tr>
  <tr>
    <td>761</td>
    <td>Maori</td>
    <td>above</td>
    <td>i runga</td>
    <td>i runga</td>
    <td>i _ r u ŋ a</td>
    <td>0</td>
    <td>1382</td>
  </tr>
  <tr>
    <td>3332</td>
    <td>North_Marquesan</td>
    <td>above</td>
    <td>'una</td>
    <td>'una</td>
    <td>ʔ u n a</td>
    <td>0</td>
    <td>1382</td>
  </tr>
  <tr>
    <td>4214</td>
    <td>Mele-Fila</td>
    <td>all</td>
    <td>euči</td>
    <td>euči</td>
    <td>e u tʃ i</td>
    <td>0</td>
    <td>1115</td>
  </tr>
  <tr>
    <td>3917</td>
    <td>Pukapuka</td>
    <td>all</td>
    <td>katoa(toa)</td>
    <td>katoa</td>
    <td>k a + t o a</td>
    <td>0</td>
    <td>293</td>
  </tr>
  <tr>
    <td>560</td>
    <td>Proto-Polynesian</td>
    <td>yellow</td>
    <td>*reŋareŋa, *felo(-felo)</td>
    <td>*reŋareŋa</td>
    <td>r e ŋ a + r e ŋ a</td>
    <td>0</td>
    <td>162</td>
  </tr>
  <tr>
    <td>561</td>
    <td>Proto-Polynesian</td>
    <td>yellow</td>
    <td>*reŋareŋa, *felo(-felo)</td>
    <td>*felo</td>
    <td>f e l o</td>
    <td>0</td>
    <td>230</td>
  </tr>
</table>

While this format seems to be rather redundant on first sight, it offers a so much greater degree of flexibility that all linguists who started to seriously test this kind of data representation quickly understand the advantages. What you need to keep in mind is that the number of columns is theoretically unlimited. So you can easily add your own columns which you want to use for either enhanced ways to annotated and model your data, or to add notes in prose which you can later include in your publication. You can add sources, and you can be very detailed, listing the page number for each word form to trace from which source it was originally taken. The possibilities are virtually unlimited, once you get a clearer understanding of this way to handle linguistic data.




## 3.3 Simple but important rules for data consistency

In the previous sections we have tried to show that the ad-hoc formats employed by linguists for data collection usually have huge disadvantages in terms of transparancy and inter-operability, and we recommend all users to take the time to read more about the formats underlying LingPy, EDICTOR, but also the *Cross-Linguistic Data Formats* initiative ([Forkel et al. 2017](http://bibliography.lingpy.org?key=Forkel2017a), see http://cldf.clld.org). These formats are to a large degree compatible, for LingPy and EDICTOR, they are almost identical, and will be introduced below. 

The most obvious failure in data annotation that many scholars commit when preparing their data in Excel or other spreadsheet software is that they include multiple different types of information into one cell. Thus, if a word has a variant, scholars will place it into one cell in their spreadsheet software and separate the entries by a comma, a colon, a tilde, a dash, or at times even by a back-slash, often even using all of these separators inconsistently for the same dataset. A first and general rule that people creating data must understand and follow, is that 

  **`1.` Only one type of information should be put into one cell in a spreadsheet.** 

This rule is non-negotiable, as in our experience with a huge number of differently coded datasets, scholars necessarily make annotation errors, even if they try to be consistent. Computers are not like humans, and if you want to profit from computers to ease your work, assume that they cannot interpret whether you use a comma and a colon without semantic difference when listing word variants or whether you do it on purpose. In fact, humans are also unlikely to understand this, unless it was them who created the data. 

A more general rule deriving from this first rule is the rule that 

  **`2.` All information valid for a given analysis needs to be consistently annotated.**

This means, for example, that, if root alternation is important for your reconstruction and cognate decisions, you need to think how to model this in consistent markup. If your data contains reflexes of an alternating protoform *&ast;ka- vs. &ast;ku-*, for example, it is not sufficient to simply write *&ast;ka- vs. &ast;ku-* and listing the reflexes, assuming that your readers will understand which reflex stems from which of the two alternants. Instead, Two proto-forms should be listed, the variants should be assigned to the correct proto-form from which they evolve, and the additional information should be given that *&ast;ka-* and *&ast;ku-* are variants of the same root. This practice is rarely followed systematically in etymological dictionaries, and therefore also often disregarded in databases, but it is clear that it is the only way to transparently list what reflex stems from which proto-variant. In fact, this is not a matter of a more computational approach to historical linguistics, but rather a matter of improving on our common practice in historical linguistics, which has for too long a time been based on lax guidelines.

# 4 The CLDF Initiative

## 4.1 Introduction to CLDF

The CLDF initiative is an attempt to standardize cross-lingustic data. This attempt can be traced back almost twenty years (Dimitriadis 2001). With the efforts to standardize the software stack for cross-linguistic databases within the CLLD project, we became aware of the need for a general data model that could be used to represent cross-linguistic data in form of word lists and typological surveys. This data model aims to provide a convenient form that should ease the comparison across datasets and applications.

All of this culminated in a series of workshops, hosted by the Max Planck Institute for Psycholinguistics in Nijmegen
(Language Comparison with Linguistic Databases, 2014), the Max Planck Institute for Evolutionary Anthropology in Leipzig (Language Comparison with Linguistic Databases 2, 2015), the Lorentz Center in Leiden
(Capturing Phylogenetic Algorithms for Linguistics, 2015), and numerous follow-up workshops organized
as part of the Glottobank project (http://glottobank.org), an international research consortium
established to document and understand the world's linguistic diversity, funded by the Max Planck Institute
for the Science of Human History in Jena (2014-2017).


## 4.2 CLDF Idea

Generally speaking, data in historical and typological linguistic research have a (seemingly) simple structure:

* `languages` have
* `features` that, in turn, have different
* `values`.

The triple (or set of triples in a complete study) of `(language, feature, value)` appears easy and straightforward enough, but carries multiple potential fallacies. Can you think of any or have you experienced any in your own reaserch experiences?

The data model underlying the [CLDF specification](https://github.com/cldf/cldf) aims to be a remedy to these (potential) issues. In general, CLDF aims to make linguistic data `FAIR`:

* findable,
* accessible,
* interoperable,
* and re-usable.

CLDF places a heavy emphasis on the re-usability aspect, since we probably all know stories of interesting linguistic data sets that have been lost to time, coding issues, or both.

## 4.3 CLDF Data Model

How does the CLDF specification aim to achieve more FAIRness for linguistic data?

First, the CLDF specification doesn't want and need unnecessary complexity and sticks to the previously established triple of `(language, feature, value)` but refines the notion of the individual entities:

* `languages` are `languoids` and represent the object under investigation
* `features` are `parameters` and entail the comparative concepts that can be compared across different `languoids`
* `values` are `values` and represent measurements for `languoid`-`parameter` pairs

One additional layer expands this model as each triple can be and should be linked to one or multiple sources.

## 4.4 CLDF Ontology

For our purposes, the CLDF Ontology serves the function of having a catalogue of entities that can exist within a CLDF data set. More generally speaking, the ontology lists things that can exists within a CLDF data set and specifies their relationship between each other.

Visiting [the CLDF ontology web site](http://cldf.clld.org/v1.0/terms.rdf) with a reasonably modern browser (Firefox recommended!) nicely 'illustrates' the CLDF ontology and its components.

## 4.5 CLDF in the Wild

To support workig with real-world data, CLDF is separated into `modules` and `components`, which in turn allow tackling and modelling real-world problems.

The top-most order of organization of a CLDF data set is the `module` that is being used. CLDF provides support for multiple different `modules`:

* [`Wordlist`](https://github.com/cldf/cldf/tree/master/modules/Wordlist)
* [`StructureDataset`](https://github.com/cldf/cldf/tree/master/modules/StructureDataset)
* [`Dictionary`](https://github.com/cldf/cldf/tree/master/modules/Dictionary)
* [`ParallelText`](https://github.com/cldf/cldf/tree/master/modules/ParallelText)
* [`Generic`](https://github.com/cldf/cldf/tree/master/modules/Generic)

A finer-grained control over a module's content can be achieved by employing the different `components` the CLDF specification has to offer:

* [`Language Metadata`](https://github.com/cldf/cldf/tree/master/components/languages)
* [`Parameter Metadata`](https://github.com/cldf/cldf/tree/master/components/parameters)
* [`Values`](https://github.com/cldf/cldf/tree/master/components/values)
* [`Codes`](https://github.com/cldf/cldf/tree/master/components/codes)
* [`Entries`](https://github.com/cldf/cldf/tree/master/components/entries)
* [`Senses`](https://github.com/cldf/cldf/tree/master/components/senses)
* [`Examples`](https://github.com/cldf/cldf/tree/master/components/examples)
* [`Forms`](https://github.com/cldf/cldf/tree/master/components/forms)
* [`Cognates`](https://github.com/cldf/cldf/tree/master/components/cognates)
* [`CognateSets`](https://github.com/cldf/cldf/tree/master/components/cognatesets)
* [`Borrowings`](https://github.com/cldf/cldf/tree/master/components/borrowings)
* [`Functional Equivalents`](https://github.com/cldf/cldf/tree/master/components/functionalequivalents)
* [`Functional Equivalents Sets`](https://github.com/cldf/cldf/tree/master/components/functionalequivalentsets)

### 4.5.1 pycldf

...

### 4.5.2 CLDF & R -- It's just CSV!

...

## 4.6 CLDF Resources

You can find more information, tutorials, and examples here:

* http://cldf.clld.org/
* https://github.com/cldf/cldf
* https://github.com/cldf/cldf/tree/master/examples
* https://github.com/cldf/pycldf
* https://github.com/cldf/cookbook
* https://github.com/lexibank (see the individual data set repositories)

## 4.7 What's in It for You?

Creating CLDF-compliant data sets might seem daunting and not worth the additional overhead, at first. However, you shouldn't be scared by the wealth of modules and components; for very many basic research scenarios a basic file structure that follows the `Generic` CLDF idea is fine and already brings a multitude of benefits.