python package to read and write CLDF datasets
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
examples
src/pycldf
tests
.gitignore
.travis.yml
CHANGES.md
CONTRIBUTING.md
LICENSE
MANIFEST.in
README.md
RELEASING.md
requirements.txt
setup.cfg
setup.py
tox.ini

README.md

pycldf

A python package to read and write CLDF datasets.

Build Status codecov Requirements Status PyPI

Writing CLDF

from pycldf import Wordlist, Source

dataset = Wordlist.in_dir('mydataset')
dataset.add_sources(Source('book', 'Meier2005', author='Hans Meier', year='2005', title='The Book'))
dataset.write(FormTable=[
    {
        'ID': '1', 
        'Form': 'word', 
        'Language_ID': 'abcd1234', 
        'Parameter_ID': '1277', 
        'Source': ['Meier2005[3-7]'],
    }])

results in

$ ls -1 mydataset/
forms.csv
sources.bib
Wordlist-metadata.json
  • mydataset/forms.csv
ID,Language_ID,Parameter_ID,Value,Segments,Comment,Source
1,abcd1234,1277,word,,,Meier2005[3-7]
  • mydataset/sources.bib
@book{Meier2005,
    author = {Meier, Hans},
    year = {2005},
    title = {The Book}
}

  • mydataset/Wordlist-metadata.json

Advanced writing

To add predefined CLDF components to a dataset, use the add_component method:

from pycldf import StructureDataset, term_uri

dataset = StructureDataset.in_dir('mydataset')
dataset.add_component('ParameterTable')
dataset.write(
    ValueTable=[{'ID': '1', 'Language_ID': 'abc', 'Parameter_ID': '1', 'Value': 'x'}],
	ParameterTable=[{'ID': '1', 'Name': 'Grammatical Feature'}])

It is also possible to add generic tables:

dataset.add_table('contributors.csv', term_uri('id'), term_uri('name'))

which can also be linked to other tables:

dataset.add_columns('ParameterTable', 'Contributor_ID')
dataset.add_foreign_key('ParameterTable', 'Contributor_ID', 'contributors.csv', 'ID')

Addressing tables and columns

Tables in a dataset can be referenced using a Dataset's __getitem__ method, passing

  • a full CLDF Ontology URI for the corresponding component,
  • the local name of the component in the CLDF Ontology,
  • the url of the table.

Columns in a dataset can be referenced using a Dataset's __getitem__ method, passing a tuple (<TABLE>, <COLUMN>) where <TABLE> specifies a table as explained above and <COLUMN> is

  • a full CLD Ontolgy URI used as propertyUrl of the column,
  • the name property of the column.

Reading CLDF

>>> from pycldf.dataset import Wordlist
>>> dataset = Wordlist.from_metadata('mydataset/Wordlist-metadata.json')
>>> print(dataset)
<cldf:v1.0:Wordlist at mydataset>
>>> forms = list(dataset['FormTable'])
>>> forms[0]
OrderedDict([('ID', '1'), ('Language_ID', 'abcd1234'), ('Parameter_ID', '1277'), ('Value', 'word'), ('Segments', []), ('Comment', None), ('Source', ['Meier2005[3-7]'])])
>>> refs = list(dataset.sources.expand_refs(forms[0]['Source']))
>>> refs
[<Reference Meier2005[3-7]>]
>>> print(refs[0].source)
Meier, Hans. 2005. The Book.

Command line usage

Installing the pycldf package will also install a command line interface cldf, which provides some sub-commands to manage CLDF datasets.

Summary statistics

$ cldf stats mydataset/Wordlist-metadata.json 
<cldf:v1.0:Wordlist at mydataset>

Path                   Type          Rows
---------------------  ----------  ------
forms.csv              Form Table       1
mydataset/sources.bib  Sources          1

Validation

By default, data files are read in strict-mode, i.e. invalid rows will result in an exception being raised. To validate a data file, it can be read in validating-mode.

For example the following output is generated

$ cldf validate mydataset/forms.csv
WARNING forms.csv: duplicate primary key: (u'1',)
WARNING forms.csv:4:Source missing source key: Mei2005

when reading the file

ID,Language_ID,Parameter_ID,Value,Segments,Comment,Source
1,abcd1234,1277,word,,,Meier2005[3-7]
1,stan1295,1277,hand,,,Meier2005[3-7]
2,stan1295,1277,hand,,,Mei2005[3-7]

See also