Fetching contributors…
Cannot retrieve contributors at this time
77 lines (54 sloc) 2.97 KB

Importing Corpora

The CLTK stores all data in the local directory cltk_data, which is created at a user's root directory upon first initialization of the CorpusImporter() class. Within this are an originals directory, in which untouched copies of downloaded or copied files are preserved, and a directory for every language for which a corpus has been downloaded. It also contains cltk.log for all CLTK logging.

Listing corpora

To see all of the corpora available for importing, use list_corpora().

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: corpus_importer = CorpusImporter('greek')  # e.g., or CorpusImporter('latin')

In [3]: corpus_importer.list_corpora


Importing a corpus

To download a remote corpus, use the following, for example, for the Latin Library.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: corpus_importer = CorpusImporter('latin')  # e.g., or CorpusImporter('greek')

In [3]: corpus_importer.import_corpus('latin_text_latin_library')
Downloaded 100% , 35.53 MiB | 3.28 MiB/s s

For a local corpus, such as the TLG, you must give a second argument of the filepath to the corpus, e.g.:

In [4]: corpus_importer.import_corpus('tlg', '~/Documents/corpora/TLG_E/')

User-defined, distributed corpora

Most users will want to use the CLTK's publicly available corpora. However users can import any repository that is hosted on a Git server. The benefit of this is that users can use corpora that the CLTK organization is not able to distribute itself (because too specific, license restrictions, etc.).

Let's say a user wants to keep a particular Git-backed corpus at It can be cloned into the ~/cltk_data/ directory by declaring it in a manually created YAML file at ~/cltk_data/distributed_corpora.yaml like the following:

    language: latin
    type: text

    language: pali
    type: treebank

Each block defines a separate corpus. The first line of a block (e.g., example_distributed_latin_corpus) gives the unique name to the custom corpus, however it is not used elsewhere. This first example block would allow a user to fetch the repo and install it at ~/cltk_data/latin/text/latin_corpus_newton_example.