Browse files

Add FilteredCorpus reader, helper methods, organize latin corpus type…

…s, correct prosody annotations, etc (#846)

* Initial releases with unit tests and doctests

* Added sections and preliminary documentation for:
Scansion of Poetry
About the use of macrons in poetry
Metrical Validator
StringUtils module

Made minor formatting corrections elsewhere to quiet warnings encountered during transpiling the rst file during testing and verification.

* corrected documentation & doctest comments that were causing errors.
doctests run with an added command line switch:
nosetests --no-skip --with-coverage --cover-package=cltk --with-doctest

* fixing broken doctest comment

* correcting documentation comment that causes doctest to err

* Corrections to make the build pass:
1. added install gensim to travis build script; its absence is causing an error in during the build.
2. Modified so that the macronizer is initialized on instantiation of the Transcriber class and not at the module level; the macronizer file is 32MB and this also seems to cause an error with travis as github does not make large files displayable, and so it may not be available for the build. The macronizer object has been made a component of "self."

* moved package import inside of main so that it does not prevent the build from completing;
soon, we should move to update the dependencies of word2vec; gensim pulls in boto which isn't python3 compliant, there is a boto3 version which we may be able to slot in, but perhaps a larger question is boto necessary?

* correcting documentation

* corrections for type annotations

* corrections for type annotations, renaming modules to be more in alignment with standard Python naming conventions

* Adding FilteredCorpusReader, and assemble_corpus function, with tests.
Adding utility code for featurization, functions for matrix operations on a corpus in matrix format.
Adding latin library corpus type file and directory mapping.
Doctest incorporated for every function.

* adding get_corpus_reader to latin corpus

* corrections for type annotations

* corrections for tests

* corrections for tests

* increasing test coverage

* adjusting travis to pull the test corpora

* remove unused imports

* making FilteredCorpus tests more clear about what they are looking for

* add corpus_readers documentation, clean up readers and test_corpus

* add carriage return
  • Loading branch information...
todd-cook authored and kylepjohnson committed Jan 6, 2019
1 parent de1fd6d commit 320e810184204d9171e683241adea4c5c1c73a04
@@ -24,6 +24,7 @@ before_script:
- pip install numpy
- pip install scipy
- pip install scikit-learn
- python cltk/tests/

# Notes on nose:
@@ -1,44 +1,11 @@
# CLTK: Latin Corpus Readers

__author__ = ['Patrick J. Burns <>']
__license__ = 'MIT License. See LICENSE.'

CLTK Latin corpus readers
CLTK: Corpus Latin properties

import os.path
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

from cltk.tokenize.sentence import TokenizeSentence
from cltk.tokenize.word import WordTokenizer

# Would like to have this search through a CLTK_DATA environment variable
# Better to use something like make_cltk_path in cltk.utils.file_operations?
home = os.path.expanduser('~')
cltk_path = os.path.join(home, 'cltk_data')
if not os.path.isdir(cltk_path):

word_tokenizer = WordTokenizer('latin')
__author__ = ['Patrick J. Burns <>', 'Todd Cook <>']
__license__ = 'MIT License. See LICENSE.'

if os.path.exists(cltk_path + 'latin/model/latin_models_cltk/tokenizers/sentence'):
sent_tokenizer = TokenizeSentence('latin')
punkt_param = PunktParameters()
abbreviations = ['c', 'l', 'm', 'p', 'q', 't', 'ti', 'sex', 'a', 'd', 'cn', 'sp', "m'", 'ser', 'ap', 'n', 'v', 'k', 'mam', 'post', 'f', 'oct', 'opet', 'paul', 'pro', 'sert', 'st', 'sta', 'v', 'vol', 'vop']
punkt_param.abbrev_types = set(abbreviations)
sent_tokenizer = PunktSentenceTokenizer(punkt_param)

# Latin Library
latinlibrary = PlaintextCorpusReader(cltk_path + '/latin/text/latin_text_latin_library',
except IOError as e:
# print("Corpus not found. Please check that the Latin Library is installed in CLTK_DATA.")
abbreviations = ['c', 'l', 'm', 'p', 'q', 't', 'ti', 'sex', 'a', 'd', 'cn', 'sp', "m'", 'ser',
'ap', 'n', 'v', 'k', 'mam', 'post', 'f', 'oct', 'opet', 'paul', 'pro', 'sert',
'st', 'sta', 'v', 'vol', 'vop']
@@ -0,0 +1,303 @@
"""`latin_library_corpus_types` - a mapping of corpus types into common periods, based largely on:
and some personal choices, e.g.: the inscrutable Twelve Tables is placed in an 'early' latin
classification, while Plautus and Terence are in the Old latin section, some uncertain items are
binned into 'misc'. Pull requests to further sort this out are welcome!

texts_to_remove_from_fileids = [

# ontology map directories

corpus_directories_by_type = {

'republican': [
'augustan': [
'early_silver': [
'late_silver': [
'old': [
'christian': [
'medieval': [
'renaissance': [
'neo_latin': [
#: uncategorized
'early': []

#### by text

corpus_texts_by_type = {
'republican': [
'augustan': [
'early_silver': [
'late_silver': [
'old': [
'early': [
'medieval': [
'christian': [
'renaissance': [
'neo_latin': [
Oops, something went wrong.

0 comments on commit 320e810

Please sign in to comment.