Add FilteredCorpus reader, helper methods, organize latin corpus types, correct prosody annotations, etc #846

todd-cook · 2018-12-30T05:53:12Z

I wrote this PR in part to address the corpus access problem, e.g.; we have different corpora: text files, json files, and xml and other formats, and they should be accessed in a transparent way, e.g. the user doesn't need to know about the backing format.
Also, people who add a corpus should guide the user by partitioning the corpus into typical genres or eras if the corpus is large and diverse, e.g. for Latin it's ideal to easily allow corpus access to filter works so that one can grab Golden age works without grabbing works from the medieval period, etc.

To these ends, I added a FilteredCorpusReader, and definitions for groups corpus types and files, and a generic method for assembling a filtered corpus. Ideally, I think this approach should probably be adopted elsewhere.
I added some utility functions: matrix_corpus_ops: a collection of functions useful for corpus cleaning and diagnostics on a corpus when it is in matrix format (e.g. in the scikit learn pipeline Transformer approach).
Also, a function for featurization, and a helper function for getting an md5 hash for a file, which is used in a forthcoming Jupyter notebook.

I have some Jupyter notebooks which leverage some of these changes on a branch so I think the changes will make sense and can be well vetted, but of course, feedback appreciated!

I also took the liberty of correcting some of my old code, namely updating annotations in the cltk.prosody.latin module so that it works better with the current mypy implementation.

All methods have doctest function examples, but I did add some tests for the filtered corpus creation and assembly.

Lastly, although these methods are going to be used shortly in a notebook, the game plan is that the notebooks will help vet a revision of the cltk word vec, and possible the creation of the newer embedding, FastText.

Scansion of Poetry About the use of macrons in poetry HexameterScanner Hexameter ScansionConstants Syllabifier Metrical Validator ScansionFormatter StringUtils module Made minor formatting corrections elsewhere to quiet warnings encountered during transpiling the rst file during testing and verification.

doctests run with an added command line switch: nosetests --no-skip --with-coverage --cover-package=cltk --with-doctest

1. added install gensim to travis build script; its absence is causing an error in word2vec.py during the build. 2. Modified transcription.py so that the macronizer is initialized on instantiation of the Transcriber class and not at the module level; the macronizer file is 32MB and this also seems to cause an error with travis as github does not make large files displayable, and so it may not be available for the build. The macronizer object has been made a component of "self."

…uild from completing; soon, we should move to update the dependencies of word2vec; gensim pulls in boto which isn't python3 compliant, there is a boto3 version which we may be able to slot in, but perhaps a larger question is boto necessary?

…nment with standard Python naming conventions

Adding utility code for featurization, functions for matrix operations on a corpus in matrix format. Adding latin library corpus type file and directory mapping. Doctest incorporated for every function.

codecov-io · 2018-12-30T07:51:51Z

Codecov Report

Merging #846 into master will increase coverage by 0.11%.
The diff coverage is 92.17%.

@@            Coverage Diff             @@
##           master     #846      +/-   ##
==========================================
+ Coverage   89.38%   89.49%   +0.11%     
==========================================
  Files         185      190       +5     
  Lines       11641    11946     +305     
==========================================
+ Hits        10405    10691     +286     
- Misses       1236     1255      +19

Impacted Files	Coverage Δ
cltk/prosody/latin/scansion_constants.py	`100% <ø> (ø)`
cltk/prosody/latin/pentameter_scanner.py	`82.53% <100%> (ø)`
cltk/utils/file_operations.py	`88.57% <100%> (+2.85%)`	⬆️
cltk/prosody/latin/hendecasyllable_scanner.py	`85.71% <100%> (ø)`
cltk/prosody/latin/metrical_validator.py	`87.28% <100%> (ø)`
cltk/prosody/latin/syllabifier.py	`93.96% <100%> (ø)`
cltk/prosody/latin/hexameter_scanner.py	`89.4% <100%> (ø)`
cltk/corpus/latin/latin_library_corpus_types.py	`100% <100%> (ø)`
cltk/prosody/latin/scansion_formatter.py	`96.07% <100%> (ø)`
cltk/corpus/latin/__init__.py	`100% <100%> (+13.04%)`	⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de1fd6d...6ea27a5. Read the comment docs.

kylepjohnson

In my review, I focused on the corpus reader. Haven't pulled and tested locally, but tests demonstrate use well.

Your approach to "filtering" docs by era is fine. No temporal boundaries are perfect but your choices are sane.

About genre and future filtering capabilities, I will share some ideas I've seen (a file inside a corpus's repo which acts as a map; leaning on some 3rd party's canonical list of authors and texts, as here: http://docs.cltk.org/en/latest/latin.html#id1 ). IIRC in an email you suggested adding metadata (eg genre, date, region) at the top of a file. I am less enthusiastic about this because it would set a precedent of the CLTK editing others' data. But let's keep discussing, so we understand each other well.

I'd like to use the CLTK logger but won't hold up merging for this.

Finally, about docs: Since this is newish, we don't need to add docs, so long as we don't forget. I'll add if you like, just let me know.

kylepjohnson · 2019-01-06T01:39:25Z

cltk/corpus/readers.py

+from cltk.tokenize.sentence import TokenizeSentence
+from cltk.tokenize.word import WordTokenizer
+
+LOG = logging.getLogger(__name__)


We also have our own logger, example here:

cltk/cltk/utils/file_operations.py

Line 40 in 9b9cdb4

logger.error(pickle_error)

Is there anything stopping us from using it here?

The lines:
LOG = logging.getLogger(name)
LOG.addHandler(logging.NullHandler())
are the preferred method for library logging, as recommended by the Python Cookbook.

A common logging gateway such as the cltk logger is fine for methods or approaches that wrap the library, but, if someone is composing out of the cltk classes it's best to delegate/inherit from the common logger they instantiate.

Here's the quote from the Python Cookbook section:

13.12. Adding Logging to Libraries Problem You would like to add a logging capability to a library, but don’t want it to interfere with programs that don’t use logging. Solution For libraries that want to perform logging, you should create a dedicated logger object, and initially configure it as follows: ##### somelib.py import logging log = logging.getLogger(__name__) log.addHandler(logging.NullHandler()) ##### Example function (for testing) def func(): log.critical('A Critical Error!') log.debug('A debug message') #### With this configuration, no logging will occur by default. For example: >>> import somelib >>> somelib.func() #### However, if the logging system gets configured, log messages will start to appear. #### For example: >>> import logging >>> logging.basicConfig() >>> somelib.func() CRITICAL:somelib:A Critical Error! Discussion Libraries present a special problem for logging, since information about the environment in which they are used isn’t known. As a general rule, you should never write library code that tries to configure the logging system on its own or which makes assumptions about an already existing logging configuration. Thus, you need to take great care to provide isolation. The call to getLogger(__name__) creates a logger module that has the same name as the calling module. Since all modules are unique, this creates a dedicated logger that is likely to be separate from other loggers. The log.addHandler(logging.NullHandler()) operation attaches a null handler to the just created logger object. A null handler ignores all logging messages by default. Thus, if the library is used and logging is never configured, no messages or warnings will appear. One subtle feature of this recipe is that the logging of individual libraries can be inde‐ pendently configured, regardless of other logging settings. For example, consider the following code: >>> import logging >>> logging.basicConfig(level=logging.ERROR) >>> import somelib >>> somelib.func() CRITICAL:somelib:A Critical Error! >>> #### Change the logging level for 'somelib' only >>> logging.getLogger('somelib').level=logging.DEBUG >>> somelib.func() CRITICAL:somelib:A Critical Error! DEBUG:somelib:A debug message Here, the root logger has been configured to only output messages at the ERROR level or higher. However, the level of the logger for somelib has been separately configured to output debugging messages. That setting takes precedence over the global setting. The ability to change the logging settings for a single module like this can be a useful debugging tool, since you don’t have to change any of the global logging settings—simply change the level for the one module where you want more output.

--excuse the sloppy copy/paste but I wanted to covey the full argument.

I will read your note carefully when I am able. To flag this topic, I have opened an issue so it doesn't fall through the cracks: #853

kylepjohnson · 2019-01-06T01:48:03Z

cltk/prosody/latin/hendecasyllable_scanner.py

-from cltk.prosody.latin.Syllabifier import Syllabifier
-import cltk.prosody.latin.StringUtils as StringUtils
-from cltk.prosody.latin.VerseScanner import VerseScanner
+import cltk.prosody.latin.string_utils as StringUtils


Not a big deal, but curious why StringUtils is camel case.

I'm trying to be consistent and still falling down. Good news, bad news: there'll be another PR.

kylepjohnson · 2019-01-06T02:04:38Z

cltk/utils/featurization.py

+LOG.addHandler(logging.NullHandler())
+
+
+def word_to_features(word: str, max_word_length: int = 20) -> List[int]:


Note: In the future we may want something similar to TF's VocabularyProcessor (http://tflearn.org/data_utils/#vocabulary-processor), and for whatever features our model-builders regularly need (thinking of prefix/suffix strings).

Good point. Yes, we'll need to revisit this, and/or expand on it.

@todd-cook

Addition of a corpus reader, by @todd-cook , which inherits from the NLTK's, and adds to it filtering specific to the CLTK's hosted Latin Library texts (<#846>). Also misc cleanup of verse modules by Todd.

todd-cook added 19 commits June 28, 2017 18:26

Initial releases with unit tests and doctests

176826a

Merge branch 'master' into master

dc10c21

corrected documentation & doctest comments that were causing errors.

839c21c

doctests run with an added command line switch: nosetests --no-skip --with-coverage --cover-package=cltk --with-doctest

Merge branch 'master' into master

e0b98f9

fixing broken doctest comment

9a238a7

Merge branch 'master' into master

2956118

Merge remote-tracking branch 'upstream/master'

aeba2c5

Merge branch 'master' of https://github.com/todd-cook/cltk

1896541

correcting documentation comment that causes doctest to err

7a7c451

correcting documentation

6dac3e9

Merge branch 'master' into master

5576255

Merge remote-tracking branch 'upstream/master'

3749eb7

corrections for type annotations

34bb6a5

corrections for type annotations, renaming modules to be more in alig…

a68204c

…nment with standard Python naming conventions

Adding FilteredCorpusReader, and assemble_corpus function, with tests.

1bdcb2c

Adding utility code for featurization, functions for matrix operations on a corpus in matrix format. Adding latin library corpus type file and directory mapping. Doctest incorporated for every function.

adding get_corpus_reader to latin corpus

91bc135

todd-cook requested review from kylepjohnson and diyclassics December 30, 2018 05:53

todd-cook added 3 commits December 29, 2018 22:20

corrections for type annotations

5479772

corrections for tests

c4da3d8

corrections for tests

a8bcc9e

todd-cook added 5 commits December 30, 2018 01:06

increasing test coverage

a54acda

adjusting travis to pull the test corpora

3b48385

remove unused imports

cb8c04d

making FilteredCorpus tests more clear about what they are looking for

abb703c

add corpus_readers documentation, clean up readers and test_corpus

157049f

kylepjohnson self-assigned this Jan 6, 2019

add carriage return

6ea27a5

kylepjohnson approved these changes Jan 6, 2019

View reviewed changes

kylepjohnson merged commit 320e810 into cltk:master Jan 6, 2019

todd-cook deleted the corpus_reader branch January 6, 2019 06:04

kylepjohnson mentioned this pull request Jan 6, 2019

Improve CLTK's logger #853

Closed

This was referenced Jan 8, 2019

Adding readers file for Latin corpora; fixes #383 #672

Closed

Move or change Latin CorpusReader module #383

Closed

kylepjohnson mentioned this pull request Mar 18, 2019

CTS and TEI XML parsing project #560

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FilteredCorpus reader, helper methods, organize latin corpus types, correct prosody annotations, etc #846

Add FilteredCorpus reader, helper methods, organize latin corpus types, correct prosody annotations, etc #846

todd-cook commented Dec 30, 2018 •

edited

Loading

codecov-io commented Dec 30, 2018 •

edited

Loading

kylepjohnson left a comment •

edited

Loading

kylepjohnson Jan 6, 2019

todd-cook Jan 6, 2019 •

edited

Loading

kylepjohnson Jan 6, 2019

kylepjohnson Jan 6, 2019

todd-cook Jan 6, 2019

kylepjohnson Jan 6, 2019

kylepjohnson Jan 6, 2019

todd-cook Jan 6, 2019

		LOG.addHandler(logging.NullHandler())


		def word_to_features(word: str, max_word_length: int = 20) -> List[int]:

Add FilteredCorpus reader, helper methods, organize latin corpus types, correct prosody annotations, etc #846

Add FilteredCorpus reader, helper methods, organize latin corpus types, correct prosody annotations, etc #846

Conversation

todd-cook commented Dec 30, 2018 • edited Loading

codecov-io commented Dec 30, 2018 • edited Loading

Codecov Report

kylepjohnson left a comment • edited Loading

Choose a reason for hiding this comment

kylepjohnson Jan 6, 2019

Choose a reason for hiding this comment

todd-cook Jan 6, 2019 • edited Loading

Choose a reason for hiding this comment

kylepjohnson Jan 6, 2019

Choose a reason for hiding this comment

kylepjohnson Jan 6, 2019

Choose a reason for hiding this comment

todd-cook Jan 6, 2019

Choose a reason for hiding this comment

kylepjohnson Jan 6, 2019

Choose a reason for hiding this comment

kylepjohnson Jan 6, 2019

Choose a reason for hiding this comment

todd-cook Jan 6, 2019

Choose a reason for hiding this comment

todd-cook commented Dec 30, 2018 •

edited

Loading

codecov-io commented Dec 30, 2018 •

edited

Loading

kylepjohnson left a comment •

edited

Loading

todd-cook Jan 6, 2019 •

edited

Loading