Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FilteredCorpus reader, helper methods, organize latin corpus types, correct prosody annotations, etc #846

Merged
merged 28 commits into from
Jan 6, 2019

Conversation

todd-cook
Copy link
Collaborator

@todd-cook todd-cook commented Dec 30, 2018

I wrote this PR in part to address the corpus access problem, e.g.; we have different corpora: text files, json files, and xml and other formats, and they should be accessed in a transparent way, e.g. the user doesn't need to know about the backing format.
Also, people who add a corpus should guide the user by partitioning the corpus into typical genres or eras if the corpus is large and diverse, e.g. for Latin it's ideal to easily allow corpus access to filter works so that one can grab Golden age works without grabbing works from the medieval period, etc.

To these ends, I added a FilteredCorpusReader, and definitions for groups corpus types and files, and a generic method for assembling a filtered corpus. Ideally, I think this approach should probably be adopted elsewhere.
I added some utility functions: matrix_corpus_ops: a collection of functions useful for corpus cleaning and diagnostics on a corpus when it is in matrix format (e.g. in the scikit learn pipeline Transformer approach).
Also, a function for featurization, and a helper function for getting an md5 hash for a file, which is used in a forthcoming Jupyter notebook.

I have some Jupyter notebooks which leverage some of these changes on a branch so I think the changes will make sense and can be well vetted, but of course, feedback appreciated!

I also took the liberty of correcting some of my old code, namely updating annotations in the cltk.prosody.latin module so that it works better with the current mypy implementation.

All methods have doctest function examples, but I did add some tests for the filtered corpus creation and assembly.

Lastly, although these methods are going to be used shortly in a notebook, the game plan is that the notebooks will help vet a revision of the cltk word vec, and possible the creation of the newer embedding, FastText.

Scansion of Poetry
About the use of macrons in poetry
HexameterScanner
Hexameter
ScansionConstants
Syllabifier
Metrical Validator
ScansionFormatter
StringUtils module

Made minor formatting corrections elsewhere to quiet warnings encountered during transpiling the rst file during testing and verification.
doctests run with an added command line switch:
nosetests --no-skip --with-coverage --cover-package=cltk --with-doctest
1. added install gensim to travis build script; its absence is causing an error in word2vec.py during the build.
2. Modified transcription.py so that the macronizer is initialized on instantiation of the Transcriber class and not at the module level; the macronizer file is 32MB and this also seems to cause an error with travis as github does not make large files displayable, and so it may not be available for the build. The macronizer object has been made a component of "self."
…uild from completing;

soon, we should move to update the dependencies of word2vec; gensim pulls in boto which isn't python3 compliant, there is a boto3 version which we may be able to slot in, but perhaps a larger question is boto necessary?
…nment with standard Python naming conventions
Adding utility code for featurization, functions for matrix operations on a corpus in matrix format.
Adding latin library corpus type file and directory mapping.
Doctest incorporated for every function.
@codecov-io
Copy link

codecov-io commented Dec 30, 2018

Codecov Report

Merging #846 into master will increase coverage by 0.11%.
The diff coverage is 92.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #846      +/-   ##
==========================================
+ Coverage   89.38%   89.49%   +0.11%     
==========================================
  Files         185      190       +5     
  Lines       11641    11946     +305     
==========================================
+ Hits        10405    10691     +286     
- Misses       1236     1255      +19
Impacted Files Coverage Δ
cltk/prosody/latin/scansion_constants.py 100% <ø> (ø)
cltk/prosody/latin/pentameter_scanner.py 82.53% <100%> (ø)
cltk/utils/file_operations.py 88.57% <100%> (+2.85%) ⬆️
cltk/prosody/latin/hendecasyllable_scanner.py 85.71% <100%> (ø)
cltk/prosody/latin/metrical_validator.py 87.28% <100%> (ø)
cltk/prosody/latin/syllabifier.py 93.96% <100%> (ø)
cltk/prosody/latin/hexameter_scanner.py 89.4% <100%> (ø)
cltk/corpus/latin/latin_library_corpus_types.py 100% <100%> (ø)
cltk/prosody/latin/scansion_formatter.py 96.07% <100%> (ø)
cltk/corpus/latin/__init__.py 100% <100%> (+13.04%) ⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de1fd6d...6ea27a5. Read the comment docs.

@kylepjohnson kylepjohnson self-assigned this Jan 6, 2019
Copy link
Member

@kylepjohnson kylepjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my review, I focused on the corpus reader. Haven't pulled and tested locally, but tests demonstrate use well.

Your approach to "filtering" docs by era is fine. No temporal boundaries are perfect but your choices are sane.

About genre and future filtering capabilities, I will share some ideas I've seen (a file inside a corpus's repo which acts as a map; leaning on some 3rd party's canonical list of authors and texts, as here: http://docs.cltk.org/en/latest/latin.html#id1 ). IIRC in an email you suggested adding metadata (eg genre, date, region) at the top of a file. I am less enthusiastic about this because it would set a precedent of the CLTK editing others' data. But let's keep discussing, so we understand each other well.

I'd like to use the CLTK logger but won't hold up merging for this.

Finally, about docs: Since this is newish, we don't need to add docs, so long as we don't forget. I'll add if you like, just let me know.

from cltk.tokenize.sentence import TokenizeSentence
from cltk.tokenize.word import WordTokenizer

LOG = logging.getLogger(__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have our own logger, example here:

logger.error(pickle_error)

Is there anything stopping us from using it here?

Copy link
Collaborator Author

@todd-cook todd-cook Jan 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lines:
LOG = logging.getLogger(name)
LOG.addHandler(logging.NullHandler())
are the preferred method for library logging, as recommended by the Python Cookbook.

A common logging gateway such as the cltk logger is fine for methods or approaches that wrap the library, but, if someone is composing out of the cltk classes it's best to delegate/inherit from the common logger they instantiate.

Here's the quote from the Python Cookbook section:

13.12. Adding Logging to Libraries Problem
You would like to add a logging capability to a library, but don’t want it to interfere with programs that don’t use logging.
Solution
For libraries that want to perform logging, you should create a dedicated logger object, and initially configure it as follows:
##### somelib.py
import logging
log = logging.getLogger(__name__) 
log.addHandler(logging.NullHandler())
##### Example function (for testing)
def func():
  log.critical('A Critical Error!')
  log.debug('A debug message')
#### With this configuration, no logging will occur by default. For example:
>>> import somelib 
>>> somelib.func() 

#### However, if the logging system gets configured, log messages will start to appear. #### For example:
>>> import logging
>>> logging.basicConfig()
>>> somelib.func() 
CRITICAL:somelib:A Critical Error!

Discussion
Libraries present a special problem for logging, since information about the environment in which they are used isn’t known. As a general rule, you should never write library code that tries to configure the logging system on its own or which makes assumptions about an already existing logging configuration. Thus, you need to take great care to provide isolation.
The call to getLogger(__name__) creates a logger module that has the same name as the calling module. Since all modules are unique, this creates a dedicated logger that is likely to be separate from other loggers.
The log.addHandler(logging.NullHandler()) operation attaches a null handler to the just created logger object. A null handler ignores all logging messages by default. Thus, if the library is used and logging is never configured, no messages or warnings will appear.
One subtle feature of this recipe is that the logging of individual libraries can be inde‐ pendently configured, regardless of other logging settings. For example, consider the following code:
>>> import logging
>>> logging.basicConfig(level=logging.ERROR) 
>>> import somelib
>>> somelib.func()
CRITICAL:somelib:A Critical Error!
>>> #### Change the logging level for 'somelib' only 
>>> logging.getLogger('somelib').level=logging.DEBUG 
>>> somelib.func()
CRITICAL:somelib:A Critical Error!
DEBUG:somelib:A debug message

Here, the root logger has been configured to only output messages at the ERROR level or higher. However, the level of the logger for somelib has been separately configured to output debugging messages. That setting takes precedence over the global setting.
The ability to change the logging settings for a single module like this can be a useful debugging tool, since you don’t have to change any of the global logging settings—simply change the level for the one module where you want more output.

--excuse the sloppy copy/paste but I wanted to covey the full argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will read your note carefully when I am able. To flag this topic, I have opened an issue so it doesn't fall through the cracks: #853

from cltk.prosody.latin.Syllabifier import Syllabifier
import cltk.prosody.latin.StringUtils as StringUtils
from cltk.prosody.latin.VerseScanner import VerseScanner
import cltk.prosody.latin.string_utils as StringUtils
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but curious why StringUtils is camel case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to be consistent and still falling down. Good news, bad news: there'll be another PR.

LOG.addHandler(logging.NullHandler())


def word_to_features(word: str, max_word_length: int = 20) -> List[int]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: In the future we may want something similar to TF's VocabularyProcessor (http://tflearn.org/data_utils/#vocabulary-processor), and for whatever features our model-builders regularly need (thinking of prefix/suffix strings).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Yes, we'll need to revisit this, and/or expand on it.

@kylepjohnson kylepjohnson merged commit 320e810 into cltk:master Jan 6, 2019
kylepjohnson added a commit that referenced this pull request Jan 6, 2019
Addition of a corpus reader, by @todd-cook , which inherits from the NLTK's, and adds to it filtering specific to the CLTK's hosted Latin Library texts (<#846>). Also misc cleanup of verse modules by Todd.
@todd-cook todd-cook deleted the corpus_reader branch January 6, 2019 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants