-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FilteredCorpus reader, helper methods, organize latin corpus types, correct prosody annotations, etc #846
Conversation
Scansion of Poetry About the use of macrons in poetry HexameterScanner Hexameter ScansionConstants Syllabifier Metrical Validator ScansionFormatter StringUtils module Made minor formatting corrections elsewhere to quiet warnings encountered during transpiling the rst file during testing and verification.
doctests run with an added command line switch: nosetests --no-skip --with-coverage --cover-package=cltk --with-doctest
1. added install gensim to travis build script; its absence is causing an error in word2vec.py during the build. 2. Modified transcription.py so that the macronizer is initialized on instantiation of the Transcriber class and not at the module level; the macronizer file is 32MB and this also seems to cause an error with travis as github does not make large files displayable, and so it may not be available for the build. The macronizer object has been made a component of "self."
…uild from completing; soon, we should move to update the dependencies of word2vec; gensim pulls in boto which isn't python3 compliant, there is a boto3 version which we may be able to slot in, but perhaps a larger question is boto necessary?
…nment with standard Python naming conventions
Adding utility code for featurization, functions for matrix operations on a corpus in matrix format. Adding latin library corpus type file and directory mapping. Doctest incorporated for every function.
Codecov Report
@@ Coverage Diff @@
## master #846 +/- ##
==========================================
+ Coverage 89.38% 89.49% +0.11%
==========================================
Files 185 190 +5
Lines 11641 11946 +305
==========================================
+ Hits 10405 10691 +286
- Misses 1236 1255 +19
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my review, I focused on the corpus reader. Haven't pulled and tested locally, but tests demonstrate use well.
Your approach to "filtering" docs by era is fine. No temporal boundaries are perfect but your choices are sane.
About genre and future filtering capabilities, I will share some ideas I've seen (a file inside a corpus's repo which acts as a map; leaning on some 3rd party's canonical list of authors and texts, as here: http://docs.cltk.org/en/latest/latin.html#id1 ). IIRC in an email you suggested adding metadata (eg genre, date, region) at the top of a file. I am less enthusiastic about this because it would set a precedent of the CLTK editing others' data. But let's keep discussing, so we understand each other well.
I'd like to use the CLTK logger but won't hold up merging for this.
Finally, about docs: Since this is newish, we don't need to add docs, so long as we don't forget. I'll add if you like, just let me know.
from cltk.tokenize.sentence import TokenizeSentence | ||
from cltk.tokenize.word import WordTokenizer | ||
|
||
LOG = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have our own logger, example here:
cltk/cltk/utils/file_operations.py
Line 40 in 9b9cdb4
logger.error(pickle_error) |
Is there anything stopping us from using it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lines:
LOG = logging.getLogger(name)
LOG.addHandler(logging.NullHandler())
are the preferred method for library logging, as recommended by the Python Cookbook.
A common logging gateway such as the cltk logger is fine for methods or approaches that wrap the library, but, if someone is composing out of the cltk classes it's best to delegate/inherit from the common logger they instantiate.
Here's the quote from the Python Cookbook section:
13.12. Adding Logging to Libraries Problem You would like to add a logging capability to a library, but don’t want it to interfere with programs that don’t use logging. Solution For libraries that want to perform logging, you should create a dedicated logger object, and initially configure it as follows: ##### somelib.py import logging log = logging.getLogger(__name__) log.addHandler(logging.NullHandler()) ##### Example function (for testing) def func(): log.critical('A Critical Error!') log.debug('A debug message') #### With this configuration, no logging will occur by default. For example: >>> import somelib >>> somelib.func() #### However, if the logging system gets configured, log messages will start to appear. #### For example: >>> import logging >>> logging.basicConfig() >>> somelib.func() CRITICAL:somelib:A Critical Error! Discussion Libraries present a special problem for logging, since information about the environment in which they are used isn’t known. As a general rule, you should never write library code that tries to configure the logging system on its own or which makes assumptions about an already existing logging configuration. Thus, you need to take great care to provide isolation. The call to getLogger(__name__) creates a logger module that has the same name as the calling module. Since all modules are unique, this creates a dedicated logger that is likely to be separate from other loggers. The log.addHandler(logging.NullHandler()) operation attaches a null handler to the just created logger object. A null handler ignores all logging messages by default. Thus, if the library is used and logging is never configured, no messages or warnings will appear. One subtle feature of this recipe is that the logging of individual libraries can be inde‐ pendently configured, regardless of other logging settings. For example, consider the following code: >>> import logging >>> logging.basicConfig(level=logging.ERROR) >>> import somelib >>> somelib.func() CRITICAL:somelib:A Critical Error! >>> #### Change the logging level for 'somelib' only >>> logging.getLogger('somelib').level=logging.DEBUG >>> somelib.func() CRITICAL:somelib:A Critical Error! DEBUG:somelib:A debug message Here, the root logger has been configured to only output messages at the ERROR level or higher. However, the level of the logger for somelib has been separately configured to output debugging messages. That setting takes precedence over the global setting. The ability to change the logging settings for a single module like this can be a useful debugging tool, since you don’t have to change any of the global logging settings—simply change the level for the one module where you want more output.
--excuse the sloppy copy/paste but I wanted to covey the full argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will read your note carefully when I am able. To flag this topic, I have opened an issue so it doesn't fall through the cracks: #853
from cltk.prosody.latin.Syllabifier import Syllabifier | ||
import cltk.prosody.latin.StringUtils as StringUtils | ||
from cltk.prosody.latin.VerseScanner import VerseScanner | ||
import cltk.prosody.latin.string_utils as StringUtils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but curious why StringUtils
is camel case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to be consistent and still falling down. Good news, bad news: there'll be another PR.
LOG.addHandler(logging.NullHandler()) | ||
|
||
|
||
def word_to_features(word: str, max_word_length: int = 20) -> List[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: In the future we may want something similar to TF's VocabularyProcessor
(http://tflearn.org/data_utils/#vocabulary-processor), and for whatever features our model-builders regularly need (thinking of prefix/suffix strings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Yes, we'll need to revisit this, and/or expand on it.
Addition of a corpus reader, by @todd-cook , which inherits from the NLTK's, and adds to it filtering specific to the CLTK's hosted Latin Library texts (<#846>). Also misc cleanup of verse modules by Todd.
I wrote this PR in part to address the corpus access problem, e.g.; we have different corpora: text files, json files, and xml and other formats, and they should be accessed in a transparent way, e.g. the user doesn't need to know about the backing format.
Also, people who add a corpus should guide the user by partitioning the corpus into typical genres or eras if the corpus is large and diverse, e.g. for Latin it's ideal to easily allow corpus access to filter works so that one can grab Golden age works without grabbing works from the medieval period, etc.
To these ends, I added a FilteredCorpusReader, and definitions for groups corpus types and files, and a generic method for assembling a filtered corpus. Ideally, I think this approach should probably be adopted elsewhere.
I added some utility functions: matrix_corpus_ops: a collection of functions useful for corpus cleaning and diagnostics on a corpus when it is in matrix format (e.g. in the scikit learn pipeline Transformer approach).
Also, a function for featurization, and a helper function for getting an md5 hash for a file, which is used in a forthcoming Jupyter notebook.
I have some Jupyter notebooks which leverage some of these changes on a branch so I think the changes will make sense and can be well vetted, but of course, feedback appreciated!
I also took the liberty of correcting some of my old code, namely updating annotations in the cltk.prosody.latin module so that it works better with the current mypy implementation.
All methods have doctest function examples, but I did add some tests for the filtered corpus creation and assembly.
Lastly, although these methods are going to be used shortly in a notebook, the game plan is that the notebooks will help vet a revision of the cltk word vec, and possible the creation of the newer embedding, FastText.