Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring word tokenizers #907

Merged
merged 42 commits into from May 14, 2019

Conversation

Projects
None yet
4 participants
@diyclassics
Copy link
Contributor

commented May 3, 2019

Following the recent refactoring of the sentence tokenizers (cf. #883), I have followed a similar pattern for refactoring the word tokenizers. The existing api for the existing word tokenizers has been left in place so there should be no disruption for legacy code among current users—though these should perhaps be deprecated at some point. Docs and additional tests forthcoming...

@diyclassics

This comment has been minimized.

Copy link
Contributor Author

commented May 3, 2019

Closes #638

@codecov-io

This comment has been minimized.

Copy link

commented May 3, 2019

Codecov Report

Merging #907 into master will increase coverage by 0.12%.
The diff coverage is 90.8%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #907      +/-   ##
==========================================
+ Coverage   89.61%   89.74%   +0.12%     
==========================================
  Files         206      221      +15     
  Lines       14046    14236     +190     
==========================================
+ Hits        12588    12776     +188     
- Misses       1458     1460       +2
Impacted Files Coverage Δ
cltk/stem/lemma.py 92.59% <ø> (ø) ⬆️
cltk/tokenize/latin/params.py 100% <100%> (ø) ⬆️
cltk/tests/test_nlp/test_stop.py 97.66% <100%> (-0.02%) ⬇️
cltk/tokenize/old_french/params.py 100% <100%> (ø)
cltk/tokenize/middle_high_german/word.py 100% <100%> (ø)
cltk/tokenize/middle_english/word.py 100% <100%> (ø)
cltk/tokenize/arabic/word.py 100% <100%> (ø)
cltk/tokenize/greek/word.py 100% <100%> (ø)
cltk/prosody/old_norse/verse.py 97.21% <100%> (+0.01%) ⬆️
cltk/tokenize/sanskrit/word.py 100% <100%> (ø)
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cec0f43...bc73d53. Read the comment docs.

@todd-cook

This comment has been minimized.

Copy link
Contributor

commented May 4, 2019

I vote that we actually don't support backwards compatibility; if users need the old api fixtures they can pin a version of a cltk install, right?

Otherwise this PR sounds great and I'll review in detail soon.

@diyclassics diyclassics requested a review from todd-cook May 4, 2019

@todd-cook
Copy link
Contributor

left a comment

Excellent PR!
Some minor changes could be made with compiling and caching the regex operations, and this might speed up processing, and might be easy to verify with the built-in timeit library.

import string
exclude_flag = '~'
if 'ne' in enclitics and 'n' in enclitics:
ne_compile = re.compile(r'^\b(\w+?)([n]e?)[%s]?\b'%re.escape(string.punctuation))

This comment has been minimized.

Copy link
@todd-cook

todd-cook May 4, 2019

Contributor

A good optimization might be to re.compile at init and store the compilations in the self object

This comment has been minimized.

Copy link
@diyclassics

diyclassics May 6, 2019

Author Contributor

Great idea—and a lesson I feel like I should have learned by now!

sent_tokens_ = []
for sent in sents:
for word in latin_exceptions:
sent = re.sub(rf'\b{word}\b', self._matchcase(rf'~{word}~'), sent, flags=re.IGNORECASE)

This comment has been minimized.

Copy link
@todd-cook

todd-cook May 4, 2019

Contributor

prefer an re.compiled object over re.sub, especially in a loop

This comment has been minimized.

Copy link
@diyclassics

diyclassics May 6, 2019

Author Contributor

I can see how this would be cleaner—curious, would it be faster? Still need to compile per word as it is written. I suppose I could concatenate the strings into a single regex, e.g. (neque|quoque|etc..)—can test to see if that's faster.

@kylepjohnson kylepjohnson merged commit fb787df into cltk:master May 14, 2019

3 checks passed

codecov/patch 90.8% of diff hit (target 89.61%)
Details
codecov/project 89.74% (+0.12%) compared to cec0f43
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@kylepjohnson

This comment has been minimized.

Copy link
Member

commented May 14, 2019

@diyclassics I didn't notice these errors in your previous PR, but I see 6 now, all related to OE: https://travis-ci.org/cltk/cltk/jobs/532362743#L2038

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.