Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parameter for punctuation used by Latin punkt sentence tokenizer #944

Merged
merged 8 commits into from Oct 7, 2019

Conversation

@diyclassics
Copy link
Contributor

commented Oct 6, 2019

Adds a ```strict`` parameter to the Latin Punkt tokenizer that extends the list of punctuation used for sentence segmentation from period, question mark, and exclamation point to also include colon, semicolon, and hyphen. Includes test and docs. Fixes #943.

@diyclassics

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2019

Fwiw handling of PunktLanguageVars in NLTK is not intuitive, i.e. the sent_end_chars are changed in PunktLanguageVars after the tokenizer class is instantiated (cf. nltk/nltk#2068). May want to rewrite some of the Punkt base classes at some point to make this part of training or at least something that can be saved with the models themselves.

@codecov-io

This comment has been minimized.

Copy link

commented Oct 6, 2019

Codecov Report

Merging #944 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #944      +/-   ##
==========================================
+ Coverage   89.77%   89.78%   +<.01%     
==========================================
  Files         222      222              
  Lines       14258    14270      +12     
==========================================
+ Hits        12800    12812      +12     
  Misses       1458     1458
Impacted Files Coverage Δ
cltk/tokenize/sentence.py 88.88% <ø> (ø) ⬆️
cltk/tests/test_nlp/test_tokenize.py 100% <100%> (ø) ⬆️
cltk/tokenize/latin/params.py 100% <100%> (ø) ⬆️
cltk/tokenize/latin/sentence.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a58c2e4...b934a02. Read the comment docs.

@todd-cook

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2019

Looks good, with one perhaps unrelated fix.

Do you get this:

from cltk.tokenize.word import WordTokenizer
toker = WordTokenizer('latin')
toker.tokenize('algetisne?')
['algetis', '-', '-ne', '?']

if so, we can fix with patching
https://github.com/cltk/cltk/blob/master/cltk/tokenize/latin/word.py#L109
with:

       else:
if token.startswith('-'):
    specific_tokens.append(token)
    is_enclitic = True
    continue
specific_tokens += [token[:-len(enclitic)]] + ['-' + enclitic]

is_enclitic = True
break

it's probably a regression that crept in

@todd-cook

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2019

Should we also parameterize the somewhat aggressive tokenization conversion of
[word]-st -> [word] est
[word]-n -> [word] -ne

e.g.

toker.tokenize('Cenavin ego heri in navi in portu Persico?')
['Cenavi', '-ne', 'ego', 'heri', 'in', 'navi', 'in', 'portu', 'Persico', '?']
toker.tokenize('Dic si audes mihi, bellan videtur specie mulier?')
['Dic', 'si', 'audes', 'mihi', ',', 'bella', '-ne', 'videtur', 'specie', 'mulier', '?']

These conversions are valid, especially in verse, but I'm seeing a lot of false positives when applied to prose. Maybe we turn on these aggressive conversions with a "verse" switch?

Copy link
Contributor

left a comment

Looks good with suggestions noted in the main PR

@kylepjohnson

This comment has been minimized.

Copy link
Member

commented Oct 6, 2019

Waiting on word from Patrick whether to merge now or add Todd's rec

@diyclassics

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2019

@todd-cook I'll make the -ne issue a priority as it's clearly a problem (cf. #79 (comment)). Thanks for the reminder.

@diyclassics

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2019

@todd-cook We might want to make enclitic splitting model-driven (i.e. not rules-based) anyway, cf. https://github.com/diyclassics/ll-experiments/blob/master/Splitting%20Enclitics.ipynb

@diyclassics

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2019

Waiting on word from Patrick whether to merge now or add Todd's rec

I'll open another issue to discuss -ne enclitic splitting.

@diyclassics

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2019

Found something else that needs to be fixed—hold the merge until the next commit.

@todd-cook

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2019

@diyclassics

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2019

I'm taking a crack at a model implementation for tokenization that could be used across several languages. I'll post soon. Things look promising but I'll need to finish and then vet it--I have my doubts ;-)

This is what I like about the OOPhilology model—you can add a new model implementation as a subclass of BaseSentenceTokenizer (vel sim) and it's another good option for our users.

@kylepjohnson kylepjohnson merged commit f5386c5 into cltk:master Oct 7, 2019
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 89.77%)
Details
codecov/project 89.78% (+<.01%) compared to a58c2e4
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
kylepjohnson added a commit that referenced this pull request Oct 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.