Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Under-splitting on "set." and "ago." #22

Closed
leitneratselerity opened this issue Feb 28, 2022 · 0 comments · Fixed by #24
Closed

Under-splitting on "set." and "ago." #22

leitneratselerity opened this issue Feb 28, 2022 · 0 comments · Fixed by #24
Labels
bug Something isn't working

Comments

@leitneratselerity
Copy link

leitneratselerity commented Feb 28, 2022

Syntok did not split:

Some of the cookies are essential for parts of the site to operate and have already been set. You may delete and block all cookies from this site, but if you do, parts of the site may not work.

and:

The sharpshooter appears to be checked out on his Kings experience, and an argument could easily be raised that he should have been moved two years ago. Now, his $23 million salary will be a tough pill for teams to swallow, even if there is decent chance of a solid bounce-back year at a new destination.

Because those are the official abbreviations for two Spanish months (septiembre and agosto).

@leitneratselerity leitneratselerity changed the title Under-splitting on "You" Under-splitting on "You" and "Now" Feb 28, 2022
@leitneratselerity leitneratselerity changed the title Under-splitting on "You" and "Now" Under-splitting on "set." and "ago." Feb 28, 2022
peter-lang-dealogic added a commit to peter-lang-dealogic/syntok that referenced this issue Feb 28, 2022
Fixes fnl#22
Some month abbreviations are also valid English words (e.g.: "set", "ago"), which cases false positives if we are treating month abberviations as generic abbrevations, as it would consume such sentence endings.
Code already contains logic to check if month abbreviation is preceeded by number, see test-case: "Am 13. Jän. 2006 war es regnerisch."
@fnl fnl closed this as completed in #24 Feb 28, 2022
fnl added a commit that referenced this issue Feb 28, 2022
Fixes #22

Some month abbreviations are also valid English words (e.g.: "set", "ago"), which causes false positives if we are treating month abbreviations generically, as syntok does not split abbreviations.
Therefore, this change removes the months from the list of abbreviations.

However, neither introduce over-splitting of month abbreviations as long as they are followed by a numeric token:
Because after removing the months from the list of official abbreviations, syntok would only not split at a sentence terminal marker if it was followed by anything with 3+ digits (due to the next_is_a_large_number rule).
Therefore, this change also avoids splitting after a month abbreviation if it is followed even by a short number (i.e., days or 2-digit-years).

Co-authored-by: fnl <me@fnl.es>
@fnl fnl added the bug Something isn't working label Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants