Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in not-contraction handling code #15

Closed
KDercksen opened this issue Dec 16, 2021 · 5 comments
Closed

Bug in not-contraction handling code #15

KDercksen opened this issue Dec 16, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@KDercksen
Copy link
Contributor

KDercksen commented Dec 16, 2021

Hi,

Thanks for the great library! I think I ran into a weird edge-case wrt not-contraction handling code. If I use the following example:

from syntok.tokenizer import Tokenizer

tok = Tokenizer(replace_not_contraction=False)
tok.split("n't")

The output is [<Token '' : "n't" @ 1>]. Something is going wrong in the offset calculation there, that 1 should be a 0... The real example this came from is a sentence in the AIDA dataset, " Falling share prices in New York do n't hurt Mexico as long as it happens gradually , as earlier this week.

I see the same with "don't": [<Token '' : 'do' @ 0>, <Token '' : "n't" @ 3>], that 3 should be 2 no?

Would love to hear your thoughts, not sure how to fix this neatly yet.

@fnl
Copy link
Owner

fnl commented Dec 16, 2021

Looks like a valid bug.
The best way to solve this kind of issue is to implement the corresponding (failing) test, and then explore with a debugger what's wrong to fix it.

@fnl fnl added the bug Something isn't working label Dec 16, 2021
@KDercksen
Copy link
Contributor Author

Thanks sir, you will see the PR in a bit!

@fnl
Copy link
Owner

fnl commented Dec 17, 2021

That would be very nice - thank you, Koen! Checking…

fnl pushed a commit that referenced this issue Dec 17, 2021
* fix not-contraction offsets + add test
* do not differ in offset calculation when using replace_not_contract=True or False
@fnl
Copy link
Owner

fnl commented Dec 17, 2021

Fixed by PR #17

I will review in full over the weekend, release a new version, and close this bug thereafter.

Thanks, again, Koen!

@fnl
Copy link
Owner

fnl commented Dec 18, 2021

Fixed by release 1.3.2; Closing.

@fnl fnl closed this as completed Dec 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants