Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix not-contraction offsets + add test (resolves #15) #17

Merged
merged 2 commits into from
Dec 17, 2021

Conversation

KDercksen
Copy link
Contributor

This PR implements a fix for the issue noted in #15. Specifically, in the case of "don't" you now get do @ 0, n't @ 2 when not replacing not-contractions; if you set replace_not_contraction=True, you will still get do @ 0, not @ 3 because of the implicit whitespace introduction. I assumed you would still want to introduce a space in that case, because keeping the text pristine doesn't necessarily matter there.

Let me know if you disagree or would like to see something changed!

@fnl
Copy link
Owner

fnl commented Dec 17, 2021

I think offsets should all ways be truthful to the input text, as that is where there value lies: tracking the provenance of a token.

So even can’t would become “ca”:0, “not”:2 in that scenario.

Does that seem right to you, too?

@KDercksen
Copy link
Contributor Author

Yes, I agree actually! I was doubting myself because of the comment written here though.

I do think it would be best, even if the token text changes ("n't" -> "not"), that the offsets should be truthful to the original text. I can change the PR to reflect this!

@fnl fnl merged commit 9d217f6 into fnl:master Dec 17, 2021
@fnl
Copy link
Owner

fnl commented Dec 17, 2021

Thank you very much for your contribution, Koen, much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants