We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi,
when I run:
>>> list(syntok.tokenize('..A')) [<Token '' : '.' @ 0>, <Token '' : '.' @ 0>, <Token '' : 'A' @ 2>]
Here the first two tokens have the same offset. As I understand offsets this is not the intended behavior.
The problem can be fixed by adding "+i" in tokenizer.py:197, making the line:
yield Token("", c, mo.start()+i)
The text was updated successfully, but these errors were encountered:
Indeed; Thanks for catching that naughty little bug! Will push a fix shortly.
Sorry, something went wrong.
Fix for issue #6: Wrong offset with nonword-prefix
c22065f
Including a version bump to 1.2.1
Fixed with 6feb04c and in release v1.2.2
Thank you for reporting, and even more for tracking down the core issue! That helped massively closing this ticket asap.
No branches or pull requests
Hi,
when I run:
Here the first two tokens have the same offset. As I understand offsets this is not the intended behavior.
The problem can be fixed by adding "+i" in tokenizer.py:197, making the line:
The text was updated successfully, but these errors were encountered: