Wrong offset with nonword-prefix #6

Lingepumpe · 2019-11-11T10:55:26Z

Hi,

when I run:

>>> list(syntok.tokenize('..A'))
[<Token '' : '.' @ 0>, <Token '' : '.' @ 0>, <Token '' : 'A' @ 2>]

Here the first two tokens have the same offset. As I understand offsets this is not the intended behavior.

The problem can be fixed by adding "+i" in tokenizer.py:197, making the line:

yield Token("", c, mo.start()+i)

The text was updated successfully, but these errors were encountered:

fnl · 2019-11-11T14:03:51Z

Indeed; Thanks for catching that naughty little bug! Will push a fix shortly.

Including a version bump to 1.2.1

fnl · 2019-11-11T14:53:21Z

Fixed with 6feb04c and in release v1.2.2

Thank you for reporting, and even more for tracking down the core issue!
That helped massively closing this ticket asap.

fnl added the bug Something isn't working label Nov 11, 2019

fnl added a commit that referenced this issue Nov 11, 2019

Fix for issue #6: Wrong offset with nonword-prefix

c22065f

Including a version bump to 1.2.1

fnl closed this as completed Nov 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong offset with nonword-prefix #6

Wrong offset with nonword-prefix #6

Lingepumpe commented Nov 11, 2019

fnl commented Nov 11, 2019

fnl commented Nov 11, 2019

Wrong offset with nonword-prefix #6

Wrong offset with nonword-prefix #6

Comments

Lingepumpe commented Nov 11, 2019

fnl commented Nov 11, 2019

fnl commented Nov 11, 2019