Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization of two ints separated by a space #508

Open
jmn319 opened this issue Apr 20, 2021 · 0 comments
Open

Tokenization of two ints separated by a space #508

jmn319 opened this issue Apr 20, 2021 · 0 comments
Labels
bug Something isn't working

Comments

@jmn319
Copy link

jmn319 commented Apr 20, 2021

Full disclosure, I have only spent a handful of hours with the tool so if there is an easy fix for this my apologies.

I started with a data set where it's very common to see two ints following each other separated by a space (could be a single space or could be multiple spaces). When I go to the labeling UI, I noticed that the two ints are together as one token. They are even tokenized as one token when they are separated by a comma. Screenshots for full repro below.

Any thoughts in how I can get these to be separate tokens? Hoping there are some simple settings I can change.

udt-token1

udt-token2

udt-token3

@jmn319 jmn319 added the bug Something isn't working label Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant