Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent tokenization of "T-shirt". #4

Closed
tkurosaka opened this issue Jan 29, 2023 · 8 comments
Closed

Inconsistent tokenization of "T-shirt". #4

tkurosaka opened this issue Jan 29, 2023 · 8 comments

Comments

@tkurosaka
Copy link

I am seeing inconsistent tokenization for "T-shirt" and probably for any hyphen separated words.
Below, the italic is the input and the bold is the output

$ bin/opennlp TokenizerME en-tokenizer.onlpm
Loading Tokenizer model ... done (0.123s)
yellow t-shirt
yellow t - shirt
yellow t-shirt!
yellow t- shirt !

"t-shirt" is sometimes tokenized as three tokens and sometimes two tokens.

I tested with openNLP 1.9.1 and 2.1.0 and they show the same results.

The model distributed in opennlp.apache.org, opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin, doesn't have this problem. "t-shirt" is always tokenized as "t", "-", "shirt". I tested other words like "ice-cream", "truck-driver", "sign-in", "warm-up", and they are consistent in that "-" is a separate token.

Is there any cure on this?

@tkurosaka
Copy link
Author

I tested with other compound words. It seems in majority case, "-" is treated as a separate token. But there are cases where the compound words are treated as one (for example "e-commerce", "non-profilt", "multi-lingual"), and there are compound words where the hyphen is attached to the second word ("turn-key" => ["turn", "-key"]; "word-for-word" => ["word", "-", "for", "-word"]; "world-class" => ["world", "-class"]).
I suspect this might be caused by inconsistency in the annotation.

@abzif
Copy link
Owner

abzif commented Feb 1, 2023

You made an interesting observation!
Your hypothesis that hyphen words are inconsistently annotated seems right.
I looked in *.conllu file and hyphen words are most often splited but there are cases that they are treated as single word.
I noticed examples: so-called, african-american, avant-garde, run-on and others.
It may be the cause of tokenizer inconsistent behavior.

@abzif
Copy link
Owner

abzif commented Feb 1, 2023

I think it can be fixed. During implementation I noticed inconsistent annotations for 'de', 'fi' language and implemented so-called Transformer. Transformer read conllu sentences and tries to fix it or reject it.
EnTransformer can check sentences and if some word is not splitted hyphen word then reject such sentence. Tokenizer model should then always split a word at hyphen.

@tkurosaka
Copy link
Author

That sounds very promising! Will it take a while to get the model fixed?

@abzif
Copy link
Owner

abzif commented Feb 7, 2023

Models are automatically recomputed every month on 27th day. However I don't have much time recently, so I am not sure if I make a fix soon.

@abzif
Copy link
Owner

abzif commented Feb 13, 2023

I've just make a small fix. Sentences which contains not splitted words with hyphens are rejected. Hopefully, trained models will consistently split words at hyphens.
Please check after 27-th february when new models are recomputed.

@tkurosaka
Copy link
Author

Thank you. I'll check out the new model when it becomes available!

@tkurosaka
Copy link
Author

In my smoke test, the new model handled compound words with "-" very consistently. "yellow t-shirt!" is now tokenized as "yellow", "t", "-", "shirt" and "!". "post-brexit", is now "post", "-", and "brexit". It seems this model fixed the issue. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants