-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return the "string" representation of tokens or their text offsets #38
Comments
You might want to take a look at the Another thing you can do is use the If neither of these options fit your needs, it could be helpful if you elaborate further on your specific use case. Returning the string that every input_id was created from would be severly inefficient and probably reduce the performance of FastBertTokenizer by integer factors - thus, it is unlikely I'd add such an API. Returning all the offsets for each input_id would probably be a smaller performance issue, but I'd need to think about the API and measure the impact. There often are more efficient ways to achieve the same end result though, which is why it would be interesting to know what you want to do with the string representations/offsets. |
Well, currently I am using the BERT tokenization only as a first step in a processing pipeline. The token_ids returned by your tokenizer are used for NER using a corresponding model from hugging-face (which I converted to ONNX model to make it running in dotnet). The ONNX model returns predictions as "scores" for each possible trained labels (e.g. I-LOC) for each of the input tokens. Finally I need the surface of that "detected" span of the named-entity which is recognized. At this point the offsets of the tokens might be helpful to simply take the sub-string from incoming text. I simply tried to use the returned offsets of the NER predictions to call the Decode method you mentioned. Unfortunately it sometimes throws an KeyNotFoundException as you already mentioned. I currently switch back to BERTTokenizer which I forked and adjusted to return the offsets (in addition to fixing the infinite loop issue). I aso did some optimizations and reworks by using also memories and spans for tokenization on that implementation. If time permits I will have a deeper look on your new enumerator class, maybe that fits my needs better. I will maybe also give it a try to add offsets to your implementation by my own. I am with you that returning the string is not necessary but offsets are very helpful :) |
I fixed #39 and published |
Did you give the new version a try since? Does it work for you? Are you still interested in the offsets? |
Hi,
I came with the issue that I need to get the "string" representations of the tokens returned by the Encode method. Alternative can be to get the offsets. Is there a chance to get the offsets/string of a input_id ?
The text was updated successfully, but these errors were encountered: