Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return the "string" representation of tokens or their text offsets #38

Open
Egyptmaster opened this issue Dec 20, 2023 · 4 comments
Open

Comments

@Egyptmaster
Copy link

Hi,

I came with the issue that I need to get the "string" representations of the tokens returned by the Encode method. Alternative can be to get the offsets. Is there a chance to get the offsets/string of a input_id ?

@georg-jung
Copy link
Owner

georg-jung commented Dec 20, 2023

You might want to take a look at the CreateBatchEnumerator method added in the current version. If you're encoding some longer text, it will not only encode the first e.g. 512 input_ids of it but will continue with the rest of the text and optionally allow you to specify some overlap. This method's return value includes TokenizedRange<TKey>(TKey Key, int Offset, int? LastTokenizedWordStartIndex). Thus, you will be able to know the exact piece of text that was tokenized, including its start offset and length. It also automatically enables you to e.g. encode full books without manually thinking about offsets. It wont let you know the offsets or strings corresponding to every single input_id though.

Another thing you can do is use the Decode method. It will translate any length of given input_ids back into string represantations, based on the loaded vocabulary. You could pass shorter combinations of input_ids than what you encoded before. This will not necessarily return the exact same string as you passed as an input though. Depending on your vocabulary, diacritics and other characters might have been removed, the text might have been lowercased and all spacing characters will be replaced by a single space " ". Overall, the spacing could be different. Also, the Decode method currently assumes the first input_id you pass as its argument represents the beginning of a word. If the first passed id represents a word suffix - or, read: some arbitrary position in a list of input_ids, which doesn't happen to represent the start of a word -, it will probably throw a KeyNotFoundException. This probably isn't the best way how this method could work and I'm happy to fix that - if it helps you I could do it quite soon, please let me know.

If neither of these options fit your needs, it could be helpful if you elaborate further on your specific use case. Returning the string that every input_id was created from would be severly inefficient and probably reduce the performance of FastBertTokenizer by integer factors - thus, it is unlikely I'd add such an API. Returning all the offsets for each input_id would probably be a smaller performance issue, but I'd need to think about the API and measure the impact. There often are more efficient ways to achieve the same end result though, which is why it would be interesting to know what you want to do with the string representations/offsets.

@Egyptmaster
Copy link
Author

Well, currently I am using the BERT tokenization only as a first step in a processing pipeline. The token_ids returned by your tokenizer are used for NER using a corresponding model from hugging-face (which I converted to ONNX model to make it running in dotnet). The ONNX model returns predictions as "scores" for each possible trained labels (e.g. I-LOC) for each of the input tokens. Finally I need the surface of that "detected" span of the named-entity which is recognized. At this point the offsets of the tokens might be helpful to simply take the sub-string from incoming text.

I simply tried to use the returned offsets of the NER predictions to call the Decode method you mentioned. Unfortunately it sometimes throws an KeyNotFoundException as you already mentioned.

I currently switch back to BERTTokenizer which I forked and adjusted to return the offsets (in addition to fixing the infinite loop issue). I aso did some optimizations and reworks by using also memories and spans for tokenization on that implementation.

If time permits I will have a deeper look on your new enumerator class, maybe that fits my needs better. I will maybe also give it a try to add offsets to your implementation by my own. I am with you that returning the string is not necessary but offsets are very helpful :)

@georg-jung
Copy link
Owner

I fixed #39 and published 0.5.18-alpha on NuGet which contains the fix. Let me know if that works for you. I'll take a look at adding an offset output. It should be quite straight forward to add, probably here. I need to think about the API though (and measure the perf impact, but I guess that it wont be large) because my goal is to keep it as easy to use as possible. Would be interesting to know what you found hard to grasp from a consumer's perspective if you mind sharing :).

@georg-jung
Copy link
Owner

Did you give the new version a try since? Does it work for you? Are you still interested in the offsets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants