New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Longest sequence and truncation of sentence #181
Comments
Hi @dataislife A sentence is broken down to tokens and word pieces. Anything above the limit (e.g. 128) is truncated, i.e., only the first 128 word pieces are used in the default setting. Best |
for text in batch_tokens:
sentence_features = self.get_sentence_features(text, longest_seq) From the snippet above, do you set on the fly the maximum number of tokens to be considered for each sentence in a given batch? Hence, the limit that you mention below being set to 128 by default would be set to the max_length of the sentences in each batch? It would be useful to clarify what self.get_sentence_features(text, longest_seq) does exactly. Maybe a link to self._first_module().get_sentence_features(features) implementation would be useful as well. Thanks, |
Hi @MastafaF The 128, which is set when e.g. models.Transformer or models.BERT is initialized, is the maximum length for any input. So if a sentence has more than 128 tokens, it will be truncated to 128 tokens. Best |
Hi @nreimers |
@sagar1411986 When you update to the most recent version, the warning will disappear. This is how you can set the input length: Longformer is currently not supported. Also, as there is no good training data available for longer documents, it would not make a difference to use longformer. Without training data, you won't get good embeddings |
@nreimers |
This is an older post, but hopefully this will benefit someone else. The actual truncation of the tokens happens here: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L3559 |
Hi,
I wonder how the maximum length is set before getting an embedding given a sentence. Let s be a sentence such as s = [x1, x2, x3, ----, xN]. Is there a maximum length parameter n such that if N>n, then all tokens in indices above n are removed? s would be mapped to map(s) = [x1, x2, ---,xn] (This what we can see often in BERT-like models).
From this code:
I am confused about what get_sentence_features does which is defined here (I do not get what _first_module corresponds to actually):
The text was updated successfully, but these errors were encountered: