Longest sequence and truncation of sentence #181

dataislife · 2020-03-30T13:51:07Z

Hi,
I wonder how the maximum length is set before getting an embedding given a sentence. Let s be a sentence such as s = [x1, x2, x3, ----, xN]. Is there a maximum length parameter n such that if N>n, then all tokens in indices above n are removed? s would be mapped to map(s) = [x1, x2, ---,xn] (This what we can see often in BERT-like models).

From this code:

            longest_seq = 0

            for idx in length_sorted_idx[batch_start: batch_end]:
                sentence = sentences[idx]
                tokens = self.tokenize(sentence)
                longest_seq = max(longest_seq, len(tokens))
                batch_tokens.append(tokens)

            features = {}
            for text in batch_tokens:
                sentence_features = self.get_sentence_features(text, longest_seq)

I am confused about what get_sentence_features does which is defined here (I do not get what _first_module corresponds to actually):

def get_sentence_features(self, *features):
        return self._first_module().get_sentence_features(*features)

nreimers · 2020-04-16T14:37:20Z

Hi @dataislife
BERT like models have a limit of usually 512 tokens. In the sentence transformer models, you can set your own limit, which is usually set to 128 tokens.

A sentence is broken down to tokens and word pieces. Anything above the limit (e.g. 128) is truncated, i.e., only the first 128 word pieces are used in the default setting.

Best
Nils

MastafaF · 2020-04-17T14:27:07Z

       for text in batch_tokens:
                sentence_features = self.get_sentence_features(text, longest_seq)

From the snippet above, do you set on the fly the maximum number of tokens to be considered for each sentence in a given batch? Hence, the limit that you mention below being set to 128 by default would be set to the max_length of the sentences in each batch?

It would be useful to clarify what self.get_sentence_features(text, longest_seq) does exactly. Maybe a link to self._first_module().get_sentence_features(features) implementation would be useful as well.

Thanks,

nreimers · 2020-04-17T14:40:14Z

Hi @MastafaF
Mini-batches must be padded before it can be processed by pytorch. If you have a batch with 3 sentences of length 8, 12, 7, then longest_seq would be 12. All sentences would be padded to 12.

The 128, which is set when e.g. models.Transformer or models.BERT is initialized, is the maximum length for any input. So if a sentence has more than 128 tokens, it will be truncated to 128 tokens.

Best
Nils

sagar1411986 · 2021-01-10T20:37:48Z

Hi @nreimers
when i used roberta-large model in sentence transformer it shows a warning message like "token indices sequence length is longer than the specified maximum sequence length for this model (1017 > 512)" . how can i set my own limit? and another question is that can i use longformer model into this??

nreimers · 2021-01-10T20:42:17Z

@sagar1411986
You can ignore this warning. Truncation happens a bit different in the versions of sentence-transformers before 0.4.1.

When you update to the most recent version, the warning will disappear.

This is how you can set the input length:
https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length

Longformer is currently not supported. Also, as there is no good training data available for longer documents, it would not make a difference to use longformer. Without training data, you won't get good embeddings

sagar1411986 · 2021-01-10T21:06:36Z

@nreimers
Thanks.

brentonmallen1 · 2024-03-22T16:49:18Z

This is an older post, but hopefully this will benefit someone else. The actual truncation of the tokens happens here: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L3559

ginward mentioned this issue Sep 23, 2021

Maximum length of token length MaartenGr/BERTopic#248

Closed

kl-thamm mentioned this issue Jun 28, 2023

Use Sentence Transformers lib instead of Transformers weaviate/t2v-transformers-models#31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Longest sequence and truncation of sentence #181

Longest sequence and truncation of sentence #181

dataislife commented Mar 30, 2020 •

edited

nreimers commented Apr 16, 2020

MastafaF commented Apr 17, 2020 •

edited

nreimers commented Apr 17, 2020

sagar1411986 commented Jan 10, 2021

nreimers commented Jan 10, 2021

sagar1411986 commented Jan 10, 2021

brentonmallen1 commented Mar 22, 2024

Longest sequence and truncation of sentence #181

Longest sequence and truncation of sentence #181

Comments

dataislife commented Mar 30, 2020 • edited

nreimers commented Apr 16, 2020

MastafaF commented Apr 17, 2020 • edited

nreimers commented Apr 17, 2020

sagar1411986 commented Jan 10, 2021

nreimers commented Jan 10, 2021

sagar1411986 commented Jan 10, 2021

brentonmallen1 commented Mar 22, 2024

dataislife commented Mar 30, 2020 •

edited

MastafaF commented Apr 17, 2020 •

edited