Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longest sequence and truncation of sentence #181

Open
dataislife opened this issue Mar 30, 2020 · 7 comments
Open

Longest sequence and truncation of sentence #181

dataislife opened this issue Mar 30, 2020 · 7 comments

Comments

@dataislife
Copy link

dataislife commented Mar 30, 2020

Hi,
I wonder how the maximum length is set before getting an embedding given a sentence. Let s be a sentence such as s = [x1, x2, x3, ----, xN]. Is there a maximum length parameter n such that if N>n, then all tokens in indices above n are removed? s would be mapped to map(s) = [x1, x2, ---,xn] (This what we can see often in BERT-like models).

From this code:

            longest_seq = 0

            for idx in length_sorted_idx[batch_start: batch_end]:
                sentence = sentences[idx]
                tokens = self.tokenize(sentence)
                longest_seq = max(longest_seq, len(tokens))
                batch_tokens.append(tokens)

            features = {}
            for text in batch_tokens:
                sentence_features = self.get_sentence_features(text, longest_seq)
      

I am confused about what get_sentence_features does which is defined here (I do not get what _first_module corresponds to actually):

def get_sentence_features(self, *features):
        return self._first_module().get_sentence_features(*features)
@nreimers
Copy link
Member

Hi @dataislife
BERT like models have a limit of usually 512 tokens. In the sentence transformer models, you can set your own limit, which is usually set to 128 tokens.

A sentence is broken down to tokens and word pieces. Anything above the limit (e.g. 128) is truncated, i.e., only the first 128 word pieces are used in the default setting.

Best
Nils

@MastafaF
Copy link

MastafaF commented Apr 17, 2020

       for text in batch_tokens:
                sentence_features = self.get_sentence_features(text, longest_seq)

From the snippet above, do you set on the fly the maximum number of tokens to be considered for each sentence in a given batch? Hence, the limit that you mention below being set to 128 by default would be set to the max_length of the sentences in each batch?

It would be useful to clarify what self.get_sentence_features(text, longest_seq) does exactly. Maybe a link to self._first_module().get_sentence_features(features) implementation would be useful as well.

Thanks,

@nreimers
Copy link
Member

Hi @MastafaF
Mini-batches must be padded before it can be processed by pytorch. If you have a batch with 3 sentences of length 8, 12, 7, then longest_seq would be 12. All sentences would be padded to 12.

The 128, which is set when e.g. models.Transformer or models.BERT is initialized, is the maximum length for any input. So if a sentence has more than 128 tokens, it will be truncated to 128 tokens.

Best
Nils

@sagar1411986
Copy link

Hi @nreimers
when i used roberta-large model in sentence transformer it shows a warning message like "token indices sequence length is longer than the specified maximum sequence length for this model (1017 > 512)" . how can i set my own limit? and another question is that can i use longformer model into this??

@nreimers
Copy link
Member

@sagar1411986
You can ignore this warning. Truncation happens a bit different in the versions of sentence-transformers before 0.4.1.

When you update to the most recent version, the warning will disappear.

This is how you can set the input length:
https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length

Longformer is currently not supported. Also, as there is no good training data available for longer documents, it would not make a difference to use longformer. Without training data, you won't get good embeddings

@sagar1411986
Copy link

@nreimers
Thanks.

@brentonmallen1
Copy link

This is an older post, but hopefully this will benefit someone else. The actual truncation of the tokens happens here: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L3559

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants