How come the networks include <cls>, <eos>, <unk> and other similar tokens? #18

tueboesen · 2020-12-25T21:28:02Z

This is not an issue, so I apologize for putting it here, but I didn't really know where else to ask.

I have been testing out the various pretrained networks you have trained in this repository, and they seem very interesting and I might use them in a paper I'm working on, so I would like to understand it in detail.
One thing I do not understand about the networks is why they include so many special tokens?
I get that you need the masking token, and similarly the padding token for handling proteins batched together with various sizes.
The cls and eos are used just before and after a protein, but seem unnecessary for proteins unless I'm missing something?
The unk token should signal that an amino acid is unknown if I understand correctly, but isn't X generally the catch all case in protein language for unknown amino acids? So what is the usecase here?
And similarly for the last few tokens used which I have no good guess for.

tomsercu · 2021-01-04T16:29:17Z

For now github issues are a good place to ask questions 👍
You're right, there are a number of tokens in the vocab which have no good reason to be there. We use fairseq to train the models and largely stick to their conventions when it comes to vocab. The unusual tokens are completely unseen in training data, so shouldn't be used. But their dummy presence shouldn't hurt either.

joshim5 · 2021-01-04T17:29:05Z

@tueboesen To clarify further, it's important to follow the conventions if you use these models for downstream tasks. For example, cls/eos need to be appended and prepended to the sequences to get the best performance. Thanks for your interest and let us know if you have any more questions!

jiosephlee · 2024-03-05T16:34:33Z

@joshim5 @tomsercu Just to jump in, I have a few quick follow-up questions: "The unusual tokens are completely unseen in training data" does this apply to cls/eos tokens as well? I'd be surprised if CLS tokens improve performance for downstream applications without having seen them. Also, is there a need to manually append/prepend cls/eos tokens? It seems like the hugging face version of the tokenizer is automatically adding these tokens.

"to get the best performance" does this also depend on the fact that the CLS token is used for classifiers? For some context, for other models like BERT or ViTs, I'm seeing arguments for average pooling of the token embeddings rather than the CLS token. I'm curious if there's a recommendation for ESM.

gorj-tessella · 2024-04-15T21:52:33Z

I also have this question. I noticed in the Huggingface code, EsmForSequenceClassification uses EsmClassificationHead which use only the encoding at token position 0 which should be <cls>, noting "take <s> token (equiv. to [CLS])". This is obviously different from the "mean_representations" value typically generated by extract.py, which is the average over the used tokens, not including the <cls> and <eos> tokens.

Is there some justification for using the <cls> token embedding vs. the mean sequence token embedding?

tomsercu added the question Further information is requested label Jan 4, 2021

tomsercu closed this as completed Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How come the networks include <cls>, <eos>, <unk> and other similar tokens? #18

How come the networks include <cls>, <eos>, <unk> and other similar tokens? #18

tueboesen commented Dec 25, 2020

tomsercu commented Jan 4, 2021

joshim5 commented Jan 4, 2021

jiosephlee commented Mar 5, 2024 •

edited

Loading

gorj-tessella commented Apr 15, 2024 •

edited

Loading

How come the networks include <cls>, <eos>, <unk> and other similar tokens? #18

How come the networks include <cls>, <eos>, <unk> and other similar tokens? #18

Comments

tueboesen commented Dec 25, 2020

tomsercu commented Jan 4, 2021

joshim5 commented Jan 4, 2021

jiosephlee commented Mar 5, 2024 • edited Loading

gorj-tessella commented Apr 15, 2024 • edited Loading

jiosephlee commented Mar 5, 2024 •

edited

Loading

gorj-tessella commented Apr 15, 2024 •

edited

Loading