Skip to content
This repository has been archived by the owner on Aug 1, 2024. It is now read-only.

How come the networks include <cls>, <eos>, <unk> and other similar tokens? #18

Closed
tueboesen opened this issue Dec 25, 2020 · 4 comments
Labels
question Further information is requested

Comments

@tueboesen
Copy link

This is not an issue, so I apologize for putting it here, but I didn't really know where else to ask.

I have been testing out the various pretrained networks you have trained in this repository, and they seem very interesting and I might use them in a paper I'm working on, so I would like to understand it in detail.
One thing I do not understand about the networks is why they include so many special tokens?
I get that you need the masking token, and similarly the padding token for handling proteins batched together with various sizes.
The cls and eos are used just before and after a protein, but seem unnecessary for proteins unless I'm missing something?
The unk token should signal that an amino acid is unknown if I understand correctly, but isn't X generally the catch all case in protein language for unknown amino acids? So what is the usecase here?
And similarly for the last few tokens used which I have no good guess for.

@tomsercu tomsercu added the question Further information is requested label Jan 4, 2021
@tomsercu
Copy link
Contributor

tomsercu commented Jan 4, 2021

For now github issues are a good place to ask questions 👍
You're right, there are a number of tokens in the vocab which have no good reason to be there. We use fairseq to train the models and largely stick to their conventions when it comes to vocab. The unusual tokens are completely unseen in training data, so shouldn't be used. But their dummy presence shouldn't hurt either.

@tomsercu tomsercu closed this as completed Jan 4, 2021
@joshim5
Copy link
Contributor

joshim5 commented Jan 4, 2021

@tueboesen To clarify further, it's important to follow the conventions if you use these models for downstream tasks. For example, cls/eos need to be appended and prepended to the sequences to get the best performance. Thanks for your interest and let us know if you have any more questions!

@jiosephlee
Copy link

jiosephlee commented Mar 5, 2024

@joshim5 @tomsercu Just to jump in, I have a few quick follow-up questions: "The unusual tokens are completely unseen in training data" does this apply to cls/eos tokens as well? I'd be surprised if CLS tokens improve performance for downstream applications without having seen them. Also, is there a need to manually append/prepend cls/eos tokens? It seems like the hugging face version of the tokenizer is automatically adding these tokens.

"to get the best performance" does this also depend on the fact that the CLS token is used for classifiers? For some context, for other models like BERT or ViTs, I'm seeing arguments for average pooling of the token embeddings rather than the CLS token. I'm curious if there's a recommendation for ESM.

@gorj-tessella
Copy link

gorj-tessella commented Apr 15, 2024

I also have this question. I noticed in the Huggingface code, EsmForSequenceClassification uses EsmClassificationHead which use only the encoding at token position 0 which should be <cls>, noting "take <s> token (equiv. to [CLS])". This is obviously different from the "mean_representations" value typically generated by extract.py, which is the average over the used tokens, not including the <cls> and <eos> tokens.

Is there some justification for using the <cls> token embedding vs. the mean sequence token embedding?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants