-
Notifications
You must be signed in to change notification settings - Fork 639
How come the networks include <cls>, <eos>, <unk> and other similar tokens? #18
Comments
For now github issues are a good place to ask questions 👍 |
@tueboesen To clarify further, it's important to follow the conventions if you use these models for downstream tasks. For example, cls/eos need to be appended and prepended to the sequences to get the best performance. Thanks for your interest and let us know if you have any more questions! |
@joshim5 @tomsercu Just to jump in, I have a few quick follow-up questions: "The unusual tokens are completely unseen in training data" does this apply to cls/eos tokens as well? I'd be surprised if CLS tokens improve performance for downstream applications without having seen them. Also, is there a need to manually append/prepend cls/eos tokens? It seems like the hugging face version of the tokenizer is automatically adding these tokens. "to get the best performance" does this also depend on the fact that the CLS token is used for classifiers? For some context, for other models like BERT or ViTs, I'm seeing arguments for average pooling of the token embeddings rather than the CLS token. I'm curious if there's a recommendation for ESM. |
I also have this question. I noticed in the Huggingface code, Is there some justification for using the |
This is not an issue, so I apologize for putting it here, but I didn't really know where else to ask.
I have been testing out the various pretrained networks you have trained in this repository, and they seem very interesting and I might use them in a paper I'm working on, so I would like to understand it in detail.
One thing I do not understand about the networks is why they include so many special tokens?
I get that you need the masking token, and similarly the padding token for handling proteins batched together with various sizes.
The cls and eos are used just before and after a protein, but seem unnecessary for proteins unless I'm missing something?
The unk token should signal that an amino acid is unknown if I understand correctly, but isn't X generally the catch all case in protein language for unknown amino acids? So what is the usecase here?
And similarly for the last few tokens used which I have no good guess for.
The text was updated successfully, but these errors were encountered: