You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 1, 2024. It is now read-only.
Could I split the sequence into 1024 length chunks, run each separately with the BOS and EOS tokens occurring in the first and last chunks, concatenate the resulting embeddings, then take the average?
Seems like since during training the model used random crops of >1024 length sequences, this should work, but want to make sure.
Also, some warning that your sequence is too long might be helpful, since as of now trying to embed a larger than 1024 length sequence while running on gpu results in the unhelpful "device-side assert triggered" CUDA runtime error.
The text was updated successfully, but these errors were encountered:
Yes, during the training process we cropped sequences >1024 sequences, so taking crops is a sensible choice. We haven't experimented with concatenating or averaging the resulting embeddings, but there are a number of things you could try. For example, assuming the sequence is of length 3072, you could:
split into 3 chunks of length 1024 and concenate
split into 2048 chunks (start position = 0, 1, 2, ..., 2048) and average appropriately
split into chunks of different lengths
many other possibilities...
It's possible that these will all perform similarly, but as this is still an open research question, we would be eager to hear what you find.
Because I'm referencing this issue in the github discussions, let me add another option to Josh' list:
If you know domain boundaries (or can predict them), that would be a good way to split up the protein sequences. Potentially again with averaging the embeddings over a strided window of consecutive domains.
Could I split the sequence into 1024 length chunks, run each separately with the BOS and EOS tokens occurring in the first and last chunks, concatenate the resulting embeddings, then take the average?
Seems like since during training the model used random crops of >1024 length sequences, this should work, but want to make sure.
Also, some warning that your sequence is too long might be helpful, since as of now trying to embed a larger than 1024 length sequence while running on gpu results in the unhelpful "device-side assert triggered" CUDA runtime error.
The text was updated successfully, but these errors were encountered: