Any suggestions for extracting embeddings for sequences with > 1024 residues? #21

ptkim1 · 2021-01-08T22:49:22Z

Could I split the sequence into 1024 length chunks, run each separately with the BOS and EOS tokens occurring in the first and last chunks, concatenate the resulting embeddings, then take the average?

Seems like since during training the model used random crops of >1024 length sequences, this should work, but want to make sure.

Also, some warning that your sequence is too long might be helpful, since as of now trying to embed a larger than 1024 length sequence while running on gpu results in the unhelpful "device-side assert triggered" CUDA runtime error.

joshim5 · 2021-01-19T23:49:14Z

Yes, during the training process we cropped sequences >1024 sequences, so taking crops is a sensible choice. We haven't experimented with concatenating or averaging the resulting embeddings, but there are a number of things you could try. For example, assuming the sequence is of length 3072, you could:

split into 3 chunks of length 1024 and concenate
split into 2048 chunks (start position = 0, 1, 2, ..., 2048) and average appropriately
split into chunks of different lengths
many other possibilities...
It's possible that these will all perform similarly, but as this is still an open research question, we would be eager to hear what you find.

tomsercu · 2021-05-19T13:40:39Z

Because I'm referencing this issue in the github discussions, let me add another option to Josh' list:
If you know domain boundaries (or can predict them), that would be a good way to split up the protein sequences. Potentially again with averaging the embeddings over a strided window of consecutive domains.

aliencaocao · 2023-09-04T14:57:10Z

so is there no need to add BOS and EOS in EACH chunk?

joshim5 closed this as completed Jan 19, 2021

joshim5 mentioned this issue Feb 9, 2021

Inconsistent dimension when generate contact maps and maximum sequence length #34

Closed

konstin mentioned this issue Mar 3, 2021

Proteins longer than 1024 causes an exception on the CPU and poisons the GPU #49

Closed

tomsercu mentioned this issue Mar 25, 2022

ValueError: Sequence length 1042 above maximum sequence length of 1024. #166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any suggestions for extracting embeddings for sequences with > 1024 residues? #21

Any suggestions for extracting embeddings for sequences with > 1024 residues? #21

ptkim1 commented Jan 8, 2021 •

edited

Loading

joshim5 commented Jan 19, 2021

tomsercu commented May 19, 2021

aliencaocao commented Sep 4, 2023

Any suggestions for extracting embeddings for sequences with > 1024 residues? #21

Any suggestions for extracting embeddings for sequences with > 1024 residues? #21

Comments

ptkim1 commented Jan 8, 2021 • edited Loading

joshim5 commented Jan 19, 2021

tomsercu commented May 19, 2021

aliencaocao commented Sep 4, 2023

ptkim1 commented Jan 8, 2021 •

edited

Loading