Skip to content
This repository has been archived by the owner on Aug 1, 2024. It is now read-only.

Any suggestions for extracting embeddings for sequences with > 1024 residues? #21

Closed
ptkim1 opened this issue Jan 8, 2021 · 3 comments

Comments

@ptkim1
Copy link

ptkim1 commented Jan 8, 2021

Could I split the sequence into 1024 length chunks, run each separately with the BOS and EOS tokens occurring in the first and last chunks, concatenate the resulting embeddings, then take the average?

Seems like since during training the model used random crops of >1024 length sequences, this should work, but want to make sure.

Also, some warning that your sequence is too long might be helpful, since as of now trying to embed a larger than 1024 length sequence while running on gpu results in the unhelpful "device-side assert triggered" CUDA runtime error.

@joshim5
Copy link
Contributor

joshim5 commented Jan 19, 2021

Yes, during the training process we cropped sequences >1024 sequences, so taking crops is a sensible choice. We haven't experimented with concatenating or averaging the resulting embeddings, but there are a number of things you could try. For example, assuming the sequence is of length 3072, you could:

  • split into 3 chunks of length 1024 and concenate
  • split into 2048 chunks (start position = 0, 1, 2, ..., 2048) and average appropriately
  • split into chunks of different lengths
  • many other possibilities...
    It's possible that these will all perform similarly, but as this is still an open research question, we would be eager to hear what you find.

@tomsercu
Copy link
Contributor

Because I'm referencing this issue in the github discussions, let me add another option to Josh' list:
If you know domain boundaries (or can predict them), that would be a good way to split up the protein sequences. Potentially again with averaging the embeddings over a strided window of consecutive domains.

@aliencaocao
Copy link

so is there no need to add BOS and EOS in EACH chunk?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants