Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with long proteins (>1024 residues) with ESM_1b? #2

Closed
salvatoreloguercio opened this issue Aug 12, 2021 · 2 comments
Closed

Comments

@salvatoreloguercio
Copy link

Hello, wondering if evolocity works with long proteins (>1024 residues) when embedding with ESM_1b - since the ESM repo reports issues with these proteins:
facebookresearch/esm#49

Although I see that in the preprint that evolocity was used with e.g. Spike which is above 1024 residues.. So, not an issue?

@brianhie
Copy link
Owner

Hi @salvatoreloguercio, yes the 1024 residue limit of ESM-1b is unfortunate. The current workaround is to just divide the protein into 1022 residue windows (e.g.,

batch_size = 1022
, 1022 + before/after sequence tokens), run these through the model separately, then concatenate the output, but this is definitely a heuristic.

Encouragingly, though, this seems to give reasonable results in the (zero-shot) deep mutational scan benchmark that contains long proteins (>1022 residues). For example, I think BRCA1 is longer than 1022, but the zero-shot mutational effect performance is still higher than DeepSequence.

@salvatoreloguercio
Copy link
Author

I see, that makes sense - I think the the ESM folks also suggested a similar workaround. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants