Issues with long proteins (>1024 residues) with ESM_1b? #2

salvatoreloguercio · 2021-08-12T18:16:45Z

Hello, wondering if evolocity works with long proteins (>1024 residues) when embedding with ESM_1b - since the ESM repo reports issues with these proteins:
facebookresearch/esm#49

Although I see that in the preprint that evolocity was used with e.g. Spike which is above 1024 residues.. So, not an issue?

brianhie · 2021-08-12T18:43:47Z

Hi @salvatoreloguercio, yes the 1024 residue limit of ESM-1b is unfortunate. The current workaround is to just divide the protein into 1022 residue windows (e.g.,

evolocity/bin/fb_semantics.py

Line 12 in 2b162ff

batch_size = 1022

, 1022 + before/after sequence tokens), run these through the model separately, then concatenate the output, but this is definitely a heuristic.

Encouragingly, though, this seems to give reasonable results in the (zero-shot) deep mutational scan benchmark that contains long proteins (>1022 residues). For example, I think BRCA1 is longer than 1022, but the zero-shot mutational effect performance is still higher than DeepSequence.

salvatoreloguercio · 2021-08-12T19:07:49Z

I see, that makes sense - I think the the ESM folks also suggested a similar workaround. Thank you very much!

salvatoreloguercio closed this as completed Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with long proteins (>1024 residues) with ESM_1b? #2

Issues with long proteins (>1024 residues) with ESM_1b? #2

salvatoreloguercio commented Aug 12, 2021

brianhie commented Aug 12, 2021

salvatoreloguercio commented Aug 12, 2021

Issues with long proteins (>1024 residues) with ESM_1b? #2

Issues with long proteins (>1024 residues) with ESM_1b? #2

Comments

salvatoreloguercio commented Aug 12, 2021

brianhie commented Aug 12, 2021

salvatoreloguercio commented Aug 12, 2021