Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protein Embedding with last activation layers? #15

Closed
victormaricato opened this issue Jul 16, 2021 · 9 comments
Closed

Protein Embedding with last activation layers? #15

victormaricato opened this issue Jul 16, 2021 · 9 comments

Comments

@victormaricato
Copy link

Is it possible to obtain the last activation values using AlphaFold?

Something like ESM allows with the model.forward method.

@ptynecki
Copy link

ptynecki commented Jul 19, 2021

I would precise the question:

How can we execute AF2 pipeline to get fixed-length numeric vector which will represent single AA sequence?
If it is possible, should we expect that AA sequence length have to be no longer than 512, 1280 or any other limits?

@russbates
Copy link

russbates commented Jul 20, 2021

Hi,
Although the ability to return the final representations/embeddings is not currently exposed in the RunModel container, it should be possible to enable it by adding a return_representations=True key-word argument here:
https://github.com/deepmind/alphafold/blob/d26287ea57e1c5a71372f42bf16f486bb9203068/alphafold/model/model.py#L64

@xinformatics
Copy link

I didn't run the actual model but I was using the jupyter notebook provided by @sokrypton. He suggested to edit class AlphaFold (located inalphafold/model/modules.py ) set return_representations=True.

In the jupyter notebook he provided,

prediction_result = model_runner.predict(processed_feature_dict)

gives 'prediction_result' as a dictionary with a key as 'representations'

prediction_result.keys()
dict_keys(['distogram', 'experimentally_resolved', 'masked_msa', 'predicted_lddt', 'representations', 'structure_module', 'plddt'])

this returns a nested dictionary and then

prediction_result['representations'].keys() outputs
dict_keys(['msa', 'msa_first_row', 'pair', 'single', 'structure_module'])

it contains the learned representations, although I am not sure which one to use. Hope it helps

@tfgg tfgg closed this as completed Jul 21, 2021
@xinformatics
Copy link

@tfgg Could you suggest which representation would be a good choice as an protein embedding for downstream tasks? since i get 5 different representations from the prediction result?

@ptynecki
Copy link

@tfgg
Is there any reason why this thread was closed? @xinformatics shared some tips but the main questions still haven't answered.

Thank you for considering.

@ricomnl
Copy link

ricomnl commented Jul 26, 2021

@xinformatics The first section of the article The AlphaFold2 Method Paper: A Fount of Good Ideas suggests that s_i is the embedding you want to use. This would correspond to the single key in the prediction_result['representations'] dict.

At every step of the process, {s_i} is kept updated, communicating back and forth with {z_{ij}}, so that whatever is built up in {z_{ij}} is made accessible to {s_i}. As a result {s_i} is front and center in all the major modules. And at the end, in the structure module, it is ultimately {s_i}, not {z_{ij}}, that encodes the structure (where the quaternions get extracted to generate the structure). This avoids the awkwardness of having to project the 2D representation onto 3D space.

@xinformatics
Copy link

@rmeinl Thank you so much. I was thinking on the similar lines. Actually, the problem is my case is that I only need the representations (not the final PDB product) and somehow I am unable to figure out how to run AF2 prediction in a loop. I have 964 sequences and I wish to avoid running AF2 manually on each sequence. The embedding extraction is available on my Github Alphafold

@ricomnl
Copy link

ricomnl commented Jul 26, 2021

Ah interesting! I'm looking at a similar task. Two things I'll look at is 1) "turning off" the recycling step (doing a one-pass only) and 2) using only 1 of the models (instead of all 7 and then select the best scoring as they do in the provided AlphaFold.ipynb).

[...]
model_names = ['model_1', 'model_2', 'model_3', 'model_4', 'model_5', 'model_2_ptm']

[...]
for model_name in model_names:
   [...]

[...]
# Find the best model according to the mean pLDDT
best_model_name = max(plddts.keys(), key=lambda x: plddts[x].mean())

[...]

@pykao
Copy link

pykao commented May 10, 2022

Hi @xinformatics,

I set return_representations=True within alphafold/model/modules.py, relaunched the docker container, and ran the same experiment again. However, the feature.pkl is still the same. Could you please point out which jupyter notebook ColabFold use to generate the protein embedding?

Best,
Po-Yu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants