Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any alignment files to download? #113

Closed
Zhang690683220 opened this issue Jun 6, 2022 · 13 comments
Closed

Is there any alignment files to download? #113

Zhang690683220 opened this issue Jun 6, 2022 · 13 comments

Comments

@Zhang690683220
Copy link
Contributor

Hi,

We're trying to reproduce the training process. However, the alignment seems to take extremely long time.

We used 128 nodes to align 128 mmcif files (1 file on each node), but it took 13 hours to finish the entire job.

I'm wondering if there is tar file that already aligned all mmcif files for us to download which will helps a lot.

Thanks

@gahdritz
Copy link
Collaborator

gahdritz commented Jun 6, 2022

There will be approximately one week from now, when we release our full training data. Stay tuned.

@gahdritz gahdritz closed this as completed Jun 6, 2022
@llwx593
Copy link

llwx593 commented Jul 15, 2022

Hi,
I want to know whether the alignment files of full training data sets have been published.
I don't seem to have found them.
Thanks

@gahdritz
Copy link
Collaborator

Yes they have. See the RODA link in the README.

@llwx593
Copy link

llwx593 commented Jul 20, 2022

Hi,
Thank you very much for the training data。But I have some following questions:

  1. RODA contains two dir, one is the training data after PDB dataset alignment, and the other is the training data after uniclust30 self distillation dataset alignment. Don't know if I understand it correctly.

  2. Whether the training data after PDB dataset alignment is same with pdb_mmcif/mmcif_ files through scripts/precompute_ alignments.py processing? If so, should I set mmcif_dir to pdb_mmcif/mmcif_files,alignment_dir to RODA_PATH/pdb/,template_mmcif_dir to pdb_mmcif/mmcif_files?
    Thanks

@gahdritz
Copy link
Collaborator

gahdritz commented Jul 20, 2022

Correct. You can simultaneously use the distillation data using the --distillation... flags and the predicted structures uploaded to RODA.

@llwx593
Copy link

llwx593 commented Jul 21, 2022

Thank you for your answer. I try to run the training scripts using above method, but I got the “StopIteration Eexception”. The following figure is the location of the exception(in openfold/data/data_modules.py):
3`@B3DI_LYSDUVS1OTI{IFH
The value of "flag1" just become 1, then the exception throw. I already check the RODA_PATH/pdb/ is normal. I don't know what else can cause this error.

@gahdritz
Copy link
Collaborator

Could you print out "self.probabilities" for me?

@llwx593
Copy link

llwx593 commented Jul 22, 2022

The value of "self.probabilities" is 1.
image
The file structure of the RODA data is:
RODAPATH
--pdb
----101m_A
------a3m
--------bfd_uniclust_hits.a3m, mgnify_hits.a3m, uniref90_hits.a3
------hhr
--------pdb70_hits.hhr
----uniclust30

@llwx593
Copy link

llwx593 commented Jul 22, 2022

I found that some chain in RODAPATH/pdb does not exist in pdb_mmcif/mmcif_files/. This will lead to keyerror when query cache with chain id of RODAPATH/pdb. Maybe my mmcif_files/ is different from yours. Can I simply delete these nonexistent chain? I'll try to see if it can be trained normally. If it can, will it affect the accuracy?
Thanks.

@gahdritz
Copy link
Collaborator

I see now. Since the RODA data is supposed to be generally applicable, it has a slightly different format than that expected by the OF dataloaders. For OF's sake, you should flatten the intermediate a3m and hhr directories, putting all .a3m and .hhr files directly in directories corresponding to the individual chains. So e.g.

alignment_dir/
---101m_A
------msa_1.a3m
------msa_2.a3m
------template_hits.hhr
---next_chain
------ etc.

If you also want to use the distillation set in uniclust30/, you should similarly flatten the file format directories.

@llwx593
Copy link

llwx593 commented Jul 29, 2022

Thanks for your reply. I tried to skip the nonexistent Chain ID and found that it could train normally, even if I didn't flatten the data.
image

@gahdritz
Copy link
Collaborator

It's important that you flatten the data, or the model is going to run with empty MSAs and templates. It doesn't know how to read un-flattened data like you have.

@llwx593
Copy link

llwx593 commented Jul 29, 2022

Oh, maybe I run with empty MSAs and templates. I will try to flatten the data. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants