Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New entries in obsolete.dat will throw up errors. #4

Open
sachinkadyan7 opened this issue Nov 12, 2021 · 15 comments
Open

New entries in obsolete.dat will throw up errors. #4

sachinkadyan7 opened this issue Nov 12, 2021 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@sachinkadyan7
Copy link
Collaborator

Traceback (most recent call last):
  File "/ocean/projects/bio210060p/kadyan/openfold-release/scripts/precompute_te
mplate_hits.py", line 224, in <module>
    main(args, template_pipeline_runner)
  File "/ocean/projects/bio210060p/kadyan/openfold-release/scripts/precompute_te
mplate_hits.py", line 116, in main
    feature_dict = template_pipeline_runner.run(a3m_dir, fasta_file_path)
  File "/ocean/projects/bio210060p/kadyan/openfold-release/scripts/precompute_te
mplate_hits.py", line 80, in run
    alignment_dir=a3m_dir,
  File "/ocean/projects/bio210060p/kadyan/openfold-release/openfold/data/data_pi
peline.py", line 360, in process_fasta
    hits=hits_cat,
  File "/ocean/projects/bio210060p/kadyan/openfold-release/openfold/data/templat
es.py", line 1058, in get_templates
    kalign_binary_path=self._kalign_binary_path,
  File "/ocean/projects/bio210060p/kadyan/openfold-release/openfold/data/templat
es.py", line 828, in _process_single_hit
    with open(cif_path, "r") as cif_file:
FileNotFoundError: [Errno 2] No such file or directory: '/databases/pdb_mmcif/mmcif_files/6ek0.cif'

ISSUE: New entries added in obsolete.dat will fail because the corresponding replacements will not be found in the pre-downloaded pdb_mmcifs.

@sachinkadyan7 sachinkadyan7 self-assigned this Nov 12, 2021
@sachinkadyan7 sachinkadyan7 added the bug Something isn't working label Nov 12, 2021
@sachinkadyan7
Copy link
Collaborator Author

sachinkadyan7 commented Nov 13, 2021

Upon investigation, the error is genuine but the cause is different.

Obsolete PDB ids were not getting replaced by newer ones, but not just for new additions to obsolete.dat, but for older entries as well. That is, for all cases.

FIXED:
Added small missing code to fix the bug.

NOTE: Recommend people pull again as this will affect a lot of proteins.

@yuzhiguo07
Copy link
Contributor

yuzhiguo07 commented Nov 30, 2021

I used the latest version of the code, but I still got the same error:

  File "/mnt/smile1/protein_proj/codes/github/openfold_ori/openfold/openfold/data/data_modules.py", line 158, in __getitem__
    path + ".cif", file_id, chain_id, alignment_dir
  File "/mnt/smile1/protein_proj/codes/github/openfold_ori/openfold/openfold/data/data_modules.py", line 138, in _parse_mmcif
    chain_id=chain_id,
  File "/mnt/smile1/protein_proj/codes/github/openfold_ori/openfold/openfold/data/data_pipeline.py", line 463, in process_mmcif
    query_release_date=to_date(mmcif.header["release_date"])
  File "/mnt/smile1/protein_proj/codes/github/openfold_ori/openfold/openfold/data/data_pipeline.py", line 55, in make_template_features
    hits=hits_cat,
  File "/mnt/smile1/protein_proj/codes/github/openfold_ori/openfold/openfold/data/templates.py", line 1059, in get_templates
    kalign_binary_path=self._kalign_binary_path,
  File "/mnt/smile1/protein_proj/codes/github/openfold_ori/openfold/openfold/data/templates.py", line 827, in _process_single_hit
    with open(cif_path, "r") as cif_file:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/smile1/protein_proj/codes/github/openfold/data/pdb_mmcif/mmcif_files/6ek0.cif'

Epoch 0:  11%|█         | 62/575 [14:25<1:59:18, 13.95s/it, loss=7.65, v_num=0]

@gahdritz gahdritz reopened this Nov 30, 2021
@sachinkadyan7
Copy link
Collaborator Author

Hi yuzhiguo07,

Sorry, that you are facing this issue.

Can you please share some details about how you reproduced this? Namely, what was the specific protein for which the template generation failed?

Also, what version of the pdb_mmcif/obsolete.dat are you using? Specifically, what date did you download it on?

@yuzhiguo07
Copy link
Contributor

I just attached the pdb_mmcif/obsolete.dat, I downloaded it on Nov 16, 2021.
obsolete.dat.tar.gz

I'm still working on printing the pdbid on each iteration before the bug, or could u give me some tips of where should I print? (like, which python file and which function). Since the bug occurs in the middle of training, it may take some time to print it out.

Thank you so much for your work and effort!

@yuzhiguo07
Copy link
Contributor

I followed the deepmind MSA generation pipeline and it will take a super long time. So I just used a few data to try to train the openfold.

@yuzhiguo07
Copy link
Contributor

The failed protein is 6u4z_A. @sachinkadyan7

@sachinkadyan7
Copy link
Collaborator Author

Thanks for letting us know.

It seems that because of some reason the obsolete protein id '6ek0' was not replaced by the newer protein id '6qzp' (as seen from obsolete.dat)

Is '6u4z_A' the protein for which you were trying to run the MSAs and templates?

@yuzhiguo07
Copy link
Contributor

yes 6u4z_A is the target protein

@sachinkadyan7
Copy link
Collaborator Author

Couple of questions to help figure out this issue:

  1. Are you passing the obsolete.dat path in the script call?
  2. Did you generate the release_dates file? (release_dates can be generated by running `scripts/generate_mmcif_cache.py')
  3. Do you have the release_dates file in the correct path?

The only possible way that the above issue can occur is if there is no release_dates file or obsolete.dat file.

@yuzhiguo07
Copy link
Contributor

yuzhiguo07 commented Dec 14, 2021

Sorry for the late reply.
I did generate the release_dates file: mmcif_cache.json and I put it in the correct path.
I'm not sure if I was passing the obsolete.dat path in the script call, the path is /mnt/smile1/protein_proj/dataset/open_fold/try_mmcif_files/obsolete.dat (following the default data path),
and my running command is:

python3 train_openfold.py /mnt/smile1/protein_proj/dataset/open_fold/try_mmcif_files/ /mnt/smile1/protein_proj/dataset/open_fold/try_alignments/ /mnt/smile1/protein_proj/codes/github/openfold/data/pdb_mmcif/mmcif_files /mnt/smile1/protein_proj/models/openfold/try 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --precision 16 --gpus 1 --replace_sampler_ddp=True --seed 42 --deepspeed_config_path deepspeed_config.json

@hellofinch
Copy link

hello, I meet the same error.
FileNotFoundError: [Errno 2] No such file or directory: '/hdd/nas_157/dataset/pdbmmcif/mmcif_files/3wxw.cif'
3wxw.cif can not find. I download pdb_mmcif using the script in scripts/download_pdb_mmcif.sh.
and my running command is:
python -u train_openfold.py /hdd/nas_157/dataset/fold2/mmcif/ /hdd/nas_157/dataset/fold2/features/ /hdd/nas_157/dataset/pdbmmcif/mmcif_files/ output/ 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --seed 42 --deepspeed_config_path deepspeed_config.json --gpus 1 --replace_sampler_ddp=True --precision 16

@sachinkadyan7
Copy link
Collaborator Author

@hellofinch
It seems that your command does not contain the path of the obsolete.dat file to the script. Upon inspection, I figured out that this is because (specifically) the training code does not have the commandline parameter to grab the obsolete.dat file path. faulty code

I also analyzed the code that actually parses the file and uses it to replace obsolete entries. There does not seem to be any way that the issue is happening in that part of the code. If the release_dates and obsolete_pdbs files are present, the obsolete hits should be replaced by their newer versions.

To verify, can you try running only the inference code through run_pretrained_openfold.py on the specific protein for which the training code failed? Make sure to add the obsolete_pdbs_path and the release_dates_path to the command.

@hellofinch
Copy link

@sachinkadyan7
I run the inference code through run_pretrained_openfold.py on the protein 3wxw which is multimers. The inference code works fine getting the PDB files. But nothing outputs in the console.
My command is python run_pretrained_openfold.py ./3wxw.fasta /hdd/dataset/protein/uniref90/uniref90.fasta /hdd/dataset/protein/mgnify/mgy_clusters_2018_12.fa /hdd/dataset/protein/pdb70/pdb70 /hdd/nas_157/dataset/pdbmmcif/mmcif_files/ /hdd/dataset/protein/uniclust30/uniclust30_2018_08/uniclust30_2018_08 --output_dir ./output --bfd_database_path /hdd/dataset/protein/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --model_device cuda:1 --jackhmmer_binary_path /usr/bin/jackhmmer --hhblits_binary_path /usr/bin/hhblits --hhsearch_binary_path /usr/bin/hhsearch --kalign_binary_path /usr/bin/kalign --obsolete_pdbs_path /hdd/nas_157/dataset/pdbmmcif/obsolete.dat --release_dates_path ./mmcif_cache.json . It already has obsolete_pdbs_path and release_dates_path.

@sachinkadyan7
Copy link
Collaborator Author

@hellofinch
Do you see any the alignment files and the predicted structure in your output directory?
If the files are there, it means that there was no error during the execution and obsolete IDs were replaced.

@hellofinch
Copy link

@sachinkadyan7
I check the alignment files and they are there. Does it mean that I get the right dataset?
image
and I also try to add obsolete_pdbs_path to my training command like python -u train_openfold.py /hdd/nas_157/dataset/fold2/mmcif/ /hdd/nas_157/dataset/fold2/features/ /hdd/nas_157/dataset/pdbmmcif/mmcif_files/ output/ 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --seed 42 --deepspeed_config_path deepspeed_config.json --gpus 1 --replace_sampler_ddp=True --precision 16 --obsolete_pdbs_path /hdd/nas_157/dataset/pdbmmcif/obsolete.dat.
But it doesn't work. An error comes out. train_openfold.py: error: unrecognized arguments: --obsolete_pdbs_path /hdd/nas_157/dataset/pdbmmcif/obsolete.dat It seems that I should not add this option?

sachinkadyan7 added a commit that referenced this issue Jan 26, 2022
Added obsolete_pdbs_file_path flag in the training script.
christinaflo pushed a commit that referenced this issue Aug 3, 2023
created Multimer dataloader and datamodule classes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants