You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Although the current argparse parser allows the user to not pass a value for train_chain_data_cache_path, the current implementation of data_modules.OpenFoldDataset (specifically the inner function, looped_samples) assumes that the cache object is not None. If the user does not supply a cache path, then the training script simply fails with a StopIteration, as it tries to get a cache entry from a None object on line 371:
It seems like OpenFold's datasets have been built to support parsing structure files on the fly as well, so which of the two options would be preferred going forward? 1) make train_chain_data_cache_path required, so the user does not have an unexpected failure when the data is loaded, or 2) Adding support in OpenFoldDataset/looped_samples for the case that the cache is None?
Happy to help implement something either way!
The text was updated successfully, but these errors were encountered:
Edit: You know, feel free to disregard the rest of my comment below. I thought about it more, and now I understand the point of the caches as storing file paths + sequences, rather than using them to hold coordinate information. I was confused because I thought the train_chain_data_cache would hold this information, but it does not. Thanks for your help!
Are training set and template structures supposed to be parsed during every training step, regardless of any precomputed caches? Or are the caches only supposed to store enough info to essentially filter certain protein structures?
To me, it seems like the training loop is not using any sort of indexing/caching for protein structure/coordinate data.
Let me explain why I think so:
1. In order to avoid parsing a file in OpenFoldSingleDataset, you must pass in a _structure_index object. When setting up the dataset here, no structure index is provided for the training dataset.
- Even if the structure index is passed to the training dataset, the _structure_index entry simply points to the file path, and the string must be read and parsed.
2. The chain_data_cache.json object created by generate_chain_data_cache.py (and discussed above) is a cache storing sequences, resolutions, and cluster sizes for each protein.
- This does not contain ground truth atomic coordinates, so this data structure must not serve the purpose of caching all data necessary to train the model, and a structure file must be parsed on every step.
Although the current argparse parser allows the user to not pass a value for
train_chain_data_cache_path
, the current implementation ofdata_modules.OpenFoldDataset
(specifically the inner function,looped_samples
) assumes that the cache object is not None. If the user does not supply a cache path, then the training script simply fails with a StopIteration, as it tries to get a cache entry from a None object on line 371:openfold/openfold/data/data_modules.py
Lines 360 to 374 in 59277de
It seems like OpenFold's datasets have been built to support parsing structure files on the fly as well, so which of the two options would be preferred going forward? 1) make train_chain_data_cache_path required, so the user does not have an unexpected failure when the data is loaded, or 2) Adding support in
OpenFoldDataset/looped_samples
for the case that the cache is None?Happy to help implement something either way!
The text was updated successfully, but these errors were encountered: