Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should train_chain_data_cache_path be a required argument? #251

Open
jonathanking opened this issue Dec 8, 2022 · 2 comments
Open

Should train_chain_data_cache_path be a required argument? #251

jonathanking opened this issue Dec 8, 2022 · 2 comments

Comments

@jonathanking
Copy link
Contributor

Although the current argparse parser allows the user to not pass a value for train_chain_data_cache_path, the current implementation of data_modules.OpenFoldDataset (specifically the inner function, looped_samples) assumes that the cache object is not None. If the user does not supply a cache path, then the training script simply fails with a StopIteration, as it tries to get a cache entry from a None object on line 371:

def looped_samples(dataset_idx):
max_cache_len = int(epoch_len * probabilities[dataset_idx])
dataset = self.datasets[dataset_idx]
idx_iter = looped_shuffled_dataset_idx(len(dataset))
chain_data_cache = dataset.chain_data_cache
while True:
weights = []
idx = []
for _ in range(max_cache_len):
candidate_idx = next(idx_iter)
chain_id = dataset.idx_to_chain_id(candidate_idx)
chain_data_cache_entry = chain_data_cache[chain_id]
if(not deterministic_train_filter(chain_data_cache_entry)):
continue

It seems like OpenFold's datasets have been built to support parsing structure files on the fly as well, so which of the two options would be preferred going forward? 1) make train_chain_data_cache_path required, so the user does not have an unexpected failure when the data is loaded, or 2) Adding support in OpenFoldDataset/looped_samples for the case that the cache is None?

Happy to help implement something either way!

@gahdritz
Copy link
Collaborator

Good point. I think the former option is better, since performance suffers greatly if you have to parse mmCIF files all the time.

@jonathanking
Copy link
Contributor Author

jonathanking commented Jan 30, 2023

Edit: You know, feel free to disregard the rest of my comment below. I thought about it more, and now I understand the point of the caches as storing file paths + sequences, rather than using them to hold coordinate information. I was confused because I thought the train_chain_data_cache would hold this information, but it does not. Thanks for your help!

Are training set and template structures supposed to be parsed during every training step, regardless of any precomputed caches? Or are the caches only supposed to store enough info to essentially filter certain protein structures?

To me, it seems like the training loop is not using any sort of indexing/caching for protein structure/coordinate data.

Let me explain why I think so:

1. In order to avoid parsing a file in OpenFoldSingleDataset, you must pass in a _structure_index object. When setting up the dataset here, no structure index is provided for the training dataset.
- Even if the structure index is passed to the training dataset, the _structure_index entry simply points to the file path, and the string must be read and parsed.
2. The chain_data_cache.json object created by generate_chain_data_cache.py (and discussed above) is a cache storing sequences, resolutions, and cluster sizes for each protein.
- This does not contain ground truth atomic coordinates, so this data structure must not serve the purpose of caching all data necessary to train the model, and a structure file must be parsed on every step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants