Should train_chain_data_cache_path be a required argument? #251

jonathanking · 2022-12-08T22:08:11Z

Although the current argparse parser allows the user to not pass a value for train_chain_data_cache_path, the current implementation of data_modules.OpenFoldDataset (specifically the inner function, looped_samples) assumes that the cache object is not None. If the user does not supply a cache path, then the training script simply fails with a StopIteration, as it tries to get a cache entry from a None object on line 371:

openfold/openfold/data/data_modules.py

Lines 360 to 374 in 59277de

    
           def looped_samples(dataset_idx): 
        
               max_cache_len = int(epoch_len * probabilities[dataset_idx]) 
        
               dataset = self.datasets[dataset_idx] 
        
               idx_iter = looped_shuffled_dataset_idx(len(dataset)) 
        
               chain_data_cache = dataset.chain_data_cache 
        
               while True: 
        
                   weights = [] 
        
                   idx = [] 
        
                   for _ in range(max_cache_len): 
        
                       candidate_idx = next(idx_iter) 
        
                       chain_id = dataset.idx_to_chain_id(candidate_idx) 
        
                       chain_data_cache_entry = chain_data_cache[chain_id] 
        
                       if(not deterministic_train_filter(chain_data_cache_entry)): 
        
                           continue

It seems like OpenFold's datasets have been built to support parsing structure files on the fly as well, so which of the two options would be preferred going forward? 1) make train_chain_data_cache_path required, so the user does not have an unexpected failure when the data is loaded, or 2) Adding support in OpenFoldDataset/looped_samples for the case that the cache is None?

Happy to help implement something either way!

The text was updated successfully, but these errors were encountered:

gahdritz · 2023-01-29T05:33:23Z

Good point. I think the former option is better, since performance suffers greatly if you have to parse mmCIF files all the time.

jonathanking · 2023-01-30T16:44:42Z

Edit: You know, feel free to disregard the rest of my comment below. I thought about it more, and now I understand the point of the caches as storing file paths + sequences, rather than using them to hold coordinate information. I was confused because I thought the train_chain_data_cache would hold this information, but it does not. Thanks for your help!

Are training set and template structures supposed to be parsed during every training step, regardless of any precomputed caches? Or are the caches only supposed to store enough info to essentially filter certain protein structures?

~~To me, it seems like the training loop is not using any sort of indexing/caching for protein structure/coordinate data.~~

~~Let me explain why I think so:~~

1. In order to avoid parsing a file in OpenFoldSingleDataset, you must pass in a _structure_index object. When setting up the dataset here, no structure index is provided for the training dataset.
- Even if the structure index is passed to the training dataset, the _structure_index entry simply points to the file path, and the string must be read and parsed.
2. The chain_data_cache.json object created by generate_chain_data_cache.py (and discussed above) is a cache storing sequences, resolutions, and cluster sizes for each protein.
- This does not contain ground truth atomic coordinates, so this data structure must not serve the purpose of caching all data necessary to train the model, and a structure file must be parsed on every step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should train_chain_data_cache_path be a required argument? #251

Should train_chain_data_cache_path be a required argument? #251

jonathanking commented Dec 8, 2022

gahdritz commented Jan 29, 2023

jonathanking commented Jan 30, 2023 •

edited

Should train_chain_data_cache_path be a required argument? #251

Should train_chain_data_cache_path be a required argument? #251

Comments

jonathanking commented Dec 8, 2022

gahdritz commented Jan 29, 2023

jonathanking commented Jan 30, 2023 • edited

jonathanking commented Jan 30, 2023 •

edited