Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list index out of range #13

Closed
yuehua-Song666 opened this issue May 1, 2024 · 3 comments
Closed

list index out of range #13

yuehua-Song666 opened this issue May 1, 2024 · 3 comments

Comments

@yuehua-Song666
Copy link

Dear authors,

Thank you so much for such great work. I'm really interested in it.
I got an issue here, after getting processed.pt, I tried to run main.py. It uses das_split.pt to split the data into train, val and test, right? But I got an "index out of range" error. I wonder if you have any clues why this happened? By the way, I saw under the data folder, you have three '_split.pt' files, can you please tell me the difference between them?

Error log:
46 Traceback (most recent call last):
47 File "/home/yjwang/geometric-rna-design/main.py", line 246, in
48 main(config, device)
49 File "/home/yjwang/geometric-rna-design/main.py", line 39, in main
50 train_list, val_list, test_list = get_data_splits(config, split_type=config.split)
51 File "/home/yjwang/geometric-rna-design/main.py", line 119, in get_data_splits
52 train_list = index_list_by_indices(data_list, train_idx_list)
53 File "/home/yjwang/geometric-rna-design/main.py", line 113, in index_list_by_indices
54 return [lst[index] for index in indices]
55 File "/home/yjwang/geometric-rna-design/main.py", line 113, in
56 return [lst[index] for index in indices]
57 IndexError: list index out of range

Thanks in advance,
yuehua

@chaitjo
Copy link
Owner

chaitjo commented May 21, 2024

Hi @yuehua-Song666, many thanks for your interest! And apologies for this very delayed response.

I got an issue here, after getting processed.pt, I tried to run main.py.

Have you created the processed dataset yourself from the raw RNAsolo PDB files? Or have you downloaded it from our link: https://drive.google.com/file/d/1gcUUaRxbGZnGMkLdtVwAILWVerVCbu4Y/view?usp=sharing

It uses das_split.pt to split the data into train, val and test, right? But I got an "index out of range" error. I wonder if you have any clues why this happened?

I think the index error could be happening if you have created the processed dataset by yourself and there are fewer entries/samples in the new processed dataset than there were when I created the splits. Could you check whether this is the case?

Essentially, index out of range means that the list of indexes in the das_split contains one or more indexes that are far too large to be able to correctly index the processed data list. It is likely that the processed data list has length N, but the index value is something like N + x > N, so it leads to an index out of range error.

By the way, I saw under the data folder, you have three '_split.pt' files, can you please tell me the difference between them?

We have provided two splits used in our experiments in the data/ directory:

  • Single-state split from Das et al., 2010: data/das_split.pt (called the Das split for compatibility with older code)
    • This split is used to fairly evaluate gRNAde for single-state design on a set of RNA structures of interest from the PDB identified by the Das et al. paper, which mainly includes riboswitches, aptamers, and ribozymes.
    • We identify the structural clusters belonging to the RNAs identified in Das et al. and add all the RNAs in these clusters to the test set (100 samples).
    • The remaining clusters are randomly added to the training and validation splits.
  • Multi-state split of structurally flexible RNAs: data/structsim_split.pt
    • This split is used to test gRNAde's ability to design RNA with multiple distinct conformational states.
    • We order the structural clusters based on median intra-sequence RMSD among available structures within the cluster.
    • The top 100 samples from clusters with the highest median intra-sequence RMSD are added to the test set. The next 100 samples are added to the validation set and all remaining samples are used for training.

Let me know if this is helpful.

@chaitjo
Copy link
Owner

chaitjo commented Jun 4, 2024

Hi @yuehua-Song666, I recently updated the instructions for preparing the data and for reproducing our splits for benchmarking: #16

Somebody else told me that RNAsolo was no longer allowing downloading older versions based on date cutoffs, and I suspect the issues you were facing can be due to the same reason. If you try the new data instructions in the README, I think it should work.

Let me know how it goes!

@yuehua-Song666
Copy link
Author

Hi authors,

Thank you so much for such an useful reply! I figured it out. =)

Thanks a lot,
Yuehua

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants