# Investigate RNA3DB test splits

For train / test splits, we want to make sure there are no potential cases with sequence or structural similarities. This means for each split, we want to construct an RNA3DB test split such that no structure that can potentially be similar to the training split remains in the test split. However, with this strategy, it might be possible that some of the structures from RNA3DB never end up being included in the test splits. 

This notebook analyzes how much of the structures from RNA3DB eventually get included in the test splits

In [5]:
from glob import glob
import json
import pandas as pd

from rna_sdb.datasets import RNA_SDB_PATH, RNA3DB_PATH
from rna3db.tabular import read_tbls_from_dir

In [2]:
# Load RNA3DB
tbl = read_tbls_from_dir(RNA3DB_PATH / 'rna3db-cmscans')
clusters = json.load(open(RNA3DB_PATH / 'rna3db-jsons/cluster.json'))

rna3db_seqs = set()
for component in clusters:
    rna3db_seqs.update(clusters[component].keys())
print(f'Number of sequences in RNA3Dbase: {len(rna3db_seqs)}')

Number of sequences in RNA3Dbase: 1645


In [3]:
# Load Rfam families
df_rfamseq_stats = pd.read_csv(RNA_SDB_PATH / "rfamseq_stats.csv", sep="\t")
rfam_families = set(df_rfamseq_stats['rfam_family'])

## Filter out using relaxed threshold (1.0)

In [8]:
seqs_included_in_test = set()

for split in RNA_SDB_PATH.glob('test_split_*.lst'):
    with open(RNA_SDB_PATH / split) as f:
        test_families = set(f.read().splitlines())

    train_families = rfam_families.difference(test_families)

    rna3db_test_seqs = set()
    for seq in rna3db_seqs:
        # For each of the sequences in RNA3Dbase, check any possible matches to Rfam families
        # Use relaxed e-value threshold to find possible matches. 1.0 used in the paper as well.
        matched_families = set(tbl[seq].filter_e_value(1.0).target_accession)

        if not matched_families.intersection(train_families):
            # If there is no matched family from the training split, add the sequence to the test set
            rna3db_test_seqs.add(seq)

    print(f'Number of sequences in {split.stem}: {len(rna3db_test_seqs)}')
    seqs_included_in_test.update(rna3db_test_seqs)

print(f'Number of sequences in RNA3Dbase included in at least one test split: {len(seqs_included_in_test)}')

Number of sequences in test_split_9: 477
Number of sequences in test_split_5: 519
Number of sequences in test_split_1: 635
Number of sequences in test_split_7: 661
Number of sequences in test_split_6: 575
Number of sequences in test_split_2: 492
Number of sequences in test_split_3: 506
Number of sequences in test_split_8: 564
Number of sequences in test_split_4: 542
Number of sequences in RNA3Dbase included in at least one test split: 1200


## Filter out using stringent threshold (1e-3)

In [11]:
seqs_included_in_test = set()

for split in RNA_SDB_PATH.glob('test_split_*.lst'):
    with open(RNA_SDB_PATH / split) as f:
        test_families = set(f.read().splitlines())

    train_families = rfam_families.difference(test_families)

    rna3db_test_seqs = set()
    for seq in rna3db_seqs:
        # For each of the sequences in RNA3Dbase, check any possible matches to Rfam families
        # Use relaxed e-value threshold to find possible matches. 1.0 used in the paper as well.
        matched_families = set(tbl[seq].filter_e_value(1e-3).target_accession)

        if not matched_families.intersection(train_families):
            # If there is no matched family from the training split, add the sequence to the test set
            rna3db_test_seqs.add(seq)

    print(f'Number of sequences in {split.stem}: {len(rna3db_test_seqs)}')
    seqs_included_in_test.update(rna3db_test_seqs)

print(f'Number of sequences in RNA3Dbase included in at least one test split: {len(seqs_included_in_test)}')

Number of sequences in test_split_9: 624
Number of sequences in test_split_5: 674
Number of sequences in test_split_1: 877
Number of sequences in test_split_7: 859
Number of sequences in test_split_6: 723
Number of sequences in test_split_2: 640
Number of sequences in test_split_3: 654
Number of sequences in test_split_8: 764
Number of sequences in test_split_4: 688
Number of sequences in RNA3Dbase included in at least one test split: 1521


## Filter out using top hits

In [14]:
seqs_included_in_test = set()

for split in RNA_SDB_PATH.glob('test_split_*.lst'):
    with open(RNA_SDB_PATH / split) as f:
        test_families = set(f.read().splitlines())

    train_families = rfam_families.difference(test_families)

    rna3db_test_seqs = set()
    for seq in rna3db_seqs:
        # For each of the sequences in RNA3Dbase, check any possible matches to Rfam families
        # Use relaxed e-value threshold to find possible matches. 1.0 used in the paper as well.
        matched_families = set(tbl[seq].filter_e_value(1e-3).top_hits.target_accession)

        if not matched_families.intersection(train_families):
            # If there is no matched family from the training split, add the sequence to the test set
            rna3db_test_seqs.add(seq)

    print(f'Number of sequences in {split.stem}: {len(rna3db_test_seqs)}')
    seqs_included_in_test.update(rna3db_test_seqs)

print(f'Number of sequences in RNA3Dbase included in at least one test split: {len(seqs_included_in_test)}')

Number of sequences in test_split_9: 639
Number of sequences in test_split_5: 771
Number of sequences in test_split_1: 888
Number of sequences in test_split_7: 883
Number of sequences in test_split_6: 750
Number of sequences in test_split_2: 721
Number of sequences in test_split_3: 706
Number of sequences in test_split_8: 806
Number of sequences in test_split_4: 801
Number of sequences in RNA3Dbase included in at least one test split: 1585


In summary, most structures from RNA3DB get included in at least one of the test splits. Since running inference on the entire test set is not expensive at all, it is best to run model predictions over the entire dataset and report for each case