# Splitting and subsetting PPIs

The PPIRef package provides a unified approach to storing and processing data splits and other subsets of PPIs.

In [1]:
from ppiref.split import read_split, read_fold, read_split_source, write_split
from ppiref.definitions import PPIREF_TEST_DATA_DIR

**Example 1.** Create a toy data split. In this example, we use some of PPIs extracted in the "Extracting PPIs" tutorial.

First, prepare PPI ids for the split and use the `write_split` function to store the split as a JSON file. This will run simple sanity checks on the split (e.g. that PPI ids are not overlapping across the folds or the split is complete with respect to the source directory with PPI files). In this example, we introduce a data leakage on purpose by including the same PPI in both the training and the test sets.

In [2]:
split = {'train': ['10gs_A_B'], 'test': ['1ahw_B_C', '10gs_A_B']}
write_split('demo_split', source=PPIREF_TEST_DATA_DIR / 'ppi_dir', folds=split)



Now, you can read the split using the `read_split` function.

In [3]:
read_split('demo_split')

                                                              

{'train': [PPIPath('/Users/anton/dev/PPIRef/ppiref/data/test/ppi_dir/0g/10gs_A_B.pdb')],
 'test': [PPIPath('/Users/anton/dev/PPIRef/ppiref/data/test/ppi_dir/ah/1ahw_B_C.pdb'),
  PPIPath('/Users/anton/dev/PPIRef/ppiref/data/test/ppi_dir/0g/10gs_A_B.pdb')]}

Or read individual folds using the `read_fold` function.

In [4]:
read_fold('demo_split', 'train')

                                                              

[PPIPath('/Users/anton/dev/PPIRef/ppiref/data/test/ppi_dir/0g/10gs_A_B.pdb')]

In [5]:
read_fold('demo_split', 'train+test')

                                                              

[PPIPath('/Users/anton/dev/PPIRef/ppiref/data/test/ppi_dir/0g/10gs_A_B.pdb'),
 PPIPath('/Users/anton/dev/PPIRef/ppiref/data/test/ppi_dir/ah/1ahw_B_C.pdb'),
 PPIPath('/Users/anton/dev/PPIRef/ppiref/data/test/ppi_dir/0g/10gs_A_B.pdb')]

**Example 2.** Read pre-defined data splits.

PPIRef50K used to train PPIformer:

In [6]:
fold = read_fold('ppiref_10A_filtered_clustered_03', 'whole', full_paths=False)
fold[:3]

['4q2p_A_B', '3q2s_A_D', '6q2v_B_E']

Test set from non-leaking SKEMPI v2.0:

In [7]:
fold = read_fold('skempi2_iclr24_split', 'test', full_paths=False)
fold[:3]

['1B3S_A_D', '1B2U_A_D', '1BRS_A_D']

DIPS set used to train and validate EquiDock and DiffDock-PP:

In [9]:
fold = read_fold('dips_equidock', 'train+val', full_paths=False)
fold[:3]

['1v6j_A_D', '2v6a_A_L', '2v6a_B_O']

For each of the pre-defined splits, you can read the source directory of the PPIs using the `read_split_source` function:

In [10]:
read_split_source('ppiref_6A_raw')

PosixPath('/Users/anton/dev/PPIRef/ppiref/data/ppiref/ppi_6A')