# Splitting a Protein Dataset into Training and Test Sets

Constructing careful splits of protein datasets can be tricky due to sequence homology. Graphein can take care of this for you.

We use BLAST to cluster sequences based on similarity. Disjoint training and test sets can the be constructed from these clusters.


We'll run through a small example of 4 sequences which we split into two equally sized training and test sets at 25% identity. We note that this is really the bare minimum one can prevent data leakage due to homology. Here we only account for sequence homology, however even proteins with 0% sequence identity can adopt very similar folds. Features for spltting based on SCOP and CATH annotations are priorities on our development roadmap. For a fuller discussion, see David Jones' excellent treatment of the potential pitfalls when working with machine learning in biology:

> Setting the standards for machine learning in biology
> David T. Jones
> Nature Review Molecular Cell Biology
> https://www.nature.com/articles/s41580-019-0176-5



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/a-r-j/graphein/blob/master/notebooks/splitting_a_dataset.ipynb)
[![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/a-r-j/graphein/blob/master/notebooks/splitting_a_dataset.ipynb)


## Requirements
This functionality relies on BLAST. On linux, you can install it with:

```bash
sudo apt install ncbi-blast+
```

Otherwise, please see: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download


In [None]:
# Install graphein if necessary:
# !pip install graphein

# Install blast if necessary (linux):
# !sudo apt install ncbi-blast+

## Building the Dataset FASTA
First, we need to assemble our sequences into a FASTA file that contains all of our queries.

We can either do this based on a mapping of our own creation from PDBs to sequences or from a list of graphs

In [7]:
from graphein.ml.clustering import build_fasta_file_from_mapping

pdb_sequence_mapping = {
    "3eiy": "SFSNVPAGKDLPQDFNVIIEIPAQSEPVKYEADKALGLLVVDRFIGTGMRYPVNYGFIPQTLSGDGDPVDVLVITPFPLLAGSVVRARALGMLKMTDESGVDAKLVAVPHDKVCPMTANLKSIDDVPAYLKDQIKHFFEQYKALEKGKWVKVEGWDGIDAAHKEITDGVANFKK",
    "1lds": "MIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNGERIEKVEHSDLSFSKDWSFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVKWD",
    "4hhb": "VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR",
    "7wda": "GLVVSFYTPATDGATFTAIAQRCNQQFGGRFTIAQVSLPRSPNEQRLQLARRLTGNDRTLDVMALDVVWTAEFAEAGWALPLSDDPAGLAENDAVADTLPGPLATAGWNHKLYAAPVTTNTQLLWYRPDLVNSPPTDWNAMIAEAARLHAAGEPSWIAVQANQGEGLVVWFNTLLVSAGGSVLSEDGRHVTLTDTPAHRAATVSALQILKSVATTPGADPSITRTEEGSARLAFEQGKAALEVNWPFVFASMLENAVKGGVPFLPLNRIPQLAGSINDIGTFTPSDEQFRIAYDASQQVFGFAPYPAVAPGQPAKVTIGGLNLAVAKTTRHRAEAFEAVRCLRDQHNQRYVSLEGGLPAVRASLYSDPQFQAKYPMHAIIRQQLTDAAVRPATPVYQALSIRLAAVLSPITEIDPESTADELAAQAQKAIDG"
    }

build_fasta_file_from_mapping(pdb_sequence_mapping, "sequences.fasta")

In [13]:
# We could also build this mapping from a list of graphs
import graphein.protein as gp
from graphein.ml.clustering import build_fasta_file_from_graphs

# Build graphs
graphs = [gp.construct_graph(pdb_code=code) for code in ["3eiy", "1lds", "4hhb", "7wda"]]

# Build fasta
build_fasta_file_from_graphs(graphs, fasta_out="sequences.fasta", chains=["A", "A", "A", "A"]) # Chain param lets us select a specific chain in a structure

In [14]:
# Inspect the FASTA file:
with open("sequences.fasta", "r") as f:
    print(f.read())

>3eiy_A
SFSNVPAGKDLPQDFNVIIEIPAQSEPVKYEADKALGLLVVDRFIGTGMRYPVNYGFIPQTLSGDGDPVDVLVITPFPLLAGSVVRARALGMLKMTDESGVDAKLVAVPHDKVCPMTANLKSIDDVPAYLKDQIKHFFEQYKALEKGKWVKVEGWDGIDAAHKEITDGVANFKK
>1lds_A
MIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNGERIEKVEHSDLSFSKDWSFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVKWD
>4hhb_A
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
>7wda_A
GLVVSFYTPATDGATFTAIAQRCNQQFGGRFTIAQVSLPRSPNEQRLQLARRLTGNDRTLDVMALDVVWTAEFAEAGWALPLSDDPAGLAENDAVADTLPGPLATAGWNHKLYAAPVTTNTQLLWYRPDLVNSPPTDWNAMIAEAARLHAAGEPSWIAVQANQGEGLVVWFNTLLVSAGGSVLSEDGRHVTLTDTPAHRAATVSALQILKSVATTPGADPSITRTEEGSARLAFEQGKAALEVNWPFVFASMLENAVKGGVPFLPLNRIPQLAGSINDIGTFTPSDEQFRIAYDASQQVFGFAPYPAVAPGQPAKVTIGGLNLAVAKTTRHRAEAFEAVRCLRDQHNQRYVSLEGGLPAVRASLYSDPQFQAKYPMHAIIRQQLTDAAVRPATPVYQALSIRLAAVLSPITEIDPESTADELAAQAQKAIDG



## Clustering the Data

In [11]:
from graphein.ml.clustering import train_and_test_from_fasta

train_and_test_from_fasta(fasta_file="sequences.fasta", number_of_sets=1, fraction_in_test=0.5,
                            cluster_file_name='s2d_clusters.txt', seq_id_low_thresh=25.,
                              use_very_loose_condition=False, n_cpu=2,
                              max_target_seqs=200, delete_db_when_done=True,
                              train_set_key='LR', test_set_key='TS', early_break=True
                              )


Clustering proteins by sequence identity
----------------------------------------
Parameters:
    - Number of sets to split: 1
    - Sequence identity low threshold: 25.0%
    - Fraction of sequences in test set: 0.5


*** Creating sequences pairs for clustering


Building a new DB, current time: 05/18/2022 18:38:39
New DB name:   /home/atj39/github/graphein/notebooks/sequences.fasta
New DB title:  sequences.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 4 sequences in 0.000277042 seconds.
*** Clustering from pairs of sequences
*** Generating sets
4 ids in 4 clusters
Generated set LR_00 and TS_00 - 2 sequences for 2 clusters in test, among (4, 4) in total. Test relative size is 0.5


['4hhb', '3eiy']

## Inspecting the split

In [12]:
# Train Data
print("Train Data:")
with open("LR_00", "r") as f:
    print(f.read())

# Test Data
print("Test Data:")
with open("TS_00", "r") as f:
    print(f.read())

Train Data:
1lds
7wda

Test Data:
4hhb
3eiy

