# How to split a dataset
The dataset is first clustered, then split along clusters. This means that sequences in the test set will have a low identity compared to the sequences in the training set. The amount of identity tolerated between both set is controlled by the `threshold` parameter. The mechanism works by building a graph where each node is a sequence and an edge is drawn between two nodes if their sequence identity is above the threshold. The leiden algorithm is used to detect communities in the graph. From this, we can extract clusters of sequences. The test set is built by sampling whole clusters. Then, sequences from the train set that share an identity higher than the threshold with the test set can be removed by setting `post_filtering` argument to true.

In [None]:
from qmap.toolkit.split import train_test_split

# Imports for the example
import json


# Step 1: Load the training dataset.
# For this example, we will load the DBAASP dataset that is supposed to be already downloaded in the ../data/build folder
with open('../../../data/build/dataset.json', 'r') as f:
    dataset = json.load(f)
    # Filter out sequences that are too long because the aligner support sequences up to 100 amino acids long
    dataset = [sample for sample in dataset if len(sample["Sequence"]) < 100]

# Step 2: Split the dataset into train and validation sets.
sequences = [sample['Sequence'] for sample in dataset]
train_sequences, val_sequences, train_samples, val_samples = train_test_split(sequences, dataset,
                                                                              test_size=0.15, post_filtering=True)

In [None]:
# If you only want to split the sequences, you can also do this by only passing the sequences to the function:
train_sequences, val_sequences = train_test_split(sequences, test_size=0.15, post_filtering=True)