# How to use the benchmark
There are 5 different splits in the benchmark. A model should be tested on each split to evaluate its performances accurately. The means that the model must be trained five times on different training and validation sets that are separated from the test split. The hyperparameters of the model should stay the same for all splits. This notebook will show a simple example on how to use the benchmark.

## MIC

In [23]:
import numpy as np

from qmap.benchmark import QMAPBenchmark

# Imports for the example
import json
from sklearn.linear_model import LinearRegression
from qmap.toolkit.aligner import Encoder


# Step 1: Load the training dataset.
# For this example, we will load the DBAASP dataset that is supposed to be already downloaded in the ../data/build folder
with open('../../../data/build/dataset.json', 'r') as f:
    dataset = json.load(f)
    # Filter out sequences that are too long because the aligner support sequences up to 100 amino acids long
    dataset = [sample for sample in dataset if len(sample["Sequence"]) < 100]

# Loop over all splits of the benchmark
for split in range(5):
    # Step 2: Load the benchmark. For this example, we will use the threshold of 60% which is more conservative
    # For this example, we will only consider the specie Escherichia coli
    benchmark = QMAPBenchmark(split, 60, species_subset=['Escherichia coli'])
    # Step 3: Get the training mask. This means the sequences that can be used to train the model because they are dissimilar enough from the test set.
    train_mask = benchmark.get_train_mask([sample['Sequence'] for sample in dataset])
    # Step 4: Filter the dataset to only keep the sequences that can be used for training
    train_dataset = [sample for is_valid, sample in zip(train_mask, dataset) if is_valid]

    # Step 5: Train the model on the training dataset
    # For this example, we will use a linear model on top of the encoder's embeddings
    encoder = Encoder()
    y_train = [sample['Targets'].get('Escherichia coli', (None, None, None))[0] for sample in train_dataset]
    X_train = encoder.encode([sample['Sequence'] for sample, target in zip(train_dataset, y_train) if target is not None]).embeddings
    y_train = np.log10(np.array([target for target in y_train if target is not None]))
    model = LinearRegression()
    model.fit(X_train.float(), y_train)

    # Step 6: Get the test sequences and their targets
    # The only inputs we have are the sequences because we turned off all options. However, if you want to include C and N terminal modifications, you can do this in the benchmark constructor. When accessing the inputs attribute, you will receive the sequences, the C terminal modifications, and the N terminal modifications. The same concept applies for unusual amino acids or species as inputs which will include the specie name with the sequence.
    test_sequences = benchmark.inputs
    # The targets are of shape(n, n_species) which is (n, 1) in our case since we only consider one species (Escherichia coli).
    test_targets = benchmark.targets.reshape(-1)

    # Step 7: Get the predictions of the model on the test set
    X_test = encoder.encode(test_sequences).embeddings
    y_pred = model.predict(X_test.float())

    # Step 8: Evaluate the model
    # The benchmark provides a method to evaluate the model on the test set given the predictions.
    results = benchmark.compute_metrics(y_pred)
    print(results)
    # You can also get the results as a dictionary
    results_dict = results.dict()

[38;5;247mEncoding sequences:   [38;5;10m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m  [0m[38;5;247m[38;5;247m[38;5;2m2/2 it[0m [38;5;247m[38;5;1m0.68 it/s[0m [38;5;247meta [38;5;6m00:00[0m[0m
[38;5;247mEncoding sequences:   [38;5;10m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m  [0m[38;5;247m[38;5;247m[38;5;2m38/38 it[0m [38;5;247m[38;5;1m0.55 it/s[0m [38;5;247meta [38;5;6m00:00[0m[0m
[38;5;247mEncoding sequences:   [38;5;10m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m  [0m[38;5;247m[38;5;247m[38;5;2m23/23 it[0m [38;5;247m[38;5;1m0.42 it/s[0m [38;5;247meta [38;5;6m00:00[0m[0m
[38;5;247mEncoding sequences:   [38;5;10m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m  [0m[38;5;247m[38;5;247m[38;5;2m2/2 it[0m [38;5;247m[38;5;1m0.84 it/s[0m [38;5;247meta [38;5;6m00:00[0m[0m
QMAPMetrics(split: 0, threshold: 60%):
 - RMSE: 0.9588
 - MSE: 0.9193
 - MAE: 0.7596
 - R2: 0.1755
 - Spearman: 0.4869
 - Kendall's Tau: 0.3350
 - Pearson: 0.4663
[38;5;247mEncoding sequences:   [38;5;10m━━

## Subsets
The benchmark can be divided into different subsets to evaluate the performances of the model in different conditions. For example, you can evaluate the model on high complexity sequences, low complexity sequences or highly effective sequences. To do this, you simply need to access the desired subset attribute of the benchmark object. This subset can be used like the benchmark object since it has almost the same interface.

In [24]:
# Run an example with the last split and last model
sub_benchmark = benchmark.high_efficiency
test_sequences = sub_benchmark.inputs
test_targets = sub_benchmark.targets.reshape(-1)

# Get the predictions of the model on the test set
X_test = encoder.encode(test_sequences).embeddings
y_pred = model.predict(X_test.float())

# Evaluate the model
# The benchmark provides a method to evaluate the model on the test set given the predictions.
results = sub_benchmark.compute_metrics(y_pred)
print(results)

QMAPMetrics(split: 4, threshold: 60%):
 - RMSE: 0.9475
 - MSE: 0.8978
 - MAE: 0.7598
 - R2: -3.5476
 - Spearman: 0.2026
 - Kendall's Tau: 0.1384
 - Pearson: 0.1880
