# Supervised Sequence Classification

While unsupervised methods can be powerful to identify antigen-specific sequences, being able to leverage known labels to guide the learning process can provide for better results, provided there is a sufficient amount of data to learn from. The first type of supervised learning we will explore within DeepTCR is being able to correctly classify a given TCR sequence to some label (i.e. its antigen specificity) from using its sequence information.

First we will load data from the Murine dataset which has TCR sequences from 9 murine antigens with beta-chain information including sequence, v-beta, and j-beta gene usage.

In [None]:
# import pandas as pd
# DF = pd.read_csv('Data/MOG/IED8_pos/Filtered_MOG_assays.csv')
# DF.head()
import warnings
warnings.filterwarnings("ignore")

In [None]:
import sys
sys.path.append('../')
from DeepTCR.DeepTCR import DeepTCR_SS

In [None]:
# Instantiate training object
DTCR_SS = DeepTCR_SS('MOG')

#Load Data from directories
DTCR_SS.Get_Data(directory='Data/MOG',Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)

We will then train the sequence classifier as follows. First, we will split the dataset into a train,validation,and independent test cohort so we can assess how generalizable our model will be to new unseen data.And then we will train our sequene classifier. Our test_size parameter will tell DeepTCR how much of the data to leave out for valid/test which is split evenly across these two.

In [None]:
DTCR_SS.Get_Train_Valid_Test(test_size=0.1)
DTCR_SS.Train(suppress_output=True)

When we are done training, we can assess how well our classifier performs on the independent test set via looking at the ROC curves.

In [None]:
DTCR_SS.AUC_Curve()

We can also train our classifier with two other methods that allow for multiple iterations including a Monte Carlo method and K-Fold Cross-Validation Method.

For the Monte-Carlo method, we will specify the number of times we want to train train the classifier and the test size we want for each iteration. Of note, all parameters available for the Train method are also inputs for the Monte-Carlo and K-Fold Cross-Validation method.

In [None]:
DTCR_SS.Monte_Carlo_CrossVal(test_size=0.25,folds=4, suppress_output=True)

Once again, we can view the AUC curve.

In [None]:
DTCR_SS.AUC_Curve()

To run a K-fold cross validation with 5 folds of the data, fun the following command. In this case, no test_size is required as the algorithm is trained on the entirety of the train folds and tested on the out-fold.

In [None]:
DTCR_SS.K_Fold_CrossVal(folds=4, suppress_output=True)

Once our algorithm has been trained, we may want to see which sequences are the most strongly predicted for each label. To do this we will run the following command. The output of the command is a dictionary of dataframes within the object we can view. Additionally, these dataframes can be found in the results folder underneath the subdirectory 'Rep_Sequences'.

In [None]:
DTCR_SS.Representative_Sequences()

In [None]:
print(DTCR_SS.Rep_Seq['IED8-MOG'])

Furthermore, we may want to know which learned motifs are associated with a given label. To do this, we can run the following command with the label we want to know the predictive motifs for.

In [None]:
DTCR_SS.Motif_Identification('IED8-MOG')

The motifs can then be found in fasta files in the results folder underneath (label)_(alpha/beta)_Motifs. These fasta fiels can then be used with "https://weblogo.berkeley.edu/logo.cgi" for motif visualization.The file names have the magnitude of the enrichment as the first number followed by '_feature_n'. The higher the number, the more enriched the motif is in the label-specific sequences.

## Visualization

We can also visualize the learned latent space from the supervised sequence classifier through plotting a UMAP representation of the sequences in two dimensions.

In [None]:
DTCR_SS.UMAP_Plot(by_class=True,freq_weight=True,scale=1000)

We can also specify whether we only want to plot sequences that were used in either train,valid, or the test set with the 'set' parameter.

In [None]:
DTCR_SS.UMAP_Plot(by_class=True,freq_weight=True,scale=2000,set='test')

We can also visualize how the repertoires are related from this learned representation. This visualiztion is helpful when we want to compare how different TCR repertoires are related structurally.

In [None]:
# DTCR_SS.Repertoire_Dendrogram(gridsize=50,gaussian_sigma=0.75,lw=6,dendrogram_radius=0.3)

See documentation for how to use the full functionality of this method.

# test on new set

In [None]:
import numpy as np
import pandas as pd

test_DF = pd.read_csv('Data/test/test.tsv', sep='\t')
new_beta = test_DF.iloc[:,0]
new_vbeta = test_DF.iloc[:,2]
new_jbeta = test_DF.iloc[:,3]

# new_beta = np.array([...])
# new_vbeta = np.array([...])
# new_jbeta = ...

In [None]:
set(new_vbeta)

In [None]:
train_data = pd.read_csv('Data/MOG/IED8-MOG/train.tsv', sep='\t')
set(train_data['Chain 2 J Gene'])


Returns:
            features, features_dist

            - features (array), shape = [N, latent_dim]: An array that contains n x latent_dim containing features for all sequences. For the VAE, this represents the features from the latent space. For the sequence classifier, this represents the probabilities for every class or the regressed value from the sequence regressor. In the case of multiple models being used for inference in ensemble, this becomes the average prediction over all models.

            - features_dist (array), shape = [n_models,N,latent_dim]: An array that contains the output of all models separately for each input sequence. This output is useful if using multiple models in ensemble to predict on a new sequence. This output describes the distribution of the predictions over all models.

In [None]:
predictions, dist = DTCR_SS.Sequence_Inference(
    beta_sequences=new_beta,
    v_beta=new_vbeta,
    j_beta=new_jbeta,
    batch_size=10,
    return_dist=True
)

In [None]:
predictions[:5,:]