# Python setup #

The work developed using DeepTCR has been carried out in a virtual Python 3.7.17 environment, so that the DeepTCR dependencies are correctly installed, and therefore the library itself. The execution of DeepTCR has been done using Jupyter notebooks in VScode. In addition to DeepTCR it is necessary to import the sys, pandas, numpy and pickle libraries.

For this analysis, the DeepTCR_U module has been imported, which has the necessary functions for the training and evaluation of the unsupervised model.

In [None]:
import sys
import pandas as pd
import numpy as np
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_U
import pickle

# Unsupervised analysis #
The unsupervised analysis consists of the training, graphical representation and subsequent testing of the variational autoencoder (VAE). VAE will be used to analyse the CDR3b sequences, V alleles and J alleles of the 1000 most expanded clonotypes of each sample, with the aim of obtaining a latent representation of the samples based on these characteristics. For this work, the VAE model will be trained with Sparsity, which aims to find the minimum number of latent classes that explain 99% of the explained variance. For this purpose, we use the parameters sparsity_alpha=0.1 and var_explained=0.99 within the Train_VAE() function. The hyperparameter sparsity_alpha penalises the VAE model during training and modulates how sparse we want our latent representation to be. In this way, we seek to reduce the number of latent features and thus the risk of collinearity, by way of regularisation. The value of 1.0 is chosen as it is considered by the DeepTCR developers as a good starting point. Para más información, consulte la API de DeepTCR (https://sidhomj.github.io/DeepTCR/api/).

### 1. Getting data for the VAE analysis of the three clusters ###

In [None]:
DTCR_U = DeepTCR_U('unsuperv')
classes = ['ct1', 'ct2', 'ct3'] # Definition of classes of samples we compare
DTCR_U.Get_Data('clusters',
              Load_Prev_Data=True,
              aa_column_beta=0,
              v_beta_column=2,
              j_beta_column=3,
              count_column=1,
              data_cut=1000,
              type_of_data_cut='Num_Seq', # Selecting to 1000 expanded clonotypes
              aggregate_by_aa=True) # Preventing redundancy bias of repeated clonotypes by aminoacid sequence

### 2. VAE trainning ###

We use the Train_VAE() function with the hyperparameters for choosing the number of latent classes explained in the previous text (sparsity_alpha=1.0,var_explained = 0.99).

In [28]:
# Train VAE model
DTCR_U.Train_VAE(Load_Prev_Data=False,sparsity_alpha=1.0,var_explained = 0.99)

Training Done


### 3. Graphical representation ###

A clear and concise way of graphically representing the generated VAE model is by means of a heatmap, representing for each sample the average value of each of the latent features. For this we use the HeatMap_Samples() function. 

In [None]:
DTCR_U.HeatMap_Samples(filename='heatmap.tif', # Save the figure
                       color_dict={'ct1': '#470E61', # Dictionary of class colouring
                                  'ct2': '#1F948C',
                                  'ct3': '#FDE725',
                                  },
                        labels= False ) # Supress sample labels in the plot

# Plotting the cummulative sum of explained variance of the latent features
import matplotlib.pyplot as plt
plt.plot(np.cumsum(DTCR_U.explained_variance_ratio_))

### 4. KNN sample classification ###

De forma adicional, se ha realizado un K-Nearest-Neighbor classification of the resulting latent features values of each sample retrieved from the previous VAE. El objetivo es evaluar la capacidad predictiva de estas latent features para distinguir los clusters 1, 2 y 3.

En nuestro estudio, se ha aplicado KNN y se ha calculado el respectivo AUC, en un modelo entrenado con 50 folds. Se ha aplicado las distancias de KL-Divergence, Euclidean, JS-Divergence, Wasstersein Distance, and Correlation Distance predefinidas en el hiperparámetro distance_metric. Se eligió aquella distancia cuyo AUC por clase fuera el más alto.

Por otro lado, dado el alto número de sencuencias a analizar pese a estudiar los 1000 clonotipos más expandidos, se ha subsampleado el número de secuencias a analizar en cada fold a 10000, con el objetivo de que el KNN fuera computacionalmente viable. Para ,ás información, consulte la función KNN_Repertorie_Classifier() en https://sidhomj.github.io/DeepTCR/api/



In [None]:
# KNN with KL-Divergence
DTCR_U.KNN_Repertoire_Classifier(metrics=['AUC'],
                                distance_metric='KL',
                                plot_metrics=True,
                                by_class=True, # Show the performance metrics by class
                                sample=10000, 
                                folds=50,  n_jobs=3)


In [None]:
# KNN with correlation
DTCR_U.KNN_Repertoire_Classifier(metrics=['AUC'],
                                distance_metric='correlation',
                                plot_metrics=True,
                                by_class=True, 
                                sample=10000,
                                folds=50, n_jobs=3)

In [None]:
# KNN with euclidean distance
DTCR_U.KNN_Repertoire_Classifier(metrics=['AUC'],
                                distance_metric='euclidean',
                                plot_metrics=True,
                                by_class=True, 
                                sample=10000,
                                folds=50, n_jobs=3)

In [None]:
# KNN with Jenssen-Shannon Divergence
DTCR_U.KNN_Repertoire_Classifier(metrics=['AUC'],
                                distance_metric='JS',
                                plot_metrics=True,
                                by_class=True, 
                                sample=10000,
                                folds=50, n_jobs=3)

In [None]:
# KNN with Wasserstein distance
# This method was the one that yielded the highets AUC values for the tree clusters
DTCR_U.KNN_Repertoire_Classifier(metrics=['AUC'],
                                distance_metric='wasserstein',
                                plot_metrics=True,
                                by_class=True, 
                                sample=10000,
                                folds=50)