# Workflow #

The pipeline consists of (i), creation or reloading of the work area via Get_Data(), (ii) creation of the repertoire classification model via Monte_Carlo_CrossVal(), and (iii) representation of the AUC-ROC curve. In case of positive results in the classification model, the pipeline has continued with (iv) identification of motifs by Motif_Identification() and obtaining the Residue Sensitivity Logos for each cluster (v). This workflow has been applied in all supervised analysis scripts. The procedure has been developed given the scripts deposited at https://github.com/sidhomj/DeepTCR_COVID19, whose results are published at https://doi.org/10.1038/s41598-021-93608-8

**Important** The supervised model analyses stored in the .ipynb *cluster1vcluster2*, *cluster1vcluster3*, *cluster2vcluster3* and *severity* scripts have the same structure and code blocks (except the RSL logo code blocks in *cluster2vcluster3* and *severity* files), with corresponding differences in terms of sample directories, variable and file names. Therefore, for further simplification, more detailed workflow comments have been added in the *cluster1vcluster3.ipynb* script, applicable to the rest of the scripts, which retain basic documentation.

In [None]:
import sys
import pandas as pd
import numpy as np
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_WF
import pickle

In [None]:
DTCR_CT23 = DeepTCR_WF('CT2_CT3')
classes = ['ct2', 'ct3']  # Definition of classes of samples we compare
DTCR_CT23.Get_Data('ct2_ct3',
              Load_Prev_Data=True,
              aa_column_beta=0,
              v_beta_column=2,
              j_beta_column=3,
              count_column=1,
              data_cut=1000, # Selecting to 1000 expanded clonotypes
              type_of_data_cut='Num_Seq',
              aggregate_by_aa=True, # Preventing redundancy bias of repeated clonotypes by aminoacid sequence
              classes=classes)

In [None]:
folds = 100 # 100 fold Monte-Carlo cross validation
epochs_min = 25
size_of_net = 'small'
num_concepts=64 # Tested as the best option for a better model's perfomance for our data
hinge_loss_t = 0.1
train_loss_min=0.1
seeds = np.array(range(folds))
graph_seed = 0
#
DTCR_CT23.Monte_Carlo_CrossVal(folds=folds,epochs_min=epochs_min,size_of_net=size_of_net,num_concepts=num_concepts,
                          train_loss_min=train_loss_min,combine_train_valid=True,
                          hinge_loss_t=hinge_loss_t,
                          multisample_dropout=True, # Enabling multi-sample dropout rate
                          weight_by_class =True, # Enabling by class weigthing
                          seeds=seeds,graph_seed=graph_seed, subsample=100 ) # Subsampling 100 sequences each fold during training

In [None]:
DTCR_CT23.Motif_Identification('ct2', by_samples=True)
DTCR_CT23.Motif_Identification('ct3', by_samples=True)

In [None]:
with open('ct23_model_preds.pkl','wb') as f:
    pickle.dump(DTCR_CT23.DFs_pred,f,protocol=4)

with open('ct23_model_seq_preds.pkl', 'wb') as f:
    pickle.dump((DTCR_CT23.predicted,DTCR_CT23.beta_sequences,DTCR_CT23.lb), f, protocol=4)