## Predicting antigen specificity 

In [7]:
import pandas as pd

In [8]:
dataset = pd.read_csv('vdjdb.csv')

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92771 entries, 0 to 92770
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   complex.id        92771 non-null  int64 
 1   gene              92771 non-null  object
 2   cdr3              92771 non-null  object
 3   v.segm            92670 non-null  object
 4   j.segm            91626 non-null  object
 5   species           92771 non-null  object
 6   mhc.a             92771 non-null  object
 7   mhc.b             92771 non-null  object
 8   mhc.class         92771 non-null  object
 9   antigen.epitope   92771 non-null  object
 10  antigen.gene      92709 non-null  object
 11  antigen.species   92771 non-null  object
 12  reference.id      91260 non-null  object
 13  method            92771 non-null  object
 14  meta              92771 non-null  object
 15  cdr3fix           92771 non-null  object
 16  vdjdb.score       92771 non-null  int64 
 17  web.method  

In [9]:
dataset.isnull().sum()

complex.id             0
gene                   0
cdr3                   0
v.segm               101
j.segm              1145
species                0
mhc.a                  0
mhc.b                  0
mhc.class              0
antigen.epitope        0
antigen.gene          62
antigen.species        0
reference.id        1511
method                 0
meta                   0
cdr3fix                0
vdjdb.score            0
web.method             0
web.method.seq         0
web.cdr3fix.nc         0
web.cdr3fix.unmp       0
dtype: int64

# Data Transformation

## Removing Unwanted Columns

Some of the columns contain specific information about the literature sources (reference.id), sequencing methods (method, web.method.seq), collection method (web.method) etc., some of which are related to how **vdjdb.score** is calculated.

Therefore we first remove these columns. Deleted columns are as follows: 
-  reference.id
-  method
-  meta
-  cdr3fix
-  web.method  
-  web.method.seq
-  web.cdr3fix.nc
-  web.cdr3fix.unmp

(Work by Yutong)

In [10]:
dataset = dataset.iloc[:, list(range(0, 12)) + [16]]
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92771 entries, 0 to 92770
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   complex.id       92771 non-null  int64 
 1   gene             92771 non-null  object
 2   cdr3             92771 non-null  object
 3   v.segm           92670 non-null  object
 4   j.segm           91626 non-null  object
 5   species          92771 non-null  object
 6   mhc.a            92771 non-null  object
 7   mhc.b            92771 non-null  object
 8   mhc.class        92771 non-null  object
 9   antigen.epitope  92771 non-null  object
 10  antigen.gene     92709 non-null  object
 11  antigen.species  92771 non-null  object
 12  vdjdb.score      92771 non-null  int64 
dtypes: int64(2), object(11)
memory usage: 9.2+ MB


---

## How to filter the dataset

1. Using only beta chains  -- The CDR3 regions of TCRβ chains are located in the center of the paratope and are considered as the key determinant of specificity in antigen recognition (Jiang, Huo and Cheng Li, 2023). Moreover, the database contain mostly beta chained samples. -- <b>Done</b>
2. Only taking human species (Why?)  -- <b>Done</b>
    1. Most samples
3. Removed duplicates -- <b>Done</b>
3. Restricted to MHC Class 1 (Why?)  - <b>Not Sure?</b>
    1. There are around 80K records of MHC class 1
    2. The TIENET paper maybe focused mainly on MHC1 only

---
4. <b> only kept the CDR3β and epitope sequences whose lengths lie between 5–30 and 7–15 amino acids, respectively (Why?) </b>
    1. This was likely done to focus the model on reasonable sequence lengths and remove potentially noisy or erroneous data. (Explore more)
    2. CDR3s with lengths between 10-20  -- <b> Done </b>


#### Explanation for CDR3 Lengths
The optimal length for the CDR3 region in predicting TCR-peptide interactions can vary depending on the specific dataset and model used. However, most studies suggest that the CDR3 loop sequence's mid-region creates most of the contacts with the peptide due to its length and flexibility[3]. The CDR3 region is typically 10-20 amino acids long, and its length can impact antigen specificity[3].

In the case of the VDJ.db dataset, the authors report that the CDR3 loop sequence's mid-region creates most of the contacts with the peptide due to the loop and peptide's flexibility[3]. They also demonstrate that the modeling pipeline can be used to study features of residue interactions in TCR:pMHC complexes, providing access to structural data for immunologists and in silico structural modeling[3].

When dealing with the VDJ.db dataset, it is essential to consider the specific length of the CDR3 region for predicting TCR-peptide interactions. The optimal length may depend on the specific dataset and model used, but generally, a CDR3 region of 10-20 amino acids is considered suitable for predicting TCR-peptide interactions.
    
Citations:
[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10464843/

#### Explanation for epitope lengths

In the context of immunology and T-cell receptor (TCR) research, the length of 7–15 amino acids for epitopes is commonly used because this range represents the typical length of peptides that can be presented by Major Histocompatibility Complex (MHC) molecules to TCRs. MHC molecules play a crucial role in the immune response by presenting peptide fragments derived from pathogens to TCRs on T cells, which are then able to recognize and respond to these peptides. The peptides that bind to Class I MHC molecules are generally 8-10 amino acids in length, while those that bind to Class II MHC molecules are often longer, ranging from 13 to 25 amino acids, with a core region of about 9 amino acids that interacts most directly with the TCR


---
5. Removing epitopes with less than 10 associated TCR sequences: (Why?) <b> Done </b>
    1. The merged dataset was highly imbalanced, with some epitopes having many associated TCRs and others having very few.
    2. To address this imbalance, the authors removed epitopes that had less than 10 associated TCR sequences.
    3. This ensured a minimum number of examples for each epitope, which can help the model learn more robust patterns.
6. Removing TCR sequences with ambiguous amino acids (Why?)
    1. Some TCR sequences contained ambiguous amino acid characters (B, J, O, U, X), which the authors removed.
6. I should keep both positive and negative dataset. (Why?)
    1. What are negative sampling? VDJ_score in (0,1)
    2. What are positive sampling? VDJ_score in (2,3)

## How prepare the data for ML model?

1. Encoding the input sequences: (Is this process too complex?)
    1. The authors used two separate pre-trained TCRpeg models to encode the TCR (CDR3β) and epitope sequences into numerical vector representations.
    2. This transfer learning approach was used to leverage the knowledge learned from the pre-training tasks.

## Which ML algo to use?

1. The encoded TCR and epitope vectors were concatenated and fed into the final fully connected neural network (FCN) of the TEINet model (Is this too complex?)

---