# Full length tcr reconstruction
In this notebook an example is given on how to reconstruct full length tcr sequences from V, J and CDR3 annotations.
When run on google colab, this notebook is assigned its own online environment. Therefore, everything done here will not affect your local files.
However, it is possible to up- and download files to the directories (folder icon on the left).

The reconstruction algorithm has been benchmarked on an internal dataset (RootPath), and an publically available dataset (10x).
Compared to the reconstruction of RootPath this reconstruction method matched near 100%$^1$ of their reconstructions (900+ TCRA and TCRB).
The 10x data was biological sample of 10k TCRS (50/50 TCRA/TCRB). The reconstruction fidelity of this dataset was $>85\%$
The remaining $15\%$ was explained by biological and/or technical noise and by missing info for allelic differences (alleles are often not annotated).

1. small differences were explained by a possible error and difference in assumptions between the unknown RootPath script and this method.

## Requirements
- a .csv file that contains the following columns (can be obtained by exporting/saving an excel file as .csv)
    - V and J annotations columns should contain the following column names: `TRAV', 'TRAJ',  'TRBV' and 'TRBJ`
    - CDR3a and CDR3b annotations should contain the following column names: `'cdr3_alpha_aa', 'cdr3_beta_aa'`
- two translation dictionaries
    - Multiple versions can be found in `/IMGT_versions`
    - These translate (ambiguous) annotations to IMGT standardized format, and takes the corresponding V and J sequences.
    - When no
    - One should contain the IMGT annoations as keys with the $AA$ (amino acid) sequences as values
    - The other should contain the IMGT annoations as keys with the $NT$ (nucleotide) sequences as values

### First of all: run this cell to copy the required files from github (only needed when running from google colab)

In [None]:
!cd ~
!rm -r TCR_reconstruction/
!git clone https://github.com/bpkwee/TCR_reconstruction
!pip install pysam

### next import all relevant functions
- import the functions that were downloaded from github
- initialize the logging

In [None]:
# from logger.logger import init_logger
# from vdj_reconstruction_utils import reconstruct_full_tcr
# from vdj_reconstruction_utils import reconstruct_vdj

from TCR_reconstruction.logger.logger import init_logger
from TCR_reconstruction.vdj_reconstruction_utils import reconstruct_vdj
from TCR_reconstruction.vdj_reconstruction_utils import reconstruct_full_tcr
import pandas

logger = init_logger('TCR_reconstruction.log', level_msg='INFO')

### specify file path and load the dataset
If you upload your own, right to the file should be 3 dots (in google colab) here click 'copy file path'
If your file is not loaded correctly, consider changing the `'delimeter=';'` to the correct delimeter.

Example:
```

saved_dataset_file_path = '/content/TCR_reconstruction/saved_data/example.csv'
translation_dict_aa = '/content/TCR_reconstruction/IMGT_versions/after_benchmark/functional_with_L-PART1+V-EXON_after_benchmark/vdj_translation_dict_aa_2021-12-05_18h_.json'
translation_dict_nt = '/content/TCR_reconstruction/IMGT_versions/after_benchmark/functional_with_L-PART1+V-EXON_after_benchmark/vdj_translation_dict_nt_2021-12-05_18h_.json'
```

In [None]:
saved_dataset_file_path = '/content/TCR_reconstruction/saved_data/example.csv'
translation_dict_aa = '/content/TCR_reconstruction/IMGT_versions/after_benchmark/functional_with_L-PART1+V-EXON_after_benchmark/vdj_translation_dict_aa_2021-12-05_18h_.json'
translation_dict_nt = '/content/TCR_reconstruction/IMGT_versions/after_benchmark/functional_with_L-PART1+V-EXON_after_benchmark/vdj_translation_dict_nt_2021-12-05_18h_.json'
dataset = pandas.read_csv(saved_dataset_file_path,delimiter=';' ,low_memory=False)
dataset.head(10)

Reconstruct the v,d and j aa and nt sequences from TRAV, TRAJ, TRBV, TRBJ annotations


In [None]:
for vdj in ['TRAV', 'TRAJ', 'TRBV', 'TRBD', 'TRBJ']:
    dataset[vdj + '_imgt_aa'],  dataset[vdj + '_seq_aa'],  dataset[
        vdj + '_imgt_nt'],  dataset[vdj + '_seq_nt'] = reconstruct_vdj(dataset,
                                                                       vdj,
                                                                       translation_dict_nt,
                                                                       translation_dict_aa)
dataset.head(10)

Calculate how many annotations could be matched to a IMGT sequence:

In [None]:
total_len = len(  dataset)
for nt_or_aa in ['aa', 'nt']:
    for vdj in ['TRAV', 'TRAJ', 'TRBV', 'TRBD', 'TRBJ']:
        count = sum(  dataset[vdj + '_seq_' + nt_or_aa].notna())
        count_original = sum(  dataset[vdj].notna())
        print('{0} imputed: {1} / {4} total annotations ({2}) ({3})'.format(vdj, count, total_len,
                                                                                  nt_or_aa.upper(),
                                                                                  count_original))

### reconstructing the full sequence for the beta and alpha TCR
### and calculate statistics
For clarity it selects only the original columns and the reconstructed sequence.
If you want all columns you should comment out :
`dataset = dataset[['full_seq_reconstruct_beta_aa','full_seq_reconstruct_alpha_aa','cdr3_alpha_aa','cdr3_beta_aa','TRAV','TRAJ',	'TRBV',	'TRBD',	'TRBJ']]`

In [None]:
# beta
dataset['full_seq_reconstruct_beta_aa'] = reconstruct_full_tcr(dataset['TRBV_seq_nt'],
                                                               dataset['TRBV_seq_aa'],
                                                               dataset['TRBJ_seq_nt'],
                                                               dataset['TRBJ_seq_aa'],
                                                               dataset['cdr3_beta_aa'],
                                                               include_leader=False)
# alpha
dataset['full_seq_reconstruct_alpha_aa'] = reconstruct_full_tcr(dataset['TRAV_seq_nt'],
                                                                dataset['TRAV_seq_aa'],
                                                                dataset['TRAJ_seq_nt'],
                                                                dataset['TRAJ_seq_aa'],
                                                                dataset['cdr3_alpha_aa'],
                                                                include_leader=False)

dataset = dataset[['full_seq_reconstruct_beta_aa','full_seq_reconstruct_alpha_aa','cdr3_alpha_aa','cdr3_beta_aa','TRAV','TRAJ',	'TRBV',	'TRBD',	'TRBJ']]
dataset.head(10)

### Calculate statistics on the reconstruction:

In [None]:
print('Could reconstruct full BETA TCR for {0} entries of total {1} CDR3b entries'.format(
    sum(dataset['full_seq_reconstruct_beta_aa'].notna()),
    sum(   dataset['cdr3_beta_aa'].notna())))

print('Could  reconstruct full ALPHA TCR for {0} entries of total {1} CDR3a entries'.format(
    sum(   dataset['full_seq_reconstruct_alpha_aa'].notna()),
    sum(   dataset['cdr3_alpha_aa'].notna())))

### Lastly save the output
To download the output again use the three dots besides the file in the directory

In [None]:
dataset.to_csv('reconstructed_tcrs.csv')

dataset.to_csv('reconstructed_tcrs.csv')

### First of all: run this cell to copy the required files from github (only needed when running from google colab)

In [None]:
!cd ~
!rm -r TCR_reconstruction/
!git clone https://github.com/bpkwee/TCR_reconstruction
!pip install pysam

### next import all relevant functions
- import the functions that were downloaded from github
- initialize the logging

In [None]:
# from logger.logger import init_logger
# from vdj_reconstruction_utils import reconstruct_full_tcr
# from vdj_reconstruction_utils import reconstruct_vdj

from TCR_reconstruction.logger.logger import init_logger
from TCR_reconstruction.vdj_reconstruction_utils import reconstruct_vdj
from TCR_reconstruction.vdj_reconstruction_utils import reconstruct_full_tcr
import pandas

logger = init_logger('TCR_reconstruction.log', level_msg='INFO')

### specify file path and load the dataset
If you upload your own, right to the file should be 3 dots (in google colab) here click 'copy file path'
If your file is not loaded correctly, consider changing the `'delimeter=';'` to the correct delimeter.

Example:
```

saved_dataset_file_path = '/content/TCR_reconstruction/saved_data/example.csv'
translation_dict_aa = '/content/TCR_reconstruction/IMGT_versions/after_benchmark/functional_with_L-PART1+V-EXON_after_benchmark/vdj_translation_dict_aa_2021-12-05_18h_.json'
translation_dict_nt = '/content/TCR_reconstruction/IMGT_versions/after_benchmark/functional_with_L-PART1+V-EXON_after_benchmark/vdj_translation_dict_nt_2021-12-05_18h_.json'
```

In [None]:
saved_dataset_file_path = '/content/TCR_reconstruction/saved_data/example.csv'
translation_dict_aa = '/content/TCR_reconstruction/IMGT_versions/after_benchmark/functional_with_L-PART1+V-EXON_after_benchmark/vdj_translation_dict_aa_2021-12-05_18h_.json'
translation_dict_nt = '/content/TCR_reconstruction/IMGT_versions/after_benchmark/functional_with_L-PART1+V-EXON_after_benchmark/vdj_translation_dict_nt_2021-12-05_18h_.json'
dataset = pandas.read_csv(saved_dataset_file_path,delimiter=';' ,low_memory=False)
dataset.head(10)

Reconstruct the v,d and j aa and nt sequences from TRAV, TRAJ, TRBV, TRBJ annotations


In [None]:
for vdj in ['TRAV', 'TRAJ', 'TRBV', 'TRBD', 'TRBJ']:
    dataset[vdj + '_imgt_aa'],  dataset[vdj + '_seq_aa'],  dataset[
        vdj + '_imgt_nt'],  dataset[vdj + '_seq_nt'] = reconstruct_vdj(dataset,
                                                                       vdj,
                                                                       translation_dict_nt,
                                                                       translation_dict_aa)
dataset.head(10)

Calculate how many annotations could be matched to a IMGT sequence:

In [None]:
total_len = len(  dataset)
for nt_or_aa in ['aa', 'nt']:
    for vdj in ['TRAV', 'TRAJ', 'TRBV', 'TRBD', 'TRBJ']:
        count = sum(  dataset[vdj + '_seq_' + nt_or_aa].notna())
        count_original = sum(  dataset[vdj].notna())
        print('{0} imputed: {1} / {4} total annotations ({2}) ({3})'.format(vdj, count, total_len,
                                                                                  nt_or_aa.upper(),
                                                                                  count_original))

### reconstructing the full sequence for the beta and alpha TCR
### and calculate statistics
For clarity it selects only the original columns and the reconstructed sequence.
If you want all columns you should comment out :
`dataset = dataset[['full_seq_reconstruct_beta_aa','full_seq_reconstruct_alpha_aa','cdr3_alpha_aa','cdr3_beta_aa','TRAV','TRAJ',	'TRBV',	'TRBD',	'TRBJ']]`

In [None]:
# beta
dataset['full_seq_reconstruct_beta_aa'] = reconstruct_full_tcr(dataset['TRBV_seq_nt'],
                                                               dataset['TRBV_seq_aa'],
                                                               dataset['TRBJ_seq_nt'],
                                                               dataset['TRBJ_seq_aa'],
                                                               dataset['cdr3_beta_aa'],
                                                               include_leader=False)
# alpha
dataset['full_seq_reconstruct_alpha_aa'] = reconstruct_full_tcr(dataset['TRAV_seq_nt'],
                                                                dataset['TRAV_seq_aa'],
                                                                dataset['TRAJ_seq_nt'],
                                                                dataset['TRAJ_seq_aa'],
                                                                dataset['cdr3_alpha_aa'],
                                                                include_leader=False)

dataset = dataset[['full_seq_reconstruct_beta_aa','full_seq_reconstruct_alpha_aa','cdr3_alpha_aa','cdr3_beta_aa','TRAV','TRAJ',	'TRBV',	'TRBD',	'TRBJ']]
dataset.head(10)

### Calculate statistics on the reconstruction:

In [None]:
print('Could reconstruct full BETA TCR for {0} entries of total {1} CDR3b entries'.format(
    sum(dataset['full_seq_reconstruct_beta_aa'].notna()),
    sum(   dataset['cdr3_beta_aa'].notna())))

print('Could  reconstruct full ALPHA TCR for {0} entries of total {1} CDR3a entries'.format(
    sum(   dataset['full_seq_reconstruct_alpha_aa'].notna()),
    sum(   dataset['cdr3_alpha_aa'].notna())))

### Lastly save the output
To download the output again use the three dots besides the file in the directory

In [None]:
dataset.to_csv('reconstructed_tcrs.csv')