# Data
Before getting started, let's take a look at the provided data

## Provided datasets

### GILGFVFTL_data
This dataset contains positive examples of TCR Cells (alpha and beta parts (sometimes both, sometimes only one of them)) that react with th GILGFVFTL epitope.

In [1]:
# Load data/GILGFVFTL_data.tsv in a dataframe
import pandas as pd
df_g = pd.read_csv('data/GILGFVFTL_data.tsv') #, sep='\t') # Hmm, for some reason it has a .tsv extension but is actually a .csv file
df_g.head()

Unnamed: 0.1,Unnamed: 0,GeneA,CDR3_alfa,TRAV,TRAJ,MHC A_alfa,Epitope,Score_alfa,GeneB,CDR3_beta,TRBV,TRBJ,MHC A_beta,Epitope.1,Score_beta
0,0,TRA,CAGAGSQGNLIF,TRAV27*01,TRAJ42*01,HLA-A*02:01:48,GILGFVFTL,3.0,TRB,CASSSRSSYEQYF,TRBV19*01,TRBJ2-7*01,HLA-A*02:01:48,GILGFVFTL,3.0
1,1,TRA,CAGAGSQGNLIF,TRAV27*01,TRAJ42*01,HLA-A*02:01:48,GILGFVFTL,3.0,TRB,CASSSRSSYEQYF,TRBV19*01,TRBJ2-7*01,HLA-A*02:01:48,GILGFVFTL,3.0
2,2,TRA,CAGAGSQGNLIF,TRAV27*01,TRAJ42*01,HLA-A*02:01:48,GILGFVFTL,3.0,TRB,CASSSRSSYEQYF,TRBV19*01,TRBJ2-7*01,HLA-A*02:01:48,GILGFVFTL,3.0
3,3,TRA,CAGAGSQGNLIF,TRAV27*01,TRAJ42*01,HLA-A*02:01:48,GILGFVFTL,3.0,TRB,CASSSRASYEQYF,TRBV19*01,TRBJ2-7*01,HLA-A*02:01:48,GILGFVFTL,3.0
4,4,TRA,CAGPGGSSNTGKLIF,TRAV35*01,TRAJ37*01,HLA-A*02:01:48,GILGFVFTL,3.0,TRB,CASSLIYPGELFF,TRBV27*01,TRBJ2-2*01,HLA-A*02:01:48,GILGFVFTL,3.0


In [2]:
# Get per column the number of different values
df_g.nunique()

Unnamed: 0    5192
GeneA            1
CDR3_alfa     1093
TRAV            44
TRAJ            50
MHC A_alfa       3
Epitope          1
Score_alfa       3
GeneB            1
CDR3_beta     3473
TRBV            51
TRBJ            13
MHC A_beta       3
Epitope.1        1
Score_beta       4
dtype: int64

In [3]:
# Number of genes containing alpha (column GeneA = TRA)
alpha_count_g = df_g[df_g['GeneA'] == 'TRA'].shape[0]
f"Number of genes containing alpha: {alpha_count_g} ({alpha_count_g/df_g.shape[0]*100:.2f}%)"

'Number of genes containing alpha: 2161 (41.62%)'

In [4]:
beta_count_g = df_g[df_g['GeneB'] == 'TRB'].shape[0]
f"Number of genes containing beta: {beta_count_g} ({beta_count_g/df_g.shape[0]*100:.2f}%)"

'Number of genes containing beta: 5190 (99.96%)'

In [5]:
# Values to ignore (for now?) are MHC A_alfa (always the same), Score_alfa (confidence score of alfa), Score_beta (confidence score of beta) -> Drop those columns
df_g = df_g.drop(columns=['MHC A_alfa', 'Score_alfa', 'MHC A_beta', 'Score_beta'])
df_g.head()

Unnamed: 0.1,Unnamed: 0,GeneA,CDR3_alfa,TRAV,TRAJ,Epitope,GeneB,CDR3_beta,TRBV,TRBJ,Epitope.1
0,0,TRA,CAGAGSQGNLIF,TRAV27*01,TRAJ42*01,GILGFVFTL,TRB,CASSSRSSYEQYF,TRBV19*01,TRBJ2-7*01,GILGFVFTL
1,1,TRA,CAGAGSQGNLIF,TRAV27*01,TRAJ42*01,GILGFVFTL,TRB,CASSSRSSYEQYF,TRBV19*01,TRBJ2-7*01,GILGFVFTL
2,2,TRA,CAGAGSQGNLIF,TRAV27*01,TRAJ42*01,GILGFVFTL,TRB,CASSSRSSYEQYF,TRBV19*01,TRBJ2-7*01,GILGFVFTL
3,3,TRA,CAGAGSQGNLIF,TRAV27*01,TRAJ42*01,GILGFVFTL,TRB,CASSSRASYEQYF,TRBV19*01,TRBJ2-7*01,GILGFVFTL
4,4,TRA,CAGPGGSSNTGKLIF,TRAV35*01,TRAJ37*01,GILGFVFTL,TRB,CASSLIYPGELFF,TRBV27*01,TRBJ2-2*01,GILGFVFTL


### background
This dataset contains some alpha and beta parts from TCR cells, without any epitope. You can use them as negative dataset (by filtering out the alphas and betas that occur in the positive dataset).

In [6]:
import pandas as pd
df_b = pd.read_csv('data/background.tsv', sep='\t') # this is actually a tsv
df_b.head()

Unnamed: 0,CDR3_alfa,TRAV,TRAJ,CDR3_beta,TRBV,TRBJ
0,CAYGSTYNTDKLIF,TRAV38-2DV8,TRAJ34,CASSQEGSGVTDTQYF,TRBV4-3,TRBJ2-3
1,CILRDEGGGADGLTF,TRAV26-2,TRAJ45,CSAREGLAEFNEQFF,TRBV20-1,TRBJ2-1
2,CAGGSGYSTLTF,TRAV12-2,TRAJ11,CASSLGHYGYTF,TRBV12-3,TRBJ1-2
3,CAVRDLIVGANNLFF,TRAV3,TRAJ36,CASSQSFRDDEQYF,TRBV18,TRBJ2-7
4,CATYGGSQGNLIF,TRAV21,TRAJ42,CASSQAVGYNEQFF,TRBV4-1,TRBJ2-1


In [7]:
df_b.nunique()

CDR3_alfa    429396
TRAV             56
TRAJ             56
CDR3_beta    474559
TRBV             61
TRBJ             13
dtype: int64

## Combined Dataset

In [8]:
positive_dataset = df_g
negative_dataset = df_b

In [9]:
# Drop the rows from the negative dataset where CDR3_alpha or CDR3_beta occur in the positive dataset
# Note: Check seperately or together? (e.g. alpha in positive, but alpha + beta combination not)
negative_dataset = negative_dataset[~negative_dataset['CDR3_alfa'].isin(positive_dataset['CDR3_alfa'])]
negative_dataset = negative_dataset[~negative_dataset['CDR3_beta'].isin(positive_dataset['CDR3_beta'])]

In [10]:
# get the percentages of df_g containing alpha, beta and both
alpha_only_count_g = df_g[(df_g['GeneA'] == 'TRA') & (df_g['GeneB'] != 'TRB')].shape[0]
beta_only_count_g = df_g[(df_g['GeneA'] != 'TRA') & (df_g['GeneB'] == 'TRB')].shape[0]
both_count_g = df_g[(df_g['GeneA'] == 'TRA') & (df_g['GeneB'] == 'TRB')].shape[0]
non_count_g = df_g[(df_g['GeneA'] != 'TRA') & (df_g['GeneB'] != 'TRB')].shape[0]
print(f"Number of genes containing alpha only: {alpha_only_count_g} ({alpha_only_count_g/df_g.shape[0]*100:.2f}%)")
print(f"Number of genes containing beta only: {beta_only_count_g} ({beta_only_count_g/df_g.shape[0]*100:.2f}%)")
print(f"Number of genes containing both: {both_count_g} ({both_count_g/df_g.shape[0]*100:.2f}%)")
print(f"Number of genes containing neither: {non_count_g} ({non_count_g/df_g.shape[0]*100:.2f}%)")

Number of genes containing alpha only: 2 (0.04%)
Number of genes containing beta only: 3031 (58.38%)
Number of genes containing both: 2159 (41.58%)
Number of genes containing neither: 0 (0.00%)


In [11]:
def sample_and_drop(df, n):
    """Sample n rows from df and drop them from df"""
    # src: https://stackoverflow.com/questions/39835021/pandas-random-sample-with-remove
    df_subset = df.sample(n)
    df.drop(df_subset.index, inplace=True)
    return df_subset

In [12]:
# Test the sample and drop function
df_ex = pd.DataFrame({'a': [1,2,3,4,5,6,7,8,9,10], 'b': [1,2,3,4,5,6,7,8,9,10]})
# get the number of rows in df_ex
rows_before = df_ex.shape[0]
n = 2
df_subset = sample_and_drop(df_ex, n)
rows_after = df_ex.shape[0]
assert rows_before - rows_after == n
df_subset

Unnamed: 0,a,b
4,5,5
0,1,1


In [13]:
import numpy as np

# now sample the same amount of negative examples from the negative dataset
alpha_only_rows_b = sample_and_drop(negative_dataset, alpha_only_count_g)
# Set CDR3_beta, TRBV and TRBJ to NaN
alpha_only_rows_b['CDR3_beta'] = np.nan
alpha_only_rows_b['TRBV'] = np.nan
alpha_only_rows_b['TRBJ'] = np.nan

# Is this the
beta_only_rows_b = sample_and_drop(negative_dataset, beta_only_count_g)
# Set CDR3_alpha, TRAV and TRAJ to NaN
beta_only_rows_b['CDR3_alfa'] = np.nan
beta_only_rows_b['TRAV'] = np.nan
beta_only_rows_b['TRAJ'] = np.nan

both_rows_b = sample_and_drop(negative_dataset, both_count_g)

non_rows_b = sample_and_drop(negative_dataset, non_count_g)
# Set CDR3_alpha, TRAV,  TRAJ, CDR3_beta, TRBV, TRBJ to NaN
non_rows_b['CDR3_alfa'] = np.nan
non_rows_b['TRAV'] = np.nan
non_rows_b['TRAJ'] = np.nan
non_rows_b['CDR3_beta'] = np.nan
non_rows_b['TRBV'] = np.nan
non_rows_b['TRBJ'] = np.nan

In [14]:
# combine samples in new dataframe
negative_dataset_same_proportion = pd.concat([alpha_only_rows_b, beta_only_rows_b, both_rows_b, non_rows_b])
# Check whether number of rows is same as in positive dataset
assert negative_dataset_same_proportion.shape[0] == positive_dataset.shape[0]

In [15]:
# count the number of duplicates in the positive dataset
positive_dataset_duplicates = positive_dataset[positive_dataset.duplicated()]
print(f"Number of duplicates in positive dataset: {positive_dataset_duplicates.shape[0]}")

# count the number of duplicates in the negative dataset
negative_dataset_duplicates = negative_dataset_same_proportion[negative_dataset_same_proportion.duplicated()]
print(f"Number of duplicates in negative dataset: {negative_dataset_duplicates.shape[0]}")

# Count the number of duplicates in the combined dataset
combined_dataset_duplicates = pd.concat([positive_dataset, negative_dataset_same_proportion])
combined_dataset_duplicates = combined_dataset_duplicates[combined_dataset_duplicates.duplicated()]
print(f"Number of duplicates in combined dataset: {combined_dataset_duplicates.shape[0]}")

Number of duplicates in positive dataset: 0
Number of duplicates in negative dataset: 0
Number of duplicates in combined dataset: 0


In [15]:
# Add column 'reaction' 0 for negative_dataset_same_proportion, 1 for positive_dataset
positive_dataset['reaction'] = 1
negative_dataset_same_proportion['reaction'] = 0

In [16]:
# Keep only the columns of the positive dataset that are in the negative datasets
positive_dataset = positive_dataset[negative_dataset_same_proportion.columns]

In [17]:
# Combine the two datasets
combined_dataset = pd.concat([positive_dataset, negative_dataset_same_proportion])
# shuffle combined_dataset
combined_dataset = combined_dataset.sample(frac=1).reset_index(drop=True)
combined_dataset.head()

Unnamed: 0,CDR3_alfa,TRAV,TRAJ,CDR3_beta,TRBV,TRBJ,reaction
0,,,,CASSQLETYEQYF,TRBV28,TRBJ2-7,0
1,,,,CSVPLRWEQYF,TRBV29-1,TRBJ2-7,0
2,,,,CASSPTGVYNSPLHF,TRBV6-5*01,TRBJ1-6*01,1
3,CAGGGSQGNLIF,TRAV27*01,TRAJ42*01,CASSIRSAYEQYF,TRBV19*01,TRBJ2-7*01,1
4,,,,CASSIRSGPEAFF,TRBV19*01,TRBJ1-1*01,1


In [18]:
# Create a train and test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(combined_dataset, test_size=0.2, random_state=42)
# Save the combined dataset
test.to_csv('data/generated_combined_dataset_test.csv', index=False)
train.to_csv('data/generated_combined_dataset_train.csv', index=False)

In [19]:
train.head()

Unnamed: 0,CDR3_alfa,TRAV,TRAJ,CDR3_beta,TRBV,TRBJ,reaction
2310,,,,CASGNIAGGVNTGELFF,TRBV6-2,TRBJ2-2,0
5620,,,,CASSIGGWNEQYF,TRBV19*01,TRBJ2-7*01,1
9674,,,,CASSFPTIPYEQYF,TRBV13,TRBJ2-7,0
6547,CAGPSEPGDSNYQLIW,TRAV35*01,TRAJ33*01,CASSGLSNQPQHF,TRBV19*01,TRBJ1-5*01,1
8240,,,,CASSQERQTILEAFF,TRBV4-1*01,TRBJ1-1*01,1


# Summary
I was given two datasets. One contained 5192 positive samples (42% containing the alpha part, 99.96% containing beta) of TCR cells that react with the GILGFVFTL epitope (Influenza A virus).
The other dataset (20788 samples) could be used as a negative dataset (after filtering out the alphas and betas that occured in the positive dataset and making part of the alphas and betas NaN's, to get the same proportions as the positive dataset). Those two sets where then combined into one large dataframe, which was split into a training and a testing file.