**Immune2vec: Embedding B/T Cell Receptor Sequences Using Natural Language Processing**

In NLP, the term “embedding” refers to the representation of symbolic information in text at the word-level, phrase-level, and even sentence-level, in terms of real number vectors.

In [1]:
import pandas as pd
import numpy as np

• Data set

The first step in creating the model is building an adequate corpus for word2vec training. All scores of over 0, alpha and beta cdr3 sequences between length 12 and 14, species type HomoSapien

In [20]:
file_path = "vdjdb_full.txt"
df = pd.read_csv(file_path, delimiter='\t')
df = df.drop_duplicates()
df = df[(df['vdjdb.score'] > 0)]
df = df[['cdr3.alpha','cdr3.beta','species','antigen.epitope','antigen.gene','vdjdb.score']]
print(df.shape)

  df = pd.read_csv(file_path, delimiter='\t')


(9300, 6)


In [21]:
df['cdr3.alpha.length'] = df['cdr3.alpha'].apply(lambda x: len(x) if pd.notnull(x) and not isinstance(x, float) else 0)
df = df[(df['cdr3.alpha.length'] >= 12) & (df['cdr3.alpha.length'] <= 14)]

df['cdr3.beta.length'] = df['cdr3.beta'].apply(lambda x: len(x) if pd.notnull(x) and not isinstance(x, float) else 0)
df = df[(df['cdr3.beta.length'] >= 12) & (df['cdr3.beta.length'] <= 14)]

df = df[df['species'] == 'HomoSapiens']

df['cdr3combined'] = df['cdr3.alpha'].fillna('') + df['cdr3.beta'].fillna('')

df = df.reset_index(drop=True)

In [22]:
print(df.iloc[0])
print(df.shape)

cdr3.alpha                         CAYRPPGTYKYIF
cdr3.beta                         CASSALASLNEQFF
species                              HomoSapiens
antigen.epitope                         FLKEKGGL
antigen.gene                                 Nef
vdjdb.score                                    2
cdr3.alpha.length                             13
cdr3.beta.length                              14
cdr3combined         CAYRPPGTYKYIFCASSALASLNEQFF
Name: 0, dtype: object
(555, 9)


**Preprocess CDR3 sequences and translate to amino acids**

In [19]:
from gensim.models import Word2Vec
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

df['cdr3combined'] = df['cdr3combined'].apply(lambda x: x.split())

model = Word2Vec(df['cdr3combined'], min_count=1)

df['cdr3combined_vec'] = df['cdr3combined'].apply(lambda x: np.mean([model.wv[word] for word in x], axis=0))

le = LabelEncoder()
labels = le.fit_transform(df['antigen.epitope'])
X_train, X_test, y_train, y_test = train_test_split(df['cdr3combined_vec'].tolist(), labels, test_size=0.2, random_state=42)


clf = svm.SVC()
clf.fit(X_train, y_train)

print("Train accuracy:", clf.score(X_train, y_train))
print("Test accuracy:", clf.score(X_test, y_test))

Train accuracy: 0.527027027027027
Test accuracy: 0.3333333333333333
