# Classifying Genetic Mutation 



1. Packages

   1. Downloading Sent2Vec and Installing

2. Downloading data

3. Preprocessing data

4. Primary data analysis

5. Data featurization

   1. Text Feature

      - Traditional NLP: TF-IDF 

      - Using pre-trained model
        - Downloading BioSent2Vec
        - Loading BioSent2Vec
        - Applying BioSent2Vec on TEXT
   2. Categorical Features
      - Response encoding
      - One-hot encoding

6. Secondary data analysis

   1. Visualizing high dimensional text features by t-SNE
   2. Visualizing Gene & Variation features by  t-SNE

7. ML model building

   1. Linear models
      1. SVM 
         1. RBF Kernel
         2. Linear kernel
      2. Logistic Regression (SGD)
   2. Tree-base models
      1. Decision tree

8. Conclusion
   

<a name="cell-id1"></a>
# Packages

## Downloading Sent2Vec and Installing

In [1]:
! curl --header 'Host: codeload.github.com' --user-agent 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' --header 'Accept-Language: en-US,en;q=0.5' --referer 'https://github.com/epfml/sent2vec' --header 'DNT: 1' --cookie '_octo=GH1.1.1157954504.1616399111; logged_in=yes; tz=Asia%2FKolkata; color_mode=%7B%22color_mode%22%3A%22dark%22%2C%22light_theme%22%3A%7B%22name%22%3A%22light%22%2C%22color_mode%22%3A%22light%22%7D%2C%22dark_theme%22%3A%7B%22name%22%3A%22dark%22%2C%22color_mode%22%3A%22dark%22%7D%7D; dotcom_user=akshayonly' --header 'Upgrade-Insecure-Requests: 1' --header 'Sec-GPC: 1' 'https://codeload.github.com/epfml/sent2vec/zip/refs/heads/master' --output 'sent2vec-master.zip'
! unzip sent2vec-master && cd sent2vec-master && make && pip install .

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  355k    0  355k    0     0   797k      0 --:--:-- --:--:-- --:--:--  797k
Archive:  sent2vec-master.zip
f00a1b67f4330e5be99e7cc31ac28df94deed9ac
   creating: sent2vec-master/
  inflating: sent2vec-master/.gitignore  
  inflating: sent2vec-master/Dockerfile  
  inflating: sent2vec-master/LICENSE  
  inflating: sent2vec-master/Makefile  
  inflating: sent2vec-master/README.md  
  inflating: sent2vec-master/get_sentence_embeddings_from_pre-trained_models.ipynb  
  inflating: sent2vec-master/paper-sent2vec.pdf  
 extracting: sent2vec-master/requirements.txt  
  inflating: sent2vec-master/setup.py  
   creating: sent2vec-master/src/
  inflating: sent2vec-master/src/args.cc  
  inflating: sent2vec-master/src/args.h  
  inflating: sent2vec-master/src/asvoid.h  
  inflating: sent2vec-master/src/dictionary.cc  
  inflating: sent2vec-m

## Libraries

In [2]:
import pandas as pd
import numpy as np

import re
from tqdm import tqdm
from datetime import datetime

import sent2vec

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

from sklearn.preprocessing import OneHotEncoder


import warnings
warnings.filterwarnings("ignore")

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Downloading data

In [3]:
# Train Texts
! curl --header 'Host: storage.googleapis.com' --user-agent 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' --header 'Accept-Language: en-US,en;q=0.5' --referer 'https://www.kaggle.com/' --header 'DNT: 1' --header 'Upgrade-Insecure-Requests: 1' --header 'Sec-GPC: 1' 'https://storage.googleapis.com/kagglesdsdata/competitions/6841/44307/training_text.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1616748114&Signature=PCcLSRN0v8r2loPEYY0WdqNPem%2FsZHwTMCEGUtJrTLuW0LvOZ5lt6kl7nB%2FMKpqe67FN1%2BPDuZvC%2B8QWg6wIXLWEmz0f0UE0x5yYW5suq7AaRzedE8A6HZuDRj9mmhb1lYhksZL04ZAB2VYADur90n%2Fyhpsgt6aQ5VuEKWGHQuYH7k4X7NDkTPmfC0cTp4qujfz5I%2BagifBdFnZj%2FQMzMH2onbMJi4%2FAY806%2FDsEPQ5YKZZ4bNqOyjTNrEDWcdKu1TVQSlmrTnwOykSNK1s7viy%2Bhnrx2trZa8LvEMiP2GxDXsnFcK7IdhPJWevk5UX9m9d3vXi%2BVsJLoJuTiizezw%3D%3D&response-content-disposition=attachment%3B+filename%3Dtraining_text.zip' --output 'training_text.zip' && unzip training_text.zip

# Train Variants
! curl --header 'Host: storage.googleapis.com' --user-agent 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' --header 'Accept-Language: en-US,en;q=0.5' --referer 'https://www.kaggle.com/' --header 'DNT: 1' --header 'Upgrade-Insecure-Requests: 1' --header 'Sec-GPC: 1' 'https://storage.googleapis.com/kagglesdsdata/competitions/6841/44307/training_variants.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1616748147&Signature=HYq8rHWpYzz%2B2hIPZV6XWullEV0yQ3oepX6SicRqb5EU6MFOwznQLKprpTK2M2yiS3yvzVz%2Bzft0tlkQBFgF33JTwgSv3aZubyUvowDUrWHqYho%2FYnzWl8sJXl0x9nIFon4hU8rdg9JMbve9nyVvz%2F28fYUaapk73c7ZWC5SYOmPzSwgEKnxDsW8Vn2G4U6fe34i4JW1FveXebIA9wtYBG6%2F%2FC8Soh46d5qTI%2B6LHIo%2BmaIXmnL7jmpNpuL3wFmbN6tMTUG1f7yPOhRKreglmXUVFjDqqnkWRmvfwAHdzr%2FdpgKw4Ahu2EivDehaewTy%2B%2BTyOKDLutNgxj2nfbWfVA%3D%3D&response-content-disposition=attachment%3B+filename%3Dtraining_variants.zip' --output 'training_variants.zip' && unzip training_variants.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 60.9M  100 60.9M    0     0  62.2M      0 --:--:-- --:--:-- --:--:-- 62.2M
Archive:  training_text.zip
  inflating: training_text           
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24831  100 24831    0     0   137k      0 --:--:-- --:--:-- --:--:--  137k
Archive:  training_variants.zip
  inflating: training_variants       


In [4]:
! ls -I *.zip

training_text.zip  training_variants.zip


<a name="cell-id3"></a>
# Data Preprocessing

In [6]:
variants_data = pd.read_csv('/content/training_variants', index_col=False)
variants_data.head()

Unnamed: 0,ID,Gene,Variation,Class
0,0,FAM58A,Truncating Mutations,1
1,1,CBL,W802*,2
2,2,CBL,Q249E,2
3,3,CBL,N454D,3
4,4,CBL,L399V,4


In [7]:
texts_data = pd.read_csv('/content/training_text', 
                         sep="\|\|", 
                         engine="python", 
                         names=["ID","TEXT"], 
                         skiprows=1)

texts_data.head()

Unnamed: 0,ID,TEXT
0,0,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,Abstract Background Non-small cell lung canc...
2,2,Abstract Background Non-small cell lung canc...
3,3,Recent evidence has demonstrated that acquired...
4,4,Oncogenic mutations in the monomeric Casitas B...


## Merging variants and texts dataframes

In [8]:
data = pd.merge(texts_data, variants_data, on='ID', how='left')

data.head()

Unnamed: 0,ID,TEXT,Gene,Variation,Class
0,0,Cyclin-dependent kinases (CDKs) regulate a var...,FAM58A,Truncating Mutations,1
1,1,Abstract Background Non-small cell lung canc...,CBL,W802*,2
2,2,Abstract Background Non-small cell lung canc...,CBL,Q249E,2
3,3,Recent evidence has demonstrated that acquired...,CBL,N454D,3
4,4,Oncogenic mutations in the monomeric Casitas B...,CBL,L399V,4


## Highlevel overview

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3321 entries, 0 to 3320
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ID         3321 non-null   int64 
 1   TEXT       3316 non-null   object
 2   Gene       3321 non-null   object
 3   Variation  3321 non-null   object
 4   Class      3321 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 155.7+ KB


1. There are 3321 entries in the dataframe.
2. Leaving ``Class`` column which is numerical representation of genetic mutation classes, rest are object datatype i.e. categorical and text.
3. The ``TEXT`` column have 05 entries lesser than the total 3321, we'd look at them later.

In [11]:
data['Class'].describe()

count    3321.000000
mean        4.365854
std         2.309781
min         1.000000
25%         2.000000
50%         4.000000
75%         7.000000
max         9.000000
Name: Class, dtype: float64

In [12]:
# Removing from memory (RAM) as we're gonna load an 22GB pretrained model later
# we need to do some memory hacks. 

%xdel variants_data
%xdel texts_data

## Preprocessing

### NaN values

In [13]:
data[data.isnull().any(axis=1)]

Unnamed: 0,ID,TEXT,Gene,Variation,Class
1109,1109,,FANCA,S1088F,1
1277,1277,,ARID5B,Truncating Mutations,1
1407,1407,,FGFR3,K508M,6
1639,1639,,FLT1,Amplification,6
2755,2755,,BRAF,G596C,7


- There are 05 rows in the data in the ``'TEXT'`` column which are NaN
- We'd  remove them, as these are text base and we cannot replace them with other values.

In [14]:
data.dropna(subset = ["TEXT"], inplace=True)

### Standard NLP preprocessing
- Removing stop words
- Removing puntuations

In [15]:
stop_words = set(stopwords.words('english'))

def text_processing(text):
  """Processing the sentence (single entry)."""

  text = re.sub('\s+',' ', text)
  text = re.sub('[^a-zA-Z0-9\n]', ' ', text)
  text = text.lower()

  tokens = [token for token in word_tokenize(text) if token not in punctuation and token not in stop_words]

  return ' '.join(tokens)

In [20]:
for index, text in tqdm(data.TEXT.iteritems()):
  data["TEXT"][index] = text_processing(text)

3316it [02:31, 21.95it/s]


In [21]:
data.head(2)

Unnamed: 0,ID,TEXT,Gene,Variation,Class
0,0,cyclin dependent kinases cdks regulate variety...,FAM58A,Truncating Mutations,1
1,1,abstract background non small cell lung cancer...,CBL,W802*,2


<a name="cell-id4"></a>
# Primary data analysis

<a name="cell-id5"></a>
# Data featurization

## Categorical

## Text

<a name="cell-id6"></a>
# Secondary data analysis

<a name="cell-id7"></a>
# ML model building

<a name="cell-id8"></a>
# Conclusion