# Classifying variants 

> The aim of this is to practice some python, ML and hopefully NLP skills 

- The workflow of the notebook: 

    1. Data exploration
    2. Feature enigneering
    3. Predict variant classification
    4. Evaluate predictions

## Data exploration

In [1]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

import pandas as pd
import re
import seaborn as sns

%load_ext autotime

I0206 20:56:33.083682 4466677184 file_utils.py:38] PyTorch version 1.4.0 available.


In [2]:
train_var = pd.read_csv("/Users/DZ_laptop/Dropbox/DZ_Work/data_science/kaggling/personalised_med/raw_data/msk-redefining-cancer-treatment/training_variants.zip")
train_text = pd.read_csv("/Users/DZ_laptop/Dropbox/DZ_Work/data_science/kaggling/personalised_med/raw_data/msk-redefining-cancer-treatment/training_text.zip", 
                         skiprows = 1, sep = "\|\|", header = None, names = ["ID", "Text"], engine = 'python')

test_var = pd.read_csv("/Users/DZ_laptop/Dropbox/DZ_Work/data_science/kaggling/personalised_med/raw_data/msk-redefining-cancer-treatment/test_variants.zip")
test_text = pd.read_csv("/Users/DZ_laptop/Dropbox/DZ_Work/data_science/kaggling/personalised_med/raw_data/msk-redefining-cancer-treatment/test_variants.zip", 
                         skiprows = 1, sep = "\|\|", header = None, names = ["ID", "Text"], engine = 'python')

time: 2.31 s


In [3]:
print(train_var.shape)
print(test_var.shape)
train_var.head()

(3321, 4)
(5668, 3)


Unnamed: 0,ID,Gene,Variation,Class
0,0,FAM58A,Truncating Mutations,1
1,1,CBL,W802*,2
2,2,CBL,Q249E,2
3,3,CBL,N454D,3
4,4,CBL,L399V,4


time: 11.4 ms


In [4]:
print(train_text.shape)
print(test_text.shape)
train_text.head()

(3321, 2)
(5668, 2)


Unnamed: 0,ID,Text
0,0,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,Abstract Background Non-small cell lung canc...
2,2,Abstract Background Non-small cell lung canc...
3,3,Recent evidence has demonstrated that acquired...
4,4,Oncogenic mutations in the monomeric Casitas B...


time: 7.31 ms


# Feature enigneering

- The main hurdle here looks to be the conversion of the abstract into a form that can be inputted and used by a machine learning model. 
- Let's try three different methods of encoding the text:

    1. Simple baseline model - clean text, take the most frequent x number of words then tokenise. 
    2. Tokennise using BERT (http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)
    3. Shifting window-based BERT

- BERT (Bidirectional Encoder Representations from Transformers) is a technique that googles made that has been pre-trained on a huge amount of text data (corpus) and allows for context dependent encoding of sentences and text. 
- The max limit of normal BERT looks to be 512 tokens, which most of our abstracts exceed - https://stackoverflow.com/questions/58636587/how-to-use-bert-for-long-text-classification / https://github.com/google-research/bert/issues/66
- So for approach 3 let's try a shifting window approach 

In [5]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

I0206 20:56:38.539901 4466677184 tokenization_utils.py:418] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /Users/DZ_laptop/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
I0206 20:56:38.976819 4466677184 configuration_utils.py:254] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json from cache at /Users/DZ_laptop/.cache/torch/transformers/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.587f67ec28c540f4294c9c2ac7dcf7841ff371aeb12cdeb6a17f69da39ad9452
I0206 20:56:38.978065 4466677184 configuration_utils.py:290] Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "dim": 768,
  "do_sample": false,
  "dropout": 0.1,
  "eos_token_ids": 0,
 

time: 2.55 s


In [6]:
# tokenise the text
tokenised = train_text["Text"].iloc[range(50)].apply(lambda x: tokenizer.encode(x, add_special_tokens = True, max_length = 510))

time: 46.7 s


In [7]:
# pad up the tokens into a 2d matrix
max_len = 0
for i in tokenised.values:
    if len(i) > max_len:
        max_len = len(i)

# beautiful list comphehension for adding 0's up to the max length 
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenised.values])
padded[0:5]

array([[  101, 22330, 20464, ..., 21618,  1999,   102],
       [  101, 10061,  4281, ..., 11192,  4456,   102],
       [  101, 10061,  4281, ..., 11192,  4456,   102],
       [  101,  3522,  3350, ...,  2070,  1997,   102],
       [  101,  2006,  3597, ...,  2116, 25409,   102]])

time: 6.07 ms


In [8]:
# then we have to tell BERT to ignore the areas in which we've padded 
attention_mask = np.where(padded != 0, 1, 0)
attention_mask[0:5]

array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]])

time: 2.83 ms


In [9]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

time: 26.2 s


In [10]:
# extract the feature we actually need...
features = last_hidden_states[0][:,0,:].numpy()
print(features)
len(features[0])

[[-0.3599693  -0.01222367 -0.15771253 ... -0.11386845  0.34729826
   0.42809564]
 [-0.44107428 -0.14908828 -0.3602752  ... -0.22257173  0.3804882
   0.5449228 ]
 [-0.44107428 -0.14908828 -0.3602752  ... -0.22257173  0.3804882
   0.5449228 ]
 ...
 [-0.33200416 -0.01765815 -0.18332963 ...  0.010189    0.3175863
   0.6008474 ]
 [-0.47746763 -0.13395074 -0.11828913 ... -0.30756983  0.39826
   0.44104475]
 [-0.37927583 -0.00662447 -0.26490277 ... -0.16988462  0.2973519
   0.65052885]]


768

time: 4.69 ms


# Predict variant classification

In [11]:
labels = train_var.iloc[range(50)]["Class"]
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

time: 3.42 ms


In [None]:
lr_clf = LogisticRegression(solver = "lbfgs", multi_class = "multinomial")
lr_clf.fit(train_features, train_labels)

In [None]:
lr_clf.predict(test_features)