# Medical Treatment

## Business Problem

## Description

A lot has been said during the past several years about how precision medicine and, more concretely, how genetic testing is going to disrupt the way diseases like cancer are treated.

But this is only partially happening due to the huge amount of manual work still required. Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is distinguishing the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers).

Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature.

## Problem Statement

We need to develop a Machine Learning algorithm that Predict the effect of Genetic Variants to enable Personalized Medicine.

All data file sources available on : https://www.kaggle.com/c/msk-redefining-cancer-treatment/data

## Importing the Library

Import the important libraries for loading data.

In [1]:
# Loading all required packages

In [2]:
# Loading training_variants. Its a comma seperated file

In [3]:
# Loading training_text dataset. This is seperated by ||

In [4]:
# Next we have to understand data whatever we have loaded. There are 4 fields above:

# ID 
# Gene 
# Variation 
# Class

In [5]:
# Checking dimention of data

In [6]:
# Next Checking column in above data set

In [7]:
# Finally checking the dimentions

In [8]:
# So, in short datasets looks like this

# data_variants (ID, Gene, Variations, Class)
# data_text(ID, text)

## We have huge amount of text data. So, we need to pre process it. 

In [10]:
# We would like to remove all stop words like a, is, an, the, ...     

In [11]:
# so we collecting all of them from nltk library

In [12]:
# Remove int values from text data as that might not be imp

In [13]:
# Next we will replace all special char with space

In [14]:
# next replace  multiple spaces with single space

In [15]:
# And bring whole text to same lower-case scale.

In [16]:
# if the word is a not a stop word then retain that word from text

In [18]:
# Now fianlly merge both gene_variations and text data based on ID

## Creating Training, Test and Validation data

In [19]:
# Splitting the data into train and test set 

In [20]:
# split the train data now into train validation and cross validation

## Visualizing for train class distrubution

In [21]:
# Look at distribution in form of percentage

In [22]:
# Lets visualize the same for test set

In [23]:
# Let's visualize for cross validation set

## Building a Random model

In [24]:
# we create a output array that has exactly same size as the CV data

In [25]:
# Test-Set error. we create a output array that has exactly same as the test data

In [26]:
# Lets get the index of max probablity

In [27]:
# Lets see the output. these will be 665 values present in test dataset

## Confusion Matrix

In [28]:
# Use Confusion Matrix for figure out which is a predicted class and original class.

##  Precision matrix

In [30]:
# Use Precision matrix for figure out which is a predicted class and original class.

## Recall matrix

In [31]:
# Use Recall matrix for figure out which is a predicted class and original class.

## Evaluating Gene Column

In [32]:
# lets explore column Gene and lets look at its distribution.

In [33]:
# the top 10 genes that occured most

In [36]:
# So we have 2 techniques to deal with it.
   # - One-hot encoding
   # - Response Encoding(Mean imputation)
# Let's use both of them to see which one work the best.

In [37]:
# one-hot encoding of Gene feature.

In [38]:
# column names after one-hot encoding for Gene column

In [39]:
# lets also create Response encoding columns for Gene column

In [40]:
# get_gv_fea_dict: Get Gene varaition Feature Dict

In [41]:
# Get Gene variation feature

In [43]:
#response-coding of the Gene feature

In [44]:
# alpha is used for laplace smoothing

In [45]:
# We need a hyperparemeter for SGD classifier.

In [46]:
# Lets plot the same to check the best Alpha value

In [47]:
# Lets use best alpha value as we can see from above graph and compute log loss

## Evaluating Variation colum

Variation is also a categorical variable so we have to deal in same way like we have done for Gene column. We will again get the one hot encoder and response enoding variable for variation column.

In [48]:
# printing the top 10 variations that occured most

In [49]:
# one-hot encoding of variation feature

In [50]:
# Now we need a hyperparemeter for SGD classifier.

In [51]:
# Lets plot the same to check the best Alpha value

## Evaluating Text column

cls_text is a data frame for every row in data fram consider the 'TEXT' split the words by space make a dict with those words increment its count whenever we see that word

In [52]:
# building a CountVectorizer with all the words that occured minimum 3 times in train data

In [53]:
# getting all the feature names (words)

In [54]:
# train_text_feature_onehotCoding.sum(axis=0).A1 will sum every row and returns (1*number of features) vector

In [55]:
# zip(list(text_features),text_fea_counts) will zip a word with its number of times it occured

In [56]:
# dict_list =[] contains 9 dictoinaries each corresponds to a class

In [57]:
# dict_list[i] is build on i'th  class text data total_dict is buid on whole training text data

In [58]:
# response coding of text features

In [59]:
# Next we convert each row values such that they sum to 1  

In [60]:
# Let's see number of words for a given frequency

## Data prepration for Machine Learning models

In [61]:
# We now plot the confusion matrices given y_i, y_i_hat

In [62]:
# representing A in heatmap format

In [63]:
# representing B in heatmap format

In [64]:
# for calculating log_loss we willl provide the array of probabilities belongs to each class

In [65]:
# calculating the number of data points that are misclassified

In [66]:
# this function will be used just for naive bayes for the given indices, we will print the name of the features and we will check whether the feature present in the test point text or not

## Combining all 3 features together

In [67]:
# merging gene, variance and text features

## Building Machine Learning model

Lets start the first model which is most suitable when we have lot of text column data. So, we will start with Naive Bayes.

## Naive Bayes

In [68]:
# Use naive bayes and plot the Cross validation error.

In [69]:
# Check the interpretability of our model

## K Nearest Neighbour Classification

In [72]:
# Use K Nearest Neighbour Classification and plot the Cross validation error.

## Logistic Regression

In [73]:
# Use Logistic Regression for Balancing all classes

In [74]:
# to avoid rounding error while multiplying probabilites we use log-probability estimates

## Feature importance

In [75]:
# Test query point and doing interpretability

In [76]:
# use Without class balancing

In [77]:
# Testing query point and interpretability

## Linear Support Vector Machines

In [78]:
# Use Linear Support Vector Machines and plot the Cross validation error.

In [79]:
# Testing model with best alpha values

## Random Forest Classifier

In [80]:
# Model with One hot encoder

In [81]:
# Use RF with Response Coding

In [82]:
# Fianlly use the Stacking model

In [83]:
# Logistic Regression :  Log Loss: 1.07
# Support vector machines : Log Loss: 1.69
# Naive Bayes : Log Loss: 1.24

In [84]:
# Log loss (train) on the stacking classifier : 0.662940243923
# Log loss (CV) on the stacking classifier : 1.11432017542
# Log loss (test) on the stacking classifier : 1.11594191877
# Number of missclassified point : 0.35789473684210527

## Maximum voting Classifier

In [85]:
# import the library
from sklearn.ensemble import VotingClassifier

In [88]:
# You will ge the final result.
# Refer:http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

In [89]:
# Log loss (train) on the VotingClassifier : 0.916449056762
# Log loss (CV) on the VotingClassifier : 1.1858758572
# Log loss (test) on the VotingClassifier : 1.20235899773
# Number of missclassified point : 0.35639097744360904