# vLife Virtusa

## Predicting Patient Eligibility for Cancer Trials Usecase
### Usecase Description
<b> Interventional cancer clinical trials are generally too restrictive, and some patients are often excluded on the basis of comorbidity, past or concomitant treatments, or the fact that they are over a certain age. Using Deep Neural Network, this tool will predict the eligibility of a patient for a cancer clinical trial after going through his/her diagnosis notes. </b>

### Data Source
Dataset for this usecase can be found [here](https://www.kaggle.com/auriml/eligibilityforcancerclinicaltrials)

### Dataset Description
<p> A total of 6,186,572 labeled clinical statements were extracted from 49,201 interventional CT protocols on cancer (the URL for downloading this dataset is freely available at https://clinicaltrials.gov/ct2/results?term=neoplasmtype=Intrshowdow). Each CT downloaded is an XML file that follows a structure of fields defined by an XML schema of clinical trials [16]. The relevant data for this project are derived from the intervention, condition, and eligibility fields written in unstructured free-text language. The information in the eligibility criteria—both exclusion and inclusion criteria—are sets of phrases and/or sentences displayed in a free format, such as paragraphs, bulleted lists, enumeration lists, etc. None of these fields use common standards, nor do they enforce the use of standardized terms from medical dictionaries and ontologies. Moreover, the language had the problems of both polysemy and synonymy. </p>

### Import packages & Modules

In [1]:
import pandas as pd
import numpy as np
import re
import string
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from sklearn import preprocessing
from keras.models import model_from_json
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report


import tensorflow as tf
from tensorflow import keras
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Using TensorFlow backend.


1.15.0


In [3]:
df = pd.read_table('Eligibility.csv', header = None)

  if __name__ == '__main__':


In [4]:
df.head()

Unnamed: 0,0,1
0,__label__0,study interventions are recombinant CD40-ligan...
1,__label__0,study interventions are Liposomal doxorubicin ...
2,__label__0,study interventions are BI 836909 . multiple m...
3,__label__0,study interventions are Immunoglobulins . recu...
4,__label__0,study interventions are Paclitaxel . stage ova...


## Exploratory Data Analysis

In [5]:
clin_trial = pd.DataFrame(np.array(df).reshape(1000000,2), columns=['label', 'describe'])

In [6]:
clin_trial['label'].unique()

array(['__label__0', '__label__1'], dtype=object)

In [7]:
clin_trial['study_intervention'], clin_trial['diagnosis'] = clin_trial['describe'].str.split('.', 1).str

In [8]:
clin_trial.head()

Unnamed: 0,label,describe,study_intervention,diagnosis
0,__label__0,study interventions are recombinant CD40-ligan...,study interventions are recombinant CD40-ligand,melanoma skin diagnosis and no active cns met...
1,__label__0,study interventions are Liposomal doxorubicin ...,study interventions are Liposomal doxorubicin,colorectal cancer diagnosis and cardiovascular
2,__label__0,study interventions are BI 836909 . multiple m...,study interventions are BI 836909,multiple myeloma diagnosis and indwelling cen...
3,__label__0,study interventions are Immunoglobulins . recu...,study interventions are Immunoglobulins,recurrent fallopian tube carcinoma diagnosis ...
4,__label__0,study interventions are Paclitaxel . stage ova...,study interventions are Paclitaxel,stage ovarian cancer diagnosis and patients m...


In [9]:
clin_trial=clin_trial.drop('describe',axis=1)

In [10]:
clin_trial['qualification']=clin_trial['label'].str.extract('(\d)', expand=True)

In [11]:
clin_trial.head()

Unnamed: 0,label,study_intervention,diagnosis,qualification
0,__label__0,study interventions are recombinant CD40-ligand,melanoma skin diagnosis and no active cns met...,0
1,__label__0,study interventions are Liposomal doxorubicin,colorectal cancer diagnosis and cardiovascular,0
2,__label__0,study interventions are BI 836909,multiple myeloma diagnosis and indwelling cen...,0
3,__label__0,study interventions are Immunoglobulins,recurrent fallopian tube carcinoma diagnosis ...,0
4,__label__0,study interventions are Paclitaxel,stage ovarian cancer diagnosis and patients m...,0


In [12]:
clin_trial=clin_trial.drop('label',axis=1)

In [13]:
clin_trial['study_intervention']=clin_trial['study_intervention'].str.replace("study interventions are ",'')

In [14]:
clin_trial.to_csv('cancer_eligibility.csv',index=False)

In [15]:
clin_trial.head()

Unnamed: 0,study_intervention,diagnosis,qualification
0,recombinant CD40-ligand,melanoma skin diagnosis and no active cns met...,0
1,Liposomal doxorubicin,colorectal cancer diagnosis and cardiovascular,0
2,BI 836909,multiple myeloma diagnosis and indwelling cen...,0
3,Immunoglobulins,recurrent fallopian tube carcinoma diagnosis ...,0
4,Paclitaxel,stage ovarian cancer diagnosis and patients m...,0


### Data Preprocessing & Cleaning

In [16]:
clin_trial['study_intervention'][300]

'Sirolimus '

In [17]:
clin_trial['diagnosis'][300]

' sarcoma diagnosis and antifungals voriconazole itraconazole or ketoconazole'

In [18]:
clin_trial['qualification']=clin_trial['qualification'].astype('int64')

In [19]:
clin_trial['summ']=clin_trial['study_intervention']+clin_trial['diagnosis']

In [20]:
def clean_text(text):
    df1=text.lower()
    result3=df1.strip()
    tokens=word_tokenize(result3)
    result4=[i for i in tokens if not i in stop_words]
    return result4

In [21]:
X=clin_trial['summ']
y=clin_trial.iloc[:,2]

In [22]:
xtrain, xval, ytrain, yval = train_test_split(X, y, test_size=0.2, random_state=9)

In [23]:
type(xtrain)

pandas.core.series.Series

In [24]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8)

In [25]:
xtrain.to_csv('xtrain.csv',index=False)

  if __name__ == '__main__':


In [26]:
x_train=pd.read_csv('xtrain.csv',header=None)
#x_train=pd.Series(x_train.iloc[:,1])
type(x_train.iloc[:,0])

pandas.core.series.Series

In [27]:
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)

In [28]:
xtrain_tfidf

<800000x35511 sparse matrix of type '<class 'numpy.float64'>'
	with 13413896 stored elements in Compressed Sparse Row format>

## Predictive Models
### Keras ANN Deep Learning Architecture

In [29]:
classifier = Sequential()
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu', input_dim = 35511))
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu'))
classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [29]:
classifier.fit(xtrain_tfidf, ytrain, batch_size = 10000, nb_epoch = 10)

  if __name__ == '__main__':



Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7f56eed4fef0>

### Saving the ANN Architecture

In [34]:
model_json=classifier.to_json()

In [35]:
with open("model.json","w") as json_file:
    json_file.write(model_json)

In [36]:
classifier.save_weights("model.h5")

### Loading Trained ANN Model 

In [37]:
json_file=open('model.json','r')

In [38]:
loaded_model_json=json_file.read()

In [41]:
json_file.close()
loaded_model=model_from_json(loaded_model_json)
loaded_model.load_weights('model.h5')

In [42]:
y_pred = loaded_model.predict_proba(xval_tfidf)
#y_pred = (y_pred > 0.5)

In [43]:
y_pred[0][0]

0.8764373

### Model Evaluation

In [83]:
confusion_matrix(yval, y_pred)

array([[87962, 12337],
       [12663, 87038]])

In [84]:
print ('Accuracy Score :',accuracy_score(yval, y_pred)) 
print ('Report : ')
print (classification_report(yval, y_pred) )

Accuracy Score : 0.875
Report : 
              precision    recall  f1-score   support

           0       0.87      0.88      0.88    100299
           1       0.88      0.87      0.87     99701

    accuracy                           0.88    200000
   macro avg       0.88      0.87      0.87    200000
weighted avg       0.88      0.88      0.87    200000



In [134]:
print("For the study intervention of '"+clin_trial.iloc[xval.index[1]]['study_intervention'].strip()+"' the probability that the patient is eligible for this cancer trial is",classifier.predict_proba(tfidf_vectorizer.transform(pd.Series(xval.iloc[1])))[0][0])

For the study intervention of 'Bevacizumab' the probability that the patient is eligible for this cancer trial is 0.974883


In [63]:
xval.iloc[1]

'Bevacizumab  glioblastoma diagnosis and co medication that may interfere with study results immuno suppressive agents other than corticosteroids'

In [153]:
 yval.iloc[1]

1

## END