# vLife Virtusa

## Clinical Trial Feasibility Analyzer Usecase
### Usecase Description
<b> Clinical trial is the foundation pillar of the  drug discovery process. Roughly 1 in 10 of drugs tested in human subjects receive FDA approval. Given the title and summary of the clinical trial, this tool will predict if the clinical trial will be approved or not. This tool can be used in the nascent stage of the clinical trial process which can help the researchers take a better decision. This tool is trained on around 1200 approved/rejected clinical trials. Model used : Deep Neural Network </b>

### Data Source
Data for this usecase can be found [here](https://www.kaggle.com/c/clinical-trials/data)
### Dataset Description
<b> File description
 - train-clinical_trial.csv - the training set
 - test-clinical_trial.csv - the test set
 - sample_submission-clinical_trial.csv - a sample submission file in the correct format </b> 


### Package Import

In [1]:
import csv
import pickle
import pandas as pd
import numpy as np
import string
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.externals import joblib
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

from DShap import DShap
from shap_utils import *
from Shapley import *
%matplotlib inline
MEM_DIR = './'


import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [None]:
df = pd.read_csv('train_clinical_perfect.csv')

In [2]:
df['summary']=df['title']+df['abstract']

In [4]:
df.head()

Unnamed: 0,title,abstract,trial,summary
0,"""Adjuvant chemotherapy followed by goserelin c...","""PURPOSE: The purpose of this article is to co...",0,"""Adjuvant chemotherapy followed by goserelin c..."
1,"""Relaxation and guided imagery program in pati...","""OBJECTIVE: Treatment of breast cancer is usua...",0,"""Relaxation and guided imagery program in pati..."
2,"""Effect of age and radiation dose on local con...","""PURPOSE: To determine whether the effect of a...",0,"""Effect of age and radiation dose on local con..."
3,"""The effect of systemic adjuvant chemotherapy ...","""A randomised trial has previously been repeat...",0,"""The effect of systemic adjuvant chemotherapy ..."
4,"""Analysis of time to response to chemotherapy ...","""Chemotherapy is a major tool for metastatic b...",0,"""Analysis of time to response to chemotherapy ..."


## Exploratory Data Analysis
### Data Preprocessing
> Cleaning Text data for feeding into Machine learning Model

In [5]:
def clean_text(text):
    df1=text.lower()
    import re
    result1=re.sub(r'\d+','',df1)
    import string
    result2=result1.translate(str.maketrans('', '', string.punctuation))
    result3=result2.strip()
    import nltk
    from nltk.corpus import stopwords
    stop_words=set(stopwords.words('english'))
    from nltk.tokenize import word_tokenize
    tokens=word_tokenize(result3)
    result4=[i for i in tokens if not i in stop_words]
    result5=' '.join(result4)
    import re
    result6=re.sub(r'\b\w{1,3}\b','',result5)
    return result6

In [8]:
df['clean_note'] = df['summary'].apply(lambda x: clean_text(x))

In [9]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)

In [10]:
xtrain, xval, ytrain, yval = train_test_split(df['clean_note'].head(10000), df['trial'], test_size=0.2, random_state=9)

In [11]:
trainindex=xtrain.index

In [12]:
xtrain.to_csv('xtrain.csv')

  if __name__ == '__main__':


In [13]:
xtrain1=pd.read_csv('xtrain.csv')
xtrain1=pd.Series(xtrain1.iloc[:,1])
type(xtrain1)

pandas.core.series.Series

In [14]:
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)
#xval_tfidf

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


## Predictive Models
> <b> Logistic Regression </b>

In [15]:
lr = LogisticRegression()

In [16]:
lr.fit(xtrain_tfidf, ytrain)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [17]:
joblib.dump(lr, 'lrmodel.pkl')



['lrmodel.pkl']

> <b> Saving Pickled Model </b>

In [21]:
pickle.dump(lr,open('lr_model.pkl','wb'))

In [19]:
lr_model=pickle.load(open('lr_model.sav','rb'))

In [20]:
y_pred = lr_model.predict_proba(xval_tfidf)
y_pred

array([[0.53004681, 0.46995319],
       [0.81172901, 0.18827099],
       [0.87909135, 0.12090865],
       [0.77186929, 0.22813071],
       [0.59924578, 0.40075422],
       [0.43555987, 0.56444013],
       [0.13521447, 0.86478553],
       [0.21759483, 0.78240517],
       [0.65431187, 0.34568813],
       [0.14339209, 0.85660791],
       [0.93284162, 0.06715838],
       [0.80255654, 0.19744346],
       [0.79799753, 0.20200247],
       [0.23043444, 0.76956556],
       [0.70524282, 0.29475718],
       [0.33242916, 0.66757084],
       [0.63331282, 0.36668718],
       [0.3894813 , 0.6105187 ],
       [0.11807317, 0.88192683],
       [0.86156291, 0.13843709],
       [0.72628616, 0.27371384],
       [0.60049392, 0.39950608],
       [0.72197224, 0.27802776],
       [0.29063322, 0.70936678],
       [0.09334571, 0.90665429],
       [0.11397251, 0.88602749],
       [0.65152674, 0.34847326],
       [0.71399354, 0.28600646],
       [0.6765982 , 0.3234018 ],
       [0.14867257, 0.85132743],
       [0.

In [27]:
f1_score(yval, y_pred, average="micro")

0.8566666666666667

In [29]:
results = confusion_matrix(yval, y_pred) 
print ('Confusion Matrix :')
print(results) 
print ('Accuracy Score :',accuracy_score(yval, y_pred)) 
print ('Report : ')
print (classification_report(yval, y_pred) )

Confusion Matrix :
[[148  12]
 [ 31 109]]
Accuracy Score : 0.8566666666666667
Report : 
              precision    recall  f1-score   support

           0       0.83      0.93      0.87       160
           1       0.90      0.78      0.84       140

    accuracy                           0.86       300
   macro avg       0.86      0.85      0.85       300
weighted avg       0.86      0.86      0.86       300



In [77]:
df.loc[trainindex[3]]

title         "Dexrazoxane cardioprotection in advanced brea...
abstract      "PURPOSE: We performed a randomized trial to e...
trial                                                         0
summary       "Dexrazoxane cardioprotection in advanced brea...
clean_note    dexrazoxane cardioprotection advanced breast c...
Name: 1065, dtype: object

In [56]:
lr_model=joblib.load('lrmodel.pkl')

> <b> Performing TF/IDF Vectorizer </b> 

In [83]:
l=[]
s="adjuvant chemotherapy followed goserelin compared either modality alone impact amenorrhea  flashes quality life premenopausal patientsthe international breast cancer study group trial viiipurpose purpose article compare quality life  menopausal symptoms among premenopausal patients lymph nodenegative breast cancer receiving chemotherapy goserelin sequential combination investigate differential effects  patients methods evaluated  data  perimenopausal women lymph nodenegative breast cancer randomly assigned receive  courses classical cyclophosphamide methotrexate fluorouracil  chemotherapy ovarian suppression goserelin months  courses classical  followed months goserelin report  data collected years random assignment patients without disease recurrence results overall patients receiving goserelin alone showed marked improvement less deterioration  measures first months patients treated  differences years random assignment according treatment except  flashes reflected  flashes scores patients three treatment groups experienced induced amenorrhea onset ovarian function suppression slightly delayed patients receiving chemotherapy younger patients years received goserelin alone returned premenopausal status months cessation therapy received  showed marginal changes baseline  flashes scores conclusion ageadjusted risk profiles consider patientreported outcomes enable patients adapt disease treatment considering tradeoffs delayed endocrine symptoms higher risk permanent menopause chemotherapy immediate reversible endocrine symptoms goserelin younger premenopausal patients"
l.append(s)
l=pd.Series(l)
type(l)

pandas.core.series.Series

In [75]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)

In [80]:
traintfidf1 = tfidf_vectorizer.fit_transform(xtrain)

In [100]:
traintfidf=tfidf_vectorizer.transform(l)
traintfidf

<1x9469 sparse matrix of type '<class 'numpy.float64'>'
	with 108 stored elements in Compressed Sparse Row format>

In [101]:
lr_model.predict_proba(traintfidf)

array([[0.72859767, 0.27140233]])

### Display Predicted Results

In [103]:
print("The probability of clinical trial "+ df.loc[trainindex[5]][0] + " getting approved is "+str(lr_model.predict_proba(traintfidf))+" %\n")

The probability of clinical trial "PXR, CAR and HNF4alpha genotypes and their association with pharmacokinetics and pharmacodynamics of docetaxel and doxorubicin in Asian patients." getting approved is [[0.72859767 0.27140233]] %



In [81]:
print("The probability of clinical trial "+ df.loc[trainindex[5]][0] + " getting approved is "+str(lr.predict_proba(xval_tfidf[5])[0][1])+" %\n")

The probability of clinical trial "PXR, CAR and HNF4alpha genotypes and their association with pharmacokinetics and pharmacodynamics of docetaxel and doxorubicin in Asian patients." getting approved is 0.5644401273631034 %



In [82]:
print("The probability of clinical trial "+ df.loc[trainindex[8]][0] + " getting approved is "+str(lr.predict_proba(xval_tfidf[8])[0][1])+" %\n")

The probability of clinical trial "A comparative study of exemestane versus anastrozole in patients with postmenopausal breast cancer with visceral metastases." getting approved is 0.34568812907446583 %



In [18]:
array_train=xtrain_tfidf.toarray()

In [19]:
array_test=xval_tfidf.toarray()

In [20]:
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(ytrain)
Test_Y = Encoder.fit_transform(yval)

In [21]:
print(type(array_train))
print(type(Train_Y))
print(type(array_test))
print(type(Test_Y))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


### Application of  Data Shapley 
><b>  Using Logistic Regression </b>

In [24]:
model = 'logistic'
problem = 'classification'
num_test = 1000
directory = './temp_NLP2'
dshap = DShap(array_train, Train_Y,array_test, Test_Y, num_test, sources=None, model_family=model, metric='accuracy',
              directory=directory, seed=0)

dshap.run(100, 0.1)

Starting LOO score calculations!
LOO values calculated!
10 out of 100 G-Shapley iterations
20 out of 100 G-Shapley iterations
30 out of 100 G-Shapley iterations
40 out of 100 G-Shapley iterations
50 out of 100 G-Shapley iterations
60 out of 100 G-Shapley iterations
70 out of 100 G-Shapley iterations
80 out of 100 G-Shapley iterations
90 out of 100 G-Shapley iterations
100 out of 100 G-Shapley iterations
10 out of 100 TMC_Shapley iterations.
20 out of 100 TMC_Shapley iterations.
30 out of 100 TMC_Shapley iterations.
40 out of 100 TMC_Shapley iterations.
50 out of 100 TMC_Shapley iterations.
60 out of 100 TMC_Shapley iterations.
70 out of 100 TMC_Shapley iterations.
80 out of 100 TMC_Shapley iterations.
90 out of 100 TMC_Shapley iterations.
100 out of 100 TMC_Shapley iterations.
10 out of 100 G-Shapley iterations
20 out of 100 G-Shapley iterations
30 out of 100 G-Shapley iterations
40 out of 100 G-Shapley iterations
50 out of 100 G-Shapley iterations
60 out of 100 G-Shapley iterations
70

80 out of 100 TMC_Shapley iterations.
90 out of 100 TMC_Shapley iterations.
100 out of 100 TMC_Shapley iterations.
10 out of 100 TMC_Shapley iterations.
20 out of 100 TMC_Shapley iterations.
30 out of 100 TMC_Shapley iterations.
40 out of 100 TMC_Shapley iterations.
50 out of 100 TMC_Shapley iterations.
60 out of 100 TMC_Shapley iterations.
70 out of 100 TMC_Shapley iterations.
80 out of 100 TMC_Shapley iterations.
90 out of 100 TMC_Shapley iterations.
100 out of 100 TMC_Shapley iterations.
10 out of 100 TMC_Shapley iterations.
20 out of 100 TMC_Shapley iterations.
30 out of 100 TMC_Shapley iterations.
40 out of 100 TMC_Shapley iterations.
50 out of 100 TMC_Shapley iterations.
60 out of 100 TMC_Shapley iterations.
70 out of 100 TMC_Shapley iterations.
80 out of 100 TMC_Shapley iterations.
90 out of 100 TMC_Shapley iterations.
100 out of 100 TMC_Shapley iterations.
10 out of 100 TMC_Shapley iterations.
20 out of 100 TMC_Shapley iterations.
30 out of 100 TMC_Shapley iterations.
40 out of

In [25]:
model = 'logistic'
problem = 'classification'
num_test = 1000
directory = './temp_NLP2'
dshap = DShap(array_train, Train_Y,array_test, Test_Y, num_test, model_family=model, metric='accuracy',
              directory=directory, seed=1)

dshap.run(100, 0.1)

LOO values calculated!
10 out of 100 G-Shapley iterations
20 out of 100 G-Shapley iterations
30 out of 100 G-Shapley iterations
40 out of 100 G-Shapley iterations
50 out of 100 G-Shapley iterations
60 out of 100 G-Shapley iterations
70 out of 100 G-Shapley iterations
80 out of 100 G-Shapley iterations
90 out of 100 G-Shapley iterations
100 out of 100 G-Shapley iterations
10 out of 100 TMC_Shapley iterations.
20 out of 100 TMC_Shapley iterations.
30 out of 100 TMC_Shapley iterations.
40 out of 100 TMC_Shapley iterations.
50 out of 100 TMC_Shapley iterations.
60 out of 100 TMC_Shapley iterations.
70 out of 100 TMC_Shapley iterations.
80 out of 100 TMC_Shapley iterations.
90 out of 100 TMC_Shapley iterations.
100 out of 100 TMC_Shapley iterations.
10 out of 100 G-Shapley iterations
20 out of 100 G-Shapley iterations
30 out of 100 G-Shapley iterations
40 out of 100 G-Shapley iterations
50 out of 100 G-Shapley iterations
60 out of 100 G-Shapley iterations
70 out of 100 G-Shapley iterations


MemoryError: 

In [None]:
model = 'logistic'
problem = 'classification'
num_test = 1000
directory = './temp_NLP2'
dshap = DShap(array_train, Train_Y,array_test, Test_Y, num_test, model_family=model, metric='accuracy',
              directory=directory, seed=2)
dshap.run(100, 0.1)

In [24]:
title="Adjuvant chemotherapy followed by goserelin compared with either modality alone: the impact on amenorrhea, hot flashes, and quality of life in premenopausal patients--the International Breast Cancer Study Group Trial VIII"
abstract="PURPOSE: The purpose of this article is to compare quality of life (QOL) and menopausal symptoms among premenopausal patients with lymph node-negative breast cancer receiving chemotherapy, goserelin, or their sequential combination, and to investigate differential effects by age. PATIENTS AND METHODS: We evaluated QOL data from 874 pre- and perimenopausal women with lymph node-negative breast cancer who were randomly assigned to receive six courses of classical cyclophosphamide, methotrexate, and fluorouracil (CMF) chemotherapy, ovarian suppression with goserelin for 24 months, or six courses of classical CMF followed by 18 months of goserelin. We report QOL data collected during 3 years after random assignment in patients without disease recurrence. RESULTS: Overall, patients receiving goserelin alone showed a marked improvement or less deterioration in QOL measures over the first 6 months than those patients treated with CMF. There were no differences at 3 years after random assignment according to treatment except for hot flashes. As reflected in the hot flashes scores, patients in all three treatment groups experienced induced amenorrhea, but the onset of ovarian function suppression was slightly delayed for patients receiving chemotherapy. Younger patients (< 40 years) who received goserelin alone returned to their premenopausal status at 6 months after the cessation of therapy, while those who received CMF showed marginal changes from their baseline hot flashes scores. CONCLUSION: Age-adjusted risk profiles that consider patient-reported outcomes enable patients to adapt to their disease and treatment, such as considering the trade-offs between delayed endocrine symptoms, but higher risk of permanent menopause with chemotherapy, and immediate but reversible endocrine symptoms with goserelin, in younger premenopausal patients."
summary=title+abstract
df1=summary.lower()
result1=re.sub(r'\d+','',df1)
result2=result1.translate(str.maketrans('', '', string.punctuation))
result3=result2.strip()
stop_words=set(stopwords.words('english'))
tokens=word_tokenize(result3)
result4=[i for i in tokens if not i in stop_words]
result5=' '.join(result4)
result6=re.sub(r'\b\w{1,3}\b','',result5)
l=[]
l.append(result6)
l=pd.Series(l)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)
xtrain=pd.read_csv('xtrain.csv',header=None)
xtrain=pd.Series(xtrain.iloc[:,1])
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf=tfidf_vectorizer.transform(l)
lr_model=joblib.load('lrmodel.pkl')
res="The probability of clinical trial "+ title + " getting approved is "+str(lr_model.predict_proba(xval_tfidf)[0][1])+" %"



In [25]:
res

'The probability of clinical trial Adjuvant chemotherapy followed by goserelin compared with either modality alone: the impact on amenorrhea, hot flashes, and quality of life in premenopausal patients--the International Breast Cancer Study Group Trial VIII getting approved is 0.2714023333657697 %'

## END