# Notebook for CS598 Final Project 
Author: Binbin Weng (binbinw2@illinois.edu)

## 1. Introduction

The method to implement is from the paper, `A disease inference method based on symptom extraction and bidirectional Long Short Term Memory networks`, which introduces a method of multi-label classifier for disease inference with clinical text data. 

Since the data used in the paper is clinical text data, symptoms are first extracted from the clinical text data, and then a deep learning model, bidirectional Long Short Term Memory network (BiLSTM), with two different representations of the extracted symptoms, representations of TF-IDF (term frequency-inverse document frequency) and Word2Vec (an embedding method), is built to improve the performance of the classifier. 

Using this two representations of the extracted symptoms is because the method of TF-IDF can reflect the relation between symptom and target disease and the method of Word2Vec can reflect the relation between symptom and symptom, so that more information are involved in the modeling, which improves the performance of the classifier.

## 2. Data
The datasets used in the paper are two tables, `NOTEEVENTS` and `DIAGNOSES_ICD`, from MIMIC-III, which can be accessed through https://physionet.org/content/mimiciii/1.4/. 

After you access the datasets, please save it to the same folder as you save the notebook.

Per the paper, the clinical texts used are the discharge summaries.

The statistics about the two tables are as the follows:


In [6]:
import os
import pickle
import random
import numpy as np
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import time
import re
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.feature_extraction.text import CountVectorizer
import math
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import gensim
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from torch.utils.data import DataLoader
import csv

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/wangguanshen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
################## set seeds ##########################
seed = 2023
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)

In [7]:
########## processing NOTEVENTS Table #######################################
noteevents = pd.read_csv('NOTEEVENTS.csv.gz')
print("The shape of NOTEEVENTS table is: ", noteevents.shape)
print("Categories of the clinical text:")
print(noteevents['CATEGORY'].value_counts())
# select only data with the category of Discharge summary
discharge_summary_data = noteevents[noteevents['CATEGORY']=='Discharge summary']
print("The shape of Discharge Summary table is: ", discharge_summary_data.shape)
# subject_id is patient id, hadm_id is visit id
print("Number of unique patients in the discharge summary table: ",discharge_summary_data['SUBJECT_ID'].nunique())
print("Number of unique visits in the discharge summary table: ",discharge_summary_data['HADM_ID'].nunique())
#drop unnecessary columns from discharge summary data
discharge_summary_text = discharge_summary_data.drop(['ROW_ID','CHARTDATE','CHARTTIME','STORETIME','CATEGORY','DESCRIPTION','CGID','ISERROR'],axis =1)
#change datatype for following processing
discharge_summary_text['SUBJECT_ID'] = discharge_summary_text['SUBJECT_ID'].astype(str)
# function to concatenate contents in each column with the same 'HADM_ID'
def concat_values(group):
    return pd.Series({
        'SUBJECT_concat': ' '.join(group['SUBJECT_ID']),
        'TEXT_concat': ' '.join(group['TEXT'])
    })
discharge_summary_text = discharge_summary_text.groupby('HADM_ID').apply(concat_values)
discharge_summary_text = discharge_summary_text.reset_index()

  noteevents = pd.read_csv('NOTEEVENTS.csv.gz')


The shape of NOTEEVENTS table is:  (2083180, 11)
Categories of the clinical text:
Nursing/other        822497
Radiology            522279
Nursing              223556
ECG                  209051
Physician            141624
Discharge summary     59652
Echo                  45794
Respiratory           31739
Nutrition              9418
General                8301
Rehab Services         5431
Social Work            2670
Case Management         967
Pharmacy                103
Consult                  98
Name: CATEGORY, dtype: int64
The shape of Discharge Summary table is:  (59652, 11)
Number of unique patients in the discharge summary table:  41127
Number of unique visits in the discharge summary table:  52726


In [8]:
########## processing DIAGNOSES_ICD Table #######################################
dig = pd.read_csv('DIAGNOSES_ICD.csv.gz')
print("The shape of DIAGNOSES_ICD table is: ",dig.shape)
print("Number of unique patients in the DIAGNOSES_ICD table: ", dig['SUBJECT_ID'].nunique())
print("Number of unique visits in the DIAGNOSES_ICD table: ", dig['HADM_ID'].nunique())
# drop unnecessary columns from diagnoses data
diagnoses_data = dig.drop(['ROW_ID','SEQ_NUM'],axis = 1)
#change datatype for following processing
diagnoses_data['SUBJECT_ID'] = diagnoses_data['SUBJECT_ID'].astype(str)
diagnoses_data['ICD9_CODE'] = diagnoses_data['ICD9_CODE'].astype(str)
# function to concatenate contents in each column with the same 'HADM_ID'
def concat_values2(group):
    return pd.Series({
        'SUBJECT_concat': ' '.join(group['SUBJECT_ID']),
        'ICD_concat_CODEs': ','.join(group['ICD9_CODE'])
    })
diagnoses_data = diagnoses_data.groupby('HADM_ID').apply(concat_values2)
diagnoses_data = diagnoses_data.reset_index()

The shape of DIAGNOSES_ICD table is:  (651047, 5)
Number of unique patients in the DIAGNOSES_ICD table:  46520
Number of unique visits in the DIAGNOSES_ICD table:  58976


## 3. Extract symptoms from discharge summaries.
The purpose of the paper is to use the clinical text data to build a multi-label classifier for disease inference. The first thing needs to do is to extract symptoms from the discharge summaries. It uses the techniques introduced by MetaMap to extract symptoms from the clinical text data. 

MetaMap provides a database which you can download and set up the environment for it so that you can run the extraction from your local side. Please go to the website page, https://lhncbc.nlm.nih.gov/ii/tools/MetaMap/run-locally/MainDownload.html, to download related database and follow the instructions to set up the environment.  

However, this method is very slow. I tried it. It took about 25 CPU hours to extract symptoms from about 1000 records. The dataset contains about 50,000 records, so obviously this method is not efficient enough.

Luckily, MetaMap also provides the service of Batch MetaMap, where you can upload your file and it will help extract symptoms from your file and return results to you. So the following part of code is to prepare the data which you can upload to Batch MetaMap. After running the following part of code, two files will be generated, `merged_data.csv` and `data_for_metamapBatch_full.txt`. The file, `merged_data.csv` is for later modeling use which contains labels for each records. The file, `data_for_metamapBatch_full.txt`, is to submit in Batch MetaMap to get the symptoms for each records.

In [None]:
import prepare_data_for_MetaMap
prepare_data_for_MetaMap.processing_text_for_MetaMap(discharge_summary_text,diagnoses_data)


## 4. Get symptoms from Batch MetaMap
Register an account on National Library of Medicine, https://www.nlm.nih.gov/. It may take some time to review your registration.

After your registration gets approved, go to https://ii.nlm.nih.gov/Batch/UTS_Required/MetaMap.html,Enter, enter FULL Email Address (which is used to contact you when the job is finished), upload the data_for_metamapBatch_full.txt from the previous step, in the Out/Display Options select Fielded MMI output (-N), in the Batch Specific Options select Single Line Delimited Input w/ ID, in the I would like to only use specific Semantic Types, enter `sosy,dsyn,neop,fngs,bact,virs,cgab,acab,lbtr,inpo,mobd,comd,anab`, then submit Batch MetaMap. It then will give you a link to track your job. It takes about 3 days to process the data. When the job finishes, you will receive an email containing link for the output files. Download the `text.out` file and `text.out.ERR` file to the same folder you store your other data.

## 5. Symptoms from Batch MetaMap
Running the following part of code will help process the symptoms gotten from Batch MetaMap

In [None]:
import prepare_symptoms
symptoms = pd.read_csv('text.out.txt', delimiter='|', header=None,
                       usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
symptoms_final = prepare_symptoms.symptoms(symptoms)

In [14]:
import prepare_symptoms
symptoms = pd.read_csv('text.out.txt', delimiter='|', header=None,
                       usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
symptoms_final = prepare_symptoms.symptoms(symptoms)
print("The shape of symptoms table is: ", symptoms_final.shape)

  symptoms = pd.read_csv('text.out.txt', delimiter='|', header=None,


The shape of symptoms table is:  (47887, 3)


## 6: Prepare data for modeling
After we get the results from symptoms, we need to merge the symptoms with the labels. Then we can start prepare the data for modeling.

1. Per the paper, just keep the first 50 symotoms if the number of symotoms in an observation is over 50; delete the observations with only one symotom
2. Per the paper, just keep the most common 50 diseases, because the observations with the most common 50 diseases take 97% of the dataset
3. Prepare the TF-IDF matrix. The way to get the TF-IDF matrix is as the follows:

Each symptom i will be represented as vector ${S_i= (W_{i,1},W_{i,2},...,W_{i,d})}$, where ${W_{i,j}}$ is the strength of the association between symptom i and disease j. More specifically, ${W_{i,j}=TF_{i,j}*log{\frac{N}{D_i}}}$, where $TF_{i,j}$ is the number of symptom i in the clinical texts data correlated with disease j, $N$ is the number of all diseases, $D_i$ is the number of diseases associated with symptom i. So the matrix should be of the shape (number of symptoms, number of diseases) 

4. Build the word2vec model for symtoms.
5. Prepare forward and backward for both TF-IDF representation and word2vec representation of the symptoms

In [15]:
################## merged the symptoms with the ICD codes with each visit ##########################
merged_data = pd.read_csv('merged_data.csv')
merged_data = merged_data.reset_index()
symptom_ICD = pd.merge(merged_data,symptoms_final,on = 'index',how = 'inner')
symptom_ICD.to_csv('symptom_ICD.csv',index = False)
print("The number of unique visits after merging with symptoms is: ", symptom_ICD.shape)

The number of unique visits after merging with symptoms is:  (47887, 9)


In [None]:
################## select symptoms and diseases ###############################################
import prepare_data_for_modeling
symptom_ICD = prepare_data_for_modeling.select_symptoms(symptom_ICD)
symptom_ICD = prepare_data_for_modeling.select_diseases(symptom_ICD)

In [18]:
################### generate dummy variables for symptoms and ICD codes #######################
s1 = symptom_ICD.shape[1]
symptom_ICD_new = (symptom_ICD['symptom_code_list'].apply(lambda x:','.join(x))).str.get_dummies(',')
symptom_ICD_new = symptom_ICD.join(symptom_ICD_new)
e1 = symptom_ICD_new.shape[1]

s2 = symptom_ICD_new.shape[1]
symptom_ICD_new = symptom_ICD_new.join((symptom_ICD_new['common50_icds'].apply(lambda x:','.join(x))).str.get_dummies(','))
e2 = symptom_ICD_new.shape[1]

In [19]:
################## calculate TF-IDF Matrix #########################################
start_time = time.time()
X = np.array(symptom_ICD_new.iloc[:,s1:e1])
Y = np.array(symptom_ICD_new.iloc[:,s2:e2])
print("X shape: ", X.shape)
print("Y shape: ", Y.shape)
item_tf =  np.zeros((X.shape[1],Y.shape[1]))
for i in range(X.shape[1]):
    for j in range(Y.shape[1]):
        item_tf[i,j] = sum(X[:,i]*Y[:,j])
D = []
for i in range(X.shape[1]):
    ans = 0
    for j in range(Y.shape[1]):
        if sum(X[:,i]*Y[:,j])>0:
            ans+=1
    D.append(ans)
W = np.zeros((X.shape[1],Y.shape[1]))
for i in range(X.shape[1]):
    for j in range(Y.shape[1]):
        W[i,j]=item_tf[i,j]*math.log(Y.shape[1]/D[i])
print("shape of W matrix: ",W.shape)
np.savetxt("tfidf_matrix.csv",W,delimiter = ',')
end_time = time.time()
diff_time = end_time-start_time
print("time spent on calculating TF-IDF matrix: ", diff_time)

X shape:  (46659, 18146)
Y shape:  (46659, 50)
shape of W matrix:  (18146, 50)
time spent on calculating TF-IDF matrix:  5243.654246807098


In [20]:
################## build word2vec model #########################################
start_time = time.time()
sentences = symptom_ICD_new['symptom_code_list'].apply(lambda x: [y.replace(" ","") for y in x])
embedding_model = Word2Vec(sentences, sg=1, window=5, vector_size=128, min_count=1, workers=4)
embedding_model.save('word2vec_model')
end_time = time.time()
diff_time = end_time-start_time
print("time spent on word2vec: ",diff_time)

time spent on word2vec:  18.809247970581055


In [21]:
################## prepare X ###################################################
numOfICD = 50
# the representation of TF-IDF
col = list(symptom_ICD_new.columns)
col = col[s1:e1]
col_map = {}
for i in range(len(col)):
    col_map[col[i].replace(" ","")] = i
symptom_ICD_new['tfidf_x'] = list(np.zeros((symptom_ICD_new.shape[0],50,numOfICD)))
symptom_ICD_new['tfidf_x_rev'] = list(np.zeros((symptom_ICD_new.shape[0],50,numOfICD)))
for i in range(symptom_ICD_new.shape[0]): 
    l = [x.replace(" ","") for x in symptom_ICD_new['selected_symtom_code'][i]]
    for j in range(len(l)):
        symptom_ICD_new['tfidf_x'][i][j] = W[col_map[l[j]],:]
        symptom_ICD_new['tfidf_x_rev'][i][len(l)-j-1]= W[col_map[l[j]],:]
        
# the representation of word2vec
symptom_ICD_new['word2vec_x'] = list(np.zeros((symptom_ICD_new.shape[0],50,128)))
symptom_ICD_new['word2vec_x_rev'] = list(np.zeros((symptom_ICD_new.shape[0],50,128)))
for i in range(symptom_ICD_new.shape[0]): 
    l = [x.replace(" ","") for x in symptom_ICD_new['selected_symtom_code'][i]]
    for j in range(len(l)):
        symptom_ICD_new['word2vec_x'][i][j] = embedding_model.wv[l[j]]
        symptom_ICD_new['word2vec_x_rev'][i][len(l)-j-1]= embedding_model.wv[l[j]]

In [22]:
################## prepare Y ###################################################
symptom_ICD_new['y'] = symptom_ICD_new.apply(lambda row: row.iloc[s2:e2].to_list(), axis = 1)

In [23]:
################### dataframe for X and Y #######################################
final_df = symptom_ICD_new.loc[:,['tfidf_x','tfidf_x_rev','word2vec_x','word2vec_x_rev','y']]

In [24]:
######################## split data into training (80%) and testing data (20%) #################
X_forward_tfidf = torch.tensor(final_df["tfidf_x"]).float()
X_backward_tfidf = torch.tensor(final_df["tfidf_x_rev"]).float()
X_forward_word2vec = torch.tensor(final_df["word2vec_x"]).float()
X_backward_word2vec = torch.tensor(final_df["word2vec_x_rev"]).float()
Y = torch.tensor(final_df["y"].tolist()).float()
X_train_forward_tfidf, X_test_forward_tfidf, X_train_backward_tfidf, X_test_backward_tfidf,X_train_forward_word2vec, X_test_forward_word2vec, X_train_backward_word2vec, X_test_backward_word2vec, y_train, y_test = train_test_split(X_forward_tfidf, X_backward_tfidf, X_forward_word2vec, X_backward_word2vec, Y, test_size=0.2, random_state=2023)

  X_forward_tfidf = torch.tensor(final_df["tfidf_x"]).float()


## 7: Modeling and evaluating
After prepare the data for modeling, in the following part, we will build and compare several models with the prepared data. 

The models are:
1. the bidirectional LSTM with the X representation of TF-IDF
2. the bidirectional LSTM with the X representation of word2vec
3. simply combine the results from the two models above
4. training two bidirectional LSTMs with the X representations of both TF-IDF and word2vec together
5. training two bidirectional GRUs with the X representations of both TF-IDF and word2vec together

The reason that bidirection LSTM is chosen is becuase the data used here is text data, which has the charactistics of seriality, and LSTM is a good deep learning model to catch this charactistics. In addition, bidirectional LSTM can catch the sequential information from both the forward side and backward side.

The structure of the bidirectional LSTM is as the follows:
<table>
<thead>
<tr>
<th>Layers</th>
<th>Configuration</th>
<th>Activation Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>Forward_LSTM</td>
<td>input size 50, hidden size 100</td>
<td>-</td>
</tr>
<tr>
<td>Backward_LSTM</td>
<td>input size 50, hidden size 100</td>
<td>-</td>
</tr>
<tr>
<td>Dropout</td>
<td>probability 0.8</td>
<td>-</td>
</tr>
<tr>
<td>Full connected</td>
<td>input size 200, output size 50</td>
<td>Sigmoid</td>
</tr>
</tbody>
</table>

The matrics to evaluate the models are four micro-averaging measurements: Precison (MiP), Recall (MiR), F1-score (MiF1) and AUC (area under the receiver operating characteristic curve). The measurements are defined as the follows:

$MiP = \frac{\sum_{i,j}{y^j_i}{\hat{y}^j_i}}{\sum_{i,j}{\hat{y}^j_i}}$

$MiR = \frac{\sum_{i,j}{y^j_i}{\hat{y}^j_i}}{\sum_{i,j}{y^j_i}}$

$MiF1=\frac{2*MiP*MiR}{MiP+MiR}$

In [25]:
class BiLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super(BiLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size,num_layers,batch_first=True, bidirectional=False)
        self.fc = nn.Linear(hidden_size*2, output_size)
        self.dropout = nn.Dropout(0.8)
        self.sigmoid = nn.Sigmoid()

    def forward(self, forward_input, backward_input):
        h0_forward = torch.zeros(self.num_layers, forward_input.size(0), self.hidden_size)
        c0_forward = torch.zeros(self.num_layers, forward_input.size(0), self.hidden_size)
        h0_backward = torch.zeros(self.num_layers, backward_input.size(0), self.hidden_size)
        c0_backward = torch.zeros(self.num_layers, backward_input.size(0), self.hidden_size)
        out_forward, _ = self.lstm(forward_input, (h0_forward, c0_forward))
        out_backward, _ = self.lstm(backward_input, (h0_backward, c0_backward))
        output = torch.cat((out_forward[:,-1,:], out_backward[:,0,:]), dim=1)
        output = self.dropout(output)
        output = self.fc(output)
        output = self.sigmoid(output)
        return output

In [26]:
def evaluateModel(pred,truth,cutoff):
    label = np.array(((pred>cutoff)*1.0))
    truth2 = np.array(truth)
    acc = np.sum(label==truth2)/np.prod(truth2.shape)
    mip = np.sum(label*truth2)/np.sum(label)
    mir = np.sum(label*truth2)/np.sum(truth2)
    mif1 = (2*mip*mir)/(mip+mir)
    auc = roc_auc_score(truth2,np.array(pred))
    return [acc,mip,mir,mif1,auc]

In [27]:
results = pd.DataFrame(columns=['Model', 'Accuracy', 'MiP', 'MiR', 'MiF1', 'AUC'])

In [28]:
##################### build model1: the bidirectional LSTM with the X representation of TF-IDF ################
start_time = time.time()
train_data = torch.utils.data.TensorDataset(X_train_forward_tfidf, X_train_backward_tfidf, y_train)
train_loader = DataLoader(train_data, batch_size=400, shuffle=True)
model = BiLSTM(X_train_forward_tfidf.shape[2], 100, 1,y_train.shape[1])
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(),lr = 0.001)
num_epochs = 100
for epoch in range(num_epochs):
    running_loss = 0.0
    i = 0
    for i, data in enumerate(train_loader, 0):
        inputs_forward, inputs_backward, labels = data
        optimizer.zero_grad()
        outputs = model(inputs_forward, inputs_backward)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    if (epoch+1)%10 ==0:
        print('[Epoch %d/100] loss: %.3f' %
              (epoch + 1, running_loss/i))
end_time = time.time()
diff_time = end_time-start_time
print("time spent on training the model1 is: ",diff_time)

[Epoch 10/100] loss: 0.321
[Epoch 20/100] loss: 0.315
[Epoch 30/100] loss: 0.309
[Epoch 40/100] loss: 0.303
[Epoch 50/100] loss: 0.300
[Epoch 60/100] loss: 0.297
[Epoch 70/100] loss: 0.294
[Epoch 80/100] loss: 0.292
[Epoch 90/100] loss: 0.291
[Epoch 100/100] loss: 0.289
time spent on training the model1 is:  2451.0875160694122


In [30]:
start_time = time.time()
model.eval()
with torch.no_grad():
    y_pred_tfidf = model(X_test_forward_tfidf, X_test_backward_tfidf)
y_pred_tfidf = y_pred_tfidf.detach().numpy()
rslt = evaluateModel(y_pred_tfidf,y_test,0.2)
results.loc[len(results.index)]=['BiLSTM with TF-IDF',rslt[0],rslt[1],rslt[2],rslt[3],rslt[4]]
end_time = time.time()
diff_time = end_time-start_time
print("time spent on evaluating the model1 is: ",diff_time)

time spent on evaluating the model1 is:  1.9696598052978516


In [31]:
print(results)

                Model  Accuracy       MiP       MiR      MiF1       AUC
0  BiLSTM with TF-IDF   0.84488  0.418782  0.595707  0.491817  0.804495


In [32]:
##################### build model2: the bidirectional LSTM with the X representation of word2vec ################
start_time = time.time()
train_data = torch.utils.data.TensorDataset(X_train_forward_word2vec, X_train_backward_word2vec, y_train)
train_loader = DataLoader(train_data, batch_size=400, shuffle=True)
model2 = BiLSTM(X_train_forward_word2vec.shape[2], 100, 1,y_train.shape[1])
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model2.parameters(),lr = 0.001)
num_epochs = 100
for epoch in range(num_epochs):
    running_loss = 0.0
    i = 0
    for i, data in enumerate(train_loader, 0):
        inputs_forward, inputs_backward, labels = data
        optimizer.zero_grad()
        outputs = model2(inputs_forward, inputs_backward)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    if (epoch+1)%10 ==0:
        print('[Epoch %d/{num_epochs}] loss: %.3f' %
              (epoch + 1, running_loss/i))
end_time = time.time()
diff_time = end_time-start_time
print("time spent on training model2 is: ",diff_time)

[Epoch 10/{num_epochs}] loss: 0.316
[Epoch 20/{num_epochs}] loss: 0.308
[Epoch 30/{num_epochs}] loss: 0.301
[Epoch 40/{num_epochs}] loss: 0.297
[Epoch 50/{num_epochs}] loss: 0.293
[Epoch 60/{num_epochs}] loss: 0.290
[Epoch 70/{num_epochs}] loss: 0.288
[Epoch 80/{num_epochs}] loss: 0.285
[Epoch 90/{num_epochs}] loss: 0.283
[Epoch 100/{num_epochs}] loss: 0.281
time spent on training model2 is:  3197.1380009651184


In [43]:
start_time = time.time()
model2.eval()
with torch.no_grad():
    y_pred_word2vec = model2(X_test_forward_word2vec, X_test_backward_word2vec)
y_pred_word2vec = y_pred_word2vec.detach().numpy()
rslt = evaluateModel(y_pred_word2vec,y_test,0.2)
results.loc[len(results.index)]=['BiLSTM with word2vec',rslt[0],rslt[1],rslt[2],rslt[3],rslt[4]]
end_time = time.time()
diff_time = end_time-start_time
print("time spent on evaluating the model is: ",diff_time)

time spent on evaluating the model is:  2.74979305267334


In [44]:
print(results)

                  Model  Accuracy       MiP       MiR      MiF1       AUC
0    BiLSTM with TF-IDF  0.844880  0.418782  0.595707  0.491817  0.804495
1  BiLSTM with word2vec  0.861234  0.460096  0.583818  0.514625  0.814443


In [45]:
################## build model3: simply combine the results from model1 and model2 ################################
start_time = time.time()
rslt = evaluateModel(y_pred_tfidf*0.5+y_pred_word2vec*0.5,y_test,0.2)
results.loc[len(results.index)]=['BiLSTM with TF-IDF and word2vec',rslt[0],rslt[1],rslt[2],rslt[3],rslt[4]]
end_time = time.time()
diff_time = end_time-start_time
print("time spent on evaluating the model3 is: ",diff_time)

time spent on evaluating the model3 is:  0.13937115669250488


In [46]:
print(results)

                             Model  Accuracy       MiP       MiR      MiF1  \
0               BiLSTM with TF-IDF  0.844880  0.418782  0.595707  0.491817   
1             BiLSTM with word2vec  0.861234  0.460096  0.583818  0.514625   
2  BiLSTM with TF-IDF and word2vec  0.859053  0.455739  0.610504  0.521890   

        AUC  
0  0.804495  
1  0.814443  
2  0.829092  


In [47]:
### build model4: training two bidirectional LSTMs with the X representations of both TF-IDF and word2vec together ################################
class CombinedBiLSTM(nn.Module):
    def __init__(self, input_size, input_size2,hidden_size, num_layers,output_size):
        super(CombinedBiLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm_tfidf = nn.LSTM(input_size, hidden_size,num_layers,batch_first=True, bidirectional=False)
        self.lstm_word2vec = nn.LSTM(input_size2, hidden_size,num_layers,batch_first=True, bidirectional=False)
        self.fc = nn.Linear(hidden_size*2, output_size)
        self.dropout = nn.Dropout(0.8)
        self.sigmoid = nn.Sigmoid()

    def forward(self, forward_input_tfidf, backward_input_tfidf,forward_input_word2vec, backward_input_word2vec):
        h0_forward_tfidf = torch.zeros(self.num_layers, forward_input_tfidf.size(0), self.hidden_size)
        c0_forward_tfidf = torch.zeros(self.num_layers, forward_input_tfidf.size(0), self.hidden_size)
        h0_backward_tfidf = torch.zeros(self.num_layers, backward_input_tfidf.size(0), self.hidden_size)
        c0_backward_tfidf = torch.zeros(self.num_layers, backward_input_tfidf.size(0), self.hidden_size)
        out_forward_tfidf, _ = self.lstm_tfidf(forward_input_tfidf, (h0_forward_tfidf, c0_forward_tfidf))
        out_backward_tfidf, _ = self.lstm_tfidf(backward_input_tfidf, (h0_backward_tfidf, c0_backward_tfidf))
        output_tfidf = torch.cat((out_forward_tfidf[:,-1,:], out_backward_tfidf[:,0,:]), dim=1)
        output_tfidf = self.dropout(output_tfidf)
        output_tfidf = self.fc(output_tfidf)
        output_tfidf = self.sigmoid(output_tfidf)
        
        h0_forward_word2vec = torch.zeros(self.num_layers, forward_input_word2vec.size(0), self.hidden_size)
        c0_forward_word2vec = torch.zeros(self.num_layers, forward_input_word2vec.size(0), self.hidden_size)
        h0_backward_word2vec = torch.zeros(self.num_layers, backward_input_word2vec.size(0), self.hidden_size)
        c0_backward_word2vec = torch.zeros(self.num_layers, backward_input_word2vec.size(0), self.hidden_size)
        out_forward_word2vec, _ = self.lstm_word2vec(forward_input_word2vec, (h0_forward_word2vec, c0_forward_word2vec))
        out_backward_word2vec, _ = self.lstm_word2vec(backward_input_word2vec, (h0_backward_word2vec, c0_backward_word2vec))
        output_word2vec = torch.cat((out_forward_word2vec[:,-1,:], out_backward_word2vec[:,0,:]), dim=1)
        output_word2vec = self.dropout(output_word2vec)
        output_word2vec = self.fc(output_word2vec)
        output_word2vec = self.sigmoid(output_word2vec)
        
        output = output_word2vec*0.5+output_tfidf*0.5
                                    
        return output

In [48]:
start_time = time.time()
train_data = torch.utils.data.TensorDataset(X_train_forward_tfidf, X_train_backward_tfidf,X_train_forward_word2vec, X_train_backward_word2vec, y_train)
train_loader = DataLoader(train_data, batch_size=400, shuffle=True)
model3 = CombinedBiLSTM(X_train_forward_tfidf.shape[2],X_train_forward_word2vec.shape[2], 100,1, y_train.shape[1])
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model3.parameters(),lr = 0.001)
num_epochs = 100
for epoch in range(num_epochs):
    running_loss = 0.0
    i = 0
    for i, data in enumerate(train_loader, 0):
        inputs_forward_tfidf, inputs_backward_tfidf,inputs_forward_word2vec, inputs_backward_word2vec, labels = data
        optimizer.zero_grad()
        outputs = model3(inputs_forward_tfidf, inputs_backward_tfidf,inputs_forward_word2vec, inputs_backward_word2vec)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    if (epoch+1)%10 ==0:
        print('[Epoch %d/100] loss: %.3f' %
              (epoch + 1, running_loss/i))
end_time = time.time()
diff_time = end_time-start_time
print("time spent on training the model4 is: ",diff_time)

[Epoch 10/100] loss: 0.310
[Epoch 20/100] loss: 0.299
[Epoch 30/100] loss: 0.292
[Epoch 40/100] loss: 0.286
[Epoch 50/100] loss: 0.281
[Epoch 60/100] loss: 0.276
[Epoch 70/100] loss: 0.273
[Epoch 80/100] loss: 0.271
[Epoch 90/100] loss: 0.268
[Epoch 100/100] loss: 0.267
time spent on training the model4 is:  5808.835932016373


In [49]:
start_time = time.time()
model3.eval()
with torch.no_grad():
    y_pred_combined = model3(X_test_forward_tfidf, X_test_backward_tfidf,
                             X_test_forward_word2vec, X_test_backward_word2vec)
y_pred_combined = y_pred_combined.detach().numpy()
rslt = evaluateModel(y_pred_combined,y_test,0.2)
results.loc[len(results.index)]=['Combined training of BiLSTMs',rslt[0],rslt[1],rslt[2],rslt[3],rslt[4]]
end_time = time.time()
diff_time = end_time-start_time
print("time spent on evaluating the model4 is: ",diff_time)

time spent on evaluating the model4 is:  4.616988182067871


In [50]:
print(results)

                             Model  Accuracy       MiP       MiR      MiF1  \
0               BiLSTM with TF-IDF  0.844880  0.418782  0.595707  0.491817   
1             BiLSTM with word2vec  0.861234  0.460096  0.583818  0.514625   
2  BiLSTM with TF-IDF and word2vec  0.859053  0.455739  0.610504  0.521890   
3     Combined training of BiLSTMs  0.869381  0.486158  0.643076  0.553714   

        AUC  
0  0.804495  
1  0.814443  
2  0.829092  
3  0.842082  


In [51]:
### build model5: training two bidirectional GRUs with the X representations of both TF-IDF and word2vec together ################################
class CombinedBiGRU(nn.Module):
    def __init__(self, input_size, input_size2,hidden_size, num_layers,output_size):
        super(CombinedBiGRU, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.gru_tfidf = nn.GRU(input_size, hidden_size,num_layers,batch_first=True, bidirectional=False)
        self.gru_word2vec = nn.GRU(input_size2, hidden_size,num_layers,batch_first=True, bidirectional=False)
        self.fc = nn.Linear(hidden_size*2, output_size)
        self.dropout = nn.Dropout(0.8)
        self.sigmoid = nn.Sigmoid()

    def forward(self, forward_input_tfidf, backward_input_tfidf,forward_input_word2vec, backward_input_word2vec):
        h0_forward_tfidf = torch.zeros(self.num_layers, forward_input_tfidf.size(0), self.hidden_size)
        h0_backward_tfidf = torch.zeros(self.num_layers, backward_input_tfidf.size(0), self.hidden_size)
        out_forward_tfidf, _ = self.gru_tfidf(forward_input_tfidf, h0_forward_tfidf)
        out_backward_tfidf, _ = self.gru_tfidf(backward_input_tfidf, h0_backward_tfidf)
        output_tfidf = torch.cat((out_forward_tfidf[:,-1,:], out_backward_tfidf[:,0,:]), dim=1)
        output_tfidf = self.dropout(output_tfidf)
        output_tfidf = self.fc(output_tfidf)
        output_tfidf = self.sigmoid(output_tfidf)
        
        h0_forward_word2vec = torch.zeros(self.num_layers, forward_input_word2vec.size(0), self.hidden_size)
        h0_backward_word2vec = torch.zeros(self.num_layers, backward_input_word2vec.size(0), self.hidden_size)
        out_forward_word2vec, _ = self.gru_word2vec(forward_input_word2vec, h0_forward_word2vec)
        out_backward_word2vec, _ = self.gru_word2vec(backward_input_word2vec, h0_backward_word2vec)
        output_word2vec = torch.cat((out_forward_word2vec[:,-1,:], out_backward_word2vec[:,0,:]), dim=1)
        output_word2vec = self.dropout(output_word2vec)
        output_word2vec = self.fc(output_word2vec)
        output_word2vec = self.sigmoid(output_word2vec)
        
        output = output_word2vec*0.5+output_tfidf*0.5                       
        return output

In [52]:
start_time = time.time()
train_data = torch.utils.data.TensorDataset(X_train_forward_tfidf, X_train_backward_tfidf,X_train_forward_word2vec, X_train_backward_word2vec, y_train)
train_loader = DataLoader(train_data, batch_size=400, shuffle=True)
model4 = CombinedBiGRU(X_train_forward_tfidf.shape[2],X_train_forward_word2vec.shape[2], 100,1, y_train.shape[1])
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model4.parameters(),lr = 0.001)
num_epochs = 100
for epoch in range(num_epochs):
    running_loss = 0.0
    i = 0
    for i, data in enumerate(train_loader, 0):
        inputs_forward_tfidf, inputs_backward_tfidf,inputs_forward_word2vec, inputs_backward_word2vec, labels = data
        optimizer.zero_grad()
        outputs = model4(inputs_forward_tfidf, inputs_backward_tfidf,inputs_forward_word2vec, inputs_backward_word2vec)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    if (epoch+1)%10 ==0:
        print('[Epoch %d/100] loss: %.3f' %
              (epoch + 1, running_loss/i))
end_time = time.time()
diff_time = end_time-start_time
print("time spent on training the model4 is: ",diff_time)

[Epoch 10/100] loss: 0.306
[Epoch 20/100] loss: 0.293
[Epoch 30/100] loss: 0.283
[Epoch 40/100] loss: 0.277
[Epoch 50/100] loss: 0.272
[Epoch 60/100] loss: 0.269
[Epoch 70/100] loss: 0.266
[Epoch 80/100] loss: 0.264
[Epoch 90/100] loss: 0.263
[Epoch 100/100] loss: 0.261
time spent on training the model4 is:  5426.055936098099


In [53]:
start_time = time.time()
model4.eval()
with torch.no_grad():
    y_pred_combined_gru = model4(X_test_forward_tfidf, X_test_backward_tfidf,
                             X_test_forward_word2vec, X_test_backward_word2vec)
y_pred_combined_gru = y_pred_combined_gru.detach().numpy()
rslt = evaluateModel(y_pred_combined_gru,y_test,0.2)
results.loc[len(results.index)]=['Combined training of BiGRUs',rslt[0],rslt[1],rslt[2],rslt[3],rslt[4]]
end_time = time.time()
diff_time = end_time-start_time
print("time spent on evaluating the model4 is: ",diff_time)

time spent on evaluating the model4 is:  5.146279811859131


## 8: Compare the results
After we built the 5 models above, let's compare the performance of the 5 models:

In [57]:
print("The performance of the 5 models are as the follows: ")
results

The performance of the 5 models are as the follows: 


Unnamed: 0,Model,Accuracy,MiP,MiR,MiF1,AUC
0,BiLSTM with TF-IDF,0.84488,0.418782,0.595707,0.491817,0.804495
1,BiLSTM with word2vec,0.861234,0.460096,0.583818,0.514625,0.814443
2,BiLSTM with TF-IDF and word2vec,0.859053,0.455739,0.610504,0.52189,0.829092
3,Combined training of BiLSTMs,0.869381,0.486158,0.643076,0.553714,0.842082
4,Combined training of BiGRUs,0.876479,0.507595,0.658741,0.573374,0.856416


#### Result 1 
Based on the table, MiP, MiR, MiF1, and AUC of the BiLSTMs with the two representations are all higher than that of BiLSTM with only the representation of TF-IDF, which fully supports the claim in the paper that using the two representations of the symptoms in the modeling do improve the performance of the classifier.

#### Result 2 
Based on the table, MiR, MiF1 and AUC of the BiLSTMs with the two representations are higher than that of BiLSTM with only the representation of Word2Vec, which does not fully support the claim in the paper that using the two representations of the symptoms in the modeling do improve the performance of the classifier.


#### Result 3 
In the original paper, the two BiLSTMs are trained separately first and then the outputs of the two BiLSTMs are added together with weight to get the final results. I want to test if training the two BiLSTMs together would have better performance. Based on the results in the table, training the BiLSTMs together has higher MiP, MiR, MiF1, and AUC than the model that simply sums up the results from the two separately trained BiLSTMs with weight. So training the two BiLSTMs together has better performance.

#### Result 4
In addition, suprisingly, training two BiGRUs together has better performance than training the two BiLSTMs together. Since BiGRU is simpler than BiLSTM, it was expected that BiLSTM should have better performance because BiLSTM has better ability to catch the sequential information in text data. However, the results in the table show that the model of BiGRUs outperforms the model of BiLSTMs with all the measurements. 

The reason might be that since the symptoms are finally used in modeling instead of the original full texts, the long term memory is not important in predicting the results, while the short term memory or more specifically the memory from last time step is more important in predicting the results. The reason still needs to be tested with further experiments.

Based on the runtime of training the BiLSTMs and BiGRUs and performance of these two methods, the method with BiGRUs is better.

## 9. Reference:
[1]Donglin Guo, Guihua Duan, Ying Yu, Yaohang Li, Fang-Xiang Wu, Min Li: A disease inference method based on symptom extraction and bidirectional Long Short Term Memory networks 

[2]Parikshit Sondhi, Jimeng Sun, Hanghang Tong, ChengXiang Zhai: SympGraph: A Framework for Mining Clinical Notes through Symptom Relation Graphs

[3]https://pytorch.org/

[4]https://numpy.org/

[5]https://pandas.pydata.org/

[6]https://stackoverflow.com/

[7]https://lhncbc.nlm.nih.gov/ii/tools/MetaMap/documentation/Installation.html

[8]https://scikit-learn.org/

[9]https://physionet.org/content/mimiciii/1.4/