<a href="https://colab.research.google.com/github/amyth18/CS598-Deep-Learning-Final-Project/blob/main/CS598_Deep_Learning_For_Healthcare_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [73]:
! pip install gensim --upgrade



In [74]:
import pandas as pd
import torch

In [75]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Load Data

In [76]:
# read data
df = pd.read_csv("/content/drive/My Drive/DLH Final Project/mimic3/NOTEEVENTS.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


# Data Preprocessing
Need to focus on 2 tables
1. NOTESEVENTS.csv
2. DIAGNOSES_ICD.csv

Join tables by <subject_id, hadm_id>

Construct 2 datasets from "TEXT" field in NOTESEVENTS.csv for each <subject_id, hadm_id> pair (i.e discharge summary for that admission.)

X1, y and X2, y
x1 = sequence of vectors from word2vec 
x2 = sequence of vectors from tf-idf
y = list of icd codes for <subject_id, hadm_id> i.e. diagnosis maded in ICU admission.

Need to focus on 50 and 100 most commonly diagnosed diseases.

Use NLTK + MetaMap to extract only the symptom related entities (how to use MetaMap is unknown atm.)

Filter out sections in discharge summaries that are related to symptoms only, ignore others to speed up things.

Negative filters (e.g. "no sign of breath problem").

Generate Word2Vec embeddings (currently using Gensim) using "TEXT".

Generate TF-IDF vector for each symptom entity.

Generate multi-hot encoding for y


## Utility Routines For Data Processing

In [77]:
import re
import nltk

from nltk.corpus import stopwords
nltk.download('stopwords')

eng_stop_words =  stopwords.words('english')

class MySentences(object):
    def __init__(self, dframe):
        self.dframe = dframe
    
    # TODO: Keeping only alpha numeric characters and spaces for now.
    # need to make this better. Find some good libraries.
    def sanitize_text(self, text):
      test = text.strip()
      text = re.sub(r'\s\s+', ' ', text)
      text = re.sub(r'[^a-zA-z0-9\/\.\?\!\s;,\'\-]', '', text)
      text = re.sub(r'[\.\-\/\?\!;,]', ' ', text)
      text = re.sub(r'[\[\]]', '', text)
      return text

    # TODO: adding some basic checks again need to make it better.
    def sanitize_words(self, sentence):
      return [w for w in sentence if w not in eng_stop_words and not w.isdigit()]

    def __iter__(self):
        for idx in range(len(self.dframe)):
          text = self.sanitize_text(self.dframe["TEXT"][idx])
          yield self.sanitize_words(text.split())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [78]:
def pad_dataset(dataset):
  seq_lengths = list()

  for idx in range(len(dataset)):
    seq_lengths.append(len(dataset[idx]))
  max_seq_length = max(seq_lengths)
  
  padded_dataset = torch.zeros([len(dataset), max_seq_length, len(dataset[0][0])], 
                               dtype=torch.float64)
  for i in range(len(dataset)):
    for j in range(len(dataset[i])):
      padded_dataset[i][j] = torch.FloatTensor(dataset[i][j])
  
  return padded_dataset

## Data Filtering and Tranformation

In [79]:
df_icd_codes = pd.read_csv(
    "/content/drive/My Drive/DLH Final Project/mimic3/DIAGNOSES_ICD.csv")

Get top #50 ICD9 codes 

In [80]:
counts = df_icd_codes["ICD9_CODE"].value_counts().head(50)
top_icd_codes = counts.index.to_list()

Filter data to include admimission with top 50 diseases only and group and reorganize data in the following format <subject_id, hadm_id, [icd_code1, icd_code2 ...]>

In [81]:
df_admissions_with_top_diseases = \
df_icd_codes[df_icd_codes["ICD9_CODE"].isin(top_icd_codes)]

df_admissions_with_top_diseases = \
df_admissions_with_top_diseases.groupby(
['SUBJECT_ID', 'HADM_ID'])['ICD9_CODE'].apply(
        list).to_frame().reset_index()

Now select discharge summaries for the admimissions that contain atleast one of the top #50 ICD codes.

In [82]:
df_dataset = pd.merge(df, df_admissions_with_top_diseases, 
                       on=["SUBJECT_ID", "HADM_ID"])

df_dataset = df_dataset[df_dataset["CATEGORY"] 
                          == 'Discharge summary'].reset_index()
# free up some memory
# del df

In [83]:
len(df_dataset)

55988

Need to covert the ICD9 column to multi-hot encoding, we keep the old column with list of codes and added and new column with multi-hot encoding representation.

In [84]:
sorted_top_icd_codes = sorted(top_icd_codes)
icd_code_to_idx = dict((k, v) for v, k in enumerate(sorted_top_icd_codes))

In [85]:
# new col to be added to dataframe
multi_hot_ecoding_col = list()
for idx in range(len(df_dataset)):
  icd_codes = df_dataset.iloc[idx]['ICD9_CODE']
  encoding = [0] * 50
  for code in icd_codes:
    encoding[icd_code_to_idx[code]] = 1    
  multi_hot_ecoding_col.append(encoding)

# new add a new column with multi-hot encoding.
df_dataset['ICD9_CODE_ENCODED'] = multi_hot_ecoding_col

Extract symptoms from the text. Note: currently we treat all tokens as symptoms need to add all the filters discussed in the paper later. So we added a column called "SYMPTOMS" which is simply tokenized form of "TEXT" after basic sanitization.

In [86]:
sgen = MySentences(df_dataset)
symptom_col = list()
for s in sgen:
  symptom_col.append(s)

# add the new column to the dataset.
df_dataset["SYMPTOMS"] = symptom_col

## Generate Word2Vec Embeddings

Word2Vec training using gensim.

In [87]:
# NOTE: commenting this part so that we dont run this by mistake.

# import gensim
# sgen = MySentences(df_dataset) # a memory-friendly iterator
# model = gensim.models.Word2Vec(sgen, min_count=5, workers=4, sample=1e-05)
# model.save("/content/drive/My Drive/DLH Final Project/mimic3/word2vec-4.model")

## Construct dataset with Word2Vec embeddings

In [88]:
from gensim.models import Word2Vec
model = Word2Vec.load('/content/drive/My Drive/DLH Final Project/mimic3/word2vec-4.model')

In [89]:
X_word2vec = list()
for idx in range(len(df_dataset)):
  # ignore words in not vocabulary
  symptoms = df_dataset["SYMPTOMS"][idx]
  symptoms_emb = [model.wv[s] for s in symptoms if s in model.wv]
  X_word2vec.append(symptoms_emb)

# pad the dataset.
# X_word2vec = pad_dataset(X_word2vec)

In [90]:
# import pickle
# pfile = open("/content/drive/My Drive/DLH Final Project/X_word2vec", "ab")
# pickle.dump(X_word2vec, pfile)
# pfile.close()

# Construct data with TF-IDF encoding

In [92]:
import numpy as np
import itertools

vocab_size = len(model.wv)
tf = np.zeros((len(model.wv), len(top_icd_codes)))


for idx in range(len(df_dataset)):
  # XXX: TODO currently we treat all tokens from "TEXT" as sypmtoms
  # get the icd codes for this discharge summary
  symptoms = df_dataset['SYMPTOMS'][idx]
  icd_codes = df_dataset['ICD9_CODE'][idx]
  # create a cross product of symptoms and icd codes
  # and update tf matrix. tf matrix keeps count of how many 
  # (i.e frequency) times <symptom, icd code> pair occur in our dataset.
  for pair in itertools.product(symptoms, icd_codes):
    # update count of each (symptom, icd_code) pair to compute TF
    if pair[0] in model.wv:
      tf[model.wv.get_index(pair[0])][icd_code_to_idx[pair[1]]] += 1

# Complete the TF-IDF matrix computation.
# Compute the number of ICD Codes (i.e diseaes) each 
# symptom is associated with.
D_i = np.sum(tf > 0, axis=1)
print(D_i.shape)

log_N_Di = np.log(len(top_icd_codes)/D_i)
tf_idf = (tf.T * log_N_Di).T
print(tf_idf.shape)

(64259,)
(64259, 50)


In [93]:
# build the X_tfidf dataset
X_tf_idf = list()
for idx in range(len(df_dataset)):
  symptoms = df_dataset["SYMPTOMS"][idx]
  # get tf-idf vector for each symptom
  # ignore words in not vocabulary
  symptoms_tf_idf = [tf_idf[model.wv.get_index(s)] \
                     for s in symptoms if s in model.wv]
  X_tf_idf.append(symptoms_tf_idf)

# pad the dataset.
# X_tf_idf = pad_dataset(X_tf_idf)

In [None]:
# import pickle
# pfile = open("/content/drive/My Drive/DLH Final Project/X_tf_idf", "ab")
# pickle.dump(tf_idf, pfile)
# pfile.close()

# Construct Y (Multihot Encoding)

In [94]:
# multi-hot encoding for ICD codes diagnosed.
y = df_dataset['ICD9_CODE_ENCODED'].to_list()

In [95]:
print(len(X_word2vec))
print(len((X_tf_idf)))
print(len(y))

55988
55988
55988


# Define Dataset and DataLoaders

In [99]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
from torch.utils.data import DataLoader

def collate_fn(data):
  x_w2v, x_tidf, y_batch = zip(*data)
  x_w2v = pad_dataset(x_w2v)
  x_tidf = pad_dataset(x_tidf)
  y_batch = torch.FloatTensor(y_batch)
  # print(f"{x_w2v.shape}, {x_tidf.shape}, {y_batch.shape}")
  return x_w2v, x_tidf, y_batch


class CustomDataset(Dataset):

  def __init__(self, X_w2v, X_tfidf, y):              
    self.X_w2v = X_w2v
    self.X_tfidf = X_tfidf
    self.y = y
    
  def __len__(self):                
    return len(self.y)
    
  def __getitem__(self, index):          
    # your code here
    return self.X_w2v[index], self.X_tfidf[index], self.y[index]

dataset = CustomDataset(X_word2vec, X_tf_idf, y)

split = int(len(dataset)*0.8)
lengths = [split, len(dataset) - split]
train_dataset, test_dataset = random_split(dataset, lengths)

train_loader = DataLoader(train_dataset, shuffle=True, batch_size=32, 
                          collate_fn=collate_fn)

test_loader = DataLoader(test_dataset, shuffle=True, batch_size=32, 
                         collate_fn=collate_fn)

a = iter(train_loader)
p, q, r = next(a)
print(type(x))

torch.Size([32, 1774, 100]), torch.Size([32, 1774, 50]), torch.Size([32, 50])
<class 'NoneType'>


# Model Definition

In [32]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class BiLSTM(nn.Module):
  def __init__(self, input_dim, embedding_dim, output_dim):   
    super(BiLSTM, self).__init__()
    self.lstm = nn.LSTM(input_size=input_dim, 
                        hidden_size=embedding_dim,
                        num_layers=2,
                        bidirectional=True,
                        batch_first=True)
    
    self.linear = nn.Linear(embedding_dim*2, 
                            output_dim)
  
  def forward(self, X):
    emb = self.lstm(X)
    output = F.sigmoid(self.linear(emb))
    return output

In [33]:
class DiseasePredictionModel(nn.Module):
  def __init__(self, weight=0.4):    
    super(DiseasePredictionModel, self).__init__()
    self.weight = 0.4    
    self.w2v_lstm = BiLSTM(input_dim=100, embedding_dim=50, output_dim=50)
    self.tf_idf_lstm = BiLSTM(input_dim=50, embedding_dim=50, output_dim=50)
  
  def forward(self, X_w2v, X_tidf):
    pred1 = self.w2v_lstm(X_w2v)
    pred2 = self.tf_idf_lstm(X_tidf)
    # compute the weighted average of predictions
    # from the 2 models.
    return self.weight * pred1 + (1-self.weight) * pred2

# Model Training

In [34]:
model = DiseasePredictionModel()
loss = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

for e in range(1):
  model.train()
  for x_w2v, x_tidf, y_batch in train_loader:    
    model.zero_grad()
    pred = model(x_w2v, x_tidf)
    print(pred)
    l = loss(pred, y_batch[idx])
    print(loss)
    l.backward() 
    optimizer.step()
    break

AttributeError: ignored