<a href="https://colab.research.google.com/github/amyth18/CS598-Deep-Learning-Final-Project/blob/main/Other_Baseline_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

In this notebook we provide the implementation for DeepLabeler model as proposed in the paper [Automated ICD-9 Coding via A Deep Learning Approach.](https://ieeexplore.ieee.org/document/8320340) by Min Li et al. The model makes disease prediction by using all of the clinical text (using Doc2Vec) as input to a Convolutional Neural Network. We use this model as one of baseline models to compare our main model implemented in [Main_Model.ipynb](https://github.com/amyth18/CS598-Deep-Learning-Final-Project/blob/main/Main_Model.ipynb).

# Pre-Requisites

Before you can run this notebook you need to gain access to MIMIC III version 1.4 dataset from physionet.org, please refer to Pre-Requisites section in the Readme file in the GitHub repository for more details.

For this notebook we specifically we need the `NOTEVENTS.csv` and `DIAGNOSES_ICD.csv` files from the MIMIC III dataset.

We use Google Drive to store all our data including the original dataset, transformed/pre-processed dataset, trained models and evaluation results. Before you get started please create a top level folder in your Google Drive and update the project level env variable ```PROJECT_PATH``` in the **Intial Setup** section.

Once the top level folder is created, please save the MIMIC III dataset (i.e the de-compressed csv files) in a folder called ```mimic3```. If you don't have space to save all the files, you can only save the `NOTEVENTS.csv` and `DIAGNOSES_ICD.csv`files.

Also, in the same top level folder create the following folders where the notebook will save various results.
1. ```models```
2. ```results```
3. ```stats```

# Initial Setup

In [None]:
! pip install gensim --upgrade

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn

In [None]:
PROJECT_PATH = "/content/drive/My Drive/DLH Final Project"
DATASET_PATH = f"{PROJECT_PATH}/mimic3/df_dataset_full_text.csv"
DATASET_D2V_PATH = f"{PROJECT_PATH}/mimic3/df_dataset_full_text_d2v.csv"
DOC2VEC_PATH = f"{PROJECT_PATH}/models/doc2vec.model"
W2V_MODEL_PATH = f"{PROJECT_PATH}/models/word2vec.model"

TRAINING_BATCH_SIZE = 400
MAX_WORDS = 1000
W2V_EMB_SIZE = 128

In [None]:
! ls "/content/drive/My Drive/DLH Final Project/models"

doc2vec.model			main-model-27-04-2022-19-11-16
doc2vec.model.syn1neg.npy	tf-idf-27-04-2022-16-29-54
doc2vec.model.wv.vectors.npy	word2vec-27-04-2022-17-47-59
main-model-27-04-2022-15-40-59	word2vec.model


# Data Preprocessing

In [None]:
df_dataset = pd.read_csv(DATASET_PATH, converters={'INPUT_TEXT': eval, 
                                                   'ICD9_CODE': eval})

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from os import path

model = None

if path.exists(DOC2VEC_PATH):
  print("Loading doc2vec model from disk.")
  model = Doc2Vec.load(DOC2VEC_PATH)
else:
  print("Building doc2vec model.")
  docs = [TaggedDocument(doc, [i])
          for i, doc in enumerate(df_dataset.INPUT_TEXT)]

  model = Doc2Vec(vector_size=128, window=2, 
                  min_count=1, 
                  workers=8, 
                  epochs = 40)

  model.build_vocab(docs)

  model.train(docs, total_examples=model.corpus_count, 
              epochs=model.epochs)
  model.save(DOC2VEC_PATH)

In [None]:
X_doc2vec = [model.infer_vector(df_dataset['INPUT_TEXT'][i]) 
              for i in range(0, len(df_dataset['INPUT_TEXT']))]

In [None]:
df_dataset['DOC2VEC'] = np.array(X_doc2vec).tolist()
df_dataset.to_csv(DATASET_D2V_PATH)

# Load Data

In [None]:
df_dataset = pd.read_csv(DATASET_D2V_PATH, 
                         converters={'INPUT_TEXT': eval, 
                                     'ICD9_CODE': eval,
                                     'DOC2VEC': eval})

In [None]:
from gensim.models import Word2Vec

# load the model
model = Word2Vec.load(W2V_MODEL_PATH)

# now create a vector of word2vec embeddings for each discharge summary
X_word2vec = list()
for idx in range(len(df_dataset)):
  # ignore words in not vocabulary
  text = df_dataset["INPUT_TEXT"][idx]
  word_emb = [model.wv[w] for w in text if w in model.wv]
  X_word2vec.append(word_emb)

In [None]:
# top 50 unique ICD codes.
top_icd_codes = [codes for codes in df_dataset['ICD9_CODE']]
top_icd_codes = np.unique([code for codes in top_icd_codes for code in codes])

sorted_top_icd_codes = sorted(top_icd_codes)
icd_code_to_idx = dict((k, v) for v, k in enumerate(sorted_top_icd_codes))

multi_hot_ecoding_col = list()
for idx in range(len(df_dataset)):
  icd_codes = df_dataset.iloc[idx]['ICD9_CODE']
  encoding = [0] * 50
  for code in icd_codes:
    encoding[icd_code_to_idx[code]] = 1    
  multi_hot_ecoding_col.append(encoding)

# new add a new column with multi-hot encoding.
df_dataset['ICD9_CODE_ENCODED'] = multi_hot_ecoding_col

# multi-hot encoding for ICD codes diagnosed.
y = df_dataset['ICD9_CODE_ENCODED'].to_list()

In [None]:
X_doc2vec = df_dataset["DOC2VEC"]

In [None]:
print(len(X_word2vec))
print(len(X_doc2vec))
print(len(y))

55988
55988
55988


# Datasets and Dataloaders

In [None]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
from torch.utils.data import DataLoader

In [None]:
def pad_dataset(dataset, vec_size):
  seq_lengths = list()

  # for idx in range(len(dataset)):
  #   seq_lengths.append(len(dataset[idx]))
  # max_seq_length = max(seq_lengths)

  padded_dataset = torch.zeros([len(dataset), MAX_WORDS, vec_size], 
                               dtype=torch.float)
  for i in range(len(dataset)):
    for j in range(len(dataset[i])):
      padded_dataset[i][j] = torch.FloatTensor(dataset[i][j])
  
  return padded_dataset

In [None]:
import time

def collate_fn(data):
  x_w2v, x_d2v, y_batch = zip(*data)  
  x_w2v = pad_dataset(x_w2v, W2V_EMB_SIZE)  
  x_d2v = torch.FloatTensor(x_d2v)
  y_batch = torch.FloatTensor(y_batch)
  # move to gpus
  x_w2v = x_w2v.cuda() if torch.cuda.is_available() else x_w2v
  x_d2v = x_d2v.cuda() if torch.cuda.is_available() else x_d2v
  y_batch = y_batch.cuda() if torch.cuda.is_available() else y_batch
  return (x_w2v, x_d2v), y_batch

In [None]:
class CustomDataset(Dataset):
  def __init__(self, X_w2v, X_d2v, y):              
    self.X_w2v = X_w2v
    self.X_d2v = X_d2v
    self.y = y
    
  def __len__(self):                
    return len(self.y)
    
  def __getitem__(self, index):
    return self.X_w2v[index], self.X_d2v[index], self.y[index]

dataset = CustomDataset(X_word2vec, X_doc2vec, y)
split = int(len(dataset)*0.8)
lengths = [split, len(dataset) - split]

train_dataset, test_dataset = random_split(dataset, lengths)

train_loader = DataLoader(train_dataset, shuffle=True, 
                          batch_size=TRAINING_BATCH_SIZE, 
                          collate_fn=collate_fn)

test_loader = DataLoader(test_dataset, shuffle=True, 
                         batch_size=TRAINING_BATCH_SIZE, 
                         collate_fn=collate_fn)

# Model Definition

In [None]:
class CNNModel(nn.Module):
  
  def __init__(self):
    super(CNNModel, self).__init__()
    self.conv1 = nn.Conv2d(1, 64, (5, 128), 1)
    self.max_pool = torch.nn.MaxPool2d(4)
    self.dropout = torch.nn.Dropout(0.75)
    self.relu = torch.nn.ReLU()
  
  def forward(self, X):
    out = self.conv1(X)
    # print(out.shape)
    out = self.relu(out)
    x_in = torch.squeeze(out, dim=3)
    # print(x_in.shape)
    out = self.max_pool(x_in)
    out = self.dropout(out)
    # print(out.shape)
    out = torch.flatten(out, 1)
    return out

class DeepLabeler(nn.Module):

  def __init__(self):
    super(DeepLabeler, self).__init__()
    self.cnn = CNNModel()    
    self.fc = nn.Linear(4112, 50)
    self.dropout = nn.Dropout(0.75)
    self.sigmoid = nn.Sigmoid()
  
  def forward(self, X_w2vec, X_d2vec):
    out1 = self.cnn(X_w2vec)
    # print(out1.shape)
    # print(X_d2vec.shape)
    X_concat = torch.cat((out1, X_d2vec), 1)
    out2 = self.sigmoid(self.fc(X_concat))
    # print(out2.shape)
    return out2
  
  def get_name():
    return "deep-labeler"

# Model Training

In [None]:
from datetime import datetime
import pytz

def get_model_file_name(modelname="model"):
  return "/content/drive/My Drive/DLH Final Project/models/" + modelname + "-" + \
                  datetime.now(pytz.timezone('Asia/Kolkata')).strftime(
                      "%d-%m-%Y-%H-%M-%S")

def get_stats_file_name(modelname="model"):
  return "/content/drive/My Drive/DLH Final Project/stats/" + modelname + "-" + \
                  datetime.now(pytz.timezone('Asia/Kolkata')).strftime(
                      "%d-%m-%Y-%H-%M-%S")

def get_results_file_name(modelname="model"):
  return "/content/drive/My Drive/DLH Final Project/results/" + modelname + \
                  "-" + datetime.now(pytz.timezone('Asia/Kolkata')).strftime(
                      "%d-%m-%Y-%H-%M-%S")

In [None]:
import psutil
import time
import pickle

no_of_epocs = 100

def train_model(model, loss, optimizer, train_loader):

  main_memory_usage = list()
  gpu_memory_usage = list()
  gpu_time = list()
  train_loss = list()

  for e in range(no_of_epocs):
    model.train()
    epoc_train_loss = 0
    main_memory_before = psutil.virtual_memory().used
    gpu_memory_before = torch.cuda.memory_allocated()
    start_time = time.time()

    # iterate over data in mini batches.
    for tup, y_batch in train_loader:
      X_w2v, X_d2v = tup
      X_w2v = torch.unsqueeze(X_w2v, dim=1)
      model.zero_grad()
      pred = model(X_w2v, X_d2v)
      l = loss(pred, y_batch)
      l.backward()
      optimizer.step()    
      epoc_train_loss += l.item()
      
    # print epoc level training loss.
    print(f"epoc: {e}: Train Loss: {epoc_train_loss/len(train_loader)}")
    
    # collect cpu and memory stats.
    memory_used = psutil.virtual_memory().used
    gpu_memory_used = torch.cuda.memory_allocated()
    run_time = time.time() - start_time
    print(f"time: {run_time} memory_used: {memory_used} gpu_memory_used: {gpu_memory_used}")
    print("\n")

    train_loss.append(epoc_train_loss/len(train_loader))
    main_memory_usage.append(memory_used)
    gpu_memory_usage.append(gpu_memory_used)
    gpu_time.append(run_time)
    # end of one epoc

  # save the model
  torch.save(model.state_dict(), get_model_file_name(model.get_name()))
  # print and collect stats.
  print(psutil.virtual_memory())

  stats = {
      "gpu_mem": gpu_memory_usage,
      "main_mem": main_memory_usage,
      "gpu_time": gpu_time,
      "vmm_info": psutil.virtual_memory()
  }

  with open(get_stats_file_name(model.get_name()), "ab") as sfile:
    pickle.dump(stats, sfile)

In [None]:
model = DeepLabeler()
if torch.cuda.is_available():
  model.cuda()

loss_fn = nn.BCELoss()
optim = torch.optim.Adam(model.parameters(), lr=0.001)
print(f"No of parameters to train: \
        {sum(p.numel() for p in model.parameters() if p.requires_grad)}")

train_model(model, loss_fn, optim, train_loader)

No of parameters to train:         246674
epoc: 0: Train Loss: 0.26725409446018084
time: 392.7763228416443 memory_used: 10455932928 gpu_memory_used: 833817600


epoc: 1: Train Loss: 0.22767333313822746
time: 398.54994797706604 memory_used: 10453786624 gpu_memory_used: 833817600


epoc: 2: Train Loss: 0.2204462287149259
time: 396.1982464790344 memory_used: 10454323200 gpu_memory_used: 833817600


epoc: 3: Train Loss: 0.21659478106136834
time: 400.37121987342834 memory_used: 10454585344 gpu_memory_used: 833817600


epoc: 4: Train Loss: 0.2142102364450693
time: 402.3953001499176 memory_used: 10456240128 gpu_memory_used: 833817600


epoc: 5: Train Loss: 0.2124074496594923
time: 406.72699904441833 memory_used: 9789870080 gpu_memory_used: 833817600


epoc: 6: Train Loss: 0.211052455141076
time: 403.66787552833557 memory_used: 9788055552 gpu_memory_used: 833817600


epoc: 7: Train Loss: 0.20988512970507145
time: 403.0095522403717 memory_used: 9789186048 gpu_memory_used: 833817600


epoc: 8: T

TypeError: ignored

In [None]:
with open(get_stats_file_name("deep-labeler"), "ab") as sfile:
    pickle.dump(model, sfile)

# Model Evaluation

In [None]:
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt

def evaluate_model(model, test_loader):
  model.eval()
  y_pred_all = list()
  y_true_all = list()

  for tup, y_batch in test_loader:
    X_w2v, X_d2v = tup
    X_w2v = torch.unsqueeze(X_w2v, dim=1)
    y_pred = model(X_w2v, X_d2v)
    y_pred = y_pred > 0.20 # TODO: remove hard coding
    y_pred_all.extend(y_pred.detach().to('cpu').numpy())
    y_true_all.extend(y_batch.detach().to('cpu').numpy())

  y_true_all = np.array(y_true_all)
  y_pred_all = np.array(y_pred_all)

  # micro level metrics
  p1, r1, f1, s1 = precision_recall_fscore_support(y_true_all, y_pred_all, 
                                                  average="micro")
  micro_auc = roc_auc_score(y_true_all, y_pred_all, average="micro")
  print(f"Micro Averaging. Precision: {p1}, Recall: {r1}, F1 Score: {f1}, \
          AUC: {micro_auc}")

  # macro level metrics
  p2, r2, f2, s2 = precision_recall_fscore_support(y_true_all, y_pred_all, 
                                                  average="macro")
  macro_auc = roc_auc_score(y_true_all, y_pred_all, average="macro")
  print(f"Macro Averaging. Precision: {p2}, Recall: {r2}, F1 Score: {f2}, \
          AUC: {macro_auc}")

  results = {
      "micro": [p1, r1, f1],
      "macro": [p2, r2, f2]
  }

  with open(get_results_file_name("deep-labeler"), "ab") as rfile:
    pickle.dump(results, rfile)
  
  for idx in range(50):
    p, r, f, _12 = precision_recall_fscore_support(y_true_all[:,idx], 
                                                 y_pred_all[:,idx], 
                                                 average='binary')
    print(f"p={p}, r={r}, f={f}")

In [None]:
if model is None:
  print("load from disk")
  model = DeepLabeler()
  if torch.cuda.is_available():
    model.cuda()
    model.load_state_dict(torch.load(f"{PROJECT_PATH}/models/"))
    evaluate_model(model, test_loader)
else:
  print("evaluating in-memory model")
  evaluate_model(model, test_loader)

evaluating in-memory model
Micro Averaging. Precision: 0.43531539678950937, Recall: 0.5684117299744145, F1 Score: 0.49303913618710254,           AUC: 0.7474107992817327
Macro Averaging. Precision: 0.39562850169915237, Recall: 0.4724516202766275, F1 Score: 0.4190941059846913,           AUC: 0.6954960423444898
p=0.37510729613733906, r=0.5421836228287841, f=0.443429731100964
p=0.5836431226765799, r=0.5804066543438078, f=0.5820203892493049
p=0.36137366099558915, r=0.5758032128514057, f=0.44405729771583435
p=0.30420280186791193, r=0.3627684964200477, f=0.3309143686502177
p=0.38059047062262497, r=0.6899841017488076, f=0.4905802562170309
p=0.28212290502793297, r=0.2069672131147541, f=0.23877068557919623
p=0.23865877712031558, r=0.18113772455089822, f=0.20595744680851066
p=0.2871287128712871, r=0.3228744939271255, f=0.3039542639352072
p=0.3746630727762803, r=0.29324894514767935, f=0.3289940828402367
p=0.47126436781609193, r=0.5705765407554672, f=0.5161870503597122
p=0.2181571815718157, r=0.136