# 1 Summary
This notebook is to investigate the performance of the KSI framework without the attention mechanism in the document similarity learning model, in the task of the medical code prediction from the clinical notes. 

This is the ablation study of the reproduction study of the following paper:  

Tian Bai and Slobodan Vucetic. 2019. Improving medical code prediction from clinical text via incorporating online knowledge sources. In The World WideWeb Conference, WWW ’19, page 72–82, New York, NY, USA. Association for Computing Machinery.

# 2 Preparation

## 2.1 Check GPU Status

This notebook requires hardware acceleration with GPU. Run the following code to make sure the GPU is running.

In [None]:
!nvidia-smi

Sun Apr 30 03:47:01 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 2.2 Dataset Loading

In this section, the following files are loaded:

* `NOTEEVENTS.csv`
* `DIAGNOSES_ICD.csv`
* `wikipedia_knowledge`
* `IDlist.npy`



To load `NOTEEVENTS.csv` and `DIAGNOSES_ICD.csv`, take and pass the training "CITI Data or Specimens Only Research" at [https://about.citiprogram.org/](https://about.citiprogram.org/), and apply for the access to the MIMIC-III Clinical Database at PhysioNet at [https://physionet.org/content/mimiciii/1.4/](https://physionet.org/content/mimiciii/1.4/). After gaining the access, download `NOTEEVENTS.csv` and `DIAGNOSES_ICD.csv` from PhysioNet, and upload these two CSV files to a created folder "cs598_project" in Google drive.
  
Mount the Google Drive to Google Colab runtime.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


Display the files in the "cs598_project" folder. Make sure `NOTEEVENTS.csv` and `DIAGNOSES_ICD.csv` are uploaded to this folder.

In [None]:
!ls drive/MyDrive/cs598_project

CAML_model  DIAGNOSES_ICD.csv  KSI_CNN_model	  KSI_RNN_model   RNNattn_model
CNN_model   KSI_CAML_model     KSI_RNNattn_model  NOTEEVENTS.csv  RNN_model


Copy `NOTEEVENTS.csv` and `DIAGNOSES_ICD.csv` from Google Drive to Google Colab Runtime.

In [None]:
!cp drive/MyDrive/cs598_project/DIAGNOSES_ICD.csv DIAGNOSES_ICD.csv
!cp drive/MyDrive/cs598_project/NOTEEVENTS.csv NOTEEVENTS.csv

`wikipedia_knowledge` and `IDlist.npy` can be downloaded from the GitHub Repository of the original paper ([https://github.com/tiantiantu/KSI](https://github.com/tiantiantu/KSI)).

In [None]:
!wget https://raw.githubusercontent.com/tiantiantu/KSI/master/wikipedia_knowledge
!wget https://github.com/tiantiantu/KSI/raw/master/IDlist.npy

--2023-04-30 03:48:00--  https://raw.githubusercontent.com/tiantiantu/KSI/master/wikipedia_knowledge
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5311908 (5.1M) [text/plain]
Saving to: ‘wikipedia_knowledge’


2023-04-30 03:48:01 (174 MB/s) - ‘wikipedia_knowledge’ saved [5311908/5311908]

--2023-04-30 03:48:01--  https://github.com/tiantiantu/KSI/raw/master/IDlist.npy
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/tiantiantu/KSI/master/IDlist.npy [following]
--2023-04-30 03:48:01--  https://raw.githubusercontent.com/tiantiantu/KSI/master/IDlist.npy
Resolving raw.githubusercontent.com (raw.git

## 2.3 Load Python Files

`preprocessing.py`, `training.py`, and `testing.py` contains the functions for data preprocessing, training, and testing of the deep learning models. They can be downloaded from the GitHub repository of this reproduction study ([https://github.com/chenwusi2012/CS598_KSI]()).

In [None]:
!wget https://raw.githubusercontent.com/chenwusi2012/CS598_KSI/main/preprocessing.py
!wget https://raw.githubusercontent.com/chenwusi2012/CS598_KSI/main/training.py
!wget https://raw.githubusercontent.com/chenwusi2012/CS598_KSI/main/testing.py

--2023-04-30 03:48:01--  https://raw.githubusercontent.com/chenwusi2012/CS598_KSI/main/preprocessing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1717 (1.7K) [text/plain]
Saving to: ‘preprocessing.py’


2023-04-30 03:48:01 (40.8 MB/s) - ‘preprocessing.py’ saved [1717/1717]

--2023-04-30 03:48:02--  https://raw.githubusercontent.com/chenwusi2012/CS598_KSI/main/training.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4678 (4.6K) [text/plain]
Saving to: ‘training.py’


2023-04-30 03:48:02 (66.0 MB/s) - ‘training.py’ saved [4

## 2.4 Check Loaded Files

Check if the required files for this notebook have been loaded to the directory.

In [None]:
import os

assert os.path.exists('NOTEEVENTS.csv'), "NOTEEVENTS.csv is not in the directory"
assert os.path.exists('DIAGNOSES_ICD.csv'), "DIAGNOSES_ICD.csv is not in the directory"
assert os.path.exists('wikipedia_knowledge'), "wikipedia_knowledge is not in the directory"
assert os.path.exists('IDlist.npy'), "IDlist.npy is not in the directory"
assert os.path.exists('preprocessing.py'), "preprocessing.py is not in the directory"
assert os.path.exists('training.py'), "training.py is not in the directory"
assert os.path.exists('testing.py'), "testing.py is not in the directory"

## 2.5 Running Time Tracking

Import timeit and datetime to track the running time of the notebook.

In [None]:
import timeit
import datetime
total_start = timeit.default_timer()

# 3 Data Pre-processing

## 3.1 Data Pre-processing 1

Install `stop-words` to filter out stop words in the clinical notes.

In [None]:
!pip install stop-words

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stop-words
  Downloading stop-words-2018.7.23.tar.gz (31 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: stop-words
  Building wheel for stop-words (setup.py) ... [?25l[?25hdone
  Created wheel for stop-words: filename=stop_words-2018.7.23-py3-none-any.whl size=32910 sha256=eea3f7508e9c2ed689a1d86ea72e45d5b251da26a0ec229797cdd40d5c011056
  Stored in directory: /root/.cache/pip/wheels/d0/1a/23/f12552a50cb09bcc1694a5ebb6c2cd5f2a0311de2b8c3d9a89
Successfully built stop-words
Installing collected packages: stop-words
Successfully installed stop-words-2018.7.23


Import the needed Python packages for this section.

In [None]:
import codecs
from collections import defaultdict
import csv
import string
from stop_words import get_stop_words    # download stop words package from https://pypi.org/project/stop-words/
import numpy as np
import datetime
import pandas as pd

Record the start timestamp of this section to track the total running time of this section.

In [None]:
start = timeit.default_timer()

Create a dictionary from `NOTEEVENTS.csv`. The key is `HADM_ID` (ID of a visit), and the value is clinical note.  
For clinical note, replace line change with whitespace, remove punctuation, and lowercase all letters.  
From paper: "During preprocessing we lowercased all
tokens and removed punctuations, stop words, words containing
only digits, and words whose frequency is less than 10".


In [None]:
stop_words = get_stop_words('english')

admidic=defaultdict(list)
num_of_notes=0

with open('NOTEEVENTS.csv', 'r') as csvfile:
  spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
  for row in spamreader: # Iterate all entries in NOTEEVETS
    if row[6]=='Discharge summary': # Check if the category is discharge summary
      # identify patients through subject_id
      # append text to the list of text for a HADM_ID (visit) 
      # lower case all words, and remove punctuation
      admidic[row[2]].append(row[-1].replace('\n',' ').translate(str.maketrans('','',string.punctuation)).lower())
      num_of_notes = num_of_notes+1 # count number of discharge summary notes

Calculate the word ocurrence (including stop words) in `NOTEEVENTS.csv`.

In [None]:
u=defaultdict(int) # The default count for each word is 0
for i in admidic: # for each visit
  for jj in admidic[i]: # for each saved note
    line=jj.strip('\n').split() # split into a list of words
    for j in line: # Iterate each word
      u[j]=u[j]+1 # count the number of words

Print the number of words (vocabulary) in `NOTEEVENTS.csv`.  
Print the ocurrence of the word "consistent" in `NOTEEVENTS.csv`.  
Check if the stop word "the" in the dictionary.

In [None]:
print(f"Number of words in NOTEEVENTS.csv = {len(u)}")
print(u["consistent"])
print("the" in u)

Number of words in NOTEEVENTS.csv = 529818
35831
True


Remove the stopwords.  
Remove the words whose number of occurence is less than 10.  
From paper: "During preprocessing we lowercased all
tokens and removed punctuations, stop words, words containing
only digits, and words whose frequency is less than 10".

In [None]:
u2=defaultdict(int) # Create a new dict to filter out some words
for i in u: # iterate each word
  if i.isdigit()==False: # Make sure not a number
    if u[i]>10: # Make sure the word occurence is higher than 10
      if i not in stop_words: # Make sure not stop words
        u2[i]=u[i]
num_of_words_in_notes = len(u2)

Print the number of words (vocabulary) in `NOTEEVENTS.csv` AFTER the removal of stopwords (From paper: "The final
word vocabulary contains 47,965 unique words.").  
Check if the stop word "the" in the dictionary.

In [None]:
print(f"Number of words in NOTEEVENTS.csv = {num_of_words_in_notes}")
print("the" in u2)

Number of words in NOTEEVENTS.csv = 47964
False


Create a dictionary for `DIAGNOSES_ICD.csv`. The key is `HADM_ID`, and the value if a list of ICD-9 codes for `HADM_ID`.  
Add a prefix "d_" to the ICD-9 codes.

In [None]:
u=[]   

file1=codecs.open('DIAGNOSES_ICD.csv','r')
ad2c=defaultdict(list)
line=file1.readline() # Skip the 1st line
line=file1.readline() # Read the 2nd line

while line:
  line=line.strip().split(',') # Split a row into a list

  if line[4][1:-1]!='': # If ICD9_CODE column is not empty
    ad2c[line[2]].append("d_"+line[4][1:-1]) # Append the code to list of codes for a HADM_ID
  
  line=file1.readline() # Read the next line

Print a sample key-value pair (ICD-9 codes for visit 172335).

In [None]:
print(ad2c["172335"])
print(len(ad2c["172335"]))

['d_40301', 'd_486', 'd_58281', 'd_5855', 'd_4254', 'd_2762', 'd_7100', 'd_2767', 'd_7243', 'd_45829', 'd_2875', 'd_28521', 'd_28529', 'd_27541']
14


Calculate the code ocurrence in `DIAGNOSES_ICD.csv`.

In [None]:
codeu=defaultdict(int)
for i in ad2c:
  for j in ad2c[i]:
    codeu[j]=codeu[j]+1 # counter the occurence of codes

Print the number of unique ICD-9 codes (original codes) in `DIAGNOSES_ICD.csv`.  
Print the number of occurence for a code.

In [None]:
print(f"number of codes in DIAGNOSES_ICD.csv = {len(codeu)}")
print(codeu["d_486"])

number of codes in DIAGNOSES_ICD.csv = 6984
4839


Group ICD-9 codes in `DIAGNOSES_ICD.csv` by the first 3 letters (From paper: "We extracted all
listed ICD-9 diagnosis codes for each visit and grouped them by
their first three digits").

In [None]:
ad2c2 = defaultdict(list)
for hadm_id in ad2c:
  for code in ad2c[hadm_id]:
    if code[0:5] not in ad2c2[hadm_id]:
      ad2c2[hadm_id].append(code[0:5])

Print a sample key-value pair (same as the one before ICD-9 code grouping).

In [None]:
print(ad2c2["172335"])
print(len(ad2c2["172335"]))

['d_403', 'd_486', 'd_582', 'd_585', 'd_425', 'd_276', 'd_710', 'd_724', 'd_458', 'd_287', 'd_285', 'd_275']
12


Calculate the code ocurrence in DIAGNOSES_ICD.csv.

In [None]:
codeu2=defaultdict(int)
for hadm_id in ad2c2:
  for code in ad2c2[hadm_id]:
    codeu2[code]=codeu2[code]+1 # counter the occurence of codes
num_of_codes_in_notes = len(codeu2)

Print the number of unique ICD-9 codes (after grouping) in `DIAGNOSES_ICD.csv` (From paper: "The code vocabulary contains 942 codes.").

In [None]:
print(f"Number of codes after grouping = {num_of_codes_in_notes}")

Number of codes after grouping = 942


Iterate `HADM_ID` in `IDlist.npy`, and combine the code data (from `DIAGNOSES_ICD.csv`) and note data (from `NOTEEVENTS.csv`) into a single file `combined_dataset`.  
`combined_dataset`: A dataset contains the clinical notes of hospital visits and corresponding diagnosis (ICD-9 codes).

In [None]:
fileo=codecs.open("combined_dataset",'w')

num_of_aggregated_notes = 0
lower_limit_freq = 0
upper_limit_freq = 10000000

IDlist=np.load('IDlist.npy',encoding='bytes').astype(str) # a list of HADM_ID
for hadm_id in IDlist:
  if ad2c2[hadm_id]!=[]: # If HADM_ID exists in ad2c
    add_to_set = False
    for code in ad2c2[hadm_id]:
      if codeu2[code] >= lower_limit_freq and codeu2[code] <= upper_limit_freq:
        add_to_set = True
    if add_to_set:
      fileo.write('start! '+i+'\n')
      fileo.write('codes: ')
      tempc=[]
      for code in ad2c2[hadm_id]: # for each code
        if codeu2[code] >= lower_limit_freq and codeu2[code] <= upper_limit_freq: # if code occurence greater than threshold
          if code not in tempc:
            tempc.append(code) # save d_ and first 3 digits of ICD-9 code
      
      for code in tempc:
        fileo.write(code+" ") # write code to combined dataset
      fileo.write('\n')
      fileo.write('notes:\n') # write note
      for line in admidic[hadm_id]: # iterate each line    
        thisline=line.strip('\n').split() 
        for j in thisline: # iterate each word
          if u2[j]!=0: # if this word is a qualified word
            fileo.write(j+" ") # write this word
          fileo.write('\n')
      fileo.write('end!\n')
      num_of_aggregated_notes = num_of_aggregated_notes + 1
fileo.close()
 

Print the number of note-code pairs added to the dataset (From paper:"The number of aggregated discharge summary notes is 52,722..").

In [None]:
print(f"Number of notes written to combined dataset = {num_of_aggregated_notes}")

Number of notes written to combined dataset = 52722


Print out the total running time of this section.

In [None]:
stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s') 

Time duration: 0h 01m 45s


## 3.2 Data Pre-processing 2

Import the needed Python packages for this section.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

Record the start timestamp of this section to track the total running time of this section.

In [None]:
start = timeit.default_timer()

Build a vocabulary for Wiki documents. The key is a word, and the value is 1 if the word exists.

In [None]:
wikivocab={}
file1=codecs.open("wikipedia_knowledge",'r','utf-8')
line=file1.readline()
while line:
  if line[0:3]!='XXX': # if this line is not start or end of doc
    line=line.strip('\n')
    line=line.split()
    for i in line: # iterate each word
      wikivocab[i.lower()]=1 # if a word exists in doc, set to 1
  line=file1.readline()
num_of_words_in_wiki = len(wikivocab)

Print the size of the vocabulary for Wiki documents (From paper: "The size
of the word vocabulary of Wikipedia documents is 60,968.").

In [None]:
print(f"Number of unique words in Wiki documents = {num_of_words_in_wiki}")

Number of unique words in Wiki documents = 60968


Build a vocabulary for `NOTEEVENTS.csv` (after word processing). The key is a word, and the value is 1 if the word exists.

In [None]:
notesvocab={}
filec=codecs.open("combined_dataset",'r','utf-8')

line=filec.readline()

while line:
  line=line.strip('\n')
  line=line.split()
  
  if line[0]=='codes:':
    line=filec.readline()
    line=line.strip('\n')
    line=line.split()
    
    if line[0]=='notes:': 
      line=filec.readline()
      while line!='end!\n':
        line=line.strip('\n')
        line=line.split()
        for word in line:
          notesvocab[word]=1
        line=filec.readline()            
  line=filec.readline()

Print the size of vocabulary for `NOTEEVENTS.csv` after word processing (From paper: "The final word vocabulary contains 47,965 unique words.").  
Print a sample key-value pair.

In [None]:
print(len(notesvocab))
print(notesvocab["consistent"])

47964
1


Get an intersection of the vocabulary for wiki documents and the one for `NOTEVENTS.csv`.

In [None]:
a1=set(notesvocab)
a2=set(wikivocab)
a3=a1.intersection(a2) # get the intersection
num_of_words_in_both = len(a3)

Print number of unique words in the intersection (From paper: "The size of the word vocabulary of Wikipedia documents is 60,968, out of which only 12,173 are also in the word vocabulary of MIMIC-III clinical notes.").

In [None]:
print(num_of_words_in_both)

12173


Create a list of lists. Each element list is a list of intersected words for one Wiki document.

In [None]:
wikidocuments=[] # a list of lists, each element is a list of intersected and filtered words for one doc
file2=codecs.open("wikipedia_knowledge",'r','utf-8')
line=file2.readline() 
while line:
  if line[0:4]=='XXXd': # Check if it is a start of a wiki doc
    tempf=[]
    line=file2.readline() # read the next line
    while line[0:4]!='XXXe': # keep reading until the end of that doc
      line=line.strip('\n')
      words=line.split()
      for word in words:
        if word.lower() in a3: # if this word also appears in note
          tempf.append(word.lower()) # add this word to the list for this doc
      line=file2.readline()
    wikidocuments.append(tempf) # add word list of doc to list of docs 
  line=file2.readline()

Create a list of lists. Each element list is a list of intersected words for one clinical note.

In [None]:
notesdocuments=[] # a list of lists, each element is a list of intersected and filtered words for one note
file3=codecs.open("combined_dataset",'r','utf-8')
line=file3.readline()
while line:
  line=line.strip('\n')
  line=line.split()
  if line[0]=='codes:': # if this line is for code
    line=file3.readline() # skip this line
    line=line.strip('\n')
    line=line.split()
    if line[0]=='notes:': # if this line is for note
      tempf=[]
      line=file3.readline()
      while line!='end!\n': # keep reading until the end of the note
        line=line.strip('\n')
        line=line.split()
        for word in line:
          if word in a3:
            tempf.append(word)     
        line=file3.readline()      
      notesdocuments.append(tempf)
  line=file3.readline()

Set the sequence of words in the vocabulary matrix. Key is a word, and the value is the sequence/order in the vocabulary matrix.  
This is a preparation for building the matrices of the intersected words for Wiki documents and clinical notes.

In [None]:
notesvocab={}
for i in notesdocuments: # for each element (is a list) in list
  for j in i: # for each word
    if j.lower() not in notesvocab: # if a word is not in dict
      # # set value of this word equal to current element (order of the token in matrix)
      notesvocab[j.lower()]=len(notesvocab) 

Create a list of string for Wiki document, and each element is a string for a Wiki document (including intersected words).
This is a preparation for building the matrices of the intersected words for Wiki documents and clinical notes.

In [None]:
wikidata=[] # each element is a string, each string contains words for a wiki doc
for i in wikidocuments:
  temp=''
  for j in i:
    temp=temp+j+" "
  wikidata.append(temp)    

Create a list of string for clinical notes, and each element is a string for a note (including intersected words).  
This is a preparation for building the matrices of the intersected words for Wiki documents and clinical notes.

In [None]:
notedata=[] # each element is a string, each string contains words for a note
for i in notesdocuments: # for each element (list) in list
  temp=''
  for j in i: # for each word
    temp=temp+j+" " # create a string, words for a note separated by a space
  notedata.append(temp)

Create 2 word matrices (intersected words). One is for Wiki documents, and the other is for clinical notes.  
`notevec`: A matrix of intersected words for clinical notes.  
`wikivec`: A matrix of intersected words for Wiki documents.

In [None]:
# create a matrix of token counts
vect = CountVectorizer(min_df=1,vocabulary=notesvocab,binary=True)
# transfer list of string to matrix, if a word exists in a string, set value to 1
binaryn = vect.fit_transform(notedata)
binaryn=binaryn.A # Return self as an ndarray object.
binaryn=np.array(binaryn,dtype=float)

vect2 = CountVectorizer(min_df=1,vocabulary=notesvocab,binary=True)
binaryk = vect2.fit_transform(wikidata)
binaryk=binaryk.A
binaryk=np.array(binaryk,dtype=float)
notevec = binaryn
wikivec = binaryk

Print the shape of the created matrices.  
For the matrix for clinical notes, the size of 1st dimension is the number of notes, and the size of the 2nd dimension is the number of intersected words.  
For the matrix for Wiki documents, the size of 1st dimension is the number of Wiki documents, and the size of the 2nd dimension is the number of intersected words.  

In [None]:
print(f"The shape of the matrix for clinical notes (notevec) = {binaryn.shape}")
print(f"The shape of the matrix for wiki docs (wikivec) = {binaryk.shape}")

The shape of the matrix for clinical notes (notevec) = (52722, 12173)
The shape of the matrix for wiki docs (wikivec) = (325, 12173)


Print out the total running time of this section.

In [None]:
stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')   

Time duration: 0h 04m 17s


## 3.3 Data Pre-processing 3

Import the needed Python packages for this section.

In [None]:
from sklearn.model_selection import train_test_split

Record the start timestamp of this section to track the total running time of this section.

In [None]:
start = timeit.default_timer()

Create 2 dictionaries for the ICD-9 codes for Wiki documents.  
For the 1st dictionary, the key is a ICD-9 code which exists in Wiki documents, and the value is 1.
For the 2nd dictionary, the key is a ICD-9 code which exists in Wiki documents, and the value is the list which contain the sequence number of Wiki documents for this ICD-9 code.

In [None]:
wikivoc={}
codewiki=defaultdict(list)

file2=codecs.open("wikipedia_knowledge",'r','utf-8')
line=file2.readline()
count=0
while line:
  if line[0:4]=='XXXd': # read the start of a wiki doc
    line=line.strip('\n')
    codes=line.split()
    for code in codes:
      if code[0:2]=='d_': # if it is a icd code
        codewiki[code].append(count) # save the index of wiki doc to list for code
        wikivoc[code]=1 # set value of code to 1
    count=count+1
  line=file2.readline()
  num_of_codes_in_wiki = len(wikivoc)

Each of the following 4 ICD-9 codes appears in 2 Wiki documents.

In [None]:
print(codewiki['d_072'])
print(codewiki['d_698'])
print(codewiki['d_305'])
print(codewiki['d_386'])

[47, 214]
[106, 125]
[149, 250]
[219, 221]


For training purpose, each Wiki document can have more than one ICD-9 code, but each ICD-9 code can appear in only one Wiki document.  
Correct the 4 ICD-9 codes above, and for each of these 4 codes, down-select one Wiki document.  
`wikivoc`: A matrix of ICD-9 codes which appear in Wiki documents.

In [None]:
codewiki['d_072']=[214]
codewiki['d_698']=[125]
codewiki['d_305']=[250]
codewiki['d_386']=[219]

Prepare feature and label for the deep learning model.  
Feature: A list of string. Each string is one clinical note.  
Label: A list of string. Each string is the ICD-9 codes for one clinical note.

In [None]:
filec=codecs.open("combined_dataset",'r','utf-8')

line=filec.readline()

feature=[]
label=[]

while line:
  line=line.strip('\n')
  line=line.split()
  
  if line[0]=='codes:':
    temp=line[1:] # read the codes of that node
    label.append(temp) # add the code to list for label
    line=filec.readline()
    line=line.strip('\n')
    line=line.split()
    if line[0]=='notes:':
      tempf=[]
      line=filec.readline()
      
      while line!='end!\n': # read the notes until end
        line=line.strip('\n')
        line=line.split()
        tempf=tempf+line
        line=filec.readline()
      feature.append(tempf) # add list of words to list of feature
  line=filec.readline()

Create the sequence for label (ICD-9 codes). The key is a ICD-9 code, and the value is the sequence of a ICD-9 code in the code vector later.

In [None]:
prevoc={}
for i in label:
  for j in i:
    if j not in prevoc:
      prevoc[j] = len(prevoc) # set up the order of codes (for vector)

Print a sample key-value pairs (refer to `label[0]`).  
Print the number of key-value pairs (codes).

In [None]:
print(prevoc["d_486"])
print(prevoc["d_518"])
print(prevoc["d_511"])
print(len(prevoc))

0
1
2
941


Create mapping between ICD-9 codes and the index in the code vector by 2 dictionaries.  
**This mapping is for all codes found in the combined dataset.**

In [None]:
label_to_ix = {}
ix_to_label = {}

# create a mapping between code and index
for codes in label:
  for code in codes:
    if code not in label_to_ix:
      label_to_ix[code]=len(label_to_ix)
      ix_to_label[label_to_ix[code]]=code

Print sample key-value pairs.  
Print the number of ICD-9 codes found in combined dataset.

In [None]:
print(label_to_ix["d_486"])
print(ix_to_label[0])
print(f"Total number of codes = {len(label_to_ix)}")

0
d_486
Total number of codes = 941


Create a word vector (intersected words) for each of the ICD-9 codes found in combined_dataset.
*   If a ICD-9 code can be found in Wiki documents: label index -> ICD-9 code -> sequence/index of Wiki document -> vector of intersected words of Wikidocument (1 x number of intersected words).
*   If ICD-9 code cannot be found in Wiki document: zero vector in shape of (1 x number of intersected words).
Create a mapping between ICD-9 code index in label and corresponding 

In [None]:
tempwikivec=[]

for i in range(0,len(ix_to_label)):
  if ix_to_label[i] in wikivoc: # if a code in note can be found in wiki docs
    temp=wikivec[codewiki[ix_to_label[i]][0]] # save wiki doc index to temp
    tempwikivec.append(temp)
  else:
    tempwikivec.append([0.0]*wikivec.shape[1])
wikivec=np.array(tempwikivec)

Create dataset. The dataset contains 3 parts:
*   Feature: a list of lists, and each element is a list of words (strings) for one clinical note.
*   Notevec: a list of vectors, and each element is a vector of intersected words for one clinical note.
*   Label: a list of lists, and each element is a list of ICD-codes for one clinical note.

In [None]:
data=[]
for i in range(0,len(feature)):
  # save feature (list of words for note), note matrix and label (code) as a tuple
  data.append((feature[i], notevec[i], label[i]))
    
data=np.array(data, dtype=object)

Create mapping between ICD-9 codes and the index in the code vector by 2 dictionaries.  
**Different from previous `label_to_ix` and `ix_to_label`, this mapping is for ICD-9 codes found in Wiki documents only.**

In [None]:
label_to_ix = {}
ix_to_label = {}

for doc, note, codes in data:
  for code in codes:
    if code not in label_to_ix:
      if code in wikivoc:
        label_to_ix[code]=len(label_to_ix)
        ix_to_label[label_to_ix[code]]=code

num_of_codes_in_both = len(label_to_ix)

Print sample key-value pairs.  
Print the number of ICD-9 codes which **exists in both clinical notes and Wiki documents** (From paper: "Of those codes, we selected a subset of 344 codes for which we found the corresponding Wikipedia document and used those codes in our experiments.".

In [None]:
print(label_to_ix["d_486"])
print(ix_to_label[0])
print(f"Total number of codes = {num_of_codes_in_both}")

0
d_486
Total number of codes = 344


Split training data, validation data, and test data.

In [None]:
training_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
training_data, val_data = train_test_split(training_data, test_size=0.125, random_state=42)

Create index for words in clinical notes.

In [None]:
word_to_ix = {}
ix_to_word={}
ix_to_word[0]='OUT'

for doc, note, codes in training_data:
  for word in doc:
    if word not in word_to_ix:
      word_to_ix[word] = len(word_to_ix)+1
      ix_to_word[word_to_ix[word]]=word  

Print sample key-value pairs.

In [None]:
print(ix_to_word[0])
print(ix_to_word[1])
print(word_to_ix['admission'])

OUT
admission
1


Create a word vector (intersected words) for each of the ICD-9 codes found in **both Wiki document and clinical notes (combined dataset)**.

In [None]:
newwikivec=[]
for i in range(0,len(ix_to_label)):
  newwikivec.append(wikivec[prevoc[ix_to_label[i]]])
newwikivec=np.array(newwikivec)

Print sample result.  
Print the number of vectors in wikivec and newwikivec.

In [None]:
print(ix_to_label[0])
print(prevoc[ix_to_label[0]])
print(wikivec[prevoc[ix_to_label[0]]])
print(len(newwikivec))
print(len(wikivec))

d_486
0
[1. 0. 0. ... 0. 0. 0.]
344
941


Print out the total running time of this section.

In [None]:
stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')  

Time duration: 0h 04m 21s


## 3.4 Data Pre-processing 4

Perform data processing on the training dataset, validation dataset, and test dataset. The function `preprocessing` can be found in `preprocessing.py`.

In [None]:
import torch
import torch.autograd as autograd
from preprocessing import preprocessing


wikisize=newwikivec.shape[0]
rvocsize=newwikivec.shape[1]
wikivec=autograd.Variable(torch.FloatTensor(newwikivec))

batchsize = 32

batchtraining_data=preprocessing(training_data, label_to_ix, word_to_ix, wikivoc, batchsize)
batchtest_data=preprocessing(test_data, label_to_ix, word_to_ix, wikivoc, batchsize)
batchval_data=preprocessing(val_data, label_to_ix, word_to_ix, wikivoc, batchsize) 

# 4 Data Statistics (Overview of Data)

The following table shows the statistics of the ***clinical notes from MIMIC-III dataset***. The values in the following table match the statistics in section 4.1 of the original paper.

In [None]:
note_stat = []
note_stat.append(["Number of discharge summary notes", num_of_notes])
note_stat.append(["Number of aggregated discharge summary notes", num_of_aggregated_notes])
note_stat.append(["Number of words in discharge summary notes", num_of_words_in_notes])
note_stat.append(["Number of codes in discharge summary notes", num_of_codes_in_notes])
df_note_stat = pd.DataFrame(note_stat, columns=['Statistics', 'Result'])
df_note_stat

Unnamed: 0,Statistics,Result
0,Number of discharge summary notes,59652
1,Number of aggregated discharge summary notes,52722
2,Number of words in discharge summary notes,47964
3,Number of codes in discharge summary notes,942


The following table shows the statistics of the ***Wikipedia articles for ICD-9 diagnosis codes***. The values in the following table match the statistics in section 4.1 of the original paper.

In [None]:
wiki_stat = []
num_of_words_in_wiki
wiki_stat.append(["Number of words in Wiki articles", num_of_words_in_wiki])
wiki_stat.append(["Number of codes in Wiki articles", num_of_codes_in_wiki])
wiki_stat.append(["Number of words in both Wiki and notes", num_of_words_in_both])
wiki_stat.append(["Number of codes in both Wiki and notes", num_of_codes_in_both])
df_wiki_stat = pd.DataFrame(wiki_stat, columns=['Statistics', 'Result'])
df_wiki_stat

Unnamed: 0,Statistics,Result
0,Number of words in Wiki articles,60968
1,Number of codes in Wiki articles,389
2,Number of words in both Wiki and notes,12173
3,Number of codes in both Wiki and notes,344


# 5 Training and Testing of Models

## 5.1 Convolutional Attention (CAML)

This section investigates the performance of the following model in the task of ICD-9 diagnosis code prediction from the clinical notes in MIMIC-III dataset:

*   Convolutional attention model with the KSI framework (KSI+CAML), but no attention mechanism in the document similarity learning model.

First, import the needed Python packages for this section.

In [None]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
torch.manual_seed(1)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
import copy
import pandas as pd

Record the start timestamp of this section to track the total running time of this section.

In [None]:
start = timeit.default_timer()

Define the CAML model with the KSI framework (no attention mechanism in the document similarity learning model). The lines for the attention mechanism in the document similarity learning model are commented out.

In [None]:
Embeddingsize = 100
hidden_dim = 200

class CAML(nn.Module):

    def __init__(self, batch_size, vocab_size, tagset_size):
        super(CAML, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size+1, Embeddingsize, padding_idx=0)
        self.embed_drop = nn.Dropout(p=0.2)   
        
        
        self.convs1 = nn.Conv1d(Embeddingsize,300,10,padding=5)
        self.H=nn.Linear(300, tagset_size )   
        self.final = nn.Linear(300, tagset_size)
        
        self.layer2 = nn.Linear(Embeddingsize, 1)
        self.embedding=nn.Linear(rvocsize,Embeddingsize,bias=False)
        # self.vattention=nn.Linear(Embeddingsize,Embeddingsize)
        
        self.sigmoid = nn.Sigmoid()
        self.tanh = nn.Tanh()
    
        self.dropout = nn.Dropout(0.2)
    
    def forward(self, vec1, nvec, wiki, simlearning):
        
       
        thisembeddings=self.word_embeddings(vec1)
        thisembeddings = self.embed_drop(thisembeddings)
        thisembeddings=thisembeddings.transpose(1,2)
        
        
        thisembeddings=self.tanh(self.convs1(thisembeddings).transpose(1,2))  
        
        alpha=self.H.weight.matmul(thisembeddings.transpose(1,2))
        alpha=F.softmax(alpha, dim=2)
        
        m=alpha.matmul(thisembeddings)
       
        myfinal=self.final.weight.mul(m).sum(dim=2).add(self.final.bias)
        
        if simlearning==1:
            nvec=nvec.view(batchsize,1,-1)
            nvec=nvec.expand(batchsize,wiki.size()[0],-1)
            wiki=wiki.view(1,wiki.size()[0],-1)
            wiki=wiki.expand(nvec.size()[0],wiki.size()[1],-1)
            new=wiki*nvec
            new=self.embedding(new)
            # vattention=self.sigmoid(self.vattention(new))
            # new=new*vattention
            vec3=self.layer2(new)
            vec3=vec3.view(batchsize,-1)
        
       
        if simlearning==1:
            tag_scores = self.sigmoid(myfinal.detach()+vec3)
        else:
            tag_scores = self.sigmoid(myfinal)
              
        return tag_scores 

Train the CAML model with the KSI framework (no attention mechanism in the document similarity learning model). The training will stop if the top-10 recall score does not improve on the validation dataset in the next 5 epochs. The function `trainmodel` can be found in `training.py`.

In [None]:
from training import trainmodel

topk = 10
max_epochs = 5000 # Default is 5000
print(f"max epochs = {max_epochs}")

model = CAML(batchsize, len(word_to_ix), len(label_to_ix))
model.cuda()

loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters())

print("Train basemodel")
basemodel= trainmodel(model, 0, topk, max_epochs, batchtraining_data, batchval_data, wikivec, loss_function, optimizer)
torch.save(basemodel, 'CAML_model')

model = CAML(batchsize, len(word_to_ix), len(label_to_ix))
model.cuda()
model.load_state_dict(basemodel)
loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
print("Train model with KSI")
KSImodel= trainmodel(model, 1, topk, max_epochs, batchtraining_data, batchval_data, wikivec, loss_function, optimizer)
torch.save(KSImodel, 'KSI_CAML_model')  

max epochs = 5000
Train basemodel
epoch = 0
validation recall @ top- 10 0.6044048484082241
Update the best model to epoch 0
epoch = 1
validation recall @ top- 10 0.7351386587887544
Update the best model to epoch 1
epoch = 2
validation recall @ top- 10 0.7708715147202543
Update the best model to epoch 2
epoch = 3
validation recall @ top- 10 0.7896376400647074
Update the best model to epoch 3
epoch = 4
validation recall @ top- 10 0.7971153736015517
Update the best model to epoch 4
epoch = 5
validation recall @ top- 10 0.8032056204173794
Update the best model to epoch 5
epoch = 6
validation recall @ top- 10 0.8069309121351105
Update the best model to epoch 6
epoch = 7
validation recall @ top- 10 0.805776089131652
epoch = 8
validation recall @ top- 10 0.8046463100059577
epoch = 9
validation recall @ top- 10 0.8076175557491282
Update the best model to epoch 9
epoch = 10
validation recall @ top- 10 0.8093299127083716
Update the best model to epoch 10
epoch = 11
validation recall @ top- 10 0.

Test the trained models on the test dataset. The function `testmodel` can be found in `testing.py`.

In [None]:
from testing import testmodel

print('Test KSI+CAML')
model2 = CAML(batchsize, len(word_to_ix), len(label_to_ix))
camlKSI_loss, camlKSI_recall, camlKSI_mac_auc, camlKSI_mic_auc, camlKSI_mac_f1, camlKSI_mic_f1 = testmodel(model2, KSImodel, 1, batchtest_data, wikivec, topk)
 

Test KSI+CAML


Print out the total running time of this section.

In [None]:
stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Running time of this section: {duration[0]}h {duration[1]}m {duration[2]}s') 

Running time of this section: 0h 18m 25s


Display the performance metrics the CAML model with the KSI framework (no attention mechanism in the document similarity learning model).

In [None]:
result_caml = [['KSI+CAML',  
         round(camlKSI_mac_auc, 3), 
         round(camlKSI_mic_auc, 3), 
         round(camlKSI_mac_f1, 3), 
         round(camlKSI_mic_f1, 3),
         round(camlKSI_loss, 3),
         round(camlKSI_recall, 3)]]
df_caml = pd.DataFrame(result_caml, columns=['Model', 'Macro_AUC', 'Micro_AUC', 'Macro_F1', 'Micro_F1', 'Loss', 'Recall@10'])
df_caml

Unnamed: 0,Model,Macro_AUC,Micro_AUC,Macro_F1,Micro_F1,Loss,Recall@10
0,KSI+CAML,0.855,0.978,0.273,0.659,0.033,0.808


## 5.2 Recurrent Neural Network with Attention (RNNatt)

This section investigates the performance of the following model in the task of ICD-9 diagnosis code prediction from the clinical notes in MIMIC-III dataset:

*   Recurrent neural network with attention and the KSI framework (KSI+RNNatt), but no attention mechanism in the document similarity learning model.

First, import the Python packages needed for this section.

In [None]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
torch.manual_seed(1)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
import copy

Record the start timestamp of this section to track the total running time of this section.

In [None]:
start = timeit.default_timer()

Define the RNN model with attention and the KSI framework (no attention mechanism in the document similarity learning model). The lines for the attention mechanism in the document similarity learning model are commented out.

In [None]:
Embeddingsize = 100
hidden_dim = 200

class LSTMattn(nn.Module):

    def __init__(self, batch_size, vocab_size, tagset_size):
        super(LSTMattn, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size+1, Embeddingsize, padding_idx=0)
        self.lstm = nn.LSTM(Embeddingsize, hidden_dim)
        self.hidden = self.init_hidden()
        
        self.H=nn.Linear(hidden_dim, tagset_size )  
        self.final = nn.Linear(hidden_dim, tagset_size)
        
        self.layer2 = nn.Linear(Embeddingsize, 1,bias=False)
        self.embedding=nn.Linear(rvocsize,Embeddingsize)
        # self.vattention=nn.Linear(Embeddingsize,Embeddingsize,bias=False)
        
        self.softmax = nn.Softmax()
        self.sigmoid = nn.Sigmoid()
        self.embed_drop = nn.Dropout(p=0.2)
    
    def init_hidden(self):
        return (autograd.Variable(torch.zeros(1, batchsize, self.hidden_dim).cuda()),
                autograd.Variable(torch.zeros(1, batchsize, self.hidden_dim)).cuda())

    
    def forward(self, vec1, nvec, wiki, simlearning):
        
        
        thisembeddings=self.word_embeddings(vec1).transpose(0,1)
        thisembeddings = self.embed_drop(thisembeddings)
        
        
        if simlearning==1:
            nvec=nvec.view(batchsize,1,-1)
            nvec=nvec.expand(batchsize,wiki.size()[0],-1)
            wiki=wiki.view(1,wiki.size()[0],-1)
            wiki=wiki.expand(nvec.size()[0],wiki.size()[1],-1)
            new=wiki*nvec
            new=self.embedding(new)
            # vattention=self.sigmoid(self.vattention(new))
            # new=new*vattention
            vec3=self.layer2(new)
            vec3=vec3.view(batchsize,-1)
        
        
        lstm_out, self.hidden = self.lstm(
            thisembeddings, self.hidden)
        
        
        
        lstm_out=lstm_out.transpose(0,1)

        alpha=self.H.weight.matmul(lstm_out.transpose(1,2))
        alpha=F.softmax(alpha, dim=2)
        
        m=alpha.matmul(lstm_out)
        
        myfinal=self.final.weight.mul(m).sum(dim=2).add(self.final.bias)
        
        
        if simlearning==1:
            tag_scores = self.sigmoid(myfinal.detach()+vec3)
        else:
            tag_scores = self.sigmoid(myfinal)
                
        return tag_scores  

Train the RNN model with attention and the KSI framework (no attention mechanism in the document similarity learning model). The training will stop if the top-10 recall score does not improve on the validation dataset in the next 5 epochs. The function `trainrnnmodel` can be found in `training.py`.

In [None]:
from training import trainrnnmodel


topk = 10
max_epochs = 5000 # Default is 5000

model = LSTMattn(batchsize, len(word_to_ix), len(label_to_ix))
model.cuda()
loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
print("Train basemodel")
basemodel= trainrnnmodel(model, 0, topk, max_epochs, batchtraining_data, batchval_data, wikivec, loss_function, optimizer)
torch.save(basemodel, 'RNNattn_model')

model = LSTMattn(batchsize, len(word_to_ix), len(label_to_ix))
model.cuda()
model.load_state_dict(basemodel)
loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
print("Train model with KSI")
KSImodel= trainrnnmodel(model, 1, topk, max_epochs, batchtraining_data, batchval_data, wikivec, loss_function, optimizer)
torch.save(KSImodel, 'KSI_RNNattn_model')  

Train basemodel
epoch = 0
validation recall @ top- 10 0.3796880833045024
Update the best model to epoch 0
epoch = 1
validation recall @ top- 10 0.511078002036114
Update the best model to epoch 1
epoch = 2
validation recall @ top- 10 0.6083505897830405
Update the best model to epoch 2
epoch = 3
validation recall @ top- 10 0.6578323770023232
Update the best model to epoch 3
epoch = 4
validation recall @ top- 10 0.6940694854775656
Update the best model to epoch 4
epoch = 5
validation recall @ top- 10 0.7134766580187625
Update the best model to epoch 5
epoch = 6
validation recall @ top- 10 0.7351077450246006
Update the best model to epoch 6
epoch = 7
validation recall @ top- 10 0.7497311047325111
Update the best model to epoch 7
epoch = 8
validation recall @ top- 10 0.7638417743842686
Update the best model to epoch 8
epoch = 9
validation recall @ top- 10 0.7714110540940694
Update the best model to epoch 9
epoch = 10
validation recall @ top- 10 0.774969991821008
Update the best model to epo

Test the trained models on the test dataset. The function `testrnnmodel` can be found in `testing.py`.

In [None]:
from testing import testrnnmodel


print('Test KSI+RNNatt')
model2 = LSTMattn(batchsize, len(word_to_ix), len(label_to_ix))
lstmattKSI_loss, lstmattKSI_recall, lstmattKSI_mac_auc, lstmattKSI_mic_auc, lstmattKSI_mac_f1, lstmattKSI_mic_f1 = testrnnmodel(model2, KSImodel, 1, batchtest_data, wikivec, topk)  

Test KSI+RNNatt


Print out the total running time of this section.

In [None]:
stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Running time of this section: {duration[0]}h {duration[1]}m {duration[2]}s') 

Running time of this section: 0h 45m 27s


Display the performance metrics of the RNN model with attention and the KSI framework (no attention mechanism in the document similarity learning model).

In [None]:
result_lstmatt = [['KSI+RNNatt', 
         round(lstmattKSI_mac_auc, 3), 
         round(lstmattKSI_mic_auc, 3), 
         round(lstmattKSI_mac_f1, 3), 
         round(lstmattKSI_mic_f1, 3),
         round(lstmattKSI_loss, 3), 
         round(lstmattKSI_recall, 3)]]
df_lstmatt = pd.DataFrame(result_lstmatt, columns=['Model', 'Macro_AUC', 'Micro_AUC', 'Macro_F1', 'Micro_F1', 'Loss', 'Recall@10'])
df_lstmatt

Unnamed: 0,Model,Macro_AUC,Micro_AUC,Macro_F1,Micro_F1,Loss,Recall@10
0,KSI+RNNatt,0.867,0.975,0.268,0.649,0.034,0.791


# 6 Performance Comparison

The following tables shows the performance metrics of the 2 selected models without the attention mechanism in the document similarity learning model.

In [None]:
comparison_df = pd.concat([df_lstmatt, df_caml])
comparison_df

Unnamed: 0,Model,Macro_AUC,Micro_AUC,Macro_F1,Micro_F1,Loss,Recall@10
0,KSI+RNNatt,0.867,0.975,0.268,0.649,0.034,0.791
0,KSI+CAML,0.855,0.978,0.273,0.659,0.033,0.808


# 7 Miscellaneous

Print out the total running time of the notebook.

In [None]:
total_stop = timeit.default_timer()
total_duration = str(datetime.timedelta(seconds=round(total_stop - total_start)))
total_duration = total_duration.split(":")
print(f'Notebook finished at {total_stop}')
print(f'Total running time of this notebook: {total_duration[0]}h {total_duration[1]}m {total_duration[2]}s')   

Notebook finished at 5236.365986309
Total running time of this notebook: 1h 15m 09s
