# Check GPU Status

This notebook requires hardware acceleration with GPU. Run the following code to make sure the GPU is running.

In [None]:
!nvidia-smi

Wed Mar 22 18:35:20 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P0    24W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Dataset Loading

In this section, the following files are loaded:

* NOTEEVENTS.csv
* DIAGNOSES_ICD.csv
* wikipedia_knowledge
* IDlist.npy



To load NOTEEVENTS.csv and DIAGNOSES_ICD.csv, take and pass the training "CITI Data or Specimens Only Research" at [https://about.citiprogram.org/](https://about.citiprogram.org/), and apply for the access to the MIMIC-III Clinical Database at PhysioNet at [https://physionet.org/content/mimiciii/1.4/](https://physionet.org/content/mimiciii/1.4/). After gaining the access, download NOTEEVENTS.csv and DIAGNOSES_ICD.csv from PhysioNet, and upload these two CSV files to a created folder "cs598_project" in Google drive.  
Mount the Google Drive to Google Colab runtime.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


Display the file in the "cs598_project" folder to make sure NOTEEVENTS.csv and DIAGNOSES_ICD.csv are uploaded to this folder.

In [None]:
!ls drive/MyDrive/cs598_project

combined_dataset   IDlist.npy	   wikipedia_knowledge
DIAGNOSES_ICD.csv  NOTEEVENTS.csv


Copy NOTEEVENTS.csv and DIAGNOSES_ICD.csv from Google Drive to Google Colab Runtime.

In [None]:
!cp drive/MyDrive/cs598_project/DIAGNOSES_ICD.csv DIAGNOSES_ICD.csv
!cp drive/MyDrive/cs598_project/NOTEEVENTS.csv NOTEEVENTS.csv

wikipedia_knowledge and IDlist.npy can be downloaded from the GitHub Repository of the original paper.

In [3]:
!wget https://github.com/tiantiantu/KSI/blob/master/wikipedia_knowledge
!wget https://github.com/tiantiantu/KSI/blob/master/IDlist.npy

--2023-04-11 21:30:55--  https://github.com/tiantiantu/KSI/blob/master/wikipedia_knowledge
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘wikipedia_knowledge’

wikipedia_knowledge     [ <=>                ] 134.78K  --.-KB/s    in 0.007s  

2023-04-11 21:30:55 (18.6 MB/s) - ‘wikipedia_knowledge’ saved [138012]

--2023-04-11 21:30:55--  https://github.com/tiantiantu/KSI/blob/master/IDlist.npy
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘IDlist.npy’

IDlist.npy              [ <=>                ] 134.39K  --.-KB/s    in 0.009s  

2023-04-11 21:30:56 (14.0 MB/s) - ‘IDlist.npy’ saved [137612]



Import timeit and datetime to track the running time of the notebook.

# Running Time Tracking

In [None]:
import timeit
import datetime

# Data Pre-processing 1

Install stop-words to filter out stop words in the clinical notes.

In [None]:
!pip install stop-words

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stop-words
  Downloading stop-words-2018.7.23.tar.gz (31 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: stop-words
  Building wheel for stop-words (setup.py) ... [?25l[?25hdone
  Created wheel for stop-words: filename=stop_words-2018.7.23-py3-none-any.whl size=32910 sha256=d11499108d1cf270475561cd7060c07a3d475f2471147a08c04b0a01ed8fb3b9
  Stored in directory: /root/.cache/pip/wheels/da/d8/66/395317506a23a9d1d7de433ad6a7d9e6e16aab48cf028a0f60
Successfully built stop-words
Installing collected packages: stop-words
Successfully installed stop-words-2018.7.23


In [None]:
import codecs
from collections import defaultdict
import csv
import string
from stop_words import get_stop_words    # download stop words package from https://pypi.org/project/stop-words/
import numpy as np
import datetime

In [None]:
start = timeit.default_timer()

Create a dictionary from NOTEEVENTS.csv. The key is HADM_ID (ID of a visit), and the value is clinical note.  
For clinical note, replace line change with whitespace, remove punctuation, and lowercase all letters.  
From paper: "During preprocessing we lowercased all
tokens and removed punctuations, stop words, words containing
only digits, and words whose frequency is less than 10".


In [None]:
stop_words = get_stop_words('english')

admidic=defaultdict(list)
count=0

with open('NOTEEVENTS.csv', 'r') as csvfile:
  spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
  for row in spamreader: # Iterate all entries in NOTEEVETS
    if row[6]=='Discharge summary': # Check if the category is discharge summary
      # identify patients through subject_id
      # append text to the list of text for a HADM_ID (visit) 
      # lower case all words, and remove punctuation
      admidic[row[2]].append(row[-1].replace('\n',' ').translate(str.maketrans('','',string.punctuation)).lower())
      count=count+1 # count number of discharge summary notes

Print the number of notes in NOTEEVENTS.csv (From paper: "The total number of
discharge summary notes is 59,652.").  
Print a sample key-value pair (the clinical note for visit 167853).

In [None]:
print(f"Number of notes in NOTEEVENTS.csv = {count}")
print(admidic['167853'])

Number of notes in NOTEEVENTS.csv = 59652
['admission date  2151716       discharge date  215184   service addendum  radiologic studies  radiologic studies also included a chest ct which confirmed cavitary lesions in the left lung apex consistent with infectious processtuberculosis  this also moderatesized left pleural effusion  head ct  head ct showed no intracranial hemorrhage or mass effect but old infarction consistent with past medical history  abdominal ct  abdominal ct showed lesions of t10 and sacrum most likely secondary to osteoporosis these can be followed by repeat imaging as an outpatient                                first name8 namepattern2  first name4 namepattern1 1775 last name namepattern1  md  md number1 1776  dictated byhospital 1807 medquist36  d  215185  1211 t  215185  1221 job  job number 1808 ', 'admission date  2151716       discharge date  215184    history of present illness  the patient is an 86 year old african american female who on the morning of 716 w

Calculate the word ocurrence (including stop words) in NOTEEVENTS.csv.

In [None]:
u=defaultdict(int) # The default count for each word is 0
for i in admidic: # for each visit
  for jj in admidic[i]: # for each saved note
    line=jj.strip('\n').split() # split into a list of words
    for j in line: # Iterate each word
      u[j]=u[j]+1 # count the number of words

Print the number of words (vocabulary) in NOTEEVENTS.csv.  
Print the ocurrence of the word "consistent" in NOTEEVENTS.csv.  
Check if the stop word "the" in the dictionary.

In [None]:
print(f"Number of words in NOTEEVENTS.csv = {len(u)}")
print(u["consistent"])
print("the" in u)

Number of words in NOTEEVENTS.csv = 529818
35831
True


Remove the stopwords.  
Remove the words whose number of occurence is less than 10.  
From paper: "During preprocessing we lowercased all
tokens and removed punctuations, stop words, words containing
only digits, and words whose frequency is less than 10".

In [None]:
u2=defaultdict(int) # Create a new dict to filter out some words
for i in u: # iterate each word
  if i.isdigit()==False: # Make sure not a number
    if u[i]>10: # Make sure the word occurence is higher than 10
      if i not in stop_words: # Make sure not stop words
        u2[i]=u[i]

Print the number of words (vocabulary) in NOTEEVENTS.csv AFTER the removal of stopwords (From paper: "The final
word vocabulary contains 47,965 unique words.").  
Check if the stop word "the" in the dictionary.

In [None]:
print(f"Number of words in NOTEEVENTS.csv = {len(u2)}")
print("the" in u2)

Number of words in NOTEEVENTS.csv = 47964
False


Create a dictionary for DIAGNOSES_ICD.csv. The key is HADM_ID, and the value if a list of ICD-9 codes for HADM_ID.  
Add a prefix "d_" to the ICD-9 codes.

In [None]:
u=[]   

file1=codecs.open('DIAGNOSES_ICD.csv','r')
ad2c=defaultdict(list)
line=file1.readline() # Skip the 1st line
line=file1.readline() # Read the 2nd line

while line:
  line=line.strip().split(',') # Split a row into a list

  if line[4][1:-1]!='': # If ICD9_CODE column is not empty
    ad2c[line[2]].append("d_"+line[4][1:-1]) # Append the code to list of codes for a HADM_ID
  
  line=file1.readline() # Read the next line

Print a sample key-value pair (ICD-9 codes for visit 172335).

In [None]:
print(ad2c["172335"])
print(len(ad2c["172335"]))

['d_40301', 'd_486', 'd_58281', 'd_5855', 'd_4254', 'd_2762', 'd_7100', 'd_2767', 'd_7243', 'd_45829', 'd_2875', 'd_28521', 'd_28529', 'd_27541']
14


Calculate the code ocurrence in DIAGNOSES_ICD.csv.

In [None]:
codeu=defaultdict(int)
for i in ad2c:
  for j in ad2c[i]:
    codeu[j]=codeu[j]+1 # counter the occurence of codes

Print the number of unique ICD-9 codes (original codes) in DIAGNOSES_ICD.csv.  
Print the number of occurence for a code.

In [None]:
print(f"number of codes in DIAGNOSES_ICD.csv = {len(codeu)}")
print(codeu["d_486"])

number of codes in DIAGNOSES_ICD.csv = 6984
4839


Group ICD-9 codes in DIAGNOSES_ICD.csv by the first 3 letters (From paper: "We extracted all
listed ICD-9 diagnosis codes for each visit and grouped them by
their first three digits").

In [None]:
ad2c2 = defaultdict(list)
for hadm_id in ad2c:
  for code in ad2c[hadm_id]:
    if code[0:5] not in ad2c2[hadm_id]:
      ad2c2[hadm_id].append(code[0:5])

Print a sample key-value pair (same as the one before ICD-9 code grouping).

In [None]:
print(ad2c2["172335"])
print(len(ad2c2["172335"]))

['d_403', 'd_486', 'd_582', 'd_585', 'd_425', 'd_276', 'd_710', 'd_724', 'd_458', 'd_287', 'd_285', 'd_275']
12


Calculate the code ocurrence in DIAGNOSES_ICD.csv.

In [None]:
codeu2=defaultdict(int)
for hadm_id in ad2c2:
  for code in ad2c2[hadm_id]:
    codeu2[code]=codeu2[code]+1 # counter the occurence of codes

Print the number of unique ICD-9 codes (after grouping) in DIAGNOSES_ICD.csv (From paper: "The code vocabulary contains 942 codes.").

In [None]:
print(f"Number of codes after grouping = {len(codeu2)}")

Number of codes after grouping = 942


Iterate HADM_ID in IDlist.npy, and combine the code data (from DIAGNOSES_ICD.csv) and note data (from NOTEEVENTS.csv) into a single file "combined_dataset".  
**combined_dataset**: A dataset contains the clinical notes of hospital visits and corresponding diagnosis (ICD-9 codes).

In [None]:
fileo=codecs.open("combined_dataset",'w')

num_of_notes = 0
lower_limit = 0
upper_limit = 10000000

IDlist=np.load('IDlist.npy',encoding='bytes').astype(str) # a list of HADM_ID
for hadm_id in IDlist:
  if ad2c2[hadm_id]!=[]: # If HADM_ID exists in ad2c
    add_to_set = False
    for code in ad2c2[hadm_id]:
      if codeu2[code] >= lower_limit and codeu2[code] <= upper_limit:
        add_to_set = True
    if add_to_set:
      fileo.write('start! '+i+'\n')
      fileo.write('codes: ')
      tempc=[]
      for code in ad2c2[hadm_id]: # for each code
        if codeu2[code] >= lower_limit and codeu2[code] <= upper_limit: # if code occurence greater than threshold
          if code not in tempc:
            tempc.append(code) # save d_ and first 3 digits of ICD-9 code
      
      for code in tempc:
        fileo.write(code+" ") # write code to combined dataset
      fileo.write('\n')
      fileo.write('notes:\n') # write note
      for line in admidic[hadm_id]: # iterate each line    
        thisline=line.strip('\n').split() 
        for j in thisline: # iterate each word
          if u2[j]!=0: # if this word is a qualified word
            fileo.write(j+" ") # write this word
          fileo.write('\n')
      fileo.write('end!\n')
      num_of_notes = num_of_notes + 1
fileo.close()
 

Print the number of note-code pairs added to the dataset (From paper:"The number of aggregated discharge summary notes is 52,722..").

In [None]:
print(f"Number of notes written to combined dataset = {num_of_notes}")

Number of notes written to combined dataset = 52722


In [None]:
stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s') 

Time duration: 0h 01m 49s


# Data Pre-processing 2

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
start = timeit.default_timer()

Build a vocabulary for Wiki documents. The key is a word, and the value is 1 if the word exists.

In [None]:
wikivocab={}
file1=codecs.open("wikipedia_knowledge",'r','utf-8')
line=file1.readline()
while line:
  if line[0:3]!='XXX': # if this line is not start or end of doc
    line=line.strip('\n')
    line=line.split()
    for i in line: # iterate each word
      wikivocab[i.lower()]=1 # if a word exists in doc, set to 1
  line=file1.readline()

Print the size of the vocabulary for Wiki documents (From paper: "The size
of the word vocabulary of Wikipedia documents is 60,968.").

In [None]:
print(f"Number of unique words in Wiki documents = {len(wikivocab)}")

Number of unique words in Wiki documents = 60968


Build a vocabulary for NOTEEVENTS.csv (after word processing). The key is a word, and the value is 1 if the word exists.

In [None]:
notesvocab={}
filec=codecs.open("combined_dataset",'r','utf-8')

line=filec.readline()

while line:
  line=line.strip('\n')
  line=line.split()
  
  if line[0]=='codes:':
    line=filec.readline()
    line=line.strip('\n')
    line=line.split()
    
    if line[0]=='notes:': 
      line=filec.readline()
      while line!='end!\n':
        line=line.strip('\n')
        line=line.split()
        for word in line:
          notesvocab[word]=1
        line=filec.readline()            
  line=filec.readline()

Print the size of vocabulary for NOTEEVENTS.csv after word processing (From paper: "The final word vocabulary contains 47,965 unique words.").  
Print a sample key-value pair.

In [None]:
print(len(notesvocab))
print(notesvocab["consistent"])

47964
1


Get an intersection of the vocabulary for wiki documents and the one for NOTEVENTS.csv.

In [None]:
a1=set(notesvocab)
a2=set(wikivocab)
a3=a1.intersection(a2) # get the intersection

Print number of unique words in the intersection (From paper: "The size of the word vocabulary of Wikipedia documents is 60,968, out of which only 12,173 are also in the word vocabulary of MIMIC-III clinical notes.").

In [None]:
print(len(a3))

12173


Create a list of lists. Each element list is a list of intersected words for one Wiki document.

In [None]:
wikidocuments=[] # a list of lists, each element is a list of intersected and filtered words for one doc
file2=codecs.open("wikipedia_knowledge",'r','utf-8')
line=file2.readline()
while line:
  if line[0:4]=='XXXd': # Check if it is a start of a wiki doc
    tempf=[]
    line=file2.readline() # read the next line
    while line[0:4]!='XXXe': # keep reading until the end of that doc
      line=line.strip('\n')
      words=line.split()
      for word in words:
        if word.lower() in a3: # if this word also appears in note
          tempf.append(word.lower()) # add this word to the list for this doc
      line=file2.readline()
    wikidocuments.append(tempf) # add word list of doc to list of docs 
  line=file2.readline()

Print a sample element list.

In [None]:
print(wikidocuments[0])
print(len(wikidocuments))

['breast', 'cancer', 'breast', 'cancer', 'cancer', 'develops', 'breast', 'signs', 'breast', 'cancer', 'may', 'include', 'lump', 'change', 'breast', 'dimpling', 'fluid', 'coming', 'newly', 'inverted', 'red', 'scaly', 'patch', 'distant', 'spread', 'may', 'bone', 'swollen', 'lymph', 'shortness', 'yellow', 'risk', 'factors', 'developing', 'breast', 'cancer', 'include', 'lack', 'physical', 'drinking', 'hormone', 'replacement', 'therapy', 'early', 'age', 'first', 'children', 'late', 'older', 'prior', 'history', 'breast', 'family', 'cases', 'due', 'inherited', 'including', 'among', 'breast', 'cancer', 'commonly', 'develops', 'cells', 'lining', 'milk', 'ducts', 'lobules', 'supply', 'ducts', 'cancers', 'developing', 'ducts', 'known', 'ductal', 'developing', 'lobules', 'known', 'lobular', 'breast', 'ductal', 'carcinoma', 'develop', 'diagnosis', 'breast', 'cancer', 'confirmed', 'taking', 'biopsy', 'concerning', 'diagnosis', 'tests', 'done', 'determine', 'cancer', 'spread', 'beyond', 'breast', 'tr

Create a list of lists. Each element list is a list of intersected words for one clinical note.

In [None]:
notesdocuments=[] # a list of lists, each element is a list of intersected and filtered words for one note
file3=codecs.open("combined_dataset",'r','utf-8')
line=file3.readline()
while line:
  line=line.strip('\n')
  line=line.split()
  if line[0]=='codes:': # if this line is for code
    line=file3.readline() # skip this line
    line=line.strip('\n')
    line=line.split()
    if line[0]=='notes:': # if this line is for note
      tempf=[]
      line=file3.readline()
      while line!='end!\n': # keep reading until the end of the note
        line=line.strip('\n')
        line=line.split()
        for word in line:
          if word in a3:
            tempf.append(word)     
        line=file3.readline()      
      notesdocuments.append(tempf)
  line=file3.readline()

Print a sample element list.

In [None]:
print(notesdocuments[0])

['admission', 'date', 'discharge', 'date', 'date', 'birth', 'sex', 'm', 'service', 'medicine', 'allergies', 'patient', 'recorded', 'known', 'allergies', 'drugs', 'name', 'un', 'chief', 'complaint', 'fever', 'cough', 'weakness', 'major', 'surgical', 'invasive', 'procedure', 'rigid', 'bronchoscopy', 'endobronchial', 'debulking', 'squamous', 'cell', 'lung', 'mass', 'endotracheal', 'intubation', 'history', 'present', 'illness', 'mr', 'known', 'man', 'squamous', 'cell', 'lung', 'cancer', 'admitted', 'hypotension', 'fever', 'secondary', 'pneumonia', 'patient', 'back', 'pain', 'weight', 'loss', 'productive', 'cough', 'mild', 'hemoptysis', 'beginning', 'month', 'month', 'time', 'revealed', 'opacity', 'ct', 'revealed', 'mass', 'x', 'cm', 'concerning', 'malignancy', 'patient', 'underwent', 'flexible', 'bronchoscopy', 'endobronchial', 'biopsies', 'linear', 'endobronchial', 'ultrasound', 'lymph', 'node', 'biopsies', 'procedures', 'performed', 'dr', 'last', 'name', 'endobronchial', 'biopsies', 'rev

Set the sequence of words in the vocabulary matrix. Key is a word, and the value is the sequence/order in the vocabulary matrix.  
This is a preparation for building the matrices of the intersected words for Wiki documents and clinical notes.

In [None]:
notesvocab={}
for i in notesdocuments: # for each element (is a list) in list
  for j in i: # for each word
    if j.lower() not in notesvocab: # if a word is not in dict
      # # set value of this word equal to current element (order of the token in matrix)
      notesvocab[j.lower()]=len(notesvocab) 

Print sample key-value pairs (Compare with notesdocuments[0]).

In [None]:
print(notesvocab["admission"])
print(notesvocab["date"])
print(notesvocab["discharge"])
print(notesvocab["birth"])
print(notesdocuments[0])

0
1
2
3
['admission', 'date', 'discharge', 'date', 'date', 'birth', 'sex', 'm', 'service', 'medicine', 'allergies', 'patient', 'recorded', 'known', 'allergies', 'drugs', 'name', 'un', 'chief', 'complaint', 'fever', 'cough', 'weakness', 'major', 'surgical', 'invasive', 'procedure', 'rigid', 'bronchoscopy', 'endobronchial', 'debulking', 'squamous', 'cell', 'lung', 'mass', 'endotracheal', 'intubation', 'history', 'present', 'illness', 'mr', 'known', 'man', 'squamous', 'cell', 'lung', 'cancer', 'admitted', 'hypotension', 'fever', 'secondary', 'pneumonia', 'patient', 'back', 'pain', 'weight', 'loss', 'productive', 'cough', 'mild', 'hemoptysis', 'beginning', 'month', 'month', 'time', 'revealed', 'opacity', 'ct', 'revealed', 'mass', 'x', 'cm', 'concerning', 'malignancy', 'patient', 'underwent', 'flexible', 'bronchoscopy', 'endobronchial', 'biopsies', 'linear', 'endobronchial', 'ultrasound', 'lymph', 'node', 'biopsies', 'procedures', 'performed', 'dr', 'last', 'name', 'endobronchial', 'biopsie

Create a list of string for Wiki document, and each element is a string for a Wiki document (including intersected words).
This is a preparation for building the matrices of the intersected words for Wiki documents and clinical notes.

In [None]:
wikidata=[] # each element is a string, each string contains words for a wiki doc
for i in wikidocuments:
  temp=''
  for j in i:
    temp=temp+j+" "
  wikidata.append(temp)    

Print a sample string (Compare with wikidocuments[0]).

In [None]:
print(wikidata[0])
print(wikidocuments[0])

breast cancer breast cancer cancer develops breast signs breast cancer may include lump change breast dimpling fluid coming newly inverted red scaly patch distant spread may bone swollen lymph shortness yellow risk factors developing breast cancer include lack physical drinking hormone replacement therapy early age first children late older prior history breast family cases due inherited including among breast cancer commonly develops cells lining milk ducts lobules supply ducts cancers developing ducts known ductal developing lobules known lobular breast ductal carcinoma develop diagnosis breast cancer confirmed taking biopsy concerning diagnosis tests done determine cancer spread beyond breast treatments likely balance benefits versus harms breast cancer screening review stated unclear screening good review us preventive services task force found evidence benefit years organization recommends screening every two years women years medications tamoxifen raloxifene may used effort preve

Create a list of string for clinical notes, and each element is a string for a note (including intersected words).  
This is a preparation for building the matrices of the intersected words for Wiki documents and clinical notes.

In [None]:
notedata=[] # each element is a string, each string contains words for a note
for i in notesdocuments: # for each element (list) in list
  temp=''
  for j in i: # for each word
    temp=temp+j+" " # create a string, words for a note separated by a space
  notedata.append(temp)

Print a sample string (Compare with notesdocuments[0]).

In [None]:
print(notedata[0])
print(notesdocuments[0])

admission date discharge date date birth sex m service medicine allergies patient recorded known allergies drugs name un chief complaint fever cough weakness major surgical invasive procedure rigid bronchoscopy endobronchial debulking squamous cell lung mass endotracheal intubation history present illness mr known man squamous cell lung cancer admitted hypotension fever secondary pneumonia patient back pain weight loss productive cough mild hemoptysis beginning month month time revealed opacity ct revealed mass x cm concerning malignancy patient underwent flexible bronchoscopy endobronchial biopsies linear endobronchial ultrasound lymph node biopsies procedures performed dr last name endobronchial biopsies revealed squamous cell carcinoma patient presented dr first name office worsening cough fever weakness office sbp breathing ra diminished l upper lung breath sounds per pcp revealed new infiltrate peripheral mass lesions likely consistent referred pcp time done also showed likely pat

Create 2 word matrices (intersected words). One is for Wiki documents, and the other is for clinical notes.  
**notevec.npy**: A matrix of intersected words for clinical notes.  
**wikivec.npy**: A matrix of intersected words for Wiki documents.

In [None]:
# create a matrix of token counts
vect = CountVectorizer(min_df=1,vocabulary=notesvocab,binary=True)
# transfer list of string to matrix, if a word exists in a string, set value to 1
binaryn = vect.fit_transform(notedata)
binaryn=binaryn.A # Return self as an ndarray object.
binaryn=np.array(binaryn,dtype=float)

vect2 = CountVectorizer(min_df=1,vocabulary=notesvocab,binary=True)
binaryk = vect2.fit_transform(wikidata)
binaryk=binaryk.A
binaryk=np.array(binaryk,dtype=float)


np.save('notevec',binaryn) # save numpy array as a file
np.save('wikivec',binaryk) # save numpy array as a file

Print the shape of the created matrices.  
For the matrix for clinical notes, the size of 1st dimension is the number of notes, and the size of the 2nd dimension is the number of intersected words.  
For the matrix for Wiki documents, the size of 1st dimension is the number of Wiki documents, and the size of the 2nd dimension is the number of intersected words.  

In [None]:
print(f"The shape of the matrix for clinical notes (notevec) = {binaryn.shape}")
print(f"The shape of the matrix for wiki docs (wikivec) = {binaryk.shape}")

The shape of the matrix for clinical notes (notevec) = (52722, 12173)
The shape of the matrix for wiki docs (wikivec) = (325, 12173)


In [None]:
stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')   

Time duration: 0h 04m 32s


# Data Pre-processing 3

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
start = timeit.default_timer()

Create 2 dictionaries for the ICD-9 codes for Wiki documents.  
For the 1st dictionary, the key is a ICD-9 code which exists in Wiki documents, and the value is 1.
For the 2nd dictionary, the key is a ICD-9 code which exists in Wiki documents, and the value is the list which contain the sequence number of Wiki documents for this ICD-9 code.

In [None]:
wikivoc={}
codewiki=defaultdict(list)

file2=codecs.open("wikipedia_knowledge",'r','utf-8')
line=file2.readline()
count=0
while line:
  if line[0:4]=='XXXd': # read the start of a wiki doc
    line=line.strip('\n')
    codes=line.split()
    for code in codes:
      if code[0:2]=='d_': # if it is a icd code
        codewiki[code].append(count) # save the index of wiki doc to list for code
        wikivoc[code]=1 # set value of code to 1
    count=count+1
  line=file2.readline()

Print the 2 dictionaries.  
Print the number of ICD-9 codes in Wiki documents (**not all of them can be found in clinical notes**).

In [None]:
print(wikivoc)
print(codewiki)
print(f"Total number of codes = {len(wikivoc)}")

{'d_174': 1, 'd_175': 1, 'd_130': 1, 'd_540': 1, 'd_541': 1, 'd_542': 1, 'd_200': 1, 'd_357': 1, 'd_614': 1, 'd_615': 1, 'd_616': 1, 'd_191': 1, 'd_027': 1, 'd_323': 1, 'd_060': 1, 'd_136': 1, 'd_330': 1, 'd_331': 1, 'd_272': 1, 'd_715': 1, 'd_458': 1, 'd_076': 1, 'd_313': 1, 'd_147': 1, 'd_241': 1, 'd_314': 1, 'd_410': 1, 'd_740': 1, 'd_155': 1, 'd_695': 1, 'd_420': 1, 'd_345': 1, 'd_421': 1, 'd_268': 1, 'd_651': 1, 'd_551': 1, 'd_552': 1, 'd_553': 1, 'd_365': 1, 'd_084': 1, 'd_346': 1, 'd_203': 1, 'd_512': 1, 'd_860': 1, 'd_574': 1, 'd_586': 1, 'd_214': 1, 'd_135': 1, 'd_209': 1, 'd_511': 1, 'd_287': 1, 'd_056': 1, 'd_033': 1, 'd_101': 1, 'd_463': 1, 'd_578': 1, 'd_708': 1, 'd_359': 1, 'd_072': 1, 'd_207': 1, 'd_208': 1, 'd_811': 1, 'd_286': 1, 'd_087': 1, 'd_176': 1, 'd_094': 1, 'd_480': 1, 'd_347': 1, 'd_690': 1, 'd_694': 1, 'd_702': 1, 'd_707': 1, 'd_341': 1, 'd_193': 1, 'd_332': 1, 'd_134': 1, 'd_610': 1, 'd_260': 1, 'd_361': 1, 'd_579': 1, 'd_401': 1, 'd_309': 1, 'd_693': 1, 'd_

Each of the following 4 ICD-9 codes appears in 2 Wiki documents.

In [None]:
print(codewiki['d_072'])
print(codewiki['d_698'])
print(codewiki['d_305'])
print(codewiki['d_386'])

[47, 214]
[106, 125]
[149, 250]
[219, 221]


For training purpose, each Wiki document can have more than one ICD-9 code, but each ICD-9 code can appear in only one Wiki document.  
Correct the 4 ICD-9 codes above, and for each of these 4 codes, down-select one Wiki document.  
**wikivoc.npy**: A matrix of ICD-9 codes which appear in Wiki documents.

In [None]:
codewiki['d_072']=[214]
codewiki['d_698']=[125]
codewiki['d_305']=[250]
codewiki['d_386']=[219]

np.save('wikivoc',wikivoc)

Prepare feature and label for the deep learning model.  
Feature: A list of string. Each string is one clinical note.  
Label: A list of string. Each string is the ICD-9 codes for one clinical note.

In [None]:
filec=codecs.open("combined_dataset",'r','utf-8')

line=filec.readline()

feature=[]
label=[]

while line:
  line=line.strip('\n')
  line=line.split()
  
  if line[0]=='codes:':
    temp=line[1:] # read the codes of that node
    label.append(temp) # add the code to list for label
    line=filec.readline()
    line=line.strip('\n')
    line=line.split()
    if line[0]=='notes:':
      tempf=[]
      line=filec.readline()
      
      while line!='end!\n': # read the notes until end
        line=line.strip('\n')
        line=line.split()
        tempf=tempf+line
        line=filec.readline()
      feature.append(tempf) # add list of words to list of feature
  line=filec.readline()

Print a sample feature.  
Print a sample label.

In [None]:
print(feature[0])
print(label[0])
print(len(feature[0]))
print(len(label[0]))
print(len(feature))
print(len(label))

['admission', 'date', 'discharge', 'date', 'date', 'birth', 'sex', 'm', 'service', 'medicine', 'allergies', 'patient', 'recorded', 'known', 'allergies', 'drugs', 'attendinglast', 'name', 'un', 'chief', 'complaint', 'fever', 'cough', 'weakness', 'major', 'surgical', 'invasive', 'procedure', 'rigid', 'bronchoscopy', 'endobronchial', 'debulking', 'squamous', 'cell', 'lung', 'mass', 'endotracheal', 'intubation', 'history', 'present', 'illness', 'mr', 'known', 'lastname', 'yo', 'man', 'w', 'recentlydiagnosed', 'lul', 'poorlydifferentiated', 'squamous', 'cell', 'lung', 'cancer', 'admitted', 'hypotension', 'fever', 'secondary', 'postobstructive', 'pneumonia', 'patient', 'back', 'pain', 'weight', 'loss', 'productive', 'cough', 'mild', 'hemoptysis', 'beginning', 'month', 'month', 'time', 'revealed', 'lul', 'opacity', 'fu', 'ct', 'thorax', 'revealed', 'lul', 'mass', 'x', 'cm', 'concerning', 'malignancy', 'patient', 'underwent', 'flexible', 'bronchoscopy', 'endobronchial', 'biopsies', 'linear', '

Create the sequence for label (ICD-9 codes). The key is a ICD-9 code, and the value is the sequence of a ICD-9 code in the code vector later.

In [None]:
prevoc={}
for i in label:
  for j in i:
    if j not in prevoc:
      prevoc[j] = len(prevoc) # set up the order of codes (for vector)

Print a sample key-value pairs (refer to label[0]).  
Print the number of key-value pairs (codes).

In [None]:
# print(prevoc["d_486"])
# print(prevoc["d_518"])
# print(prevoc["d_511"])
print(len(prevoc))

941


Load notevec.npy and wikivec.npy.

In [None]:
notevec=np.load('notevec.npy')
wikivec=np.load('wikivec.npy')

Create mapping between ICD-9 codes and the index in the code vector by 2 dictionaries.  
**This mapping is for all codes found in the combined dataset.**

In [None]:
label_to_ix = {}
ix_to_label = {}

# create a mapping between code and index
for codes in label:
  for code in codes:
    if code not in label_to_ix:
      label_to_ix[code]=len(label_to_ix)
      ix_to_label[label_to_ix[code]]=code

Print sample key-value pairs.  
Print the number of ICD-9 codes found in combined dataset.

In [None]:
# print(label_to_ix["d_486"])
print(ix_to_label[0])
print(f"Total number of codes = {len(label_to_ix)}")

d_486
Total number of codes = 941


Create a word vector (intersected words) for each of the ICD-9 codes found in combined_dataset.
*   If a ICD-9 code can be found in Wiki documents: label index -> ICD-9 code -> sequence/index of Wiki document -> vector of intersected words of Wikidocument (1 x number of intersected words).
*   If ICD-9 code cannot be found in Wiki document: zero vector in shape of (1 x number of intersected words).
Create a mapping between ICD-9 code index in label and corresponding 

In [None]:
tempwikivec=[]

for i in range(0,len(ix_to_label)):
  if ix_to_label[i] in wikivoc: # if a code in note can be found in wiki docs
    temp=wikivec[codewiki[ix_to_label[i]][0]] # save wiki doc index to temp
    tempwikivec.append(temp)
  else:
    tempwikivec.append([0.0]*wikivec.shape[1])
wikivec=np.array(tempwikivec)

Print sample result.

In [None]:
# print(f"If the ICD-9 code is = {ix_to_label[0]}")
# print(f"The index of the corresponding wiki doc = {codewiki[ix_to_label[0]]}")
# print(f"The index of the corresponding wiki doc = {codewiki[ix_to_label[0]][0]}")
# print(f"The vector of intersected words for the corresponding wiki doc = {wikivec[codewiki[ix_to_label[0]][0]]}")
# print(f"The shape of the vector is = {wikivec[codewiki[ix_to_label[0]][0]].shape}")
# print(f"If the ICD-9 code cannot be found is Wiki docs, the vector = {[0.0]*wikivec.shape[1]}")

Create dataset. The dataset contains 3 parts:
*   Feature: a list of lists, and each element is a list of words (strings) for one clinical note.
*   Notevec: a list of vectors, and each element is a vector of intersected words for one clinical note.
*   Label: a list of lists, and each element is a list of ICD-codes for one clinical note.

In [None]:
data=[]
for i in range(0,len(feature)):
  # save feature (list of words for note), note matrix and label (code) as a tuple
  data.append((feature[i], notevec[i], label[i]))
    
data=np.array(data, dtype=object)

Print the first set of data.

In [None]:
print(data[0][0])
print(data[0][1])
print(data[0][2])

['admission', 'date', 'discharge', 'date', 'date', 'birth', 'sex', 'm', 'service', 'medicine', 'allergies', 'patient', 'recorded', 'known', 'allergies', 'drugs', 'attendinglast', 'name', 'un', 'chief', 'complaint', 'fever', 'cough', 'weakness', 'major', 'surgical', 'invasive', 'procedure', 'rigid', 'bronchoscopy', 'endobronchial', 'debulking', 'squamous', 'cell', 'lung', 'mass', 'endotracheal', 'intubation', 'history', 'present', 'illness', 'mr', 'known', 'lastname', 'yo', 'man', 'w', 'recentlydiagnosed', 'lul', 'poorlydifferentiated', 'squamous', 'cell', 'lung', 'cancer', 'admitted', 'hypotension', 'fever', 'secondary', 'postobstructive', 'pneumonia', 'patient', 'back', 'pain', 'weight', 'loss', 'productive', 'cough', 'mild', 'hemoptysis', 'beginning', 'month', 'month', 'time', 'revealed', 'lul', 'opacity', 'fu', 'ct', 'thorax', 'revealed', 'lul', 'mass', 'x', 'cm', 'concerning', 'malignancy', 'patient', 'underwent', 'flexible', 'bronchoscopy', 'endobronchial', 'biopsies', 'linear', '

Create mapping between ICD-9 codes and the index in the code vector by 2 dictionaries.  
**Different from previous label_to_ix and ix_to_label, this mapping is for ICD-9 codes found in Wiki documents only.**

In [None]:
label_to_ix = {}
ix_to_label = {}

for doc, note, codes in data:
  for code in codes:
    if code not in label_to_ix:
      if code in wikivoc:
        label_to_ix[code]=len(label_to_ix)
        ix_to_label[label_to_ix[code]]=code

np.save('label_to_ix',label_to_ix)
np.save('ix_to_label',ix_to_label)

Print sample key-value pairs.  
Print the number of ICD-9 codes which **exists in both clinical notes and Wiki documents** (From paper: "Of those codes, we selected a subset of 344 codes for which we found the corresponding Wikipedia document and used those codes in our experiments.".

In [None]:
# print(label_to_ix["d_486"])
# print(ix_to_label[0])
print(f"Total number of codes = {len(label_to_ix)}")

Total number of codes = 344


Split training data, validation data, and test data.

In [None]:
training_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
training_data, val_data = train_test_split(training_data, test_size=0.125, random_state=42)

np.save('training_data',training_data)
np.save('test_data',test_data)
np.save('val_data',val_data)

Create index for words in clinical notes.

In [None]:
word_to_ix = {}
ix_to_word={}
ix_to_word[0]='OUT'

for doc, note, codes in training_data:
  for word in doc:
    if word not in word_to_ix:
      word_to_ix[word] = len(word_to_ix)+1
      ix_to_word[word_to_ix[word]]=word  
    
np.save('word_to_ix',word_to_ix)
np.save('ix_to_word',ix_to_word)

Print sample key-value pairs.

In [None]:
print(ix_to_word[0])
print(ix_to_word[1])
print(word_to_ix['admission'])

OUT
admission
1


Create a word vector (intersected words) for each of the ICD-9 codes found in **both Wiki document and clinical notes (combined dataset)**.

In [None]:
newwikivec=[]
for i in range(0,len(ix_to_label)):
  newwikivec.append(wikivec[prevoc[ix_to_label[i]]])
newwikivec=np.array(newwikivec)
np.save('newwikivec',newwikivec)

Print sample result.  
Print the number of vectors in wikivec and newwikivec.

In [None]:
print(ix_to_label[0])
print(prevoc[ix_to_label[0]])
print(wikivec[prevoc[ix_to_label[0]]])
print(len(newwikivec))
print(len(wikivec))

d_486
0
[1. 0. 0. ... 0. 0. 0.]
344
941


In [None]:
stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')  

Time duration: 0h 04m 47s


# CAML

In [None]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
torch.manual_seed(1)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
import copy
import pandas as pd

In [None]:
start = timeit.default_timer()

##########################################################

label_to_ix=np.load('label_to_ix.npy', allow_pickle=True).item()
ix_to_label=np.load('ix_to_label.npy', allow_pickle=True)
training_data=np.load('training_data.npy', allow_pickle=True)
test_data=np.load('test_data.npy', allow_pickle=True)
val_data=np.load('val_data.npy', allow_pickle=True)
word_to_ix=np.load('word_to_ix.npy', allow_pickle=True).item()
ix_to_word=np.load('ix_to_word.npy', allow_pickle=True)
newwikivec=np.load('newwikivec.npy', allow_pickle=True)
wikivoc=np.load('wikivoc.npy', allow_pickle=True).item()

wikisize=newwikivec.shape[0]
rvocsize=newwikivec.shape[1]
wikivec=autograd.Variable(torch.FloatTensor(newwikivec))

batchsize = 32

def preprocessing(data):

    new_data=[]
    for i, note, j in data:
        templabel=[0.0]*len(label_to_ix)
        for jj in j:
            if jj in wikivoc:
                templabel[label_to_ix[jj]]=1.0
        templabel=np.array(templabel,dtype=float)
        new_data.append((i, note, templabel))
    new_data=np.array(new_data, dtype=object)
    
    lenlist=[]
    for i in new_data:
        lenlist.append(len(i[0]))
    sortlen=sorted(range(len(lenlist)), key=lambda k: lenlist[k])  
    new_data=new_data[sortlen]
    
    batch_data=[]
    
    for start_ix in range(0, len(new_data)-batchsize+1, batchsize):
        thisblock=new_data[start_ix:start_ix+batchsize]
        mybsize= len(thisblock)
        numword=np.max([len(ii[0]) for ii in thisblock])
        main_matrix = np.zeros((mybsize, numword), dtype=int)
        for i in range(main_matrix.shape[0]):
            for j in range(main_matrix.shape[1]):
                try:
                    if thisblock[i][0][j] in word_to_ix:
                        main_matrix[i,j] = word_to_ix[thisblock[i][0][j]]
                    
                except IndexError:
                    pass       # because initialze with 0, so you pad with 0
    
        xxx2=[]
        yyy=[]
        for ii in thisblock:
            xxx2.append(ii[1])
            yyy.append(ii[2])
        
        xxx2=np.array(xxx2)
        yyy=np.array(yyy)
        batch_data.append((autograd.Variable(torch.from_numpy(main_matrix)),autograd.Variable(torch.FloatTensor(xxx2)),autograd.Variable(torch.FloatTensor(yyy))))
    return batch_data


batchtraining_data=preprocessing(training_data)
batchtest_data=preprocessing(test_data)
batchval_data=preprocessing(val_data)

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')   

Time duration: 0h 00m 57s


In [None]:
start = timeit.default_timer()

######################################################################
# Create the model:

Embeddingsize = 100
hidden_dim = 200

class CAML(nn.Module):

    def __init__(self, batch_size, vocab_size, tagset_size):
        super(CAML, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size+1, Embeddingsize, padding_idx=0)
        self.embed_drop = nn.Dropout(p=0.2)   
        
        
        self.convs1 = nn.Conv1d(Embeddingsize,300,10,padding=5)
        self.H=nn.Linear(300, tagset_size )   
        self.final = nn.Linear(300, tagset_size)
        
        self.layer2 = nn.Linear(Embeddingsize, 1)
        self.embedding=nn.Linear(rvocsize,Embeddingsize,bias=False)
        self.vattention=nn.Linear(Embeddingsize,Embeddingsize)
        
        self.sigmoid = nn.Sigmoid()
        self.tanh = nn.Tanh()
    
        self.dropout = nn.Dropout(0.2)
    
    def forward(self, vec1, nvec, wiki, simlearning):
        
       
        thisembeddings=self.word_embeddings(vec1)
        thisembeddings = self.embed_drop(thisembeddings)
        thisembeddings=thisembeddings.transpose(1,2)
        
        
        thisembeddings=self.tanh(self.convs1(thisembeddings).transpose(1,2))  
        
        alpha=self.H.weight.matmul(thisembeddings.transpose(1,2))
        alpha=F.softmax(alpha, dim=2)
        
        m=alpha.matmul(thisembeddings)
       
        myfinal=self.final.weight.mul(m).sum(dim=2).add(self.final.bias)
        
        if simlearning==1:
            nvec=nvec.view(batchsize,1,-1)
            nvec=nvec.expand(batchsize,wiki.size()[0],-1)
            wiki=wiki.view(1,wiki.size()[0],-1)
            wiki=wiki.expand(nvec.size()[0],wiki.size()[1],-1)
            new=wiki*nvec
            new=self.embedding(new)
            vattention=self.sigmoid(self.vattention(new))
            new=new*vattention
            vec3=self.layer2(new)
            vec3=vec3.view(batchsize,-1)
        
       
        if simlearning==1:
            tag_scores = self.sigmoid(myfinal.detach()+vec3)
        else:
            tag_scores = self.sigmoid(myfinal)
              
        return tag_scores

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')   

Time duration: 0h 00m 00s


In [None]:
start = timeit.default_timer()

######################################################################
# Train the model:

topk = 10
max_epochs = 5000 # Default is 5000
print(f"max epochs = {max_epochs}")

def trainmodel(model, sim):
    modelsaved=[]
    modelperform=[]
    
    bestresults=-1
    bestiter=-1
    for epoch in range(max_epochs):  
        model.train()
        
        lossestrain = []
        recall=[]
        for mysentence in batchtraining_data:
            model.zero_grad()
            
            targets = mysentence[2].cuda()
            tag_scores = model(mysentence[0].cuda(),mysentence[1].cuda(),wikivec.cuda(),sim)
            loss = loss_function(tag_scores, targets)
            loss.backward()
            optimizer.step()
            lossestrain.append(loss.data.mean())
        print (f"epoch = {epoch}")
        modelsaved.append(copy.deepcopy(model.state_dict()))
        model.eval()
    
        recall=[]
        for inputs in batchval_data:
           
            targets = inputs[2].cuda()
            tag_scores = model(inputs[0].cuda(),inputs[1].cuda() ,wikivec.cuda(),sim)
    
            loss = loss_function(tag_scores, targets)
            
            targets=targets.data.cpu().numpy()
            tag_scores= tag_scores.data.cpu().numpy()
            
            
            for iii in range(0,len(tag_scores)):
                temp={}
                for iiii in range(0,len(tag_scores[iii])):
                    temp[iiii]=tag_scores[iii][iiii]
                temp1=[(k, temp[k]) for k in sorted(temp, key=temp.get, reverse=True)]
                thistop=int(np.sum(targets[iii]))
                hit=0.0
                for ii in temp1[0:max(thistop,topk)]:
                    if targets[iii][ii[0]]==1.0:
                        hit=hit+1
                if thistop!=0:
                    recall.append(hit/thistop)
            
        print ('validation recall @ top-',topk, np.mean(recall))
           
        modelperform.append(np.mean(recall))
        if modelperform[-1]>bestresults:
            bestresults=modelperform[-1]
            bestiter=len(modelperform)-1
            print(f"Update the best model to epoch {bestiter}")
        
        if (len(modelperform)-bestiter)>5:
            print (modelperform,bestiter)
            return modelsaved[bestiter]

    print(f"Reach the max epochs, return the best model at epoch {bestiter}")
    return modelsaved[bestiter]

model = CAML(batchsize, len(word_to_ix), len(label_to_ix))
model.cuda()

loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters())

print("Train basemodel")
basemodel= trainmodel(model, 0)
torch.save(basemodel, 'CAML_model')

model = CAML(batchsize, len(word_to_ix), len(label_to_ix))
model.cuda()
model.load_state_dict(basemodel)
loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
print("")
print("Train model with KSI")
KSImodel= trainmodel(model, 1)
torch.save(KSImodel, 'KSI_CAML_model')

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')   

max epochs = 5000
Train basemodel
epoch = 0
validation recall @ top- 10 0.6044048484082241
Update the best model to epoch 0
epoch = 1
validation recall @ top- 10 0.7351386587887544
Update the best model to epoch 1
epoch = 2
validation recall @ top- 10 0.7708715147202543
Update the best model to epoch 2
epoch = 3
validation recall @ top- 10 0.7896376400647074
Update the best model to epoch 3
epoch = 4
validation recall @ top- 10 0.7972106688739532
Update the best model to epoch 4
epoch = 5
validation recall @ top- 10 0.8032410262762794
Update the best model to epoch 5
epoch = 6
validation recall @ top- 10 0.8068210941543067
Update the best model to epoch 6
epoch = 7
validation recall @ top- 10 0.8057932495000231
epoch = 8
validation recall @ top- 10 0.8050636399093615
epoch = 9
validation recall @ top- 10 0.8070068471661362
Update the best model to epoch 9
epoch = 10
validation recall @ top- 10 0.8080802677316081
Update the best model to epoch 10
epoch = 11
validation recall @ top- 10 0

In [None]:
start = timeit.default_timer()

######################################################################
# Test the model:


def testmodel(modelstate, sim):
    model = CAML(batchsize, len(word_to_ix), len(label_to_ix))
    model.cuda()
    # model.cpu()
    model.load_state_dict(modelstate)
    loss_function = nn.BCELoss()
    model.eval()
    recall=[]
    lossestest = []
    
    y_true=[]
    y_scores=[]
    
    
    for inputs in batchtest_data:
       
        targets = inputs[2].cuda()
        # targets = inputs[2].cpu()
        
        tag_scores = model(inputs[0].cuda(),inputs[1].cuda() ,wikivec.cuda(),sim)
        # tag_scores = model(inputs[0].cpu(),inputs[1].cpu() ,wikivec.cpu(),sim)

        loss = loss_function(tag_scores, targets)
        targets = targets.data.cpu().numpy()
        tag_scores= tag_scores.data.cpu().numpy()
               
        # lossestest.append(loss.data.mean())
        lossestest.append(loss.data.cpu().mean())
        y_true.append(targets)
        y_scores.append(tag_scores)
        
        for iii in range(0,len(tag_scores)):
            temp={}
            for iiii in range(0,len(tag_scores[iii])):
                temp[iiii]=tag_scores[iii][iiii]
            temp1=[(k, temp[k]) for k in sorted(temp, key=temp.get, reverse=True)]
            thistop=int(np.sum(targets[iii]))
            hit=0.0
            
            for ii in temp1[0:max(thistop,topk)]:
                if targets[iii][ii[0]]==1.0:
                    hit=hit+1
            if thistop!=0:
                recall.append(hit/thistop)
    y_true=np.concatenate(y_true,axis=0)
    y_scores=np.concatenate(y_scores,axis=0)
    y_true=y_true.T
    y_scores=y_scores.T
    temptrue=[]
    tempscores=[]
    for  col in range(0,len(y_true)):
        if np.sum(y_true[col])!=0:
            temptrue.append(y_true[col])
            tempscores.append(y_scores[col])
    temptrue=np.array(temptrue)
    tempscores=np.array(tempscores)
    y_true=temptrue.T
    y_scores=tempscores.T
    y_pred=(y_scores>0.5).astype(int)
    # print ('test loss', np.mean(lossestest))
    # print ('top-',topk, np.mean(recall))
    # print ('macro AUC', roc_auc_score(y_true, y_scores,average='macro'))
    # print ('micro AUC', roc_auc_score(y_true, y_scores,average='micro'))
    # print ('macro F1', f1_score(y_true, y_pred, average='macro'))
    # print ('micro F1', f1_score(y_true, y_pred, average='micro'))
    test_loss = np.mean(lossestest)
    test_recall = np.mean(recall)
    test_mac_auc = roc_auc_score(y_true, y_scores,average='macro')
    test_mic_auc = roc_auc_score(y_true, y_scores,average='micro')
    test_mac_f1 = f1_score(y_true, y_pred, average='macro')
    test_mic_f1 = f1_score(y_true, y_pred, average='micro')
    return test_loss, test_recall, test_mac_auc, test_mic_auc, test_mac_f1, test_mic_f1

print('Test CAML baseline')
caml_loss, caml_recall, caml_mac_auc, caml_mic_auc, caml_mac_f1, caml_mic_f1 = testmodel(basemodel, 0)
print('Test KSI+CAML')
camlKSI_loss, camlKSI_recall, camlKSI_mac_auc, camlKSI_mic_auc, camlKSI_mac_f1, camlKSI_mic_f1 = testmodel(KSImodel, 1)

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')   

Test CAML baseline
Test KSI+CAML
Time duration: 0h 00m 27s


In [None]:
result_caml = [['CAML Baseline', 
                caml_loss, 
                caml_recall, 
                caml_mac_auc, 
                caml_mic_auc, 
                caml_mac_f1, 
                caml_mic_f1], 
        ['CAML + KSI', 
         camlKSI_loss, 
         camlKSI_recall, 
         camlKSI_mac_auc, 
         camlKSI_mic_auc, 
         camlKSI_mac_f1, 
         camlKSI_mic_f1]]
df_caml = pd.DataFrame(result_caml, columns=['Model', 'Loss', 'Recall@10', 'Macro_AUC', 'Micro_AUC', 'Macro_F1', 'Micro_F1'])
df_caml

Unnamed: 0,Model,Loss,Recall@10,Macro_AUC,Micro_AUC,Macro_F1,Micro_F1
0,CAML Baseline,0.03352,0.807766,0.852964,0.978119,0.278527,0.656805
1,CAML + KSI,0.033304,0.808454,0.853904,0.978277,0.279034,0.657468


# CNN

In [None]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim
import numpy as np
torch.manual_seed(1)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
import copy

In [None]:
start = timeit.default_timer()

##########################################################

label_to_ix=np.load('label_to_ix.npy', allow_pickle=True).item()
ix_to_label=np.load('ix_to_label.npy', allow_pickle=True)
training_data=np.load('training_data.npy', allow_pickle=True)
test_data=np.load('test_data.npy', allow_pickle=True)
val_data=np.load('val_data.npy', allow_pickle=True)
word_to_ix=np.load('word_to_ix.npy', allow_pickle=True).item()
ix_to_word=np.load('ix_to_word.npy', allow_pickle=True)
newwikivec=np.load('newwikivec.npy', allow_pickle=True)
wikivoc=np.load('wikivoc.npy', allow_pickle=True).item()

wikisize=newwikivec.shape[0]
rvocsize=newwikivec.shape[1]
wikivec=autograd.Variable(torch.FloatTensor(newwikivec))

batchsize = 32

def preprocessing(data):

    new_data=[]
    for i, note, j in data:
        templabel=[0.0]*len(label_to_ix)
        for jj in j:
            if jj in wikivoc:
                templabel[label_to_ix[jj]]=1.0
        templabel=np.array(templabel,dtype=float)
        new_data.append((i, note, templabel))
    new_data=np.array(new_data, dtype=object)
    
    lenlist=[]
    for i in new_data:
        lenlist.append(len(i[0]))
    sortlen=sorted(range(len(lenlist)), key=lambda k: lenlist[k])  
    new_data=new_data[sortlen]
    
    batch_data=[]
    
    for start_ix in range(0, len(new_data)-batchsize+1, batchsize):
        thisblock=new_data[start_ix:start_ix+batchsize]
        mybsize= len(thisblock)
        numword=np.max([len(ii[0]) for ii in thisblock])
        main_matrix = np.zeros((mybsize, numword), dtype= int)
        for i in range(main_matrix.shape[0]):
            for j in range(main_matrix.shape[1]):
                try:
                    if thisblock[i][0][j] in word_to_ix:
                        main_matrix[i,j] = word_to_ix[thisblock[i][0][j]]
                    
                except IndexError:
                    pass       # because initialze with 0, so you pad with 0
    
        xxx2=[]
        yyy=[]
        for ii in thisblock:
            xxx2.append(ii[1])
            yyy.append(ii[2])
        
        xxx2=np.array(xxx2)
        yyy=np.array(yyy)
        batch_data.append((autograd.Variable(torch.from_numpy(main_matrix)),autograd.Variable(torch.FloatTensor(xxx2)),autograd.Variable(torch.FloatTensor(yyy))))
    return batch_data


batchtraining_data=preprocessing(training_data)
batchtest_data=preprocessing(test_data)
batchval_data=preprocessing(val_data)

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')    

Time duration: 0h 00m 56s


In [None]:
start = timeit.default_timer()

######################################################################
# Create the model:

Embeddingsize = 100
hidden_dim = 200

class CNN(nn.Module):

    def __init__(self, batch_size, vocab_size, tagset_size):
        super(CNN, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size+1, Embeddingsize, padding_idx=0)
        self.embed_drop = nn.Dropout(p=0.2)
        
        self.hidden2tag = nn.Linear(300, tagset_size)
        
        
        self.convs1 = nn.Conv1d(Embeddingsize,100,3)
        self.convs2 = nn.Conv1d(Embeddingsize,100,4)
        self.convs3 = nn.Conv1d(Embeddingsize,100,5)
        
        
        self.layer2 = nn.Linear(Embeddingsize, 1,bias=False)
        self.embedding=nn.Linear(rvocsize,Embeddingsize)
        self.vattention=nn.Linear(Embeddingsize,Embeddingsize)
        
        self.sigmoid = nn.Sigmoid()
        self.tanh = nn.Tanh()
        self.dropout = nn.Dropout(0.2)
    
    def forward(self, vec1, nvec, wiki, simlearning):
       
        thisembeddings=self.word_embeddings(vec1)
        thisembeddings = self.embed_drop(thisembeddings)
        thisembeddings=thisembeddings.transpose(1,2)
        
        output1=self.tanh(self.convs1(thisembeddings))
        output1=nn.MaxPool1d(output1.size()[2])(output1)
        
        output2=self.tanh(self.convs2(thisembeddings))
        output2=nn.MaxPool1d(output2.size()[2])(output2)
        
        output3=self.tanh(self.convs3(thisembeddings))
        output3=nn.MaxPool1d(output3.size()[2])(output3)
        
        output4 = torch.cat([output1,output2,output3], 1).squeeze(2)
        
        if simlearning==1:
            nvec=nvec.view(batchsize,1,-1)
            nvec=nvec.expand(batchsize,wiki.size()[0],-1)
            wiki=wiki.view(1,wiki.size()[0],-1)
            wiki=wiki.expand(nvec.size()[0],wiki.size()[1],-1)
            new=wiki*nvec
            new=self.embedding(new)
            vattention=self.sigmoid(self.vattention(new))
            new=new*vattention
            vec3=self.layer2(new)
            vec3=vec3.view(batchsize,-1)
        
       
        vec2 = self.hidden2tag(output4)
        if simlearning==1:
            tag_scores = self.sigmoid(vec2.detach()+vec3)
        else:
            tag_scores = self.sigmoid(vec2)
        
        
        return tag_scores

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s') 

Time duration: 0h 00m 00s


In [None]:
start = timeit.default_timer()

######################################################################
# Train the model:

topk = 10
max_epochs = 5000 # Default is 5000
print(f"max epochs = {max_epochs}")

def trainmodel(model, sim):
    modelsaved=[]
    modelperform=[]
    
    bestresults=-1
    bestiter=-1
    for epoch in range(max_epochs):  
        model.train()
        
        lossestrain = []
        recall=[]
        for mysentence in batchtraining_data:
            model.zero_grad()
            
            targets = mysentence[2].cuda()
            tag_scores = model(mysentence[0].cuda(),mysentence[1].cuda(),wikivec.cuda(),sim)
            loss = loss_function(tag_scores, targets)
            loss.backward()
            optimizer.step()
            lossestrain.append(loss.data.mean())
        print (f"epoch = {epoch}")
        modelsaved.append(copy.deepcopy(model.state_dict()))
        model.eval()
    
        recall=[]
        for inputs in batchval_data:
           
            targets = inputs[2].cuda()
            tag_scores = model(inputs[0].cuda(),inputs[1].cuda() ,wikivec.cuda(),sim)
    
            loss = loss_function(tag_scores, targets)
            
            targets=targets.data.cpu().numpy()
            tag_scores= tag_scores.data.cpu().numpy()
            
            
            for iii in range(0,len(tag_scores)):
                temp={}
                for iiii in range(0,len(tag_scores[iii])):
                    temp[iiii]=tag_scores[iii][iiii]
                temp1=[(k, temp[k]) for k in sorted(temp, key=temp.get, reverse=True)]
                thistop=int(np.sum(targets[iii]))
                hit=0.0
                for ii in temp1[0:max(thistop,topk)]:
                    if targets[iii][ii[0]]==1.0:
                        hit=hit+1
                if thistop!=0:
                    recall.append(hit/thistop)
            
        print ('validation recall @ top-',topk, np.mean(recall))
        
        
        
        modelperform.append(np.mean(recall))
        if modelperform[-1]>bestresults:
            bestresults=modelperform[-1]
            bestiter=len(modelperform)-1
            print(f"Update the best model to epoch {bestiter}")
        
        if (len(modelperform)-bestiter)>5:
            print (modelperform,bestiter)
            return modelsaved[bestiter]

    print(f"Reach the max epochs, return the best model at epoch {bestiter}")
    return modelsaved[bestiter]
    
model = CNN(batchsize, len(word_to_ix), len(label_to_ix))
model.cuda()

loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters())

print("Train basemodel")
basemodel= trainmodel(model, 0)
torch.save(basemodel, 'CNN_model')

model = CNN(batchsize, len(word_to_ix), len(label_to_ix))
model.cuda()
model.load_state_dict(basemodel)
loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
print("")
print("Train model with KSI")
KSImodel= trainmodel(model, 1)
torch.save(KSImodel, 'KSI_CNN_model')

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s') 

max epochs = 5000
Train basemodel
epoch = 0
validation recall @ top- 10 0.4424304468846055
Update the best model to epoch 0
epoch = 1
validation recall @ top- 10 0.5308329047788358
Update the best model to epoch 1
epoch = 2
validation recall @ top- 10 0.5740794226579505
Update the best model to epoch 2
epoch = 3
validation recall @ top- 10 0.6079622025717314
Update the best model to epoch 3
epoch = 4
validation recall @ top- 10 0.6447288187979041
Update the best model to epoch 4
epoch = 5
validation recall @ top- 10 0.6614662603343429
Update the best model to epoch 5
epoch = 6
validation recall @ top- 10 0.6826196291581947
Update the best model to epoch 6
epoch = 7
validation recall @ top- 10 0.6908042089212478
Update the best model to epoch 7
epoch = 8
validation recall @ top- 10 0.7006907045359452
Update the best model to epoch 8
epoch = 9
validation recall @ top- 10 0.7086330055181878
Update the best model to epoch 9
epoch = 10
validation recall @ top- 10 0.7143359639700324
Update t

In [None]:
start = timeit.default_timer()

######################################################################
# Test the model:

def testmodel(modelstate, sim):
    model = CNN(batchsize, len(word_to_ix), len(label_to_ix))
    model.cuda()
    model.load_state_dict(modelstate)
    loss_function = nn.BCELoss()
    model.eval()
    recall=[]
    lossestest = []
    
    y_true=[]
    y_scores=[]
    
    
    for inputs in batchtest_data:
       
        targets = inputs[2].cuda()
        
        tag_scores = model(inputs[0].cuda(),inputs[1].cuda() ,wikivec.cuda(),sim)

        loss = loss_function(tag_scores, targets)
        
        targets=targets.data.cpu().numpy()
        tag_scores= tag_scores.data.cpu().numpy()
        
        
        lossestest.append(loss.data.cpu().mean())
        y_true.append(targets)
        y_scores.append(tag_scores)
        
        for iii in range(0,len(tag_scores)):
            temp={}
            for iiii in range(0,len(tag_scores[iii])):
                temp[iiii]=tag_scores[iii][iiii]
            temp1=[(k, temp[k]) for k in sorted(temp, key=temp.get, reverse=True)]
            thistop=int(np.sum(targets[iii]))
            hit=0.0
            
            for ii in temp1[0:max(thistop,topk)]:
                if targets[iii][ii[0]]==1.0:
                    hit=hit+1
            if thistop!=0:
                recall.append(hit/thistop)
    y_true=np.concatenate(y_true,axis=0)
    y_scores=np.concatenate(y_scores,axis=0)
    y_true=y_true.T
    y_scores=y_scores.T
    temptrue=[]
    tempscores=[]
    for  col in range(0,len(y_true)):
        if np.sum(y_true[col])!=0:
            temptrue.append(y_true[col])
            tempscores.append(y_scores[col])
    temptrue=np.array(temptrue)
    tempscores=np.array(tempscores)
    y_true=temptrue.T
    y_scores=tempscores.T
    y_pred=(y_scores>0.5).astype(int)
    # print ('test loss', np.mean(lossestest))
    # print ('top-',topk, np.mean(recall))
    # print ('macro AUC', roc_auc_score(y_true, y_scores,average='macro'))
    # print ('micro AUC', roc_auc_score(y_true, y_scores,average='micro'))
    # print ('macro F1', f1_score(y_true, y_pred, average='macro')  )
    # print ('micro F1', f1_score(y_true, y_pred, average='micro')  )
    test_loss = np.mean(lossestest)
    test_recall = np.mean(recall)
    test_mac_auc = roc_auc_score(y_true, y_scores,average='macro')
    test_mic_auc = roc_auc_score(y_true, y_scores,average='micro')
    test_mac_f1 = f1_score(y_true, y_pred, average='macro')
    test_mic_f1 = f1_score(y_true, y_pred, average='micro')
    return test_loss, test_recall, test_mac_auc, test_mic_auc, test_mac_f1, test_mic_f1

print('Test CNN baseline')
cnn_loss, cnn_recall, cnn_mac_auc, cnn_mic_auc, cnn_mac_f1, cnn_mic_f1 = testmodel(basemodel, 0)
print('Test KSI+CNN')
cnnKSI_loss, cnnKSI_recall, cnnKSI_mac_auc, cnnKSI_mic_auc, cnnKSI_mac_f1, cnnKSI_mic_f1 = testmodel(KSImodel, 1)
######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s') 

Test CNN baseline
Test KSI+CNN
Time duration: 0h 00m 24s


In [None]:
result_cnn = [['CNN Baseline', 
               cnn_loss, 
               cnn_recall, 
               cnn_mac_auc, 
               cnn_mic_auc, 
               cnn_mac_f1, 
               cnn_mic_f1], 
        ['CNN + KSI', 
         cnnKSI_loss, 
         cnnKSI_recall, 
         cnnKSI_mac_auc, 
         cnnKSI_mic_auc, 
         cnnKSI_mac_f1, 
         cnnKSI_mic_f1]]
df_cnn = pd.DataFrame(result_cnn, columns=['Model', 'Loss', 'Recall@10', 'Macro_AUC', 'Micro_AUC', 'Macro_F1', 'Micro_F1'])
df_cnn

Unnamed: 0,Model,Loss,Recall@10,Macro_AUC,Micro_AUC,Macro_F1,Micro_F1
0,CNN Baseline,0.038858,0.754856,0.831774,0.967348,0.214947,0.627527
1,CNN + KSI,0.038051,0.765813,0.847441,0.970184,0.226481,0.635694


# LSTM

In [None]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim
import numpy as np
torch.manual_seed(1)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
import copy

In [None]:
start = timeit.default_timer()

##########################################################

label_to_ix=np.load('label_to_ix.npy', allow_pickle=True).item()
ix_to_label=np.load('ix_to_label.npy', allow_pickle=True)
training_data=np.load('training_data.npy', allow_pickle=True)
test_data=np.load('test_data.npy', allow_pickle=True)
val_data=np.load('val_data.npy', allow_pickle=True)
word_to_ix=np.load('word_to_ix.npy', allow_pickle=True).item()
ix_to_word=np.load('ix_to_word.npy', allow_pickle=True)
newwikivec=np.load('newwikivec.npy', allow_pickle=True)
wikivoc=np.load('wikivoc.npy', allow_pickle=True).item()

wikisize=newwikivec.shape[0]
rvocsize=newwikivec.shape[1]
wikivec=autograd.Variable(torch.FloatTensor(newwikivec))

batchsize = 32

def preprocessing(data):

    new_data=[]
    for i, note, j in data:
        templabel=[0.0]*len(label_to_ix)
        for jj in j:
            if jj in wikivoc:
                templabel[label_to_ix[jj]]=1.0
        templabel=np.array(templabel,dtype=float)
        new_data.append((i, note, templabel))
    new_data=np.array(new_data, dtype=object)
    
    lenlist=[]
    for i in new_data:
        lenlist.append(len(i[0]))
    sortlen=sorted(range(len(lenlist)), key=lambda k: lenlist[k])  
    new_data=new_data[sortlen]
    
    batch_data=[]
    
    for start_ix in range(0, len(new_data)-batchsize+1, batchsize):
        thisblock=new_data[start_ix:start_ix+batchsize]
        mybsize= len(thisblock)
        numword=np.max([len(ii[0]) for ii in thisblock])
        main_matrix = np.zeros((mybsize, numword), dtype= int)
        for i in range(main_matrix.shape[0]):
            for j in range(main_matrix.shape[1]):
                try:
                    if thisblock[i][0][j] in word_to_ix:
                        main_matrix[i,j] = word_to_ix[thisblock[i][0][j]]
                    
                except IndexError:
                    pass       # because initialze with 0, so you pad with 0
    
        xxx2=[]
        yyy=[]
        for ii in thisblock:
            xxx2.append(ii[1])
            yyy.append(ii[2])
        
        xxx2=np.array(xxx2)
        yyy=np.array(yyy)
        batch_data.append((autograd.Variable(torch.from_numpy(main_matrix)),autograd.Variable(torch.FloatTensor(xxx2)),autograd.Variable(torch.FloatTensor(yyy))))
    return batch_data


batchtraining_data=preprocessing(training_data)
batchtest_data=preprocessing(test_data)
batchval_data=preprocessing(val_data)

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')  

Time duration: 0h 00m 54s


In [None]:
start = timeit.default_timer()

######################################################################
# Create the model:

Embeddingsize = 100
hidden_dim = 200

class LSTM(nn.Module):

    def __init__(self, batch_size, vocab_size, tagset_size):
        super(LSTM, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size+1, Embeddingsize, padding_idx=0)
        self.lstm = nn.LSTM(Embeddingsize, hidden_dim)
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()
        
        
        self.layer2 = nn.Linear(Embeddingsize, 1,bias=False)
        self.embedding=nn.Linear(rvocsize,Embeddingsize)
        self.vattention=nn.Linear(Embeddingsize,Embeddingsize,bias=False)
        
        self.softmax = nn.Softmax()
        self.sigmoid = nn.Sigmoid()
        self.tanh = nn.Tanh()
        self.embed_drop = nn.Dropout(p=0.2)
    
    def init_hidden(self):
        return (autograd.Variable(torch.zeros(1, batchsize, self.hidden_dim).cuda()),
                autograd.Variable(torch.zeros(1, batchsize, self.hidden_dim)).cuda())

    
    def forward(self, vec1, nvec, wiki, simlearning):
      
        thisembeddings=self.word_embeddings(vec1).transpose(0,1)
        thisembeddings = self.embed_drop(thisembeddings)
       
        if simlearning==1:
            nvec=nvec.view(batchsize,1,-1)
            nvec=nvec.expand(batchsize,wiki.size()[0],-1)
            wiki=wiki.view(1,wiki.size()[0],-1)
            wiki=wiki.expand(nvec.size()[0],wiki.size()[1],-1)
            new=wiki*nvec
            new=self.embedding(new)
            vattention=self.sigmoid(self.vattention(new))
            new=new*vattention
            vec3=self.layer2(new)
            vec3=vec3.view(batchsize,-1)
        
        
        
        lstm_out, self.hidden = self.lstm(thisembeddings, self.hidden)
        
        lstm_out=lstm_out.transpose(0,2).transpose(0,1)
        
        output1=nn.MaxPool1d(lstm_out.size()[2])(lstm_out).view(batchsize,-1)
        
        vec2 = self.hidden2tag(output1)
        if simlearning==1:
            tag_scores = self.sigmoid(vec2.detach()+vec3)
        else:
            tag_scores = self.sigmoid(vec2)
        
        
        return tag_scores

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')  

Time duration: 0h 00m 00s


In [None]:
start = timeit.default_timer()

######################################################################
# Train the model:

topk = 10
max_epochs = 5000 # Default is 5000
print(f"max epochs = {max_epochs}")

def trainmodel(model, sim):
    modelsaved=[]
    modelperform=[]
    
    bestresults=-1
    bestiter=-1
    for epoch in range(max_epochs):  
        
        model.train()
        
        lossestrain = []
        recall=[]
        for mysentence in batchtraining_data:
            model.zero_grad()
            model.hidden = model.init_hidden()
            targets = mysentence[2].cuda()
            tag_scores = model(mysentence[0].cuda(),mysentence[1].cuda(),wikivec.cuda(),sim)
            loss = loss_function(tag_scores, targets)
            loss.backward()
            optimizer.step()
            lossestrain.append(loss.data.mean())
        print (f"epoch = {epoch}")
        
        modelsaved.append(copy.deepcopy(model.state_dict()))
        model.eval()
    
        recall=[]
        for inputs in batchval_data:
            model.hidden = model.init_hidden()
            targets = inputs[2].cuda()
            tag_scores = model(inputs[0].cuda(),inputs[1].cuda() ,wikivec.cuda(),sim)
    
            loss = loss_function(tag_scores, targets)
            
            targets=targets.data.cpu().numpy()
            tag_scores= tag_scores.data.cpu().numpy()
            
            
            for iii in range(0,len(tag_scores)):
                temp={}
                for iiii in range(0,len(tag_scores[iii])):
                    temp[iiii]=tag_scores[iii][iiii]
                temp1=[(k, temp[k]) for k in sorted(temp, key=temp.get, reverse=True)]
                thistop=int(np.sum(targets[iii]))
                hit=0.0
                for ii in temp1[0:max(thistop,topk)]:
                    if targets[iii][ii[0]]==1.0:
                        hit=hit+1
                if thistop!=0:
                    recall.append(hit/thistop)
            
        print ('validation recall @ top-',topk, np.mean(recall))
               
        modelperform.append(np.mean(recall))
        if modelperform[-1]>bestresults:
            bestresults=modelperform[-1]
            bestiter=len(modelperform)-1
            print(f"Update the best model to epoch {bestiter}")
        
        if (len(modelperform)-bestiter)>5:
            print (modelperform,bestiter)
            return modelsaved[bestiter]

    print(f"Reach the max epochs, return the best model at epoch {bestiter}")
    return modelsaved[bestiter]
    
model = LSTM(batchsize, len(word_to_ix), len(label_to_ix))
model.cuda()

loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
print("Train basemodel")
basemodel= trainmodel(model, 0)
torch.save(basemodel, 'LSTM_model')

model = LSTM(batchsize, len(word_to_ix), len(label_to_ix))
model.cuda()
model.load_state_dict(basemodel)
loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
print("")
print("Train model with KSI")
KSImodel= trainmodel(model, 1)
torch.save(KSImodel, 'KSI_LSTM_model')

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')  

max epochs = 5000
Train basemodel
epoch = 0
validation recall @ top- 10 0.41915675729388885
Update the best model to epoch 0
epoch = 1
validation recall @ top- 10 0.4964443709283647
Update the best model to epoch 1
epoch = 2
validation recall @ top- 10 0.534239150672695
Update the best model to epoch 2
epoch = 3
validation recall @ top- 10 0.5581098953780302
Update the best model to epoch 3
epoch = 4
validation recall @ top- 10 0.5906730436186113
Update the best model to epoch 4
epoch = 5
validation recall @ top- 10 0.616447153576521
Update the best model to epoch 5
epoch = 6
validation recall @ top- 10 0.6329954780439777
Update the best model to epoch 6
epoch = 7
validation recall @ top- 10 0.6552253302863654
Update the best model to epoch 7
epoch = 8
validation recall @ top- 10 0.6678222285296515
Update the best model to epoch 8
epoch = 9
validation recall @ top- 10 0.6745732579811821
Update the best model to epoch 9
epoch = 10
validation recall @ top- 10 0.6857097047979456
Update th

In [None]:
start = timeit.default_timer()

######################################################################
# Test the model:

def testmodel(modelstate, sim):
    model = LSTM(batchsize, len(word_to_ix), len(label_to_ix))
    model.cuda()
    model.load_state_dict(modelstate)
    loss_function = nn.BCELoss()
    model.eval()
    recall=[]
    lossestest = []
    
    y_true=[]
    y_scores=[]
        
    for inputs in batchtest_data:
        model.hidden = model.init_hidden()
        targets = inputs[2].cuda()
        
        tag_scores = model(inputs[0].cuda(),inputs[1].cuda() ,wikivec.cuda(),sim)

        loss = loss_function(tag_scores, targets)
        
        targets=targets.data.cpu().numpy()
        tag_scores= tag_scores.data.cpu().numpy()
        
        lossestest.append(loss.data.cpu().mean())
        y_true.append(targets)
        y_scores.append(tag_scores)
        
        for iii in range(0,len(tag_scores)):
            temp={}
            for iiii in range(0,len(tag_scores[iii])):
                temp[iiii]=tag_scores[iii][iiii]
            temp1=[(k, temp[k]) for k in sorted(temp, key=temp.get, reverse=True)]
            thistop=int(np.sum(targets[iii]))
            hit=0.0
            
            for ii in temp1[0:max(thistop,topk)]:
                if targets[iii][ii[0]]==1.0:
                    hit=hit+1
            if thistop!=0:
                recall.append(hit/thistop)
    y_true=np.concatenate(y_true,axis=0)
    y_scores=np.concatenate(y_scores,axis=0)
    y_true=y_true.T
    y_scores=y_scores.T
    temptrue=[]
    tempscores=[]
    for  col in range(0,len(y_true)):
        if np.sum(y_true[col])!=0:
            temptrue.append(y_true[col])
            tempscores.append(y_scores[col])
    temptrue=np.array(temptrue)
    tempscores=np.array(tempscores)
    y_true=temptrue.T
    y_scores=tempscores.T
    y_pred=(y_scores>0.5).astype(int)
    # print ('test loss', np.mean(lossestest))
    # print ('top-',topk, np.mean(recall))
    # print ('macro AUC', roc_auc_score(y_true, y_scores,average='macro'))
    # print ('micro AUC', roc_auc_score(y_true, y_scores,average='micro'))
    # print ('macro F1', f1_score(y_true, y_pred, average='macro')  )
    # print ('micro F1', f1_score(y_true, y_pred, average='micro')  )
    test_loss = np.mean(lossestest)
    test_recall = np.mean(recall)
    test_mac_auc = roc_auc_score(y_true, y_scores,average='macro')
    test_mic_auc = roc_auc_score(y_true, y_scores,average='micro')
    test_mac_f1 = f1_score(y_true, y_pred, average='macro')
    test_mic_f1 = f1_score(y_true, y_pred, average='micro')
    return test_loss, test_recall, test_mac_auc, test_mic_auc, test_mac_f1, test_mic_f1

print('Test LSTM baseline')
lstm_loss, lstm_recall, lstm_mac_auc, lstm_mic_auc, lstm_mac_f1, lstm_mic_f1 = testmodel(basemodel, 0)
print('Test KSI+LSTM')
lstmKSI_loss, lstmKSI_recall, lstmKSI_mac_auc, lstmKSI_mic_auc, lstmKSI_mac_f1, lstmKSI_mic_f1 = testmodel(KSImodel, 1)

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')  

Test LSTM baseline
Test KSI+LSTM
Time duration: 0h 00m 33s


In [None]:
result_lstm = [['LSTM Baseline', 
                lstm_loss, 
                lstm_recall, 
                lstm_mac_auc, 
                lstm_mic_auc, 
                lstm_mac_f1, 
                lstm_mic_f1], 
        ['LSTM + KSI', 
         lstmKSI_loss, 
         lstmKSI_recall, 
         lstmKSI_mac_auc, 
         lstmKSI_mic_auc, 
         lstmKSI_mac_f1, 
         lstmKSI_mic_f1]]
df_lstm = pd.DataFrame(result_lstm, columns=['Model', 'Loss', 'Recall@10', 'Macro_AUC', 'Micro_AUC', 'Macro_F1', 'Micro_F1'])
df_lstm

Unnamed: 0,Model,Loss,Recall@10,Macro_AUC,Micro_AUC,Macro_F1,Micro_F1
0,LSTM Baseline,0.034217,0.768321,0.843087,0.970367,0.207647,0.646738
1,LSTM + KSI,0.033033,0.778417,0.858165,0.972597,0.225144,0.648586


# LSTMatt

In [None]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
torch.manual_seed(1)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
import copy

In [None]:
start = timeit.default_timer()

##########################################################

label_to_ix=np.load('label_to_ix.npy', allow_pickle=True).item()
ix_to_label=np.load('ix_to_label.npy', allow_pickle=True)
training_data=np.load('training_data.npy', allow_pickle=True)
test_data=np.load('test_data.npy', allow_pickle=True)
val_data=np.load('val_data.npy', allow_pickle=True)
word_to_ix=np.load('word_to_ix.npy', allow_pickle=True).item()
ix_to_word=np.load('ix_to_word.npy', allow_pickle=True)
newwikivec=np.load('newwikivec.npy', allow_pickle=True)
wikivoc=np.load('wikivoc.npy', allow_pickle=True).item()

wikisize=newwikivec.shape[0]
rvocsize=newwikivec.shape[1]
wikivec=autograd.Variable(torch.FloatTensor(newwikivec))

batchsize = 32

def preprocessing(data):

    new_data=[]
    for i, note, j in data:
        templabel=[0.0]*len(label_to_ix)
        for jj in j:
            if jj in wikivoc:
                templabel[label_to_ix[jj]]=1.0
        templabel=np.array(templabel,dtype=float)
        new_data.append((i, note, templabel))
    new_data=np.array(new_data, dtype=object)
    
    lenlist=[]
    for i in new_data:
        lenlist.append(len(i[0]))
    sortlen=sorted(range(len(lenlist)), key=lambda k: lenlist[k])  
    new_data=new_data[sortlen]
    
    batch_data=[]
    
    for start_ix in range(0, len(new_data)-batchsize+1, batchsize):
        thisblock=new_data[start_ix:start_ix+batchsize]
        mybsize= len(thisblock)
        numword=np.max([len(ii[0]) for ii in thisblock])
        main_matrix = np.zeros((mybsize, numword), dtype= int)
        for i in range(main_matrix.shape[0]):
            for j in range(main_matrix.shape[1]):
                try:
                    if thisblock[i][0][j] in word_to_ix:
                        main_matrix[i,j] = word_to_ix[thisblock[i][0][j]]
                    
                except IndexError:
                    pass       # because initialze with 0, so you pad with 0
    
        xxx2=[]
        yyy=[]
        for ii in thisblock:
            xxx2.append(ii[1])
            yyy.append(ii[2])
        
        xxx2=np.array(xxx2)
        yyy=np.array(yyy)
        batch_data.append((autograd.Variable(torch.from_numpy(main_matrix)),autograd.Variable(torch.FloatTensor(xxx2)),autograd.Variable(torch.FloatTensor(yyy))))
    return batch_data


batchtraining_data=preprocessing(training_data)
batchtest_data=preprocessing(test_data)
batchval_data=preprocessing(val_data)

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')   

Time duration: 0h 00m 55s


In [None]:
start = timeit.default_timer()

######################################################################
# Create the model:

Embeddingsize = 100
hidden_dim = 200

class LSTMattn(nn.Module):

    def __init__(self, batch_size, vocab_size, tagset_size):
        super(LSTMattn, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size+1, Embeddingsize, padding_idx=0)
        self.lstm = nn.LSTM(Embeddingsize, hidden_dim)
        self.hidden = self.init_hidden()
        
        self.H=nn.Linear(hidden_dim, tagset_size )  
        self.final = nn.Linear(hidden_dim, tagset_size)
        
        self.layer2 = nn.Linear(Embeddingsize, 1,bias=False)
        self.embedding=nn.Linear(rvocsize,Embeddingsize)
        self.vattention=nn.Linear(Embeddingsize,Embeddingsize,bias=False)
        
        self.softmax = nn.Softmax()
        self.sigmoid = nn.Sigmoid()
        self.embed_drop = nn.Dropout(p=0.2)
    
    def init_hidden(self):
        return (autograd.Variable(torch.zeros(1, batchsize, self.hidden_dim).cuda()),
                autograd.Variable(torch.zeros(1, batchsize, self.hidden_dim)).cuda())

    
    def forward(self, vec1, nvec, wiki, simlearning):
        
        
        thisembeddings=self.word_embeddings(vec1).transpose(0,1)
        thisembeddings = self.embed_drop(thisembeddings)
        
        
        if simlearning==1:
            nvec=nvec.view(batchsize,1,-1)
            nvec=nvec.expand(batchsize,wiki.size()[0],-1)
            wiki=wiki.view(1,wiki.size()[0],-1)
            wiki=wiki.expand(nvec.size()[0],wiki.size()[1],-1)
            new=wiki*nvec
            new=self.embedding(new)
            vattention=self.sigmoid(self.vattention(new))
            new=new*vattention
            vec3=self.layer2(new)
            vec3=vec3.view(batchsize,-1)
        
        
        lstm_out, self.hidden = self.lstm(
            thisembeddings, self.hidden)
        
        
        
        lstm_out=lstm_out.transpose(0,1)

        alpha=self.H.weight.matmul(lstm_out.transpose(1,2))
        alpha=F.softmax(alpha, dim=2)
        
        m=alpha.matmul(lstm_out)
        
        myfinal=self.final.weight.mul(m).sum(dim=2).add(self.final.bias)
        
        
        if simlearning==1:
            tag_scores = self.sigmoid(myfinal.detach()+vec3)
        else:
            tag_scores = self.sigmoid(myfinal)
                
        return tag_scores

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')   

Time duration: 0h 00m 00s


In [None]:
start = timeit.default_timer()

######################################################################
# Train the model:

topk = 10
max_epochs = 5000 # Default is 5000

def trainmodel(model, sim):
    modelsaved=[]
    modelperform=[]
    
    
    bestresults=-1
    bestiter=-1
    for epoch in range(max_epochs):  
       
        model.train()
        
        lossestrain = []
        recall=[]
        for mysentence in batchtraining_data:
            model.zero_grad()
            model.hidden = model.init_hidden()
            targets = mysentence[2].cuda()
            tag_scores = model(mysentence[0].cuda(),mysentence[1].cuda(),wikivec.cuda(),sim)
            loss = loss_function(tag_scores, targets)
            loss.backward()
            optimizer.step()
            lossestrain.append(loss.data.mean())
        print (f"epoch = {epoch}")
        modelsaved.append(copy.deepcopy(model.state_dict()))
        model.eval()
    
        recall=[]
        for inputs in batchval_data:
            model.hidden = model.init_hidden()
            targets = inputs[2].cuda()
            tag_scores = model(inputs[0].cuda(),inputs[1].cuda() ,wikivec.cuda(),sim)
    
            loss = loss_function(tag_scores, targets)
            
            targets=targets.data.cpu().numpy()
            tag_scores= tag_scores.data.cpu().numpy()
            
            
            for iii in range(0,len(tag_scores)):
                temp={}
                for iiii in range(0,len(tag_scores[iii])):
                    temp[iiii]=tag_scores[iii][iiii]
                temp1=[(k, temp[k]) for k in sorted(temp, key=temp.get, reverse=True)]
                thistop=int(np.sum(targets[iii]))
                hit=0.0
                for ii in temp1[0:max(thistop,topk)]:
                    if targets[iii][ii[0]]==1.0:
                        hit=hit+1
                if thistop!=0:
                    recall.append(hit/thistop)
            
        print ('validation recall @ top--',topk, np.mean(recall))
                
        modelperform.append(np.mean(recall))
        if modelperform[-1]>bestresults:
            bestresults=modelperform[-1]
            bestiter=len(modelperform)-1
            print(f"Update the best model to epoch {bestiter}")
        
        if (len(modelperform)-bestiter)>5:
            print (modelperform,bestiter)
            return modelsaved[bestiter]
    
    print(f"Reach the max epochs, return the best model at epoch {bestiter}")
    return modelsaved[bestiter]
    
model = LSTMattn(batchsize, len(word_to_ix), len(label_to_ix))
model.cuda()
loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
print("Train basemodel")
basemodel= trainmodel(model, 0)
torch.save(basemodel, 'LSTMattn_model')

model = LSTMattn(batchsize, len(word_to_ix), len(label_to_ix))
model.cuda()
model.load_state_dict(basemodel)
loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
print("")
print("Train model with KSI")
KSImodel= trainmodel(model, 1)
torch.save(KSImodel, 'KSI_LSTMattn_model')

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')   

Train basemodel
epoch = 0
validation recall @ top-- 10 0.3796880833045024
Update the best model to epoch 0
epoch = 1
validation recall @ top-- 10 0.5111196937177896
Update the best model to epoch 1
epoch = 2
validation recall @ top-- 10 0.604538903363407
Update the best model to epoch 2
epoch = 3
validation recall @ top-- 10 0.6585091762268965
Update the best model to epoch 3
epoch = 4
validation recall @ top-- 10 0.6937992263722687
Update the best model to epoch 4
epoch = 5
validation recall @ top-- 10 0.7130799462899884
Update the best model to epoch 5
epoch = 6
validation recall @ top-- 10 0.7296098975521876
Update the best model to epoch 6
epoch = 7
validation recall @ top-- 10 0.7457491858711384
Update the best model to epoch 7
epoch = 8
validation recall @ top-- 10 0.7620013719496009
Update the best model to epoch 8
epoch = 9
validation recall @ top-- 10 0.7701372446019467
Update the best model to epoch 9
epoch = 10
validation recall @ top-- 10 0.778430630381561
Update the best m

In [None]:
start = timeit.default_timer()

######################################################################
# Test the model:

def testmodel(modelstate, sim):
    model = LSTMattn(batchsize, len(word_to_ix), len(label_to_ix))
    model.cuda()
    model.load_state_dict(modelstate)
    loss_function = nn.BCELoss()
    model.eval()
    recall=[]
    lossestest = []
    
    y_true=[]
    y_scores=[]
    
    
    for inputs in batchtest_data:
        model.hidden = model.init_hidden()
        targets = inputs[2].cuda()
        
        tag_scores = model(inputs[0].cuda(),inputs[1].cuda() ,wikivec.cuda(),sim)

        loss = loss_function(tag_scores, targets)
        
        targets=targets.data.cpu().numpy()
        tag_scores= tag_scores.data.cpu().numpy()
        
        lossestest.append(loss.data.cpu().mean())
        y_true.append(targets)
        y_scores.append(tag_scores)
        
        for iii in range(0,len(tag_scores)):
            temp={}
            for iiii in range(0,len(tag_scores[iii])):
                temp[iiii]=tag_scores[iii][iiii]
            temp1=[(k, temp[k]) for k in sorted(temp, key=temp.get, reverse=True)]
            thistop=int(np.sum(targets[iii]))
            hit=0.0
            
            for ii in temp1[0:max(thistop,topk)]:
                if targets[iii][ii[0]]==1.0:
                    hit=hit+1
            if thistop!=0:
                recall.append(hit/thistop)
    y_true=np.concatenate(y_true,axis=0)
    y_scores=np.concatenate(y_scores,axis=0)
    y_true=y_true.T
    y_scores=y_scores.T
    temptrue=[]
    tempscores=[]
    for  col in range(0,len(y_true)):
        if np.sum(y_true[col])!=0:
            temptrue.append(y_true[col])
            tempscores.append(y_scores[col])
    temptrue=np.array(temptrue)
    tempscores=np.array(tempscores)
    y_true=temptrue.T
    y_scores=tempscores.T
    y_pred=(y_scores>0.5).astype(int)
    # print ('test loss', np.mean(lossestest))
    # print ('top-',topk, np.mean(recall))
    # print ('macro AUC', roc_auc_score(y_true, y_scores,average='macro'))
    # print ('micro AUC', roc_auc_score(y_true, y_scores,average='micro'))
    # print ('macro F1', f1_score(y_true, y_pred, average='macro')  )
    # print ('micro F1', f1_score(y_true, y_pred, average='micro')  )
    test_loss = np.mean(lossestest)
    test_recall = np.mean(recall)
    test_mac_auc = roc_auc_score(y_true, y_scores,average='macro')
    test_mic_auc = roc_auc_score(y_true, y_scores,average='micro')
    test_mac_f1 = f1_score(y_true, y_pred, average='macro')
    test_mic_f1 = f1_score(y_true, y_pred, average='micro')
    return test_loss, test_recall, test_mac_auc, test_mic_auc, test_mac_f1, test_mic_f1

print('Test LSTMatt baseline')
lstmatt_loss, lstmatt_recall, lstmatt_mac_auc, lstmatt_mic_auc, lstmatt_mac_f1, lstmatt_mic_f1 = testmodel(basemodel, 0)
print('Test KSI+LSTMatt')
lstmattKSI_loss, lstmattKSI_recall, lstmattKSI_mac_auc, lstmattKSI_mic_auc, lstmattKSI_mac_f1, lstmattKSI_mic_f1 = testmodel(KSImodel, 1)

######################################################################

stop = timeit.default_timer()
duration = str(datetime.timedelta(seconds=round(stop-start)))
duration = duration.split(":")
print(f'Time duration: {duration[0]}h {duration[1]}m {duration[2]}s')   

Test LSTMatt baseline
Test KSI+LSTMatt
Time duration: 0h 00m 34s


In [None]:
result_lstmatt = [['LSTM Attention Baseline', 
                   lstmatt_loss, 
                   lstmatt_recall, 
                   lstmatt_mac_auc, 
                   lstmatt_mic_auc, 
                   lstmatt_mac_f1, 
                   lstmatt_mic_f1], 
        ['LSTM Attention + KSI', 
         lstmattKSI_loss, 
         lstmattKSI_recall, 
         lstmattKSI_mac_auc, 
         lstmattKSI_mic_auc, 
         lstmattKSI_mac_f1, 
         lstmattKSI_mic_f1]]
df_lstmatt = pd.DataFrame(result_lstmatt, columns=['Model', 'Loss', 'Recall@10', 'Macro_AUC', 'Micro_AUC', 'Macro_F1', 'Micro_F1'])
df_lstmatt

Unnamed: 0,Model,Loss,Recall@10,Macro_AUC,Micro_AUC,Macro_F1,Micro_F1
0,LSTM Attention Baseline,0.033961,0.79326,0.843429,0.974453,0.249528,0.65025
1,LSTM Attention + KSI,0.032939,0.796518,0.865558,0.975679,0.258881,0.653255


# Comparison

In [None]:
comparison_df = pd.concat([df_caml, df_cnn, df_lstm, df_lstmatt])
comparison_df

Unnamed: 0,Model,Loss,Recall@10,Macro_AUC,Micro_AUC,Macro_F1,Micro_F1
0,CAML Baseline,0.03352,0.807766,0.852964,0.978119,0.278527,0.656805
1,CAML + KSI,0.033304,0.808454,0.853904,0.978277,0.279034,0.657468
0,CNN Baseline,0.038858,0.754856,0.831774,0.967348,0.214947,0.627527
1,CNN + KSI,0.038051,0.765813,0.847441,0.970184,0.226481,0.635694
0,LSTM Baseline,0.034217,0.768321,0.843087,0.970367,0.207647,0.646738
1,LSTM + KSI,0.033033,0.778417,0.858165,0.972597,0.225144,0.648586
0,LSTM Attention Baseline,0.033961,0.79326,0.843429,0.974453,0.249528,0.65025
1,LSTM Attention + KSI,0.032939,0.796518,0.865558,0.975679,0.258881,0.653255


In [None]:
total_stop = timeit.default_timer()
total_duration = str(datetime.timedelta(seconds=round(total_stop - total_start)))
total_duration = total_duration.split(":")
print(f'Total time duration: {total_duration[0]}h {total_duration[1]}m {total_duration[2]}s')   

Total time duration: 2h 43m 28s


In [None]:
drive.flush_and_unmount()