Title:  Project Workbook Deep Learning ML Models

Authors:  Matthew Lopes and Chris Kabat

This notebook was created to train the Deep Learning Models to support our CS 598 DLH project. The paper we have chosen for the reproducibility project is:
***Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification from Clinical Notes ***

Abstract:  The main goal of the paper is to extract Morbidity from clinical notes.  The idea was to use a combination of classical and deep learning methods to determine the best approach for classifying these notes in one or more of 16 morbidity conditions.  These models used a combination of NLP techniques including embeddings and bag of words implementations.  It also measured the effect including of stop words.  Lastly, it used ensemble techniques to tie together a number of the classical and deep learning models to provide the most accurate results.

The data cannot be shared publicly due to the agreements required to obtain the data so we are storing the data locally and not putting in GitHub.

We are only training models using data that includes stop words.  

In this workbook, we are taking the following steps:

* Run Bag of Word Hyper Parameter Tuning
* Run Bag of Word DL Models across all diseasese
* Run Embedding Hyper Parameter Tuning
* Run Embedding DL Models across all diseasese

Note,  it was very difficult to get CUDA working with the torchtext library as it depends on different versions.   To do this, you need to install in the following way (assuming CUDA 11.7):

```
python -m pip uninstall torch

python -m pip install torch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 torchtext==0.14.1  --index-url https://download.pytorch.org/whl/cu117

python -m pip install torchdata==0.5.1
```

 First we load the required libraries and retrieve our data.  Note, this can take a really long time to run, if you execute hyper parameter tuning and training.  The lines of code that execute these have been commented out so the user can selectively choose what to run.

In [None]:
import os
import random
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import datetime
from datetime import timedelta
from tqdm import tqdm
import torchtext
from torch.utils.data import SubsetRandomSampler
from sklearn.model_selection import KFold,train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import tensorflow_hub as hub
from nltk.tokenize import sent_tokenize

# set seed
seed = 24
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
# define data path
DATA_PATH = './obesity_data/'
RESULTS_PATH = './results/'
MODELS_PATH = './models/'
AOAI_PATH = './aoai/'

if os.path.exists(RESULTS_PATH) == False:
    os.mkdir(RESULTS_PATH)
if os.path.exists(MODELS_PATH) == False:
    os.mkdir(MODELS_PATH)


all_df = pd.read_pickle(DATA_PATH + '/all_df.pkl') 
all_df_expanded = pd.read_pickle(DATA_PATH + '/all_df_expanded.pkl')
allannot_df= pd.read_pickle(DATA_PATH + '/allannot_df.pkl')
alldocs_df_aoai = pd.read_pickle(AOAI_PATH + '/alldocs_df_aoai.pkl') 
voc = torch.load(DATA_PATH + '/voc.obj')

#This is created in the embeddings file
#max_tokens = 1416
#max_sentences = 380
(max_tokens, max_sentences) = torch.load(DATA_PATH + '/counts.obj')
max_sentences_aoai = 381
oai_col = 'ada_v2_sent'

#Download info for embeddings
word_embedding_size = 300
sentence_embedding_size = 512
aoai_embedding_size = 1536 #2048
use_embeddings = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
fasttext_embeddings = torchtext.vocab.FastText()
glove_embeddings = torchtext.vocab.GloVe(name='6B', dim=word_embedding_size)    

disease_list = all_df['disease'].unique().tolist()
embedding_list = ['GloVe', "FastText",'USE','AOAI']
result_cols = ['Batch','Disease','Embedding','AUROC','F1','F1_MACRO', 'F1_MICRO', 'Exec Time', 'Total Run (secs)','Epochs', 'Dropout', 'BatchSize', 'Hidden', 'LR', 'CV']
result_loss_cols = ['Loss','Epoch','Batch','Disease','Embedding','Epochs', 'Dropout', 'BatchSize', 'Hidden', 'LR']

***Common training and evaluation code***

Note, we did test some learning rate decay techniques, but we did not implement this in final results.

In [None]:
eps=1e-10

def train_model(tmodel, train_dataloader, n_epoch=5, lr=0.003, device=None, model_name='unk', use_decay=False):
    import torch.optim as optim
    
    device = device or torch.device('cpu')

    tmodel.train()

    loss_history = []

    # your code here
    optimizer = optim.Adam(tmodel.parameters(), lr=lr)
    # want to decay the learning rate as teh number of epochs get larger
    #scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma = 0.1)
    if use_decay:
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',
            factor=0.1, patience=10, threshold=0.0001, threshold_mode='abs')

    #loss_func = nn.BCELoss()
    loss_func = nn.CrossEntropyLoss()
    #loss_func = nn.NLLLoss()

    for epoch in range(n_epoch):
        epoch = epoch+1
        curr_epoch_loss = []
        start = time.time()

        bs = train_dataloader.batch_size
        hs = tmodel.hidden_size
        do = tmodel.dropout

        for X, Y in tqdm(train_dataloader,desc=f"Training {model_name}-Lr{str(lr)}-Epoch{epoch}of{n_epoch}-BatchSize{bs}-HiddenState{hs}-Dropout{do}..."):
            # your code here
            optimizer.zero_grad()

            y_hat = tmodel(X.to(device))

            loss = loss_func(y_hat, Y.to(device))
            #loss = loss_func(torch.log(y_hat+ eps), Y)
            
            loss.backward()
            optimizer.step()
            if use_decay:
                scheduler.step(loss)
            
            curr_epoch_loss.append(loss.cpu().data.numpy())


        end = time.time()
        if epoch % 10 == 0:
            print(f"epoch{epoch}: curr_epoch_loss={np.mean(curr_epoch_loss)},execution_time={str(datetime.timedelta(seconds = (end-start)))},lr={optimizer.param_groups[0]['lr']}")

        #scheduler.step()
        loss_history += curr_epoch_loss
    return tmodel, loss_history

def eval_model(emodel, dataloader, device=None, model_name='unk'):
    """
    :return:
        pred_all: prediction of model on the dataloder.
        Y_test: truth labels. Should be an numpy array of ints
    TODO:
        evaluate the model using on the data in the dataloder.
        Add all the prediction and truth to the corresponding list
        Convert pred_all and Y_test to numpy arrays 
    """
    device = device or torch.device('cpu')
    emodel.eval()
    pred_all = []
    Y_test = []
    for X, Y in tqdm(dataloader, desc=f"Evaluating {model_name}..."):
        # your code here
        y_hat = emodel(X.to(device))
        
        pred_all.append(y_hat.cpu().data.numpy())
        Y_test.append(Y.cpu().data.numpy())
        
    pred_all = np.concatenate(pred_all, axis=0)
    Y_test = np.concatenate(Y_test, axis=0)

    return pred_all, Y_test

***Common prediction code***

In [None]:
def evaluate_predictions(truth, pred):
    """
    TODO: Evaluate the performance of the predictoin via AUROC, and F1 score
    each prediction in pred is a vector representing [p_0, p_1].
    When defining the scores we are interesed in detecting class 1 only
    (Hint: use roc_auc_score and f1_score from sklearn.metrics, be sure to read their documentation)
    return: auroc, f1
    """
    from sklearn.metrics import roc_auc_score, f1_score

    # your code here
    auroc = roc_auc_score(truth, pred[:,1])
    f1 = f1_score(truth, np.argmax(pred,axis=1))
    f1_macro = f1_score(truth, np.argmax(pred,axis=1),average='macro')
    f1_micro = f1_score(truth, np.argmax(pred,axis=1),average='micro')

    return auroc, f1, f1_macro, f1_micro

***Common training code***

In [None]:
def trainAndEvaluate(train_loader, val_loader, model, model_desc, batch_name, results_file, disease, lr,  dataformat, embedding, device, n_epoch, do, batch_size, hs, cv, use_decay):
            
    return_val = False

    start_train = time.time()
    model, loss_history = train_model(model, train_loader, n_epoch=n_epoch, lr = lr, device=device, model_name=model_desc, use_decay=use_decay)
    end_train = time.time()

    try:
        #Evaluate model
        start_eval = time.time()
        pred, truth = eval_model(model, val_loader, device=device, model_name=model_desc)
        end_eval = time.time()

        auroc, f1, f1_macro, f1_micro = evaluate_predictions(truth, pred)
        runtime = f"Trn,Eval,Ttl={str(datetime.timedelta(seconds = (end_train-start_train)))},{str(datetime.timedelta(seconds = (end_eval-start_eval)))},{str(datetime.timedelta(seconds = (end_eval-start_train)))}"
        runtime_sec = end_eval-start_train

        return_val = True

    except:
        auroc = -1
        f1=-1
        f1_macro = -1
        f1_micro = -1
        runtime_sec = end_train-start_train
        runtime = 'Failure'
        print("Failure!")

    results_file_metrics = f"{results_file}.csv"
    results_file_loss = f"{results_file}_loss.csv"

    #Append to results
    if os.path.exists(results_file_metrics):
        results = pd.read_csv(results_file_metrics)
    else:
        results = pd.DataFrame(columns=result_cols)

    result = pd.DataFrame(columns=result_cols,data=[[batch_name, disease,embedding,auroc,f1,f1_macro,f1_micro,runtime,runtime_sec,n_epoch, do, batch_size, hs, lr,str(cv)]])
    results = pd.concat([results,result])

    #Save results - overwrite so we can see progress
    results.to_csv(results_file_metrics, index=False)

    #write loss_history (Batch,Epoch,Loss)
    df_loss = pd.DataFrame(loss_history)
    df_loss = df_loss.rename(columns={0:"Loss"})
    df_loss['Epoch'] = df_loss.index + 1
    df_loss['Batch'] = batch_name
    df_loss['Disease'] = disease
    df_loss['Embedding'] = embedding
    df_loss['Epochs'] = n_epoch
    df_loss['Dropout'] = do
    df_loss['BatchSize'] = batch_size
    df_loss['Hidden'] = hs
    df_loss['LR'] = lr


    #Append to results
    if os.path.exists(results_file_loss):
        results = pd.read_csv(results_file_loss)
    else:
        results = pd.DataFrame(columns=result_loss_cols)

    results = pd.concat([results,df_loss])

    #Save results - overwrite so we can see progress
    results.to_csv(results_file_loss, index=False)

    return return_val

****Bag of Word Models****

Here we do a final tokenization step.

In [None]:
for index, entry in enumerate(all_df['word_tokenized']):
    Final_words = []
    #print(entry)
    for word in entry:
        #print(word)
        Final_words.append(word)
    all_df.loc[index, 'text_final'] = str(Final_words)



Here we create a dataset to help load the data and provide a collate function.

In [None]:
class TDFClinicalNotesDataset(Dataset):
    def __init__(self, X_array, y):
        df = pd.DataFrame(index=y.index)
        
        df['tfidf_vector'] = [vector.tolist() for vector in X_array]
        
        self.tfidf_vector = df.tfidf_vector.tolist()
        self.targets = y.tolist()

    def __getitem__(self, i):
        return (self.tfidf_vector[i], self.targets[i])
    
    def __len__(self):
        return len(self.targets)

def collate_fn(batch):
    tfidf = torch.tensor([item[0] for item in batch]).float()
    target = torch.tensor([int(item[1]==True) for item in batch]).long()

    return tfidf, target        

Here we define our PyTorch model.  Note we allow for the number of tokens, the final dropout, and the hidden size to be passed as parameters.  The model itself has 2 bidirectional LSTM layers, a dropout layer, and a fully connected layer.

In [None]:
class ClincalNoteTDFNet(nn.Module):
    def __init__(self, tokens, dropout, hidden_size):
        super(ClincalNoteTDFNet, self).__init__()
        
        self.tokens = tokens
        self.dropout = dropout
        self.hidden_size = hidden_size

        self.hidden_dim1 = self.hidden_size
        self.hidden_dim2 = int(self.hidden_size/2)
        self.num_layers = 1

        #Because it is bidirectional, the output from LTSM is coming in twice the size of the hidden states required.
        #input is (batch, #of tokens)
        self.bilstm1 = nn.LSTM(input_size = self.tokens, hidden_size = int(self.hidden_dim1/2), bidirectional = True,  
                               batch_first = True, num_layers = self.num_layers) 
        
        self.bilstm2 = nn.LSTM(input_size = self.hidden_dim1, hidden_size = int(self.hidden_dim2/2), bidirectional = True,  
                               batch_first = True, num_layers=self.num_layers)

        self.do = nn.Dropout(self.dropout)
        self.flatten = nn.Flatten()

        self.fc1 = nn.Linear(self.hidden_dim2, 2)
 
    def forward(self, x):

        x, states = self.bilstm1(x)
        x, states = self.bilstm2(x)
        
        x = self.flatten(x)
        x = self.do(x)
        x = self.fc1(x)

        return x 


Here we create a vocabulary that can be used within feature selection.

In [None]:
from keras.preprocessing.text import Tokenizer
import pandas as pd
import numpy as np
from sklearn.feature_selection import RFECV, RFE
from sklearn.tree import ExtraTreeClassifier
from sklearn.feature_selection import SelectKBest, SelectFromModel
from sklearn.feature_selection import f_classif, mutual_info_classif

def getVocab(X_train, y_train, feature, max_tokens):
 
    ## Step 1: Determine the Initial Vocabulary
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(X_train)
    vocab = list(tokenizer.word_index.keys())

    ## Step 2: Create term  matrix
    vectors = tokenizer.texts_to_matrix(X_train, mode='count')

    ## Do feature selection on term matrix (column headers are words)
    X = vectors
    y = y_train

    ##Choose algorithm
    if feature == 'SelectKBest':
        selector = SelectKBest(score_func=f_classif, k=max_tokens).fit(X,y)
    else: 
        if feature == 'InfoGainAttributeVal':
            #This should be similar to the InfoGain?
            selector = SelectKBest(score_func=mutual_info_classif, k=max_tokens).fit(X,y)
        else:
            #default to ExtraTreeClassifier
            estimator = ExtraTreeClassifier(random_state = seed)
            #selector = SelectFromModel(estimator, max_features = tokens,threshold=-np.inf)
            selector = SelectFromModel(estimator, max_features = max_tokens)
            selector = selector.fit(X, y)

    support_idx = selector.get_support(True)
    
    #print("Vocab:", [vocab[i-1].replace("'","") for i in support_idx])
    tokenizer2 = Tokenizer()
    tokenizer2.fit_on_texts([vocab[i-1].replace("'","") for i in support_idx])
    new_vocab = list(tokenizer2.word_index.keys())

    return new_vocab



This is a function to train and evaluate the model.

In [None]:
def iterateTrainAndEvaluateTFIDF(df, k, disease_list, feature_list, lr_list, 
                            batch_name, results_file, batch_size, dataformat, device, tokens, epoch_list, do, hs, cv = False, use_decay=False):

    for _,disease in enumerate(disease_list):
        for features_idx,feature in enumerate(feature_list):
            lr = lr_list[features_idx]
            n_epoch = epoch_list[features_idx]

            #Create a name for the model
            model_name = f"{disease}_{feature}_{batch_name}"

            disease_df = df[df['disease'] == disease].copy()

            X_train, X_test, y_train, y_test = train_test_split(disease_df[dataformat], disease_df['judgment'], test_size=0.2, random_state=seed)

            if feature != 'All':
                vocab = getVocab(X_train,y_train, feature, tokens)
                Tfidf_vect = TfidfVectorizer(max_features=tokens,vocabulary = vocab)
            else:
                Tfidf_vect = TfidfVectorizer(max_features=tokens)

            X_train_values_list = Tfidf_vect.fit_transform(X_train).toarray()
            X_training = pd.DataFrame(X_train_values_list, columns=Tfidf_vect.get_feature_names())
            X_training = np.asarray(X_training, dtype=float)
            X_training = torch.from_numpy(X_training).to(device)

            X_test_values_list = Tfidf_vect.transform(X_test).toarray()
            X_testing = pd.DataFrame(X_test_values_list, columns=Tfidf_vect.get_feature_names())
            X_testing = np.asarray(X_testing, dtype=float)
            X_testing = torch.from_numpy(X_testing).to(device)

            tokens_to_use = X_training.shape[1]

            #Create model
            model = ClincalNoteTDFNet(tokens_to_use,do,hs)
            model = model.to(device)

            ds_train = TDFClinicalNotesDataset(X_training, y_train)
            ds_test = TDFClinicalNotesDataset(X_testing, y_test)

            #Load Data 
            train_loader = torch.utils.data.DataLoader(ds_train, batch_size = batch_size, collate_fn=collate_fn)
            val_loader = torch.utils.data.DataLoader(ds_test, batch_size = batch_size,collate_fn=collate_fn)

            model_desc = f"{disease}_{feature}"

            trainAndEvaluate(train_loader, val_loader, model, model_desc, batch_name, results_file, disease, lr, dataformat, feature, device, n_epoch,  do, batch_size, hs, False, use_decay)

            #Save model
            torch.save(model.state_dict(), f'{MODELS_PATH}{model_name}.pkl')

            #Delete model
            del model


This is a function to train and evaluate the model while varying feature selection, learning rate, number of epochs, batch size, and dropout to help tune the hyper parameters usesd.

In [None]:
def iterateTrainAndEvaluateTFIDFHP(df, k, disease_list, feature_list, lr_list, 
                            batch_name, results_file, batch_size_list, dataformat, device, tokens, epoch_list, dropout_list, hs_list, cv = False, use_decay=False):

    for _,disease in enumerate(disease_list):
        for _,feature in enumerate(feature_list):
            for _,lr in enumerate(lr_list):
                for _, n_epoch in enumerate(epoch_list):
                    for _,batch_size in enumerate(batch_size_list):
                        for _, do in enumerate(dropout_list):
                            for _, hs in enumerate(hs_list):
                                #Create a name for the model
                                model_name = f"{disease}_{feature}_{batch_name}"

                                disease_df = df[df['disease'] == disease].copy()

                                X_train, X_test, y_train, y_test = train_test_split(disease_df[dataformat], disease_df['judgment'], test_size=0.2, random_state=seed)

                                if feature != 'All':
                                    vocab = getVocab(X_train,y_train, feature, tokens)
                                    Tfidf_vect = TfidfVectorizer(max_features=tokens,vocabulary = vocab)
                                else:
                                    Tfidf_vect = TfidfVectorizer(max_features=tokens)

                                X_train_values_list = Tfidf_vect.fit_transform(X_train).toarray()
                                X_training = pd.DataFrame(X_train_values_list, columns=Tfidf_vect.get_feature_names())
                                X_training = np.asarray(X_training, dtype=float)
                                X_training = torch.from_numpy(X_training).to(device)

                                X_test_values_list = Tfidf_vect.transform(X_test).toarray()
                                X_testing = pd.DataFrame(X_test_values_list, columns=Tfidf_vect.get_feature_names())
                                X_testing = np.asarray(X_testing, dtype=float)
                                X_testing = torch.from_numpy(X_testing).to(device)

                                tokens_to_use = X_training.shape[1]

                                #Create model
                                model = ClincalNoteTDFNet(tokens_to_use,do,hs)
                                model = model.to(device)

                                ds_train = TDFClinicalNotesDataset(X_training, y_train)
                                ds_test = TDFClinicalNotesDataset(X_testing, y_test)

                                #Load Data 
                                train_loader = torch.utils.data.DataLoader(ds_train, batch_size = batch_size, collate_fn=collate_fn)
                                val_loader = torch.utils.data.DataLoader(ds_test, batch_size = batch_size,collate_fn=collate_fn)

                                model_desc = f"{disease}_{feature}"

                                trainAndEvaluate(train_loader, val_loader, model, model_desc, batch_name, results_file, disease, lr, dataformat, feature, device, n_epoch,  do, batch_size, hs, False, use_decay)
                                #Delete model
                                del model

                

Here is where we run a parameter sweep.

In [None]:
#device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'cpu'
print(f'Using device: {device}')

#Override these if need be
disease_list = ['Asthma', 'CAD', 'CHF', 'Depression', 'Diabetes', 'Gallstones', 'GERD', 'Gout', 'Hypercholesterolemia', 'Hypertension', 'Hypertriglyceridemia', 'OA', 'OSA', 'PVD', 'Venous Insufficiency', 'Obesity']
feature_list = ['All','ExtraTreeClassifier','SelectKBest','InfoGainAttributeVal']


#0.01 seems to be the most effective with no decay
lr_list = [0.01,0.001]
epoch_list = [20,40,60]
dropout_list = [0, 0.1, 0.5]
batch_size_list = [32,64]
hs_list = [64,128]  #can't go above 128, 256 worked, but pushed memoery to limit

#training parameters
k = 2

#These should not change
dataformat = 'text_final'
tokens = 600

results_file = f'{RESULTS_PATH}DL_tfidf_results'
result_time = datetime.datetime.now()
result_name = result_time.strftime("%Y-%m-%d-%H-%M-%S")

descriptor = 'HP'
batch_name = f'DL_tfidf_{descriptor}_{result_name}'

#commented out because working on Embeddings
#iterateTrainAndEvaluateTFIDFHP(all_df, k, disease_list, feature_list, lr_list, batch_name, results_file, batch_size_list, dataformat, device, tokens, epoch_list, dropout_list, hs_list, False, False)


Here is where we train the final models:

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
#device = 'cpu'
print(f'Using device: {device}')

#Override these if need be
disease_list = ['Asthma', 'CAD', 'CHF', 'Depression', 'Diabetes', 'Gallstones', 'GERD', 'Gout', 'Hypercholesterolemia', 'Hypertension', 'Hypertriglyceridemia', 'OA', 'OSA', 'PVD', 'Venous Insufficiency', 'Obesity']
#disease_list = ['Asthma']
feature_list = ['All','InfoGainAttributeVal','ExtraTreeClassifier','SelectKBest']

#0.01 seems to be the most effective with no decay
lr_list = [0.001,0.01,0.01,0.01]
epoch_list = [20,40,60,40]
dropout_list = [0, 0.1, 0.5]


#training parameters
k = 2
do = .1
batch_size = 32
hs = 64

#These should not change
dataformat = 'text_final'
tokens = 600

results_file = f'{RESULTS_PATH}DL_tfidf_results'
result_time = datetime.datetime.now()
result_name = result_time.strftime("%Y-%m-%d-%H-%M-%S")

descriptor = 'All'
batch_name = f'DL_tfidf_{descriptor}_{result_name}'

#commented out because working on Embeddings
#iterateTrainAndEvaluateTFIDF(all_df, k, disease_list, feature_list, lr_list, batch_name, results_file, batch_size, dataformat, device, tokens, epoch_list, do, hs, False, False)


****DL Model using word embeddings****

First we start by creating a dataset.  Note this will have to take the disease as part of the init and filter just for those records.

In [None]:

class ClinicalNoteDataset(Dataset):

    def __init__(self, dataframe, disease, dataformat):
        """
        TODO: init the Dataset instance.  datafomat is just the column to use from the dataframe 'vector_tokenized' , 'one_hot'
        """
        # your code here
        self.disease = disease
        self.dataformat = dataformat

        if(self.dataformat == oai_col):
            #overriding the dataframe and merging in the annoutations
            dataframe = pd.merge(allannot_df,alldocs_df_aoai, on='id')

        self.df = dataframe[dataframe['disease'] == disease].copy()
        self.df = self.df.reset_index()




    def __len__(self):
        """
        TODO: Denotes the total number of samples
        """
        return len(self.df)

    def __getitem__(self, i):
        """
        TODO: Generates one sample of data
            return X, y for the i-th data.
        """
        #Cannot make tensors yet, will need to happen in collate
        Y = self.df.iloc[i]['judgment']
        X = self.df.iloc[i][self.dataformat]

        return X,Y
        
def vectorize_batch_words(batch):
    embedding_size_used = 300
 
    Xi, Yi = batch[0]
    batch_size = len(batch)

    X = torch.zeros(batch_size, len(Xi), dtype=torch.long)
    Y = torch.zeros((batch_size), dtype=torch.long)
    
    for i in range(len(batch)):
        x, y = batch[i]
        vectors = voc.lookup_indices(x)

        X[i] = torch.tensor(vectors).long()
        Y[i] = torch.tensor(float(y == True))

    return X,Y

def vectorize_batch_USE(batch):
    embedding_size_used = 512

    Xi, Yi = batch[0]
    batch_size = len(batch)

    X = torch.zeros(batch_size, len(Xi), embedding_size_used, dtype=torch.float)
    Y = torch.zeros((batch_size), dtype=torch.long)

    for i in range(len(batch)):
        x, y = batch[i]
        
        tensor_flow_vectors = use_embeddings(x)
        array_vectors = tensor_flow_vectors.numpy()

        X[i] = torch.tensor(array_vectors).float()
        Y[i] = torch.tensor(float(y == True))

    return X,Y 

def vectorize_batch_AOAI(batch):
    embedding_size_used = aoai_embedding_size

    Xi, Yi = batch[0]
    batch_size = len(batch)

    X = torch.zeros(batch_size, Xi.shape[0], embedding_size_used, dtype=torch.float)
    Y = torch.zeros((batch_size), dtype=torch.long)

    for i in range(len(batch)):
        x, y = batch[i]

        #all the work was done and stored in another notebook
        
        X[i] = torch.tensor(x).float()
        Y[i] = torch.tensor(float(y == True))

    return X,Y 
          


The following code prepares the weights matrix for the nn.Embedding object.

In [None]:
matrix_len = len(voc)
glove_weights_matrix = np.zeros((matrix_len, word_embedding_size))
fasttext_weights_matrix = np.zeros((matrix_len, word_embedding_size))

#GloVe
for i in range(0,matrix_len-1):
    word = voc.lookup_token(i)
    try: 
        glove_weights_matrix[i] = glove_embeddings.get_vecs_by_tokens(word)
    except KeyError:
        glove_weights_matrix[i] = np.random.normal(scale=0.6, size=(word_embedding_size, ))
#FastText
for i in range(0,matrix_len-1):
    word = voc.lookup_token(i)
    try: 
        fasttext_weights_matrix[i] = fasttext_embeddings.get_vecs_by_tokens(word)
    except KeyError:
        fasttext_weights_matrix[i] = np.random.normal(scale=0.6, size=(word_embedding_size, ))



Here we define our PyTorch model.  Note we allow for the embedding type, number of tokens, the final dropout, and the hidden size to be passed as parameters.  The model itself has 2 bidirectional LSTM layers, a dropout layer, and a fully connected layer.  For the word embeddings we use the nn.Embeddings object which really increases performance.

In [None]:
class ClincalNoteEmbeddingNet(nn.Module):
    def __init__(self, embedding_type, max_tokens, dropout, hidden_size):
        super(ClincalNoteEmbeddingNet, self).__init__()
        
        self.max_tokens = max_tokens
        self.dropout = dropout
        self.hidden_size = hidden_size

        if(embedding_type == 'USE'):
            self.embedding_dimension = sentence_embedding_size
            self.em = None
        else:
            if embedding_type == 'AOAI':
                self.embedding_dimension = aoai_embedding_size
                self.em = None                                        
            else:
                self.embedding_dimension = word_embedding_size
                if(embedding_type == 'GloVe'):
                    self.em = nn.Embedding.from_pretrained(torch.tensor(glove_weights_matrix).float(), freeze=False)
                else:
                    self.em = nn.Embedding.from_pretrained(torch.tensor(fasttext_weights_matrix).float(), freeze=False)

        self.hidden_dim1 = self.hidden_size
        self.hidden_dim2 = int(self.hidden_size/2)
        self.num_layers = 1

        #Because it is bidirectional, the output from LTSM is coming in twice the size of the hidden states required.
        #input is (batch, #of tokens * embedding_dimension)
        self.bilstm1 = nn.LSTM(input_size = self.embedding_dimension, hidden_size = int(self.hidden_dim1/2), bidirectional = True,  
                               batch_first = True, num_layers = self.num_layers) 
        
        self.bilstm2 = nn.LSTM(input_size = self.hidden_dim1, hidden_size = int(self.hidden_dim2/2), bidirectional = True,  
                               batch_first = True, num_layers=self.num_layers)

        self.do = nn.Dropout(self.dropout)
        self.flatten = nn.Flatten()

        self.fc1 = nn.Linear(self.hidden_dim2 * self.max_tokens, 2)


    def forward(self, x):
        #using an embedding layer instead of just vectors
        if self.em is not None:
            x = self.em(x)  

        x, states = self.bilstm1(x)
        x, states = self.bilstm2(x)

        x = self.flatten(x)
        x = self.do(x)
        x = self.fc1(x)

        return x 



Here is a function do do a parameter sweep.

In [None]:
def iterateTrainAndEvaluateHP(df, k, disease_list, embedding_list, lr_list, 
                            batch_name, results_file, device, epoch_list, dropout_list, batch_size_list, hs_list, use_decay = False):

    for _,disease in enumerate(disease_list):
        for _,embedding in enumerate(embedding_list):
            for _,lr in enumerate(lr_list):
                for _, n_epoch in enumerate(epoch_list):
                    for _,batch_size in enumerate(batch_size_list):
                        for _, do in enumerate(dropout_list):
                            for _, hs in enumerate(hs_list):
                                #Create a name for the model
                                model_name = f"{disease}_{embedding}_{batch_name}"

                                #Create model
                                if embedding == 'USE':
                                    model_tokens = max_sentences
                                else:
                                    if embedding == 'AOAI':
                                        model_tokens = max_sentences_aoai
                                    else:
                                        model_tokens = max_tokens
                                    
                                model = ClincalNoteEmbeddingNet(embedding, max_tokens = model_tokens, dropout = do, hidden_size = hs)
                                model = model.to(device)

                                if embedding == 'GloVe':
                                    custom_collate=vectorize_batch_words
                                    dataformat = 'vector_tokenized'
                                if embedding == 'FastText':
                                    custom_collate=vectorize_batch_words
                                    dataformat = 'vector_tokenized'
                                if embedding == 'USE':
                                    custom_collate=vectorize_batch_USE
                                    dataformat = 'sentence_tokenized'
                                if embedding == 'AOAI':
                                    custom_collate=vectorize_batch_AOAI
                                    dataformat = oai_col

                                ds = ClinicalNoteDataset(df, disease, dataformat)
                                ds_train, ds_test = train_test_split(ds, test_size=0.20, shuffle=True, random_state = seed)

                                #Load Data 
                                train_loader = torch.utils.data.DataLoader(ds_train, batch_size = batch_size, collate_fn=custom_collate)
                                val_loader = torch.utils.data.DataLoader(ds_test, batch_size = batch_size, collate_fn=custom_collate)
                                
                                model_desc = f"{disease}_{embedding}"

                                trainAndEvaluate(train_loader, val_loader, model, model_desc, batch_name, results_file, disease, lr, dataformat, embedding, device, n_epoch,  do, batch_size, hs, False, use_decay)

                                #Save model - don't need to save for hyper parameter tuning
                                #torch.save(model.state_dict(), f'{MODELS_PATH}{model_name}.pkl')

                                #Delete model
                                del model



Here is a function to help with training and cross validation (if required).

In [None]:
def iterateTrainAndEvaluate(df, k, disease_list, embedding_list, lr_list, 
                            batch_name, results_file, device, epoch_list, do, batch_size, hs, cv = False, use_decay = False):

    for _,disease in enumerate(disease_list):
        for embedding_idx,embedding in enumerate(embedding_list):
            lr = lr_list[embedding_idx]
            n_epoch = epoch_list[embedding_idx]
            #Create a name for the model
            model_name = f"{disease}_{embedding}_{batch_name}"

            #Create model
            if embedding == 'USE':
                model_tokens = max_sentences
            else:
                if embedding == 'AOAI':
                    model_tokens = max_sentences_aoai 
                else:
                    model_tokens = max_tokens
                
            model = ClincalNoteEmbeddingNet(embedding, max_tokens = model_tokens, dropout = do, hidden_size=hs)
            model = model.to(device)

            if embedding == 'GloVe':
                custom_collate=vectorize_batch_words
                dataformat = 'vector_tokenized'
            if embedding == 'FastText':
                custom_collate=vectorize_batch_words
                dataformat = 'vector_tokenized'
            if embedding == 'USE':
                custom_collate=vectorize_batch_USE
                dataformat = 'sentence_tokenized'
            if embedding == 'AOAI':
                custom_collate=vectorize_batch_AOAI
                dataformat = oai_col

            ds = ClinicalNoteDataset(df, disease, dataformat)
            ds_train, ds_test = train_test_split(ds, test_size=0.20, shuffle=True, random_state = seed)

            #Load Data 
            train_loader = torch.utils.data.DataLoader(ds_train, batch_size = batch_size, collate_fn=custom_collate)
            val_loader = torch.utils.data.DataLoader(ds_test, batch_size = batch_size, collate_fn=custom_collate)
            
            model_desc = f"{disease}_{embedding}"

            trainAndEvaluate(train_loader, val_loader, model, model_desc, batch_name, results_file, disease, lr, dataformat, embedding, device, n_epoch, do, batch_size, hs, False, use_decay)

            #Save model
            torch.save(model.state_dict(), f'{MODELS_PATH}{model_name}.pkl')

            #Delete model
            del model

            if cv:
                #note, cross validation is only used to validate the model works consistently
                splits=KFold(n_splits=k,shuffle=True,random_state=seed)

                for fold, (train_idx,val_idx) in enumerate(splits.split(np.arange(len(ds)))):
                    #for now, let's keep the results at the fold level
                    model = ClincalNoteEmbeddingNet(embedding, max_tokens = max_tokens)
                    model = model.to(device)
                    
                    train_sampler = SubsetRandomSampler(train_idx)
                    val_sampler = SubsetRandomSampler(val_idx)
                    #Load Data 
                    train_loader = torch.utils.data.DataLoader(ds, batch_size = batch_size, sampler=train_sampler, collate_fn=custom_collate)
                    val_loader = torch.utils.data.DataLoader(ds, batch_size = batch_size, sampler=val_sampler, collate_fn=custom_collate)
                    
                    model_desc = f"{disease}_{embedding}_Fold{fold+1}"

                    trainAndEvaluate(train_loader, val_loader, model, model_desc, batch_name, results_file, disease, lr, dataformat, embedding, device, n_epoch, do, batch_size, hs, cv, use_decay)

                    del model

Here is where we execute our parameter sweep.

In [None]:
#device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'cpu'
print(f'Using device: {device}')

#Override these if need be
#disease_list = ['Asthma', 'CAD', 'CHF', 'Depression', 'Diabetes', 'Gallstones', 'GERD', 'Gout', 'Hypercholesterolemia', 'Hypertension', 'Hypertriglyceridemia', 'OA', 'OSA', 'PVD', 'Venous Insufficiency', 'Obesity']
disease_list = ['Asthma']
#embedding_list = ['USE','GloVe','FastText']
embedding_list = ['AOAI']
#epoch_list = [15,25,35]
epoch_list = [25,50, 75]
#0.01 seems to be the most effective, although FastText prefers 0.001
lr_list = [0.01,0.001]
dropout_list = [0, 0.1]
batch_size_list = [32,64] #can't go above 128, 256 worked, but pushed memoery to limit
hs_list = [64,128]


results_file = f'{RESULTS_PATH}DL_embedding_results'

#training parameters
k = 2

#These should not change

result_time = datetime.datetime.now()
result_name = result_time.strftime("%Y-%m-%d-%H-%M-%S")
descriptor = 'HP_AOAI'
batch_name = f'DL_er_{descriptor}_{result_name}'

#iterateTrainAndEvaluateHP(all_df_expanded, k, disease_list, embedding_list, lr_list, batch_name, results_file, device, epoch_list, dropout_list, batch_size_list, hs_list, False)



Here is where we train the models for each disease and embedding.

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
#device = 'cpu'
print(f'Using device: {device}')

#Override these if need be
disease_list = ['Asthma', 'CAD', 'CHF', 'Depression', 'Diabetes', 'Gallstones', 'GERD', 'Gout', 'Hypercholesterolemia', 'Hypertension', 'Hypertriglyceridemia', 'OA', 'OSA', 'PVD', 'Venous Insufficiency', 'Obesity']
embedding_list = ['AOAI','GloVe','FastText','USE']
epoch_list = [50,25,25,50]
lr_list = [0.01,0.01,0.001,0.01]

results_file = f'{RESULTS_PATH}DL_embedding_results'

#training parameters
batch_size = 32
k = 2
dropout = 0.1
hs = 128

#These should not change

result_time = datetime.datetime.now()
result_name = result_time.strftime("%Y-%m-%d-%H-%M-%S")
descriptor = 'All_HSChange64'
batch_name = f'DL_er_{descriptor}_{result_name}'
#iterateTrainAndEvaluate(all_df_expanded, k, disease_list, embedding_list, lr_list, batch_name, results_file, device, epoch_list, dropout, batch_size, hs, False, False)