This notebook was created to support the data preparation required to support our CS 598 DLH project.  The paper we have chosen for the reproducibility project is:
***Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification from Clinical Notes ***

This notebook is for creating the multiple embeddings formats as described in the study.

 

The data cannot be shared publicly due to the agreements required to obtain the data so we are storing the data locally and not putting in GitHub.

We are only creating embeddings for data that includes stop words.

In [1]:
import pandas as pd
import numpy as np

DATA_PATH = './obesity_data/'
test_df = pd.read_pickle(DATA_PATH + '/test_df.pkl')
train_df = pd.read_pickle(DATA_PATH + '/train_df.pkl')

test_df['word_count'] = test_df['lower_text'].apply(lambda x: len(x.split()))
train_df['word_count'] = train_df['lower_text'].apply(lambda x: len(x.split()))

df_print = pd.DataFrame()
df_print['Min'] = [np.min(test_df['word_count']), np.min(train_df['word_count'])]
df_print['Mean'] = [np.mean(test_df['word_count']), np.mean(train_df['word_count'])]
df_print['Max'] = [np.max(test_df['word_count']), np.max(train_df['word_count'])]
df_print['Std'] = [np.std(test_df['word_count']), np.std(train_df['word_count'])]
df_print['MeanPlusStd'] = round(df_print['Mean'] + df_print['Std'],0)
token_max = int(round(np.max(df_print['MeanPlusStd']),0))

print(df_print)
print('Max Tokens:',token_max)
print('Test:', sum(test_df['word_count'] > token_max), "out of", len(test_df))
print('Train:', sum(train_df['word_count'] > token_max), "out of", len(train_df))


   Min        Mean   Max         Std  MeanPlusStd
0  206  991.907298  3124  437.621770       1430.0
1  113  958.774141  3748  444.819254       1404.0
Max Tokens: 1430
Test: 79 out of 507
Train: 72 out of 611


We are going to split these larger text blocks into 2 notes of size max_token or below.  Note, there are 4 notes (1 in test and 3 in train) that are bigger than 2 times x tokens.  For now, we will ignore, but may want to add in later (either loop or have left/middle/right).

In [2]:
test_df_ok = test_df[test_df['word_count'] <= token_max].copy()
test_df_large_right = test_df[test_df['word_count'] > token_max].copy()
test_df_large_left = test_df[test_df['word_count'] > token_max].copy()

train_df_ok = train_df[train_df['word_count'] <= token_max].copy()
train_df_large_right = train_df[train_df['word_count'] > token_max].copy()
train_df_large_left = train_df[train_df['word_count'] > token_max].copy()

#Get the right words and the left words and then concatenate all 3 and recacluate 
test_df_large_left['lower_text'] = test_df_large_left['lower_text'].apply(lambda x: ' '.join([word for word in x.split()[:(token_max-1)]]))
test_df_large_right['lower_text'] = test_df_large_right['lower_text'].apply(lambda x: ' '.join([word for word in x.split()[token_max:(2*token_max)]]))
train_df_large_left['lower_text'] = train_df_large_left['lower_text'].apply(lambda x: ' '.join([word for word in x.split()[:(token_max-1)]]))
train_df_large_right['lower_text'] = train_df_large_right['lower_text'].apply(lambda x: ' '.join([word for word in x.split()[token_max:(2*token_max)]]))

test_df_expanded = pd.concat([test_df_ok,test_df_large_right,test_df_large_left])
test_df_expanded['word_count'] = test_df_expanded['lower_text'].apply(lambda x: len(x.split()))
train_df_expanded = pd.concat([train_df_ok,train_df_large_right,train_df_large_left])
train_df_expanded['word_count'] = train_df_expanded['lower_text'].apply(lambda x: len(x.split()))

print('Test:', sum(test_df_expanded['word_count'] > token_max), "out of", len(test_df_expanded))
print('Train:', sum(train_df_expanded['word_count'] > token_max), "out of", len(train_df_expanded))

Test: 0 out of 586
Train: 0 out of 683


![Note occurrences](images\note_occurrences.gif)


We need to create a one hot vector given a vocabulary and pad it with the padding character.

In [27]:
from typing import Union, Iterable
import torchtext, torch, torch.nn.functional as F
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

corpus = pd.concat([test_df_expanded['lower_text'],train_df_expanded['lower_text']])
tokenizer = get_tokenizer("basic_english")
tokens = [tokenizer(doc) for doc in corpus]

voc = build_vocab_from_iterator(tokens, specials = ['<pad>'])

#so need to create one hot encoding but add <pad> to reach max_tokens
#need to do for both stop words and non stop words???
def encode_and_pad(vocab, input_tokens, token_max):
    pad_zeros = token_max - len(input_tokens)
    result = vocab.lookup_indices(input_tokens)
    if pad_zeros > 0:
        result.extend(np.zeros(pad_zeros, dtype=int))
    return result

train_df_expanded['one_hot'] = train_df_expanded['lower_text'].apply(lambda x: encode_and_pad(voc, x.split(), token_max))
test_df_expanded['one_hot'] = test_df_expanded['lower_text'].apply(lambda x: encode_and_pad(voc, x.split(), token_max))

print(train_df_expanded.iloc[0])



id                                                                  4
text                368346277 | EMH | 64927307 | | 815098 | 3/29/1...
no_punc_text        368346277  EMH  64927307   815098  3291993 120...
no_numerics_text      EMH         AM  Discharge Summary  Signed  D...
lower_text          emh am discharge summary signed dis admission ...
tokenized_text      [emh, am, discharge, summary, signed, dis, adm...
tok_lem_text        [emh, am, discharge, summary, signed, dis, adm...
word_count                                                        413
one_hot             [7206, 73, 18, 126, 123, 138, 26, 53, 108, 36,...
Name: 2, dtype: object


In [29]:
from torchtext.vocab import vocab


vec = torchtext.vocab.GloVe(name='6B', dim=300)

one_hot_test = train_df_expanded.iloc[0]['one_hot']

print(one_hot_test)
ret = vec.get_vecs_by_tokens(voc.lookup_tokens(one_hot_test))
print(ret.shape)

[7206, 73, 18, 126, 123, 138, 26, 53, 108, 36, 123, 18, 53, 5276, 93, 77, 72, 55, 100, 725, 463, 428, 55, 119, 169, 15, 273, 504, 169, 25, 3, 180, 228, 1, 13, 20, 8, 396, 392, 671, 20859, 56, 29809, 3176, 7, 8, 871, 25, 3, 451, 19, 24, 111, 280, 11, 1, 360, 4936, 4478, 10, 380, 7, 3258, 2715, 10, 2840, 398, 3, 22, 77, 72, 55, 1, 13, 24, 1291, 2, 24, 111, 536, 200, 1654, 11662, 3, 1, 451, 2274, 673, 22, 451, 24, 2076, 2, 19, 2147, 5, 2028, 374, 817, 753, 19, 4, 1590, 10, 1477, 66, 202, 97, 86, 25, 417, 10, 30, 498, 3, 41, 27, 11, 60, 119, 2, 25, 3, 463, 428, 55, 7, 2289, 197, 148, 130, 6, 148, 162, 1, 84, 340, 4, 57, 80, 494, 688, 2, 379, 162, 889, 389, 196, 3811, 57, 143, 80, 2, 135, 15, 689, 1475, 239, 288, 452, 215, 15, 114, 24, 30894, 2087, 366, 3, 433, 22, 148, 162, 4, 455, 6, 22, 2768, 26, 5, 1, 66, 157, 395, 63, 3705, 1032, 17, 1477, 66, 202, 313, 130, 22, 26, 313, 162, 4, 1403, 10, 8, 67, 1374, 2, 1736, 534, 162, 22, 150, 63, 66, 202, 657, 2841, 129, 1339, 11, 932, 607, 2, 193, 

In [21]:
type(train_df_expanded.iloc[0]['one_hot'][0])

int