This notebook was created to support the data preparation required to support our CS 598 DLH project.  The paper we have chosen for the reproducibility project is:
***Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification from Clinical Notes ***

This notebook is for creating the multiple embeddings formats as described in the study.

 

The data cannot be shared publicly due to the agreements required to obtain the data so we are storing the data locally and not putting in GitHub.

We are only creating embeddings for data that includes stop words.

In [71]:
pip install torchtext

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.



In [1]:
import pandas as pd
import numpy as np

DATA_PATH = './obesity_data/'
test_df = pd.read_pickle(DATA_PATH + '/test_df.pkl')
train_df = pd.read_pickle(DATA_PATH + '/train_df.pkl')
test_annot_all_df_clean = pd.read_pickle(DATA_PATH + '/test_annot_all_df_clean.pkl') 
train_annot_all_df_clean = pd.read_pickle(DATA_PATH + '/train_annot_all_df_clean.pkl') 

test_df['word_count'] = test_df['lower_text'].apply(lambda x: len(x.split()))
train_df['word_count'] = train_df['lower_text'].apply(lambda x: len(x.split()))

df_print = pd.DataFrame()
df_print['Min'] = [np.min(test_df['word_count']), np.min(train_df['word_count'])]
df_print['Mean'] = [np.mean(test_df['word_count']), np.mean(train_df['word_count'])]
df_print['Max'] = [np.max(test_df['word_count']), np.max(train_df['word_count'])]
df_print['Std'] = [np.std(test_df['word_count']), np.std(train_df['word_count'])]
df_print['MeanPlusStd'] = round(df_print['Mean'] + df_print['Std'],0)
token_max = int(round(np.max(df_print['MeanPlusStd']),0))

print(df_print)
print('Max Tokens:',token_max)
print('Test:', sum(test_df['word_count'] > token_max), "out of", len(test_df))
print('Train:', sum(train_df['word_count'] > token_max), "out of", len(train_df))


   Min        Mean   Max         Std  MeanPlusStd
0  206  991.907298  3124  437.621770       1430.0
1  113  958.774141  3748  444.819254       1404.0
Max Tokens: 1430
Test: 79 out of 507
Train: 72 out of 611


We are going to split these larger text blocks into 2 notes of size max_token or below.  Note, there are 4 notes (1 in test and 3 in train) that are bigger than 2 times x tokens.  For now, we will ignore, but may want to add in later (either loop or have left/middle/right).

In [2]:
test_df_ok = test_df[test_df['word_count'] <= token_max].copy()
test_df_large_right = test_df[test_df['word_count'] > token_max].copy()
test_df_large_left = test_df[test_df['word_count'] > token_max].copy()

train_df_ok = train_df[train_df['word_count'] <= token_max].copy()
train_df_large_right = train_df[train_df['word_count'] > token_max].copy()
train_df_large_left = train_df[train_df['word_count'] > token_max].copy()

#Get the right words and the left words and then concatenate all 3 and recacluate 
test_df_large_left['lower_text'] = test_df_large_left['lower_text'].apply(lambda x: ' '.join([word for word in x.split()[:(token_max-1)]]))
test_df_large_right['lower_text'] = test_df_large_right['lower_text'].apply(lambda x: ' '.join([word for word in x.split()[token_max:(2*token_max)]]))
train_df_large_left['lower_text'] = train_df_large_left['lower_text'].apply(lambda x: ' '.join([word for word in x.split()[:(token_max-1)]]))
train_df_large_right['lower_text'] = train_df_large_right['lower_text'].apply(lambda x: ' '.join([word for word in x.split()[token_max:(2*token_max)]]))

test_df_expanded = pd.concat([test_df_ok,test_df_large_right,test_df_large_left])
test_df_expanded['word_count'] = test_df_expanded['lower_text'].apply(lambda x: len(x.split()))
train_df_expanded = pd.concat([train_df_ok,train_df_large_right,train_df_large_left])
train_df_expanded['word_count'] = train_df_expanded['lower_text'].apply(lambda x: len(x.split()))

print('Test:', sum(test_df_expanded['word_count'] > token_max), "out of", len(test_df_expanded))
print('Train:', sum(train_df_expanded['word_count'] > token_max), "out of", len(train_df_expanded))

Test: 0 out of 586
Train: 0 out of 683


We need to create a one hot vector given a vocabulary and pad it with the padding character.

In [3]:
from typing import Union, Iterable
import torchtext, torch, torch.nn.functional as F
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

corpus = pd.concat([test_df_expanded['lower_text'],train_df_expanded['lower_text']])
tokenizer = get_tokenizer("basic_english")
tokens = [tokenizer(doc) for doc in corpus]

voc = build_vocab_from_iterator(tokens, specials = ['<pad>'])

#need to create one hot encoding but add <pad> to reach max_tokens
def encode_and_pad(vocab, input_tokens, token_max):
    pad_zeros = token_max - len(input_tokens)
    result = vocab.lookup_indices(input_tokens)
    if pad_zeros > 0:
        result.extend(np.zeros(pad_zeros, dtype=int))
    return result

#need to create tokens add <pad> to reach max_tokens
def token_and_pad(vocab, input_tokens, token_max):
    pad_zeros = token_max - len(input_tokens)
    result = input_tokens
    if pad_zeros > 0:
        zeros = []
        for i in range(pad_zeros):
            zeros.append('<pad>')
        result.extend(zeros)
    return result

train_df_expanded['one_hot'] = train_df_expanded['lower_text'].apply(lambda x: encode_and_pad(voc, x.split(), token_max))
test_df_expanded['one_hot'] = test_df_expanded['lower_text'].apply(lambda x: encode_and_pad(voc, x.split(), token_max))

train_df_expanded['vector_tokenized'] = train_df_expanded['lower_text'].apply(lambda x: token_and_pad(voc, x.split(), token_max))
test_df_expanded['vector_tokenized'] = test_df_expanded['lower_text'].apply(lambda x: token_and_pad(voc, x.split(), token_max))

print(train_df_expanded.iloc[0]['lower_text'])
print(train_df_expanded.iloc[0]['one_hot'])
print(train_df_expanded.iloc[0]['vector_tokenized'])



emh am discharge summary signed dis admission date report status signed discharge date principle diagnosis coronary artery disease other diagnoses peripheral vascular disease hypertension allergies no known drug allergies history of present illness the patient is a year old male immigrant from tope ri with a long history of angina he had been followed in the o lake jack for years with strong indication for interventional evaluation of his coronary artery disease the patient had refused and had been being treated medically inspite of the angina pattern recently his angina had worsened and he agreed to undergo more intensive workup he was referred for elective cardiac catheterization past medical history hospitalization for an episode of chest pain in s hypertension and history of peripheral vascular disease with claudication symptoms physical examination on physical exam the patients temperature was heart rate heent head and neck exam unremarkable lungs clear anteriorly heart regular ra

Join the test data documents with their associated annotations.  Verify the number of records are the same.

In [4]:
test_with_annot_df = pd.merge(test_annot_all_df_clean,test_df, on='id')
train_with_annot_df = pd.merge(train_annot_all_df_clean,train_df, on='id')


test_with_annot_df_expanded = pd.merge(test_annot_all_df_clean,test_df_expanded,on='id')
train_with_annot_df_expanded = pd.merge(train_annot_all_df_clean,train_df_expanded,on='id')



print("Test:", len(test_annot_all_df_clean), len(test_df), len(test_with_annot_df))
print("Train:", len(train_annot_all_df_clean), len(train_df), len(train_with_annot_df))

print("Test Expanded:", len(test_annot_all_df_clean), len(test_df_expanded), len(test_with_annot_df_expanded))
print("Train Expanded:", len(train_annot_all_df_clean), len(train_df_expanded), len(train_with_annot_df_expanded))



Test: 9641 507 9641
Train: 11273 611 11273
Test Expanded: 9641 586 11190
Train Expanded: 11273 683 12641


Try and validate the numbers are close with the original papers.  You can see the counts are higher for some reason but the percentage occurrence of each disease doesn't change too much so we are good to use the expanded set.

In [5]:
all_df = pd.concat([test_with_annot_df, train_with_annot_df])
all_df_extended = pd.concat([test_with_annot_df_expanded, train_with_annot_df_expanded])

df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
df_after = pd.concat([all_df_extended['disease'].value_counts().sort_index(0),all_df_extended[all_df_extended['judgment']==True]['disease'].value_counts().sort_index(0)/all_df_extended['disease'].value_counts().sort_index(0)],axis =1)
df_all = pd.concat([df_before,df_after], axis=1)

df_all


  df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
  df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
  df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
  df_after = pd.concat([all_df_extended['disease'].value_counts().sort_index(0),all_df_extended[all_df_extended['judgment']==True]['disease'].value_counts().sort_index(0)/all_df_extended['disease'].value_counts().sort_index(0)],axis =1)
  df_after = pd.concat([all_df_extended['disease'].value_counts().sort_index(0),all_df_extended[all_df_extended['judgment']==True]['disease'].value_counts().

Unnamed: 0,disease,disease.1,disease.2,disease.3
Asthma,1189,0.236333,1343,0.236783
CAD,1653,0.730188,1884,0.72983
CHF,1138,0.783831,1338,0.799701
Depression,1217,0.316352,1391,0.329978
Diabetes,1807,0.811289,2072,0.822394
GERD,1080,0.347222,1217,0.357436
Gallstones,1267,0.272297,1446,0.282158
Gout,1219,0.21493,1389,0.223182
Hypercholesterolemia,1407,0.684435,1599,0.689181
Hypertension,1809,0.885572,2064,0.888081


![Note occurrences](images\note_occurrences.gif)


Save the final test/train dataset

In [6]:
test_with_annot_df_expanded.to_pickle(DATA_PATH + '/test.pkl') 
train_with_annot_df_expanded.to_pickle(DATA_PATH + '/train.pkl') 
#corpus.to_pickle(DATA_PATH + '/corpus.pkl')

Testing the GloVe embedding

In [None]:
from torchtext.vocab import vocab


vec = torchtext.vocab.GloVe(name='6B', dim=300)

one_hot_test = train_df_expanded.iloc[0]['one_hot']
vector_tokenized_test = train_df_expanded.iloc[0]['vector_tokenized']

print(one_hot_test)
ret = vec.get_vecs_by_tokens(voc.lookup_tokens(one_hot_test))
print(ret.shape)
#print(ret[0])
print(vector_tokenized_test)
ret = vec.get_vecs_by_tokens(voc.lookup_tokens(one_hot_test))
print(ret.shape)
#print(ret[0])



.vector_cache\glove.6B.zip:  63%|██████▎   | 540M/862M [01:50<01:00, 5.33MB/s]    

Testing the FastText embedding.

In [None]:
from torchtext.vocab import vocab

vec = torchtext.vocab.FastText()

one_hot_test = train_df_expanded.iloc[0]['one_hot']

print(one_hot_test)
ret = vec.get_vecs_by_tokens(voc.lookup_tokens(one_hot_test))
print(ret.shape)