This notebook was created to support the data preparation required to support our CS 598 DLH project.  The paper we have chosen for the reproducibility project is:
***Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification from Clinical Notes ***

This notebook is for creating the multiple embeddings formats as described in the study.

 

The data cannot be shared publicly due to the agreements required to obtain the data so we are storing the data locally and not putting in GitHub.

We are only creating embeddings for data that includes stop words.

In [4]:
pip install torchtext

Defaulting to user installation because normal site-packages is not writeable
Collecting torchtext
  Downloading torchtext-0.15.1-cp39-cp39-win_amd64.whl (1.9 MB)
     ---------------------------------------- 1.9/1.9 MB 11.2 MB/s eta 0:00:00
Collecting torch==2.0.0
  Downloading torch-2.0.0-cp39-cp39-win_amd64.whl (172.3 MB)
     -------------------------------------- 172.3/172.3 MB 6.6 MB/s eta 0:00:00
Collecting torchdata==0.6.0
  Downloading torchdata-0.6.0-cp39-cp39-win_amd64.whl (1.3 MB)
     ---------------------------------------- 1.3/1.3 MB 20.7 MB/s eta 0:00:00
Installing collected packages: torch, torchdata, torchtext
Successfully installed torch-2.0.0 torchdata-0.6.0 torchtext-0.15.1
Note: you may need to restart the kernel to use updated packages.


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.14.1 requires torch==1.13.1, but you have torch 2.0.0 which is incompatible.
torchaudio 0.13.1 requires torch==1.13.1, but you have torch 2.0.0 which is incompatible.


In [1]:
import pandas as pd
import numpy as np

DATA_PATH = './obesity_data/'
test_df = pd.read_pickle(DATA_PATH + '/test_df.pkl')
train_df = pd.read_pickle(DATA_PATH + '/train_df.pkl')
test_annot_all_df_clean = pd.read_pickle(DATA_PATH + '/test_annot_all_df_clean.pkl') 
train_annot_all_df_clean = pd.read_pickle(DATA_PATH + '/train_annot_all_df_clean.pkl') 


test_df['word_count'] = test_df['lower_text'].apply(lambda x: len(x.split()))
train_df['word_count'] = train_df['lower_text'].apply(lambda x: len(x.split()))

df_print = pd.DataFrame()
df_print['Min'] = [np.min(test_df['word_count']), np.min(train_df['word_count'])]
df_print['Mean'] = [np.mean(test_df['word_count']), np.mean(train_df['word_count'])]
df_print['Max'] = [np.max(test_df['word_count']), np.max(train_df['word_count'])]
df_print['Std'] = [np.std(test_df['word_count']), np.std(train_df['word_count'])]
df_print['MeanPlusStd'] = round(df_print['Mean'] + df_print['Std'],0)
token_max = int(round(np.max(df_print['MeanPlusStd']),0))

print(df_print)
print('Max Tokens:',token_max)
print('Test:', sum(test_df['word_count'] > token_max), "out of", len(test_df))
print('Train:', sum(train_df['word_count'] > token_max), "out of", len(train_df))


   Min        Mean   Max         Std  MeanPlusStd
0  206  991.907298  3124  437.621770       1430.0
1  113  958.774141  3748  444.819254       1404.0
Max Tokens: 1430
Test: 79 out of 507
Train: 72 out of 611


We are going to split these larger text blocks into 2 notes of size max_token or below.  Note, there are 4 notes (1 in test and 3 in train) that are bigger than 2 times x tokens.  For now, we will ignore, but may want to add in later (either loop or have left/middle/right).

In [2]:
test_df_ok = test_df[test_df['word_count'] <= token_max].copy()
test_df_large_right = test_df[test_df['word_count'] > token_max].copy()
test_df_large_left = test_df[test_df['word_count'] > token_max].copy()

train_df_ok = train_df[train_df['word_count'] <= token_max].copy()
train_df_large_right = train_df[train_df['word_count'] > token_max].copy()
train_df_large_left = train_df[train_df['word_count'] > token_max].copy()

#Get the right words and the left words and then concatenate all 3 and recacluate 
test_df_large_left['lower_text'] = test_df_large_left['lower_text'].apply(lambda x: ' '.join([word for word in x.split()[:(token_max-1)]]))
test_df_large_right['lower_text'] = test_df_large_right['lower_text'].apply(lambda x: ' '.join([word for word in x.split()[token_max:(2*token_max)]]))
train_df_large_left['lower_text'] = train_df_large_left['lower_text'].apply(lambda x: ' '.join([word for word in x.split()[:(token_max-1)]]))
train_df_large_right['lower_text'] = train_df_large_right['lower_text'].apply(lambda x: ' '.join([word for word in x.split()[token_max:(2*token_max)]]))

test_df_expanded = pd.concat([test_df_ok,test_df_large_right,test_df_large_left])
test_df_expanded['word_count'] = test_df_expanded['lower_text'].apply(lambda x: len(x.split()))
train_df_expanded = pd.concat([train_df_ok,train_df_large_right,train_df_large_left])
train_df_expanded['word_count'] = train_df_expanded['lower_text'].apply(lambda x: len(x.split()))

print('Test:', sum(test_df_expanded['word_count'] > token_max), "out of", len(test_df_expanded))
print('Train:', sum(train_df_expanded['word_count'] > token_max), "out of", len(train_df_expanded))

Test: 0 out of 586
Train: 0 out of 683


We need to create a one hot vector given a vocabulary and pad it with the padding character.

In [5]:
from typing import Union, Iterable
import torchtext, torch, torch.nn.functional as F
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

corpus = pd.concat([test_df_expanded['lower_text'],train_df_expanded['lower_text']])
tokenizer = get_tokenizer("basic_english")
tokens = [tokenizer(doc) for doc in corpus]

voc = build_vocab_from_iterator(tokens, specials = ['<pad>'])

#so need to create one hot encoding but add <pad> to reach max_tokens
#need to do for both stop words and non stop words???
def encode_and_pad(vocab, input_tokens, token_max):
    pad_zeros = token_max - len(input_tokens)
    result = vocab.lookup_indices(input_tokens)
    if pad_zeros > 0:
        result.extend(np.zeros(pad_zeros, dtype=int))
    return result

train_df_expanded['one_hot'] = train_df_expanded['lower_text'].apply(lambda x: encode_and_pad(voc, x.split(), token_max))
test_df_expanded['one_hot'] = test_df_expanded['lower_text'].apply(lambda x: encode_and_pad(voc, x.split(), token_max))

print(train_df_expanded.iloc[0])



id                                                                  4
text                368346277 | EMH | 64927307 | | 815098 | 3/29/1...
no_punc_text        368346277  EMH  64927307   815098  3291993 120...
no_numerics_text      EMH         AM  Discharge Summary  Signed  D...
lower_text          emh am discharge summary signed dis admission ...
word_count                                                        413
one_hot             [7206, 73, 18, 126, 123, 138, 26, 53, 108, 36,...
Name: 2, dtype: object


Join the test data documents with their associated annotations.  Verify the number of records are the same.

In [6]:
test_with_annot_df = pd.merge(test_annot_all_df_clean,test_df, on='id')
train_with_annot_df = pd.merge(train_annot_all_df_clean,train_df, on='id')


test_with_annot_df_expanded = pd.merge(test_annot_all_df_clean,test_df_expanded,on='id')
train_with_annot_df_expanded = pd.merge(train_annot_all_df_clean,train_df_expanded,on='id')



print("Test:", len(test_annot_all_df_clean), len(test_df), len(test_with_annot_df))
print("Train:", len(train_annot_all_df_clean), len(train_df), len(train_with_annot_df))

print("Test Expanded:", len(test_annot_all_df_clean), len(test_df_expanded), len(test_with_annot_df_expanded))
print("Train Expanded:", len(train_annot_all_df_clean), len(train_df_expanded), len(train_with_annot_df_expanded))



Test: 7542 507 7542
Train: 8783 611 8783
Test Expanded: 7542 586 8701
Train Expanded: 8783 683 9809


Try and validate the numbers are close with the original papers.  You can see the counts are higher for some reason but the percentage occurrence of each disease doesn't change too much so we are good to use the expanded set.

In [7]:
all_df = pd.concat([test_with_annot_df, train_with_annot_df])
all_df_extended = pd.concat([test_with_annot_df_expanded, train_with_annot_df_expanded])

df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
df_after = pd.concat([all_df_extended['disease'].value_counts().sort_index(0),all_df_extended[all_df_extended['judgment']==True]['disease'].value_counts().sort_index(0)/all_df_extended['disease'].value_counts().sort_index(0)],axis =1)
df_all = pd.concat([df_before,df_after], axis=1)

df_all


  df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
  df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
  df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
  df_after = pd.concat([all_df_extended['disease'].value_counts().sort_index(0),all_df_extended[all_df_extended['judgment']==True]['disease'].value_counts().sort_index(0)/all_df_extended['disease'].value_counts().sort_index(0)],axis =1)
  df_after = pd.concat([all_df_extended['disease'].value_counts().sort_index(0),all_df_extended[all_df_extended['judgment']==True]['disease'].value_counts().

Unnamed: 0,disease,disease.1,disease.2,disease.3
Asthma,1057,0.142857,1195,0.143933
CAD,1044,0.606322,1187,0.608256
CHF,723,0.672199,838,0.693317
Depression,1068,0.220974,1211,0.230388
Diabetes,1070,0.702804,1216,0.71875
GERD,924,0.239177,1036,0.247104
Gallstones,1097,0.164084,1244,0.171222
Gout,1102,0.131579,1250,0.1368
Hypercholesterolemia,961,0.548387,1088,0.553309
Hypertension,1037,0.812922,1177,0.816483


![Note occurrences](images\note_occurrences.gif)


Save the final test/train dataset

In [8]:
test_with_annot_df_expanded.to_pickle(DATA_PATH + '/test.pkl') 
train_with_annot_df_expanded.to_pickle(DATA_PATH + '/train.pkl') 
corpus.to_pickle(DATA_PATH + '/corpus.pkl')

Testing the GloVe embedding

In [9]:
from torchtext.vocab import vocab


vec = torchtext.vocab.GloVe(name='6B', dim=300)

one_hot_test = train_df_expanded.iloc[0]['one_hot']

print(one_hot_test)
ret = vec.get_vecs_by_tokens(voc.lookup_tokens(one_hot_test))
print(ret.shape)

.vector_cache\glove.6B.zip: 862MB [02:44, 5.23MB/s]                               
100%|█████████▉| 399999/400000 [00:52<00:00, 7550.17it/s]


[7206, 73, 18, 126, 123, 138, 26, 53, 108, 36, 123, 18, 53, 5276, 93, 77, 72, 55, 100, 725, 463, 428, 55, 119, 169, 15, 273, 504, 169, 25, 3, 180, 228, 1, 13, 20, 8, 396, 392, 671, 20859, 56, 29809, 3176, 7, 8, 871, 25, 3, 451, 19, 24, 111, 280, 11, 1, 360, 4936, 4478, 10, 380, 7, 3258, 2715, 10, 2840, 398, 3, 22, 77, 72, 55, 1, 13, 24, 1291, 2, 24, 111, 536, 200, 1654, 11662, 3, 1, 451, 2274, 673, 22, 451, 24, 2076, 2, 19, 2147, 5, 2028, 374, 817, 753, 19, 4, 1590, 10, 1477, 66, 202, 97, 86, 25, 417, 10, 30, 498, 3, 41, 27, 11, 60, 119, 2, 25, 3, 463, 428, 55, 7, 2289, 197, 148, 130, 6, 148, 162, 1, 84, 340, 4, 57, 80, 494, 688, 2, 379, 162, 889, 389, 196, 3811, 57, 143, 80, 2, 135, 15, 689, 1475, 239, 288, 452, 215, 15, 114, 24, 30894, 2087, 366, 3, 433, 22, 148, 162, 4, 455, 6, 22, 2768, 26, 5, 1, 66, 157, 395, 63, 3705, 1032, 17, 1477, 66, 202, 313, 130, 22, 26, 313, 162, 4, 1403, 10, 8, 67, 1374, 2, 1736, 534, 162, 22, 150, 63, 66, 202, 657, 2841, 129, 1339, 11, 932, 607, 2, 193, 

Testing the FastText embedding.

In [10]:
from torchtext.vocab import vocab

vec = torchtext.vocab.FastText()

one_hot_test = train_df_expanded.iloc[0]['one_hot']

print(one_hot_test)
ret = vec.get_vecs_by_tokens(voc.lookup_tokens(one_hot_test))
print(ret.shape)

.vector_cache\wiki.en.vec: 6.60GB [03:54, 28.1MB/s]                                
  0%|          | 0/2519370 [00:00<?, ?it/s]Skipping token b'2519370' with 1-dimensional vector [b'300']; likely a header
100%|██████████| 2519370/2519370 [06:15<00:00, 6707.30it/s]


[7206, 73, 18, 126, 123, 138, 26, 53, 108, 36, 123, 18, 53, 5276, 93, 77, 72, 55, 100, 725, 463, 428, 55, 119, 169, 15, 273, 504, 169, 25, 3, 180, 228, 1, 13, 20, 8, 396, 392, 671, 20859, 56, 29809, 3176, 7, 8, 871, 25, 3, 451, 19, 24, 111, 280, 11, 1, 360, 4936, 4478, 10, 380, 7, 3258, 2715, 10, 2840, 398, 3, 22, 77, 72, 55, 1, 13, 24, 1291, 2, 24, 111, 536, 200, 1654, 11662, 3, 1, 451, 2274, 673, 22, 451, 24, 2076, 2, 19, 2147, 5, 2028, 374, 817, 753, 19, 4, 1590, 10, 1477, 66, 202, 97, 86, 25, 417, 10, 30, 498, 3, 41, 27, 11, 60, 119, 2, 25, 3, 463, 428, 55, 7, 2289, 197, 148, 130, 6, 148, 162, 1, 84, 340, 4, 57, 80, 494, 688, 2, 379, 162, 889, 389, 196, 3811, 57, 143, 80, 2, 135, 15, 689, 1475, 239, 288, 452, 215, 15, 114, 24, 30894, 2087, 366, 3, 433, 22, 148, 162, 4, 455, 6, 22, 2768, 26, 5, 1, 66, 157, 395, 63, 3705, 1032, 17, 1477, 66, 202, 313, 130, 22, 26, 313, 162, 4, 1403, 10, 8, 67, 1374, 2, 1736, 534, 162, 22, 150, 63, 66, 202, 657, 2841, 129, 1339, 11, 932, 607, 2, 193, 