Title:  Project Workbook Create Embeddings

Authors:  Matthew Lopes and Chris Kabat

This notebook was created to allow for word/sentence embeddings to be created to support our CS 598 DLH project. We do not actually create the embeddings in this notebook to avoid saving large files, but prepare the data for the creation of them. The paper we have chosen for the reproducibility project is:
***Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification from Clinical Notes ***

Abstract:  The main goal of the paper is to extract Morbidity from clinical notes.  The idea was to use a combination of classical and deep learning methods to determine the best approach for classifying these notes in one or more of 16 morbidity conditions.  These models used a combination of NLP techniques including embeddings and bag of words implementations.  It also measured the effect including of stop words.  Lastly, it used ensemble techniques to tie together a number of the classical and deep learning models to provide the most accurate results.

The data cannot be shared publicly due to the agreements required to obtain the data so we are storing the data locally and not putting in GitHub.

We are only creating embeddings for data that includes stop words.

In this workbook, we are taking the following steps:

* Split large documents into smaller sections (left and right)
* Create a one-hot encoding representation of the text
* Create a tokenized and padded representation of the words and sentences

 First we load the required libraries and create a new fields that are the count of words and sentences.

In [51]:
import pandas as pd
import numpy as np

DATA_PATH = './obesity_data/'

#Field to tokenize on
#tokenize_field = 'lower_text'
tokenize_field = 'tok_lem_text'
#tokenize_field = 'word_tokenized'
isTokenized = True

#Don't need to do this for the one with no stop words
alldocs_df = pd.read_pickle(DATA_PATH + '/alldocs_df.pkl')
allannot_df= pd.read_pickle(DATA_PATH + '/allannot_df.pkl')


alldocs_df['sentence_count'] = alldocs_df['sentence_tokenized'].apply(lambda x: len(x))
sentence_max = np.max(alldocs_df['sentence_count'])
print('Max Sentences:', sentence_max)

if isTokenized:
    alldocs_df['word_count'] = alldocs_df[tokenize_field].apply(lambda x: len(x))
    alldocs_df['word_count'] = alldocs_df[tokenize_field].apply(lambda x: len(x))
else:
    alldocs_df['word_count'] = alldocs_df[tokenize_field].apply(lambda x: len(x.split()))
    alldocs_df['word_count'] = alldocs_df[tokenize_field].apply(lambda x: len(x.split()))

df_print = pd.DataFrame()
df_print['Min'] = [np.min(alldocs_df['word_count']), np.min(alldocs_df['word_count'])]
df_print['Mean'] = [np.mean(alldocs_df['word_count']), np.mean(alldocs_df['word_count'])]
df_print['Max'] = [np.max(alldocs_df['word_count']), np.max(alldocs_df['word_count'])]
df_print['Std'] = [np.std(alldocs_df['word_count']), np.std(alldocs_df['word_count'])]
df_print['MeanPlusStd'] = round(df_print['Mean'] + df_print['Std'],0)
token_max = int(round(np.max(df_print['MeanPlusStd']),0))

print(df_print)
print('Max Tokens:',token_max)
print('All:', sum(alldocs_df['word_count'] > token_max), "out of", len(alldocs_df))



Max Sentences: 380
   Min        Mean   Max         Std  MeanPlusStd
0  113  973.840787  3748  441.912616       1416.0
1  113  973.840787  3748  441.912616       1416.0
Max Tokens: 1416
All: 156 out of 1118


We are going to split these larger text blocks into 2 notes of size max_token or below.  Note, there are 4 notes (1 in test and 3 in train) that are bigger than 2 times x tokens.  In those cases we are only taking the top and bottom of the document (left and right).

In [52]:
alldocs_df_ok = alldocs_df[alldocs_df['word_count'] <= token_max].copy()
alldocs_df_large_left = alldocs_df[alldocs_df['word_count'] > token_max].copy()
alldocs_df_large_right = alldocs_df[alldocs_df['word_count'] > token_max].copy()

#Get the right words and the left words and then concatenate all 3 and recacluate 
if isTokenized:
    #alldocs_df_large_left[tokenize_field] = alldocs_df_large_left[tokenize_field].apply(lambda x: [word for word in x[:(token_max-1)]])
    #alldocs_df_large_right[tokenize_field] = alldocs_df_large_right[tokenize_field].apply(lambda x: [word for word in x[token_max:(2*token_max)]])
    alldocs_df_large_left[tokenize_field] = alldocs_df_large_left[tokenize_field].apply(lambda x: [word for word in x[:(token_max)]])
    alldocs_df_large_right[tokenize_field] = alldocs_df_large_right[tokenize_field].apply(lambda x: [word for word in x[(len(x)-token_max):len(x)]])   
else:
    #alldocs_df_large_left[tokenize_field] = alldocs_df_large_left[tokenize_field].apply(lambda x: ' '.join([word for word in x.split()[:(token_max-1)]]))
    #alldocs_df_large_right[tokenize_field] = alldocs_df_large_right[tokenize_field].apply(lambda x: ' '.join([word for word in x.split()[token_max:(2*token_max)]]))
    alldocs_df_large_left[tokenize_field] = alldocs_df_large_left[tokenize_field].apply(lambda x: ' '.join([word for word in x.split()[:(token_max)]]))
    alldocs_df_large_right[tokenize_field] = alldocs_df_large_right[tokenize_field].apply(lambda x: ' '.join([word for word in x.split()[(len(x.split())-token_max):len(x.split())]]))

alldocs_df_expanded = pd.concat([alldocs_df_ok,alldocs_df_large_right,alldocs_df_large_left])

if isTokenized:
    alldocs_df_expanded['word_count'] = alldocs_df_expanded[tokenize_field].apply(lambda x: len(x))
else:
    alldocs_df_expanded['word_count'] = alldocs_df_expanded[tokenize_field].apply(lambda x: len(x.split()))

print('All:', sum(alldocs_df_expanded['word_count'] > token_max), "out of", len(alldocs_df_expanded))

All: 0 out of 1274


In [53]:
alldocs_df[alldocs_df['id']==1]

Unnamed: 0,id,text,sentence_tokenized,word_tokenized,no_punc_text,no_numerics_text,lower_text,tokenized_text,tok_lem_text,sentence_count,word_count
0,1,490646815 | WMC | 31530471 | | 9629480 | 11/23...,[wmc am anemia sign di admiss date report stat...,"[wmc, am, anemia, sign, di, admiss, date, repo...",490646815 WMC 31530471 9629480 11232006 1...,WMC AM ANEMIA Signed DIS Admissi...,wmc am anemia signed dis admission date report...,"[wmc, am, anemia, signed, dis, admission, date...","[wmc, am, anemia, signed, dis, admission, date...",110,1474


In [54]:
alldocs_df_expanded[alldocs_df_expanded['id']==1]


Unnamed: 0,id,text,sentence_tokenized,word_tokenized,no_punc_text,no_numerics_text,lower_text,tokenized_text,tok_lem_text,sentence_count,word_count
0,1,490646815 | WMC | 31530471 | | 9629480 | 11/23...,[wmc am anemia sign di admiss date report stat...,"[wmc, am, anemia, sign, di, admiss, date, repo...",490646815 WMC 31530471 9629480 11232006 1...,WMC AM ANEMIA Signed DIS Admissi...,wmc am anemia signed dis admission date report...,"[wmc, am, anemia, signed, dis, admission, date...","[with, ejection, fraction, of, to, who, presen...",110,1416
0,1,490646815 | WMC | 31530471 | | 9629480 | 11/23...,[wmc am anemia sign di admiss date report stat...,"[wmc, am, anemia, sign, di, admiss, date, repo...",490646815 WMC 31530471 9629480 11232006 1...,WMC AM ANEMIA Signed DIS Admissi...,wmc am anemia signed dis admission date report...,"[wmc, am, anemia, signed, dis, admission, date...","[wmc, am, anemia, signed, dis, admission, date...",110,1416


We need to create a one hot vector given a vocabulary and pad it with the padding character.

In [55]:
from typing import Union, Iterable
import torchtext, torch, torch.nn.functional as F
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator


##Words
if isTokenized:
    voc = build_vocab_from_iterator(alldocs_df_expanded[tokenize_field].to_list(), specials = ['<pad>'])
else:
    corpus = alldocs_df_expanded[tokenize_field]
    tokenizer = get_tokenizer("basic_english")
    tokens = [tokenizer(doc) for doc in corpus]
    voc = build_vocab_from_iterator(tokens, specials = ['<pad>'])

#need to create one hot encoding but add <pad> to reach max_tokens
def encode_and_pad(vocab, input_tokens, token_max):
    pad_zeros = token_max - len(input_tokens)
    result = vocab.lookup_indices(input_tokens)
    if pad_zeros > 0:
        result.extend(np.zeros(pad_zeros, dtype=int))
    return result

#need to create tokens add <pad> to reach max_tokens
def token_and_pad(vocab, input_tokens, token_max):
    pad_zeros = token_max - len(input_tokens)
    result = input_tokens
    if pad_zeros > 0:
        zeros = []
        for i in range(pad_zeros):
            zeros.append('<pad>')
        result.extend(zeros)
    return result

#need to create tokens add '\n' to reach max_sentences
def token_and_pad_sentence(input_sentences, sentence_max):
    pad_spaces = sentence_max - len(input_sentences)
    result = input_sentences
    if pad_spaces > 0:
        for i in range(pad_spaces):
            result.append('\n')

    return result

if isTokenized:
    alldocs_df_expanded['one_hot'] = alldocs_df_expanded[tokenize_field].apply(lambda x: encode_and_pad(voc, x, token_max))
    alldocs_df_expanded['vector_tokenized'] = alldocs_df_expanded[tokenize_field].apply(lambda x: token_and_pad(voc, x, token_max))
else:
    alldocs_df_expanded['one_hot'] = alldocs_df_expanded[tokenize_field].apply(lambda x: encode_and_pad(voc, x.split(), token_max))
    alldocs_df_expanded['vector_tokenized'] = alldocs_df_expanded[tokenize_field].apply(lambda x: token_and_pad(voc, x.split(), token_max))

alldocs_df_expanded['sentence_tokenized'] = alldocs_df_expanded['sentence_tokenized'].apply(lambda x: token_and_pad_sentence(x, sentence_max))



print(alldocs_df_expanded.iloc[0][tokenize_field])
print(alldocs_df_expanded.iloc[0]['one_hot'])
print(alldocs_df_expanded.iloc[0]['vector_tokenized'])
print(alldocs_df_expanded.iloc[0]['sentence_tokenized'])



['emh', 'am', 'discharge', 'summary', 'signed', 'dis', 'admission', 'date', 'report', 'status', 'signed', 'discharge', 'date', 'principle', 'diagnosis', 'coronary', 'artery', 'disease', 'other', 'diagnosis', 'peripheral', 'vascular', 'disease', 'hypertension', 'allergy', 'no', 'known', 'drug', 'allergy', 'history', 'of', 'present', 'illness', 'the', 'patient', 'is', 'a', 'year', 'old', 'male', 'immigrant', 'from', 'tope', 'ri', 'with', 'a', 'long', 'history', 'of', 'angina', 'he', 'had', 'been', 'followed', 'in', 'the', 'o', 'lake', 'jack', 'for', 'year', 'with', 'strong', 'indication', 'for', 'interventional', 'evaluation', 'of', 'his', 'coronary', 'artery', 'disease', 'the', 'patient', 'had', 'refused', 'and', 'had', 'been', 'being', 'treated', 'medically', 'inspite', 'of', 'the', 'angina', 'pattern', 'recently', 'his', 'angina', 'had', 'worsened', 'and', 'he', 'agreed', 'to', 'undergo', 'more', 'intensive', 'workup', 'he', 'wa', 'referred', 'for', 'elective', 'cardiac', 'catheteriza

Join the test data documents with their associated annotations.  Verify the number of records are the same.

In [56]:
all_df = pd.merge(allannot_df,alldocs_df, on='id')
all_df_expanded= pd.merge(allannot_df,alldocs_df_expanded,on='id')

print("All:", len(allannot_df), len(alldocs_df), len(all_df))
print("All Expanded:", len(allannot_df), len(alldocs_df_expanded), len(all_df_expanded))

All: 16325 1118 16325
All Expanded: 16325 1274 18584


Try and validate the numbers are close with the original papers.  You can see the counts are higher for some reason but the percentage occurrence of each disease doesn't change too much so we are good to use the expanded set.

In [71]:
df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
df_after = pd.concat([all_df_expanded['disease'].value_counts().sort_index(0),all_df_expanded[all_df_expanded['judgment']==True]['disease'].value_counts().sort_index(0)/all_df_expanded['disease'].value_counts().sort_index(0)],axis =1)
df_all = pd.concat([df_before,df_after], axis=1)

mapping = {df_all.columns[0]:'new0', df_all.columns[1]: 'new1'}

df_all.columns.values[0] = 'Count Before'
df_all.columns.values[1] = '% Before'
df_all.columns.values[2] = 'Count After'
df_all.columns.values[3] = '% After'
df_all


  df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
  df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
  df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
  df_after = pd.concat([all_df_expanded['disease'].value_counts().sort_index(0),all_df_expanded[all_df_expanded['judgment']==True]['disease'].value_counts().sort_index(0)/all_df_expanded['disease'].value_counts().sort_index(0)],axis =1)
  df_after = pd.concat([all_df_expanded['disease'].value_counts().sort_index(0),all_df_expanded[all_df_expanded['judgment']==True]['disease'].value_counts().

Unnamed: 0,Count Before,% Before,Count After,% After
Asthma,1057,0.142857,1200,0.143333
CAD,1044,0.606322,1192,0.608221
CHF,723,0.672199,841,0.693222
Depression,1068,0.220974,1216,0.231086
Diabetes,1070,0.702804,1221,0.719902
GERD,924,0.239177,1039,0.246391
Gallstones,1097,0.164084,1249,0.172938
Gout,1102,0.131579,1255,0.136255
Hypercholesterolemia,961,0.548387,1092,0.553114
Hypertension,1037,0.812922,1182,0.816413


![Note occurrences](images\note_occurrences.gif)


Save the final test/train dataset.  We are also saving the vocabulary used and the max number of tokens and sentences.

In [58]:
all_df.to_pickle(DATA_PATH + '/all_df.pkl') 
all_df_expanded.to_pickle(DATA_PATH + '/all_df_expanded.pkl') 
torch.save(voc, DATA_PATH + '/voc.obj')
torch.save((token_max, sentence_max), DATA_PATH + '/counts.obj')

Testing the GloVe embedding

In [59]:
from torchtext.vocab import vocab


vec = torchtext.vocab.GloVe(name='6B', dim=300)

one_hot_test = all_df_expanded.iloc[0]['one_hot']
vector_tokenized_test = all_df_expanded.iloc[0]['vector_tokenized']

print(one_hot_test)
ret = vec.get_vecs_by_tokens(voc.lookup_tokens(one_hot_test))
print(ret.shape)
#print(ret[0])
print(vector_tokenized_test)
ret = vec.get_vecs_by_tokens(vector_tokenized_test)
print(ret.shape)
#print(ret[0])



[8, 318, 351, 3, 5, 153, 140, 51, 403, 8, 6, 1288, 691, 3, 911, 2, 800, 11, 90, 80, 14, 22, 22, 544, 47, 1204, 2, 618, 27, 42, 91, 6, 355, 618, 543, 330, 186, 472, 43, 22, 109, 544, 11, 108, 80, 2, 14, 31, 61, 684, 1187, 6, 59, 3, 1483, 11, 672, 913, 170, 14, 343, 5, 52, 12879, 1093, 1561, 3, 911, 2, 800, 11, 90, 80, 14, 31, 22, 201, 226, 27, 12, 6, 7105, 2315, 970, 15, 47, 867, 14, 4, 333, 5, 54, 6, 179, 3, 469, 51, 1790, 23, 472, 2, 4, 458, 5, 1, 370, 763, 11, 1159, 2, 727, 3, 15, 406, 3617, 50, 2128, 974, 134, 90, 357, 13, 37, 497, 60, 456, 346, 768, 60, 60, 60, 456, 122, 846, 9, 37, 1111, 9, 37, 728, 9, 34, 2050, 9, 34, 232, 9, 34, 518, 760, 9, 13, 34, 2, 697, 34, 104, 94, 26, 152, 986, 57, 1165, 248, 5, 286, 58, 11548, 73, 11323, 868, 2060, 671, 102, 245, 957, 286, 58, 124, 8, 318, 351, 3, 5, 2023, 2237, 57, 8, 420, 327, 543, 2, 977, 327, 769, 163, 258, 26, 3, 430, 265, 1216, 42, 91, 47, 355, 618, 543, 330, 186, 472, 281, 26, 16, 281, 26, 3, 986, 57, 30, 58, 57, 377, 26, 14, 31, 1

Testing the FastText embedding.

In [60]:
from torchtext.vocab import vocab

vec = torchtext.vocab.FastText()

one_hot_test = all_df_expanded.iloc[0]['one_hot']

print(one_hot_test)
ret = vec.get_vecs_by_tokens(voc.lookup_tokens(one_hot_test))
print(ret.shape)
print(vector_tokenized_test)
ret = vec.get_vecs_by_tokens(vector_tokenized_test)
print(ret.shape)

[8, 318, 351, 3, 5, 153, 140, 51, 403, 8, 6, 1288, 691, 3, 911, 2, 800, 11, 90, 80, 14, 22, 22, 544, 47, 1204, 2, 618, 27, 42, 91, 6, 355, 618, 543, 330, 186, 472, 43, 22, 109, 544, 11, 108, 80, 2, 14, 31, 61, 684, 1187, 6, 59, 3, 1483, 11, 672, 913, 170, 14, 343, 5, 52, 12879, 1093, 1561, 3, 911, 2, 800, 11, 90, 80, 14, 31, 22, 201, 226, 27, 12, 6, 7105, 2315, 970, 15, 47, 867, 14, 4, 333, 5, 54, 6, 179, 3, 469, 51, 1790, 23, 472, 2, 4, 458, 5, 1, 370, 763, 11, 1159, 2, 727, 3, 15, 406, 3617, 50, 2128, 974, 134, 90, 357, 13, 37, 497, 60, 456, 346, 768, 60, 60, 60, 456, 122, 846, 9, 37, 1111, 9, 37, 728, 9, 34, 2050, 9, 34, 232, 9, 34, 518, 760, 9, 13, 34, 2, 697, 34, 104, 94, 26, 152, 986, 57, 1165, 248, 5, 286, 58, 11548, 73, 11323, 868, 2060, 671, 102, 245, 957, 286, 58, 124, 8, 318, 351, 3, 5, 2023, 2237, 57, 8, 420, 327, 543, 2, 977, 327, 769, 163, 258, 26, 3, 430, 265, 1216, 42, 91, 47, 355, 618, 543, 330, 186, 472, 281, 26, 16, 281, 26, 3, 986, 57, 30, 58, 57, 377, 26, 14, 31, 1