This notebook was created to support the data preparation required to support our CS 598 DLH project.  The paper we have chosen for the reproducibility project is:
***Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification from Clinical Notes ***

This notebook is for creating the multiple embeddings formats as described in the study.

 

The data cannot be shared publicly due to the agreements required to obtain the data so we are storing the data locally and not putting in GitHub.

We are only creating embeddings for data that includes stop words.

In [None]:
#pip install torchtext

In [None]:
import pandas as pd
import numpy as np

DATA_PATH = './obesity_data/'

#Field to tokenize on
#tokenize_field = 'lower_text'
tokenize_field = 'tok_lem_text'
isTokenized = True

#Don't need to do this for the one with no stop words
alldocs_df = pd.read_pickle(DATA_PATH + '/alldocs_df.pkl')
allannot_df= pd.read_pickle(DATA_PATH + '/allannot_df.pkl')


alldocs_df['sentence_count'] = alldocs_df['sentence_tokenized'].apply(lambda x: len(x))
sentence_max = np.max(alldocs_df['sentence_count'])
print('Max Sentences:', sentence_max)

if isTokenized:
    alldocs_df['word_count'] = alldocs_df[tokenize_field].apply(lambda x: len(x))
    alldocs_df['word_count'] = alldocs_df[tokenize_field].apply(lambda x: len(x))
else:
    alldocs_df['word_count'] = alldocs_df[tokenize_field].apply(lambda x: len(x.split()))
    alldocs_df['word_count'] = alldocs_df[tokenize_field].apply(lambda x: len(x.split()))

df_print = pd.DataFrame()
df_print['Min'] = [np.min(alldocs_df['word_count']), np.min(alldocs_df['word_count'])]
df_print['Mean'] = [np.mean(alldocs_df['word_count']), np.mean(alldocs_df['word_count'])]
df_print['Max'] = [np.max(alldocs_df['word_count']), np.max(alldocs_df['word_count'])]
df_print['Std'] = [np.std(alldocs_df['word_count']), np.std(alldocs_df['word_count'])]
df_print['MeanPlusStd'] = round(df_print['Mean'] + df_print['Std'],0)
token_max = int(round(np.max(df_print['MeanPlusStd']),0))

print(df_print)
print('Max Tokens:',token_max)
print('All:', sum(alldocs_df['word_count'] > token_max), "out of", len(alldocs_df))



We are going to split these larger text blocks into 2 notes of size max_token or below.  Note, there are 4 notes (1 in test and 3 in train) that are bigger than 2 times x tokens.  For now, we will ignore, but may want to add in later (either loop or have left/middle/right).

In [None]:
alldocs_df_ok = alldocs_df[alldocs_df['word_count'] <= token_max].copy()
alldocs_df_large_left = alldocs_df[alldocs_df['word_count'] > token_max].copy()
alldocs_df_large_right = alldocs_df[alldocs_df['word_count'] > token_max].copy()

#Get the right words and the left words and then concatenate all 3 and recacluate 
if isTokenized:
    alldocs_df_large_left[tokenize_field] = alldocs_df_large_left[tokenize_field].apply(lambda x: [word for word in x[:(token_max-1)]])
    alldocs_df_large_right[tokenize_field] = alldocs_df_large_right[tokenize_field].apply(lambda x: [word for word in x[token_max:(2*token_max)]])
else:
    alldocs_df_large_left[tokenize_field] = alldocs_df_large_left[tokenize_field].apply(lambda x: ' '.join([word for word in x.split()[:(token_max-1)]]))
    alldocs_df_large_right[tokenize_field] = alldocs_df_large_right[tokenize_field].apply(lambda x: ' '.join([word for word in x.split()[token_max:(2*token_max)]]))

alldocs_df_expanded = pd.concat([alldocs_df_ok,alldocs_df_large_right,alldocs_df_large_left])

if isTokenized:
    alldocs_df_expanded['word_count'] = alldocs_df_expanded[tokenize_field].apply(lambda x: len(x))
else:
    alldocs_df_expanded['word_count'] = alldocs_df_expanded[tokenize_field].apply(lambda x: len(x.split()))

print('All:', sum(alldocs_df_expanded['word_count'] > token_max), "out of", len(alldocs_df_expanded))

We need to create a one hot vector given a vocabulary and pad it with the padding character.

In [None]:
from typing import Union, Iterable
import torchtext, torch, torch.nn.functional as F
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator


##Words
if isTokenized:
    voc = build_vocab_from_iterator(alldocs_df_expanded[tokenize_field].to_list(), specials = ['<pad>'])
else:
    corpus = alldocs_df_expanded[tokenize_field]
    tokenizer = get_tokenizer("basic_english")
    tokens = [tokenizer(doc) for doc in corpus]
    voc = build_vocab_from_iterator(tokens, specials = ['<pad>'])

#need to create one hot encoding but add <pad> to reach max_tokens
def encode_and_pad(vocab, input_tokens, token_max):
    pad_zeros = token_max - len(input_tokens)
    result = vocab.lookup_indices(input_tokens)
    if pad_zeros > 0:
        result.extend(np.zeros(pad_zeros, dtype=int))
    return result

#need to create tokens add <pad> to reach max_tokens
def token_and_pad(vocab, input_tokens, token_max):
    pad_zeros = token_max - len(input_tokens)
    result = input_tokens
    if pad_zeros > 0:
        zeros = []
        for i in range(pad_zeros):
            zeros.append('<pad>')
        result.extend(zeros)
    return result

#need to create tokens add '\n' to reach max_sentences
def token_and_pad_sentence(input_sentences, sentence_max):
    pad_spaces = sentence_max - len(input_sentences)
    result = input_sentences
    if pad_spaces > 0:
        for i in range(pad_spaces):
            result.append('\n')

    return result

if isTokenized:
    alldocs_df_expanded['one_hot'] = alldocs_df_expanded[tokenize_field].apply(lambda x: encode_and_pad(voc, x, token_max))
    alldocs_df_expanded['vector_tokenized'] = alldocs_df_expanded[tokenize_field].apply(lambda x: token_and_pad(voc, x, token_max))
else:
    alldocs_df_expanded['one_hot'] = alldocs_df_expanded[tokenize_field].apply(lambda x: encode_and_pad(voc, x.split(), token_max))
    alldocs_df_expanded['vector_tokenized'] = alldocs_df_expanded[tokenize_field].apply(lambda x: token_and_pad(voc, x.split(), token_max))

alldocs_df_expanded['sentence_tokenized'] = alldocs_df_expanded['sentence_tokenized'].apply(lambda x: token_and_pad_sentence(x, sentence_max))



print(alldocs_df_expanded.iloc[0][tokenize_field])
print(alldocs_df_expanded.iloc[0]['one_hot'])
print(alldocs_df_expanded.iloc[0]['vector_tokenized'])
print(alldocs_df_expanded.iloc[0]['sentence_tokenized'])



Join the test data documents with their associated annotations.  Verify the number of records are the same.

In [None]:
all_df = pd.merge(allannot_df,alldocs_df, on='id')
all_df_expanded= pd.merge(allannot_df,alldocs_df_expanded,on='id')



print("All:", len(allannot_df), len(alldocs_df), len(all_df))
print("All Expanded:", len(allannot_df), len(alldocs_df_expanded), len(all_df_expanded))



Try and validate the numbers are close with the original papers.  You can see the counts are higher for some reason but the percentage occurrence of each disease doesn't change too much so we are good to use the expanded set.

In [None]:
df_before = pd.concat([all_df['disease'].value_counts().sort_index(0),all_df[all_df['judgment']==True]['disease'].value_counts().sort_index(0)/all_df['disease'].value_counts().sort_index(0)],axis =1)
df_after = pd.concat([all_df_expanded['disease'].value_counts().sort_index(0),all_df_expanded[all_df_expanded['judgment']==True]['disease'].value_counts().sort_index(0)/all_df_expanded['disease'].value_counts().sort_index(0)],axis =1)
df_all = pd.concat([df_before,df_after], axis=1)

df_all


![Note occurrences](images\note_occurrences.gif)


Save the final test/train dataset

In [None]:
all_df.to_pickle(DATA_PATH + '/all_df.pkl') 
all_df_expanded.to_pickle(DATA_PATH + '/all_df_expanded.pkl') 
#corpus.to_pickle(DATA_PATH + '/corpus.pkl')
torch.save(voc, DATA_PATH + '/voc.obj')
torch.save((token_max, sentence_max), DATA_PATH + '/counts.obj')

Testing the GloVe embedding

In [None]:
from torchtext.vocab import vocab


vec = torchtext.vocab.GloVe(name='6B', dim=300)

one_hot_test = all_df_expanded.iloc[0]['one_hot']
vector_tokenized_test = all_df_expanded.iloc[0]['vector_tokenized']

print(one_hot_test)
ret = vec.get_vecs_by_tokens(voc.lookup_tokens(one_hot_test))
print(ret.shape)
#print(ret[0])
print(vector_tokenized_test)
ret = vec.get_vecs_by_tokens(vector_tokenized_test)
print(ret.shape)
#print(ret[0])



Testing the FastText embedding.

In [None]:
from torchtext.vocab import vocab

vec = torchtext.vocab.FastText()

one_hot_test = all_df_expanded.iloc[0]['one_hot']

print(one_hot_test)
ret = vec.get_vecs_by_tokens(voc.lookup_tokens(one_hot_test))
print(ret.shape)
print(vector_tokenized_test)
ret = vec.get_vecs_by_tokens(vector_tokenized_test)
print(ret.shape)