This notebook was created to support the data preparation required to support our CS 598 DLH project.  The paper we have chosen for the reproducibility project is:
***Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification from Clinical Notes ***

Abstract:  The main goal of the paper is to extract Morbidity from clinical notes.  The idea was to use a combination of classical and deep learning methods to determine the best approach for classifying these notes in one or more of 16 morbidity conditions.  These models used a combination of NLP techniques including embeddings and bag of words implementations.  It also measured the effect including of stop words.  Lastly, it used ensemble techniques to tie together a number of the classical and deep learning models to provide the most accurate results.

Dataset was retrieved from the DBMI Data Portal, Department of Biomedical Informatics (DBMI) in the Blavatnik Institute at Harvard Medical School.  This dataset was originally created for the i2b2 Obesity Challenge conducted in 2008.
This data was provided in XML format with a test and training set.  Along with the test and training set, labeled data of two forms were included. They were called Intuitive and Textual.  Textual judgements were derived by looking at the notes by multiple experts.  When the experts didnâ€™t agree, a resident doctor annotated it with a Intuitive judgement.

In this workbook, we are taking the following steps:


* Loading test and train data along with annotations
* Exploring the best annotation data sets to use
* Preprocessing the data using NLP techniques described below.
* Saving the data as pkl files for use in additional notebooks.





 

In [None]:
pip install xmltodict


The data cannot be shared publicly due to the agreements required to obtain the data so we are storing the data locally and not putting in GitHub.

In [None]:
DATA_PATH = './obesity_data/'

from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

Next we create a function to load the data from XML files and convert to a more usable dataframe structure.

In [None]:
import pandas as pd
import xmltodict

def load_dataset(filepath, xpath):    
    return pd.read_xml(filepath, xpath=xpath)

def load_annotations(filepath):

  with open(filepath,"r") as f:
      data = f.read()

  df = pd.DataFrame(columns=['source','disease','id','judgment'])

  data = xmltodict.parse(data)['diseaseset']['diseases']

  for key,val in enumerate(data):
    if(isinstance(val,str)):
      source = data['@source']
      disease = data['disease']
    else:
      source = val['@source']
      disease = val['disease']

    for key,val in enumerate(disease):
      if(isinstance(val,str)):
        disease_name = disease['@name']
        doc = disease['doc']
      else:
        disease_name = val['@name']
        doc = val['doc']
      
      for key,val in enumerate(doc):
        if(isinstance(val,str)):
          doc_id = doc['@id']
          judgment = doc['@judgment']
        else:
          doc_id = val['@id']
          judgment = val['@judgment']
        df_temp = pd.DataFrame([{"source":source,"disease":disease_name,"id":doc_id,"judgment":judgment}])
        #df = df.append(df_temp)  
        df = pd.concat([df,df_temp])

  #The xml acts really strange if there are single nodes.  Dropping duplicates solves it.
  return df.drop_duplicates()

Now we load the test and train datasets and examine the notes. Note, we are loading the training file with 2 as a seperate data frame as it relates to all the addendums which we believe was not used by the paper.

In [None]:
test_df = load_dataset(DATA_PATH + 'obesity_patient_records_test.xml', xpath='/root/docs/*')
test_df['id'] = pd.to_numeric(test_df['id'])
print(test_df.head())
print(len(test_df))

train_df = load_dataset(DATA_PATH + 'obesity_patient_records_training.xml', xpath='/root/docs/*')
train_df_with2 = train_df.append(load_dataset(DATA_PATH + '/obesity_patient_records_training2.xml', xpath='/root/docs/*'))
train_df['id'] = pd.to_numeric(train_df['id'])
train_df_with2['id'] = pd.to_numeric(train_df_with2['id'])
print(train_df.head())
print(len(train_df))
print(len(train_df_with2))

print(test_df['text'][0])

The annotation data came in two forms: textual and intuitive.  It also came with files with the forms in seperate files and with the forms all together in one file.  We do some exploration to determine which set of data is the closest to the study.

In [None]:
test_annot_intuitive_df = load_annotations(DATA_PATH + "obesity_standoff_annotations_test_intuitive.xml")
test_annot_intuitive_df['id'] = pd.to_numeric(test_annot_intuitive_df['id'])

test_annot_textual_df = load_annotations(DATA_PATH + "obesity_standoff_annotations_test_textual.xml")
test_annot_textual_df['id'] = pd.to_numeric(test_annot_textual_df['id'])

print(test_annot_intuitive_df.head())
print(len(test_annot_intuitive_df))

print(test_annot_textual_df.head())
print(len(test_annot_textual_df))


The test file with all forms is explored and the record count seems the same as combining the seperate files.

In [None]:
#trying to verify the same number of records in the one with both intuitive and textual
test_annot_all_df = load_annotations(DATA_PATH + "obesity_standoff_annotations_test.xml")
test_annot_all_df['id'] = pd.to_numeric(test_annot_all_df['id'])

print(test_annot_all_df.head())
print(len(test_annot_all_df))

We then do the same analysis with the training annotations.

In [None]:
train_annot_intuitive_df = load_annotations(DATA_PATH + "obesity_standoff_intuitive_annotations_training.xml")
train_annot_intuitive_df['id'] = pd.to_numeric(train_annot_intuitive_df['id'])
train_annot_textual_df = load_annotations(DATA_PATH + "obesity_standoff_textual_annotations_training.xml")
train_annot_textual_df['id'] = pd.to_numeric(train_annot_textual_df['id'])

print(train_annot_intuitive_df.head())
print(len(train_annot_intuitive_df))

print(train_annot_textual_df.head())
print(len(train_annot_textual_df))

When we look at the full file with addendums, we see there is a lot more data in the full file than the seperate file.

In [None]:
#trying to verify the same number of records in the one with both intuitive and textual (It isn't according to tally.pdf it should be 22285 with the annotations and addendums)
train_annot_all_df = load_annotations(DATA_PATH + "obesity_standoff_annotations_training.xml")
train_annot_all_df_with2 = pd.concat([train_annot_all_df,load_annotations(DATA_PATH + "obesity_standoff_annotations_training_addendum.xml")])
train_annot_all_df_with2 = pd.concat([train_annot_all_df_with2,load_annotations(DATA_PATH + "obesity_standoff_annotations_training_addendum2.xml")])
train_annot_all_df_with2 = pd.concat([train_annot_all_df_with2,load_annotations(DATA_PATH + "obesity_standoff_annotations_training_addendum3.xml")])

train_annot_all_df['id'] = pd.to_numeric(train_annot_all_df['id'])
train_annot_all_df_with2['id'] = pd.to_numeric(train_annot_all_df_with2['id'])

print(train_annot_all_df.head())
print(len(train_annot_all_df))
print(len(train_annot_all_df_with2))


We are going to start with the annotations in one file (test_annot_all_df, train_annot_all_df).  The paper only used annotations that were clearly marked 'Y' or 'N' (It excluded the 'Q' and 'U').

In [None]:
print(len(test_annot_all_df),len(train_annot_all_df))
test_annot_all_df_clean = test_annot_all_df[(test_annot_all_df['judgment']  == 'Y') | (test_annot_all_df['judgment']  == 'N')]
train_annot_all_df_clean = train_annot_all_df[(train_annot_all_df['judgment']  == 'Y') | (train_annot_all_df['judgment']  == 'N')]
print(len(test_annot_all_df_clean),len(train_annot_all_df_clean))


In [None]:
print(test_annot_all_df_clean.groupby('disease').size())


In [None]:

print(train_annot_all_df_clean.groupby('disease').size())


The paper specifically calls out 6 files and does not mention the addendums, so we will stick with the seperately labeled files for our study. There seems to be only one record in each of the test and training set where the textual and intuitive disagree.

In [None]:
test_annot_intuitive_df_clean = test_annot_intuitive_df[(test_annot_intuitive_df['judgment']  == 'Y') | (test_annot_intuitive_df['judgment']  == 'N')]
test_annot_textual_df_clean = test_annot_textual_df[(test_annot_textual_df['judgment']  == 'Y') | (test_annot_textual_df['judgment']  == 'N')]
train_annot_intuitive_df_clean = train_annot_intuitive_df[(train_annot_intuitive_df['judgment']  == 'Y') | (train_annot_intuitive_df['judgment']  == 'N')]
train_annot_textual_df_clean = train_annot_textual_df[(train_annot_textual_df['judgment']  == 'Y') | (train_annot_textual_df['judgment']  == 'N')]

print(len(test_annot_intuitive_df_clean))
print(len(test_annot_textual_df_clean))


df = test_annot_intuitive_df_clean.merge(test_annot_textual_df_clean, on=['id','disease'])
print(df[df['judgment_x'] != df['judgment_y']])

print(len(train_annot_intuitive_df_clean))
print(len(train_annot_textual_df_clean))

df = train_annot_intuitive_df_clean.merge(train_annot_textual_df_clean, on=['id','disease'])
print(df[df['judgment_x'] != df['judgment_y']])


Let's remove those two records from the textual table and recheck.

In [None]:
df = test_annot_textual_df_clean
df = df.reset_index()
index_names = df[(df['disease'] == 'OA') & (df['id'] == 8)].index
test_annot_textual_df_clean = df.drop(index_names)

df = train_annot_textual_df_clean
df = df.reset_index()
index_names = df[(df['disease'] == 'CHF') & (df['id'] == 1072)].index
train_annot_textual_df_clean = df.drop(index_names)

print(len(test_annot_textual_df_clean))
df = test_annot_intuitive_df_clean.merge(test_annot_textual_df_clean, on=['id','disease'])
print(df[df['judgment_x'] != df['judgment_y']])

print(len(train_annot_textual_df_clean))
df = train_annot_intuitive_df_clean.merge(train_annot_textual_df_clean, on=['id','disease'])
print(df[df['judgment_x'] != df['judgment_y']])


The paper does a classification model for each disase seperately.  Need to be able to loop through each. Using the seperate files  seems to come closer to the disease counts the paper discusses before pre-processing, so we will use this data. The study must have done some additional processing that is not evident from the paper, so our results may be a little different.

In [None]:
disease_list = train_annot_intuitive_df_clean['disease'].unique().tolist()

train_annot_all_df_clean = pd.concat([train_annot_intuitive_df_clean,train_annot_textual_df_clean])
train_annot_all_df_clean = train_annot_all_df_clean.drop(['source','index'], axis=1)
train_annot_all_df_clean = train_annot_all_df_clean.drop_duplicates()

test_annot_all_df_clean = pd.concat([test_annot_intuitive_df_clean,test_annot_textual_df_clean])
test_annot_all_df_clean = test_annot_all_df_clean.drop(['source','index'], axis=1)
test_annot_all_df_clean = test_annot_all_df_clean.drop_duplicates()

annot_all_df_clean = pd.concat([train_annot_all_df_clean,test_annot_all_df_clean])
annot_all_df_clean = annot_all_df_clean.drop_duplicates()
samples = 0

for disease in disease_list:
  print('Disease:',disease)
  print('Train:',sum(train_annot_intuitive_df_clean['disease'] == disease),
        sum(train_annot_textual_df_clean['disease'] == disease),
        sum(train_annot_all_df_clean['disease'] == disease))
  print('Test:',sum(test_annot_intuitive_df_clean['disease'] == disease),
        sum(test_annot_textual_df_clean['disease'] == disease),
        sum(test_annot_all_df_clean['disease'] == disease))  
  print('All:',sum(annot_all_df_clean['disease'] == disease))
  samples = sum(annot_all_df_clean['disease'] == disease) + samples

print("Samples:",samples)


In [None]:
allannot_df = pd.concat([train_annot_all_df_clean,test_annot_all_df_clean])
alldocs_df = pd.concat([train_df, test_df])

Datasets to use for rest of the study:
* alldocs_df [id,text] (document, clinical notes)
* allannot_df [disease,id,judment,index] (disease, document, judgment)

For the annotations, we should convert judgement to a numeric label.

Our next step is to continue the preprocessing of the data.  We want to do this seperately from the annotations, they can be joined when doing classification by each disase. This includes:

* Lower-casing of the text
* Removing punctuation and numeric values from the text
* Tokenization of text 
* Lemmatizattion of the tokens
* TF-IDF matrix generation (TF-IDF Vectorizer4 from the scikit-learn library)

We have an optional parameter to remove stop words as the paper discusses the fact that stop words should be included for deep learning models.

In [None]:
#Convert Y/N to True/False
#test_annot_all_df_clean['judgment'] = test_annot_all_df_clean['judgment'] == 'Y'
#train_annot_all_df_clean['judgment'] = train_annot_all_df_clean['judgment'] == 'Y'
allannot_df['judgment'] = allannot_df['judgment'] == 'Y'

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import ExtraTreesClassifier

wn = WordNetLemmatizer()
stemmer = PorterStemmer()

import re
import string
import nltk
#nltk.download('wordnet')
#nltk.download('stopwords')

cachedStopWords = stopwords.words("english")
vectorizer = TfidfVectorizer(stop_words = cachedStopWords, max_features = 600)
clf = ExtraTreesClassifier(n_estimators=100, random_state=0)
# add param above: vocabulary=custom_vocab , where custom_vocab is the vocabulary of
# ranked features selected by applying the feature selection algorithms.

In [None]:
import re

def cleanword(word):
    word = word.lower()

    word = ''.join([i for i in word if not i.isdigit()])
    symbols = "|,!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    word = word.translate(str.maketrans('', '', symbols))
    word = stemmer.stem(word)
    word = wn.lemmatize(word)

    return word

def cleansentence(sentence):
    sentence = ' '.join([cleanword(word) for word in sentence.split()])
    sentence = re.sub(' +', ' ', sentence).strip()
    return sentence


##Sentences - we can't do data cleansing until after sentence tokenized
alldocs_df['sentence_tokenized'] = alldocs_df['text'].replace("\n", "")
alldocs_df['sentence_tokenized'] = alldocs_df['sentence_tokenized'].apply(lambda x: sent_tokenize(x)) # this is a list of sentences
alldocs_df['sentence_tokenized'] = alldocs_df['sentence_tokenized'].apply(lambda lst:[cleansentence(sentence) for sentence in lst]) # this is a list of sentences

##Trying a different approach to cleansing for embeddings, treat text as one big sentence
alldocs_df['word_tokenized'] = alldocs_df['text'].replace("\n", "")
alldocs_df['word_tokenized'] = alldocs_df['word_tokenized'].apply(lambda x: cleansentence(x))
alldocs_df['word_tokenized'] = alldocs_df.apply(lambda row: word_tokenize(row['word_tokenized']), axis=1)


In [None]:
def preparetext(df, removestopwords = False):

    ndf = df.copy()

    ndf["no_punc_text"] = ndf['text'].str.replace('[^\w\s]', '')
    ndf["no_numerics_text"] = ndf['no_punc_text'].str.replace('\d+', '')
    ndf["lower_text"] = ndf['no_numerics_text'].apply(str.lower)

    #this has a side effect of getting rid of all of the carriage returns, etc.
    if removestopwords:
         ndf['lower_text'] =  ndf['lower_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (cachedStopWords)]))
    else:
         ndf['lower_text'] =  ndf['lower_text'].apply(lambda x: ' '.join([word for word in x.split()]))

    ndf["tokenized_text"] = ndf.apply(lambda row: word_tokenize(row['lower_text']), axis=1)

    ndf["tok_lem_text"] = ndf['tokenized_text'].apply(
        lambda lst:[wn.lemmatize(word) for word in lst])
    
    #X = vectorizer.fit_transform(ndf['lower_text'])
    #print(X[0])

    return ndf

Create seperate dataframes with stop words included and removed.

In [None]:
alldocs_df = preparetext(alldocs_df, removestopwords=False)
#print(alldocs_df['tok_lem_text'][0])
alldocs_df_ns = preparetext(alldocs_df, removestopwords=True)
#print(alldocs_df_ns['tok_lem_text'][0])

In [None]:
def apply_feature_selection(df):
    # apply ExtraTreesClassifier
    #extra_tree_feature_selection = clf.
    
    # apply InfoGainAttributeEval
    
    # apply SelectKBest
    
    
    
    
    X = vectorizer.fit_transform(df['lower_text'])
    #print(X[0])
    print(X.shape) # For DL models needs to be dimension {n x 600}, with n the number of text documents (clinical records)

Save each of these data frames for use in future notebooks.

In [None]:
#test_df.to_pickle(DATA_PATH + '/test_df.pkl')
#train_df.to_pickle(DATA_PATH + '/train_df.pkl')
#test_df_ns.to_pickle(DATA_PATH + '/test_df_ns.pkl') 
#train_df_ns.to_pickle(DATA_PATH + '/train_df_ns.pkl')
#test_annot_all_df_clean.to_pickle(DATA_PATH + '/test_annot_all_df_clean.pkl') 
#train_annot_all_df_clean.to_pickle(DATA_PATH + '/train_annot_all_df_clean.pkl') 

alldocs_df.to_pickle(DATA_PATH + '/alldocs_df.pkl')
alldocs_df_ns.to_pickle(DATA_PATH + '/alldocs_df_ns.pkl')
allannot_df.to_pickle(DATA_PATH + '/allannot_df.pkl')