This notebook was created to support the data preparation required to support our CS 598 DLH project.  The paper we have chosen for the reproducibility project is:
***Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification from Clinical Notes ***

Abstract:  The main goal of the paper is to extract Morbidity from clinical notes.  The idea was to use a combination of classical and deep learning methods to determine the best approach for classifying these notes in one or more of 16 morbidity conditions.  These models used a combination of NLP techniques including embeddings and bag of words implementations.  It also measured the effect including of stop words.  Lastly, it used ensemble techniques to tie together a number of the classical and deep learning models to provide the most accurate results.

Dataset was retrieved from the DBMI Data Portal, Department of Biomedical Informatics (DBMI) in the Blavatnik Institute at Harvard Medical School.  This dataset was originally created for the i2b2 Obesity Challenge conducted in 2008.
This data was provided in XML format with a test and training set.  Along with the test and training set, labeled data of two forms were included. They were called Intuitive and Textual.  Textual judgements were derived by looking at the notes by multiple experts.  When the experts didn’t agree, a resident doctor annotated it with a Intuitive judgement.

In this workbook, we are taking the following steps:


* Loading test and train data along with annotations
* Exploring the best annotation data sets to use
* Preprocessing the data using NLP techniques described below.
* Saving the data as pkl files for use in additional notebooks.





 

In [18]:
pip install xmltodict


Note: you may need to restart the kernel to use updated packages.


The data cannot be shared publicly due to the agreements required to obtain the data so we are storing the data locally and not putting in GitHub.

In [19]:
DATA_PATH = './obesity_data/'

Next we create a function to load the data from XML files and convert to a more usable dataframe structure.

In [20]:
import pandas as pd
import xmltodict

def load_dataset(filepath, xpath):    
    return pd.read_xml(filepath, xpath=xpath)

def load_annotations(filepath):

  with open(filepath,"r") as f:
      data = f.read()

  df = pd.DataFrame(columns=['source','disease','id','judgment'])

  data = xmltodict.parse(data)['diseaseset']['diseases']

  for key,val in enumerate(data):
    if(isinstance(val,str)):
      source = data['@source']
      disease = data['disease']
    else:
      source = val['@source']
      disease = val['disease']

    for key,val in enumerate(disease):
      if(isinstance(val,str)):
        disease_name = disease['@name']
        doc = disease['doc']
      else:
        disease_name = val['@name']
        doc = val['doc']
      
      for key,val in enumerate(doc):
        if(isinstance(val,str)):
          doc_id = doc['@id']
          judgment = doc['@judgment']
        else:
          doc_id = val['@id']
          judgment = val['@judgment']
        df_temp = pd.DataFrame([{"source":source,"disease":disease_name,"id":doc_id,"judgment":judgment}])
        #df = df.append(df_temp)  
        df = pd.concat([df,df_temp])

  #The xml acts really strange if there are single nodes.  Dropping duplicates solves it.
  return df.drop_duplicates()

Now we load the test and train datasets and examine the notes. Note, we are loading the training file with 2 as a seperate data frame as it relates to all the addendums which we believe was not used by the paper.

In [23]:
test_df = load_dataset(DATA_PATH + 'obesity_patient_records_test.xml', xpath='/root/docs/*')
test_df['id'] = pd.to_numeric(test_df['id'])
print(test_df.head())
print(len(test_df))

train_df = load_dataset(DATA_PATH + 'obesity_patient_records_training.xml', xpath='/root/docs/*')
train_df_with2 = train_df.append(load_dataset(DATA_PATH + '/obesity_patient_records_training2.xml', xpath='/root/docs/*'))
train_df['id'] = pd.to_numeric(train_df['id'])
train_df_with2['id'] = pd.to_numeric(train_df_with2['id'])
print(train_df.head())
print(len(train_df))
print(len(train_df_with2))

print(test_df['text'][0])

   id                                               text
0   3  470971328 | AECH | 09071283 | | 6159055 | 5/26...
1   5  508283935 | KFM | 67491508 | | 9707967 | 9/25/...
2   7  248652055 | CM | 07563073 | | 5027467 | 8/29/2...
3   8  052907410 | FTH | 50999409 | | 7815179 | 10/6/...
4   9  628477951 | MBCH | 30737210 | | 5713924 | 12/1...
507
   id                                               text
0   1  490646815 | WMC | 31530471 | | 9629480 | 11/23...
1   2  159644670 | VH | 60656526 | | 6334749 | 11/29/...
2   4  368346277 | EMH | 64927307 | | 815098 | 3/29/1...
3   6  018858680 | AOH | 80239131 | | 9725704 | 11/4/...
4  13  908761918 | MMC | 45427009 | | 0927689 | 5/26/...
611
730
470971328 | AECH | 09071283 | | 6159055 | 5/26/2006 12:00:00 AM | PNUEMONIA | Signed | DIS | Admission Date: 4/22/2006 Report Status: Signed

Discharge Date: 7/27/2006
ATTENDING: CARINE , WALTER MD
SERVICE:
Medicine Service.
ADMISSION INFORMATION AND CHIEF COMPLAINT:
Hypoxemic respiratory failure.
HISTO

The annotation data came in two forms: textual and intuitive.  It also came with files with the forms in seperate files and with the forms all together in one file.  We do some exploration to determine which set of data is the closest to the study.

In [24]:
test_annot_intuitive_df = load_annotations(DATA_PATH + "obesity_standoff_annotations_test_intuitive.xml")
test_annot_intuitive_df['id'] = pd.to_numeric(test_annot_intuitive_df['id'])

test_annot_textual_df = load_annotations(DATA_PATH + "obesity_standoff_annotations_test_textual.xml")
test_annot_textual_df['id'] = pd.to_numeric(test_annot_textual_df['id'])

print(test_annot_intuitive_df.head())
print(len(test_annot_intuitive_df))

print(test_annot_textual_df.head())
print(len(test_annot_textual_df))


      source disease  id judgment
0  intuitive  Asthma   3        Y
0  intuitive  Asthma   5        N
0  intuitive  Asthma   7        N
0  intuitive  Asthma   9        Y
0  intuitive  Asthma  10        N
7399
    source disease  id judgment
0  textual  Asthma   3        Y
0  textual  Asthma   5        U
0  textual  Asthma   7        U
0  textual  Asthma   8        U
0  textual  Asthma   9        Y
8044


The test file with all forms is explored and the record count seems the same as combining the seperate files.

In [25]:
#trying to verify the same number of records in the one with both intuitive and textual
test_annot_all_df = load_annotations(DATA_PATH + "obesity_standoff_annotations_test.xml")
test_annot_all_df['id'] = pd.to_numeric(test_annot_all_df['id'])

print(test_annot_all_df.head())
print(len(test_annot_all_df))

      source disease  id judgment
0  intuitive  Asthma   3        Y
0  intuitive  Asthma   5        N
0  intuitive  Asthma   7        N
0  intuitive  Asthma   9        Y
0  intuitive  Asthma  10        N
15443


We then do the same analysis with the training annotations.

In [26]:
train_annot_intuitive_df = load_annotations(DATA_PATH + "obesity_standoff_intuitive_annotations_training.xml")
train_annot_intuitive_df['id'] = pd.to_numeric(train_annot_intuitive_df['id'])
train_annot_textual_df = load_annotations(DATA_PATH + "obesity_standoff_textual_annotations_training.xml")
train_annot_textual_df['id'] = pd.to_numeric(train_annot_textual_df['id'])

print(train_annot_intuitive_df.head())
print(len(train_annot_intuitive_df))

print(train_annot_textual_df.head())
print(len(train_annot_textual_df))

      source disease  id judgment
0  intuitive  Asthma   1        N
0  intuitive  Asthma   2        Y
0  intuitive  Asthma   4        N
0  intuitive  Asthma   6        N
0  intuitive  Asthma  15        N
8621
    source disease  id judgment
0  textual  Asthma   1        U
0  textual  Asthma   2        Y
0  textual  Asthma   4        U
0  textual  Asthma   6        U
0  textual  Asthma  13        U
9655


When we look at the full file with addendums, we see there is a lot more data in the full file than the seperate file.

In [27]:
#trying to verify the same number of records in the one with both intuitive and textual (It isn't according to tally.pdf it should be 22285 with the annotations and addendums)
train_annot_all_df = load_annotations(DATA_PATH + "obesity_standoff_annotations_training.xml")
train_annot_all_df_with2 = pd.concat([train_annot_all_df,load_annotations(DATA_PATH + "obesity_standoff_annotations_training_addendum.xml")])
train_annot_all_df_with2 = pd.concat([train_annot_all_df_with2,load_annotations(DATA_PATH + "obesity_standoff_annotations_training_addendum2.xml")])
train_annot_all_df_with2 = pd.concat([train_annot_all_df_with2,load_annotations(DATA_PATH + "obesity_standoff_annotations_training_addendum3.xml")])

train_annot_all_df['id'] = pd.to_numeric(train_annot_all_df['id'])
train_annot_all_df_with2['id'] = pd.to_numeric(train_annot_all_df_with2['id'])

print(train_annot_all_df.head())
print(len(train_annot_all_df))
print(len(train_annot_all_df_with2))


      source disease  id judgment
0  intuitive  Asthma   1        N
0  intuitive  Asthma   2        Y
0  intuitive  Asthma   4        N
0  intuitive  Asthma   6        N
0  intuitive  Asthma  15        N
18276
22285


We are going to start with the annotations in one file (test_annot_all_df, train_annot_all_df).  The paper only used annotations that were clearly marked 'Y' or 'N' (It excluded the 'Q' and 'U').

In [28]:
print(len(test_annot_all_df),len(train_annot_all_df))
test_annot_all_df_clean = test_annot_all_df[(test_annot_all_df['judgment']  == 'Y') | (test_annot_all_df['judgment']  == 'N')]
train_annot_all_df_clean = train_annot_all_df[(train_annot_all_df['judgment']  == 'Y') | (train_annot_all_df['judgment']  == 'N')]
print(len(test_annot_all_df_clean),len(train_annot_all_df_clean))


15443 18276
9642 11274


In [29]:
print(test_annot_all_df_clean.groupby('disease').size())


disease
Asthma                  541
CAD                     756
CHF                     650
Depression              549
Diabetes                829
GERD                    494
Gallstones              580
Gout                    552
Hypercholesterolemia    650
Hypertension            826
Hypertriglyceridemia    496
OA                      544
OSA                     562
Obesity                 648
PVD                     528
Venous Insufficiency    437
dtype: int64


In [31]:

print(train_annot_all_df_clean.groupby('disease').size())


disease
Asthma                  648
CAD                     897
CHF                     489
Depression              668
Diabetes                978
GERD                    586
Gallstones              687
Gout                    667
Hypercholesterolemia    757
Hypertension            983
Hypertriglyceridemia    602
OA                      654
OSA                     678
Obesity                 801
PVD                     639
Venous Insufficiency    540
dtype: int64


The paper specifically calls out 6 files and does not mention the addendums, so we will stick with the seperately labeled files for our study. There seems to be only one record in each of the test and training set where the textual and intuitive disagree.

In [32]:
test_annot_intuitive_df_clean = test_annot_intuitive_df[(test_annot_intuitive_df['judgment']  == 'Y') | (test_annot_intuitive_df['judgment']  == 'N')]
test_annot_textual_df_clean = test_annot_textual_df[(test_annot_textual_df['judgment']  == 'Y') | (test_annot_textual_df['judgment']  == 'N')]
train_annot_intuitive_df_clean = train_annot_intuitive_df[(train_annot_intuitive_df['judgment']  == 'Y') | (train_annot_intuitive_df['judgment']  == 'N')]
train_annot_textual_df_clean = train_annot_textual_df[(train_annot_textual_df['judgment']  == 'Y') | (train_annot_textual_df['judgment']  == 'N')]

print(len(test_annot_intuitive_df_clean))
print(len(test_annot_textual_df_clean))


df = test_annot_intuitive_df_clean.merge(test_annot_textual_df_clean, on=['id','disease'])
print(df[df['judgment_x'] != df['judgment_y']])

print(len(train_annot_intuitive_df_clean))
print(len(train_annot_textual_df_clean))

df = train_annot_intuitive_df_clean.merge(train_annot_textual_df_clean, on=['id','disease'])
print(df[df['judgment_x'] != df['judgment_y']])


7385
2257
       source_x disease  id judgment_x source_y judgment_y
1712  intuitive      OA   8          N  textual          Y
8598
2676
      source_x disease    id judgment_x source_y judgment_y
571  intuitive     CHF  1072          Y  textual          N


Let's remove those two records from the textual table and recheck.

In [33]:
df = test_annot_textual_df_clean
df = df.reset_index()
index_names = df[(df['disease'] == 'OA') & (df['id'] == 8)].index
test_annot_textual_df_clean = df.drop(index_names)

df = train_annot_textual_df_clean
df = df.reset_index()
index_names = df[(df['disease'] == 'CHF') & (df['id'] == 1072)].index
train_annot_textual_df_clean = df.drop(index_names)

print(len(test_annot_textual_df_clean))
df = test_annot_intuitive_df_clean.merge(test_annot_textual_df_clean, on=['id','disease'])
print(df[df['judgment_x'] != df['judgment_y']])

print(len(train_annot_textual_df_clean))
df = train_annot_intuitive_df_clean.merge(train_annot_textual_df_clean, on=['id','disease'])
print(df[df['judgment_x'] != df['judgment_y']])


2256
Empty DataFrame
Columns: [source_x, disease, id, judgment_x, index, source_y, judgment_y]
Index: []
2675
Empty DataFrame
Columns: [source_x, disease, id, judgment_x, index, source_y, judgment_y]
Index: []


The paper does a classification model for each disase seperately.  Need to be able to loop through each. Using the seperate files  seems to come closer to the disease counts the paper discusses before pre-processing, so we will use this data. The study must have done some additional processing that is not evident from the paper, so our results may be a little different.

In [34]:
disease_list = train_annot_intuitive_df_clean['disease'].unique().tolist()

train_annot_all_df_clean = pd.concat([train_annot_intuitive_df_clean,train_annot_textual_df_clean])
train_annot_all_df_clean = train_annot_all_df_clean.drop(['source'], axis=1)
train_annot_all_df_clean = train_annot_all_df_clean.drop_duplicates()

test_annot_all_df_clean = pd.concat([test_annot_intuitive_df_clean,test_annot_textual_df_clean])
test_annot_all_df_clean = test_annot_all_df_clean.drop(['source'], axis=1)
test_annot_all_df_clean = test_annot_all_df_clean.drop_duplicates()

annot_all_df_clean = pd.concat([train_annot_all_df_clean,test_annot_all_df_clean])
annot_all_df_clean = annot_all_df_clean.drop_duplicates()
samples = 0

for disease in disease_list:
  print('Disease:',disease)
  print('Train:',sum(train_annot_intuitive_df_clean['disease'] == disease),
        sum(train_annot_textual_df_clean['disease'] == disease),
        sum(train_annot_all_df_clean['disease'] == disease))
  print('Test:',sum(test_annot_intuitive_df_clean['disease'] == disease),
        sum(test_annot_textual_df_clean['disease'] == disease),
        sum(test_annot_all_df_clean['disease'] == disease))  
  print('All:',sum(annot_all_df_clean['disease'] == disease))
  samples = sum(annot_all_df_clean['disease'] == disease) + samples

print("Samples:",samples)


Disease: Asthma
Train: 572 76 648
Test: 471 70 541
All: 1189
Disease: CAD
Train: 548 349 897
Test: 457 299 756
All: 1653
Disease: CHF
Train: 243 245 488
Test: 434 216 650
All: 1138
Disease: Depression
Train: 582 86 668
Test: 477 72 549
All: 1217
Disease: Diabetes
Train: 567 411 978
Test: 479 350 829
All: 1807
Disease: Gallstones
Train: 593 94 687
Test: 491 89 580
All: 1267
Disease: GERD
Train: 487 99 586
Test: 424 70 494
All: 1080
Disease: Gout
Train: 596 71 667
Test: 500 52 552
All: 1219
Disease: Hypercholesterolemia
Train: 502 255 757
Test: 431 219 650
All: 1407
Disease: Hypertension
Train: 531 452 983
Test: 446 380 826
All: 1809
Disease: Hypertriglyceridemia
Train: 587 15 602
Test: 486 10 496
All: 1098
Disease: OA
Train: 565 89 654
Test: 458 85 543
All: 1197
Disease: Obesity
Train: 553 248 801
Test: 447 201 648
All: 1449
Disease: OSA
Train: 590 88 678
Test: 493 69 562
All: 1240
Disease: PVD
Train: 556 83 639
Test: 464 64 528
All: 1167
Disease: Venous Insufficiency
Train: 526 14 540


Datasets to use for rest of the study:
* test_df [id,text] (document, clinical notes)
* train_df [id,text] (document, clinical notes)
* test_annot_all_df_clean [disease,id,judment,index] (disease, document, judgment)
* train_annot_all_df_clean [disease,id,judment,index] (disease, document, judgment)

For the annotations, we should convert judgement to a numeric label.

Our next step is to continue the preprocessing of the data.  We want to do this seperately from the annotations, they can be joined when doing classification by each disase. This includes:

* Lower-casing of the text
* Removing punctuation and numeric values from the text
* Tokenization of text 
* Lemmatizattion of the tokens
* TF-IDF matrix generation (TF-IDF Vectorizer4 from the scikit-learn library)

We have an optional parameter to remove stop words as the paper discusses the fact that stop words should be included for deep learning models.

In [35]:

test_df
train_df
test_annot_all_df_clean
train_annot_all_df_clean

Unnamed: 0,disease,id,judgment,index
0,Asthma,1,N,
0,Asthma,2,Y,
0,Asthma,4,N,
0,Asthma,6,N,
0,Asthma,15,N,
...,...,...,...,...
2671,Venous Insufficiency,879,Y,0.0
2672,Venous Insufficiency,989,Y,0.0
2673,Venous Insufficiency,1055,Y,0.0
2674,Venous Insufficiency,1149,Y,0.0


In [36]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re
import string
import nltk
#nltk.download('wordnet')

In [37]:
#wn = WordNetLemmatizer()

#def black_txt(token):
  #return token not in list(string.punctuation) and len(token) > 2

#def clean_txt(text):
  #clean_text = []
  
  #text = re.sub(re.escape("'"), "", text)
  #text = re.sub(re.escape("\\d|\\W)+"), " ", text)
  #clean_text = [wn.lemmatize(word, pos = "v") for word in word_tokenize(text.lower()) if black_txt(word)]

  #return " ".join(clean_text)


In [38]:
#train_df['Clean_Description'] = train_df['text'].map(str).apply(clean_txt)
wn = WordNetLemmatizer()

train_df["no_punc_text"] = train_df['text'].str.replace('[^\w\s]', '')
train_df["no_numerics_text"] = train_df['no_punc_text'].str.replace('\d+', '')
train_df["lower_text"] = train_df['no_numerics_text'].apply(str.lower)

train_df["tokenized_text"] = train_df.apply(lambda row: word_tokenize(row['lower_text']), axis=1)

train_df["tok_lem_text"] = train_df['tokenized_text'].apply(
    lambda lst:[wn.lemmatize(word) for word in lst])

  train_df["no_punc_text"] = train_df['text'].str.replace('[^\w\s]', '')
  train_df["no_numerics_text"] = train_df['no_punc_text'].str.replace('\d+', '')


In [39]:
print(train_df['tok_lem_text'][0])

['wmc', 'am', 'anemia', 'signed', 'dis', 'admission', 'date', 'report', 'status', 'signed', 'discharge', 'date', 'attending', 'truka', 'deon', 'xavier', 'md', 'service', 'bh', 'principal', 'diagnosis', 'anemia', 'and', 'gi', 'bleed', 'secondary', 'diagnosis', 'diabetes', 'mitral', 'valve', 'replacement', 'atrial', 'fibrillation', 'and', 'chronic', 'kidney', 'disease', 'history', 'of', 'present', 'illness', 'the', 'patient', 'is', 'an', 'yearold', 'woman', 'with', 'a', 'history', 'of', 'diabetes', 'chronic', 'kidney', 'disease', 'congestive', 'heart', 'failure', 'with', 'ejection', 'fraction', 'of', 'to', 'who', 'present', 'from', 'clinic', 'with', 'a', 'chief', 'complaint', 'of', 'fatigue', 'and', 'weakness', 'for', 'one', 'week', 'she', 'had', 'had', 'worsening', 'right', 'groin', 'and', 'hip', 'pain', 'status', 'post', 'a', 'total', 'hip', 'replacement', 'approximately', 'year', 'ago', 'which', 'had', 'been', 'worsening', 'for', 'two', 'week', 'and', 'she', 'ha', 'also', 'recently', 