This notebook was created to support the data preparation required to support our CS 598 DLH project.  The paper we have chosen for the reproducibility project is:
***Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification from Clinical Notes ***



 

The data cannot be shared publicly due to the agreements required to obtain the data so we are storing the data locally and not putting in GitHub.

In [1]:
DATA_PATH = './obesity_data/'

**Classical Machine Learning - TF-IDF - All Features**

![CML TFIDF All](images\cml-tfidf-all.gif)

**Classical Machine Learning - TF-IDF - ExtraTreesClassifier Features**

![CML TFIDF ExtraTrees](images\cml-tfidf-extra.gif)

**Classical Machine Learning - TF-IDF - InfoGain Features**

![CML TFIDF ExtraTrees](images\cml-tfidf-infogain.gif)

**Classical Machine Learning - TF-IDF - SelectKBest Features**

![CML TFIDF ExtraTrees](images\cml-tfidf-selectkbest.gif)

**Classical Machine Learning - Word Embeddings - No Stopwords**

![CML TFIDF ExtraTrees](images\cml-we-swno.gif)

**Classical Machine Learning - Word Embeddings - Stopwords**

![CML TFIDF ExtraTrees](images\cml-we-swyes.gif)

In [8]:
import os
import random
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import datetime
from datetime import timedelta
from tqdm import tqdm
import torchtext

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn import model_selection, svm, naive_bayes
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# set seed
seed = 24
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
# define data path
DATA_PATH = './obesity_data/'
RESULTS_PATH = './results/'
MODELS_PATH = './models/'

all_docs_df = pd.read_pickle(DATA_PATH + '/alldocs_df.pkl')
all_docs_df_ns = pd.read_pickle(DATA_PATH + '/alldocs_df_ns.pkl')
all_annot_df = pd.read_pickle(DATA_PATH + '/alannot_df.pkl')

#corpus = pd.read_pickle(DATA_PATH + '/corpus.pkl')
disease_list = test_df['disease'].unique().tolist()

vectorizer = TfidfVectorizer(max_features = 600) #stop_words = cachedStopWords, max_features = 600

In [33]:
all_df = pd.merge(all_docs_df,all_annot_df, on='id')
all_df_ns = pd.merge(all_docs_df_ns,all_annot_df, on='id')

In [19]:
y = all_df['judgment']

In [39]:
all_df

Unnamed: 0,id,text,no_punc_text,no_numerics_text,lower_text,tokenized_text,tok_lem_text,disease,judgment
0,1,490646815 | WMC | 31530471 | | 9629480 | 11/23...,490646815 WMC 31530471 9629480 11232006 1...,WMC AM ANEMIA Signed DIS Admissi...,wmc am anemia signed dis admission date report...,"[wmc, am, anemia, signed, dis, admission, date...","[wmc, am, anemia, signed, dis, admission, date...",Asthma,False
1,1,490646815 | WMC | 31530471 | | 9629480 | 11/23...,490646815 WMC 31530471 9629480 11232006 1...,WMC AM ANEMIA Signed DIS Admissi...,wmc am anemia signed dis admission date report...,"[wmc, am, anemia, signed, dis, admission, date...","[wmc, am, anemia, signed, dis, admission, date...",CHF,True
2,1,490646815 | WMC | 31530471 | | 9629480 | 11/23...,490646815 WMC 31530471 9629480 11232006 1...,WMC AM ANEMIA Signed DIS Admissi...,wmc am anemia signed dis admission date report...,"[wmc, am, anemia, signed, dis, admission, date...","[wmc, am, anemia, signed, dis, admission, date...",Depression,False
3,1,490646815 | WMC | 31530471 | | 9629480 | 11/23...,490646815 WMC 31530471 9629480 11232006 1...,WMC AM ANEMIA Signed DIS Admissi...,wmc am anemia signed dis admission date report...,"[wmc, am, anemia, signed, dis, admission, date...","[wmc, am, anemia, signed, dis, admission, date...",Diabetes,True
4,1,490646815 | WMC | 31530471 | | 9629480 | 11/23...,490646815 WMC 31530471 9629480 11232006 1...,WMC AM ANEMIA Signed DIS Admissi...,wmc am anemia signed dis admission date report...,"[wmc, am, anemia, signed, dis, admission, date...","[wmc, am, anemia, signed, dis, admission, date...",Gallstones,False
...,...,...,...,...,...,...,...,...,...
16320,1247,216095265 | CCVGH | 79004464 | | 818191 | 10/1...,216095265 CCVGH 79004464 818191 10101997 ...,CCVGH AM CONGESTIVE HEART FAILURE ...,ccvgh am congestive heart failure rule out myo...,"[ccvgh, am, congestive, heart, failure, rule, ...","[ccvgh, am, congestive, heart, failure, rule, ...",OA,False
16321,1247,216095265 | CCVGH | 79004464 | | 818191 | 10/1...,216095265 CCVGH 79004464 818191 10101997 ...,CCVGH AM CONGESTIVE HEART FAILURE ...,ccvgh am congestive heart failure rule out myo...,"[ccvgh, am, congestive, heart, failure, rule, ...","[ccvgh, am, congestive, heart, failure, rule, ...",Obesity,True
16322,1247,216095265 | CCVGH | 79004464 | | 818191 | 10/1...,216095265 CCVGH 79004464 818191 10101997 ...,CCVGH AM CONGESTIVE HEART FAILURE ...,ccvgh am congestive heart failure rule out myo...,"[ccvgh, am, congestive, heart, failure, rule, ...","[ccvgh, am, congestive, heart, failure, rule, ...",OSA,False
16323,1247,216095265 | CCVGH | 79004464 | | 818191 | 10/1...,216095265 CCVGH 79004464 818191 10101997 ...,CCVGH AM CONGESTIVE HEART FAILURE ...,ccvgh am congestive heart failure rule out myo...,"[ccvgh, am, congestive, heart, failure, rule, ...","[ccvgh, am, congestive, heart, failure, rule, ...",PVD,False


In [20]:
X_train, X_test, y_train, y_test = train_test_split(all_df, y, test_size=0.25, shuffle=True)
X_train_ns, X_test_ns, y_train_ns, y_test_ns = train_test_split(all_df_ns, y, test_size=0.25, shuffle=True)

In [21]:
#y_train = all_df['judgement']

In [34]:
Encoder = LabelEncoder()
Train_Y  = Encoder.fit_transform(y_train)
Test_Y  = Encoder.fit_transform(y_test)

In [41]:
Tfidf_vect = TfidfVectorizer(max_features=600)
Train_X_Tfidf = vectorizer.fit_transform(X_train['lower_text'])
Test_X_Tfidf = vectorizer.fit_transform(X_test['lower_text'])

In [42]:
print(Train_X_Tfidf)
print(Tfidf_vect.vocabulary)

  (0, 141)	0.010501750360443594
  (0, 140)	0.004753545959583979
  (0, 155)	0.010685702549068274
  (0, 184)	0.010752777406464098
  (0, 522)	0.011272450812701633
  (0, 337)	0.011618677373828091
  (0, 583)	0.005913566910296181
  (0, 208)	0.008444341061781357
  (0, 108)	0.00634211772057659
  (0, 495)	0.0057969581637492195
  (0, 150)	0.006410488352462493
  (0, 213)	0.011640245400886104
  (0, 505)	0.029813564836250833
  (0, 550)	0.025300044198250066
  (0, 286)	0.013753253950078011
  (0, 342)	0.01167027197353661
  (0, 461)	0.020740474154472555
  (0, 484)	0.021220382967324144
  (0, 105)	0.009631205500512045
  (0, 531)	0.007662603994428354
  (0, 281)	0.014362456660890425
  (0, 156)	0.011199169652638289
  (0, 560)	0.01426013956862515
  (0, 254)	0.018843229087035993
  (0, 465)	0.010797100012253367
  :	:
  (12242, 29)	0.22099132574014543
  (12242, 258)	0.015852276408213923
  (12242, 28)	0.07533871523846838
  (12242, 584)	0.15870412752083518
  (12242, 593)	0.028291937123920422
  (12242, 276)	0.0521

# Support Vector Machine (SVM)

https://link.springer.com/article/10.1007/BF00994018

In [43]:
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(X_train, y_train)

# predict labels
predictions_SVM = SVM.predict(X_test)

# get the accuracy
print("Accuracy: ",accuracy_score(predictions_SVM, y_test)*100)

ValueError: could not convert string to float: "608074171 | CH | 62451179 | | 4737434 | 4/10/2006 12:00:00 AM | Discharge Summary | Unsigned | DIS | Admission Date: 7/14/2006 Report Status: Unsigned\n\nDischarge Date:\nATTENDING: MICHAELS , TYRELL MERLIN MD\nPRINCIPAL DIAGNOSES: Coronary artery disease , left ventricular\naneurysm.\nPROBLEM LIST:\n1. Congestive heart failure.\n2. Atrial fibrillation.\n3. Coronary artery disease.\nHISTORY OF PRESENT ILLNESS: Briefly , this is a 64-year-old\ngentleman with a history of an anterior myocardial infarction in\n1976 with resultant left ventricular aneurysm and congestive\nheart failure with an ejection fraction of approximately 20%. He\nhas had increasing chest pain with exertion over the last\nseveral years , which led to an evaluation for ischemia.\nCatheterization at the Ville Tionsgon County Hospital in Flint Ent Nyjer showed a 70% left\nmain lesion and a previously known 100% LAD , OM and RCA disease.\nThe patient had previously undergone a SPECT thallium scan\n2/3/06 in Moineslumvale Ri Cons , which showed a large anterior anteroseptal\ninfarct involving the septum , anterior wall and apical walls that\nwere entirely scarred and the lateral wall viability\nwas unclear on that scan. A chest CT performed in October of 2006 showed a\nlarge left ventricular dilation with remodeling and calcified and\nthinned left ventricular apex. His past history is also notable\nfor a history of nonsustained ventricular tachycardia dating back\nto 1983. He had his first AICD placed in 1990 at Woodtwin Memorial Hospital\nand this was replaced at the PMC in 1993 with a rate-sensing lead via\nthe left subclavian vein , and in 1997 , a CPI mini AICD was\nplaced. On presentation on this admission , he did not have a\nhistory of angina , but did have a history of class III heart\nfailure with marked limitation of physical activity. He was in\natrial fibrillation , which is his baseline rhythm.\nPAST MEDICAL HISTORY: As mentioned above , includes also diabetes\nwith oral treatment only , hypothyroidism , dyslipidemia and gout.\nPAST SURGICAL HISTORY: His past surgical history is notable for\na cholecystectomy in 80s and ICD placed via an open chest in\n1990 , replacement in 1992 and in 1997 , appendectomy , right\ninguinal hernia repair in 1986.\nFAMILY HISTORY: His family history is unremarkable , no history\nof coronary artery disease.\nSOCIAL HISTORY: The patient has an eight-pack year cigarette\nsmoking history. He has an alcohol intake of two drinks per\nmonth and the patient was formerly a dentist.\nALLERGIES: His allergies include erythromycin giving a rash ,\namiodarone with a suspected pulmonary toxicity and elevated LFTs.\nMEDICATIONS ON ADMISSION:\n1. Toprol 25 mg p.o. daily.\n2. Valsartan 80 mg p.o. daily.\n3. Digoxin 0.125 mg daily.\n4. Isosorbide 30 mg b.i.d.\n5. Aspirin 81 mg daily.\n6. Coumadin 1.5 mg daily.\n7. Furosemide 80 mg p.o. b.i.d.\n8. Simvastatin 40 mg daily.\n9. Coreg 25 mg b.i.d.\n10. Synthroid 50 mcg daily.\n11. Allopurinol 100 mg daily.\n12. Potassium 20 mEq daily.\n13. Klonopin 0.5 mg b.i.d.\n14. Glucophage 500 mg b.i.d.\nPHYSICAL EXAMINATION ON ADMISSION: Notable for this gentleman\nwho was 5 feet 10 inches and 95 kg in no acute distress. His\nvital signs were temperature of 97.8 , heart rate of 70 , blood\npressure of 122/70 and oxygen saturation 98% on room air. He had\nno detectable carotid bruits. Examination of the chest was\nnotable for a midline sternotomy scar that was well healed and an\nirregular rhythm. No murmurs were appreciated. His breath\nsounds were clear and equal bilaterally. His abdomen was notable\nfor scars from a previous right cholecystectomy and a left lower\nquadrant incision for placement of ICD. There were no masses and\nthe abdomen was soft and nontender. His extremities were without\nscaring or edema and he did not have varicosities present.\nLABORATORY DATA ON ADMISSION: An echocardiogram performed at the\nBeauton Geson Health Care Services on 9/24/06 showed an ejection\nfraction of 25% with mild mitral insufficiency , mild tricuspid\ninsufficiency and an AICD wire in the right heart.\nOPERATION PROCEDURES: On 6/16/06 , the patient underwent a\nmitral valve repair with an Alfieri suture , a CABG x3 with a\nY-graft saphenous vein; SVG2 connects SVG1 to LVB1 , SVG1 connects\nto aorta to OM1 , SVG3 to ramus , a left ventricular aneurysm\nrepair and epicardial lead placement. The bypass time was 257\nminutes. The cross-clamp time was 98 minutes.\nICU COURSE:\n1. Neurologic: The patient was brought to the Intensive Care\nUnit from the operating room on a propofol drip. He was\nadequately sedated. Sedation was weaned and he was extubated\nwithout incident. The patient's pain was well controlled with\np.r.n. Toradol. This was transitioned over to Motrin and he was\ntaking p.o.s and p.r.n. Morphine. He also was continued on his\nown dose of Clonazepam 2.5 mg b.i.d. He was neurologically\nintact throughout the course of his ICU stay and he was\ntransferred to floor with no focal deficits , moving all\nextremities and pain that is well controlled.\n2. Cardiovascular: The patient was brought to the ICU on\nNeosynephrine , epinephrine and milrinone drips. The milrinone was\ninitially given a weaning trial , but his mixed venous oxygen\nsaturations dropped to the high 40s and low 50s and the milrinone\nwas restarted. The Neosynephrine was the first drip to be\ndiscontinued and the patient remained on milrinone and\nepinephrine. Both of these were slowly weaned , the epinephrine\ncoming off first and the milrinone coming off two days prior to\ntransfer out of the ICU. He was in rapid atrial fibrillation and\nwas initially loaded with digoxin , which did improve the rate\ntransiently. He continued to have marginal mixed venous oxygen\nsaturation and poor cardiac performance as assessed by cardiac\noutput and cardiac index. Postoperatively , he had an intraaortic\nballoon pump. The intraaortic balloon pump remained in until\npostoperative day# 9 when it was discontinued. Slowly the drips\nwere weaned. He was started on an afterload reducing agent ,\nfirst a drip of sodium nitroprusside , after he had had a trial of\nesmolol. The esmolol did improve his rates slightly but did not\nsignificantly improve his cardiac output or his mixed venous\noxygen saturation. He did seem to have improvement with the\nafterload reducing agent of sodium nitroprusside. His rapid\natrial fibrillation remained refractory to medical treatment. He\nwas given trial of ibutilide , which did not significantly change\nhis rate or convert him into sinus rhythm. He was loaded with\namiodarone in spite of his prior history of pulmonary and hepatic\ntoxicity with plan to use a short course of amiodarone to improve\nthe rate. His rate did improve. He remained in atrial\nfibrillation and did not convert to sinus. On postoperative day\n8 , he was taken to the operating room and cardioverted\nsuccessfully to sinus rhythm. Transesophageal echocardiogram\nshowed an ejection fraction of 30% and no clots in the atrial\nventricles. He remained in a sinus rhythm for approximately\nthree days until postoperative day 11 when he switched back into\natrial fibrillation that was refractory to ibutilide. Attempted\ncardioversion was done the following day , which failed , and he\nremained in atrial fibrillation on transfer to the floor. His\nrate was better controlled in the 80s to 90s and he was on a\nregimen of digoxin , afterload-reducing agent Captopril 25 mg q.6\nh. and low-dose Lopressor 6.25 mg q.i.d. His pulmonary artery\ncatheter was discontinued prior to transfer to the floor , and on\ntransfer to the floor , his mixed venous oxygen saturation was\nimproved in the high 50s to low 60s.\n3. Pulmonary: Initially , the patient required BIPAP while in\nthe Intensive Care Unit. He needed an aggressive diuresis to\nimprove his pulmonary status but transfer to the floor , he was\ndown to 3 L nasal cannula and he was able to ambulate and get out\nof bed.\n4. GI: The patient was initially started on clear liquids and\nwas taking very minimal p.o.'s because he had a significant BIPAP\nrequirement. As the BIPAP was weaned , his diet was advanced and\nhe was tolerating a regular diet on transfer to floor. He did\nhave several episodes of diarrhea. Fecal leukocytes and\nClostridium difficile were sent and were negative and the\ndiarrhea was resolving on transfer to the floor.\n5. GU: The patient was aggressively diuresed with a Lasix drip\nand with Diuril boluses. His creatinine tolerated the\naggressively diuresis. His sodium remained low in the 130s and\nhe was transitioned off the Lasix drip to p.o. Lasix and\ncontinued to be negative on transfer to the floor.\n6. Endocrine: The patient has a history of diabetes and\nhypothyroidism. He was maintained on dose of Synthroid initially\nIV. Then , that was transitioned to p.o. when he was tolerating\np.o.'s. He was on subcutaneous insulin and his sugars were well\ncontrolled. Initially , he was on Portland protocol and was then\ntransitioned to subcutaneous insulin.\n7. Heme: The patient was anticoagulated with heparin initially\nfor the intraaortic balloon pump and atrial fibrillation. He did\nhave a marked drop in his platelets and was positive for\nheparin-induced thrombocytopenia. He was started on argatroban\nand bridged to Coumadin. On transfer to the floor , he remains on\nargatroban with a goal PTT of 50 to 70 and he is getting Coumadin\nwith goal INR 2 to 3.\n8. Infectious Disease: The patient had an elevated while blood\ncell count throughout the course of his ICU hospital stay. He\nhad the intraaortic balloon pump in place for nine days. He was\nseen by ID and his only positive blood culture was a\ncoag-negative staph that was believe to be a contaminant. He was\non vancomycin for many days that was not discontinued. He does\nhave a history of right knee gout , which did flare while he was\nin the Intensive Care Unit and part of his elevated blood cell\ncount was attributed to that. He was started on colchicine and\nAllopurinol.\nDISCHARGE MEDICATIONS: On transfer to the floor , the patient was\non following medications:\n1. Captopril 25 mg p.o. q.6 h.\n2. Lopressor 12.5 mg p.o. q.i.d.\n3. Digoxin 0.125 mg daily.\n4. Lasix 40 mg p.o. t.i.d.\n5. Allopurinol 100 mg p.o. daily.\n6. Aspirin 81 mg p.o. daily.\n7. Clonazepam 0.5 mg p.o. b.i.d.\n8. Colchicine 0.6 mg p.o. daily.\n9. Colace 100 mg p.o. b.i.d.\n10. Regular insulin sliding scale.\n11. Synthroid 50 mcg p.o. daily.\n12. Magnesium and potassium scales.\n13. Ambien 5 mg p.o. at bedtime.\n14. Nexium 40 mg p.o. daily.\n15. Lantus 12 units subcutaneously at bedtime.\n16. Darbepoetin alfa 40 mcg subcutaneous.\n17. NovoLog sliding scale and NovoLog four units subcutaneously\nwith lunch and six units with dinner.\nDISPOSITION: He was transferred to the floor in stable\ncondition.\nFOLLOW-UP: He will need continued evaluation with physical and\noccupational therapy.\neScription document: 5-7373774 CSSten Tel\nDictated By: LUCKETT , JOSIAH\nAttending: CIRCLE , CAREY MITCH\nDictation ID 3682207\nD: 5/1/06\nT: 5/1/06"

# k-Nearest Neighbours (kNN)

https://link.springer.com/article/10.1007/BF00153759

# Naive Bayes

https://arxiv.org/abs/1302.4964

In [None]:
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)

# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)

# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

# Random Forest

https://link.springer.com/article/10.1023/A:1010933404324

# Random Tree

https://onlinelibrary.wiley.com/doi/10.1002/rsa.3240050207

# J-48

http://server3.eca.ir/isi/forum/Programs%20for%20Machine%20Learning.pdf

# J-Rip

https://www.sciencedirect.com/science/article/pii/B9781558603776500232?via%3Dihub