# Developing software solutions for EHR analytics


Medical practitioners have preferred paper-based notes for a long time. Fortunately, with the advent of computers, a good majority of hospitals have now started to **digitize clinical notes**. This has been particularly useful for data scientists, as there is treasure trove of information locked away in these free texts, which can be otherwise utilized to make analytical models. There has been several attempts to address this problem using NLP techniques. The following work attempts to utilize exisitng NLP techniques to **extract features** from the free text notes and run machine learning models to perform the following tasks:
* Predict the top 5 diagnoses in the form of ICD9 codes
* Extract the top 5 keywords from a clinical note showcasing the most important information
* Predict the probability of a '30-day readmission' for the patient 


### Dataset
We considered MIMIC III (Medical Information Mart for Intensive Care III) free hospital database. This database contains de-identified data from over 50,000 patients who were admitted to Beth Israel Deaconess Medical Center in Boston, Massachusetts from 2001 to 2012. In order to get access to the data for this project, you will need to request access at this link (https://mimic.physionet.org/gettingstarted/access/).  

### **Please install the following tools/dependencies in order for this notebook to run**
* pandas
* numpy
* sklearn
* gensim
* re
* string
* nltk
* matplotlib
* pickle

In [2]:
import pandas as pd
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import string
from string import digits
from sklearn import preprocessing
import gensim



In [97]:
import re
import pandas as pd
import string
import nltk
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
import nltk
from nltk.corpus import stopwords
from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
import pickle


In [5]:
#Read the files from your local directory. This might take a while
note=pd.read_csv('NOTEEVENTS.csv')
diag = pd.read_csv('DIAGNOSES_ICD.csv')
adm=pd.read_csv('admission_new.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [8]:
#Number of notes of each type. We will only consider Discharge summaries as it contains actual ground truth and free text 
note.CATEGORY.value_counts()

Nursing/other        822497
Radiology            522279
Nursing              223556
ECG                  209051
Physician            141624
Discharge summary     59652
Echo                  45794
Respiratory           31739
Nutrition              9418
General                8301
Rehab Services         5431
Social Work            2670
Case Management         967
Pharmacy                103
Consult                  98
Name: CATEGORY, dtype: int64

In [12]:
print('Column names for Diagnosis dataframe: ', diag.columns)
print('Column names for Noteevents dataframe: ',note.columns)
print('Column names for Admission dataframe: ',adm.columns)


Column names for Diagnosis dataframe:  Index(['ROW_ID', 'SUBJECT_ID', 'HADM_ID', 'SEQ_NUM', 'ICD9_CODE'], dtype='object')
Column names for Noteevents dataframe:  Index(['ROW_ID', 'SUBJECT_ID', 'HADM_ID', 'CHARTDATE', 'CHARTTIME',
       'STORETIME', 'CATEGORY', 'DESCRIPTION', 'CGID', 'ISERROR', 'TEXT'],
      dtype='object')
Column names for Admission dataframe:  Index(['ROW_ID', 'SUBJECT_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME',
       'DEATHTIME', 'ADMISSION_TYPE', 'ADMISSION_LOCATION',
       'DISCHARGE_LOCATION', 'INSURANCE', 'LANGUAGE', 'RELIGION',
       'MARITAL_STATUS', 'ETHNICITY', 'EDREGTIME', 'EDOUTTIME', 'DIAGNOSIS',
       'HOSPITAL_EXPIRE_FLAG', 'HAS_CHARTEVENTS_DATA'],
      dtype='object')


In [15]:
#converting the type of hospital readmission id from float to int
note.HADM_ID=note.HADM_ID.values.astype(int)

In [16]:
#considering only discharge summaries
notes=note[note.CATEGORY=='Discharge summary']

**ICD-9-CM contains a list of codes corresponding to diagnoses and procedures recorded in conjunction with hospital care in the United States. These codes may be entered onto a patient's electronic health record and used for diagnostic, billing and reporting purposes. Related information also classified and codified in the system includes symptoms, patient complaints, causes of injury, and mental disorders. Below are the top 30 ICD9 codes from the dataset. These represent almost 80% of the total diagnoses made.**

In [17]:
ICD9codes=['4019','4280','42731','41401','5849','25000','2724','51881','5990','53081','2859','2449','486','2851','2762','496','99592','V5861','5070','0389','5859','40390','311','3051','412','2875','41071','2761']


In [23]:
data_key=pd.merge(diag[['SUBJECT_ID','HADM_ID','ICD9_CODE']],notes[['SUBJECT_ID','HADM_ID','CATEGORY','TEXT']],on=['SUBJECT_ID','HADM_ID'],how='left')

In [25]:
data_readmission= pd.merge(adm[['SUBJECT_ID','HADM_ID','ADMITTIME','DISCHTIME','DEATHTIME','DIAGNOSIS','DAYS_NEXT_ADMIT','NEXT_ADMITTIME','ADMISSION_TYPE','DEATHTIME']],
                notes[['SUBJECT_ID','HADM_ID','CATEGORY','TEXT']],on = ['SUBJECT_ID','HADM_ID'],how='left')
                                 

In [24]:
data_key.HADM_ID.value_counts().head(4)

149101    136
195655    120
104995    108
128930    105
Name: HADM_ID, dtype: int64

**The HADM_ID indicates hospital admission ids. So from the above example there can be multiple such ids for the same patient, indicating re-admissions**

In [26]:
data_readmission.HADM_ID.value_counts().head(4)

172599    7
120654    7
186706    6
145911    6
Name: HADM_ID, dtype: int64

In [29]:
data_readmission.shape #Total number of discharge notes

(65902, 12)

In [32]:
data_readmission_clean = data_readmission.loc[data_readmission.ADMISSION_TYPE != 'NEWBORN'].copy() #removing NEWBORN admissions

In [34]:
#preparing labels
data_readmission_clean['LABEL'] = (data_readmission_clean.DAYS_NEXT_ADMIT < 30).astype('int')

In [36]:
data_readmission_clean.head(50)

Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DEATHTIME,DIAGNOSIS,DAYS_NEXT_ADMIT,NEXT_ADMITTIME,ADMISSION_TYPE,DEATHTIME.1,CATEGORY,TEXT,LABEL
1,3,145834,2101-10-20 19:08:00,2101-10-31 13:58:00,,HYPOTENSION,,,EMERGENCY,,Discharge summary,Admission Date: [**2101-10-20**] Discharg...,0
2,4,185777,2191-03-16 00:28:00,2191-03-23 18:41:00,,"FEVER,DEHYDRATION,FAILURE TO THRIVE",,,EMERGENCY,,Discharge summary,Admission Date: [**2191-3-16**] Discharge...,0
4,6,107064,2175-05-30 07:15:00,2175-06-15 16:00:00,,CHRONIC RENAL FAILURE/SDA,,,ELECTIVE,,Discharge summary,Admission Date: [**2175-5-30**] Dischar...,0
7,9,150750,2149-11-09 13:06:00,2149-11-14 10:15:00,2149-11-14 10:15:00,HEMORRHAGIC CVA,,,EMERGENCY,2149-11-14 10:15:00,Discharge summary,Admission Date: [**2149-11-9**] Dischar...,0
8,9,150750,2149-11-09 13:06:00,2149-11-14 10:15:00,2149-11-14 10:15:00,HEMORRHAGIC CVA,,,EMERGENCY,2149-11-14 10:15:00,Discharge summary,"Name: [**Known lastname 10050**], [**Known fi...",0
10,11,194540,2178-04-16 06:18:00,2178-05-11 19:00:00,,BRAIN MASS,,,EMERGENCY,,Discharge summary,Admission Date: [**2178-4-16**] ...,0
11,12,112213,2104-08-07 10:15:00,2104-08-20 02:57:00,2104-08-20 02:57:00,PANCREATIC CANCER/SDA,,,ELECTIVE,2104-08-20 02:57:00,Discharge summary,Admission Date: [**2104-8-7**] Discharge ...,0
12,13,143045,2167-01-08 18:43:00,2167-01-15 15:15:00,,CORONARY ARTERY DISEASE,,,EMERGENCY,,Discharge summary,Admission Date: [**2167-1-8**] Discharg...,0
13,13,143045,2167-01-08 18:43:00,2167-01-15 15:15:00,,CORONARY ARTERY DISEASE,,,EMERGENCY,,Discharge summary,"Name: [**Known lastname 9900**], [**Known fir...",0
15,17,194023,2134-12-27 07:15:00,2134-12-31 16:05:00,,PATIENT FORAMEN OVALE\ PATENT FORAMEN OVALE MI...,128.920833,2135-05-09 14:11:00,ELECTIVE,,Discharge summary,Admission Date: [**2134-12-27**] ...,0


In [38]:
sum(data_readmission_clean.LABEL==1) # Number of 30-day readmissions

3479

In [42]:
new=

In [40]:
# Number of each type of ICD9 codes with 4019 or hypertension topping the list
data_key.ICD9_CODE.value_counts()

4019     23049
4280     15200
42731    14776
41401    14051
5849     10155
25000    10105
2724      9465
51881     8519
5990      7612
53081     6908
2720      6639
2859      5983
V053      5956
V290      5687
486       5463
2449      5410
2851      5134
496       5052
2762      4983
5070      4340
99592     4207
V5861     4167
0389      4149
311       3753
5859      3677
40390     3672
3051      3669
412       3657
V3000     3648
41071     3556
         ...  
8065         1
1608         1
V230         1
85316        1
8875         1
2541         1
4473         1
E8002        1
5824         1
1898         1
2454         1
E9250        1
36106        1
E9854        1
8679         1
8400         1
71835        1
49301        1
37942        1
81511        1
05881        1
8703         1
99586        1
28652        1
3452         1
3221         1
92301        1
90450        1
61800        1
9996         1
Name: ICD9_CODE, Length: 6984, dtype: int64

In [51]:
data_key_30=data_key[(data_key.ICD9_CODE=='4019') | (data_key.ICD9_CODE=='4280') | (data_key.ICD9_CODE=='42731') | (data_key.ICD9_CODE=='41401')  | (data_key.ICD9_CODE=='5849') | (data_key.ICD9_CODE=='25000')  | (data_key.ICD9_CODE=='2724') | (data_key.ICD9_CODE=='51881') | (data_key.ICD9_CODE=='5990') | (data_key.ICD9_CODE=='53081') | (data_key.ICD9_CODE=='2859') | (data_key.ICD9_CODE=='2449') | (data_key.ICD9_CODE=='486') | (data_key.ICD9_CODE=='2851') | (data_key.ICD9_CODE=='2762') |(data_key.ICD9_CODE=='496') |(data_key.ICD9_CODE=='99592') |(data_key.ICD9_CODE=='V5861') | (data_key.ICD9_CODE=='5070') | (data_key.ICD9_CODE=='0389') | (data_key.ICD9_CODE=='5859') | (data_key.ICD9_CODE=='40390') | (data_key.ICD9_CODE=='3051') |(data_key.ICD9_CODE=='412') | (data_key.ICD9_CODE=='2875') | (data_key.ICD9_CODE=='41071')| (data_key.ICD9_CODE=='2761')]

In [52]:
data_key_30.ICD9_CODE.value_counts() # top 30 diagnosis codes

4019     23049
4280     15200
42731    14776
41401    14051
5849     10155
25000    10105
2724      9465
51881     8519
5990      7612
53081     6908
2859      5983
486       5463
2449      5410
2851      5134
496       5052
2762      4983
5070      4340
99592     4207
V5861     4167
0389      4149
5859      3677
40390     3672
3051      3669
412       3657
41071     3556
2875      3416
2761      3389
Name: ICD9_CODE, dtype: int64

In [54]:
def preprocess_text(df):
    # This function preprocesses the text by filling not a number and replacing new lines ('\n') and carriage returns ('\r')
    df.TEXT = df.TEXT.fillna(' ')
    df.TEXT =df.TEXT.str.replace('\n',' ')
    df.TEXT =df.TEXT.str.replace('\r',' ')
    df.TEXT = df.TEXT.str.replace('\d+', '')
    df.TEXT=df.TEXT.str.replace('\[\*\*[^\]]*\*\*\]', '')
    df.TEXT=df.TEXT.str.replace('<[^>]*>','')
    df.TEXT=df.TEXT.str.lower()
    df.TEXT=df.TEXT.str.replace(r'[^a-z0-9]+', ' ')
    df.TEXT=df.TEXT.str.replace(r'\b\w\b','').str.replace(r'\s+', ' ')
    df.TEXT=df.TEXT.str.replace(" *\\b[[:alpha:]]{1,2}\\b *", '')

    return df

In [55]:
#cleaning the unstructured data for both data frames
#This might take a while (~15 minutes). Load the saved files named below
data_cleaned_key=preprocess_text(data_key)
data_cleaned_readmission=preprocess_text(data_readmission)

In [56]:
data_cleaned_key.to_csv('cleaned_text_for_keywords.csv')

In [57]:
data_cleaned_readmission.to_csv('cleaned_text_for_readmission.csv')

In [58]:
#load the cleaned data. You can start directly from here
#data_cleaned_key=pd.read_csv('cleaned_text_for_keywords.csv')
#data_cleaned_readmission=pd.read_csv('cleaned_text_for_readmission.csv')

In [60]:
data_cleaned_readmission.shape

(65902, 12)

In [75]:
data_cleaned_key.head(10)

Unnamed: 0,SUBJECT_ID,HADM_ID,ICD9_CODE,CATEGORY,TEXT
0,109,172335,40301,Discharge summary,admission date discharge date date of birth se...
1,109,172335,486,Discharge summary,admission date discharge date date of birth se...
2,109,172335,58281,Discharge summary,admission date discharge date date of birth se...
3,109,172335,5855,Discharge summary,admission date discharge date date of birth se...
4,109,172335,4254,Discharge summary,admission date discharge date date of birth se...
5,109,172335,2762,Discharge summary,admission date discharge date date of birth se...
6,109,172335,7100,Discharge summary,admission date discharge date date of birth se...
7,109,172335,2767,Discharge summary,admission date discharge date date of birth se...
8,109,172335,7243,Discharge summary,admission date discharge date date of birth se...
9,109,172335,45829,Discharge summary,admission date discharge date date of birth se...


## Predict keywords and diagnosis codes from the text

In [61]:
train, test = train_test_split(data_cleaned_key, random_state=42, test_size=0.33, shuffle=True)


In [71]:
X_train=train.TEXT
X_test=test.TEXT
all_text=data_cleaned_key.TEXT

In [67]:
def tokenizer_better(text):
    # tokenize the text by replacing punctuation and numbers with spaces and lowercase all words
    tokens = word_tokenize(text)
    return tokens

In [68]:
more_stop_words=['admission','discharge','patient','date','service','medicine']

In [74]:
vect= TfidfVectorizer(max_features = 20000, tokenizer = tokenizer_better, stop_words=stopwords.words('english')+more_stop_words)

In [114]:
vect1= TfidfVectorizer(max_features = 20000, tokenizer = tokenizer_better, stop_words=stopwords.words('english')+more_stop_words)

In [102]:
#Fitting the vectorizer on the training set alone. To judge accuracy 
vect.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=20000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',..., "won't", 'wouldn', "wouldn't", 'admission', 'discharge', 'patient', 'date', 'service', 'medicine'],
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function tokenizer_better at 0x7f4395a05268>,
        use_idf=True, vocabulary=None)

In [115]:
#Fitiing the vectorizer on the entire text corpus 
vect1.fit(all_text)

  'stop_words.' % sorted(inconsistent))


TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=20000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',..., "won't", 'wouldn', "wouldn't", 'admission', 'discharge', 'patient', 'date', 'service', 'medicine'],
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function tokenizer_better at 0x7f4395a05268>,
        use_idf=True, vocabulary=None)

In [112]:
pickle.dump(vect,open('feature.pkl','wb'))
#storing the feature vector

In [116]:
pickle.dump(vect1,open('new_feature.pkl','wb'))
#storing the feature vector

In [132]:
vecy=pickle.load(open('new_feature.pkl',"rb"))

In [133]:
X_train_tf = vecy.transform(X_train)
X_test_tf = vec.transform(X_test)


  'stop_words.' % sorted(inconsistent))


In [134]:
X_train.shape

(486330,)

In [613]:
X_train_tf[1:2]

<1x40000 sparse matrix of type '<class 'numpy.float64'>'
	with 349 stored elements in Compressed Sparse Row format>

In [129]:
from sklearn.metrics import accuracy_score
logreg = LogisticRegression(C=12.0)


In [395]:
svm = LinearSVC()
clf=RandomForestClassifier(max_depth=30)

In [615]:
lab=['4019','4280','42731','41401','5849','25000','2724','51881','5990','53081']

In [373]:
lab1=['4019','4280','42731']

In [374]:
train.ICD9_CODE.shape

(80292,)

In [375]:
s=test.TEXT.values[1000]

In [137]:
for label in ICD9codes:
    y=train.ICD9_CODE==label
    yt=test.ICD9_CODE==label
    logreg.fit(X_train_tf,y)
    pickle.dump(logreg,open("feature"+label+'.pkl',"wb"))
    #y_pred=logreg.predict(X_train_tf)
    print('Test accuracy for label: ', label, ' ' ,accuracy_score(logreg.predict(X_test_tf),yt))
    print(logreg.predict_proba(vecy.transform([preprocess_textual(s)])))

Test accuracy for label:  4019   0.9685725736423753
[[0.9317629 0.0682371]]
Test accuracy for label:  4280   0.9790010687328836
[[0.99247103 0.00752897]]
Test accuracy for label:  42731   0.9794435909424888
[[0.99616576 0.00383424]]
Test accuracy for label:  41401   0.9809130986574043
[[0.88098958 0.11901042]]
Test accuracy for label:  5849   0.9858726871952441
[[0.99778744 0.00221256]]
Test accuracy for label:  25000   0.9858852114087235
[[0.96001641 0.03998359]]
Test accuracy for label:  2724   0.9869372453409926
[[0.99606431 0.00393569]]
Test accuracy for label:  51881   0.9882856856589406
[[9.99894462e-01 1.05538254e-04]]
Test accuracy for label:  5990   0.9896132522877563
[[0.99740887 0.00259113]]
Test accuracy for label:  53081   0.9905400440852314
[[0.99654323 0.00345677]]
Test accuracy for label:  2859   0.9918467370249149
[[0.99696213 0.00303787]]
Test accuracy for label:  2449   0.9924938213880168
[[0.99843915 0.00156085]]
Test accuracy for label:  486   0.9924604234854051
[[

In [144]:
print(logreg.predict_proba(vecy.transform([preprocess_textual(s)]))[0][1])

0.12063658470860801


In [100]:
s1


'Admission Date:  [**2106-4-6**]              Discharge Date:   [**2106-4-15**]\n\nDate of Birth:  [**2038-4-1**]             Sex:   M\n\nService: MEDICINE\n\nAllergies:\nPatient recorded as having No Known Allergies to Drugs\n\nAttending:[**First Name3 (LF) 1990**]\nChief Complaint:\nFever, hypotension\n\nMajor Surgical or Invasive Procedure:\nBedside debridement of ulcerations by plastic surgery team\n\n\nHistory of Present Illness:\n68M with h/o t4 paraplegia x 2yrs, felt [**3-13**] "inflammatory spinal\ndisease", with a chronic indwelling foley, sacral decubitus\nulcers, presents to [**Hospital1 18**] from rehab after RN noted 1d of fever\n(tmax 101.8).  [**Name8 (MD) **] RN caring for pt at rehab, pt noted some mild\nabdominal discomfort (chronic), but otherwise denied any recent\nsymptoms of cough, n/v, constipation, rash.  Pt has been having\nchronic diarrhea (x3/day, x2-3/night) for past 1yr, etiology\nunclear.  [**Name2 (NI) 227**] persistent fevers x24hrs, pt was brought to\n