## Problem Statement
Machine Learning algorithms in Natural Language Processing (NLP) are widely used in the Health Care and Life Sciences domain to extract information from unstructured text data. One application of NLP in healthcare is automated document classification. Several methods such as term frequency-inverse document frequency (TF-IDF), LDA Topic Modeling, Keyword Extraction, Convolutional Neural Networks have been proposed to tackle text classification for clinical documents. I will be exploring 2 different approaches to classify medical notes into clincal domains. 


## Data Exploration 

I will be using the medical transcriptions dataset from kaggle. https://www.kaggle.com/tboyle10/medicaltranscriptions 

Although the dataset contains samples from 40 medical domains, I will only be using samples from 5 medical specialties: Gastroenterology, Neurology, Orthopedic, Radiology and Urology. 

#### Importing Libraries 

In [1]:
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
import spacy 
import scispacy
import en_core_sci_sm
from scispacy.linking import EntityLinker
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn import metrics
import warnings
warnings.filterwarnings("ignore")

#### Reading in data 

In [2]:
data = pd.read_csv('mtsamples.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


I will only be using the columns : 'medical_specialty', 'transcription', 'keywords'. Lets take a look at these columns

In [3]:
# select columns 'medical_specialty', 'transcription', 'keywords'
data = data[['medical_specialty', 'transcription', 'keywords']]

#### 1. How many medical domains are there in our dataset ? 

In [4]:
data.groupby(by = 'medical_specialty',axis = 0, as_index = False).size()

Unnamed: 0,medical_specialty,size
0,Allergy / Immunology,7
1,Autopsy,8
2,Bariatrics,18
3,Cardiovascular / Pulmonary,372
4,Chiropractic,14
5,Consult - History and Phy.,516
6,Cosmetic / Plastic Surgery,27
7,Dentistry,27
8,Dermatology,29
9,Diets and Nutritions,10


We have an imbalanced dataset with 39 medical domain. I will focus on using just 5 out of 39 of the domains:

- Gastroenterology
- Neurology
- Orthopedic 
- Radiology 
- Urology

In [5]:
# Selecting only 5 of the clinical domains.
data = data[data.medical_specialty.isin([' Gastroenterology',' Neurology', ' Orthopedic', ' Radiology', ' Urology' ])]

#### 2. Lets look at an example of a transcription

In [6]:
data.transcription.iloc[0]

'CC:, Confusion and slurred speech.,HX , (primarily obtained from boyfriend): This 31 y/o RHF experienced a "flu-like illness 6-8 weeks prior to presentation. 3-4 weeks prior to presentation, she was found "passed out" in bed, and when awoken appeared confused, and lethargic. She apparently recovered within 24 hours. For two weeks prior to presentation she demonstrated emotional lability, uncharacteristic of her ( outbursts of anger and inappropriate laughter). She left a stove on.,She began slurring her speech 2 days prior to admission. On the day of presentation she developed right facial weakness and began stumbling to the right. She denied any associated headache, nausea, vomiting, fever, chills, neck stiffness or visual change. There was no history of illicit drug/ETOH use or head trauma.,PMH:, Migraine Headache.,FHX: , Unremarkable.,SHX: ,Divorced. Lives with boyfriend. 3 children alive and well. Denied tobacco/illicit drug use. Rarely consumes ETOH.,ROS:, Irregular menses.,EXAM:

#### 3. Lets look at the keywords for medical specialties

In [7]:
data[['medical_specialty', 'keywords']].head()

Unnamed: 0,medical_specialty,keywords
12,Neurology,
18,Urology,"urology, sterilization, vas, fertile male, bil..."
20,Urology,"urology, prostate cancer, technetium, whole bo..."
22,Urology,"urology, vasectomy, allis clamp, catgut, hemoc..."
23,Urology,"urology, hemiscrotum, bilateral vasectomy, vol..."


### Data Preparation/Data Cleaning 

Cleaning on the text of the transcription and medical_specialties column:

- Remove punctuation/special character
- remove numbers
- lowercase text
- Lemmatization of text (not done currently) 

In [8]:
def remove_punct(my_str):
    '''helper function to clean text'''
    new_str = ''.join([ch.lower() if ch.isalpha() else ' ' for ch in my_str ])
    new_str = ' '.join(new_str.split())
    return new_str

# removing extra white space in 'medical_specialty names and converting it to lower case
data.medical_specialty = data.medical_specialty.apply(lambda x: x.strip().lower())

#Cleaning transcription text
data.transcription = data.transcription.apply(lambda x: remove_punct(str(x)))

#### Add a column encoding the categorical variable 'medical_specialty' as an integer 

In [9]:
data['medical_specialty_id'] = data['medical_specialty'].factorize()[0]
ms_id_data = data[['medical_specialty', 'medical_specialty_id']].drop_duplicates().sort_values('medical_specialty_id')

# Dictionary to look up medical_specialties and their corresponding Id. 
ms_to_id = dict(ms_id_data.values)
id_to_ms = dict(ms_id_data[['medical_specialty_id', 'medical_specialty']].values)
data.head()

Unnamed: 0,medical_specialty,transcription,keywords,medical_specialty_id
12,neurology,cc confusion and slurred speech hx primarily o...,,0
18,urology,procedure elective male sterilization via bila...,"urology, sterilization, vas, fertile male, bil...",1
20,urology,indication prostate cancer technique hours fol...,"urology, prostate cancer, technetium, whole bo...",1
22,urology,description the patient was placed in the supi...,"urology, vasectomy, allis clamp, catgut, hemoc...",1
23,urology,preoperative diagnosis voluntary sterility pos...,"urology, hemiscrotum, bilateral vasectomy, vol...",1


## Model Training and Selection 

### Naive Bayes 

#### Split into training/testing set ( 80/20 split)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(data['transcription'], data['medical_specialty'], test_size = 0.2, random_state = 42)

In [11]:
print("Train Samples:" , len(X_train))
print("Test Samples: " , len(X_test))

Train Samples: 991
Test Samples:  248


In [12]:
id_to_ms

{0: 'neurology',
 1: 'urology',
 2: 'radiology',
 3: 'orthopedic',
 4: 'gastroenterology'}

In [13]:
# nb = Pipeline([('vect', CountVectorizer()),
#                ('tfidf', TfidfTransformer()),
#                ('clf', MultinomialNB()),
#               ])
# nb.fit(X_train, y_train)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = count_vect.transform(X_test)
y_pred = clf.predict(X_test_counts)

print('Testing accuracy %s' % accuracy_score(y_test, y_pred))
cmtx = pd.DataFrame(
    confusion_matrix(y_test, y_pred), 
    index=['true:gastroenterology', 'true:neurology', 'true:orthopedic', 'true:radiology',  'true:urology'], 
    columns=['pred:gastroenterology', 'pred:neurology', 'pred:orthopedic', 'pred:radiology',  'pred:urology']
)
print(metrics.classification_report(y_test, y_pred))
cmtx

Testing accuracy 0.4717741935483871
                  precision    recall  f1-score   support

gastroenterology       0.90      0.23      0.37        39
       neurology       0.58      0.33      0.42        42
      orthopedic       0.39      0.95      0.55        77
       radiology       0.77      0.33      0.47        60
         urology       1.00      0.03      0.06        30

       micro avg       0.47      0.47      0.47       248
       macro avg       0.73      0.38      0.37       248
    weighted avg       0.67      0.47      0.42       248



Unnamed: 0,pred:gastroenterology,pred:neurology,pred:orthopedic,pred:radiology,pred:urology
true:gastroenterology,9,0,29,1,0
true:neurology,0,14,27,1,0
true:orthopedic,0,1,73,3,0
true:radiology,0,9,31,20,0
true:urology,1,0,27,1,1


#### We see that Naive bayes gives us an accuracy of less than 50%. This is probably because Naive bayes does not work well with features that are highly correlated. It classifies most of the notes as 'orthopedic'. 


#### Linear Support Vector Machine 

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(data.transcription).toarray()
labels = data.medical_specialty_id
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = 0.2, random_state = 42)

In [17]:
def generate_report(y_test, y_pred):
    print('Testing accuracy %s' % accuracy_score(y_test, y_pred))
    print(metrics.classification_report(y_test, y_pred, target_names=data['medical_specialty'].unique()))
    cmtx = pd.DataFrame(
    confusion_matrix(y_test, y_pred), 
    index=[ 'true:neurology', 'true:urology', 'true:radiology', 'true:orthopedic', 'true:gastroenterology',], 
    columns=['pred:neurology', 'pred:urology', 'pred:radiology', 'pred:orthopedic', 'pred:gastroenterology'])
    return cmtx
    

In [19]:
svc = LinearSVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
cmtx = generate_report(y_test, y_pred)
cmtx

Testing accuracy 0.6975806451612904
                  precision    recall  f1-score   support

       neurology       0.54      0.48      0.51        42
         urology       1.00      0.83      0.91        30
       radiology       0.51      0.47      0.49        60
      orthopedic       0.74      0.82      0.78        77
gastroenterology       0.80      0.95      0.87        39

       micro avg       0.70      0.70      0.70       248
       macro avg       0.72      0.71      0.71       248
    weighted avg       0.69      0.70      0.69       248



Unnamed: 0,pred:neurology,pred:urology,pred:radiology,pred:orthopedic,pred:gastroenterology
true:neurology,20,0,14,7,1
true:urology,0,25,1,0,4
true:radiology,13,0,28,15,4
true:orthopedic,3,0,11,63,0
true:gastroenterology,1,0,1,0,37


#### Random Forest 

In [20]:
rfc_model = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=42)
rfc_model.fit(X_train, y_train)
y_pred = rfc_model.predict(X_test)
cmtx = generate_report(y_test, y_pred)
cmtx


Testing accuracy 0.5403225806451613
                  precision    recall  f1-score   support

       neurology       0.58      0.36      0.44        42
         urology       0.00      0.00      0.00        30
       radiology       0.45      0.63      0.53        60
      orthopedic       0.52      0.79      0.63        77
gastroenterology       0.95      0.51      0.67        39

       micro avg       0.54      0.54      0.54       248
       macro avg       0.50      0.46      0.45       248
    weighted avg       0.52      0.54      0.50       248



Unnamed: 0,pred:neurology,pred:urology,pred:radiology,pred:orthopedic,pred:gastroenterology
true:neurology,15,0,21,6,0
true:urology,0,0,6,24,0
true:radiology,10,0,38,11,1
true:orthopedic,1,0,15,61,0
true:gastroenterology,0,0,4,15,20


#### Logistic Regression

In [21]:
logreg = LogisticRegression(random_state = 42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
cmtx = generate_report(y_test, y_pred)
cmtx

Testing accuracy 0.7258064516129032
                  precision    recall  f1-score   support

       neurology       0.58      0.62      0.60        42
         urology       1.00      0.70      0.82        30
       radiology       0.63      0.55      0.59        60
      orthopedic       0.74      0.83      0.79        77
gastroenterology       0.82      0.92      0.87        39

       micro avg       0.73      0.73      0.73       248
       macro avg       0.75      0.72      0.73       248
    weighted avg       0.73      0.73      0.72       248



Unnamed: 0,pred:neurology,pred:urology,pred:radiology,pred:orthopedic,pred:gastroenterology
true:neurology,26,0,8,7,1
true:urology,2,21,1,2,4
true:radiology,11,0,33,13,3
true:orthopedic,4,0,9,64,0
true:gastroenterology,2,0,1,0,36


## Keyword Extraction + Entity Linking 

In [22]:
nlp = en_core_sci_sm.load()
linker = EntityLinker(k = 10, max_entities_per_mention = 2, name='umls')
nlp.add_pipe(linker)
ct = 0


In [None]:
def get_uml_terms(key_str):
    terms = []
    dc = nlp(key_str)
    for entity in dc.ents:
            try:
                concept_id = entity._.kb_ents[0][0]
                terms.append(linker.kb.cui_to_entity[concept_id].canonical_name)
            except:
                continue
    return terms
keyword_lst = [get_uml_terms(str(k)) for k in data['transcription']]

In [24]:
keyword_lst[1]

['Admission Type - Elective',
 'Males',
 'Sexual sterilization',
 'Bilateral vasectomy',
 'Preoperative',
 'Diagnosis Code',
 'Fertility',
 'Males',
 'Family',
 'Postoperative Period',
 'Diagnosis Code',
 'Fertility',
 'Males',
 'Family Relationship',
 'Anesthesia procedures',
 'Local',
 'Conscious Sedation',
 'Complication',
 'Blood Loss',
 'year',
 'Office',
 'Sexual sterilization',
 'Bilateral vasectomy',
 'Indication of (contextual qualifier)',
 'Physical Medical Procedure',
 'Patients',
 'Details',
 'Prophylactic behavior',
 'Sufficient',
 'Patients',
 'Supine Position',
 'CDISC SDTM Body Position Terminology',
 'Operating Tables',
 'Table - furniture',
 'Genitalia',
 'Solution Dosage Form',
 'Drapes (device)',
 'Physical Medical Procedure',
 'Started',
 'Structure of right vas deferens',
 'Scrotum',
 'Levels (qualifier value)',
 'Skin, Human',
 'Skin, Human',
 'Infiltration',
 'Xylocaine',
 'Sharp sensation quality',
 'Hemostat',
 'Inferior',
 'Surgical wound',
 'Anatomical segme

#### Logistic Regression

In [25]:
# Vectorizing list of tokens : http://www.davidsbatista.net/blog/2018/02/28/TfidfVectorizer/
def dummy_fun(doc):
    return doc

tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None)  
# keyword_docs = list(entLink_key_train.values())
feat = tfidf.fit_transform(keyword_lst).toarray()
X_train, X_test, y_train, y_test = train_test_split(feat, data['medical_specialty_id'], test_size=0.2, random_state=42)
logreg = LogisticRegression(random_state = 42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
cmtx = generate_report(y_test, y_pred)
cmtx

Testing accuracy 0.7016129032258065
                  precision    recall  f1-score   support

       neurology       0.64      0.50      0.56        42
         urology       1.00      0.63      0.78        30
       radiology       0.55      0.58      0.56        60
      orthopedic       0.70      0.81      0.75        77
gastroenterology       0.86      0.95      0.90        39

       micro avg       0.70      0.70      0.70       248
       macro avg       0.75      0.69      0.71       248
    weighted avg       0.71      0.70      0.70       248



Unnamed: 0,pred:neurology,pred:urology,pred:radiology,pred:orthopedic,pred:gastroenterology
true:neurology,21,0,13,7,1
true:urology,0,19,2,7,2
true:radiology,10,0,35,12,3
true:orthopedic,2,0,13,62,0
true:gastroenterology,0,0,1,1,37


#### Random Forest

In [28]:
rfc_model = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=42)
rfc_model.fit(X_train, y_train)
y_pred = rfc_model.predict(X_test)
cmtx = generate_report(y_test, y_pred)
cmtx

Testing accuracy 0.5
                  precision    recall  f1-score   support

       neurology       0.75      0.07      0.13        42
         urology       0.00      0.00      0.00        30
       radiology       0.49      0.63      0.55        60
      orthopedic       0.45      0.84      0.58        77
gastroenterology       0.90      0.46      0.61        39

       micro avg       0.50      0.50      0.50       248
       macro avg       0.52      0.40      0.37       248
    weighted avg       0.52      0.50      0.43       248



Unnamed: 0,pred:neurology,pred:urology,pred:radiology,pred:orthopedic,pred:gastroenterology
true:neurology,3,0,22,17,0
true:urology,0,0,3,27,0
true:radiology,1,0,38,19,2
true:orthopedic,0,0,12,65,0
true:gastroenterology,0,0,3,18,18


#### Linear Support Vector Machine

In [27]:
svc = LinearSVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
cmtx = generate_report(y_test, y_pred)
cmtx

Testing accuracy 0.7016129032258065
                  precision    recall  f1-score   support

       neurology       0.57      0.48      0.52        42
         urology       1.00      0.80      0.89        30
       radiology       0.53      0.52      0.53        60
      orthopedic       0.73      0.79      0.76        77
gastroenterology       0.79      0.97      0.87        39

       micro avg       0.70      0.70      0.70       248
       macro avg       0.73      0.71      0.71       248
    weighted avg       0.70      0.70      0.70       248



Unnamed: 0,pred:neurology,pred:urology,pred:radiology,pred:orthopedic,pred:gastroenterology
true:neurology,20,0,13,8,1
true:urology,0,24,2,1,3
true:radiology,12,0,31,13,4
true:orthopedic,3,0,11,61,2
true:gastroenterology,0,0,1,0,38
