In [281]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 30

#Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 8)
import plotly.graph_objs as go
import chart_studio.plotly as py
from matplotlib import style
plt.style.use('ggplot')

from IPython.core.interactiveshell import InteractiveShell
import plotly.figure_factory as ff
InteractiveShell.ast_node_interactivity = 'all'
from plotly.offline import iplot

#Statistics and basic modeling
import scipy
import sklearn

#NLP Tools
from textblob import TextBlob
import spacy
import re
from nltk.corpus import stopwords
from collections import Counter
import string
import nltk
nltk.download('wordnet')
from nltk.stem.porter import PorterStemmer

import warnings 
warnings.filterwarnings('ignore')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/belalabusaleh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## There are a few things we are going to want to do before modeling
1)**What is our question**: this needs to be identified and addressed first before moving forward. Cannot lose sight of what our objective is as the question will dictate our approach and solutions.


2)**Data Wrangling**: Extract what we need from our raw data. Are there multiple sources of data? How are we going to merge things into one dataframe if this is a factor?


3a)**Exploratory Data Analysis**: We need to know everything that we reasonably can about this data, our model will only be as good as our understanding of the data and the problem at hand. This will involve cleaning and pre-processing of our data as well. There will be many visualizations here as well as they work hand in hand with one another.


3b)**Visualization**: Any relevant visualizations can give us (and a general audience) some base level insights about our data and how to move forward with it. Histograms, boxplots, heatmaps, etc.


Once all of these are addressed we can begin to think about modeling. Which models fit best, and fits reasonably into our pipeline? Is this a NLP problem, Word2Vec or tf-idf? Latent Semantic Analysis? Is this time sensitive? Maybe slightly sacrifice accuracy for speed, etc. What metrics will we use? Did we answer our question from section 1. How can we effectively communicate results to a broader audience.


Always keep in mind...

BUSINESS QUESTION --> DATA QUESTION --> DATA PROBLEM --> DATA SOLUTION --> BACK TO BUSINESS

### 1) What is our question
Our question for these data is..Can we build a multi-label classifier to correctly identify which medical specialty a transcription came from in order to streamline the patient experience in the future.

### 2) Data wrangling
Let's load in our data so we can begin reading and working on it. Merge any relevant data sources together here

In [282]:
#Load in data and make a dataframe to work with

data = pd.read_csv('mtsamples.csv')
df = pd.DataFrame(data)

df.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


### 3) Exploratory Analysis
Let's get to know and clean our data!

In [283]:
#Shape of data before any processing, columns, types
print('Dataset size:', df.shape)
print('Columns are:', df.columns)
print('Column types are:', df.dtypes)

Dataset size: (4999, 6)
Columns are: Index(['Unnamed: 0', 'description', 'medical_specialty', 'sample_name',
       'transcription', 'keywords'],
      dtype='object')
Column types are: Unnamed: 0            int64
description          object
medical_specialty    object
sample_name          object
transcription        object
keywords             object
dtype: object


In [284]:
#We can identify the columns here that have many missing values and decide what to do
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 6 columns):
Unnamed: 0           4999 non-null int64
description          4999 non-null object
medical_specialty    4999 non-null object
sample_name          4999 non-null object
transcription        4966 non-null object
keywords             3931 non-null object
dtypes: int64(1), object(5)
memory usage: 234.4+ KB


#### The keywords row has NaNs
This is not something we can impute so we will drop this

In [285]:
df = df.dropna() #drop our NaN values
df = df.reset_index(drop=True)

In [286]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3898 entries, 0 to 3897
Data columns (total 6 columns):
Unnamed: 0           3898 non-null int64
description          3898 non-null object
medical_specialty    3898 non-null object
sample_name          3898 non-null object
transcription        3898 non-null object
keywords             3898 non-null object
dtypes: int64(1), object(5)
memory usage: 182.8+ KB


In [287]:
#We can see here a list and freq of the specialities that these transcriptions are coming from
df.medical_specialty.value_counts()

 Surgery                          1021
 Orthopedic                        303
 Cardiovascular / Pulmonary        280
 Radiology                         251
 Consult - History and Phy.        234
 Gastroenterology                  195
 Neurology                         168
 General Medicine                  146
 SOAP / Chart / Progress Notes     142
 Urology                           140
 Obstetrics / Gynecology           130
 ENT - Otolaryngology               84
 Neurosurgery                       81
 Ophthalmology                      79
 Discharge Summary                  77
 Nephrology                         63
 Hematology - Oncology              62
 Pain Management                    58
 Office Notes                       44
 Podiatry                           42
 Pediatrics - Neonatal              42
 Emergency Room Reports             31
 Cosmetic / Plastic Surgery         25
 Dentistry                          25
 Dermatology                        25
 Letters                 

In [288]:
#I want to compare transcription vs description here

print(df.transcription[3])
print(len(df.transcription[3]))

print(df.description[3])
print(len(df.description[3]))


2-D M-MODE: , ,1.  Left atrial enlargement with left atrial diameter of 4.7 cm.,2.  Normal size right and left ventricle.,3.  Normal LV systolic function with left ventricular ejection fraction of 51%.,4.  Normal LV diastolic function.,5.  No pericardial effusion.,6.  Normal morphology of aortic valve, mitral valve, tricuspid valve, and pulmonary valve.,7.  PA systolic pressure is 36 mmHg.,DOPPLER: , ,1.  Mild mitral and tricuspid regurgitation.,2.  Trace aortic and pulmonary regurgitation.
495
 2-D M-Mode. Doppler.  
23


#### We can see the difference between transcription and description
Transcription is an in depth summary of the diagnosis/checkup while description is a title of the meeting. It seems there is much more info in the transcription to work with. There are problems: potential grammatical errors, a lot of punctuation, numbers, etc. All of which need to be addressed.

##### Let's get to cleaning and visualizing

In [289]:
#Drop columns we dont want/need/redundant
df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()

Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


In [290]:
#Let's build our text cleaning function

stopword = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()

def text_cleaner2(text):
    text = re.sub('[0-9]+', '', str(text)) #remove numbers
    text = text.lower() #lowercase words
    text = re.sub(r'[^\x00-\x7f]',r'', text) #remove non-english characters
    text = re.split('\W+', text) #tokenize
    text = [word for word in text if word not in stopword] #remove stop words
    text = [wn.lemmatize(word) for word in text] #apply stem / lemmas
    text = " ".join([char for char in text if char not in string.punctuation]) #remove punct
    return text

In [291]:
#Apply our cleaner and engineer some additional features
df['trans_clean'] = df['transcription'].apply(lambda x: text_cleaner2(x))

#TextBlob for sentiment
df['polarity'] = df['trans_clean'].map(lambda text: TextBlob(text).sentiment.polarity) #sentiment
df['review_len'] = df['trans_clean'].astype(str).apply(len) #Review length
df['word_count'] = df['trans_clean'].apply(lambda x: len(str(x).split())) #review word count

In [292]:
df.head(2)

Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords,trans_clean,polarity,review_len,word_count
0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller...",subjective year old white female present compl...,-0.013542,840,117
1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh...",past medical history difficulty climbing stair...,0.008789,1817,239


##### Check to see if our sentiment  is correct, print top 5 and bottom 5 reviews based on sentiment
Sentiment is ranked from -1 being the lowest to 1 being the highest

In [293]:
#Check our highest and lowest values of polarity. Range is (-1,1)
print(df.polarity.min())
print(df.polarity.max())

-0.45
0.8


In [294]:
#Top
print('Review with the highest positive polarity: \n')
cl = df.loc[df.polarity == .8, ['trans_clean']].sample(1).values
for c in cl:
    print(c[0])

Random review with the highest positive transaction polarity: 

stable time require intervention today visit asked return six month followup dilated examination would happy see sooner notice change vision


In [295]:
#Bottom
print('Review with the most negative polarity: \n')
cl = df.loc[df.polarity == -0.45, ['trans_clean']].sample(1).values
for c in cl:
    print(c[0])

Reviews with the most negative polarity: 

procedure newborn circumcision indication parental preference anesthesia dorsal penile nerve block description procedure baby prepared draped sterile manner lidocaine ml without epinephrine instilled base penis clock clock penile foreskin removed using xxx gomco hemostasis achieved minimal blood loss sign infection baby tolerated procedure well vaseline applied penis baby diapered nursing staff


##### So it is working properly, lets plot some visualizations now
We see the positive review saying something close to the patient is stable, returning after 6 months, is happy to see that his vision is getting better sooner than expected (roughly)

While the negative review is unfortunately about a newborn child suffering from an infection due to his circumcision, and the steps the staff are taking to handle it

In [296]:
#Create classifier for category
factor = pd.factorize(df['medical_specialty'])
df.category_class = factor[0]
definitions = factor[1]

df['medical_class'] = df.category_class
print(definitions)

#Re-order our columns to make sense
df = df[['keywords', 'description', 'sample_name', 'transcription', 'trans_clean', 'medical_specialty', 'medical_class', 'polarity', "review_len", 'word_count']]
df.head()

Index([' Allergy / Immunology', ' Bariatrics', ' Cardiovascular / Pulmonary',
       ' Dentistry', ' Urology', ' General Medicine', ' Surgery',
       ' Speech - Language', ' SOAP / Chart / Progress Notes',
       ' Sleep Medicine', ' Rheumatology', ' Radiology',
       ' Psychiatry / Psychology', ' Podiatry', ' Physical Medicine - Rehab',
       ' Pediatrics - Neonatal', ' Pain Management', ' Orthopedic',
       ' Ophthalmology', ' Office Notes', ' Obstetrics / Gynecology',
       ' Neurosurgery', ' Neurology', ' Nephrology', ' Letters',
       ' Lab Medicine - Pathology', ' IME-QME-Work Comp etc.',
       ' Hospice - Palliative Care', ' Hematology - Oncology',
       ' Gastroenterology', ' ENT - Otolaryngology', ' Endocrinology',
       ' Emergency Room Reports', ' Discharge Summary',
       ' Diets and Nutritions', ' Dermatology', ' Cosmetic / Plastic Surgery',
       ' Consult - History and Phy.', ' Chiropractic'],
      dtype='object')


Unnamed: 0,keywords,description,sample_name,transcription,trans_clean,medical_specialty,medical_class,polarity,review_len,word_count
0,"allergy / immunology, allergic rhinitis, aller...",A 23-year-old white female presents with comp...,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...",subjective year old white female present compl...,Allergy / Immunology,0,-0.013542,840,117
1,"bariatrics, laparoscopic gastric bypass, weigh...",Consult for laparoscopic gastric bypass.,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...",past medical history difficulty climbing stair...,Bariatrics,1,0.008789,1817,239
2,"bariatrics, laparoscopic gastric bypass, heart...",Consult for laparoscopic gastric bypass.,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...",history present illness seen abc today pleasan...,Bariatrics,1,0.03038,3103,437
3,"cardiovascular / pulmonary, 2-d m-mode, dopple...",2-D M-Mode. Doppler.,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...",mode left atrial enlargement left atrial diame...,Cardiovascular / Pulmonary,2,0.121905,381,50
4,"cardiovascular / pulmonary, 2-d, doppler, echo...",2-D Echocardiogram,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,left ventricular cavity size wall thickness ap...,Cardiovascular / Pulmonary,2,0.10538,1267,157


In [297]:
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

In [298]:
#Univariate: Sentiment Distribution

df['polarity'].iplot(
    kind='hist',
    bins=50,
    xTitle='polarity',
    linecolor='black',
    yTitle='count',
    title='Sentiment Polarity Distribution')

#### We can see here that the vast majority of sentiment is relatively neutral
This makes a lot of sense, contrary to what TV Medical Dramas would have us believe, I am sure that an average day/diagnosis is relatively uneventful

In [299]:
#Univariate: Review length

df['review_len'].iplot(
    kind='hist',
    bins=50,
    xTitle='review length',
    linecolor='black',
    yTitle='count',
    title='Transcription Text Length Distribution')

In [300]:
#Univariate: Word Count

df['word_count'].iplot(
    kind='hist',
    bins=45,
    xTitle='word count',
    linecolor='black',
    yTitle='count',
    title='Transcription Text Word Count Distribution')

In [301]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#Plot 15 most common words

def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(stop_words = 'english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_words(df['trans_clean'], 15)
for word, freq in common_words:
    print(word, freq)
df2 = pd.DataFrame(common_words, columns = ['ReviewText' , 'count'])
df2.groupby('ReviewText').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Top 15 words in Transcriptions after removing stop words')

patient 16269
right 8610
left 8390
procedure 7274
placed 6283
normal 4697
diagnosis 4085
history 3850
incision 3551
using 3497
anesthesia 3471
time 3322
performed 3259
pain 3167
used 3100


In [302]:
#Plot 15 most common bigrams (words found side by side)

def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_bigram(df['trans_clean'], 15)
for word, freq in common_words:
    print(word, freq)
df4 = pd.DataFrame(common_words, columns = ['ReviewText' , 'count'])
df4.groupby('ReviewText').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Top 15 bigrams (side by side words) in Transcriptions after removing stop words')

year old 1899
preoperative diagnosis 1634
postoperative diagnosis 1596
operating room 1477
procedure patient 1417
prepped draped 1396
tolerated procedure 1069
patient tolerated 947
patient year 870
blood loss 864
procedure performed 813
recovery room 802
draped usual 778
stable condition 770
patient taken 756


#### Word Plots here are telling us a story
When we look at word count as well as length, we see a distribution skewed to the left. It seems most of these doctors are keeping things brief while there is a minority of doctors who really enjoy writing up quite a detailed report.

There are also 2 plots for most common unigram and bigrams. Most common unigrams make sense but dont offer too much insight, words like patient, right, left, are all very common but don't tell us a story. The bigrams however do. We see words like preopertaive diagnosis, postoperative diagnosis, operation room, blood loss, procedure performed, etc. Even without knowing anything about the data we could infer that these data are coming from pre and post operations / surgery. 

When it comes to modeling, we will check to see if unigrams or bigrams are more appropriate by testing both.

In [303]:
#I want to see association of word count vs polarity, is this showing us a pattern?

y0 = df.loc[df['word_count'] == 100]['polarity']
y1 = df.loc[df['word_count'] == 200]['polarity']
y2 = df.loc[df['word_count'] == 300]['polarity']
y3 = df.loc[df['word_count'] == 400]['polarity']
y4 = df.loc[df['word_count'] == 500]['polarity']
y5 = df.loc[df['word_count'] == 600]['polarity']

trace0 = go.Box(
    y=y0,
    name = '100 Words',
    marker = dict(
        color = 'rgb(214, 12, 140)',
    )
)
trace1 = go.Box(
    y=y1,
    name = '200 Words',
    marker = dict(
        color = 'rgb(0, 128, 128)',
    )
)
trace2 = go.Box(
    y=y2,
    name = '300 Words',
    marker = dict(
        color = 'rgb(10, 140, 208)',
    )
)
trace3 = go.Box(
    y=y3,
    name = '400 Words',
    marker = dict(
        color = 'rgb(12, 102, 14)',
    )
)
trace4 = go.Box(
    y=y4,
    name = '500 Words',
    marker = dict(
        color = 'rgb(10, 0, 100)',
    )
)
trace5 = go.Box(
    y=y5,
    name = '600 Words',
    marker = dict(
        color = 'rgb(100, 0, 10)',
    )
)
data = [trace0, trace1, trace2, trace3, trace4, trace5]
layout = go.Layout(
    title = "Sentiment Polarity Boxplot Based on Wordcount"
)

fig = go.Figure(data=data,layout=layout)
iplot(fig, filename = "Sentiment Polarity Boxplot Name")

#### There doesn't seem to be any clear correlation here
Our most common word counts all seem to be hovering close to neutral polarity (0)

Again this makes sense, the majority of interactions should be uneventful


## Modeling
We have a good sense of what our data is trying to tell us here, lets start modeling to see if we can build an accurate classifier.

In [304]:
df.head(2)

Unnamed: 0,keywords,description,sample_name,transcription,trans_clean,medical_specialty,medical_class,polarity,review_len,word_count
0,"allergy / immunology, allergic rhinitis, aller...",A 23-year-old white female presents with comp...,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...",subjective year old white female present compl...,Allergy / Immunology,0,-0.013542,840,117
1,"bariatrics, laparoscopic gastric bypass, weigh...",Consult for laparoscopic gastric bypass.,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...",past medical history difficulty climbing stair...,Bariatrics,1,0.008789,1817,239


In [330]:
df.medical_class.value_counts()

# Get names of indexes for which medical_class has value 6
indexNames = df[df['medical_class'] == 6].index
 
# Delete these row indexes from dataFrame
df.drop(indexNames, inplace=True)

17    303
2     280
11    251
37    234
29    195
22    168
5     146
8     142
4     140
20    130
30     84
21     81
18     79
33     77
23     63
28     62
16     58
19     44
15     42
13     42
32     31
3      25
36     25
35     25
24     20
12     19
1      18
9      18
31     15
14     11
34     10
7       8
25      8
10      7
27      5
38      4
26      4
0       3
Name: medical_class, dtype: int64

In [331]:
vec = CountVectorizer(stop_words = 'english').fit(df['trans_clean'])
print(len(vec.vocabulary_))

messages = vec.transform(df['trans_clean'])
count_vect_df = pd.DataFrame(messages.toarray(), columns=vec.get_feature_names())

count_vect_df.head()

15825


Unnamed: 0,___,____,_____,______,_______,________,_________,__________,___________,____________,_________________,_blunt,_cast,_knotless,aa,...,zometa,zone,zonegran,zoonotic,zoster,zosyn,zuba,zumi,zung,zygoma,zygomatic,zymar,zyprexa,zyrtec,zyvox
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [332]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split

tfidf_transformer = TfidfTransformer().fit(messages) 

#Transforming to a sparse format(For training purposes)
one = tfidf_transformer.transform(messages)
X = one
y = df.medical_class #Multi-class

#Splitting train and test data for our models
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [333]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score

classifier = MultinomialNB() #Using Naive Bayes algorithm (Simple and Common in NLP)
classifier.fit(X_train, y_train) #training the model

print('Training accuracy Unigram Naive Bayes is: {}'.format(classifier.score(X_train,y_train)))
print('Test accuracy Unigram Naive Bayes is: {}'.format(classifier.score(X_test, y_test)))

y_pred = classifier.predict(X_test) #predict y based on x_test

print(classification_report(y_test, y_pred)) #true value vs predicted value

scores = cross_val_score(classifier, X_train, y_train, cv=5)
print("Cross Val Naive Bayes Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Training accuracy normal Naive Bayes is: 0.4473806212331943
Test accuracy normal Naive Bayes is: 0.3972222222222222
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           2       0.54      0.81      0.65        69
           3       0.00      0.00      0.00         8
           4       0.74      0.52      0.61        33
           5       0.00      0.00      0.00        31
           7       0.00      0.00      0.00         3
           8       0.00      0.00      0.00        40
           9       0.00      0.00      0.00         7
          10       0.00      0.00      0.00         2
          11       0.37      0.38      0.38        73
          12       0.00      0.00      0.00         5
          13       0.00      0.00      0.00         8
          14       0.00      0.00      0.00         4
          15       0.00      0.00      0.00        10
          16       0.00      0.00      0.00        11
          17       

In [334]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)

print('Training accuracy Unigram LogisticRegression is: {}'.format(lr.score(X_train,y_train)))
print('Test accuracy Unigram LogisticRegression is: {}'.format(lr.score(X_test, y_test)))

y_pred = lr.predict(X_test) #predict y based on x_test

print(classification_report(y_test, y_pred)) #true value vs predicted value

scores = cross_val_score(lr, X_train, y_train, cv=5)
print("Cross Val LogisticRegression Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Training accuracy normal LogisticRegression is: 0.6351414000927214
Test accuracy normal LogisticRegression is: 0.4152777777777778
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.00      0.00      0.00         0
           2       0.52      0.72      0.61        69
           3       1.00      0.38      0.55         8
           4       0.66      0.82      0.73        33
           5       0.10      0.10      0.10        31
           7       0.00      0.00      0.00         3
           8       0.17      0.17      0.17        40
           9       0.00      0.00      0.00         7
          10       0.00      0.00      0.00         2
          11       0.22      0.25      0.23        73
          12       0.00      0.00      0.00         5
          13       0.00      0.00      0.00         8
          14       0.00      0.00      0.00         4
          15       0.00      0.00      0.00        10
     

In [335]:
from sklearn import ensemble

rfc = ensemble.RandomForestClassifier()
rfc.fit(X_train, y_train)

print('Training accuracy Unigram RandomForest is: {}'.format(rfc.score(X_train,y_train)))
print('Test accuracy Unigram RandomForest is: {}'.format(rfc.score(X_test, y_test)))

y_pred = rfc.predict(X_test) #predict y based on x_test

print(classification_report(y_test, y_pred)) #true value vs predicted value

scores = cross_val_score(rfc, X_train, y_train, cv=5)
print("Cross Val RandomForestClassifier Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Training accuracy normal RandomForest is: 0.7116365322206769
Test accuracy normal RandomForest is: 0.28888888888888886
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.00      0.00      0.00         0
           2       0.47      0.62      0.53        69
           3       0.88      0.88      0.88         8
           4       0.52      0.52      0.52        33
           5       0.00      0.00      0.00        31
           7       0.00      0.00      0.00         3
           8       0.08      0.07      0.08        40
           9       0.20      0.14      0.17         7
          10       0.00      0.00      0.00         2
          11       0.13      0.12      0.13        73
          12       0.00      0.00      0.00         5
          13       0.00      0.00      0.00         8
          14       0.00      0.00      0.00         4
          15       0.00      0.00      0.00        10
          16    

In [336]:
#Here we can check if Bigrams will produce a better model

from sklearn.feature_extraction.text import TfidfVectorizer

tvec = TfidfVectorizer(stop_words = 'english', ngram_range=(2, 2)).fit(df['trans_clean'])
print(len(vec.vocabulary_))

messages = tvec.transform(df['trans_clean'])
count_tvect_df = pd.DataFrame(messages.toarray(), columns=tvec.get_feature_names())

count_tvect_df.head()

15825


Unnamed: 0,___ cordis,___ pin,___ placement,___ position,___ skin,___ staple,___ trocar,____ anastomosis,____ introduced,____ pin,____ pocket,_____ anastomosed,_____ arthroscopic,_____ changed,_____ completing,...,zyrtec directed,zyrtec environmental,zyrtec fairly,zyrtec flonase,zyrtec hydrocodone,zyrtec instead,zyrtec itching,zyrtec mg,zyrtec nasonex,zyrtec physical,zyrtec serum,zyrtec used,zyrtec worked,zyvox levaquin,zyvox mg
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.107138,0.0,0.0,0.0,0.0,0.0,0.0,0.107138,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [337]:
tfidf_transformer = TfidfTransformer().fit(messages) 

#Transforming to a sparse format(For training purposes)
X = tfidf_transformer.transform(messages)
y = df.medical_class 

#Splitting train and test data for our models
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

In [338]:
classifier = MultinomialNB() #Using Naive Bayes algorithm (Simple and Common in NLP)
classifier.fit(X_train, y_train) #training the model

print('Training accuracy Bigram Naive Bayes is: {}'.format(classifier.score(X_train,y_train)))
print('Test accuracy Bigram Naive Bayes is: {}'.format(classifier.score(X_test, y_test)))

y_pred = classifier.predict(X_test) #predict y based on x_test

print(classification_report(y_test, y_pred)) #true value vs predicted value

scores = cross_val_score(classifier, X_train, y_train, cv=5)
print("Cross Val Naive Bayes Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Training accuracy normal Naive Bayes is: 0.5750511247443763
Test accuracy normal Naive Bayes is: 0.2569444444444444
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           2       0.47      0.71      0.57        45
           3       0.00      0.00      0.00         4
           4       0.50      0.35      0.41        23
           5       0.00      0.00      0.00        18
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00        22
           9       0.00      0.00      0.00         5
          10       0.00      0.00      0.00         2
          11       0.05      0.04      0.04        47
          12       0.00      0.00      0.00         3
          13       0.00      0.00      0.00         5
          14       0.00      0.00      0.00         2
          15       0.00      0.00      0.00         7
          16       1.00      0.12      0.22         8
          17       

In [339]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

print('Training accuracy Bigram LogisticRegression is: {}'.format(lr.score(X_train,y_train)))
print('Test accuracy Bigram LogisticRegression is: {}'.format(lr.score(X_test, y_test)))

y_pred = lr.predict(X_test) #predict y based on x_test

print(classification_report(y_test, y_pred)) #true value vs predicted value

scores = cross_val_score(lr, X_train, y_train, cv=5)
print("Cross Val LogisticRegression Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Training accuracy normal LogisticRegression is: 0.5934560327198364
Test accuracy normal LogisticRegression is: 0.25925925925925924
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           2       0.44      0.67      0.53        45
           3       0.00      0.00      0.00         4
           4       0.47      0.35      0.40        23
           5       0.00      0.00      0.00        18
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00        22
           9       0.00      0.00      0.00         5
          10       0.00      0.00      0.00         2
          11       0.05      0.04      0.04        47
          12       0.00      0.00      0.00         3
          13       0.00      0.00      0.00         5
          14       0.00      0.00      0.00         2
          15       0.00      0.00      0.00         7
          16       0.67      0.25      0.36         8
    

In [340]:
rfc = ensemble.RandomForestClassifier()
rfc.fit(X_train, y_train)

print('Training accuracy Bigram RandomForest is: {}'.format(rfc.score(X_train, y_train)))
print('Test accuracy Bigram RandomForest is: {}'.format(rfc.score(X_test, y_test)))

y_pred = rfc.predict(X_test) #predict y based on x_test

print(classification_report(y_test, y_pred)) #true value vs predicted value

scores = cross_val_score(rfc, X_train, y_train, cv=5)
print("Cross Val RandomForestClassifier Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Training accuracy normal RandomForest is: 0.6768916155419223
Test accuracy normal RandomForest is: 0.26157407407407407
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.00      0.00      0.00         0
           2       0.47      0.60      0.53        45
           3       0.67      1.00      0.80         4
           4       0.45      0.43      0.44        23
           5       0.07      0.11      0.09        18
           7       0.00      0.00      0.00         2
           8       0.08      0.09      0.09        22
           9       0.50      0.20      0.29         5
          10       0.00      0.00      0.00         2
          11       0.00      0.00      0.00        47
          12       0.00      0.00      0.00         3
          13       0.00      0.00      0.00         5
          14       0.00      0.00      0.00         2
          15       0.00      0.00      0.00         7
          16    

### These Results are not great. Unigrams are best in general with the best model being Logistic Regression with a 42% F1_Score and 43% cross validation
Let's do some deep learning, not enough data to reasonbly justify this but we can try

In [355]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense , Input , LSTM , Embedding, Dropout , Activation, GRU, Flatten
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model, Sequential
from keras.layers import Convolution1D
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.utils import to_categorical

max_features = 6000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(df['trans_clean'])
list_tokenized_train = tokenizer.texts_to_sequences(df['trans_clean'])

maxlen = 130
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
y = to_categorical(df['medical_class'])

embed_size = 128
model = Sequential()
model.add(Embedding(max_features, embed_size))
model.add(Bidirectional(LSTM(32, return_sequences = True)))
model.add(GlobalMaxPool1D())
model.add(Dense(20, activation="relu"))
model.add(Dropout(0.05))
model.add(Dense(39, activation="sigmoid"))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

batch_size = 100
epochs = 10
model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.4)


Train on 1726 samples, validate on 1151 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2009b5d9b0>

### Write-up and Analysis
Overview

1)Introduction

2)Background

3)Methods

a. Experimental Setup
b. Feature
c. Classifier
4) Results

5) Discussion & Future Work



### Overview
The medical field is one that is ever changing, in flux, and always adapting. It makes sense then that it would be quick to adopt Natural language processes when it comes to machine learning. What I propose is a multi-label classification system to help get patients to the departments they need to be in faster. Using NLP of doctor's transcriptions we can identify which department these transcriptions were coming from, thus in the future allowing general practioners to potentially diagnose more readily and help skip certain specialists at times and go directly to surgery. Using a simple Logistic Regression model with the minimal data given we were able to achieve an optimal F1 Score of 42% given 39 different classifiers. 

### Introduction
With the evolution of machine learning and data science, Natural Language processing is becoming an increasingly popular resource for companies, the medical industry is no different. Being able to identify and respond to a multitude of different types of text based data a company obtains can have a massive impact on the companies bottom line. 

The purpose of this report is to analyze the data given to the point where we understand enough and are comfortable enough to begin modeling and predictions. The data given is text based data for the medical industry. We have doctors who are specialists in certain fields and their transcriptions of the meetings they had with their patients. 

This is important work as it could streamline the process of getting a solution for a patient and ultimately helping save their life. Anything that can be done to speed this process along should be done. 


### 2)Background

The first thing we wanted to do with this data is do an analysis. We start with some basic analysis looking at the dimensions of the data with df.shape(). From here we can see the number of rows and columns but to get more specific we use df.columns() to print out a list of the names. We do this to quickly identify if there are any potential redundancies or irrelevant columns in the data that we'd like to drop. Finally, we look at the dtypes of all the columns to insure they weren't assigned an awkward or wrong dtype. We can then start to analyze the meat of these data. Finding the null values (if any) and dropping / imputing them. We can look for potential input error using df.describe() to identify any strange values / outliers. From this point we have a general idea of what our data holds and how to move forward with cleaning and processing.

The first step in cleaning is to simply remove columns that are no needed. The following columns were dropped:"Unnamed: 0", this is just a relic from importing a CSV and holds no value. We were able to identify the amount of null values in the data, specifically in "Keywords" and we dealt with that by dropping them, we cannot impute this as it is very specific to each row.

From this point onwards we can begin processing the textual columns into useable data for our modeling purposes.


### Methods
Before we move forward we must process our data. Using NLTK and REGEX libraries we were able to clean and create fairly uniform data. This could be done with the Spacy library however I wanted to demonstrate a deeper understand of how this data cleans as well as allow the ability to customize exactly how we wanted this data cleaned. We designed a funtion to clean our data, when the natural language (transcription data) enters our function the following occurs: All float and integer based numbers are removed, all characters are converted to lowercase, we remove non-english characters, we tokenize the words, removed common stop words, we then lemmatize our words (lemmatizing was found to be better than stemming here), and all punctuation was removed as well. This gives us a workable and uniform text. 

We then engineer some features, this new text being one of those features as well as a sentiment feature using TextBlob "polarity", word counts, and character counts. These are done for general data analysis and visualizations. Another engineered feature was the 'medical_class' feature, this is the result of each unique medical_specialty being assigned an integer so that we may use this as a target variable in our modeling.

We then began to look at visualizations of our data to see if there was any information to be gleaned. Indeed we could see that polarity was relatively neutral most of the time, which makes sense as a standard visit to the Doctor shouldn't be dire. We checked to see trends in the data with word count and length of the transcriptions and found it was skewed left (not as many words / shorter length.) It seemed most doctors kept their transcriptions short. We were also able to identify our most common words using Countvectorizer/ TFIDF checking both Unigrams (single words) and bigrams (2 words together). Ultimately through testing of different models I did not find that bigrams had a positive impact on prediction. 

There was a class imbalance issue here as well, medical_specialty had a high of over 1000 values for 'surgery'. I found there was not nearly enough data nor time to start testing under/over sampling techniques. Due to the vague nature of 'surgery' and potential overlap with other categories as well I saw it fit to drop these values. 

Using TFIDF we turn our words into features. Essentially vectorizing each word so that we can use it in different modeling techniques. Before doing so we need to transform these features into a sparse format for training using tfidf_transformer. From there we can set our y to our target feature and begin trying different models.


### Results

Metrics used were accuracy score for overfitting comparisons and ultimately F1_Score for validation of the model. Unigram words were used for these results.

The first model tried was Multinomial Naive Bayes (MNB). MNB is commonly used in NLP and is quite fast due to its simplicity. Ultimately its results were not optimal. F1 Score of 40%

The next model used was Logistic regression. Here we found the best results with an optimal F1 Score of 42%. This model did take longer to run than MNB however this is to be expected due to the simplicity of MNB. 

The final base model ran was a Random Forest Classifier, this model took by far the longest to run and also gave the worst results with an F1 Score of 29%.

Bigram models were all run as well but each bigram fared worse than its unigram counterpart and was thus not included in this writeup.

Following these base models we used a deep learning neural network to see if we could have better results. Neural networks love lots of data and we were aware this wasn't nearly enough to produce amazing results however sometimes a neural network can surprise you! A sequential model with Embeddings was used from the Keras package using the Tensorflow backend. It could be seen that accuracy was getting higher and higher as the epochs increased, this however is meaningless as the val_accurace remained at 0 throughout. This implies the nerual network model is simply overfitting and thus giving poor results. A neural network here is not appropriate.

### Discussion / Future Work
Ultimately this was a smaller data set and thus is hard to get amazing results, especially in such a short time frame! I am however happy with the result of 42% with 39 different classifiers involved. Random chance would be ~ 2.5% and so we have significanlty beaten that.
Moving forward more data would be key, with more data a neural network could be justified here and would (I believe) give us much better results than even our Logistic Regression model. This work could be very valuable as it could be the start of much more streamlined diagnoses for general practioners using automation in NLP. Again, any time saved here could help save lives and would be more cost efficient for hospitals. The plan here was to build a powerpoint presentation for a more general audience as well but sadly I ran out of time completely. Presentation ability is something I excel at and wanted to show here. 

### Thank you
Thank you for taking the time to go over my work. I know this may seem a little all over the place but the time crunch really was hard. I hope I was able to demonstrate to you my ability in data analysis and modeling, as well as a profiency in coding and data manipulation. 

I sincerly hope you'll give me the chance to meet with you. I am a hardworking, honest, and smart person and I have so much more to demonstrate than what was possible here. 

Thank you so much for your time, I look forward to all of your feedback, and I hope you have a great day.

-Belal Abusaleh