# LDA and pyLDAvis with medical transcriptions and AWS SageMaker

Begin by installing a couple of packages:
* gensim is a Natural Language Processing package developed by Radim Rehurek
* plotly is a python visualization package developed by Chris Parmer
* pyLDAvis was developed by Ben Mabey and displays the results of LDA topic modeling with gensim

In [2]:
! pip install gensim

Collecting gensim
  Downloading gensim-3.8.3-cp36-cp36m-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 18.1 MB/s eta 0:00:01
[?25hCollecting smart-open>=1.8.1
  Downloading smart_open-3.0.0.tar.gz (113 kB)
[K     |████████████████████████████████| 113 kB 101.1 MB/s eta 0:00:01
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py) ... [?25ldone
[?25h  Created wheel for smart-open: filename=smart_open-3.0.0-py3-none-any.whl size=107097 sha256=25e27525cd1368ca89fde5f442e3a325f3af578f12ec796c70d4687a2fe5ef19
  Stored in directory: /home/ec2-user/.cache/pip/wheels/88/2a/d4/f2e9023989d4d4b3574f268657cb6cd23994665a038803f547
Successfully built smart-open
Installing collected packages: smart-open, gensim
Successfully installed gensim-3.8.3 smart-open-3.0.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [3]:
! pip install plotly

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [4]:
! pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-2.1.2.tar.gz (1.6 MB)


[?25l[K     |▏                               | 10 kB 36.5 MB/s eta 0:00:01[K     |▍                               | 20 kB 38.7 MB/s eta 0:00:01[K     |▋                               | 30 kB 44.2 MB/s eta 0:00:01[K     |▉                               | 40 kB 47.4 MB/s eta 0:00:01[K     |█                               | 51 kB 38.5 MB/s eta 0:00:01[K     |█▏                              | 61 kB 34.5 MB/s eta 0:00:01[K     |█▍                              | 71 kB 28.0 MB/s eta 0:00:01[K     |█▋                              | 81 kB 30.1 MB/s eta 0:00:01[K     |█▉                              | 92 kB 28.6 MB/s eta 0:00:01[K     |██                              | 102 kB 27.6 MB/s eta 0:00:01[K     |██▎                             | 112 kB 27.6 MB/s eta 0:00:01[K     |██▍                             | 122 kB 27.6 MB/s eta 0:00:01[K     |██▋                             | 133 kB 27.6 MB/s eta 0:00:01[K     |██▉                             | 143 kB 27.6 MB/s eta 0:

Collecting funcy
  Downloading funcy-1.15-py2.py3-none-any.whl (32 kB)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... [?25ldone
[?25h  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97711 sha256=3d700773f45820a534568df16d3eb926301e16c18c7faab4938f90b808c4ab42
  Stored in directory: /home/ec2-user/.cache/pip/wheels/57/de/11/0a038be70c2c212ce45fa0f4f9da165bb5dd87de1288394dc3
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.15 pyLDAvis-2.1.2
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


Next we import all the libraries that we're going to need for the analysis

In [5]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import models
from gensim.corpora import Dictionary, MmCorpus

In [6]:
'''
Loading nltk libraries
'''

import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

from nltk.corpus import stopwords


In [7]:
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [8]:
import pandas as pd
import numpy as np
np.random.seed(400)
np.set_printoptions(precision=3, suppress=True)

In [9]:
# accessing the SageMaker Python SDK
import boto3
import sagemaker
from sagemaker.amazon.common import numpy_to_record_serializer
from sagemaker.predictor import csv_serializer, json_deserializer
from sagemaker import get_execution_role

In [10]:
import tempfile
import string
import sys
import logging
import os
import pickle
import re

In [11]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

In [12]:
import logging
logging.basicConfig(filename='gensim.log',
                    format="%(asctime)s:%(levelname)s:%(message)s",
                    level=logging.INFO)

## Getting the data

In [13]:
# get the data
df = pd.read_csv('mtsamples.csv').drop(['Unnamed: 0'], axis=1)

print(df.columns)
df.head()

Index(['description', 'medical_specialty', 'sample_name', 'transcription',
       'keywords'],
      dtype='object')


Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


In [14]:
# data cleaning: remove 33 rows with missing data
print(df['transcription'].isnull().sum())
df=df.dropna(subset=['transcription']).copy()
df['transcription'].isnull().sum()

33


0

In [15]:
# data cleaning: remove leading spaces
df['medical_specialty']=df['medical_specialty'].str.strip()
spec_list = df['medical_specialty'].value_counts().head(3).index.tolist()
spec_list

['Surgery', 'Consult - History and Phy.', 'Cardiovascular / Pulmonary']

In [16]:
# data cleaning: filter to surgery only, save the transcript text as a separate pandas object
surgery = df[df['medical_specialty']==spec_list[0]]['transcription']

## Step 1. Data Preprocessing
We will perform the following steps:

* Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* All stopwords are removed.
* Words are lemmatized - words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are stemmed - words are reduced to their root form.

In [17]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''

# Tokenize 
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(token.lower())           
    return result

In [18]:
# apply the function to the text data about surgeries
processed_docs = []
specialty=surgery
for doc in specialty:
    processed_docs.append(preprocess(doc))
print(len(processed_docs))

1088


## Step 2. Corpus and dictionary

In [19]:
# establish list of common stop words
stop = set(stopwords.words('english'))
def nltk_stopwords():
    return set(nltk.corpus.stopwords.words('english'))

In [20]:
'''
Write a function to prepare the corpus using gensim: 
* output includes a dictionary of common words and 
* the vectorized text, using the BoW method
'''

def prep_corpus(docs, additional_stopwords=set(), no_below=3, no_above=0.5):
    print('Building dictionary...')
    dictionary = Dictionary(docs)
    stopwords = nltk_stopwords().union(additional_stopwords)
    stopword_ids = map(dictionary.token2id.get, stopwords)
    dictionary.filter_tokens(stopword_ids)
    dictionary.compactify()
    dictionary.filter_extremes(no_below=no_below, no_above=no_above, keep_n=None)
    print('Building corpus...')
    corpus = [dictionary.doc2bow(doc) for doc in docs]
    
    return dictionary, corpus
# this function is from Xuan Qi
# https://github.com/XuanX111/Friends_text_generator/blob/master/Friends_LDAvis_Xuan_Qi.ipynb

In [21]:
dictionary, corpus = prep_corpus(processed_docs)

Building dictionary...
Building corpus...


## Step 3. LDA model

In [22]:
# build a model with 3 topic clusters
lda_model = models.ldamodel.LdaModel(corpus=corpus,
         id2word=dictionary,
         num_topics=3,
         eval_every=10,
         passes=50,
         iterations=5000,
         random_state=np.random.RandomState(15))

Perplexity captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set. The benefit of this statistic comes in comparing perplexity across different models with varying s. The model with the lowest perplexity is generally considered the “best”.
* https://cfss.uchicago.edu/notes/topic-modeling/
* https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
* http://qpleple.com/perplexity-to-evaluate-topic-models/

In [23]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.


Perplexity:  -7.473611635814531


Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. Higher is better.
* https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
* https://rare-technologies.com/what-is-topic-coherence/

In [24]:
# Compute Coherence Score
from gensim.models import CoherenceModel

coherence_model_lda = CoherenceModel(model=lda_model, texts=processed_docs, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.3945775889739644


## Step 5. Save the topic clusters back to the dataset

In order to carry out multiclass classification, we need to apply the topic clusters back onto the original text dataset, and the use a train-test split to evaluate the performance of the various models we created above.

In [25]:
# Write a function that adds the topic labels back onto the original dataset
def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=surgery):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    contents= contents.reset_index(drop=True)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)
# https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#18dominanttopicineachsentence

In [27]:
# apply the function using a forloop
# creates a dictionary of pandas datasets

df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=surgery)

In [28]:
# check out the first model (3 topics)
df_topic_sents_keywords.head()

Unnamed: 0,Dominant_Topic,Perc_Contribution,Topic_Keywords,transcription
0,2.0,0.9981,"lateral, bone, wound, medial, noted, tissue, t...","PREOPERATIVE DIAGNOSES:,1. Hallux rigidus, le..."
1,0.0,0.9781,"vicryl, suture, closed, bladder, noted, normal...","PREOPERATIVE DIAGNOSIS: , Secondary capsular m..."
2,2.0,0.9979,"lateral, bone, wound, medial, noted, tissue, t...","TITLE OF OPERATION: , Youngswick osteotomy wit..."
3,2.0,0.6239,"lateral, bone, wound, medial, noted, tissue, t...","PREOPERATIVE DIAGNOSES,1. Open wound from rig..."
4,0.0,0.794,"vicryl, suture, closed, bladder, noted, normal...","PREOPERATIVE DIAGNOSIS:, Visually significant..."


## Step 6. Multiclass Classification

Now we're ready to compare the performance of the models using multiclass classification

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [36]:
# convert text to vectors
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(surgery)

In [37]:
# use a forloop to create a list of accuracy scores
accuracy_scores=[]
model_numbers=[]
for x in range(len(model_list)):
    y = labeled_datasets[model_list[x]]['Dominant_Topic'].copy()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    clf = MultinomialNB()
    clf.fit(X_train, y_train)
    y_preds=clf.predict(X_test)
    acc= accuracy_score(y_test,y_preds)
    accuracy_scores.append(acc)
    model_numbers.append('Model_'+str(x))

In [38]:
# display the various accuracy scores using Plotly. Higher accuracy is better.
import plotly.graph_objects as go
mycolors=['#A7226E',   '#EC2049', '#16697a', '#db6400', '#ffa62b']
data=[go.Bar(
    y=accuracy_scores, 
    x=model_numbers,
    marker_color=mycolors[0]
)]
layout=go.Layout(title='Classification Accuracy, by model',
                 xaxis=dict(title='LDA Model'),
                 yaxis=dict(title='Accuracy Score - Multinomial Naive Bayes'),
    )
fig = go.Figure(data, layout)
fig.update_xaxes(tickangle = 45)
fig.show()
fig.write_html("compare_models.html")

## Step 7. Visualize the Final Model

As a final step in our analysis, let's display the results of the LDA model using the pyLDAvis visualization tool. I'll discuss the meaning and interpretation of the clusters in the final report.

In [29]:
vis_data1 = gensimvis.prepare(lda_model, corpus, dictionary)
surgery_lda = open('surgery.html', 'w')
pyLDAvis.save_html(vis_data1, surgery_lda)