I'm trying to make sense of the 600+ descriptions of the various drugs. Is there some way to classify them?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv("../part_b_spend_clean/output/part_b_spend_clean.csv")

Let's collect the descriptions into a set:

In [3]:
descriptions = set(df['hcpcs_description'])

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from time import time

In [22]:
n_features = 100
n_topics = 5
n_top_words = 20

In [23]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [24]:
# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(descriptions)
print("done in %0.3fs." % (time() - t0))

Extracting tf-idf features for NMF...
done in 0.019s.


In [25]:
# Fit the NMF model
print("Fitting the NMF model with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (len(descriptions), n_features))
t0 = time()
nmf = NMF(n_components=n_topics, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

Fitting the NMF model with tf-idf features, n_samples=623 and n_features=100...
done in 0.082s.

Topics in NMF model:
Topic #0:
mg injection 100 hydrochloride hcl sulfate 500 acetate 20 200 25 250 mesylate immune 30 globulin mcg vaccine muscle use
Topic #1:
10 injection mg hydrochloride units human sulfate specified alfa ml solution complex mesylate hcl preservative square 000 micrograms intravenous cc
Topic #2:
sodium 500 gm injection mg 250 1000 iu phosphate units gram complex 50 unit lyophilized globulin liquid immune non dialysis
Topic #3:
oral mg 25 anti emetic 100 therapeutic regimen exceed substitute prescription time chemotherapy hour complete treatment iv dosage approved fda
Topic #4:
50 ml hcl infusion human injection mcg contrast osmolar material iodine concentration micrograms 25 gm unit phosphate hydrochloride dme 250



In [26]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(descriptions)
print("done in %0.3fs." % (time() - t0))

Extracting tf features for LDA...
done in 0.020s.


In [27]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (len(descriptions), n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Fitting LDA models with tf features, n_samples=623 and n_features=100...
done in 0.661s.

Topics in LDA model:
Topic #0:
vaccine injection muscle ml influenza years older virus contrast patient hepatitis administered material iodine osmolar concentration skin age 000 intramuscular
Topic #1:
injection mg 10 sodium 50 500 100 ml hydrochloride globulin hcl immune units 25 sulfate human use 250 mcg intravenous
Topic #2:
administered solution dose unit non dme final fda product compounded approved inhalation form mg drug hyaluronan intra micrograms 300 used
Topic #3:
factor iu injection recombinant antihemophilic human complex specified microgram 000 100 units mg acetate vaccine dose 50 emetic dme intravenous
Topic #4:
oral anti emetic square centimeter mg use treatment therapeutic dosage iv exceed fda substitute approved prescription complete hour regimen chemotherapy



In [29]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

In [30]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [31]:
descr_clean = [clean(descr).split() for descr in descriptions]  

In [32]:
import gensim
from gensim import corpora

In [33]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index.
dictionary = corpora.Dictionary(descr_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
descr_term_matrix = [dictionary.doc2bow(descr) for descr in descr_clean]

In [34]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(descr_term_matrix, num_topics=5, id2word = dictionary, passes=50)



In [36]:
print(ldamodel.print_topics(num_topics=5, num_words=3))

[(0, '0.092*"injection" + 0.049*"1" + 0.048*"mg"'), (1, '0.101*"injection" + 0.037*"mg" + 0.031*"vaccine"'), (2, '0.038*"administered" + 0.036*"mg" + 0.033*"unit"'), (3, '0.117*"mg" + 0.110*"injection" + 0.036*"per"'), (4, '0.143*"injection" + 0.119*"mg" + 0.036*"1"')]


Ok perhaps we should have seen this sooner... But this isn't a good way to classify the drugs. We will need to find a database which matches the drugs to conditions, and classify them that way... 