# Model Comparison

This code compares the trained classifier and topic model. 

In [1]:
import pickle
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis
import pyLDAvis.sklearn

  from imp import reload


In [2]:
# load the trained models 
with open('../models/classifier.pkl', 'rb') as f:
    clf = pickle.load(f)
with open('../models/classifier_features.pkl', 'rb') as f:
    clf_features = pickle.load(f)
with open('../models/lda_model.pkl', 'rb') as f:
    tm = pickle.load(f)

In [3]:
# load the dataset 
df = pd.read_pickle('../data/processed/drugs.pkl')

In [4]:
#create a new dataframe with just the top 5 classes
df.loc[~df['target'].isin(['ORAL', 'TOPICAL', 'INTRAVENOUS', 'DENTAL', 'INTRAMUSCULAR']), 'target'] = 'OTHER'
list(df['target'].unique())

['ORAL', 'OTHER', 'TOPICAL', 'INTRAVENOUS', 'INTRAMUSCULAR', 'DENTAL']

## Classifier

As shown in the Classifier section above, the Naive Bayes model is 70% accurate on the test data. Below is its precision and recall on each class. 

|Class        |Precision|Recall|F1-Score|
|-------------|---------|------|--------|
|ORAL         |98%      |63%   |77%     |
|TOPICAL      |99%      |80%   |88%     |
|OTHER        |49%      |64%   |56%     |
|INTRAVENOUS  |17%      |96%   |29%     |
|DENTAL       |64%      |92%   |75%     |
|INTRAMUSCULAR|16%      |97%   |27%     |

The model is best at classifying TOPICAL drugs, with the highest F1-score of 88%. It is also pretty good at classifying ORAL drugs, with an F1-score of 77%. Both classes have very high precision, which means if the model predicts ORAL or TOPICAL, we can be very confident in those predictions. 

In contrast, the model struggles the most with INTRAMUSCULAR and INTRAVENOUS drugs, which have high recall but very low precision. That means the model predicts those classes more often than it should, or in other words is hypersensitive to those classes. This is reflected in the list of most informative features, which is dominated by the INTRAVENOUS and INTRAMUSCULAR classes. 

In [5]:
clf.show_most_informative_features(25)

Most Informative Features
                  stable = True           INTRAV : TOPICA =   5773.3 : 1.0
                 reapply = True           TOPICA : ORAL   =   5678.6 : 1.0
                      iv = True           INTRAM : TOPICA =   5671.7 : 1.0
                swimming = True           TOPICA : ORAL   =   5494.4 : 1.0
                injected = True           INTRAM : TOPICA =   5389.8 : 1.0
                 diluted = True           INTRAV : TOPICA =   5323.3 : 1.0
                   aging = True           TOPICA : ORAL   =   4970.6 : 1.0
                spectrum = True           TOPICA : ORAL   =   4598.6 : 1.0
          reconstitution = True           INTRAV : TOPICA =   4483.1 : 1.0
                lactated = True           INTRAV : TOPICA =   4308.1 : 1.0
          individualized = True           INTRAM : TOPICA =   4168.2 : 1.0
                     rub = True           TOPICA : ORAL   =   4052.5 : 1.0
                 divided = True           INTRAM : TOPICA =   4047.4 : 1.0

## Topic Model

In [6]:
# bag-of-words
count_vectorizer = CountVectorizer(min_df=5, max_df=0.7)
count_vectors = count_vectorizer.fit_transform(df['tokens_str'])
count_vectors.shape

(85328, 13217)

In [7]:
# This function comes from the blueprint for text analytics 

def display_topics(model, features, no_top_words=5):
    for topic, words in enumerate(model.components_):
        total = words.sum()
        largest = words.argsort()[::-1] # invert sort order
        print("\nTopic %02d" % topic)
        for i in range(0, no_top_words):
            print("  %s (%2.2f)" % (features[largest[i]], abs(words[largest[i]]*100.0/total)))

In [None]:
#compare the LDA model to the original tallys
document_topic_matrix = tm.fit_transform(count_vectors)
df['topic'] = document_topic_matrix.argmax(axis=1)
pd.crosstab(df['topic'], df['target'])

In [None]:
display_topics(tm, count_vectorizer.get_feature_names_out())

The topics do not neatly align with the classes. 

In [None]:
lda_display = pyLDAvis.sklearn.prepare(tm, count_vectors, count_vectorizer, sort_topics=False)
pyLDAvis.display(lda_display)

Instead of aligning with the drug's route of administration, which is what the classifier was trained to predict, the topics seem to align more with the drug's purpose. We can see this by setting lambda to 0 and inspecting the Top-30 Most Relevant Terms for each topic. For example, Topic 1 seems to be about immunotherapy and cancer fighting drugs. Topic 2 seems to be about psychiatric drugs like antidepressants and antipsychotics. Topic 3 seems to be about hormonal drugs for hypothyroidism or adrenal issues. Topic 4 seems to be about dental products, which is corroborated by the topic-target cross-tabulation above. Topic 5 seems to be a mix of sunscreen and painkillers, which is interesting. This topic is the most aligned with the classifier because it is dominated by the TOPICAL class. This is also reflected in the fact that one of its most common and most relevant words is "sun," which is also one of the classifier's most informative features for the TOPICAL class. Finally, Topic 6 seems to be about antibiotics and antidiabetic drugs. It is interesting that the the topics most aligned with a specific class (Topics 4 and 5 with DENTAL and TOPICAL, respectively) are those with the greatest intertopic distances. 

# Recommendations and Next Steps

## 1. More Aggressive Stopword Removal

Some tokens like "dose," "patients," and "mg" are very common. This is handled by the TF-IDF vectors fed into the classifier. However, the topic model uses bag-of-words vectors, so these very common tokens dominate many of the topics. Removing them may help produce better defined topics. 

## 2. Lemmatization

This project did not include lemmatization as a preprocessing step. As a result, there are many tokens that convey very similar information like "dose," "dosage," and "dosing." Collapsing these into a single lemma could remove feature redundancy and noise. 

## 3. Class Balancing

There is pretty dramatic class imbalance among the classifier target: route of administration. This project did not balance the classes before training the classifier or topic model. Future work should see how balancing the training data affects the classifier's performance and the extracted topics. 

# Link to GitHub Repo

https://github.com/andrewabeles/drug-labels