# Data Science Fundamentals - Assignment 2
##### By Alexandra de Carvalho, nmec 93346  

This work aims at combining text analysis and machine learning techniques, such as prediction and classification, into a real-world case study. The tasks at hand are predicting drug effectiveness and side effects ratings, based on text reviews, as well as classifying review texts into one of three possible categories: benefit review, side effect review, or general review category. 

In [53]:
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from textblob import TextBlob
from sklearn import metrics
import pandas as pd
import numpy as np
import re
import nltk

### Dataset Preprocessing

Preprocessing is a key step in any analysis and can affect how good the results will be. The dataset provided is already divided into training and testing sets, but it is usefull that both go through the same preprocessing pipeline. We will load them to pandas dataframe. We can see that we have a name for the drug being reviewed, the condition it was used to treat, the overall rating (numerical 1-10) by the user, a drug effectiveness categorical rating and a benefit text review, a side effects categorical rating and a side effects text review, and an overall comment on the drug.

In [2]:
df_train = pd.read_csv('dataset/drugLibTrain_raw.tsv', sep='\t')
df_test = pd.read_csv('dataset/drugLibTest_raw.tsv', sep='\t')

Having the goal of the assignment in mind, we can understand that the first, unnamed, column, as well as the condition, aren't essential columns, and won't be used. Therefore, we can delete them.

In [3]:
df_train = df_train.loc[:, ~df_train.columns.str.contains('^Unnamed')]
df_train = df_train.loc[:, ~df_train.columns.str.contains('condition')]

df_test = df_test.loc[:, ~df_test.columns.str.contains('^Unnamed')]
df_test = df_test.loc[:, ~df_test.columns.str.contains('condition')]

df_train

Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,benefitsReview,sideEffectsReview,commentsReview
0,enalapril,4,Highly Effective,Mild Side Effects,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ..."
1,ortho-tri-cyclen,1,Highly Effective,Severe Side Effects,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest..."
2,ponstel,10,Highly Effective,No Side Effects,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...
3,prilosec,3,Marginally Effective,Mild Side Effects,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...
4,lyrica,2,Marginally Effective,Severe Side Effects,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above
...,...,...,...,...,...,...,...
3102,vyvanse,10,Highly Effective,Mild Side Effects,"Increased focus, attention, productivity. Bett...","Restless legs at night, insomnia, headache (so...","I took adderall once as a child, and it made m..."
3103,zoloft,1,Ineffective,Extremely Severe Side Effects,Emotions were somewhat blunted. Less moodiness.,"Weight gain, extreme tiredness during the day,...",I was on Zoloft for about 2 years total. I am ...
3104,climara,2,Marginally Effective,Moderate Side Effects,---,Constant issues with the patch not staying on....,---
3105,trileptal,8,Considerably Effective,Mild Side Effects,Controlled complex partial seizures.,"Dizziness, fatigue, nausea",Started at 2 doses of 300 mg a day and worked ...


It is important that our classifications are numeric, because machine learning conducts mathematical operations. Therefore, let's do a simple replacement of our text classes ot meaningful numbers.

In [4]:
df_train["sideEffects"].replace({'No Side Effects': 0, 'Mild Side Effects': 1, 'Moderate Side Effects': 2, 'Severe Side Effects': 3, 'Extremely Severe Side Effects': 4}, inplace=True)
df_test["sideEffects"].replace({'No Side Effects': 0, 'Mild Side Effects': 1, 'Moderate Side Effects': 2, 'Severe Side Effects': 3, 'Extremely Severe Side Effects': 4}, inplace=True)
df_train

Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,benefitsReview,sideEffectsReview,commentsReview
0,enalapril,4,Highly Effective,1,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ..."
1,ortho-tri-cyclen,1,Highly Effective,3,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest..."
2,ponstel,10,Highly Effective,0,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...
3,prilosec,3,Marginally Effective,1,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...
4,lyrica,2,Marginally Effective,3,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above
...,...,...,...,...,...,...,...
3102,vyvanse,10,Highly Effective,1,"Increased focus, attention, productivity. Bett...","Restless legs at night, insomnia, headache (so...","I took adderall once as a child, and it made m..."
3103,zoloft,1,Ineffective,4,Emotions were somewhat blunted. Less moodiness.,"Weight gain, extreme tiredness during the day,...",I was on Zoloft for about 2 years total. I am ...
3104,climara,2,Marginally Effective,2,---,Constant issues with the patch not staying on....,---
3105,trileptal,8,Considerably Effective,1,Controlled complex partial seizures.,"Dizziness, fatigue, nausea",Started at 2 doses of 300 mg a day and worked ...


In [5]:
df_train["effectiveness"].replace({'Ineffective': 0, 'Marginally Effective': 1, 'Moderately Effective': 2, 'Considerably Effective': 3, 'Highly Effective': 4}, inplace=True)
df_test["effectiveness"].replace({'Ineffective': 0, 'Marginally Effective': 1, 'Moderately Effective': 2, 'Considerably Effective': 3, 'Highly Effective': 4}, inplace=True)
df_train

Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,benefitsReview,sideEffectsReview,commentsReview
0,enalapril,4,4,1,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ..."
1,ortho-tri-cyclen,1,4,3,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest..."
2,ponstel,10,4,0,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...
3,prilosec,3,1,1,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...
4,lyrica,2,1,3,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above
...,...,...,...,...,...,...,...
3102,vyvanse,10,4,1,"Increased focus, attention, productivity. Bett...","Restless legs at night, insomnia, headache (so...","I took adderall once as a child, and it made m..."
3103,zoloft,1,0,4,Emotions were somewhat blunted. Less moodiness.,"Weight gain, extreme tiredness during the day,...",I was on Zoloft for about 2 years total. I am ...
3104,climara,2,1,2,---,Constant issues with the patch not staying on....,---
3105,trileptal,8,3,1,Controlled complex partial seizures.,"Dizziness, fatigue, nausea",Started at 2 doses of 300 mg a day and worked ...


Now, we should investigate if there are any missing values. In the test dataframe, there are no missing values. Because the missing values on the training dataframe are on text reviews, they will affect our work. Since there are only 10 missing values out of 3000+ rows, removing those rows won't affect the reliability of our analysis. So, let's drop those records, as well as duplicated records.

In [6]:
df_train.isna().sum()

urlDrugName          0
rating               0
effectiveness        0
sideEffects          0
benefitsReview       0
sideEffectsReview    2
commentsReview       8
dtype: int64

In [7]:
df_test.isna().sum()

urlDrugName          0
rating               0
effectiveness        0
sideEffects          0
benefitsReview       0
sideEffectsReview    0
commentsReview       0
dtype: int64

In [8]:
df_train = df_train.dropna()
df_train = df_train.drop_duplicates()

df_test = df_test.dropna()
df_test = df_test.drop_duplicates()
df_test

Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,benefitsReview,sideEffectsReview,commentsReview
0,biaxin,9,3,1,The antibiotic may have destroyed bacteria cau...,"Some back pain, some nauseau.",Took the antibiotics for 14 days. Sinus infect...
1,lamictal,9,4,1,Lamictal stabilized my serious mood swings. On...,"Drowsiness, a bit of mental numbness. If you t...",Severe mood swings between hypomania and depre...
2,depakene,4,2,3,Initial benefits were comparable to the brand ...,"Depakene has a very thin coating, which caused...",Depakote was prescribed to me by a Kaiser psyc...
3,sarafem,10,4,0,It controlls my mood swings. It helps me think...,I didnt really notice any side effects.,This drug may not be for everyone but its wond...
4,accutane,10,4,1,Within one week of treatment superficial acne ...,Side effects included moderate to severe dry s...,Drug was taken in gelatin tablet at 0.5 mg per...
...,...,...,...,...,...,...,...
1031,accutane,7,3,3,Detoxing effect by pushing out the system thro...,"Hairloss, extreme dry skin, itchiness, raises ...",Treatment period is 3 months/12 weeks. Dosage ...
1032,proair-hfa,10,4,0,"The albuterol relieved the constriction, irrit...",I have experienced no side effects.,I use the albuterol as needed because of aller...
1033,accutane,8,3,2,Serve Acne has turned to middle,"Painfull muscles, problems with seeing at night","This drug is highly teratogenic ,females must ..."
1034,divigel,10,4,0,"My overall mood, sense of well being, energy l...",No side effects of any kind were noted or appa...,Divigel is a topically applied Bio-Identical H...


In [9]:
df_test['effectiveness'].value_counts()

4    411
3    308
2    155
0     82
1     76
Name: effectiveness, dtype: int64

### Prediction

The model will preform prediction using Multinomial Naive Bayes, which is suitable for classification with discrete features. The only hyperparameter for tunning will is the alpha value, and its selection will be evaluated by its performance, with a grid search. The prediction function is shown below. In this case, we are using 5 splits to get 20% of the data in each fold.

In [10]:
def predictor():
    
    param_grid = {
        'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],
    }

    cv = KFold(n_splits=5, shuffle=False)
    return GridSearchCV(estimator=MultinomialNB(), param_grid=param_grid, cv=cv)

### Text Processing

The raw text we see in the reviews cannot be directly fed to the predictor algorithm, it needs to be processed, so that it takes the form of a numerical vector, with a fixed size. To process the text in a meaninful way, the first step is to create a set of stopwords - common words that we can discart as not relevant to the meaning of the message (warning: sometimes some of these words are important!). The ones used here are the ones given by NLTK module. Because, in this case, we need words like 'bad' or 'not' to understand the reviews, we will not proceed with filtering small length words.

In [11]:
f = open('stopwords.txt','r')
stopwords = set(map(lambda x: x.strip(), f.readlines()))
f.close()

The tokenization splits by the regular expression '\W+', which splits any alphanumeric word by spaces and punctuation (the original expression doesn't remove underscores, so this character is then added to the list). Then, the tokenization also removes words with only numbers, and words present in the stopword set. The final step is converting all characters to lower case. Then, the stemming is done with the use of the Snowball Stemmer.

In [42]:
def tokenization(reviewText, stemming=True):
    tokens_list = [word.lower() for word in re.split('[\W+,\_]', reviewText) if not word.isnumeric() and not word in stopwords]
    if stemming:
        return " ".join([SnowballStemmer(language='english').stem(word) for word in tokens_list])
    else:
        return " ".join([word for word in tokens_list])

### Feature Selection - Bag of Words (BoW)

The first problem addressed is the classification of the review into general review, side effects review, or beneficts review. We are starting with this problem to apply the BoW, which is a simpler approach to feature selection by text vectorization. Since the review types should have distinct words between them, it should be the easiest problem in which to apply BoW, despite more words meaning a more sparse and a longer vector.

The BoW approach captures the number of times each different word occurrs in each document of the corpus, modelling the documents by a feacture vector, representing the frequency of the words in it. To construct the bag of words, based on the word counts in the respective documents, let's first append the three columns as our corpus' Series, and then use the CountVectorizer class, from scikit-learn.

The steps here are instantiating the vectorizer, and fitting it to the training data (making it learn the vocabulary). This will create a sparse matrix, which can then be converted to a dense matrix by .toarray(). In a dense matrix, we store all zeros, which is not great since most documents use a very small subset of the words and so, typically, more than 99% of the values are zero. But this is what we will use to train our model - each token frequency will be treated as a feature and the vector of all token frequencies for a given document is considered a multivariate sample. 

Then, in order to make a prediction, the tests must have the same number of features as the trainings, so we use the transform() method. It basically drops every unknown word, because, since it was not trained with it, it doesn't know how it reflects on the result. It is like our model does not comprehend that word, so it can not predict based on it. 

In [13]:
X_train = np.concatenate((df_train['benefitsReview'], df_train['sideEffectsReview'],df_train['commentsReview']))
X_test = np.concatenate((df_test['benefitsReview'], df_test['sideEffectsReview'],df_test['commentsReview']))

y = pd.Series(['benefitsReview', 'sideEffectsReview', 'commentsReview']).repeat([df_train['benefitsReview'].size,df_train['sideEffectsReview'].size, df_train['commentsReview'].size]).to_list()
y_test = pd.Series(['benefitsReview', 'sideEffectsReview', 'commentsReview']).repeat([df_test['benefitsReview'].size,df_test['sideEffectsReview'].size, df_test['commentsReview'].size]).tolist()

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform([tokenization(val) for val in X_train]).toarray() # creating dense feature matrix
X_test = vectorizer.transform([tokenization(val) for val in X_test]).toarray() # creating the feature matrix

predictr = predictor()

##### Results

Now, we want to train our Multinomial Naive Bayes model, using the fit() method. Then we apply our model to the testing dataset.

In [14]:
predictr.fit(X_train,y)
y_prediction = predictr.predict(X_test)

In [15]:
print("Accuracy: ", metrics.accuracy_score(y_test, y_prediction))
print("Recall: ", metrics.precision_score(y_test, y_prediction, average='weighted'))
print("Precision: ", metrics.precision_score(y_test, y_prediction, average='weighted'))
print("F1: ", metrics.f1_score(y_test, y_prediction, average='weighted'))
print("Confusion Matrix: \n", metrics.confusion_matrix(y_test, y_prediction))

Accuracy:  0.7790697674418605
Recall:  0.7793466911820623
Precision:  0.7793466911820623
F1:  0.7788435692455223
Confusion Matrix: 
 [[752 164 116]
 [149 809  74]
 [ 72 109 851]]


### Feature Selection - TF-IDF Score

Now, let's address the prediction of drug effectiveness based on the benefits reviews. 

TF-IDF means "term frequency - inverted document frequency", and its innovation from the simple term frequency is that it is intended to reflect how important a word is to the document. Hence, we are applying tf-idf scoring to this problem.

In [44]:
X_train = df_train['benefitsReview']
X_test = df_test['benefitsReview']

y = df_train['effectiveness']
y_test = df_test['effectiveness']

vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform([tokenization(val, False) for val in X_train]).toarray()
X_test = vectorizer.transform([tokenization(val, False) for val in X_test]).toarray() # creating the feature matrix

predictr = predictor()

In [45]:
predictr.fit(X_train,y)
y_prediction = predictr.predict(X_test)

In [46]:
print("Accuracy: ", metrics.accuracy_score(y_test, y_prediction))
print("Recall: ", metrics.precision_score(y_test, y_prediction, average='weighted'))
print("Precision: ", metrics.precision_score(y_test, y_prediction, average='weighted'))
print("F1: ", metrics.f1_score(y_test, y_prediction, average='weighted'))
print("Confusion Matrix: \n", metrics.confusion_matrix(y_test, y_prediction))

Accuracy:  0.4496124031007752
Recall:  0.4576128867509117
Precision:  0.4576128867509117
F1:  0.39000125688845616
Confusion Matrix: 
 [[ 24   1   1   9  47]
 [  5   2   5  26  38]
 [  2   1  12  45  95]
 [  3   0   6  91 208]
 [  0   0   2  74 335]]


### N-Gram

For the last task, prediction of side effects based on the reviews, let's use tf-idf but with N-Grams. This technique simply consists in taking into account the frequency and importance of not only a single term, but also sequences of n consecutive terms. This allows to study in which context the term is employed, resulting in better predictions

In [47]:
X_train = df_train['sideEffectsReview']
X_test = df_test['sideEffectsReview']

y = df_train['sideEffects']
y_test = df_test['sideEffects']

vectorizer = TfidfVectorizer(ngram_range=(1,2))

X_train = vectorizer.fit_transform([tokenization(val, False) for val in X_train]).toarray() # creating the feature matrix
X_test = vectorizer.transform([tokenization(val, False) for val in X_test]).toarray() # creating the feature matrix

predictr = predictor()

In [48]:
predictr.fit(X_train,y)
y_prediction = predictr.predict(X_test)

In [49]:
print("Accuracy: ", metrics.accuracy_score(y_test, y_prediction))
print("Recall: ", metrics.precision_score(y_test, y_prediction, average='weighted'))
print("Precision: ", metrics.precision_score(y_test, y_prediction, average='weighted'))
print("F1: ", metrics.f1_score(y_test, y_prediction, average='weighted'))
print("Confusion Matrix: \n", metrics.confusion_matrix(y_test, y_prediction))

Accuracy:  0.5300387596899225
Recall:  0.5603469898054614
Precision:  0.5603469898054614
F1:  0.48805586575523985
Confusion Matrix: 
 [[197  62   8   0   0]
 [ 28 272  25   2   0]
 [ 11 165  54   6   0]
 [  3  71  29  18   1]
 [  4  34  20  16   6]]


### Sentiment Analysis

Sentiment analysis identifies and extracts subjective information from text, measuring the attitude, sentiments, evaluations, attitudes, and emotions of the writer. For this, we will use the spaCyTextBlob. This model is sensitive to both polarity (positive/negative) and intensity (strength) of emotion.

In [61]:
X_train = df_train['commentsReview']
X_test = df_test['commentsReview']

tbs = [TextBlob(val) for val in X_train]

In [81]:
print("Positive Comments:\n")
for review in [tb for tb in tbs if tb.polarity > 0.35 and tb.subjectivity > 0.5]:
    print(review, review.sentiment,'\n')
    print("*"*25)

Positive Comments:

I take 600mg of Gabapentin three times day. It seems to be effective but I feel I need a hire dosage as the longer I take it the more I experience relapses.I first started out with 300mg three times a day but it was ineffective. Sentiment(polarity=0.45, subjectivity=0.5444444444444444) 

*************************
Take medication twice a day in combination with good skin care Sentiment(polarity=0.7, subjectivity=0.6000000000000001) 

*************************
Taken for anxiety,initially in an acute stage. This was followed by a decision by patient (myself) to continue for 3 months in order to reduce arousal levels whilst undertaking therapy using the theraputic model ACT. Sentiment(polarity=0.6, subjectivity=0.9) 

*************************
I started on the medication 3yrs ago and still take it today with continued good results and no side effects. Sentiment(polarity=0.7, subjectivity=0.6000000000000001) 

*************************
I take one 300mg Tekturna per day, 

In [83]:
print("Negative Comments:\n")
for review in [tb for tb in tbs if tb.polarity < -0.35 and tb.subjectivity > 0.5]:
    print(review, review.sentiment,'\n')
    print("*"*25)

Negative Comments:

I Hate This Birth Control, I Would Not Suggest This To Anyone. Sentiment(polarity=-0.8, subjectivity=0.9) 

*************************
I have taken both the shots and the pills (separately) and they both worked well.  I would need to take the pills before getting sick to stomach, because if after, I could not keep pills down.  The shot worked regardless of when I took it. Sentiment(polarity=-0.43492063492063493, subjectivity=0.573015873015873) 

*************************
The cream is applied at bedtime for five nights a week for Basal Cell Carcinoma.  Three night per week for actinic keratosis.  You do this for eight weeks in a row.  By the end of eight weeks, the bad cells react gradually as scabby spots.  This is how you know the medicine is working.  You put the cream on the spots your doctor tells you to, with about an inch margin all around it. Sentiment(polarity=-0.6999999999999998, subjectivity=0.6666666666666666) 

*************************
This is by far the