# Data Science Fundamentals - Assignment 2
##### By Alexandra de Carvalho, nmec 93346  

This work aims at combining text analysis and machine learning techniques, such as prediction and classification, into a real-world case study. The tasks at hand are predicting drug effectiveness and side effects ratings, based on text reviews, as well as classifying review texts into one of three possible categories: benefit review, side effect review, or general review category. 

In [3]:
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
import pandas as pd
import numpy as np
import re

### Dataset Preprocessing

Preprocessing is a key step in any analysis and can affect how good the results will be. The dataset provided is already divided into training and testing sets, but it is usefull that both go through the same preprocessing pipeline. We will load them to pandas dataframe. We can see that we have a name for the drug being reviewed, the condition it was used to treat, the overall rating (numerical 1-10) by the user, a drug effectiveness categorical rating and a benefit text review, a side effects categorical rating and a side effects text review, and an overall comment on the drug.

In [4]:
df_train = pd.read_csv('dataset/drugLibTrain_raw.tsv', sep='\t')
df_test = pd.read_csv('dataset/drugLibTest_raw.tsv', sep='\t')

Having the goal of the assignment in mind, we can understand that the first, unnamed, column, as well as the condition, aren't essential columns, and won't be used. Therefore, we can delete them.

In [5]:
df_train = df_train.loc[:, ~df_train.columns.str.contains('^Unnamed')]
df_train = df_train.loc[:, ~df_train.columns.str.contains('condition')]

df_test = df_test.loc[:, ~df_test.columns.str.contains('^Unnamed')]
df_test = df_test.loc[:, ~df_test.columns.str.contains('condition')]

df_train

Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,benefitsReview,sideEffectsReview,commentsReview
0,enalapril,4,Highly Effective,Mild Side Effects,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ..."
1,ortho-tri-cyclen,1,Highly Effective,Severe Side Effects,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest..."
2,ponstel,10,Highly Effective,No Side Effects,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...
3,prilosec,3,Marginally Effective,Mild Side Effects,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...
4,lyrica,2,Marginally Effective,Severe Side Effects,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above
...,...,...,...,...,...,...,...
3102,vyvanse,10,Highly Effective,Mild Side Effects,"Increased focus, attention, productivity. Bett...","Restless legs at night, insomnia, headache (so...","I took adderall once as a child, and it made m..."
3103,zoloft,1,Ineffective,Extremely Severe Side Effects,Emotions were somewhat blunted. Less moodiness.,"Weight gain, extreme tiredness during the day,...",I was on Zoloft for about 2 years total. I am ...
3104,climara,2,Marginally Effective,Moderate Side Effects,---,Constant issues with the patch not staying on....,---
3105,trileptal,8,Considerably Effective,Mild Side Effects,Controlled complex partial seizures.,"Dizziness, fatigue, nausea",Started at 2 doses of 300 mg a day and worked ...


Now, we should investigate if there are any missing values. In the test dataframe, there are no missing values. Because the missing values on the training dataframe are on text reviews, they will affect our work. Since there are only 10 missing values out of 3000+ rows, removing those rows won't affect the reliability of our analysis. So, let's drop those records, as well as duplicated records.

In [6]:
df_train.isna().sum()

urlDrugName          0
rating               0
effectiveness        0
sideEffects          0
benefitsReview       0
sideEffectsReview    2
commentsReview       8
dtype: int64

In [7]:
df_test.isna().sum()

urlDrugName          0
rating               0
effectiveness        0
sideEffects          0
benefitsReview       0
sideEffectsReview    0
commentsReview       0
dtype: int64

In [8]:
df_train = df_train.dropna()
df_train = df_train.drop_duplicates()

df_test = df_test.dropna()
df_test = df_test.drop_duplicates()
df_test

Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,benefitsReview,sideEffectsReview,commentsReview
0,biaxin,9,Considerably Effective,Mild Side Effects,The antibiotic may have destroyed bacteria cau...,"Some back pain, some nauseau.",Took the antibiotics for 14 days. Sinus infect...
1,lamictal,9,Highly Effective,Mild Side Effects,Lamictal stabilized my serious mood swings. On...,"Drowsiness, a bit of mental numbness. If you t...",Severe mood swings between hypomania and depre...
2,depakene,4,Moderately Effective,Severe Side Effects,Initial benefits were comparable to the brand ...,"Depakene has a very thin coating, which caused...",Depakote was prescribed to me by a Kaiser psyc...
3,sarafem,10,Highly Effective,No Side Effects,It controlls my mood swings. It helps me think...,I didnt really notice any side effects.,This drug may not be for everyone but its wond...
4,accutane,10,Highly Effective,Mild Side Effects,Within one week of treatment superficial acne ...,Side effects included moderate to severe dry s...,Drug was taken in gelatin tablet at 0.5 mg per...
...,...,...,...,...,...,...,...
1031,accutane,7,Considerably Effective,Severe Side Effects,Detoxing effect by pushing out the system thro...,"Hairloss, extreme dry skin, itchiness, raises ...",Treatment period is 3 months/12 weeks. Dosage ...
1032,proair-hfa,10,Highly Effective,No Side Effects,"The albuterol relieved the constriction, irrit...",I have experienced no side effects.,I use the albuterol as needed because of aller...
1033,accutane,8,Considerably Effective,Moderate Side Effects,Serve Acne has turned to middle,"Painfull muscles, problems with seeing at night","This drug is highly teratogenic ,females must ..."
1034,divigel,10,Highly Effective,No Side Effects,"My overall mood, sense of well being, energy l...",No side effects of any kind were noted or appa...,Divigel is a topically applied Bio-Identical H...


### Prediction

The pipeline will evaluate the treated text as input and preform prediction using a Decision Tree Classifier to obtain the results. The hyperparameter tunning will be preformed based on the accuracy, with a grid search. The hyperparameters used for tunning the model are the criterion, to measure the quality of the split (with values "gini" for the Giny impurity and "entropy" for the information gain), and the max_depth of the tree. The prediction function is shown below. In this case, we are using 5 splits to get 20% of the data in each fold.

In [9]:
def decision():
    
    l1 = np.arange(0.2, 0.9, 0.1)

    param_grid = {
        'criterion' : ['gini', 'entropy'],
        'max_depth' : [2,4,6,8,10,12]
    }

    i = 1
    cv = KFold(n_splits=5, shuffle=False)

    return GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=param_grid, cv=cv, scoring='accuracy')

### Text Processing

In order to understand text, it needs to be processed, so that the meaningful words can be extracted. The first step is to create a set of stopwords. The ones used here are the ones given by NLTK module. Because, in this case, we need words like 'bad' or 'not' to understand the reviews, we will not proceed with filtering small length words.

In [10]:
f = open('stopwords.txt','r')
stopwords = set(map(lambda x: x.strip(), f.readlines()))
f.close()

The tokenization splits by the regular expression '\W+', which splits any alphanumeric word by spaces and punctuation (the original expression doesn't remove underscores, so this character is then added to the list). Then, the tokenization also removes words with only numbers, and words present in the stopword set. The final step is converting all characters to lower case. Then, the stemming is done with the use of the Snowball Stemmer.

In [11]:
def tokenization(reviewText):
    tokens_list = [word.lower() for word in re.split('[\W+,\_]', reviewText) if not word.isnumeric() and not word in stopwords]
    return " ".join([SnowballStemmer(language='english').stem(word) for word in tokens_list])

### Feature Selection - Bag of Words (BoW)

The first problem addressed is the classification of the review into general review, side effects review, or beneficts review. We are starting with this problem to apply the BoW, which is a simpler approach to feature selection. Since the review types should have distinct words between them, it should be the easiest problem in which to apply BoW, despite more words meaning a more sparse and a longer vector.

The BoW approach captures the number of times each different word occurrs in each document of the corpus, without considering any order, modelling the documents by a feacture vector, representing the frequency of the words in it. To construct the bag of words, based on the word counts in the respective documents, let's first append the three columns as our corpus' Series, and then use the CountVectorizer class, from scikit-learn.

In [12]:
X_train = np.concatenate((df_train['benefitsReview'], df_train['sideEffectsReview'],df_train['commentsReview']))
X_test = np.concatenate((df_test['benefitsReview'], df_test['sideEffectsReview'],df_test['commentsReview']))

y = pd.Series(['benefitsReview', 'sideEffectsReview', 'commentsReview']).repeat([df_train['benefitsReview'].size,df_train['sideEffectsReview'].size, df_train['commentsReview'].size]).to_list()
y_test = pd.Series(['benefitsReview', 'sideEffectsReview', 'commentsReview']).repeat([df_test['benefitsReview'].size,df_test['sideEffectsReview'].size, df_test['commentsReview'].size]).tolist()

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform([tokenization(val) for val in X_train]).toarray() # creating the feature matrix
X_test = vectorizer.transform([tokenization(val) for val in X_test]).toarray() # creating the feature matrix

dt = decision()

##### Results

Now that we built the bag of words, we will use it as input and apply our grid search to find the best predictor!

In [2]:
dt.fit(X_train,y)
y_prediction = dt.predict(X_test)

NameError: name 'dt' is not defined

In [None]:
print("Accuracy: ", metrics.accuracy_score(y_test, y_prediction))
print("Recall: ", metrics.precision_score(y_test, y_prediction, average='weighted'))
print("Precision: ", metrics.precision_score(y_test, y_prediction, average='weighted'))
print("F1: ", metrics.f1_score(y_test, y_prediction, average='weighted'))
print("Confusion Matrix: \n", metrics.confusion_matrix(y_test, y_prediction))

Accuracy:  0.5671834625322998
Recall:  0.5892339753355301
Precision:  0.5892339753355301
F1:  0.5659415857195985
Confusion Matrix: 
 [[704 234  94]
 [341 573 118]
 [362 191 479]]


### TF-IDF

In [13]:
X_train = df_train['benefitsReview']
X_test = df_test['benefitsReview']

y = df_train['effectiveness']
y_test = df_test['effectiveness']

vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform([tokenization(val) for val in X_train]).toarray() # creating the feature matrix
X_test = vectorizer.transform([tokenization(val) for val in X_test]).toarray() # creating the feature matrix

dt = decision()

In [14]:
dt.fit(X_train,y)
y_prediction = dt.predict(X_test)

In [15]:
print("Accuracy: ", metrics.accuracy_score(y_test, y_prediction))
print("Recall: ", metrics.precision_score(y_test, y_prediction, average='weighted'))
print("Precision: ", metrics.precision_score(y_test, y_prediction, average='weighted'))
print("F1: ", metrics.f1_score(y_test, y_prediction, average='weighted'))
print("Confusion Matrix: \n", metrics.confusion_matrix(y_test, y_prediction))

Accuracy:  0.4428294573643411
Recall:  0.3894257211490456
Precision:  0.3894257211490456
F1:  0.3852505972346568
Confusion Matrix: 
 [[ 80 207   8   1  12]
 [ 62 327   5   5  12]
 [  7  31  40   2   2]
 [ 15  40  10   4   7]
 [ 38 101   3   7   6]]


In [None]:
X_train = df_train['sideEffectsReview']
X_test = df_test['sideEffectsReview']

y = df_train['sideEffects']
y_test = df_test['sideEffects']

vectorizer = TfidfVectorizer(ngram_range=(1,2))

X_train = vectorizer.fit_transform([tokenization(val) for val in X_train]).toarray() # creating the feature matrix
X_test = vectorizer.transform([tokenization(val) for val in X_test]).toarray() # creating the feature matrix

dt = decision()