# Citation
Kallumadi,Surya and Grer,Felix. (2018). Drug Review Dataset (Drugs.com). UCI Machine Learning Repository. https://doi.org/10.24432/C5SK5S.

Additional Information

The dataset provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites. The intention was to study

(1) sentiment analysis of drug experience over multiple facets, i.e. sentiments learned on specific aspects such as effectiveness and side effects,
(2) the transferability of models among domains, i.e. conditions, and
(3) the transferability of models among different data sources (see 'Drug Review Dataset (Druglib.com)').

The data is split into a train (75%) a test (25%) partition (see publication) and stored in two .tsv (tab-separated-values) files, respectively.


Important notes:

When using this dataset, you agree that you
1) only use the data for research purposes
2) don't use the data for any commerical purposes
3) don't distribute the data to anyone else
4) cite us

Attribute Information
Additional Information

1. drugName (categorical): name of drug
2. condition (categorical): name of condition
3. review (text): patient review
4. rating (numerical): 10 star patient rating
5. date (date): date of review entry
6. usefulCount (numerical): number of users who found review useful

# Importing Libraries

In [None]:
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

import sqlite3
import nltk
import string

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

from nltk.corpus import stopwords

# Reading data

In [None]:
train = pd.read_csv(r"/content/output.csv")
train.head()

Unnamed: 0,Unnamed: 1,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,"December 14, 2009",17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,"November 3, 2015",10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,"November 27, 2016",37


In [None]:
train.shape

(161297, 7)

# Checking Null Values

In [None]:
train.isnull().sum()

                 0
drugName         0
condition      899
review           0
rating           0
date             0
usefulCount      0
dtype: int64

In [None]:
train.rating.value_counts()

10    50989
9     27531
1     21619
8     18890
7      9456
5      8013
2      6931
3      6513
6      6343
4      5012
Name: rating, dtype: int64

# Mapping rating into positive and negative as per condition

In [None]:
def partition(x):
    if x >=7:
        return 1
    return 0

In [None]:
actual_rating = train['rating']
positiveNegative = actual_rating.map(partition)
train['rating'] = positiveNegative
print("Number of data points in our data", train.shape)
print("***************"*10)
print(train['rating'].value_counts())
print()
print(train.head())

Number of data points in our data (161297, 7)
******************************************************************************************************************************************************
1    106866
0     54431
Name: rating, dtype: int64

                           drugName                     condition  \
0  206461                 Valsartan  Left Ventricular Dysfunction   
1   95260                Guanfacine                          ADHD   
2   92703                    Lybrel                 Birth Control   
3  138000                Ortho Evra                 Birth Control   
4   35696  Buprenorphine / naloxone             Opiate Dependence   

                                              review  rating  \
0  "It has no side effect, I take it in combinati...       1   
1  "My son is halfway through his fourth week of ...       1   
2  "I used to take another oral contraceptive, wh...       0   
3  "This is my first time using any form of birth...       1   
4  "Suboxone has

In [None]:
# since data is imbalance, hence, splitting ham and spam seperately to make balance data
positive = train[train['rating']== 1]
negative = train[train['rating']== 0]

In [None]:
negative = negative.sample(positive.shape[0], replace=True)

In [None]:
print(positive.shape, negative.shape)

(106866, 7) (106866, 7)


In [None]:
# data is balanced add up to original dataset
train = positive.append(negative, ignore_index=True)

In [None]:
train.shape

(213732, 7)

In [None]:
train.rating

0         1
1         1
2         1
3         1
4         1
         ..
213727    0
213728    0
213729    0
213730    0
213731    0
Name: rating, Length: 213732, dtype: int64

In [None]:
# Sorting the data into ascending order by drugName

sorted_data = train.sort_values('drugName', axis=0, ascending=True, inplace=False,
                                    kind='quicksort', na_position='last' )

In [None]:
sorted_data.drugName.value_counts()

Etonogestrel                                     5177
Levonorgestrel                                   4728
Ethinyl estradiol / norethindrone                4221
Nexplanon                                        3299
Ethinyl estradiol / norgestimate                 3072
                                                 ... 
Dextromethorphan / phenylephrine / pyrilamine       1
Dexpanthenol                                        1
Pyridoxine                                          1
Pyrimethamine                                       1
A + D Cracked Skin Relief                           1
Name: drugName, Length: 3406, dtype: int64

In [None]:
sorted_data.review

6510      "I have severe cracked skin on my hands.  I&#0...
12123     "It numbs the pain. It makes my ear feel heavi...
28481     "Handable headaches at first but disappeared a...
39333     "Went from a viral load of 17,000 to undetecta...
25780     "No side effects. Reached undetectable in less...
                                ...                        
160118    "Recently switched from birth conrtol which ke...
139698    "I was on femHRT for four months and had BV (b...
125082    "I was on femHRT for four months and had BV (b...
375       "This medication completely changed my life fo...
203124    "Recently switched from birth conrtol which ke...
Name: review, Length: 213732, dtype: object

In [None]:
# printing some sample random reviews

sent_0 = sorted_data['review'].values[0]
print(sent_0)
print("="*20)

sent_200 = sorted_data['review'].values[200]
print(sent_200)
print("="*20)

sent_1500 = sorted_data['review'].values[1500]
print(sent_1500)
print("="*20)

sent_3000 = sorted_data['review'].values[3000]
print(sent_3000)
print("="*20)

sent_4110 = sorted_data['review'].values[4110]
print(sent_4110)
print("="*20)

sent_4800 = sorted_data['review'].values[4800]
print(sent_4800)
print("="*20)

# Importing english stopwords

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
stopwords = set(stop_words)

In [None]:
pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Apply all the cleaning method at one go with entire text


In [None]:

from tqdm import tqdm
import re
from bs4 import BeautifulSoup
import contractions
preprocesed_reviews = []

# tqdm is for printing the status bar
for sentence in tqdm(sorted_data['review'].values):
    sentence = re.sub(r"http\S+", "", sentence)
    sentence = BeautifulSoup(sentence, 'lxml').get_text()
    sentence = contractions.fix(sentence)
    sentence = re.sub("S\*d\S*",' ', sentence).strip()
    sentence = re.sub('[^A-Za-z]+',' ', sentence)
    sentence = ' '.join(e.lower() for e in sentence.split() if e.lower() not in stopwords)
    preprocesed_reviews.append(sentence.strip())

100%|██████████| 213732/213732 [01:30<00:00, 2365.69it/s]


In [None]:
sent0 = preprocesed_reviews[0]

In [None]:
sent100 = preprocesed_reviews[100]

In [None]:
sent250 = preprocesed_reviews[250]

In [None]:
sent500 = preprocesed_reviews[500]

In [None]:
print(sent0)
print("="*50)
print(sent100)
print("="*50)
print(sent250)
print("="*50)
print(sent500)

severe cracked skin hands tried many different products skin extremely sensitive product helps heal skin sting greasy important using hands best product found condition hard find drugstore cannot even order anymore
works great side effects minimal
antidepressants years several different meds doc celexa mg still experiencing anxiety worsening depression tried abilify mg terrific day anxiety horrible could sit still med tripled anxiety stopped taking celexa mg started back welbutrin today finally anxiety caused abilify gone see doctor tomorrow wondering else works anxiety
going medication switching less expensive one became fuzzy headed unclear horrible went back took four weeks start working finally started seeing things clearly things getting easier sort miracle drug anxiety kind fear would recommend med


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix

In [None]:
# Splitting data into train and test data
x_train,x_test,y_train,y_test = train_test_split(preprocesed_reviews,sorted_data['rating'],train_size=.80,random_state=101)

# Random Forest

In [None]:
# Pipeline for tfidf vectorizer and random forest
classification_model_rand = Pipeline([('tfidf', TfidfVectorizer()),
                                ('randomforest', RandomForestClassifier(n_estimators=100,n_jobs=-1,min_samples_split=10))])
classification_model_rand.fit(x_train, y_train)

# predict the model
y_pred_train = classification_model_rand.predict(x_train)
y_pred_test = classification_model_rand.predict(x_test)

# Confusion Matrix

confusion_matrix(y_train, y_pred_train)

# Calssification report
print(classification_report(y_train, y_pred_train))

print("*************************"*10)

print(classification_report(y_test, y_pred_test))

# Accuracy acore
from sklearn.metrics import accuracy_score
print(accuracy_score(y_train, y_pred_train))

print("*************************"*10)

print(accuracy_score(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85692
           1       1.00      1.00      1.00     85293

    accuracy                           1.00    170985
   macro avg       1.00      1.00      1.00    170985
weighted avg       1.00      1.00      1.00    170985

**********************************************************************************************************************************************************************************************************************************************************
              precision    recall  f1-score   support

           0       0.95      0.96      0.95     21174
           1       0.96      0.95      0.95     21573

    accuracy                           0.95     42747
   macro avg       0.95      0.95      0.95     42747
weighted avg       0.95      0.95      0.95     42747

0.999461941105945
*****************************************************************************

# Logistic Regression

In [None]:
# Pipeline for tfidf vectorizer and Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = Pipeline([('tfidf', TfidfVectorizer()),
                                ('Logistic_regression', LogisticRegression())])
# Train the model
lr.fit(x_train, y_train)

# Make predictions
y_pred_train_lr = lr.predict(x_train)
y_pred_test_lr = lr.predict(x_test)

# Evaluate the model
print("Training Set:")
print(classification_report(y_train, y_pred_train_lr))
print("*************************"*10)
print("Testing Set:")
print(classification_report(y_test, y_pred_test_lr))


# Accuracy acore
from sklearn.metrics import accuracy_score
print(accuracy_score(y_train, y_pred_train_lr))

print("*************************"*10)

print(accuracy_score(y_test, y_pred_test_lr))


Training Set:
              precision    recall  f1-score   support

           0       0.85      0.85      0.85     85692
           1       0.85      0.84      0.85     85293

    accuracy                           0.85    170985
   macro avg       0.85      0.85      0.85    170985
weighted avg       0.85      0.85      0.85    170985

**********************************************************************************************************************************************************************************************************************************************************
Testing Set:
              precision    recall  f1-score   support

           0       0.82      0.84      0.83     21174
           1       0.84      0.82      0.83     21573

    accuracy                           0.83     42747
   macro avg       0.83      0.83      0.83     42747
weighted avg       0.83      0.83      0.83     42747

0.8488171477030149
*************************************************

# Importing unseen data for testing

In [None]:
test_data = pd.read_csv('/content/drugsComTest_raw.tsv', sep='\t')

In [None]:
test_data.head()

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,163740,Mirtazapine,Depression,"""I&#039;ve tried a few antidepressants over th...",10.0,"February 28, 2012",22
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""My son has Crohn&#039;s disease and has done ...",8.0,"May 17, 2009",17
2,159672,Bactrim,Urinary Tract Infection,"""Quick reduction of symptoms""",9.0,"September 29, 2017",3
3,39293,Contrave,Weight Loss,"""Contrave combines drugs that were used for al...",9.0,"March 5, 2017",35
4,97768,Cyclafem 1 / 35,Birth Control,"""I have been on this birth control for one cyc...",9.0,"October 22, 2015",4


In [None]:
from tqdm import tqdm
import re
from bs4 import BeautifulSoup
import contractions

reviews = []

# tqdm is for printing the status bar
for sentence in tqdm(test_data['review'].values):
    sentence = re.sub(r"http\S+", "", sentence)
    sentence = BeautifulSoup(sentence, 'lxml').get_text()
    sentence = contractions.fix(sentence)
    sentence = re.sub("S\*d\S*", ' ', sentence).strip()
    sentence = re.sub('[^A-Za-z]+', ' ', sentence)
    sentence = ' '.join(sentence.split())  # Split and join the words to remove extra spaces
    reviews.append(sentence)

100%|██████████| 53766/53766 [00:17<00:00, 3026.29it/s]


In [None]:
sent_1000 = reviews[100:110]

In [None]:
data = test_data[100:110]
data

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
100,41991,Clonidine,ADHD,"""My 5 year old son was diagnosed with ADHD jus...",10.0,"April 30, 2011",159
101,85618,Ethinyl estradiol / norgestimate,Birth Control,"""I&#039;d never been on birth control up until...",4.0,"July 20, 2014",35
102,62652,Nicoderm CQ,Smoking Cessation,"""I will say this about the patch. It work for ...",10.0,"February 24, 2017",14
103,129850,Levonorgestrel,Emergency Contraception,"""on March 21-25 I had my period. On March 26 I...",8.0,"April 27, 2015",9
104,103401,Celecoxib,Osteoarthritis,"""Celebrex did nothing for my pain.""",1.0,"February 12, 2009",35
105,45260,Fluoxetine,Major Depressive Disorde,"""I have Major Depressive Disorder, Bipolar Dis...",10.0,"November 23, 2015",58
106,222701,Topamax,Migraine Prevention,"""I seemed to catch everything while I was on t...",1.0,"September 14, 2015",31
107,218886,Depakote,Bipolar Disorde,"""General tiredness with the medication but no ...",8.0,"January 22, 2012",22
108,224062,Riboflavin,Migraine Prevention,"""I take 400 mg a day and it helps.""",8.0,"August 24, 2011",22
109,9723,Lo Loestrin Fe,Birth Control,"""BC from below. Rapid weight gain, swelling an...",1.0,"October 2, 2015",3


In [None]:
[1,0,1,1,0,1,0,1,1,0]

In [None]:
# Testing on random forest
test_rf = classification_model_rand.predict(sent_1000)

In [None]:
test_rf

array([1, 0, 1, 1, 0, 1, 1, 0, 1, 0])

In [None]:
# Testing on logistic regression
test_lr = lr.predict(sent_1000)

In [None]:
test_lr

array([1, 0, 1, 1, 0, 1, 0, 0, 1, 0])

## After Testing with lot of samples and all i find out that Logistic regression model generalising well and giving good result than random forest even its accuracy is low