https://archive.ics.uci.edu/dataset/462/drug+review+dataset+drugs+com

# Flatiron Phase 5 Project

## Aaron Galbraith

https://www.linkedin.com/in/aarongalbraith \
https://github.com/aarongalbraith

### Submitted: November 21, 2023

## working contents

- **[functions](#functions)<br>**
- **[rough overview](#rough-overview)<br>**
- **[duplicates](#duplicates)<br>**
- **[missing values](#missing-values)<br>**
- **[contractions](#contractions)<br>**
- **[dates](#dates)<br>**
- **[ratings](#ratings)<br>**
- **[focusing on birth control](#focusing-on-birth-control)<br>**
- **[feature engineering ideas](#feature-engineering-ideas)<br>**
- **[rudimentary word cloud maker](#rudimentary-word-cloud-maker)<br>**
- **[end](#end)<br>**


## Contents

- **[Business Understanding](#Business-Understanding)<br>**
- **[Data Understanding](#Data-Understanding)**<br>
- **[Data Preparation](#Data-Preparation)**<br>
- **[Exploration](#Exploration)**<br>
- **[Modeling](#Modeling)**<br>
- **[Evaluation](#Evaluation)**<br>
- **[Recommendations](#Recommendations)<br>**
- **[Further Inquiry](#Further-Inquiry)**<br>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import FreqDist
from nltk.corpus import stopwords
import string
from wordcloud import WordCloud

import html
import contractions

import re

from IPython.display import display

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, precision_score, f1_score
from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV


SEED = 1979

do_grids = True

In [2]:
d1 = pd.read_csv('../data/drugsComTrain_raw.tsv', delimiter='\t', encoding='latin-1')
d2 = pd.read_csv('../data/drugsComTest_raw.tsv', delimiter='\t', encoding='latin-1')
df = pd.concat([d1,d2]).reset_index().drop(columns=['Unnamed: 0', 'index'])

# functions

In [3]:
def show_review(index):
    print(df.review.loc[index])
    display(df[df.review == df.loc[index].review][['drugName', 'condition', 'rating', 'date', 'usefulCount']])

In [4]:
def show_similar(index):
    
    count_total = df[
        (df.drugName == df.loc[index].drugName) & \
        (df.condition == df.loc[index].condition) & \
        (df.date == df.loc[index].date)
    ].review.count()
    
    count_similar = df[
        (df.drugName == df.loc[index].drugName) & \
        (df.condition == df.loc[index].condition) & \
        (df.rating == df.loc[index].rating) & \
        (df.date == df.loc[index].date)
    ].review.count()
    
    print('On', df.loc[index].date, df.loc[index].drugName, 'was reviewed', count_total, \
          'times and received a rating of', df.loc[index].rating, count_similar, 'times.\n')
    print('From that date, here are all', count_similar, 'reviews with the same rating:\n')
    for ind in df[
        (df.drugName == df.loc[index].drugName) & \
        (df.condition == df.loc[index].condition) & \
        (df.rating == df.loc[index].rating) & \
        (df.date == df.loc[index].date)
    ].index:
        print(df.loc[ind].review,'\n')
    
    print('Here is a breakdown of all the dates when reviewers gave the same drug name and condition THIS RATING:')
    display(df[
        (df.drugName == df.loc[index].drugName) & \
        (df.condition == df.loc[index].condition) & \
        (df.rating == df.loc[index].rating)
    ].date.value_counts())

# missing values

In [5]:
len(df[df.condition.isna()])

1194

In [6]:
df.condition.fillna('missing', inplace=True)

In [7]:
len(df[df.condition == 'missing'])

1194

We noticed another condition label that was meant to indicate missing and should be accordingly changed.

In [8]:
df.condition = df.condition.apply(lambda x: 'missing' if 'Not Listed' in x else x)

In [9]:
len(df[df.condition == 'missing'])

1786

We've identified some actual missing condition labels, but we noticed there are more condition labels that seem suspicious, particularly ones that start with something other than an upper case character. Let's look at all such condition labels.

In [10]:
set(df[(~df.condition.str[0].isin(list(string.ascii_uppercase))) &
   (df.condition != 'missing')
  ].condition)

{'0</span> users found this comment helpful.',
 '100</span> users found this comment helpful.',
 '105</span> users found this comment helpful.',
 '10</span> users found this comment helpful.',
 '110</span> users found this comment helpful.',
 '11</span> users found this comment helpful.',
 '121</span> users found this comment helpful.',
 '123</span> users found this comment helpful.',
 '12</span> users found this comment helpful.',
 '135</span> users found this comment helpful.',
 '13</span> users found this comment helpful.',
 '142</span> users found this comment helpful.',
 '145</span> users found this comment helpful.',
 '146</span> users found this comment helpful.',
 '14</span> users found this comment helpful.',
 '15</span> users found this comment helpful.',
 '16</span> users found this comment helpful.',
 '17</span> users found this comment helpful.',
 '18</span> users found this comment helpful.',
 '19</span> users found this comment helpful.',
 '1</span> users found this comm

These fall into three categories. Ones that include "users found this comment helpful" should be regarded as erroneous and therefore missing.

In [11]:
df.condition = df.condition.apply(lambda x: 'missing' if 'users found' in x else x)

In [12]:
len(df[df.condition == 'missing'])

2957

 Ones that show a clipped copy of the drug name and end with a parenthesis should also be regarded as missing.

In [13]:
df.condition = df.condition.apply(lambda x: 'missing' \
                                  if x[0] not in list(string.ascii_uppercase) and \
                                  x[-1] in ['(', ')'] \
                                  else x)

In [14]:
len(df[df.condition == 'missing'])

3286

Most of the ones that show a clipped version of the condition label can possibly be restored.

In [15]:
def condition_restore(condition):
    if condition.split()[-1] in ['Disorde', 'eve', 'Shoulde', 'Cance']:
        condition = condition+'r'
    if condition.split()[0] in ['acial', 'ibrocystic', 'ungal', 'amilial', 'ailure', 'ever', \
                                'emale', 'unctional', 'actor', 'ibromyalgia', 'atigue']:
        condition = 'F'+condition
    if condition.split()[0] in ['llicular', 'llicle', 'lic', 'cal']:
        condition = 'Fo'+condition
    if condition.split()[0] in ['mance']:
        condition = 'Perfor'+condition
    if condition.split()[0] in ['zen']:
        condition = 'Fro'+condition
    if condition.split()[0] in ['mis']:
        condition = 'Dermatitis Herpetifor'+condition
    return condition

df.condition = df.condition.apply(lambda x: condition_restore(x))

Let's look at what we have left.

In [16]:
set(df[(~df.condition.str[0].isin(list(string.ascii_uppercase))) &
   (df.condition != 'missing')
  ].condition)

{'m Pain Disorder', 'me', 't Care', "von Willebrand's Disease"}

"von Willebrand's Disease" appears to be a naturally uncapitalized condition. The others have been impossible to restore and will also be regarded as missing.

In [17]:
df.condition = df.condition.apply(lambda x: 'missing' \
                                  if x[0] not in list(string.ascii_uppercase) and \
                                  x.split()[0] != 'von' \
                                  else x)

In [18]:
len(df[df.condition == 'missing'])

3293

## proposed solutions for missing values

1. For every record with a missing condition, we will assign it the condition that is most common for the drug indicated by that record.

2. Before executing solution 1, find each record's twin and use the condition label from the twin where applicable.

For now, we'll just execute solution 2.

In [19]:
drugs_w_missing_condition = list(set(df[df.condition == 'missing'].drugName))

In [20]:
len(drugs_w_missing_condition)

842

This applies to about a quarter of the drugs. We'll create a dictionary that reports the most common condition for these drugs.

In [21]:
most_common_condition = {}

for drug in drugs_w_missing_condition:
    condition = df[df.drugName == drug].condition.value_counts().idxmax()
    if condition == 'missing' and len(set(df[df.drugName == drug].condition)) > 1:
        condition = df[(df.drugName == drug) &
                       (df.condition != 'missing')
                      ].condition.value_counts().idxmax()
    proportion = round(df[df.drugName == drug].condition.value_counts(normalize=True)[0],2)
    most_common_condition[drug] = [condition, proportion]

In [22]:
most_common_condition['Viagra']

['Erectile Dysfunction', 0.87]

For example, if a review with an unlisted condition is about Viagra, we will assume the condition is Erectile Dysfunction.

In [23]:
df['condition'] = df.apply(lambda x: most_common_condition[x.drugName][0] \
                           if x.condition == 'missing' \
                           else x.condition, axis = 1)

In [24]:
len(df[df.condition == 'missing'])

105

This is how many records there are that still have no label for condition. This means the drugs indicated in these records are *only* indicated in references without an indicated condition. They may still have a "twin" records that we could match them to, but while we're skipping that solution step, there's not really anything we can do with these records, and we may as well drop them.

In [25]:
df.drop(df[df.condition == 'missing'].index, inplace=True)

# duplicates

In [26]:
df.duplicated().value_counts()

False    214956
True          2
Name: count, dtype: int64

In [27]:
df[df.duplicated()]

Unnamed: 0,drugName,condition,review,rating,date,usefulCount
178703,Levonorgestrel,Emergency Contraception,"""I had a quickie n he decided to finish it off...",1.0,"September 23, 2016",10
191001,Plan B,Emergency Contraception,"""I had a quickie n he decided to finish it off...",1.0,"September 23, 2016",10


In [28]:
show_review(178703)

"I had a quickie n he decided to finish it off in me... Well IMMEDIATELY we went 2 our local pharmacy n bought this plan b 1 step pill.I took it immediately.2 weeks later,took a pregnancy test n got the world&#039;s BIGGEST POSITIVE. The small pill was $50.That was the 1st time in a year n a half that I had intercourse n the last after I had my first son. I honestly believe this pill is ineffective because they just want u to think it works when n reality, it would never work. Alot of women don&#039;t know their bodies when they ovulate so if your not fertile and he ejaculates n u and u take the pill n dont get preg., The pill is supposed to make u think it worked. DO NOT buy. Was NEVER effective. Thank u!"


Unnamed: 0,drugName,condition,rating,date,usefulCount
131531,Levonorgestrel,Emergency Contraception,1.0,"September 23, 2016",10
143768,Plan B,Emergency Contraception,1.0,"September 23, 2016",10
178703,Levonorgestrel,Emergency Contraception,1.0,"September 23, 2016",10
191001,Plan B,Emergency Contraception,1.0,"September 23, 2016",10


This is curious. The same review is recorded four times. There are two identical pairs, where the difference between the pairs is the drug name. We can drop one from each pair, but this will need to be revisited.

In [29]:
df.drop_duplicates(inplace=True)

# contractions

Here is an example of a contraction.

In [30]:
df.review[3][56:69]

'I&#039;m glad'

Here is how the html function fixes it.

In [31]:
html.unescape(df.loc[3][2])[56:64]

"I'm glad"

Here is how the contractions function fixes (the html function's fix of) it.

In [32]:
contractions.fix(html.unescape(df.loc[3][2]))[56:65]

'I am glad'

Here is an instance of "ain't" with the same functions applied.

In [33]:
df.review.loc[507][75:99]

'I ain&#039;t complaining'

In [34]:
html.unescape(df.review.loc[507])[75:94]

"I ain't complaining"

In [35]:
contractions.fix(html.unescape(df.review.loc[507]))[75:96]

'I are not complaining'

In [36]:
len(df[df.review.str.contains('ain&#039;t')])

53

There are 53 instances of "ain't".

I'm currently having difficulty downloading the package that appropriately fixes "ain't" into "is not" or "are not" etc. This shouldn't matter after I remove stop words. I think it will be helpful to exclude negatives like "no" and "not" from the stop words. It could certainly be of help to look for bigrams like "not good".

In [37]:
df.review = df.review.apply(lambda x: html.unescape(x))

# make some dummy dfs to use

In [38]:
df_old = df.copy()
len(df_old[df_old.duplicated(subset = df_old.columns.difference(['drugName']))])

85967

In [39]:
df_bc = df_old.drop(df_old[df_old.condition != 'Birth Control'].index)
df_bc.reset_index(inplace=True)
df_bc.drop(columns='index', inplace=True)
len(df_bc[df_bc.duplicated(subset = df_bc.columns.difference(['drugName']))])

19420

In [40]:
df_20000 = df_bc[df_bc.index < 20000]
len(df_20000[df_20000.duplicated(subset = df_20000.columns.difference(['drugName']))])

5072

In [41]:
df_10000 = df_bc[df_bc.index < 10000]
len(df_10000[df_10000.duplicated(subset = df_10000.columns.difference(['drugName']))])

1241

In [42]:
df_5000 = df_bc[df_bc.index < 5000]
len(df_5000[df_5000.duplicated(subset = df_5000.columns.difference(['drugName']))])

310

In [43]:
df_2000 = df_bc[df_bc.index < 2000]
len(df_2000[df_2000.duplicated(subset = df_2000.columns.difference(['drugName']))])

50

# *I messed this up because I did not restore the list bucket_B in the third trial of 2,000 or 5,000*

# experiment: 2,000 records // 50 duplicates

In [44]:
df = df_2000.copy()

In [45]:
df.drop(columns='drugName', inplace=True)

In [46]:
%%time
bucket_A = df[df.duplicated].index.tolist()
bucket_B = df[~df.index.isin(bucket_A)].index.tolist()

CPU times: user 5.74 ms, sys: 1.39 ms, total: 7.13 ms
Wall time: 5.74 ms


In [47]:
%%time
found_pairs = 0
for i in bucket_A:
    for j in bucket_B:
        if df.loc[i].equals(df.loc[j]):
            found_pairs += 1
print(found_pairs)

50
CPU times: user 13.6 s, sys: 39.2 ms, total: 13.6 s
Wall time: 13.7 s


In [48]:
%%time
found_pairs = 0
for i in bucket_A:
    for j in bucket_B:
        if df.loc[i].equals(df.loc[j]):
            found_pairs += 1
            bucket_B.remove(j)
            break
print(found_pairs)

50
CPU times: user 3.53 s, sys: 13 ms, total: 3.55 s
Wall time: 3.58 s


In [49]:
%%time
found_pairs = 0
twins = []
bucket_B = df[~df.index.isin(bucket_A)].index.tolist()
for i in bucket_A:
    for j in bucket_B:
        if df.loc[i].equals(df.loc[j]):
            found_pairs += 1
            bucket_B.remove(j)
            twins.append([i,j])
            break
print(found_pairs)

50
CPU times: user 3.47 s, sys: 11.3 ms, total: 3.48 s
Wall time: 3.5 s


# experiment: 5,000 records // 310 duplicates

In [50]:
df = df_5000.copy()

In [51]:
df.drop(columns='drugName', inplace=True)

In [52]:
%%time
bucket_A = df[df.duplicated].index.tolist()
bucket_B = df[~df.index.isin(bucket_A)].index.tolist()

CPU times: user 12.5 ms, sys: 1.52 ms, total: 14 ms
Wall time: 13.3 ms


In [53]:
%%time
found_pairs = 0
for i in bucket_A:
    for j in bucket_B:
        if df.loc[i].equals(df.loc[j]):
            found_pairs += 1
print(found_pairs)

310
CPU times: user 3min 53s, sys: 881 ms, total: 3min 54s
Wall time: 3min 57s


In [54]:
%%time
found_pairs = 0
for i in bucket_A:
    for j in bucket_B:
        if df.loc[i].equals(df.loc[j]):
            found_pairs += 1
            bucket_B.remove(j)
            break
print(found_pairs)

310
CPU times: user 1min 17s, sys: 237 ms, total: 1min 17s
Wall time: 1min 18s


In [55]:
%%time
found_pairs = 0
twins = []
bucket_B = df[~df.index.isin(bucket_A)].index.tolist()
for i in bucket_A:
    for j in bucket_B:
        if df.loc[i].equals(df.loc[j]):
            found_pairs += 1
            bucket_B.remove(j)
            twins.append([i,j])
            break
print(found_pairs)

310
CPU times: user 1min 17s, sys: 251 ms, total: 1min 18s
Wall time: 1min 18s


In [56]:
error

NameError: name 'error' is not defined

# experiment: 39,499 records // 19,420 duplicates

In [None]:
df = df_bc.copy()

In [None]:
df.drop(columns='drugName', inplace=True)

In [None]:
%%time
bucket_A = df[df.duplicated].index.tolist()
bucket_B = df[~df.index.isin(bucket_A)].index.tolist()

In [None]:
%%time
found_pairs = 0
twins = []
for i in bucket_A:
    for j in bucket_B:
        if df.loc[i].equals(df.loc[j]):
            found_pairs += 1
            bucket_B.remove(j)
            twins.append([i,j])
            break
print(found_pairs)

# attempt to fix twin duplicates problem

In [None]:
df[df.duplicated(subset=['review', 'rating', 'date', 'usefulCount'])]

In [None]:
df['condition'] = 'Birth Control'

In [None]:
show_review(96)

In [None]:
assert(df.index.max() == len(df)) - 1

In [None]:
twin_dates = []
for i in df[df.duplicated(subset=['review', 'rating', 'date', 'usefulCount'])].index:
    twin = df[df.date == df.loc[i].date]
    twin_dates.append(len(twin))

In [None]:
max(twin_dates)

In [None]:
len(twin_dates)

In [None]:
found_pairs = 0

for i in df[df.duplicated(subset=['review', 'rating', 'date', 'usefulCount'])].index:
    for j in df[~df.duplicated(subset=['review', 'rating', 'date', 'usefulCount'])].index:
        if df.loc[i].review == df.loc[j].review and \
        df.loc[i].rating == df.loc[j].rating and \
        df.loc[i].date == df.loc[j].date and \
        df.loc[i].usefulCount == df.loc[j].usefulCount:
            found_pair = True
            found_pairs += 1
            break

print(found_pairs)

In [None]:
found_pairs

# end