https://archive.ics.uci.edu/dataset/462/drug+review+dataset+drugs+com

# Flatiron Phase 5 Project

## Aaron Galbraith

https://www.linkedin.com/in/aarongalbraith \
https://github.com/aarongalbraith

### Submitted: November 21, 2023

## working contents

- **[rough overview](#rough-overview)<br>**
- **[missing values](#missing-values)<br>**
- **[contractions](#contractions)<br>**
- **[dates](#dates)<br>**
- **[ratings](#ratings)<br>**
- **[focusing on birth control](#focusing-on-birth-control)<br>**
- **[end](#end)<br>**


## Contents

- **[Business Understanding](#Business-Understanding)<br>**
- **[Data Understanding](#Data-Understanding)**<br>
- **[Data Preparation](#Data-Preparation)**<br>
- **[Exploration](#Exploration)**<br>
- **[Modeling](#Modeling)**<br>
- **[Evaluation](#Evaluation)**<br>
- **[Recommendations](#Recommendations)<br>**
- **[Further Inquiry](#Further-Inquiry)**<br>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import FreqDist
from nltk.corpus import stopwords
import string
from wordcloud import WordCloud

import html
import contractions

import re

In [2]:
d1 = pd.read_csv('../data/drugsComTrain_raw.tsv', delimiter='\t', encoding='latin-1')
d2 = pd.read_csv('../data/drugsComTest_raw.tsv', delimiter='\t', encoding='latin-1')
df = pd.concat([d1,d2]).reset_index().drop(columns=['Unnamed: 0', 'index'])

# rough overview

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.drugName.value_counts()

In [14]:
pd.set_option("display.max_rows", None)
print(df.drugName.value_counts())
pd.set_option("display.max_rows", 10)

drugName
Levonorgestrel                                                                                      4930
Etonogestrel                                                                                        4421
Ethinyl estradiol / norethindrone                                                                   3753
Nexplanon                                                                                           2892
Ethinyl estradiol / norgestimate                                                                    2790
Ethinyl estradiol / levonorgestrel                                                                  2503
Phentermine                                                                                         2085
Sertraline                                                                                          1868
Escitalopram                                                                                        1747
Mirena                                        

Oddly, the condition labels often (always?) omit initial 'F' and terminal 'r'.

# missing values

In [3]:
df.condition.fillna('missing', inplace=True)

In [10]:
def missing_fix(string):
    if 'users found this comment' in string:
        return True
    elif 'Not Listed' in string:
        return True
    else:
        return False

df['condition'] = df.condition.apply(lambda x: 'missing' if missing_fix(x) else x)

In [11]:
len(df[df.condition == 'missing'])

2957

There are around 1200 missing conditions. We can search for drugs that have the most missing conditions and potentially infer the condition based on the most commmon condition for that drug. For example:

In [8]:
df[df.condition == 'missing'].drugName.value_counts()

drugName
Ethinyl estradiol / norethindrone     134
Ethinyl estradiol / norgestimate      108
Ethinyl estradiol / levonorgestrel    103
Loestrin 24 Fe                         70
Drospirenone / ethinyl estradiol       44
                                     ... 
Azelaic acid                            1
Antabuse                                1
Bepotastine                             1
Lo / Ovral-28                           1
Dulera                                  1
Name: count, Length: 704, dtype: int64

In [12]:
df[df.drugName == 'Ethinyl estradiol / norethindrone'].condition.value_counts(normalize=True)

condition
Birth Control                0.820943
Acne                         0.045031
missing                      0.037570
Menstrual Disorders          0.033040
Abnormal Uterine Bleeding    0.030109
Polycystic Ovary Syndrome    0.016520
Endometriosis                0.014921
Postmenopausal Symptoms      0.001599
Gonadotropin Inhibition      0.000266
Name: proportion, dtype: float64

We should note that "Not Listed" etc. essentially counts as a missing value.

In [None]:
df[(df.condition.isna()) &
   (df.drugName == 'Ethinyl estradiol / norethindrone')
  ].review[1253]

# contractions

Here is an example of a contraction.

In [None]:
df.review[3][56:69]

Here is how the html function fixes it.

In [None]:
html.unescape(df.loc[3][2])[56:64]

Here is how the contractions function fixes (the html function's fix of) it.

In [None]:
contractions.fix(html.unescape(df.loc[3][2]))[56:65]

Here is an instance of "ain't" with the same functions applied.

In [None]:
df.review.loc[507][75:99]

In [None]:
html.unescape(df.review.loc[507])[75:94]

In [None]:
contractions.fix(html.unescape(df.review.loc[507]))[75:96]

In [None]:
len(df[df.review.str.contains('ain&#039;t')])

There are 53 instances of "ain't".

I'm currently having difficulty downloading the package that appropriately fixes "ain't" into "is not" or "are not" etc. This shouldn't matter after I remove stop words. I think it will be helpful to exclude negatives like "no" and "not" from the stop words. It could certainly be of help to look for bigrams like "not good".

# dates

In [None]:
sample = df.date.loc[0]

In [None]:
sample

In [None]:
re.split(r'\W+', sample)

There's probably a datetime method for this, but the following will produce month // day // year, and then we can figure out the earliest and latest dates.

In [None]:
df['month'] = df.date.apply(lambda x: re.split(r'\W+', x)[0])
df['day'] = df.date.apply(lambda x: int(re.split(r'\W+', x)[1]))
df['year'] = df.date.apply(lambda x: int(re.split(r'\W+', x)[2]))

In [None]:
df.year.min()

In [None]:
df[df.year == 2008].month.value_counts()

In [None]:
df[(df.year == 2008) &
   (df.month == 'February')
  ].day.min()

In [None]:
df.year.max()

In [None]:
df[df.year == 2017].month.value_counts()

In [None]:
df[(df.year == 2017) &
   (df.month == 'November')
  ].day.max()

The reviews span from February 24, 2008 to November 30, 2017.

# ratings

In [None]:
len(df)/2

In [None]:
df.rating.value_counts()

In [None]:
len(df[df.rating > 8.5])

In [None]:
len(df[df.rating < 8.5])

To split the review roughly in half we would split between 8 and 9

To split the ratings roughly in half we would make the splits 1-8 and 9-10.

In [None]:
len(df)/3

In [None]:
len(df[df.rating > 9.5])

In [None]:
len(df[df.rating < 6.5])

To split the ratings roughly in thirds we would make the splits 1-6, 7-9, and 10.

# focusing on birth control

In [19]:
len(df[df.condition == 'Birth Control'])

38436

In [36]:
birth_control_drugs = set(df[df.condition == 'Birth Control'].drugName)

In [37]:
len(birth_control_drugs)

181

In [44]:
df[df.condition == 'Birth Control'].drugName.value_counts()

drugName
Etonogestrel                          4394
Ethinyl estradiol / norethindrone     3081
Levonorgestrel                        2884
Nexplanon                             2883
Ethinyl estradiol / levonorgestrel    2107
                                      ... 
Norlyda                                  1
Larin 24 Fe                              1
Loestrin 21 1.5 / 30                     1
Lillow                                   1
Cyclafem 7 / 7 / 7                       1
Name: count, Length: 181, dtype: int64

In [49]:
birth_control_drugs_with_missing_conditions = set()

for drug in birth_control_drugs:
    count = len(df[(df.drugName == drug) &
       (df.condition == 'missing')
      ])
    if count > 0:
        birth_control_drugs_with_missing_conditions.add(drug)

In [51]:
len(birth_control_drugs_with_missing_conditions)

91

In [56]:
birth_control_drugs_with_missing_conditions_whose_top_condition_is_birth_control = set()

for drug in birth_control_drugs_with_missing_conditions:
     if df[df.drugName == drug].condition.value_counts().idxmax() == 'Birth Control':
            birth_control_drugs_with_missing_conditions_whose_top_condition_is_birth_control.add(drug)

In [57]:
len(birth_control_drugs_with_missing_conditions_whose_top_condition_is_birth_control)

88

In [60]:
for drug in birth_control_drugs_with_missing_conditions_whose_top_condition_is_birth_control:
    proportion = df[df.drugName == drug].condition.value_counts(normalize=True)[0]
    if proportion >= .8:
        print(proportion)

0.9926470588235294
0.825
0.9981412639405205
0.950354609929078
0.99389278443791
0.8
0.820943245403677
0.9803063457330415
0.8983050847457628
0.9534206695778749
0.9968879668049793
0.9933598937583001
0.9950738916256158
0.9629629629629629
0.9904191616766467
0.984375
0.833922261484099
0.9987515605493134
0.8837209302325582
0.9271523178807947
0.9980353634577603
0.9333333333333333
0.9975786924939467
0.8333333333333334
0.8
0.9635416666666666
0.8658536585365854
0.8333333333333334
0.8571428571428571
0.9090909090909091
0.9
0.9152542372881356
0.9125
0.86
0.9882352941176471
0.8661971830985915
0.9980353634577603
0.9090909090909091
0.9612676056338029
0.8571428571428571
0.9846153846153847
0.9951219512195122
0.9916666666666667
0.8301158301158301
0.8470254957507082
0.8928571428571429
0.9951456310679612
0.9444444444444444
0.9761904761904762
0.8372093023255814
0.8849557522123894
0.8417898521773871


# rudimentary word cloud maker

In [None]:
# make list of all reviews
reviews_pos = dfbcpos.review.to_list()
reviews_neg = dfbcneg.review.to_list()

In [None]:
# # make tokenizer
# tokenizer = TweetTokenizer(
#     preserve_case=False,
#     strip_handles=True
# )

# create list of tokens from data set
tokens_pos = word_tokenize(','.join(reviews_pos))
tokens_neg = word_tokenize(','.join(reviews_neg))


# tokens = [word for word in tokens]

In [None]:
# make lemmatizer
lemmatizer = WordNetLemmatizer()

# lemmatize the list of words
tokens_lemmatized_pos = [lemmatizer.lemmatize(word) for word in tokens_pos]
tokens_lemmatized_neg = [lemmatizer.lemmatize(word) for word in tokens_neg]

In [None]:
# show the most frequently occurring tokens
FreqDist(tokens_lemmatized_pos).most_common(25)

In [None]:
# show the most frequently occurring tokens
FreqDist(tokens_lemmatized_neg).most_common(25)

In [None]:
# obtain the standard list of stopwords
nltk.download('stopwords', quiet=True)
# start our own list of stopwords with these words
stop_list = stopwords.words('english')
# add punctuation characters
for char in string.punctuation:
    stop_list.append(char)
# add empty string
stop_list.extend(['', 'ha', 'wa'])

In [None]:
# make stopped list of tokens
tokens_stopped_pos = [word for word in tokens_lemmatized_pos if word not in stop_list]
tokens_stopped_neg = [word for word in tokens_lemmatized_neg if word not in stop_list]

In [None]:
# show the most frequently occurring tokens
FreqDist(tokens_stopped_pos).most_common(25)

In [None]:
# show the most frequently occurring tokens
FreqDist(tokens_stopped_neg).most_common(25)

In [None]:
# a function that generates a word cloud of a given list of words
def make_wordcloud(wordlist, colormap='Greens', title=None):
    # instantiate wordcloud
    wordcloud = WordCloud(
        width=600,
        height=400,
        colormap=colormap,
        collocations = True
    )
    return wordcloud.generate(','.join(wordlist))

def plot_wordcloud(wordcloud):
    # plot wordcloud
    plt.figure(figsize = (12, 15)) 
    plt.imshow(wordcloud) 
    plt.axis('off');

In [None]:
# word cloud of stopped words
plot_wordcloud(make_wordcloud(tokens_stopped_pos))

In [None]:
# word cloud of stopped words
plot_wordcloud(make_wordcloud(tokens_stopped_neg))

# end