https://archive.ics.uci.edu/dataset/462/drug+review+dataset+drugs+com

# Flatiron Phase 5 Project

## Aaron Galbraith

https://www.linkedin.com/in/aarongalbraith \
https://github.com/aarongalbraith

### Submitted: November 21, 2023

## working contents

- **[functions](#functions)<br>**
- **[rough overview](#rough-overview)<br>**
- **[missing values](#missing-values)<br>**
- **[duplicates](#duplicates)<br>**
- **[brand / generic pairs](#brand-/-generic-pairs)<br>**
- **[further exploration of duplicates (skip for now)](#further-exploration-of-duplicates-(skip-for-now))<br>**
- **[contractions](#contractions)<br>**
- **[dates](#dates)<br>**
- **[ratings](#ratings)<br>**
- **[focusing on birth control](#focusing-on-birth-control)<br>**
- **[save and reload preprocessed set](#save-and-reload-preprocessed-set)<br>**
- **[feature engineering ideas](#feature-engineering-ideas)<br>**
- **[rudimentary word cloud maker](#rudimentary-word-cloud-maker)<br>**
- **[end](#end)<br>**


## Contents

- **[Business Understanding](#Business-Understanding)<br>**
- **[Data Understanding](#Data-Understanding)**<br>
- **[Data Preparation](#Data-Preparation)**<br>
- **[Exploration](#Exploration)**<br>
- **[Modeling](#Modeling)**<br>
- **[Evaluation](#Evaluation)**<br>
- **[Recommendations](#Recommendations)<br>**
- **[Further Inquiry](#Further-Inquiry)**<br>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import FreqDist
from nltk.corpus import stopwords
import string
from wordcloud import WordCloud

import html
import contractions

import re

from IPython.display import display

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, precision_score, f1_score
from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV

from pathlib import Path

SEED = 1979

do_grids = True

In [None]:
d1 = pd.read_csv('../data/drugsComTrain_raw.tsv', delimiter='\t', encoding='latin-1')
d2 = pd.read_csv('../data/drugsComTest_raw.tsv', delimiter='\t', encoding='latin-1')
df = pd.concat([d1,d2]).reset_index().drop(columns=['Unnamed: 0', 'index'])

# functions

In [None]:
def show_review(index):
    print(df.review.loc[index])
    display(df[df.review == df.loc[index].review][['drugName', 'condition', 'rating', 'date', 'usefulCount']])

# rough overview

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

There are some missing condition labels.

In [None]:
df.drugName.value_counts()

In [None]:
df.drugName.value_counts().quantile(.90)

There are 3,671 unique drug names, and 10% of the drug names have more than 120 reviews.

In [None]:
df.condition.value_counts()

In [None]:
df.condition.value_counts().quantile(.90)

There are 916 unique conditionis, and 10% of the conditions have more than 332 reviews.

In [None]:
df.rating.value_counts()

Most of the conditions lie at the extremes, and more of them appear to be at the positive extreme.

In [None]:
df.groupby('drugName').condition.nunique().value_counts()

This means that, for example, 1869 drugs treat 1 condition only, etc.

In [None]:
df.groupby('condition').drugName.nunique().value_counts()

This means that 188 conditions are treatable by two drugs, etc.

In [None]:
pd.set_option("display.max_rows", None)
print(df.drugName.value_counts())
pd.set_option("display.max_rows", 10)

A casual overview of the drug names indicates that they all seem valid. Some seem to specify drug combinations and/or dosage amounts.

In [None]:
pd.set_option("display.max_rows", None)
print(df.condition.value_counts())
pd.set_option("display.max_rows", 10)

Oddly, the condition labels often (always?) omit initial 'F' and terminal 'r'. We can isolate instances of the former by searching for conditions that start with a lower case letter.

We will eventually trim our records to a number of conditions that Planned Parenthood specializes in treating, but we will need all the records to help us determine missing condition labels. After we have restored (or discarded) all missing condition labels, we can drop the conditions outside the scope of this review.

## dates

(Do this date analysis *after* we have trimmed to just the records we'll use?)

There's probably a datetime method for this, but the following will produce month // day // year, and then we can figure out the earliest and latest dates.

In [None]:
df['month'] = df.date.apply(lambda x: re.split(r'\W+', x)[0])
df['day'] = df.date.apply(lambda x: int(re.split(r'\W+', x)[1]))
df['year'] = df.date.apply(lambda x: int(re.split(r'\W+', x)[2]))

In [None]:
df.year.min()

In [None]:
df[df.year == 2008].month.value_counts()

In [None]:
df[(df.year == 2008) &
   (df.month == 'February')
  ].day.min()

In [None]:
df.year.max()

In [None]:
df[df.year == 2017].month.value_counts()

In [None]:
df[(df.year == 2017) &
   (df.month == 'November')
  ].day.max()

The reviews span from February 24, 2008 to November 30, 2017.

In [None]:
df.year.hist();

## review text

In [None]:
for i in range(10):
    print(df.review[i], '\n-----')

# language cleaning

Before we go any further, we would like to clean up some of the review text. In particular, there are many escaped characters, especially apostrophes.

Here is an example of a contraction.

In [None]:
df.review[3][56:69]

Here is how the html function fixes it.

In [None]:
html.unescape(df.loc[3][2])[56:64]

Here is how the contractions function fixes (the html function's fix of) it.

In [None]:
contractions.fix(html.unescape(df.loc[3][2]))[56:65]

Here is an instance of "ain't" with the same functions applied.

In [None]:
df.review.loc[507][75:99]

In [None]:
html.unescape(df.review.loc[507])[75:94]

In [None]:
contractions.fix(html.unescape(df.review.loc[507]))[75:96]

In [None]:
len(df[df.review.str.contains('ain&#039;t')])

There are 53 instances of "ain't".

I'm currently having difficulty downloading the package that appropriately fixes "ain't" into "is not" or "are not" etc. This shouldn't matter after I remove stop words. I think it will be helpful to exclude negatives like "no" and "not" from the stop words. It could certainly be of help to look for bigrams like "not good".

In [None]:
df.review = df.review.apply(lambda x: html.unescape(x))

# missing values

In [None]:
len(df[df.condition.isna()])

In [None]:
df.condition.fillna('missing', inplace=True)

In [None]:
len(df[df.condition == 'missing'])

We noticed another condition label that was meant to indicate missing and should be accordingly changed.

In [None]:
df.condition = df.condition.apply(lambda x: 'missing' if 'Not Listed' in x else x)

In [None]:
len(df[df.condition == 'missing'])

We've identified some actual missing condition labels, but we noticed there are more condition labels that seem suspicious, particularly ones that start with something other than an upper case character. Let's look at all such condition labels.

In [None]:
set(df[(~df.condition.str[0].isin(list(string.ascii_uppercase))) &
   (df.condition != 'missing')
  ].condition)

These fall into three categories. Ones that include "users found this comment helpful" should be regarded as erroneous and therefore missing.

In [None]:
df.condition = df.condition.apply(lambda x: 'missing' if 'users found' in x else x)

In [None]:
len(df[df.condition == 'missing'])

 Ones that show a clipped copy of the drug name and end with a parenthesis should also be regarded as missing.

In [None]:
df.condition = df.condition.apply(lambda x: 'missing' \
                                  if x[0] not in list(string.ascii_uppercase) and \
                                  x[-1] in ['(', ')'] \
                                  else x)

In [None]:
len(df[df.condition == 'missing'])

# restoring condition labels with errors

(skip this step because we're just focusing on birth control?)

Most of the ones that show a clipped version of the condition label can possibly be restored.

In [None]:
def condition_restore(condition):
    if condition.split()[-1] in ['Disorde', 'eve', 'Shoulde', 'Cance']:
        condition = condition+'r'
    if condition.split()[0] in ['acial', 'ibrocystic', 'ungal', 'amilial', 'ailure', 'ever', \
                                'emale', 'unctional', 'actor', 'ibromyalgia', 'atigue']:
        condition = 'F'+condition
    if condition.split()[0] in ['llicular', 'llicle', 'lic', 'cal']:
        condition = 'Fo'+condition
    if condition.split()[0] in ['mance']:
        condition = 'Perfor'+condition
    if condition.split()[0] in ['zen']:
        condition = 'Fro'+condition
    if condition.split()[0] in ['mis']:
        condition = 'Dermatitis Herpetifor'+condition
    return condition

df.condition = df.condition.apply(lambda x: condition_restore(x))

Let's look at what we have left.

In [None]:
set(df[(~df.condition.str[0].isin(list(string.ascii_uppercase))) &
   (df.condition != 'missing')
  ].condition)

"von Willebrand's Disease" appears to be a naturally uncapitalized condition. The others have been impossible to restore and will also be regarded as missing.

In [None]:
df.condition = df.condition.apply(lambda x: 'missing' \
                                  if x[0] not in list(string.ascii_uppercase) and \
                                  x.split()[0] != 'von' \
                                  else x)

In [None]:
len(df[df.condition == 'missing'])

We will be able to restore more of these missing condition labels after we do some work with duplicates.

# duplicates

In [None]:
df.duplicated().value_counts()

In [None]:
df[df.duplicated()]

In [None]:
show_review(178703)

This is curious. The same review is recorded four times. There are two identical pairs, where the difference between the pairs is the drug name. We can drop one from each pair, but the pairs themselves will need to be revisited.

In [None]:
df.drop_duplicates(inplace=True)

# brand / generic pairs

The main type of duplicate we should look out for is records with duplicate reviews, as those likely indicate some kind of actual erroneous duplication. Let's see how many of those there are.

In [None]:
df.duplicated(subset=['review']).value_counts()

That's a lot!

Let's explore some facets of these duplicates.

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['drugName']))])

The vast majority of duplicate reviews are accounted for by different drug names. Let's explore some examples.

In [None]:
df[df.duplicated(subset=df.columns.difference(['drugName']))].head()

In [None]:
show_review(524)

In [None]:
show_review(574)

In [None]:
show_review(726)

In [None]:
show_review(1070)

In [None]:
show_review(1375)

These five examples make clear that the vast majority of duplicates are due to double-entry; (nearly) every review is entered once with its generic name and once with its brand name.

We can use this phenomenon to restore some of the missing condition labels. If a missing condition label is part of such a unique pair, then we can confidently assign it the condition of its pair-mate.

Let's broaden our search to records that duplicate every feature other than drug name and condition.

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['drugName', 'condition']))])

This is how many records are duplicates of other records in all values EXCEPT (POSSIBLY) drug name and condition. If a record is duplicated in this manner, the second (and third, fourth, etc.) instance will be captured in this bucket of dupes.

If we check only this bucket for dupes, we can see whether there are any triplets, etc.

In [None]:
df_dupes = df[df.duplicated(subset=df.columns.difference(['drugName', 'condition']))]

In [None]:
len(df_dupes[df_dupes.duplicated(subset=df_dupes.columns.difference(['drugName', 'condition']))])

There is only one.

In [None]:
df_dupes[df_dupes.duplicated(subset=df_dupes.columns.difference(['drugName', 'condition']))]

In [None]:
show_review(140144)

There are 6 records with the same review, date, rating, and condition. Because they're on the *same day*, it seems likely that these reviews were entered repeatedly by the same person. The two with a useful count of 10 are likely a brand/generic pair. As for the other 4, a possible explanation is that Sandostatin and Octreotide are brand names for the two types of insulin, and one of them somehow acquired an erroneous useful count. Let's reassign the useful count of Sandostatin to 3 and let them pair off that way.

In [None]:
df.at[133212, 'usefulCount'] = 3

In [None]:
%%time
# ⏰ record the time for this cell -- usually 11-12 s

# create stripped down dataframe that does not have drug names or conditions
# we don't need these features for this operation because we're checking for matches on all other features
df_pairs = df.drop(columns=['drugName', 'condition']).copy()

# create a list of indices of records that duplicate everything other than drug name and condition
df_dupes = df_pairs[df_pairs.duplicated()].index.tolist().copy()

# create and populate a dictionary whose keys are dates and values are indices
dates_bucket = {}
# populate dictionary with keys that are dates belonging to the duplicates
for date_ in list(set(df[df.index.isin(df_dupes)].date.tolist())):
    dates_bucket[date_] = []
# populate dictionary with values that are indices that are NOT from the duplicate list but DO share that date
for i in df[~df.index.isin(df_dupes)].index:
    dates_bucket[df.loc[i].date].append(i)

In [None]:
%%time
# ⏰ record the time for this cell -- usually 2–4 mins

# create a list of record pairs where each entry is a list of two indices
pairs = []

# iterate over the indices from the dupes list
for i in df_dupes:
    # set the date to the date from index i
    date_i = df.loc[i].date
    # iterate over OTHER indices who share that date
    for j in dates_bucket[date_i]:
        # check for a match
        if df_pairs.loc[i].equals(df_pairs.loc[j]):
            # remove this index from the dates dictionary so we have fewere to search through in later iterations
            dates_bucket[date_i].remove(j)
            # add this pair to the pairs list
            pairs.append([i,j])
            break

In [None]:
error

Let's take a look at several of the pairs we've collected.

In [None]:
pairs[:10]

Here we'll create a dictionary that matches the index of one pair member to the other member of the pair.

In [None]:
pairs_dict = {}

for pair in pairs:
    for i in range(2):
        pairs_dict[pair[i]] = pair[1-i]

# restore missing condition labels

We will restore missing condition lables in two ways, in order of certainty:

1. For missing values that possess a pair match, we will assign it the condition of its match.
2. For the remaining missing values, we will assign it the condition that is most commonly associated with its drug name.

First we'll restore missing condition labels for any record that belongs to one of these pairs.

In [None]:
len(df[df.condition == 'missing'])

In [None]:
%%time
# ⏰ record the time for this cell -- usually 15 seconds

# iterate over each record pair
for pair in pairs:
    # iterate over each member of the pair
    for i in range(2):
        # identify a pair member whose condition is missing
        if df.loc[pair[i]].condition == 'missing':
            # assign to the pair member the condition of its pair-mate
            df.at[pair[i], 'condition'] = df.loc[pairs_dict[pair[i]]].condition

In [None]:
len(df[df.condition == 'missing'])

We'll make a feature that names the indicated drug and, if applicable, the paired drug.

This is not a *final* replacement for the drug name feature, but it will allow us to better recognize the relationship between the generic and brand drug names.

In [None]:
%%time
# ⏰ record the time for this cell -- usually 20-30 seconds

df['ind'] = df.index

def drugList_fix(index, drugName_):
    drugList = [drugName_]
    if index in pairs_dict:
        drugList.append(df.loc[pairs_dict[index]].drugName)
        drugList.sort()
    return drugList

df['drugList'] = df.apply(lambda x: drugList_fix(x.ind, x.drugName), axis=1)

df.drop(columns='ind', inplace=True)

In [None]:
df['drugSetString'] = df.drugList.apply(lambda x: x[0] + ' ' + x[1] if len(x) == 2 else x[0])

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['drugName', 'drugSet', 'drugList']))])

With this new feature in place, we can drop one record from each of the brand/generic pairs. The drug name feature will randomly retain only one member of the pair, which will make this feature more or less useless for the moment.

Before we drop these records, we'll create a bookmark copy of the dataframe.

In [None]:
df_bookmark_1 = df.copy()

In [None]:
df.drop_duplicates(subset=df.columns.difference(['drugName', 'drugSet', 'drugList']), inplace=True)

For every remaining record with a missing condition, we will assign it the condition that is most common for the drug indicated by that record. (This will not be biased by duplicates from brand/generic pairs, because we have dropped those duplicates.)

In [None]:
drugs_w_missing_condition = list(set(df[df.condition == 'missing'].drugSetString))

In [None]:
len(drugs_w_missing_condition)

In [None]:
df.drugSetString.nunique()

This applies to some 20% of the drugs. We'll create a dictionary that reports the most common condition for these drugs.

In [None]:
%%time
# record the time for this cell -- 10-20 seconds

most_common_condition = {}

for drug in drugs_w_missing_condition:
    condition = df[df.drugSetString == drug].condition.value_counts().idxmax()
    if condition == 'missing' and len(set(df[df.drugSetString == drug].condition)) > 1:
        condition = df[(df.drugSetString == drug) &
                       (df.condition != 'missing')
                      ].condition.value_counts().idxmax()
    proportion = round(df[df.drugSetString == drug].condition.value_counts(normalize=True)[0],2)
    most_common_condition[drug] = [condition, proportion]

In [None]:
most_common_condition['Sildenafil Viagra']

For example, if a review with an unlisted condition is about Viagra, we will assume the condition is Erectile 
Dysfunction.

In [None]:
len(df[df.condition == 'missing'])

In [None]:
df['condition'] = df.apply(lambda x: most_common_condition[x.drugSetString][0] \
                           if x.condition == 'missing' \
                           else x.condition, axis = 1)

In [None]:
len(df[df.condition == 'missing'])

This is how many records there are that still have no condition label. This means the drugs indicated in these records are *only* indicated in references without an indicated condition. As such, there's not really anything we can do with these records, and we may as well drop them.

In [None]:
df.drop(df[df.condition == 'missing'].index, inplace=True)

# trim to just birth control

At this point, we have decided exactly which records pertain to birth control. Now we can drop all records that pertain to other conditions. We'll make another bookmark copy first.

In [None]:
df_bookmark_2 = df.copy()

In [None]:
df.drop(df[df.condition != 'Birth Control'].index, inplace=True)

# exploring generic and brand names

Now that we have a smaller number of records to deal with, we can sort out generic and brand names.

First we'll create a list of all values from the drug name feature. (Some of these have been dropped from the drug name feature itself, but all of them were included in the drug list feature.)

In [None]:
drugs_raw = df.drugList.tolist()

Next, we'll create a set of drug names, all of whome appear in a generic/brand pair.

In [None]:
all_drug_names = set()

for record in drugs_raw:
    if len(record) == 2:
        all_drug_names.add(record[0])
        all_drug_names.add(record[1])

len(all_drug_names)

Now we'll establish a dictionary whose keys are all the drug names that appear in a generic/brand pair.

In [None]:
drug_dict = {}

for drug in all_drug_names:
    drug_dict[drug] = set()

We'll assign values to those keys according to the pairings. For example, if drug name A is in a generic/brand pair with drug name B, then they will appear on each other's list of values in this dictionary.

In [None]:
for record in drugs_raw:
    if len(record) == 2:
        drug_dict[record[0]].add(record[1])
        drug_dict[record[1]].add(record[0])

In [None]:
len(drug_dict)

Let's find out how many of these drug names are associated with exactly one other drug name.

In [None]:
count = 0

for drug in drug_dict:
    if len(drug_dict[drug]) == 1:
        count += 1
count

That should mean that exactly the remainder are associated with multiple drug names.

In [None]:
count = 0

for drug in drug_dict:
    if len(drug_dict[drug]) > 1:
        count += 1
count

It would make sense that drug names that belong to multiple generic/brand pairs are themselves the generic name. On that assumption, we'll create a list of generic drug names.

In [None]:
generics = set()

for drug in drug_dict:
    if len(drug_dict[drug]) > 1:
        generics.add(drug)

Now we'll check to make sure that the drug names we've just designated as "generic" do NOT belong to a generic/brand pair with *another* "generic".

In [None]:
for drug in generics:
    for match in drug_dict[drug]:
        if match in generics:
            print(drug, drug_dict[drug])

Great.

Then we can begin designating drug names as "brands" if they are in a generic/brand pair with a generic.

In [None]:
brands = set()

for generic in generics:
    for match in drug_dict[generic]:
        brands.add(match)

In [None]:
len(generics)

In [None]:
len(brands)

Now let's see what drugs remain.

In [None]:
brands

In [None]:
set(drug for drug in all_drug_names if drug not in generics and drug not in brands)

Through this method we have identified some generic names and associated them a greater number of brand names. Together, this accounts for a majority of the drug names. The remaining ones must be unique pairs that only ever appear with each other. We'll need outside information to identify which of these are the generic names and which are the brand names.

# further exploration of duplicates (skip for now)

In [None]:
len(df[df.duplicated(subset=['drugName', 'condition', 'rating', 'date'])])

That also seems like a lot. Let's explore these now.

In [None]:
df[df.duplicated(subset=['drugName', 'condition', 'rating', 'date'])].head()

We'll use the "show_similar" function to explore these reviews that duplicate drug name

In [None]:
show_similar(2450)

In [None]:
show_similar(3597)

In [None]:
show_similar(4892)

In [None]:
df[df.duplicated(subset=['drugName', 'condition', 'rating', 'date'])].rating.value_counts()

In [None]:
df[
    (df.drugName == df.loc[8576].drugName) & \
    (df.condition == df.loc[8576].condition) & \
    (df.date == df.loc[8576].date)
    
]

In [None]:
df[(df.drugName == 'Miconazole') & \
   (df.condition == 'Vaginal Yeast Infection') & \
   (df.rating == 1.0) & \
   (df.date == 'May 25, 2016') & \
   (df.usefulCount == 6) \
  ]

In [None]:
show_review(8737)

In [None]:
len(df[df.duplicated(subset=['review'])])

An enormous number of records have duplicated reviews.

In [None]:
show_review(524)

In [None]:
show_review(574)

In [None]:
show_review(726)

In [None]:
show_review(1070)

In [None]:
show_review(1375)

In all of the instances we checked, the duplicated record occurs because it is listed once under its chemical name and once under its brand name. We'll assume this is mostly the reason for the vast majority of review duplications and deal with them after we address other types of review duplications.

In [None]:
len(df[(df.duplicated(subset=['review'])) &
   ~df.duplicated((['drugName']))
  ])

This is how many records have identical reviews but differences *other than the drug name*. Let's explore a few of these.

In [None]:
df[(df.duplicated(subset=['review'])) &
   ~df.duplicated(subset=df.columns.difference(['drugName']))
  ].head(15)

In [None]:
show_review(2664)

In [None]:
show_review(6465)

In [None]:
show_review(9735)

In [None]:
show_review(13125)

Some of these are just common, short reviews, e.g. "Great". But others seem to have issues with the condition label as well.

We found earlier that many duplicate reviews come in pairs where the drug name is generic and brand name in the two records. It seems that more of these pairs exist in instances where the condition is "missing" for some reason. Where this specific phenomenon occurs, we'll relabel the condition to match its partner in the pair. This will reduce the number of "missing" conditions but increase the number of duplicate pairs.

In [None]:
len(df[df.condition == 'missing'])

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['drugName']))])

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['condition']))])

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['rating']))])

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['date']))])

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['usefulCount']))])

In [None]:
df[df.duplicated(subset=df.columns.difference(['usefulCount']))].head()

In [None]:
show_review(33451)

In [None]:
show_review(42728)

In [None]:
show_review(61617)

In [None]:
show_review(69518)

In [None]:
show_review(72794)

This appears to be an instance of someone re-posting a review multiple times. It seems that we should drop the duplicates in this case, but possibly we should tally up the useful count?

# dates

In [None]:
sample = df.date.loc[0]

In [None]:
sample

In [None]:
re.split(r'\W+', sample)

# ratings

In [None]:
len(df)/2

In [None]:
df.rating.value_counts()

In [None]:
len(df[df.rating > 8.5])

In [None]:
len(df[df.rating < 8.5])

To split the review roughly in half we would split between 8 and 9

To split the ratings roughly in half we would make the splits 1-8 and 9-10.

In [None]:
len(df)/3

In [None]:
len(df[df.rating > 9.5])

In [None]:
len(df[df.rating < 6.5])

To split the ratings roughly in thirds we would make the splits 1-6, 7-9, and 10.

# focusing on birth control

In [None]:
len(df[df.condition == 'Birth Control'])

This many records pertain to the condition of birth control.

In [None]:
birth_control_drugs = set(df[df.condition == 'Birth Control'].drugName)

In [None]:
len(birth_control_drugs)

This many drugs treat birth control.

In [None]:
df[df.condition == 'Birth Control'].drugName.value_counts()

These are the most frequent drug names that treat birth control.

In [None]:
list(set(df[(df.condition != 'Birth Control') &
   (df.drugName.isin(birth_control_drugs))
  ].condition))

These are other conditions that are (at least sometimes) treated by drugs that (also) treat birth control.

# save and reload preprocessed set

At this stage we will save and reload the preprocessed set in order to avoid taking the time to repeat earlier work everytime we open the notebook.

The saved version has restored or deleted all records with missing condition labels.

We have established pairs in the list `twins` but we have NOT yet deleted either member of any pair or dealt with the confusion between brand and generic drug names.

The size of the dateframe is nearly the same as its original version, roughly 215,000 records.

In [None]:
filepath = Path('../data/preprocessed.csv')
filepath.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(filepath)

In [None]:
%store twins

In [None]:
df = pd.read_csv('../data/preprocessed.csv')
df.drop(columns='Unnamed: 0', inplace=True)

In [None]:
%store -r twins

# feature engineering ideas

- word count
- character count
- words in all caps
- average word length
- whether words are in English (spelled correctly)
- whether it includes characters such as exclamation points, question marks, (especially repeatedly), and emoticons
- whether it mentions the brand or generic name in the review

In [None]:
df['word_count'] = df.review.apply(lambda x: len(x.split()))

In [None]:
df['char_count'] = df.review.apply(lambda x: len(x))

In [None]:
'!' in df.loc[5].review

In [None]:
df_bookmark_2 = df.copy()

# truncate to just birth control

In [None]:
df.drop(df[df.condition != 'Birth Control'].index, inplace=True)

In [None]:
df.usefulCount.value_counts()

In [None]:
df.usefulCount.quantile(.99)

In [None]:
df[df.rating > 8].usefulCount.quantile(.95)

In [None]:
df[df.rating < 2].usefulCount.quantile(.95)

In [None]:
show_review(17598)

# rudimentary word cloud maker

In [None]:
df['review'] = df['review'].str.lower()

In [None]:
dfbc = df[df.condition == 'Birth Control']

dfbc['sentiment'] = dfbc.rating.apply(lambda x: 1 if x > 5 else 0)

dfbcpos = df[
    (df.condition == 'Birth Control') & \
    (df.rating > 9.5)
]

dfbcneg = df[
    (df.condition == 'Birth Control') & \
    (df.rating < 6.5)
]

In [None]:
# make list of all reviews
reviews_pos = dfbcpos.review.to_list()
reviews_neg = dfbcneg.review.to_list()

In [None]:
# # make tokenizer
# tokenizer = TweetTokenizer(
#     preserve_case=False,
#     strip_handles=True
# )

# create list of tokens from data set
tokens_pos = word_tokenize(','.join(reviews_pos))
tokens_neg = word_tokenize(','.join(reviews_neg))


# tokens = [word for word in tokens]

In [None]:
# make lemmatizer
lemmatizer = WordNetLemmatizer()

# lemmatize the list of words
tokens_lemmatized_pos = [lemmatizer.lemmatize(word) for word in tokens_pos]
tokens_lemmatized_neg = [lemmatizer.lemmatize(word) for word in tokens_neg]

In [None]:
# show the most frequently occurring tokens
FreqDist(tokens_lemmatized_pos).most_common(25)

In [None]:
# show the most frequently occurring tokens
FreqDist(tokens_lemmatized_neg).most_common(25)

In [None]:
negatives = ['no', 'not', "don't", "aren't", "couldn't", "didn't", "doesn't", "hadn't", "hasn't", "haven't", \
             "isn't", "wasn't", "weren't", "won't", "wouldn't"]

In [None]:
# obtain the standard list of stopwords
nltk.download('stopwords', quiet=True)
# start our own list of stopwords with these words
stop_list = [word for word in stopwords.words('english') if word not in negatives]
# add punctuation characters
for char in string.punctuation:
    stop_list.append(char)
# add empty string
stop_list.extend(['', 'ha', 'wa'])

In [None]:
stop_list

In [None]:
# make stopped list of tokens
tokens_stopped_pos = [word for word in tokens_lemmatized_pos if word not in stop_list]
tokens_stopped_neg = [word for word in tokens_lemmatized_neg if word not in stop_list]

In [None]:
# show the most frequently occurring tokens
FreqDist(tokens_stopped_pos).most_common(25)

In [None]:
# show the most frequently occurring tokens
FreqDist(tokens_stopped_neg).most_common(25)

In [None]:
# a function that generates a word cloud of a given list of words
def make_wordcloud(wordlist, colormap='Greens', title=None):
    # instantiate wordcloud
    wordcloud = WordCloud(
        width=600,
        height=400,
        colormap=colormap,
        collocations = True
    )
    return wordcloud.generate(','.join(wordlist))

def plot_wordcloud(wordcloud):
    # plot wordcloud
    plt.figure(figsize = (12, 15)) 
    plt.imshow(wordcloud) 
    plt.axis('off');

In [None]:
# word cloud of stopped words
plot_wordcloud(make_wordcloud(tokens_stopped_pos))

In [None]:
# word cloud of stopped words
plot_wordcloud(make_wordcloud(tokens_stopped_neg))

# end