https://archive.ics.uci.edu/dataset/462/drug+review+dataset+drugs+com

# Flatiron Phase 5 Project

## Aaron Galbraith

https://www.linkedin.com/in/aarongalbraith \
https://github.com/aarongalbraith

### Submitted: November 21, 2023

## working contents

- **[functions](#functions)<br>**
- **[rough overview](#rough-overview)<br>**
- **[duplicates](#duplicates)<br>**
- **[missing values](#missing-values)<br>**
- **[contractions](#contractions)<br>**
- **[dates](#dates)<br>**
- **[ratings](#ratings)<br>**
- **[focusing on birth control](#focusing-on-birth-control)<br>**
- **[feature engineering ideas](#feature-engineering-ideas)<br>**
- **[rudimentary word cloud maker](#rudimentary-word-cloud-maker)<br>**
- **[end](#end)<br>**


## Contents

- **[Business Understanding](#Business-Understanding)<br>**
- **[Data Understanding](#Data-Understanding)**<br>
- **[Data Preparation](#Data-Preparation)**<br>
- **[Exploration](#Exploration)**<br>
- **[Modeling](#Modeling)**<br>
- **[Evaluation](#Evaluation)**<br>
- **[Recommendations](#Recommendations)<br>**
- **[Further Inquiry](#Further-Inquiry)**<br>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import FreqDist
from nltk.corpus import stopwords
import string
from wordcloud import WordCloud

import html
import contractions

import re

from IPython.display import display

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, precision_score, f1_score
from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV


SEED = 1979

do_grids = True

In [2]:
d1 = pd.read_csv('../data/drugsComTrain_raw.tsv', delimiter='\t', encoding='latin-1')
d2 = pd.read_csv('../data/drugsComTest_raw.tsv', delimiter='\t', encoding='latin-1')
df = pd.concat([d1,d2]).reset_index().drop(columns=['Unnamed: 0', 'index'])

# functions

In [3]:
def show_review(index):
    print(df.review.loc[index])
    display(df[df.review == df.loc[index].review][['drugName', 'condition', 'rating', 'date', 'usefulCount']])

In [4]:
def show_similar(index):
    
    count_total = df[
        (df.drugName == df.loc[index].drugName) & \
        (df.condition == df.loc[index].condition) & \
        (df.date == df.loc[index].date)
    ].review.count()
    
    count_similar = df[
        (df.drugName == df.loc[index].drugName) & \
        (df.condition == df.loc[index].condition) & \
        (df.rating == df.loc[index].rating) & \
        (df.date == df.loc[index].date)
    ].review.count()
    
    print('On', df.loc[index].date, df.loc[index].drugName, 'was reviewed', count_total, \
          'times and received a rating of', df.loc[index].rating, count_similar, 'times.\n')
    print('From that date, here are all', count_similar, 'reviews with the same rating:\n')
    for ind in df[
        (df.drugName == df.loc[index].drugName) & \
        (df.condition == df.loc[index].condition) & \
        (df.rating == df.loc[index].rating) & \
        (df.date == df.loc[index].date)
    ].index:
        print(df.loc[ind].review,'\n')
    
    print('Here is a breakdown of all the dates when reviewers gave the same drug name and condition THIS RATING:')
    display(df[
        (df.drugName == df.loc[index].drugName) & \
        (df.condition == df.loc[index].condition) & \
        (df.rating == df.loc[index].rating)
    ].date.value_counts())

# rough overview

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.drugName.value_counts()

In [None]:
df.condition.value_counts()

In [None]:
df.rating.value_counts()

In [None]:
df.groupby('drugName').condition.nunique().value_counts()

This means that, for example, 2047 drugs treat one condition only, and 753 drugs treat two conditions, etc.

In [None]:
df.groupby('condition').drugName.nunique().value_counts()

This means that 180 conditions are treatable by two drugs, etc.

In [None]:
pd.set_option("display.max_rows", None)
print(df.drugName.value_counts())
pd.set_option("display.max_rows", 10)

A casual overview of the drug names indicates that they all seem valid.

In [None]:
pd.set_option("display.max_rows", None)
print(df.condition.value_counts())
pd.set_option("display.max_rows", 10)

Oddly, the condition labels often (always?) omit initial 'F' and terminal 'r'. We can isolate instances of the former by searching for conditions that start with a lower case letter.

# missing values

In [5]:
len(df[df.condition.isna()])

1194

In [6]:
df.condition.fillna('missing', inplace=True)

In [7]:
len(df[df.condition == 'missing'])

1194

We noticed another condition label that was meant to indicate missing and should be accordingly changed.

In [8]:
df.condition = df.condition.apply(lambda x: 'missing' if 'Not Listed' in x else x)

In [9]:
len(df[df.condition == 'missing'])

1786

We've identified some actual missing condition labels, but we noticed there are more condition labels that seem suspicious, particularly ones that start with something other than an upper case character. Let's look at all such condition labels.

In [10]:
set(df[(~df.condition.str[0].isin(list(string.ascii_uppercase))) &
   (df.condition != 'missing')
  ].condition)

{'0</span> users found this comment helpful.',
 '100</span> users found this comment helpful.',
 '105</span> users found this comment helpful.',
 '10</span> users found this comment helpful.',
 '110</span> users found this comment helpful.',
 '11</span> users found this comment helpful.',
 '121</span> users found this comment helpful.',
 '123</span> users found this comment helpful.',
 '12</span> users found this comment helpful.',
 '135</span> users found this comment helpful.',
 '13</span> users found this comment helpful.',
 '142</span> users found this comment helpful.',
 '145</span> users found this comment helpful.',
 '146</span> users found this comment helpful.',
 '14</span> users found this comment helpful.',
 '15</span> users found this comment helpful.',
 '16</span> users found this comment helpful.',
 '17</span> users found this comment helpful.',
 '18</span> users found this comment helpful.',
 '19</span> users found this comment helpful.',
 '1</span> users found this comm

These fall into three categories. Ones that include "users found this comment helpful" should be regarded as erroneous and therefore missing.

In [11]:
df.condition = df.condition.apply(lambda x: 'missing' if 'users found' in x else x)

In [12]:
len(df[df.condition == 'missing'])

2957

 Ones that show a clipped copy of the drug name and end with a parenthesis should also be regarded as missing.

In [13]:
df.condition = df.condition.apply(lambda x: 'missing' \
                                  if x[0] not in list(string.ascii_uppercase) and \
                                  x[-1] in ['(', ')'] \
                                  else x)

In [14]:
len(df[df.condition == 'missing'])

3286

Most of the ones that show a clipped version of the condition label can possibly be restored.

In [15]:
def condition_restore(condition):
    if condition.split()[-1] in ['Disorde', 'eve', 'Shoulde', 'Cance']:
        condition = condition+'r'
    if condition.split()[0] in ['acial', 'ibrocystic', 'ungal', 'amilial', 'ailure', 'ever', \
                                'emale', 'unctional', 'actor', 'ibromyalgia', 'atigue']:
        condition = 'F'+condition
    if condition.split()[0] in ['llicular', 'llicle', 'lic', 'cal']:
        condition = 'Fo'+condition
    if condition.split()[0] in ['mance']:
        condition = 'Perfor'+condition
    if condition.split()[0] in ['zen']:
        condition = 'Fro'+condition
    if condition.split()[0] in ['mis']:
        condition = 'Dermatitis Herpetifor'+condition
    return condition

df.condition = df.condition.apply(lambda x: condition_restore(x))

Let's look at what we have left.

In [16]:
set(df[(~df.condition.str[0].isin(list(string.ascii_uppercase))) &
   (df.condition != 'missing')
  ].condition)

{'m Pain Disorder', 'me', 't Care', "von Willebrand's Disease"}

"von Willebrand's Disease" appears to be a naturally uncapitalized condition. The others have been impossible to restore and will also be regarded as missing.

In [17]:
df.condition = df.condition.apply(lambda x: 'missing' \
                                  if x[0] not in list(string.ascii_uppercase) and \
                                  x.split()[0] != 'von' \
                                  else x)

In [18]:
len(df[df.condition == 'missing'])

3293

## proposed solutions for missing values

1. For every record with a missing condition, we will assign it the condition that is most common for the drug indicated by that record.

2. Before executing solution 1, find each record's twin and use the condition label from the twin where applicable.

For now, we'll just execute solution 2.

In [19]:
drugs_w_missing_condition = list(set(df[df.condition == 'missing'].drugName))

In [20]:
len(drugs_w_missing_condition)

842

This applies to about a quarter of the drugs. We'll create a dictionary that reports the most common condition for these drugs.

In [21]:
most_common_condition = {}

for drug in drugs_w_missing_condition:
    condition = df[df.drugName == drug].condition.value_counts().idxmax()
    if condition == 'missing' and len(set(df[df.drugName == drug].condition)) > 1:
        condition = df[(df.drugName == drug) &
                       (df.condition != 'missing')
                      ].condition.value_counts().idxmax()
    proportion = round(df[df.drugName == drug].condition.value_counts(normalize=True)[0],2)
    most_common_condition[drug] = [condition, proportion]

In [22]:
most_common_condition['Viagra']

['Erectile Dysfunction', 0.87]

For example, if a review with an unlisted condition is about Viagra, we will assume the condition is Erectile Dysfunction.

In [23]:
df['condition'] = df.apply(lambda x: most_common_condition[x.drugName][0] \
                           if x.condition == 'missing' \
                           else x.condition, axis = 1)

In [24]:
len(df[df.condition == 'missing'])

105

This is how many records there are that still have no label for condition. This means the drugs indicated in these records are *only* indicated in references without an indicated condition. They may still have a "twin" records that we could match them to, but while we're skipping that solution step, there's not really anything we can do with these records, and we may as well drop them.

In [25]:
df.drop(df[df.condition == 'missing'].index, inplace=True)

# duplicates

In [26]:
df.duplicated().value_counts()

False    214956
True          2
Name: count, dtype: int64

In [27]:
df[df.duplicated()]

Unnamed: 0,drugName,condition,review,rating,date,usefulCount
178703,Levonorgestrel,Emergency Contraception,"""I had a quickie n he decided to finish it off...",1.0,"September 23, 2016",10
191001,Plan B,Emergency Contraception,"""I had a quickie n he decided to finish it off...",1.0,"September 23, 2016",10


In [29]:
show_review(178703)

"I had a quickie n he decided to finish it off in me... Well IMMEDIATELY we went 2 our local pharmacy n bought this plan b 1 step pill.I took it immediately.2 weeks later,took a pregnancy test n got the world&#039;s BIGGEST POSITIVE. The small pill was $50.That was the 1st time in a year n a half that I had intercourse n the last after I had my first son. I honestly believe this pill is ineffective because they just want u to think it works when n reality, it would never work. Alot of women don&#039;t know their bodies when they ovulate so if your not fertile and he ejaculates n u and u take the pill n dont get preg., The pill is supposed to make u think it worked. DO NOT buy. Was NEVER effective. Thank u!"


Unnamed: 0,drugName,condition,rating,date,usefulCount
131531,Levonorgestrel,Emergency Contraception,1.0,"September 23, 2016",10
143768,Plan B,Emergency Contraception,1.0,"September 23, 2016",10
178703,Levonorgestrel,Emergency Contraception,1.0,"September 23, 2016",10
191001,Plan B,Emergency Contraception,1.0,"September 23, 2016",10


This is curious. The same review is recorded four times. There are two identical pairs, where the difference between the pairs is the drug name. We can drop one from each pair, but this will need to be revisited.

In [30]:
df.drop_duplicates(inplace=True)

# temporary solution for duplicate records

For the time being, we will drop records that duplicate ALL values EXCEPT drug name. This MIGHT drop some genuinely different records that happen to have the same condition, review (e.g. "It works!"), rating, date, and useful count.

In [None]:
len(df)

In [None]:
df.drop(df[df.duplicated(subset=df.columns.difference(['drugName']))].index, inplace=True)

In [None]:
len(df)

# further exploration of duplicates (skip for now)

The main type of duplicate we should look out for is records with duplicate reviews, as those likely indicate some kind of actual erroneous duplication. Let's see how many of those there are.

In [None]:
len(df[df.duplicated(subset=['review'])])

That's a lot!

The other type of duplicate we should possibly be aware of is a kind of "intentional" duplicate, where a user seems to be logging multiple reviews for the same product on the same day with the same rating in some deliberate attempt to boost or bomb the product's average. let's see how many records duplicate the drug name, condition, rating, and date.

In [None]:
len(df[df.duplicated(subset=['drugName', 'condition', 'rating', 'date'])])

That also seems like a lot. Let's explore these now.

In [None]:
df[df.duplicated(subset=['drugName', 'condition', 'rating', 'date'])].head()

We'll use the "show_similar" function to explore these reviews that duplicate drug name

In [None]:
show_similar(2450)

In [None]:
show_similar(3597)

In [None]:
show_similar(4892)

In [None]:
show_review(183510)

In [None]:
df[df.duplicated(subset=['drugName', 'condition', 'rating', 'date'])].rating.value_counts()

In [None]:
df[
    (df.drugName == df.loc[8576].drugName) & \
    (df.condition == df.loc[8576].condition) & \
    (df.date == df.loc[8576].date)
    
]

In [None]:
df[(df.drugName == 'Miconazole') & \
   (df.condition == 'Vaginal Yeast Infection') & \
   (df.rating == 1.0) & \
   (df.date == 'May 25, 2016') & \
   (df.usefulCount == 6) \
  ]

In [None]:
show_review(8737)

In [None]:
len(df[df.duplicated(subset=['review'])])

An enormous number of records have duplicated reviews.

In [None]:
df.duplicated(subset=df.columns.difference(['drugName'])).value_counts()

Most of the duplicate reviews are accounted for by different drug names. Let's explore some examples.

In [None]:
df[df.duplicated(subset=df.columns.difference(['drugName']))].head()

Let's look at each of these to see the full review and all the instances of duplication.

In [None]:
show_review(524)

In [None]:
show_review(574)

In [None]:
show_review(726)

In [None]:
show_review(1070)

In [None]:
show_review(1375)

In all of the instances we checked, the duplicated record occurs because it is listed once under its chemical name and once under its brand name. We'll assume this is mostly the reason for the vast majority of review duplications and deal with them after we address other types of review duplications.

In [None]:
len(df[(df.duplicated(subset=['review'])) &
   ~df.duplicated(subset=df.columns.difference(['drugName']))
  ])

This is how many records have identical reviews but differences *other than the drug name*. Let's explore a few of these.

In [None]:
df[(df.duplicated(subset=['review'])) &
   ~df.duplicated(subset=df.columns.difference(['drugName']))
  ].head(15)

In [None]:
show_review(2664)

In [None]:
show_review(6465)

In [None]:
show_review(9735)

In [None]:
show_review(13125)

Some of these are just common, short reviews, e.g. "Great". But others seem to have issues with the condition label as well.

We found earlier that many duplicate reviews come in pairs where the drug name is generic and brand name in the two records. It seems that more of these pairs exist in instances where the condition is "missing" for some reason. Where this specific phenomenon occurs, we'll relabel the condition to match its partner in the pair. This will reduce the number of "missing" conditions but increase the number of duplicate pairs.

In [None]:
len(df[df.condition == 'missing'])

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['drugName']))])

In [None]:
def condition_match(x):
    for index in range(len(df)):
        dove = df.loc[index]
        if dove.drugName != x.drugName and dove.condition != 'missing' \
        and dove[['review', 'rating', 'date', 'usefulCount']] == x[['review', 'rating', 'date', 'usefulCount']]:
            print('hell yeah')
#     df_temp = df[
#         (df.drugName != x.drugName) &
#         (df.condition != 'missing') &
#         (df[['review', 'rating', 'date', 'usefulCount']] == x[['review', 'rating', 'date', 'usefulCount']])
#         (df.review == x.review) &
#         (df.rating == x.rating) &
#         (df.date == x.date) &
#         (df.usefulCount == x.usefulCount)
#     ]
#     if len(df_temp) == 1:
#         return df_temp.iloc[0]['condition']
#     else:
#         return x.condition

# df.condition = df.apply(lambda x: condition_match(x) if x.condition == 'missing' else x.condition, axis=1)

In [None]:
df[df.condition.str[-1] == ')'][['drugName', 'condition']].value_counts()

Is there a way to do a search for conditions whose last "word" is a string that appears in the drug name value?

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['condition']))])

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['rating']))])

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['date']))])

In [None]:
len(df[df.duplicated(subset=df.columns.difference(['usefulCount']))])

In [None]:
df[df.duplicated(subset=df.columns.difference(['usefulCount']))].head()

In [None]:
show_review(42728)

In [None]:
show_review(61617)

In [None]:
show_review(69518)

In [None]:
show_review(72794)

This appears to be an instance of someone re-posting a review multiple times. It seems that we should drop the duplicates in this case, but possibly we should tally up the useful count?

# contractions

Here is an example of a contraction.

In [31]:
df.review[3][56:69]

'I&#039;m glad'

Here is how the html function fixes it.

In [32]:
html.unescape(df.loc[3][2])[56:64]

"I'm glad"

Here is how the contractions function fixes (the html function's fix of) it.

In [33]:
contractions.fix(html.unescape(df.loc[3][2]))[56:65]

'I am glad'

Here is an instance of "ain't" with the same functions applied.

In [34]:
df.review.loc[507][75:99]

'I ain&#039;t complaining'

In [35]:
html.unescape(df.review.loc[507])[75:94]

"I ain't complaining"

In [36]:
contractions.fix(html.unescape(df.review.loc[507]))[75:96]

'I are not complaining'

In [37]:
len(df[df.review.str.contains('ain&#039;t')])

53

There are 53 instances of "ain't".

I'm currently having difficulty downloading the package that appropriately fixes "ain't" into "is not" or "are not" etc. This shouldn't matter after I remove stop words. I think it will be helpful to exclude negatives like "no" and "not" from the stop words. It could certainly be of help to look for bigrams like "not good".

In [38]:
df.review = df.review.apply(lambda x: html.unescape(x))

# dates

In [None]:
sample = df.date.loc[0]

In [None]:
sample

In [None]:
re.split(r'\W+', sample)

There's probably a datetime method for this, but the following will produce month // day // year, and then we can figure out the earliest and latest dates.

In [None]:
df['month'] = df.date.apply(lambda x: re.split(r'\W+', x)[0])
df['day'] = df.date.apply(lambda x: int(re.split(r'\W+', x)[1]))
df['year'] = df.date.apply(lambda x: int(re.split(r'\W+', x)[2]))

In [None]:
df.year.min()

In [None]:
df[df.year == 2008].month.value_counts()

In [None]:
df[(df.year == 2008) &
   (df.month == 'February')
  ].day.min()

In [None]:
df.year.max()

In [None]:
df[df.year == 2017].month.value_counts()

In [None]:
df[(df.year == 2017) &
   (df.month == 'November')
  ].day.max()

The reviews span from February 24, 2008 to November 30, 2017.

# ratings

In [None]:
len(df)/2

In [None]:
df.rating.value_counts()

In [None]:
len(df[df.rating > 8.5])

In [None]:
len(df[df.rating < 8.5])

To split the review roughly in half we would split between 8 and 9

To split the ratings roughly in half we would make the splits 1-8 and 9-10.

In [None]:
len(df)/3

In [None]:
len(df[df.rating > 9.5])

In [None]:
len(df[df.rating < 6.5])

To split the ratings roughly in thirds we would make the splits 1-6, 7-9, and 10.

# focusing on birth control

In [39]:
len(df[df.condition == 'Birth Control'])

39499

This many records pertain to the condition of birth control.

In [40]:
birth_control_drugs = set(df[df.condition == 'Birth Control'].drugName)

In [41]:
len(birth_control_drugs)

181

This many drugs treat birth control.

In [42]:
list(set(df[(df.condition != 'Birth Control') &
   (df.drugName.isin(birth_control_drugs))
  ].condition))

['Menorrhagia',
 'Acne',
 'Polycystic Ovary Syndrome',
 'Abnormal Uterine Bleeding',
 'Renal Cell Carcinoma',
 'Amenorrhea',
 'Ovarian Cysts',
 'Premenstrual Dysphoric Disorder',
 'Postmenopausal Symptoms',
 'Gonadotropin Inhibition',
 'Menstrual Disorders',
 'Premenstrual Syndrome',
 'Endometriosis',
 'Endometrial Hyperplasia, Prophylaxis',
 'Emergency Contraception']

These are other conditions that are (at least sometimes) treated by drugs that (also) treat birth control.

In [43]:
df[df.condition == 'Birth Control'].drugName.value_counts()

drugName
Etonogestrel                          4413
Ethinyl estradiol / norethindrone     3222
Levonorgestrel                        2924
Nexplanon                             2892
Ethinyl estradiol / levonorgestrel    2213
                                      ... 
Briellyn                                 1
Loestrin Fe 1.5 / 30                     1
Philith                                  1
Lillow                                   1
Cyclafem 7 / 7 / 7                       1
Name: count, Length: 181, dtype: int64

These are the most frequent drug names that treat birth control.

# feature engineering ideas

- word count
- character count
- words in all caps
- average word length
- whether words are in English (spelled correctly)
- whether it includes characters such as exclamation points, question marks, (especially repeatedly), and emoticons

# truncate to just birth control

In [45]:
df.drop(df[df.condition != 'Birth Control'].index, inplace=True).reset_index

In [59]:
df.drop(columns='condition', inplace=True)

In [52]:
len(df)

39499

In [64]:
df.reset_index(inplace=True)

In [65]:
df.drop(columns='index', inplace=True)

# attempt to fix twin duplicates problem

In [66]:
df[df.duplicated(subset=['review', 'rating', 'date', 'usefulCount'])]

Unnamed: 0,level_0,drugName,review,rating,date,usefulCount
96,96,Nexplanon,"""First had implanon then got Nexplanon, had a ...",9.0,"April 21, 2017",5
134,134,Orsythia,"""I have only been on orsythia for about 1 mont...",2.0,"October 8, 2015",7
252,252,Ethinyl estradiol / norethindrone,"""I have been taking my first pack of Lo Loestr...",8.0,"February 1, 2012",7
368,368,Aviane,"""I have been taking Aviane for about 3 years n...",9.0,"January 24, 2011",1
428,428,Norethindrone,"""Long story short: I've never been able to tak...",9.0,"February 14, 2016",10
...,...,...,...,...,...,...
39494,39494,Etonogestrel,"""So I got Nexplanon just under a year ago. I o...",9.0,"November 5, 2013",3
39495,39495,Levonorgestrel,"""I first would like to thank all of you that p...",3.0,"January 20, 2010",140
39496,39496,Microgestin Fe 1 / 20,"""I was on Microgestin for about 3 years. Over ...",6.0,"August 1, 2014",15
39497,39497,Apri,"""I started taking Apri about 7 months ago. My ...",9.0,"August 25, 2010",18


In [75]:
df['condition'] = 'Birth Control'

In [78]:
show_review(96)

"First had implanon then got Nexplanon, had a period first month and I have not had one since. I'm due to remove it next year.  I do notice spotting  sometimes for a day but it honestly  usually coincides with when I'm stressed. 
Had some weight gain also.

So far the best BC I've  had in all my years.  I plan on trying for a baby next year then I will be back on it."


Unnamed: 0,drugName,condition,rating,date,usefulCount
69,Etonogestrel,Birth Control,9.0,"April 21, 2017",5
96,Nexplanon,Birth Control,9.0,"April 21, 2017",5


In [72]:
assert(df.index.max() == len(df)) - 1

In [97]:
twin_dates = []
for i in df[df.duplicated(subset=['review', 'rating', 'date', 'usefulCount'])].index:
    twin = df[df.date == df.loc[i].date]
    twin_dates.append(len(twin))

In [100]:
max(twin_dates)

60

In [101]:
len(twin_dates)

19420

In [83]:
found_pairs = 0

for i in df[df.duplicated(subset=['review', 'rating', 'date', 'usefulCount'])].index:
    for j in df[~df.duplicated(subset=['review', 'rating', 'date', 'usefulCount'])].index:
        if df.loc[i].review == df.loc[j].review and \
        df.loc[i].rating == df.loc[j].rating and \
        df.loc[i].date == df.loc[j].date and \
        df.loc[i].usefulCount == df.loc[j].usefulCount:
            found_pair = True
            found_pairs += 1
            break

print(found_pairs)

KeyboardInterrupt: 

In [84]:
found_pairs

159

# rudimentary word cloud maker

In [None]:
df['review'] = df['review'].str.lower()

In [None]:
dfbc = df[df.condition == 'Birth Control']

dfbc['sentiment'] = dfbc.rating.apply(lambda x: 1 if x > 5 else 0)

dfbcpos = df[
    (df.condition == 'Birth Control') & \
    (df.rating > 9.5)
]

dfbcneg = df[
    (df.condition == 'Birth Control') & \
    (df.rating < 6.5)
]

In [None]:
# make list of all reviews
reviews_pos = dfbcpos.review.to_list()
reviews_neg = dfbcneg.review.to_list()

In [None]:
# # make tokenizer
# tokenizer = TweetTokenizer(
#     preserve_case=False,
#     strip_handles=True
# )

# create list of tokens from data set
tokens_pos = word_tokenize(','.join(reviews_pos))
tokens_neg = word_tokenize(','.join(reviews_neg))


# tokens = [word for word in tokens]

In [None]:
# make lemmatizer
lemmatizer = WordNetLemmatizer()

# lemmatize the list of words
tokens_lemmatized_pos = [lemmatizer.lemmatize(word) for word in tokens_pos]
tokens_lemmatized_neg = [lemmatizer.lemmatize(word) for word in tokens_neg]

In [None]:
# show the most frequently occurring tokens
FreqDist(tokens_lemmatized_pos).most_common(25)

In [None]:
# show the most frequently occurring tokens
FreqDist(tokens_lemmatized_neg).most_common(25)

In [None]:
negatives = ['no', 'not', "don't", "aren't", "couldn't", "didn't", "doesn't", "hadn't", "hasn't", "haven't", \
             "isn't", "wasn't", "weren't", "won't", "wouldn't"]

In [None]:
# obtain the standard list of stopwords
nltk.download('stopwords', quiet=True)
# start our own list of stopwords with these words
stop_list = [word for word in stopwords.words('english') if word not in negatives]
# add punctuation characters
for char in string.punctuation:
    stop_list.append(char)
# add empty string
stop_list.extend(['', 'ha', 'wa'])

In [None]:
stop_list

In [None]:
# make stopped list of tokens
tokens_stopped_pos = [word for word in tokens_lemmatized_pos if word not in stop_list]
tokens_stopped_neg = [word for word in tokens_lemmatized_neg if word not in stop_list]

In [None]:
# show the most frequently occurring tokens
FreqDist(tokens_stopped_pos).most_common(25)

In [None]:
# show the most frequently occurring tokens
FreqDist(tokens_stopped_neg).most_common(25)

In [None]:
# a function that generates a word cloud of a given list of words
def make_wordcloud(wordlist, colormap='Greens', title=None):
    # instantiate wordcloud
    wordcloud = WordCloud(
        width=600,
        height=400,
        colormap=colormap,
        collocations = True
    )
    return wordcloud.generate(','.join(wordlist))

def plot_wordcloud(wordcloud):
    # plot wordcloud
    plt.figure(figsize = (12, 15)) 
    plt.imshow(wordcloud) 
    plt.axis('off');

In [None]:
# word cloud of stopped words
plot_wordcloud(make_wordcloud(tokens_stopped_pos))

In [None]:
# word cloud of stopped words
plot_wordcloud(make_wordcloud(tokens_stopped_neg))

# end