# Exploratory Data Analysis: WebMD Reviews of Type 2 Diabetes Treatments

## Introduction:
Project Description:
[Enter text here]

### Contents:
 - Read in scraped data
 - Perform QA on data from scrape
 - Exploratory Data Analysis
 - Insights

#### Read in scraped data:
First, let's import numpy and pandas, and read in the scraped data.

In [1]:
import numpy as np
import pandas as pd

In [2]:
reviews = pd.read_csv('webmd_reviews.csv')

Take a look at some sample data

In [3]:
reviews.sample(10)

Unnamed: 0,Rdate,comment,condition,drug,easeofuse,effectiveness,helpful,reviewer,satisfaction
4823,10/30/2013 6:14:09 PM,,Type 2 Diabetes Mellitus,Invokana oral,5,4,1,65-74 Female on Treatment for less than 1 mon...,4
584,2/6/2008 12:41:44 PM,,Type 2 Diabetes Mellitus,Avandia oral,5,4,6,"mikeoco, 65-74 Male on Treatment for 2 to les...",3
3229,2/26/2008 3:35:16 PM,,Type 2 Diabetes Mellitus,Actos oral,5,4,1,"chowdawgpup65, 65-74 Male on Treatment for 1 ...",4
3748,12/24/2012 2:07:01 PM,,Type 2 Diabetes Mellitus,Actos oral,5,5,1,"amadio, 75 or over Male on Treatment for 10 y...",5
1397,1/7/2009 9:52:52 PM,,Type 2 Diabetes Mellitus,metformin oral,2,2,2,"xring, 55-64 Male on Treatment for less than ...",1
1611,9/24/2009 3:58:57 AM,,Type 2 Diabetes Mellitus,metformin oral,4,2,5,"bramaw, 55-64 Female on Treatment for 1 to 6 ...",1
522,2/26/2009 11:56:36 AM,,Type 2 Diabetes Mellitus,Lantus Solostar subcutaneous,5,3,9,"tndsteward, 35-44 Female on Treatment for les...",3
573,12/13/2007 12:52:36 AM,I have taken this medication for the last 10 y...,Type 2 Diabetes Mellitus,Avandia oral,4,3,2,"Victim of Avandia, 55-64 Female on Treatment ...",1
4517,5/5/2010 12:41:15 PM,,Additional Medication for Diabetes Type 2,Byetta subcutaneous,4,5,6,"karenb, 65-74 Female on Treatment for 2 to le...",5
2455,3/23/2012 10:38:58 AM,,Type 2 Diabetes Mellitus,nateglinide oral,5,4,5,65-74 Female (Patient),2


In [4]:
reviews.columns

Index(['Rdate', 'comment', 'condition', 'drug', 'easeofuse', 'effectiveness',
       'helpful', 'reviewer', 'satisfaction'],
      dtype='object')

#### Perform Basic QA

Now let's check against WebMD to ensure counts and classifications were scraped as expected.

Reference URL: https://www.webmd.com/drugs/2/condition-594/type%202%20diabetes%20mellitus

How many reviews did we pull?

In [5]:
len(reviews.drug)

5495

How many unique drugs with reviews did we scrape?

In [6]:
reviews['drug'].nunique()

70

How many reviews from each drug did we scrape?

In [7]:
reviews['drug'].value_counts()

metformin oral                               1227
Actos oral                                    622
Januvia oral                                  440
Byetta subcutaneous                           373
Janumet oral                                  208
glipizide oral                                195
Invokana oral                                 193
glimepiride oral                              172
Trulicity subcutaneous                        126
Glucophage oral                               111
Bydureon subcutaneous                         110
Onglyza oral                                  106
glyburide oral                                101
Avandia oral                                   96
Amaryl oral                                    95
Tradjenta oral                                 90
Farxiga oral                                   72
Actoplus MET oral                              68
Kombiglyze XR oral                             58
Levemir Flexpen subcutaneous                   56


We searched for "Type 2 Diabetes Mellitus" on WebMD, but are these the only conditions?  Let's check.

In [8]:
reviews['condition'].value_counts()

Type 2 Diabetes Mellitus                     4955
Additional Medication for Diabetes Type 2     483
Diabetes                                       41
Type 1 Diabetes Mellitus                       16
Name: condition, dtype: int64

Ok, so 90% of our reviews are specifically for Type II Diabetes Mellitus, 9% on "Additional Medication for Type II"

In [9]:
reviews['reviewer'].nunique()

3916

Check for null values.

In [10]:
np.sum(reviews.isnull())

Rdate               0
comment          4520
condition           0
drug                0
easeofuse           0
effectiveness       0
helpful             0
reviewer          253
satisfaction        0
dtype: int64

Uh oh!  4,520 null comments out of 5,495 reviews...this can't be right. Let's grab examples and debug.

In [21]:
reviews.loc[reviews.isnull().any(axis=1), :]

Unnamed: 0,Rdate,comment,condition,drug,easeofuse,effectiveness,helpful,reviewer,satisfaction
1,9/18/2007 11:18:23 AM,,Type 2 Diabetes Mellitus,metformin oral,5,4,9,"cutie54, 45-54 Female on Treatment for 1 to 6...",5
3,4/5/2011 1:31:52 PM,,Type 2 Diabetes Mellitus,Kombiglyze XR oral,5,4,4,25-34 Female on Treatment for less than 1 mon...,4
4,3/7/2011 4:35:50 PM,,Type 2 Diabetes Mellitus,Kombiglyze XR oral,3,1,8,45-54 Male (Patient),1
6,3/25/2015 12:19:46 PM,,Type 2 Diabetes Mellitus,Afrezza inhalation,5,5,9,"tipe2, 35-44 Male on Treatment for less than ...",5
7,3/24/2015 5:08:47 PM,,Type 2 Diabetes Mellitus,Afrezza inhalation,1,1,1,55-64 Female on Treatment for 1 to 6 months (...,1
9,10/21/2007 10:23:33 AM,,Type 2 Diabetes Mellitus,Actoplus MET oral,5,5,2,35-44 Male on Treatment for 1 to less than 2 ...,5
10,10/14/2007 9:24:09 PM,,Type 2 Diabetes Mellitus,Actoplus MET oral,5,5,6,"muzings, 45-54 Male on Treatment for 6 months...",5
11,10/14/2007 8:59:09 PM,,Type 2 Diabetes Mellitus,Actoplus MET oral,4,4,4,55-64 Female on Treatment for 6 months to les...,4
13,5/7/2014 3:05:19 PM,,Type 2 Diabetes Mellitus,Farxiga oral,5,4,2,"Karmacomes knockin, 55-64 Female",4
15,9/21/2007 10:48:32 AM,love it,Type 2 Diabetes Mellitus,Avandia oral,3,3,4,,4


Note: Identified bugs in review parser that caused nulls in 'comment' field. There ARE some blank comments on WebMD, but not on 80% of reviews! The 253 null 'reviewer's are genuinely blank fields on the website.

Debugged spider to fix review parsing.  Before re-crawling, let's take a look at ratings fields (below).

Now let's see some summary stats for our ratings!

In [11]:
reviews.describe()

Unnamed: 0,easeofuse,effectiveness,helpful,satisfaction
count,5495.0,5495.0,5495.0,5495.0
mean,4.055505,3.442038,3.079709,3.052957
std,1.246186,1.42596,2.591082,1.596296
min,1.0,1.0,0.0,1.0
25%,3.0,2.0,1.0,1.0
50%,5.0,4.0,2.0,3.0
75%,5.0,5.0,5.0,5.0
max,5.0,5.0,9.0,5.0


Note: 'easeofuse', 'effectiveness', and 'satisfaction' fields are all 5-max ratings; however, 'helpful' is a counter where users upvote. '9' as a max isn't correct.  After reviewing site and finding examples from .loc table above, identified bug in code limiting to one integer. Fixed in code to allow more than 1 integer and rerun spider.

After debugging review parser for 'helpful' and 'comment' fields, re-crawling WebMD.

##### QA round 2:
...2 hours later...

Load 2nd-round file and QA again!

In [26]:
reviews = pd.read_csv('webmd_reviews2.csv')

In [27]:
reviews.sample(10)

Unnamed: 0,Rdate,comment,condition,drug,easeofuse,effectiveness,helpful,reviewer,satisfaction
4320,2/5/2009 10:15:18 PM,"I have been on Byetta for one week now, and si...",Additional Medication for Diabetes Type 2,Byetta subcutaneous,3,4,1,"masuccij, 45-54 Male on Treatment for 5 to le...",3
3687,6/4/2013 11:08:03 AM,my type 2 diabetes has remained relatively unc...,Type 2 Diabetes Mellitus,Actos oral,3,2,16,"james, 75 or over Male on Treatment for 2 to ...",2
2332,9/21/2007 1:06:43 AM,does it cause gas..im having a gas problem..i ...,Type 2 Diabetes Mellitus,Avandamet oral,5,5,10,"smichel6112, 55-64 Female on Treatment for 2 ...",5
1210,9/2/2009 6:13:28 PM,Good reactions -generally--\r\nvery responsive...,Type 2 Diabetes Mellitus,metformin oral,5,4,2,55-64 Female on Treatment for 2 to less than ...,4
4218,4/14/2008 7:11:22 PM,DX with Type 2 diabetes. On Byetta for 16 mon...,Additional Medication for Diabetes Type 2,Byetta subcutaneous,5,5,18,75 or over Female on Treatment for 1 to 6 mon...,5
2120,5/17/2013 11:17:37 AM,I used this medicine for three months. My su...,Type 2 Diabetes Mellitus,Jentadueto oral,4,4,6,"MommE2, 45-54 Female on Treatment for less th...",3
2373,1/8/2016 10:18:32 AM,Started Trulicity October 2015. A1C was 12 and...,Type 2 Diabetes Mellitus,Trulicity subcutaneous,1,1,30,"Jltc, 55-64 Female on Treatment for less than...",1
1408,7/12/2010 8:33:18 AM,"It does help lower blood sugar, however, I get...",Type 2 Diabetes Mellitus,metformin oral,4,3,5,"gego, 65-74 Female on Treatment for 2 to less...",2
1732,4/7/2013 3:24:43 PM,Started taking this in July of 12. 1000 mg twi...,Type 2 Diabetes Mellitus,metformin oral,5,3,46,,3
4984,4/16/2009 9:28:14 AM,no fuss-no muss-easey to use and no problems,Type 2 Diabetes Mellitus,glipizide oral,5,5,4,"teachmjk, 45-54 Female on Treatment for 1 to ...",5


In [28]:
np.sum(reviews.isnull())

Rdate              0
comment            9
condition          0
drug               0
easeofuse          0
effectiveness      0
helpful            0
reviewer         253
satisfaction       0
dtype: int64

In [29]:
reviews.describe()

Unnamed: 0,easeofuse,effectiveness,helpful,satisfaction
count,5495.0,5495.0,5495.0,5495.0
mean,4.055505,3.442038,9.202002,3.052957
std,1.246186,1.42596,9.672916,1.596296
min,1.0,1.0,0.0,1.0
25%,3.0,2.0,3.0,1.0
50%,5.0,4.0,7.0,3.0
75%,5.0,5.0,13.0,5.0
max,5.0,5.0,79.0,5.0


This looks more like it!

#### Exploratory Data Analysis