# Data Exploration Notebook

---



#### Import Libraries

In [4]:
# Import General Libraries
import os
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Read in Data
We want to read in data from the UCI archive directly since we are using SageMaker. For convenience, we will extract the files and move them into a new subdirectory, as well as deleting the original zip file.

In [2]:
def download_and_format():
    '''
    Remove the zip, create subdirectory for data, and move files into the 
    subdirectory. Function will return two dataframes (train, test).
    '''
    !wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
    !unzip -o drugsCom_raw.zip
    os.remove('drugsCom_raw.zip')
    data_dir = 'data'
    os.mkdir(data_dir)
    Path('drugsComTrain_raw.tsv').rename(data_dir + '/drugsComTrain_raw.tsv')
    Path('drugsComTest_raw.tsv').rename(data_dir + '/drugsComTest_raw.tsv')
    train_data = pd.read_csv(data_dir + '/drugsComTrain_raw.tsv', sep = '\t')
    test_data = pd.read_csv(data_dir + '/drugsComTest_raw.tsv', sep = '\t')
    return train_data, test_data

train, test = download_and_format()

'wget' is not recognized as an internal or external command,
operable program or batch file.
unzip:  cannot find or open drugsCom_raw.zip, drugsCom_raw.zip.zip or drugsCom_raw.zip.ZIP.


FileNotFoundError: [WinError 2] The system cannot find the file specified: 'drugsCom_raw.zip'

In [12]:
train_data = pd.read_csv('drugsComTrain_raw.tsv', sep = '\t')
test_data = pd.read_csv('drugsComTest_raw.tsv', sep = '\t')

## Initial Data Exploration
We first want to begin by looking at our data and assessing its form. We want to check for missing values, extreme outliers, and statistics about the data.

In [13]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,"November 27, 2016",37


In [28]:
train_data.isna().sum()

Unnamed: 0       0
drugName         0
condition      899
review           0
rating           0
date             0
usefulCount      0
dtype: int64

In [31]:
# maybe remove NaN for data exploration but not for training since they still have reviews/ratings (?)
train_data[train_data.condition.isna()].head()

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
30,51452,Azithromycin,,"""Very good response. It is so useful for me. """,10.0,"August 18, 2010",1
148,61073,Urea,,"""Accurate information.""",10.0,"July 2, 2011",13
488,132651,Doxepin,,"""So far so good. Good for me and I can take it...",10.0,"October 20, 2010",25
733,44297,Ethinyl estradiol / norgestimate,,"""I haven&#039;t been on it for a long time and...",8.0,"January 24, 2011",1
851,68697,Medroxyprogesterone,,"""I started the shot in July 2015 and ended in ...",6.0,"March 23, 2017",1


In [25]:
train_data[train_data.condition == '12</span> users found this comment helpful.'].head()
# remove all conditions with '__</span> users found this comment helpful.' (?)

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
6652,172053,Amitiza,12</span> users found this comment helpful.,"""I was extremely constipated and no one seems ...",9.0,"December 26, 2011",12
13189,78220,Zyprexa,12</span> users found this comment helpful.,"""My mom has suffered from schizophrenia for ov...",9.0,"September 11, 2009",12
16448,131490,Generess Fe,12</span> users found this comment helpful.,"""I was on it for 7 months and it was my first ...",4.0,"October 12, 2012",12
19620,204964,Toradol,12</span> users found this comment helpful.,"""I had to go to the ER at 3am with intense pai...",9.0,"July 30, 2015",12
26370,125477,Dulcolax,12</span> users found this comment helpful.,"""It worked just fine. Took one and 12 hours la...",7.0,"February 19, 2012",12
