# <font color='violet'> Data Wrangling

Before doing any work with psychedelic experience reports, I need to create a predictive model using labeled data from formal studies of the effects of prescription psych meds. Several studies' results will be combined here to form one large dataset with which to build a model. Data comes from the following sources:
    
    - https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Druglib.com%29
    - https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29
    - https://www.askapatient.com/store/#!/Psytar-Data-Set/p/449080512/category=129206256


In [1]:
# ! pip install pandas

In [2]:
import pandas as pd

In [3]:
# Concatenate two sets from drugs.com 
drugs_dotcom_train = pd.read_csv('../data/raw/drugsComTrain_raw.tsv', sep='\t')
drugs_dotcom_test = pd.read_csv('../data/raw/drugsComTest_raw.tsv', sep='\t')

print(drugs_dotcom_test.columns, drugs_dotcom_train.columns, drugs_dotcom_test.shape, 
      drugs_dotcom_train.shape)

Index(['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date',
       'usefulCount'],
      dtype='object') Index(['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date',
       'usefulCount'],
      dtype='object') (53766, 7) (161297, 7)


In [4]:
drugs_dotcom = pd.concat([drugs_dotcom_test, drugs_dotcom_train])
drugs_dotcom.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 215063 entries, 0 to 161296
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Unnamed: 0   215063 non-null  int64  
 1   drugName     215063 non-null  object 
 2   condition    213869 non-null  object 
 3   review       215063 non-null  object 
 4   rating       215063 non-null  float64
 5   date         215063 non-null  object 
 6   usefulCount  215063 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 13.1+ MB


In [5]:
# Do the same with data from druglib
druglib_train = pd.read_csv('../data/raw/drugLibTrain_raw.tsv', sep='\t')
druglib_test = pd.read_csv('../data/raw/drugLibTest_raw.tsv', sep='\t')

print(druglib_train.columns, druglib_test.columns, druglib_train.shape, druglib_test.shape)

Index(['Unnamed: 0', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects',
       'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview'],
      dtype='object') Index(['Unnamed: 0', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects',
       'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview'],
      dtype='object') (3107, 9) (1036, 9)


In [6]:
druglib = pd.concat([druglib_train, druglib_test])
druglib.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4143 entries, 0 to 1035
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         4143 non-null   int64 
 1   urlDrugName        4143 non-null   object
 2   rating             4143 non-null   int64 
 3   effectiveness      4143 non-null   object
 4   sideEffects        4143 non-null   object
 5   condition          4142 non-null   object
 6   benefitsReview     4143 non-null   object
 7   sideEffectsReview  4141 non-null   object
 8   commentsReview     4135 non-null   object
dtypes: int64(2), object(7)
memory usage: 323.7+ KB


In [7]:
# Finally, import from psytar
psytar = pd.read_csv('../data/raw/PsyTAR_dataset_samples.csv')
psytar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   index            891 non-null    int64  
 1   comment_index    891 non-null    int64  
 2   comment_id       891 non-null    int64  
 3   drug_id          891 non-null    object 
 4   rating           891 non-null    int64  
 5   disorder         891 non-null    object 
 6   side-effect      877 non-null    object 
 7   comment          768 non-null    object 
 8   gender           881 non-null    object 
 9   age              879 non-null    float64
 10  dosage_duration  888 non-null    object 
 11  date             891 non-null    object 
 12  category         891 non-null    object 
dtypes: float64(1), int64(4), object(8)
memory usage: 90.6+ KB


**<font color='violet'>Find a way to combine datasets</font>**

In [8]:
print(drugs_dotcom.columns)
print(druglib.columns)

Index(['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date',
       'usefulCount'],
      dtype='object')
Index(['Unnamed: 0', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects',
       'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview'],
      dtype='object')


In [9]:
# These tables have drugname, condition, and rating in common. 
# Perhaps one of druglib's review cols pairs well with drugs.com's review col
drugs_dotcom.head()

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,163740,Mirtazapine,Depression,"""I&#039;ve tried a few antidepressants over th...",10.0,"February 28, 2012",22
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""My son has Crohn&#039;s disease and has done ...",8.0,"May 17, 2009",17
2,159672,Bactrim,Urinary Tract Infection,"""Quick reduction of symptoms""",9.0,"September 29, 2017",3
3,39293,Contrave,Weight Loss,"""Contrave combines drugs that were used for al...",9.0,"March 5, 2017",35
4,97768,Cyclafem 1 / 35,Birth Control,"""I have been on this birth control for one cyc...",9.0,"October 22, 2015",4


In [10]:
druglib.head()

Unnamed: 0.1,Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview
0,2202,enalapril,4,Highly Effective,Mild Side Effects,management of congestive heart failure,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ..."
1,3117,ortho-tri-cyclen,1,Highly Effective,Severe Side Effects,birth prevention,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest..."
2,1146,ponstel,10,Highly Effective,No Side Effects,menstrual cramps,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...
3,3947,prilosec,3,Marginally Effective,Mild Side Effects,acid reflux,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...
4,1951,lyrica,2,Marginally Effective,Severe Side Effects,fibromyalgia,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above


In [11]:
# It seems that "comments review" is a general product review in druglib. 
# This can be merged with "review" in drugs.com.

druglib.columns = ['Unnamed: 0', 'drugName', 'rating', 'effectiveness', 'sideEffects',
       'condition', 'benefitsReview', 'sideEffectsReview', 'review']

druglib_dotcom = pd.merge(left=druglib, right=drugs_dotcom, how='outer')
druglib_dotcom.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 219206 entries, 0 to 219205
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Unnamed: 0         219206 non-null  int64  
 1   drugName           219206 non-null  object 
 2   rating             219206 non-null  float64
 3   effectiveness      4143 non-null    object 
 4   sideEffects        4143 non-null    object 
 5   condition          218011 non-null  object 
 6   benefitsReview     4143 non-null    object 
 7   sideEffectsReview  4141 non-null    object 
 8   review             219198 non-null  object 
 9   date               215063 non-null  object 
 10  usefulCount        215063 non-null  float64
dtypes: float64(2), int64(1), object(8)
memory usage: 20.1+ MB


In [12]:
# What about the third dataset?
psytar.head()

Unnamed: 0,index,comment_index,comment_id,drug_id,rating,disorder,side-effect,comment,gender,age,dosage_duration,date,category
0,1,372,1,lexapro.1,1,depression and anxiety,"extreme weight gain, short-term memory loss, h...",I am detoxing from Lexapro now. I slowly cut m...,F,39.0,5 years20mg 1X D,2011-02-21 0:00:00,ssri
1,2,4,2,lexapro.2,1,depression,COMPLETELY DESTROYED SEXUALLY FUNCTIONING - EV...,Just TWO tablets of Lexapro 10mg completely de...,M,40.0,2 days10mg 1X D,2016-08-21 0:00:00,ssri
2,3,419,3,lexapro.3,1,depression,"Nausea, Blurred Vision, 3 to 5 hours sleep, Su...",Be careful with this medication. This was my ...,M,50.0,2 days10mg 1X D,2010-10-04 0:00:00,ssri
3,4,1305,4,lexapro.4,1,"severe gad, minor depression, etc","Plenty! First 10 days were HORRIBLE, like a lo...","It didn't help me out, at all. My anxiety is w...",M,20.0,7 weeks,2007-07-05 0:00:00,ssri
4,5,909,5,lexapro.5,1,"depression, anxiety","Chronic cough, weight gain, no sexual interest...",I would not suggest taking this medication. I ...,F,43.0,2 months,2008-10-04 0:00:00,ssri


In [13]:
psytar.columns

Index(['index', 'comment_index', 'comment_id', 'drug_id', 'rating', 'disorder',
       'side-effect', 'comment', 'gender', 'age', 'dosage_duration', 'date',
       'category'],
      dtype='object')

In [14]:
psytar.columns = ['index', 'comment_index', 'comment_id', 'drugName', 'rating', 'condition',
       'sideEffects', 'review', 'gender', 'age', 'dosage_duration', 'date',
       'category']

psytar.columns

Index(['index', 'comment_index', 'comment_id', 'drugName', 'rating',
       'condition', 'sideEffects', 'review', 'gender', 'age',
       'dosage_duration', 'date', 'category'],
      dtype='object')

In [15]:
df = pd.merge(left=druglib_dotcom, right=psytar, how='outer')
df.head(1)

Unnamed: 0.1,Unnamed: 0,drugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,review,date,usefulCount,index,comment_index,comment_id,gender,age,dosage_duration,category
0,2202.0,enalapril,4.0,Highly Effective,Mild Side Effects,management of congestive heart failure,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ...",,,,,,,,,


In [16]:
df = df.drop(columns=['Unnamed: 0','index', 'comment_index', 'comment_id', 'usefulCount'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 220097 entries, 0 to 220096
Data columns (total 13 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   drugName           220097 non-null  object 
 1   rating             220097 non-null  float64
 2   effectiveness      4143 non-null    object 
 3   sideEffects        5020 non-null    object 
 4   condition          218902 non-null  object 
 5   benefitsReview     4143 non-null    object 
 6   sideEffectsReview  4141 non-null    object 
 7   review             219966 non-null  object 
 8   date               215954 non-null  object 
 9   gender             881 non-null     object 
 10  age                879 non-null     float64
 11  dosage_duration    888 non-null     object 
 12  category           891 non-null     object 
dtypes: float64(2), object(11)
memory usage: 23.5+ MB


In [17]:
df = df.rename(columns={'drugName':'drug', 'sideEffects':'side_effects', 
                        'benefitsReview':'benefits', 'sideEffectsReview':'side_effects'})