# <font color='violet'> Further Cleaning of Duplicate Reviews
Using prescription drug review initially wrangled here wrangled here: https://github.com/fractaldatalearning/psychedelic_efficacy/blob/main/notebooks/1-kl-wrangle-tabular.ipynb

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/interim/studies_initial_cleaning.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50637 entries, 0 to 50636
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  50637 non-null  int64  
 1   drug        50637 non-null  object 
 2   rating      50637 non-null  float64
 3   condition   50637 non-null  object 
 4   review      50637 non-null  object 
 5   date        50637 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 2.3+ MB


In [3]:
# Drop "Unnamed" column; it's redundant with the index
df = df.drop(columns=['Unnamed: 0'])
df.head(2)

Unnamed: 0,drug,rating,condition,review,date
0,vyvanse,9.0,add,I had began taking 20mg of Vyvanse for three m...,0
1,dextroamphetamine,8.0,add,Switched from Adderall to Dexedrine to compare...,0


During EDA, I discovered that many reviews are duplicated. It seems that what I discovered is one person may have just written one big review for all their drugs and entered it multiple times, with a different drug and rating each time. Is this behavior an outlier or are there many examples like this? 

<font color='violet'> Decide what to do about duplicated reviews. 

In [4]:
df[df.review.duplicated()==True]

Unnamed: 0,drug,rating,condition,review,date
668,Quetiapine,9.0,depression,"""been great for me except for the weight gain ...","October 23, 2016"
686,Buprenorphine / naloxone,1.0,addiction,"""I was on suboxone strips which was working gr...","June 28, 2017"
732,Desvenlafaxine,4.0,anxiety,"""I am into my 4th week of Pristiq and it hasn&...","October 8, 2011"
816,Suboxone,9.0,addiction,"""My personal experience with suboxone is good ...","May 27, 2017"
821,Lorazepam,8.0,anxiety,"""Most subtle of the benzos i have tried. Made...","October 28, 2013"
...,...,...,...,...,...
50631,Geodon,3.0,bipolar,"""I was in a very bad place at the time I start...","July 25, 2016"
50632,Venlafaxine,9.0,anxiety,"""Had panic attacks and social anxiety starting...","November 10, 2016"
50634,Ativan,9.0,anxiety,"""I was super against taking medication. I&#039...","August 16, 2016"
50635,Fluoxetine,8.0,ocd,"""I have been off Prozac for about 4 weeks now....","January 21, 2015"


Many rows actually contain duplicate reviews, each connected with multiple different drugs. Did the data start out this way, or did I make an error during initial wrangling?

In [5]:
drugs_dotcom_train = pd.read_csv('../data/raw/drugsComTrain_raw.tsv', sep='\t')
drugs_dotcom_test = pd.read_csv('../data/raw/drugsComTest_raw.tsv', sep='\t')
druglib_train = pd.read_csv('../data/raw/drugLibTrain_raw.tsv', sep='\t')
druglib_test = pd.read_csv('../data/raw/drugLibTest_raw.tsv', sep='\t')
psytar = pd.read_csv('../data/raw/PsyTAR_dataset_samples.csv')

In [6]:
# Make a function to help figure out what's going on 
def inspect_duplicate_reviews(df, column):
    df = df.sort_values(by=column)
    print(len(df), len(df[df[column].duplicated()==True]))
    return df[df[column].duplicated()==True].head()

# What my current working data looks like
inspect_duplicate_reviews(df, 'review')

50637 19078


Unnamed: 0,drug,rating,condition,review,date
38058,Paxil,10.0,depression,"""\r\nIn few words - Life changing\r\nAll nega...","April 3, 2016"
31832,Alprazolam,7.0,anxiety,"""\r\nxanax forums are full of how xanax can b...","December 11, 2015"
4706,Zolpidem,1.0,insomnia,""" I hate the doctors that prescribe ambien to...","March 28, 2017"
26943,Dronabinol,9.0,eating disorder,""" I have common variable immunodeficiency whic...","July 24, 2015"
38490,Lamictal,9.0,bipolar,""" I was diagnosed with bipolar 2 recently at 3...","April 5, 2017"


In [7]:
# Check out each of the other raw datasets
drugs_dotcom_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161297 entries, 0 to 161296
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Unnamed: 0   161297 non-null  int64  
 1   drugName     161297 non-null  object 
 2   condition    160398 non-null  object 
 3   review       161297 non-null  object 
 4   rating       161297 non-null  float64
 5   date         161297 non-null  object 
 6   usefulCount  161297 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 8.6+ MB


In [8]:
inspect_duplicate_reviews(drugs_dotcom_train, 'review')

161297 48968


Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
73400,124699,Tri-Previfem,Birth Control,"""\r\nFirst of all, the worst side effect for m...",1.0,"September 12, 2017",2
145940,37325,Vyvanse,ADHD,"""\r\nGood. Concentration, happy, easy to talk ...",5.0,"October 17, 2015",10
9906,148712,Mirena,Birth Control,"""\r\nI got tired of taking the pill so I figur...",3.0,"June 8, 2016",1
33031,39621,Contrave,Obesity,"""\r\nMost insurance companies won&#039;t pay f...",8.0,"March 14, 2016",15
30158,79026,Plan B One-Step,Emergency Contraception,"""\r\nMy bf and I had a condom break and I pani...",7.0,"June 30, 2017",2


In [9]:
# 30% of the original reviews from that set were duplicates. 
inspect_duplicate_reviews(drugs_dotcom_test, 'review')

53766 5486


Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
45612,113336,Bisacodyl,Constipation,"""\r\nHell no, never again! severe stomach cram...",1.0,"July 27, 2015",10
13632,88800,Necon 1 / 35,Endometriosis,""" I&#039;m on my 2nd round of necon 1/35. I st...",2.0,"July 8, 2016",5
10441,21780,Guaifenesin / pseudoephedrine,Cough and Nasal Congestion,""" It got rid of my cough but then made my nose...",1.0,"February 15, 2016",16
21080,142825,Levonorgestrel,Emergency Contraception,""" On May 18th this guy came completely inside ...",10.0,"July 5, 2017",6
10712,374,Medroxyprogesterone,Abnormal Uterine Bleeding,"""&quot;just stopped because I have been on it ...",10.0,"March 27, 2015",7


In [10]:
# 10% of drugs_dotcom_test was duplicates
druglib_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3107 entries, 0 to 3106
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         3107 non-null   int64 
 1   urlDrugName        3107 non-null   object
 2   rating             3107 non-null   int64 
 3   effectiveness      3107 non-null   object
 4   sideEffects        3107 non-null   object
 5   condition          3106 non-null   object
 6   benefitsReview     3107 non-null   object
 7   sideEffectsReview  3105 non-null   object
 8   commentsReview     3099 non-null   object
dtypes: int64(2), object(7)
memory usage: 218.6+ KB


In [11]:
inspect_duplicate_reviews(druglib_train, 'commentsReview')

3107 59


Unnamed: 0.1,Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview
1408,3024,cipro,10,Highly Effective,Mild Side Effects,rare kidney infection,My daughter is playing now finally and she see...,Blistering rash,.
249,1843,yasmin,10,Highly Effective,No Side Effects,birth control,"I've been on yasmin four years now, it works s...",,.
2282,1922,zithromax,8,Highly Effective,Mild Side Effects,sinusitis,"It is extremely, powerful antibiotic, which gi...",nausea,500 mg of azithromycin once in a day...for thr...
2575,1894,doxycycline,10,Highly Effective,No Side Effects,severe peridontal disease,"I had persistent periodontal problems, both ...",None.,A dentist in my dental HMO prescribed it and w...
439,1660,climara,10,Highly Effective,Mild Side Effects,menopausal,Climara patch almost completely stopped the se...,The only side effect from the Climara is mild ...,After being miserable with frequent (20+ times...


In [12]:
# Fewer of these were duplicates
psytar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   index            891 non-null    int64  
 1   comment_index    891 non-null    int64  
 2   comment_id       891 non-null    int64  
 3   drug_id          891 non-null    object 
 4   rating           891 non-null    int64  
 5   disorder         891 non-null    object 
 6   side-effect      877 non-null    object 
 7   comment          768 non-null    object 
 8   gender           881 non-null    object 
 9   age              879 non-null    float64
 10  dosage_duration  888 non-null    object 
 11  date             891 non-null    object 
 12  category         891 non-null    object 
dtypes: float64(1), int64(4), object(8)
memory usage: 90.6+ KB


In [13]:
inspect_duplicate_reviews(psytar, 'comment')

891 124


Unnamed: 0,index,comment_index,comment_id,drug_id,rating,disorder,side-effect,comment,gender,age,dosage_duration,date,category
236,237,1412,18,zoloft.18,1,depression,"Weight gain (20lbs.), no sexual feelings at al...","At first, I din't realize all of these side ef...",F,24.0,9 months,2003-09-01 0:00:00,ssri
444,445,2289,13,cymbalta.13,1,depression,"nonstop headache, constipation, racing thought...",Bad Drug!,F,53.0,2 weeks,2006-06-25 0:00:00,snri
7,8,156,8,lexapro.8,1,depression/ anxiety,Extreme Weight Gain 30 pounds,,M,16.0,1 years5 MG,2014-02-19 0:00:00,ssri
46,47,1793,47,lexapro.47,2,depression,weight gain,,F,27.0,4 months,2006-05-16 0:00:00,ssri
50,51,574,51,lexapro.51,2,depression,Problems with memory. Inability to focus/conce...,,F,66.0,7 weeks,2009-12-17 0:00:00,ssri


This last raw dataset has about 15% duplicate values but few rows overall. 

I did go back to the wrangling notebook and don't see any errors that would have caused this. I think I just didn't notice earlier because I would expect there to be duplicates in many of the columns (drug, condition) without it being a problem at all. Or perhaps completely duplicated rows, and took care of those. But it didn't cross my mind to think that specifically the reveiw column would have duplicates across multiple drugs. 

There are enough duplicated reviews in the raw data to account for all the duplicates in my current dataframe. My best working hypothesis is that the duplicate reviews appeared more often with psych meds because people may cycle through and try many drugs and then write up one big narrative to submit. Or perhaps, they feel one way about the drug's effects and go back to change their rating later, which results in two rows varying only by rating. I may need to more closely inspect each set of duplicates and find out which drugs the reviews are actually relevant for, removing the rest of the rows. 

<font color='violet'> Remove rows with irrelevant duplicated reviews

In [14]:
# Start with just one set of duplicates and see what I find.
df.head(8)

Unnamed: 0,drug,rating,condition,review,date
0,vyvanse,9.0,add,I had began taking 20mg of Vyvanse for three m...,0
1,dextroamphetamine,8.0,add,Switched from Adderall to Dexedrine to compare...,0
2,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0
3,saizen,8.0,fatigue,1 subcutaneous injection of somatropin in abdo...,0
4,zyprexa,3.0,dementia,Since many of these s/s are also s/s of the di...,0
5,vyvanse,10.0,add,I was diagnosed with ADD three years ago. Have...,0
6,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0
7,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0


It appears that somebody submitted the same review for vyvanse, dextroamphetamine, saizen, and zyprexa. And with vyvanse, they submitted it as being used to treat both add and adhd. And for add they gave it a rating of 9 with one submission and 10 with another. 

I can see already that this definitly pertains to vyvanse. Since the add ratings are ambiguous, I can just get rid of those and keep the row for adhd. 

In [15]:
df = df.drop(labels=[0,5])
df.head(6)

Unnamed: 0,drug,rating,condition,review,date
1,dextroamphetamine,8.0,add,Switched from Adderall to Dexedrine to compare...,0
2,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0
3,saizen,8.0,fatigue,1 subcutaneous injection of somatropin in abdo...,0
4,zyprexa,3.0,dementia,Since many of these s/s are also s/s of the di...,0
6,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0
7,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0


In [16]:
# Take a closer look at the full review to see if it pertains to the other drugs.
df.review[1]

'Switched from Adderall to Dexedrine to compare the effects. Dexedrine is simply dextroamphetamine while Adderall is a mixture of Amphetamine salts. This might explain the increased effectiveness of the Dexedrine contrary to popular belief. I found it important to take several relatively low doses frequently to achieve a balanced effect. Dexerine IR tablets need to be taken more often than adderall and also wear off more abruptly. Generic tablets are not high quality and somewhat expensive due to the need for a large quantity of tablets. Smoother more gradual onset and effect than Adderall. Effective at controlling ADD symptoms previously controlled by Adderall however this drug seems more natural and transparent. Less tension and anxiety. Less "druggy" unnatural feelings and thoughts.'

In [17]:
# This only pertains to vyvanse. Drop other rows. 
df = df.drop(labels=[1,3,4])
df.head(2)

Unnamed: 0,drug,rating,condition,review,date
2,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0
6,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0


How many sets of duplicates will I need to work with? 

In [18]:
len(df[df.review.duplicated()==True]['review'].unique())

18992

There are so many sets of duplicates, I'm going to need to find some way to do automated/batch deletion.

This could be a place to group by the review until there's just one row per review with various drug/rating/condition combinations that can be aggregated for each set of duplicates or analyzed more easily in batches for quicker identification of values to keep or delete. 

In [19]:
# Create a columm where I can hold whether each row should be kept or deleted. 
# Work until every row is filled with a value, then delete indicated rows.
df['keep'] = ''
df.head()

Unnamed: 0,drug,rating,condition,review,date,keep
2,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,
6,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,
7,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,
8,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,
9,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,


<font color='violet'> First, mark for keeping any non-duplicate reviews

In [20]:
df.loc[(df.review.duplicated(keep=False)==False),'keep'] = 'yes'
df[df.review.duplicated(keep=False)==False]

Unnamed: 0,drug,rating,condition,review,date,keep
2,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes
6,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes
7,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,yes
8,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,yes
9,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes
...,...,...,...,...,...,...
50614,Clonazepam,10.0,anxiety,"""Had terrible anxiety attacks .Have been on 0....","May 17, 2017",yes
50619,Buspirone,1.0,anxiety,"""Not good experience AT ALL. I Have anxiety an...","November 29, 2016",yes
50628,Lorazepam,8.0,anxiety,"""About 4 years ago I started having early-morn...","November 21, 2017",yes
50630,Hydroxyzine,10.0,sedation,"""Honestly , This works pretty well for me. It ...","September 13, 2017",yes


<font color='violet'> Mark for keeping any rows where the name of the drug is contained in the text of the review. 

In [21]:
grouped_df = df.groupby(['review', 'drug']).count()
grouped_df

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,condition,date,keep
review,drug,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"This medication caused me to be nervous, tremble and I became slightly irritable. This medication enabled me to think clearly. My thought processes were more coherent. It assisted me in concentrating and therefore I was able to complete tasks --- finally had follow-thru! I was able to manage my time better.",ritalin-la,1,1,1,1
"""\r\n\r\n please tell the ones who is suffering from anxiety to use lavender chamomile spray by air wick. it gives immediate relief , doctors not letting know patients about this. please spread the word!!. Please keep this post here.""",Quetiapine,1,1,1,1
"""\r\nIn few words - Life changing\r\nAll negative thoughts change to positive ones\r\nAnd that is a life changing pill\r\nI was born depressed really ,with a wierd point of veiw, very critic of everybody, now I am a happy ordinary guy with a beautifull family\r\nSomething I could not have achieved with out paxil, I take one every day for 23 years\r\nThe effects goes out just after a glass of wine , so be carefull , quiting the pill in a fast way will cause &quot; rage&quot; yes , you would not recognized your self kind of rage so be carefull.\r\nYou ar""",Paroxetine,1,1,1,1
"""\r\nIn few words - Life changing\r\nAll negative thoughts change to positive ones\r\nAnd that is a life changing pill\r\nI was born depressed really ,with a wierd point of veiw, very critic of everybody, now I am a happy ordinary guy with a beautifull family\r\nSomething I could not have achieved with out paxil, I take one every day for 23 years\r\nThe effects goes out just after a glass of wine , so be carefull , quiting the pill in a fast way will cause &quot; rage&quot; yes , you would not recognized your self kind of rage so be carefull.\r\nYou ar""",Paxil,1,1,1,1
"""\r\nxanax forums are full of how xanax can be so strong for some people. to me it was like a sledgehammer over my head. I woke 6 hours later with no recollection of what went on other than I slept. this was a low dose. It is either panic or been knocked out. Where are all these pleasant experiences I hear about?""",Alprazolam,1,1,1,1
...,...,...,...,...,...
"patient taking one 36 mg tablet, 7 days a week, in the morning after breakfast. Tablet was a slow release lasting throughout the day. patient preformed with better concentration and focus skills. also helped out in after school homework. less trips from the study table, more time spent studying, and less time having to be reminded to get back to work. patient took this drug to help with concentration and focus problems occurring in a classroom environment with many distractions.",concerta,1,1,1,1
patient was to take one dose of concerta per day - earlier in the morning due to the time release effect. it was obvious while taking the medication that it was time release as each layer was released. dry mouth patient was able to focus more readily on tasks at hand. there was an increase in intensity/focus which was noticable upon taking the medication.,concerta,1,1,1,1
"rec'd speed but no focus. I took the dexerine in morning and it wears off at\r\r\nnight. I lost weight for the first time as the metabolism was great, sadly\r\r\ni gained 15 lbs off the same diet. insomnia, I got obsessive complusive and had to stop. I also think I have a bit\r\r\nof liver damage of taking meds over 10 years. I only took dexedrine for a few months. weight loss right away. I felt it working in brain right away, but lost the efficacy when I built my tolerance. My MD gave me more and the same thing happened.\r\r\n\r\r\nMeds don't cure add 100%, so now I take supplements.",dexedrine,1,1,1,1
simply remember to take medication weight loss and loss of appetite My depression has lifted and I am able to concentrate more than I was previuosly,dextroamphetamine,1,1,1,1


In [22]:
# Row indices are defined by the drug column. Gather indices for reviews to keep.
grouped_df_indices_to_keep = []

# Find if the review column contains the string from the drug column.
for row in range(len(grouped_df.index)):
    if (grouped_df.index[row][1].lower() in grouped_df.index[row][0].lower()) == True:
        grouped_df_indices_to_keep.append(row)
        
grouped_df_indices_to_keep[:5]

[3, 5, 8, 13, 14]

In [23]:
len(grouped_df_indices_to_keep)

16564

In [24]:
# It seems many rows should be kept. Check that this worked correctly.
grouped_df.index[1]

('"\r\n\r\n please tell the ones who is suffering from anxiety to use lavender chamomile spray by air wick.  it gives immediate relief , doctors not letting know patients about this. please spread the word!!.  Please keep this post here."    ',
 'Quetiapine')

In [25]:
# The drug name is in the review narrative. 
# Isolate just the rows to keep
grouped_to_keep = pd.MultiIndex.to_frame(grouped_df.index[grouped_df_indices_to_keep])
grouped_to_keep.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,review,drug
review,drug,Unnamed: 2_level_1,Unnamed: 3_level_1
"""\r\nIn few words - Life changing\r\nAll negative thoughts change to positive ones\r\nAnd that is a life changing pill\r\nI was born depressed really ,with a wierd point of veiw, very critic of everybody, now I am a happy ordinary guy with a beautifull family\r\nSomething I could not have achieved with out paxil, I take one every day for 23 years\r\nThe effects goes out just after a glass of wine , so be carefull , quiting the pill in a fast way will cause &quot; rage&quot; yes , you would not recognized your self kind of rage so be carefull.\r\nYou ar""",Paxil,"""\r\nIn few words - Life changing\r\nAll nega...",Paxil
"""\r\nxanax forums are full of how xanax can be so strong for some people. to me it was like a sledgehammer over my head. I woke 6 hours later with no recollection of what went on other than I slept. this was a low dose. It is either panic or been knocked out. Where are all these pleasant experiences I hear about?""",Xanax,"""\r\nxanax forums are full of how xanax can b...",Xanax
""" I hate the doctors that prescribe ambien to patients like me. They are putting patent&#039;s life in danger. I know some people taking this pill during the day too and they take way too much . It is adicctive""",Ambien,""" I hate the doctors that prescribe ambien to...",Ambien
""" I have common variable immunodeficiency which causes anorexia. I found Marinol 5mg taken a couple hours before meals makes a huge difference. If I do not want to eat, ulcers etc, I found it easier with Marinol and have actually gained around 10lbs in the initial month or two..... I would recommend it for any patient with weight loss issues!""",Marinol,""" I have common variable immunodeficiency whic...",Marinol
""" I was diagnosed with bipolar 2 recently at 34 years old. I&#039;m also diagnosed with severe depression, generalized anxiety disorder, and PTSD. After being on what seems every antidepressant in the world, finally, a combination of Lamictal, Wellbutrin, and Xanax has been a life saver.I was prescribed the Lamictal a couple months ago. I noticed that after a couple weeks the Lamictal started to wear off a little and the depression started to appear again. I was started on 25 mg and then upped to 50 mg. A few days ago I noticed the depression again. I&#039;m going to ask my Dr. to up my dose again. Sorry for the long response lol. I just wanted to let anyone else know that I&#039;m going through a similar experience.""",Lamictal,""" I was diagnosed with bipolar 2 recently at 3...",Lamictal


In [26]:
grouped_to_keep = grouped_to_keep.reset_index(drop=True)
grouped_to_keep

Unnamed: 0,review,drug
0,"""\r\nIn few words - Life changing\r\nAll nega...",Paxil
1,"""\r\nxanax forums are full of how xanax can b...",Xanax
2,""" I hate the doctors that prescribe ambien to...",Ambien
3,""" I have common variable immunodeficiency whic...",Marinol
4,""" I was diagnosed with bipolar 2 recently at 3...",Lamictal
...,...,...
16559,already described above. I do not believe in ...,dexedrine
16560,"before ritalin, artificial stimulants were req...",ritalin
16561,overall I would definitely suggest vyvanse if ...,vyvanse
16562,patient was to take one dose of concerta per d...,concerta


In [27]:
# This is the correct number of rows for reviews that contain the drug name
# Add the keep row so that this df can be merged with the original df
grouped_to_keep['keep'] = 'yes'
grouped_to_keep.head()

Unnamed: 0,review,drug,keep
0,"""\r\nIn few words - Life changing\r\nAll nega...",Paxil,yes
1,"""\r\nxanax forums are full of how xanax can b...",Xanax,yes
2,""" I hate the doctors that prescribe ambien to...",Ambien,yes
3,""" I have common variable immunodeficiency whic...",Marinol,yes
4,""" I was diagnosed with bipolar 2 recently at 3...",Lamictal,yes


In [28]:
df = df.merge(right=grouped_to_keep, how='left', on=['review', 'drug'])
df

Unnamed: 0,drug,rating,condition,review,date,keep_x,keep_y
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes,yes
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes,yes
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,yes,
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,yes,
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes,yes
...,...,...,...,...,...,...,...
50627,Venlafaxine,9.0,anxiety,"""Had panic attacks and social anxiety starting...","November 10, 2016",,
50628,Vortioxetine,2.0,depression,"""This is the third med I&#039;ve tried for anx...","July 17, 2016",yes,
50629,Ativan,9.0,anxiety,"""I was super against taking medication. I&#039...","August 16, 2016",,yes
50630,Fluoxetine,8.0,ocd,"""I have been off Prozac for about 4 weeks now....","January 21, 2015",,


In [29]:
# This contains the correct number of rows to match the original df
# keep_y has the values I need for knowing which rows to keep so far

df = df.drop(columns=['keep_x'])
df.head()

Unnamed: 0,drug,rating,condition,review,date,keep_y
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes


In [30]:
df = df.rename(columns={'keep_y':'keep'})
df.head()

Unnamed: 0,drug,rating,condition,review,date,keep
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes


In [31]:
# Fill na in keep column to make it easier to work with later.
df['keep'] = df.keep.fillna('z')
df.head()

Unnamed: 0,drug,rating,condition,review,date,keep
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,z
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,z
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes


Dig further into rows where the name of the drug is not in the review. This does not necessarily mean the review isn't applicable to the associated drug. But, I'd say that if there is a review that contains a drug name, that same review should be dropped wherever it appears along with a different drug not mentioned. 

<font color='violet'> Drop rows where text doesn't contain drug name but drug name is present in the same review for a different drug. 

In [32]:
no_drug_in_review = df.groupby(['review', 'keep']).count().sort_values(
    by=['review', 'keep'])
no_drug_in_review

Unnamed: 0_level_0,Unnamed: 1_level_0,drug,rating,condition,date
review,keep,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"This medication caused me to be nervous, tremble and I became slightly irritable. This medication enabled me to think clearly. My thought processes were more coherent. It assisted me in concentrating and therefore I was able to complete tasks --- finally had follow-thru! I was able to manage my time better.",z,1,1,1,1
"""\r\n\r\n please tell the ones who is suffering from anxiety to use lavender chamomile spray by air wick. it gives immediate relief , doctors not letting know patients about this. please spread the word!!. Please keep this post here.""",z,1,1,1,1
"""\r\nIn few words - Life changing\r\nAll negative thoughts change to positive ones\r\nAnd that is a life changing pill\r\nI was born depressed really ,with a wierd point of veiw, very critic of everybody, now I am a happy ordinary guy with a beautifull family\r\nSomething I could not have achieved with out paxil, I take one every day for 23 years\r\nThe effects goes out just after a glass of wine , so be carefull , quiting the pill in a fast way will cause &quot; rage&quot; yes , you would not recognized your self kind of rage so be carefull.\r\nYou ar""",yes,1,1,1,1
"""\r\nIn few words - Life changing\r\nAll negative thoughts change to positive ones\r\nAnd that is a life changing pill\r\nI was born depressed really ,with a wierd point of veiw, very critic of everybody, now I am a happy ordinary guy with a beautifull family\r\nSomething I could not have achieved with out paxil, I take one every day for 23 years\r\nThe effects goes out just after a glass of wine , so be carefull , quiting the pill in a fast way will cause &quot; rage&quot; yes , you would not recognized your self kind of rage so be carefull.\r\nYou ar""",z,1,1,1,1
"""\r\nxanax forums are full of how xanax can be so strong for some people. to me it was like a sledgehammer over my head. I woke 6 hours later with no recollection of what went on other than I slept. this was a low dose. It is either panic or been knocked out. Where are all these pleasant experiences I hear about?""",yes,1,1,1,1
...,...,...,...,...,...
"patient taking one 36 mg tablet, 7 days a week, in the morning after breakfast. Tablet was a slow release lasting throughout the day. patient preformed with better concentration and focus skills. also helped out in after school homework. less trips from the study table, more time spent studying, and less time having to be reminded to get back to work. patient took this drug to help with concentration and focus problems occurring in a classroom environment with many distractions.",z,1,1,1,1
patient was to take one dose of concerta per day - earlier in the morning due to the time release effect. it was obvious while taking the medication that it was time release as each layer was released. dry mouth patient was able to focus more readily on tasks at hand. there was an increase in intensity/focus which was noticable upon taking the medication.,yes,1,1,1,1
"rec'd speed but no focus. I took the dexerine in morning and it wears off at\r\r\nnight. I lost weight for the first time as the metabolism was great, sadly\r\r\ni gained 15 lbs off the same diet. insomnia, I got obsessive complusive and had to stop. I also think I have a bit\r\r\nof liver damage of taking meds over 10 years. I only took dexedrine for a few months. weight loss right away. I felt it working in brain right away, but lost the efficacy when I built my tolerance. My MD gave me more and the same thing happened.\r\r\n\r\r\nMeds don't cure add 100%, so now I take supplements.",yes,1,1,1,1
simply remember to take medication weight loss and loss of appetite My depression has lifted and I am able to concentrate more than I was previuosly,z,1,1,1,1


In [33]:
len(no_drug_in_review)

43133

There are fewer indices this time because some rows have multiple drugs aggregated within the 'z' row for a review. If a review has only unknown (z) keep values, that should remain unknown for now. But if there is a yes row for the review, then that review's z's should be come no's. 

Specifically, identify reviews for rows to keep. Then, since yes comes before z in the sorting, the yes row is on top in each set of rows per review. So, the row directly below each yes row can be deleted, IF it has the same review. (If it doesn't have the same review, then it should remain unknown for now). 

In [34]:
indices_to_drop = []

for idx in range(len(no_drug_in_review)):
    # Isolate reviews for rows to keep, and if  
    if (no_drug_in_review.index[idx][1] == 'yes' and no_drug_in_review.index[idx][0] == 
        no_drug_in_review.index[idx+1][0]):
        indices_to_drop.append(idx+1)

indices_to_drop[:5]

[3, 5, 9, 13, 15]

In [35]:
len(indices_to_drop)

11579

In [36]:
# Confirm this worked correctly
no_drug_in_review.index[1]

('"\r\n\r\n please tell the ones who is suffering from anxiety to use lavender chamomile spray by air wick.  it gives immediate relief , doctors not letting know patients about this. please spread the word!!.  Please keep this post here."    ',
 'z')

In [37]:
no_drug_in_review.index[2]

('"\r\nIn few words  - Life changing\r\nAll negative thoughts change to positive ones\r\nAnd that is a life changing pill\r\nI was born depressed really ,with a wierd point of veiw, very critic of everybody, now I am a happy ordinary guy with a beautifull family\r\nSomething I could not have achieved with out paxil, I take one every day for 23 years\r\nThe effects goes out just after a glass of wine , so be carefull , quiting the pill in a fast way will cause  &quot; rage&quot; yes , you would not recognized your self kind of rage so be carefull.\r\nYou ar"    ',
 'yes')

In [38]:
# This worked correctly. Index 2 is slotted for dropping, and it has the same review as 
# index 1, which is labeled yes to keep. Now, isolate the rows to drop.

un_reviewed_to_drop = pd.MultiIndex.to_frame(no_drug_in_review.index[indices_to_drop])
un_reviewed_to_drop.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,review,keep
review,keep,Unnamed: 2_level_1,Unnamed: 3_level_1
"""\r\nIn few words - Life changing\r\nAll negative thoughts change to positive ones\r\nAnd that is a life changing pill\r\nI was born depressed really ,with a wierd point of veiw, very critic of everybody, now I am a happy ordinary guy with a beautifull family\r\nSomething I could not have achieved with out paxil, I take one every day for 23 years\r\nThe effects goes out just after a glass of wine , so be carefull , quiting the pill in a fast way will cause &quot; rage&quot; yes , you would not recognized your self kind of rage so be carefull.\r\nYou ar""",z,"""\r\nIn few words - Life changing\r\nAll nega...",z
"""\r\nxanax forums are full of how xanax can be so strong for some people. to me it was like a sledgehammer over my head. I woke 6 hours later with no recollection of what went on other than I slept. this was a low dose. It is either panic or been knocked out. Where are all these pleasant experiences I hear about?""",z,"""\r\nxanax forums are full of how xanax can b...",z
""" I hate the doctors that prescribe ambien to patients like me. They are putting patent&#039;s life in danger. I know some people taking this pill during the day too and they take way too much . It is adicctive""",z,""" I hate the doctors that prescribe ambien to...",z
""" I have common variable immunodeficiency which causes anorexia. I found Marinol 5mg taken a couple hours before meals makes a huge difference. If I do not want to eat, ulcers etc, I found it easier with Marinol and have actually gained around 10lbs in the initial month or two..... I would recommend it for any patient with weight loss issues!""",z,""" I have common variable immunodeficiency whic...",z
""" I was diagnosed with bipolar 2 recently at 34 years old. I&#039;m also diagnosed with severe depression, generalized anxiety disorder, and PTSD. After being on what seems every antidepressant in the world, finally, a combination of Lamictal, Wellbutrin, and Xanax has been a life saver.I was prescribed the Lamictal a couple months ago. I noticed that after a couple weeks the Lamictal started to wear off a little and the depression started to appear again. I was started on 25 mg and then upped to 50 mg. A few days ago I noticed the depression again. I&#039;m going to ask my Dr. to up my dose again. Sorry for the long response lol. I just wanted to let anyone else know that I&#039;m going through a similar experience.""",z,""" I was diagnosed with bipolar 2 recently at 3...",z


In [39]:
un_reviewed_to_drop = un_reviewed_to_drop.reset_index(drop=True)
un_reviewed_to_drop.head()

Unnamed: 0,review,keep
0,"""\r\nIn few words - Life changing\r\nAll nega...",z
1,"""\r\nxanax forums are full of how xanax can b...",z
2,""" I hate the doctors that prescribe ambien to...",z
3,""" I have common variable immunodeficiency whic...",z
4,""" I was diagnosed with bipolar 2 recently at 3...",z


In [40]:
# Change keep value to no
un_reviewed_to_drop['keep'] = 'no'
un_reviewed_to_drop.head()

Unnamed: 0,review,keep
0,"""\r\nIn few words - Life changing\r\nAll nega...",no
1,"""\r\nxanax forums are full of how xanax can b...",no
2,""" I hate the doctors that prescribe ambien to...",no
3,""" I have common variable immunodeficiency whic...",no
4,""" I was diagnosed with bipolar 2 recently at 3...",no


This can again be merged with df. There may be multiple drugs per "no keep" review, and that's okay; each one can be filled with no because these reviews should be dropped wherever they appear, since they already have an associated yes review that is definitely relevant to its associated drug. Wherever the new keep column says no but the old keep column says yes, the value should be yes.

In [41]:
df = df.merge(right=un_reviewed_to_drop, on='review', how='left')
df

Unnamed: 0,drug,rating,condition,review,date,keep_x,keep_y
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes,
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes,
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,z,
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,z,
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes,
...,...,...,...,...,...,...,...
50627,Venlafaxine,9.0,anxiety,"""Had panic attacks and social anxiety starting...","November 10, 2016",z,
50628,Vortioxetine,2.0,depression,"""This is the third med I&#039;ve tried for anx...","July 17, 2016",z,
50629,Ativan,9.0,anxiety,"""I was super against taking medication. I&#039...","August 16, 2016",yes,no
50630,Fluoxetine,8.0,ocd,"""I have been off Prozac for about 4 weeks now....","January 21, 2015",z,no


In [42]:
# Now, if keep_x = yes, that's the row to keep for that review. 
# anyplace where keep_x = z but keep_y = no, the keep value should end up as no

for row in range(len(df)):
    if df.loc[row,'keep_y'] == 'no' and df.loc[row,'keep_x'] == 'z':
        df.loc[row,'keep_x'] = 'no'

df[df.keep_y=='no']

Unnamed: 0,drug,rating,condition,review,date,keep_x,keep_y
106,Ambien,2.0,insomnia,"""Ditto on rebound sleepless when discontinued....","January 13, 2015",yes,no
115,Campral,10.0,addiction,"""Been a heavy drinker for over 6 years since a...","August 23, 2013",yes,no
117,Wellbutrin,8.0,depression,"""Coming from a very problematic childhood, I&#...","March 6, 2015",yes,no
121,Zoloft,1.0,depression,"""Zoloft did not help me at all. I was on it f...","January 14, 2013",yes,no
122,Ziprasidone,10.0,schizophrenia,"""Geodon is a very effective drug for me. Comp...","April 20, 2008",no,no
...,...,...,...,...,...,...,...
50622,Cymbalta,9.0,anxiety,"""I have been taking Cymbalta for 15 months now...","June 10, 2013",yes,no
50626,Geodon,3.0,bipolar,"""I was in a very bad place at the time I start...","July 25, 2016",yes,no
50629,Ativan,9.0,anxiety,"""I was super against taking medication. I&#039...","August 16, 2016",yes,no
50630,Fluoxetine,8.0,ocd,"""I have been off Prozac for about 4 weeks now....","January 21, 2015",no,no


In [43]:
# Check if this worked correctly
df[df.review == df.loc[122,'review']]

Unnamed: 0,drug,rating,condition,review,date,keep_x,keep_y
122,Ziprasidone,10.0,schizophrenia,"""Geodon is a very effective drug for me. Comp...","April 20, 2008",no,no
37111,Geodon,10.0,schizophrenia,"""Geodon is a very effective drug for me. Comp...","April 20, 2008",yes,no


In [44]:
# This looks correct. The drug name is in the review associated with the yes row
# The matching review now says no in keep_x. I can delete the row keep_y

df = df.drop(columns=['keep_y'])
df.head()

Unnamed: 0,drug,rating,condition,review,date,keep_x
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,z
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,z
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes


In [45]:
df = df.rename(columns={'keep_x':'keep'})
df.head()

Unnamed: 0,drug,rating,condition,review,date,keep
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,z
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,z
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes


In [46]:
# What remains? How many rows still have a keep value of z?
len(df[df.keep=='z'])

22483

<font color='violet'> Deal with any reviews that are just duplicates related to multiple conditions.  

In [47]:
grouped_by_condition = df.groupby(['review', 'condition']).count()
grouped_by_condition

Unnamed: 0_level_0,Unnamed: 1_level_0,drug,rating,date,keep
review,condition,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"This medication caused me to be nervous, tremble and I became slightly irritable. This medication enabled me to think clearly. My thought processes were more coherent. It assisted me in concentrating and therefore I was able to complete tasks --- finally had follow-thru! I was able to manage my time better.",adhd,1,1,1,1
"""\r\n\r\n please tell the ones who is suffering from anxiety to use lavender chamomile spray by air wick. it gives immediate relief , doctors not letting know patients about this. please spread the word!!. Please keep this post here.""",anxiety,1,1,1,1
"""\r\nIn few words - Life changing\r\nAll negative thoughts change to positive ones\r\nAnd that is a life changing pill\r\nI was born depressed really ,with a wierd point of veiw, very critic of everybody, now I am a happy ordinary guy with a beautifull family\r\nSomething I could not have achieved with out paxil, I take one every day for 23 years\r\nThe effects goes out just after a glass of wine , so be carefull , quiting the pill in a fast way will cause &quot; rage&quot; yes , you would not recognized your self kind of rage so be carefull.\r\nYou ar""",depression,2,2,2,2
"""\r\nxanax forums are full of how xanax can be so strong for some people. to me it was like a sledgehammer over my head. I woke 6 hours later with no recollection of what went on other than I slept. this was a low dose. It is either panic or been knocked out. Where are all these pleasant experiences I hear about?""",anxiety,2,2,2,2
""" Caused depression and negative, self defeating thoughts early on. They just got worse and worse until finally it peaked in major anxiety and panic attacks so bad I could barley speak. Then I had to step down from the drug slowly due to the well documented withdrawal problems. So more time feeling god awful and wasted time out of my life. In my opinion, avoid this med if you can. It&#039;s a serious drug.""",bipolar,1,1,1,1
...,...,...,...,...,...
"patient taking one 36 mg tablet, 7 days a week, in the morning after breakfast. Tablet was a slow release lasting throughout the day. patient preformed with better concentration and focus skills. also helped out in after school homework. less trips from the study table, more time spent studying, and less time having to be reminded to get back to work. patient took this drug to help with concentration and focus problems occurring in a classroom environment with many distractions.",adhd,1,1,1,1
patient was to take one dose of concerta per day - earlier in the morning due to the time release effect. it was obvious while taking the medication that it was time release as each layer was released. dry mouth patient was able to focus more readily on tasks at hand. there was an increase in intensity/focus which was noticable upon taking the medication.,adhd,1,1,1,1
"rec'd speed but no focus. I took the dexerine in morning and it wears off at\r\r\nnight. I lost weight for the first time as the metabolism was great, sadly\r\r\ni gained 15 lbs off the same diet. insomnia, I got obsessive complusive and had to stop. I also think I have a bit\r\r\nof liver damage of taking meds over 10 years. I only took dexedrine for a few months. weight loss right away. I felt it working in brain right away, but lost the efficacy when I built my tolerance. My MD gave me more and the same thing happened.\r\r\n\r\r\nMeds don't cure add 100%, so now I take supplements.",add,1,1,1,1
simply remember to take medication weight loss and loss of appetite My depression has lifted and I am able to concentrate more than I was previuosly,add,1,1,1,1


In [48]:
# Those duplicated by condition would show up where 2 subsequent indices have the same review.
indices_duplicated_by_condition = []
for idx in range(len(grouped_by_condition)):
    # Need to include a try-except since sometimes idx+1 won't exist
    try:
        if grouped_by_condition.index[idx][0] == grouped_by_condition.index[idx+1][0]:
            indices_duplicated_by_condition.append(idx)
            indices_duplicated_by_condition.append(idx+1)
    except: pass
        
indices_duplicated_by_condition[:5]    

[871, 872, 1627, 1628, 3019]

In [49]:
# Take a look at the rows I've identified
duplicated_by_condition = pd.MultiIndex.to_frame(grouped_by_condition.index[
    indices_duplicated_by_condition])
duplicated_by_condition

Unnamed: 0_level_0,Unnamed: 1_level_0,review,condition
review,condition,Unnamed: 2_level_1,Unnamed: 3_level_1
"""After many months spent being given ten different types of antidepressants which none agreed with me my Dr suggested Venlafaxine 37.5 twice a day but I personally found it too much ( turned me into a Zombie) so we agreed on one 37.5 dosage daily slowly but surly it has given me my life back no major side effects other than insomnia .... Darkness is all I could see before venlafaxine 10/10 highly recommended""",anxiety,"""After many months spent being given ten diffe...",anxiety
"""After many months spent being given ten different types of antidepressants which none agreed with me my Dr suggested Venlafaxine 37.5 twice a day but I personally found it too much ( turned me into a Zombie) so we agreed on one 37.5 dosage daily slowly but surly it has given me my life back no major side effects other than insomnia .... Darkness is all I could see before venlafaxine 10/10 highly recommended""",depression,"""After many months spent being given ten diffe...",depression
"""Awesome.""",addiction,"""Awesome.""",addiction
"""Awesome.""",anxiety,"""Awesome.""",anxiety
"""Didn&#039;t work for me.""",anxiety,"""Didn&#039;t work for me.""",anxiety
...,...,...,...
"""Works great""",sedation,"""Works great""",sedation
"""Works well.""",addiction,"""Works well.""",addiction
"""Works well.""",anxiety,"""Works well.""",anxiety
"""Works well.""",anxiety,"""Works well.""",anxiety


Here, I think it would make sense to just choose one of the conditions to keep. If there were many pairs like this, I might create columns "condition1" and "condition2", but if "condition2" would only have 4 values out of tens of thousands of rows, that seems like a waste. Instead, I'll go ahead and just keep the row for the less-common condition, so as to balance rather than further un-balance the condition column. 

First I'll need a dictionary of conditions

In [50]:
conditions_rank = df.condition.value_counts().to_frame()
conditions_rank.head()

Unnamed: 0,condition
depression,14424
anxiety,14108
bipolar,5601
addiction,5192
insomnia,5016


In [51]:
conditions_rank['rank'] = range(len(conditions_rank))
conditions_rank.head()

Unnamed: 0,condition,rank
depression,14424,0
anxiety,14108,1
bipolar,5601,2
addiction,5192,3
insomnia,5016,4


In [52]:
conditions_rank = conditions_rank.drop(columns=['condition']).reset_index().rename(
    columns={'index':'condition'})
conditions_rank.head()

Unnamed: 0,condition,rank
0,depression,0
1,anxiety,1
2,bipolar,2
3,addiction,3
4,insomnia,4


In [53]:
conditions_rank = conditions_rank.set_index('condition').to_dict()['rank']
conditions_rank

{'depression': 0,
 'anxiety': 1,
 'bipolar': 2,
 'addiction': 3,
 'insomnia': 4,
 'hrt': 5,
 'schizophrenia': 6,
 'ocd': 7,
 'other': 8,
 'schizoaffective disorder': 9,
 'ptsd': 10,
 'sedation': 11,
 'eating disorder': 12,
 'bpd': 13,
 'asd': 14,
 'alzheimers': 15,
 'fatigue': 16,
 'psychosis': 17,
 'sexual dysfunction': 18,
 'hypersomnia': 19,
 'mania': 20,
 'nightmares': 21,
 'add': 22,
 'paranoia': 23,
 'adhd': 24,
 'sad': 25,
 'body dysmorphia': 26,
 'auditory processing disorder': 27,
 'cognitive impairment': 28,
 'hypoactive sexual desire disorder': 29,
 'dementia': 30,
 'anger': 31,
 'somatic disorder': 32,
 'failure to thrive': 33,
 'mood disorder': 34,
 'did': 35,
 'neurosis': 36,
 'agoraphobia': 37}

In [54]:
# Prepare dataframe of just reviews that have multiple conditions attached
duplicated_by_condition = duplicated_by_condition.reset_index(drop=True)
duplicated_by_condition.head()

Unnamed: 0,review,condition
0,"""After many months spent being given ten diffe...",anxiety
1,"""After many months spent being given ten diffe...",depression
2,"""Awesome.""",addiction
3,"""Awesome.""",anxiety
4,"""Didn&#039;t work for me.""",anxiety


In [55]:
# Get this in a format where the conditions for each review can be compared
for row in range(len(duplicated_by_condition)):
    duplicated_by_condition.loc[row,'rank'] = conditions_rank[duplicated_by_condition.loc[
        row, 'condition']]

duplicated_by_condition.head()

Unnamed: 0,review,condition,rank
0,"""After many months spent being given ten diffe...",anxiety,1.0
1,"""After many months spent being given ten diffe...",depression,0.0
2,"""Awesome.""",addiction,3.0
3,"""Awesome.""",anxiety,1.0
4,"""Didn&#039;t work for me.""",anxiety,1.0


In [56]:
# Identify max rank as the condition to keep for each review
condition_to_keep = duplicated_by_condition.groupby(['review']).max()
condition_to_keep.head()

Unnamed: 0_level_0,condition,rank
review,Unnamed: 1_level_1,Unnamed: 2_level_1
"""After many months spent being given ten different types of antidepressants which none agreed with me my Dr suggested Venlafaxine 37.5 twice a day but I personally found it too much ( turned me into a Zombie) so we agreed on one 37.5 dosage daily slowly but surly it has given me my life back no major side effects other than insomnia .... Darkness is all I could see before venlafaxine 10/10 highly recommended""",depression,1.0
"""Awesome.""",anxiety,3.0
"""Didn&#039;t work for me.""",insomnia,4.0
"""Excellent""",insomnia,4.0
"""Good""",psychosis,17.0


In [57]:
# This is the wrong condition listed, but the correct condition rank that should be kept.

condition_to_keep = condition_to_keep.drop(columns=['condition'])
condition_to_keep.head()

Unnamed: 0_level_0,rank
review,Unnamed: 1_level_1
"""After many months spent being given ten different types of antidepressants which none agreed with me my Dr suggested Venlafaxine 37.5 twice a day but I personally found it too much ( turned me into a Zombie) so we agreed on one 37.5 dosage daily slowly but surly it has given me my life back no major side effects other than insomnia .... Darkness is all I could see before venlafaxine 10/10 highly recommended""",1.0
"""Awesome.""",3.0
"""Didn&#039;t work for me.""",4.0
"""Excellent""",4.0
"""Good""",17.0


In [58]:
# Change rank to int type
condition_to_keep['rank'] = condition_to_keep['rank'].astype(int)
condition_to_keep.head()

Unnamed: 0_level_0,rank
review,Unnamed: 1_level_1
"""After many months spent being given ten different types of antidepressants which none agreed with me my Dr suggested Venlafaxine 37.5 twice a day but I personally found it too much ( turned me into a Zombie) so we agreed on one 37.5 dosage daily slowly but surly it has given me my life back no major side effects other than insomnia .... Darkness is all I could see before venlafaxine 10/10 highly recommended""",1
"""Awesome.""",3
"""Didn&#039;t work for me.""",4
"""Excellent""",4
"""Good""",17


In [59]:
# Create regular df to iterate through:
condition_to_keep = condition_to_keep.reset_index()
condition_to_keep.head()

Unnamed: 0,review,rank
0,"""After many months spent being given ten diffe...",1
1,"""Awesome.""",3
2,"""Didn&#039;t work for me.""",4
3,"""Excellent""",4
4,"""Good""",17


In [60]:
# Refill conditions 
for row in range(len(condition_to_keep)):
    for key, value in conditions_rank.items():
        if condition_to_keep.loc[row,'rank'] == value:
                condition_to_keep.loc[row,'condition'] = key
            
condition_to_keep.head()

Unnamed: 0,review,rank,condition
0,"""After many months spent being given ten diffe...",1,anxiety
1,"""Awesome.""",3,addiction
2,"""Didn&#039;t work for me.""",4,insomnia
3,"""Excellent""",4,insomnia
4,"""Good""",17,psychosis


In [61]:
# These conditions should have a keep value of 'yes'
condition_to_keep['keep'] = 'yes'
condition_to_keep.head()

Unnamed: 0,review,rank,condition,keep
0,"""After many months spent being given ten diffe...",1,anxiety,yes
1,"""Awesome.""",3,addiction,yes
2,"""Didn&#039;t work for me.""",4,insomnia,yes
3,"""Excellent""",4,insomnia,yes
4,"""Good""",17,psychosis,yes


In [62]:
# Merge with duplicated_by_condition so as to be able to mark remaining rows with "no"
duplicated_by_condition = duplicated_by_condition.merge(condition_to_keep, how='left')
duplicated_by_condition

Unnamed: 0,review,condition,rank,keep
0,"""After many months spent being given ten diffe...",anxiety,1.0,yes
1,"""After many months spent being given ten diffe...",depression,0.0,
2,"""Awesome.""",addiction,3.0,yes
3,"""Awesome.""",anxiety,1.0,
4,"""Didn&#039;t work for me.""",anxiety,1.0,
...,...,...,...,...
71,"""Works great""",sedation,11.0,yes
72,"""Works well.""",addiction,3.0,
73,"""Works well.""",anxiety,1.0,
74,"""Works well.""",anxiety,1.0,


In [63]:
duplicated_by_condition = duplicated_by_condition.drop(columns=['rank']).fillna('no')
duplicated_by_condition.head()

Unnamed: 0,review,condition,keep
0,"""After many months spent being given ten diffe...",anxiety,yes
1,"""After many months spent being given ten diffe...",depression,no
2,"""Awesome.""",addiction,yes
3,"""Awesome.""",anxiety,no
4,"""Didn&#039;t work for me.""",anxiety,no


In [64]:
# Now duplicated_by_condition can be merged with the rest of the df
df = df.merge(duplicated_by_condition, on=['review', 'condition'], how='left')
df

Unnamed: 0,drug,rating,condition,review,date,keep_x,keep_y
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes,
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes,
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,z,
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,z,
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes,
...,...,...,...,...,...,...,...
50646,Venlafaxine,9.0,anxiety,"""Had panic attacks and social anxiety starting...","November 10, 2016",z,
50647,Vortioxetine,2.0,depression,"""This is the third med I&#039;ve tried for anx...","July 17, 2016",z,
50648,Ativan,9.0,anxiety,"""I was super against taking medication. I&#039...","August 16, 2016",yes,
50649,Fluoxetine,8.0,ocd,"""I have been off Prozac for about 4 weeks now....","January 21, 2015",no,


In [65]:
# How did that work? What does the first review with duplicated conditions look like?
df[df.review.str.contains('After many months spent being given ten')]

Unnamed: 0,drug,rating,condition,review,date,keep_x,keep_y
42826,Venlafaxine,9.0,depression,"""After many months spent being given ten diffe...","July 22, 2016",yes,no
47087,Venlafaxine,10.0,anxiety,"""After many months spent being given ten diffe...","July 25, 2016",yes,yes


In [66]:
# I'd previously mis-labeled some rows. 
df.sort_values(by=['keep_y', 'keep_x']).head(7)

Unnamed: 0,drug,rating,condition,review,date,keep_x,keep_y
13014,Fluoxetine,10.0,depression,"""Most of my life I have struggled with severe ...","March 15, 2015",no,no
15201,Duloxetine,10.0,anxiety,"""I&#039;m hoping my comments reaches out to th...","March 14, 2017",no,no
38582,Prozac,10.0,depression,"""Most of my life I have struggled with severe ...","March 15, 2015",yes,no
42700,Cymbalta,10.0,anxiety,"""I&#039;m hoping my comments reaches out to th...","March 14, 2017",yes,no
42826,Venlafaxine,9.0,depression,"""After many months spent being given ten diffe...","July 22, 2016",yes,no
162,Bupropion,10.0,depression,"""Saved my life.""","August 5, 2012",z,no
163,Bupropion,10.0,depression,"""Saved my life.""","August 5, 2012",z,no


In [67]:
# Wherever keep_y is not null, that is the value that should be kept. 
# Otherwise keep the value of keep_y

df = df.reset_index(drop = True)
df.head()

Unnamed: 0,drug,rating,condition,review,date,keep_x,keep_y
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes,
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes,
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,z,
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,z,
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes,


In [68]:
for row in range(len(df)):
    if df.loc[row,'keep_y'] == 'yes' or df.loc[row,'keep_y'] == 'no':
        df.loc[row,'keep'] = df.loc[row,'keep_y']
    else: df.loc[row,'keep'] = df.loc[row,'keep_x']
        
df.head()

Unnamed: 0,drug,rating,condition,review,date,keep_x,keep_y,keep
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes,,yes
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes,,yes
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,z,,z
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,z,,z
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes,,yes


In [69]:
df.sort_values(by=['keep_y', 'keep_x']).head()

Unnamed: 0,drug,rating,condition,review,date,keep_x,keep_y,keep
13014,Fluoxetine,10.0,depression,"""Most of my life I have struggled with severe ...","March 15, 2015",no,no,no
15201,Duloxetine,10.0,anxiety,"""I&#039;m hoping my comments reaches out to th...","March 14, 2017",no,no,no
38582,Prozac,10.0,depression,"""Most of my life I have struggled with severe ...","March 15, 2015",yes,no,no
42700,Cymbalta,10.0,anxiety,"""I&#039;m hoping my comments reaches out to th...","March 14, 2017",yes,no,no
42826,Venlafaxine,9.0,depression,"""After many months spent being given ten diffe...","July 22, 2016",yes,no,no


In [70]:
# This looks correct so far. Clean up. 
df = df.drop(columns=['keep_x', 'keep_y'])
df.head()

Unnamed: 0,drug,rating,condition,review,date,keep
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,z
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,z
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes


Now, everywhere there is a duplicated review, a row for that review is being kept if it contains the drug name and it is submitted for the least-common condition. Reviews are marked for removal if they don't contain the name of the drug but their duplicate does. And being removed if submitted for a more-common condition where the review is also submitted for a less-common condition. 

But, wherever there is no drug name at all in the review, duplicates likely still exist across multiple drugs. This may be a place where new columns for drug1, drug2, drug3 may be necessary

<font color='violet'> Deal with remaining reviews duplicated across multiple drugs.

In [71]:
# How many reviews remain to deal with?
len(df[(df.review.duplicated(keep=False)==True) & (df.keep=='z')])

14374

In [72]:
# What's the highest number of drugs associated with a single review?
row_count = df.groupby(['review']).count()
row_count.sort_values(by='drug', ascending=False)

Unnamed: 0_level_0,drug,rating,condition,date,keep
review,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"""Good""",16,16,16,16,16
"""Saved my life.""",14,14,14,14,14
"""Great""",9,9,9,9,9
"""Works well.""",8,8,8,8,8
"""Works great""",8,8,8,8,8
...,...,...,...,...,...
"""I take lexapro because my wife turns me nuts. Yes, I got depressed because I have been opressed, by her. So I started taking lex, now I don&#039;t care about my wife&#039;s outrages. I think it is her that needs it. Also my libido is down. That is a down""",1,1,1,1,1
"""I take lithium 300 mg twice daily. The lithium makes me pee constantly and I really have to watch it if I decide to have any alcohol - I drink just 2 small beers and I&#039;m very buzzed. The lithium makes me feel stable and it keeps me from having severe or deep depression. It has eliminated my severe depressions and it&#039;s the only drug I&#039;ve tried that does this. I still get the blues or feel slightly depressed but NOTHING like the suicidal black hole depression I used to experience. Also, it slows me down during hypomania so I don&#039;t escalate to full blown mania. But if I neglected to take my lithium during hypomania (ran out) when in full blown mania -lithium helps, but in my case I must take an anti psychotic to ground me as well.""",1,1,1,1,1
"""I take lorazepam for social anxiety. It seems to help. I feel more relaxed and talkative when I take it. It also helps me with severe depression.""",1,1,1,1,1
"""I take many medications for Depression (Bi-Polar) and this one helps without the sleepiness that the other ones give. I take a total of four different medications for depression some to help me sleep and others as a supplement.""",1,1,1,1,1


The review "Good" is associated with 16 different drugs. Add columns drug0...drug15 wherever a review has more than one associated drug. First, sort drugs by prevalance, then enumerate drugs per review so that column can then become multiple nuew columns. Finally, create a pivot table and fill values of new drug_n columns with drug names.

In [73]:
# Go back and sort drugs according to how common they are so they're enumerated that way
by_drug = df.groupby('drug').count().sort_values(by='rating', ascending=False)
by_drug

Unnamed: 0_level_0,rating,condition,review,date,keep
drug,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Sertraline,1855,1855,1855,1855,1855
Escitalopram,1713,1713,1713,1713,1713
Citalopram,1300,1300,1300,1300,1300
Bupropion,1279,1279,1279,1279,1279
Lexapro,1231,1231,1231,1231,1231
...,...,...,...,...,...
Dasetta 7 / 7 / 7,1,1,1,1,1
Dasatinib,1,1,1,1,1
Gentian violet,1,1,1,1,1
Magnesium sulfate / potassium sulfate / sodium sulfate,1,1,1,1,1


In [74]:
by_drug['drug_prevalance'] = range(len(by_drug))
by_drug = by_drug.drop(columns=[
    'rating', 'condition', 'review', 'date', 'keep']).reset_index()
by_drug

Unnamed: 0,drug,drug_prevalance
0,Sertraline,0
1,Escitalopram,1
2,Citalopram,2
3,Bupropion,3
4,Lexapro,4
...,...,...
644,Dasetta 7 / 7 / 7,644
645,Dasatinib,645
646,Gentian violet,646
647,Magnesium sulfate / potassium sulfate / sodium...,647


In [75]:
# Merge with df so that drugs have their prevalance values associated
df = df.merge(by_drug, how='left')
df

Unnamed: 0,drug,rating,condition,review,date,keep,drug_prevalance
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes,189
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes,189
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,z,482
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,z,458
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes,228
...,...,...,...,...,...,...,...
50646,Venlafaxine,9.0,anxiety,"""Had panic attacks and social anxiety starting...","November 10, 2016",z,5
50647,Vortioxetine,2.0,depression,"""This is the third med I&#039;ve tried for anx...","July 17, 2016",z,21
50648,Ativan,9.0,anxiety,"""I was super against taking medication. I&#039...","August 16, 2016",yes,52
50649,Fluoxetine,8.0,ocd,"""I have been off Prozac for about 4 weeks now....","January 21, 2015",no,10


In [76]:
# Create drug_n to enumerate drugs per review
df['drug_n'] = df.sort_values(by='drug_prevalance').groupby(['review']).cumcount()
df.sort_values(by=['review', 'drug_n'])
df.drug_n.max()

15

In [77]:
# That appears to have worked. drug_n should contain values 0:15, for max 15 duplicates/review
# Now fill in values for some new drug_n columns
wide_df = pd.pivot(data=df, columns='drug_n', values='drug', index='review')
wide_df.head()

drug_n,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
review,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
"This medication caused me to be nervous, tremble and I became slightly irritable. This medication enabled me to think clearly. My thought processes were more coherent. It assisted me in concentrating and therefore I was able to complete tasks --- finally had follow-thru! I was able to manage my time better.",ritalin-la,,,,,,,,,,,,,,,
"""\r\n\r\n please tell the ones who is suffering from anxiety to use lavender chamomile spray by air wick. it gives immediate relief , doctors not letting know patients about this. please spread the word!!. Please keep this post here.""",Quetiapine,,,,,,,,,,,,,,,
"""\r\nIn few words - Life changing\r\nAll negative thoughts change to positive ones\r\nAnd that is a life changing pill\r\nI was born depressed really ,with a wierd point of veiw, very critic of everybody, now I am a happy ordinary guy with a beautifull family\r\nSomething I could not have achieved with out paxil, I take one every day for 23 years\r\nThe effects goes out just after a glass of wine , so be carefull , quiting the pill in a fast way will cause &quot; rage&quot; yes , you would not recognized your self kind of rage so be carefull.\r\nYou ar""",Paroxetine,Paxil,,,,,,,,,,,,,,
"""\r\nxanax forums are full of how xanax can be so strong for some people. to me it was like a sledgehammer over my head. I woke 6 hours later with no recollection of what went on other than I slept. this was a low dose. It is either panic or been knocked out. Where are all these pleasant experiences I hear about?""",Alprazolam,Xanax,,,,,,,,,,,,,,
""" Caused depression and negative, self defeating thoughts early on. They just got worse and worse until finally it peaked in major anxiety and panic attacks so bad I could barley speak. Then I had to step down from the drug slowly due to the well documented withdrawal problems. So more time feeling god awful and wasted time out of my life. In my opinion, avoid this med if you can. It&#039;s a serious drug.""",Quetiapine,,,,,,,,,,,,,,,


In [78]:
wide_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 31554 entries,   This medication caused me to be nervous, tremble and I became slightly irritable. This medication enabled me to think clearly.  My thought processes were more coherent.  It assisted me in concentrating and therefore I was able to complete tasks --- finally had follow-thru! I was able to manage my time better. to slightly increased attention in a 13 year old girl suffering with ADHD (predominately inattentive) and aspergers syndrome but had bad side effects so was taken off medication and switched to Adderall XR instead very high dialostic blood pressure, weight loss, fatigue could focus slightly better in school and also at home
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       31554 non-null  object
 1   1       18992 non-null  object
 2   2       32 non-null     object
 3   3       23 non-null     object
 4   4       12 non-null     object
 5   5       11 

In [79]:
# drugs are now distributed across rows 0-15. Get this in a format to re-merge w/ full df
wide_df = wide_df.reset_index().rename(columns={0:'drug0', 1:'drug1', 2:'drug2', 3:'drug3', 
                                                4:'drug4', 5:'drug5', 6:'drug6', 7:'drug7', 
                                                8:'drug8', 9:'drug9', 10:'drug10', 11:'drug11', 
                                                12:'drug12', 13:'drug13', 14:'drug14', 
                                                15:'drug15'})
wide_df.head()

drug_n,review,drug0,drug1,drug2,drug3,drug4,drug5,drug6,drug7,drug8,drug9,drug10,drug11,drug12,drug13,drug14,drug15
0,"This medication caused me to be nervous, tre...",ritalin-la,,,,,,,,,,,,,,,
1,"""\r\n\r\n please tell the ones who is sufferin...",Quetiapine,,,,,,,,,,,,,,,
2,"""\r\nIn few words - Life changing\r\nAll nega...",Paroxetine,Paxil,,,,,,,,,,,,,,
3,"""\r\nxanax forums are full of how xanax can b...",Alprazolam,Xanax,,,,,,,,,,,,,,
4,""" Caused depression and negative, self defeati...",Quetiapine,,,,,,,,,,,,,,,


In [80]:
drug_cols_df = df.merge(wide_df, on='review', how='left')
drug_cols_df

Unnamed: 0,drug,rating,condition,review,date,keep,drug_prevalance,drug_n,drug0,drug1,...,drug6,drug7,drug8,drug9,drug10,drug11,drug12,drug13,drug14,drug15
0,vyvanse,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,yes,189,0,vyvanse,,...,,,,,,,,,,
1,vyvanse,7.0,add,So far the throwing up has stopped and the hea...,0,yes,189,0,vyvanse,,...,,,,,,,,,,
2,ritalin-la,8.0,adhd,"This medication caused me to be nervous, tre...",0,z,482,0,ritalin-la,,...,,,,,,,,,,
3,wellbutrin-sr,8.0,adhd,Only been on Wellbutrin for three weeks and al...,0,z,458,0,wellbutrin-sr,,...,,,,,,,,,,
4,concerta,8.0,adhd,The treatment details were pretty basic. I ju...,0,yes,228,0,concerta,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50646,Venlafaxine,9.0,anxiety,"""Had panic attacks and social anxiety starting...","November 10, 2016",z,5,0,Venlafaxine,Effexor XR,...,,,,,,,,,,
50647,Vortioxetine,2.0,depression,"""This is the third med I&#039;ve tried for anx...","July 17, 2016",z,21,0,Vortioxetine,,...,,,,,,,,,,
50648,Ativan,9.0,anxiety,"""I was super against taking medication. I&#039...","August 16, 2016",yes,52,1,Lorazepam,Ativan,...,,,,,,,,,,
50649,Fluoxetine,8.0,ocd,"""I have been off Prozac for about 4 weeks now....","January 21, 2015",no,10,0,Fluoxetine,Prozac,...,,,,,,,,,,


In [81]:
# This has the correct number and type of rows and columns. Clean up columns. 
drug_cols_df = drug_cols_df.drop(columns=['drug', 'drug_prevalance', 'drug_n'])
drug_cols_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50651 entries, 0 to 50650
Data columns (total 21 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   rating     50651 non-null  float64
 1   condition  50651 non-null  object 
 2   review     50651 non-null  object 
 3   date       50651 non-null  object 
 4   keep       50651 non-null  object 
 5   drug0      50651 non-null  object 
 6   drug1      38089 non-null  object 
 7   drug2      169 non-null    object 
 8   drug3      142 non-null    object 
 9   drug4      98 non-null     object 
 10  drug5      93 non-null     object 
 11  drug6      69 non-null     object 
 12  drug7      55 non-null     object 
 13  drug8      39 non-null     object 
 14  drug9      30 non-null     object 
 15  drug10     30 non-null     object 
 16  drug11     30 non-null     object 
 17  drug12     30 non-null     object 
 18  drug13     30 non-null     object 
 19  drug14     16 non-null     object 
 20  drug15

This is now in a format where there are (hopefully) completely duplicated rows. Reviews with duplicates and a keep value of z should now all have the same drugs associated with them, just spread over multiple columns. See if it works to simply drop completely duplicate rows. 

<font color='violet'> Delete duplicates and rows marked for deletion.

In [82]:
drug_cols_df = drug_cols_df.drop_duplicates()
drug_cols_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43190 entries, 0 to 50650
Data columns (total 21 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   rating     43190 non-null  float64
 1   condition  43190 non-null  object 
 2   review     43190 non-null  object 
 3   date       43190 non-null  object 
 4   keep       43190 non-null  object 
 5   drug0      43190 non-null  object 
 6   drug1      30628 non-null  object 
 7   drug2      90 non-null     object 
 8   drug3      72 non-null     object 
 9   drug4      48 non-null     object 
 10  drug5      45 non-null     object 
 11  drug6      33 non-null     object 
 12  drug7      25 non-null     object 
 13  drug8      19 non-null     object 
 14  drug9      14 non-null     object 
 15  drug10     14 non-null     object 
 16  drug11     14 non-null     object 
 17  drug12     14 non-null     object 
 18  drug13     14 non-null     object 
 19  drug14     9 non-null      object 
 20  drug15

In [83]:
# That did get rid of 7k rows. 
drug_cols_df = drug_cols_df[drug_cols_df.keep!='no'].copy()
drug_cols_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31562 entries, 0 to 50650
Data columns (total 21 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   rating     31562 non-null  float64
 1   condition  31562 non-null  object 
 2   review     31562 non-null  object 
 3   date       31562 non-null  object 
 4   keep       31562 non-null  object 
 5   drug0      31562 non-null  object 
 6   drug1      19000 non-null  object 
 7   drug2      40 non-null     object 
 8   drug3      29 non-null     object 
 9   drug4      13 non-null     object 
 10  drug5      12 non-null     object 
 11  drug6      8 non-null      object 
 12  drug7      6 non-null      object 
 13  drug8      4 non-null      object 
 14  drug9      2 non-null      object 
 15  drug10     2 non-null      object 
 16  drug11     2 non-null      object 
 17  drug12     2 non-null      object 
 18  drug13     2 non-null      object 
 19  drug14     1 non-null      object 
 20  drug15

In [84]:
# Another 10k rows taken care of. Check out what's up now with duplicated reviews
len(drug_cols_df[drug_cols_df.review.duplicated(keep=False)==True])

16

In [85]:
# This is very easy to deal with now
drug_cols_df[drug_cols_df.review.duplicated(keep=False)==True].sort_values(by='review')

Unnamed: 0,rating,condition,review,date,keep,drug0,drug1,drug2,drug3,drug4,...,drug6,drug7,drug8,drug9,drug10,drug11,drug12,drug13,drug14,drug15
26615,10.0,insomnia,"""After 20 years of getting up every hour and a...","November 16, 2009",z,Zolpidem,Zolpidem,Ambien,Ambien,,...,,,,,,,,,,
28271,10.0,insomnia,"""After 20 years of getting up every hour and a...","January 9, 2009",z,Zolpidem,Zolpidem,Ambien,Ambien,,...,,,,,,,,,,
2784,10.0,anxiety,"""Best medicine for anxiety.""","May 1, 2014",z,Clonazepam,Clonazepam,Klonopin,Klonopin,,...,,,,,,,,,,
17638,10.0,anxiety,"""Best medicine for anxiety.""","May 15, 2009",z,Clonazepam,Clonazepam,Klonopin,Klonopin,,...,,,,,,,,,,
3750,10.0,insomnia,"""Great""","April 20, 2016",yes,Citalopram,Citalopram,Varenicline,Chantix,Zolpidem,...,Pregabalin,Acetaminophen / diphenhydramine,Tylenol PM,,,,,,,
14531,8.0,insomnia,"""Great""","May 11, 2017",yes,Citalopram,Citalopram,Varenicline,Chantix,Zolpidem,...,Pregabalin,Acetaminophen / diphenhydramine,Tylenol PM,,,,,,,
919,10.0,anxiety,"""Helps me a lot.""","July 22, 2011",z,Alprazolam,Alprazolam,Xanax,,,...,,,,,,,,,,
23248,10.0,anxiety,"""Helps me a lot.""","March 7, 2012",z,Alprazolam,Alprazolam,Xanax,,,...,,,,,,,,,,
31289,8.0,anxiety,"""Hi Everyone, \r\r\nI am a 22 yr old female an...","June 10, 2016",yes,Escitalopram,Escitalopram,Lexapro,Lexapro,,...,,,,,,,,,,
42884,8.0,anxiety,"""Hi Everyone, \r\r\nI am a 22 yr old female an...","June 9, 2016",yes,Escitalopram,Escitalopram,Lexapro,Lexapro,,...,,,,,,,,,,


Remaining duplicates were reviews that were either identical and submitted on two different dates or varied only by their rating. I'll just keep the latest review. 

<font color='violet'> Nuke remaining duplicate reviews

In [86]:
rows_to_drop = [28271, 17638, 3750, 919, 42884, 5937, 972, 31390]
final_df = drug_cols_df.drop(index=rows_to_drop).drop(columns=['keep'])
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31554 entries, 0 to 50650
Data columns (total 20 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   rating     31554 non-null  float64
 1   condition  31554 non-null  object 
 2   review     31554 non-null  object 
 3   date       31554 non-null  object 
 4   drug0      31554 non-null  object 
 5   drug1      18992 non-null  object 
 6   drug2      32 non-null     object 
 7   drug3      23 non-null     object 
 8   drug4      12 non-null     object 
 9   drug5      11 non-null     object 
 10  drug6      7 non-null      object 
 11  drug7      5 non-null      object 
 12  drug8      3 non-null      object 
 13  drug9      2 non-null      object 
 14  drug10     2 non-null      object 
 15  drug11     2 non-null      object 
 16  drug12     2 non-null      object 
 17  drug13     2 non-null      object 
 18  drug14     1 non-null      object 
 19  drug15     1 non-null      object 
dtypes: flo

There are null values here, but they are truly null. They'll need to be changed prior to modeling, but for the purposes of EDA they should be kept. This should finally be ready to use for EDA. Pick that up in the next notebook: 

In [87]:
final_df.to_csv('../data/interim/studies_no_duplicates.csv')