# <font color='violet'> Exploration & Parsing
Using prescription drug review data wrangled here: https://github.com/fractaldatalearning/psychedelic_efficacy/blob/main/notebooks/1-kl-wrangle-tabular.ipynb

In [1]:
# ! pip install tqdm 
# !{sys.executable} -m pip install contractions

In [2]:
import numpy as np
import pandas as pd
import sys
import contractions
import re
import string
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import unicodedata

In [3]:
# prepare to add local python functions; import modules from src directory
src = '../src'
sys.path.append(src)

# import local functions
from nlp.parse import remove_accented_chars, strip_most_punc, strip_apostrophe, \
strip_non_emoji_emoji_symbol

In [4]:
df = pd.read_csv('../data/interim/studies_no_duplicates.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31554 entries, 0 to 31553
Data columns (total 21 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  31554 non-null  int64  
 1   rating      31554 non-null  float64
 2   condition   31554 non-null  object 
 3   review      31554 non-null  object 
 4   date        31554 non-null  object 
 5   drug0       31554 non-null  object 
 6   drug1       18992 non-null  object 
 7   drug2       32 non-null     object 
 8   drug3       23 non-null     object 
 9   drug4       12 non-null     object 
 10  drug5       11 non-null     object 
 11  drug6       7 non-null      object 
 12  drug7       5 non-null      object 
 13  drug8       3 non-null      object 
 14  drug9       2 non-null      object 
 15  drug10      2 non-null      object 
 16  drug11      2 non-null      object 
 17  drug12      2 non-null      object 
 18  drug13      2 non-null      object 
 19  drug14      1 non-null   

In [5]:
# Drop "Unnamed" column; it's redundant with the index
df = df.drop(columns=['Unnamed: 0'])
df.head(2)

Unnamed: 0,rating,condition,review,date,drug0,drug1,drug2,drug3,drug4,drug5,drug6,drug7,drug8,drug9,drug10,drug11,drug12,drug13,drug14,drug15
0,8.0,adhd,I have only been on Vyvanse for 2 weeks. I st...,0,vyvanse,,,,,,,,,,,,,,,
1,7.0,add,So far the throwing up has stopped and the hea...,0,vyvanse,,,,,,,,,,,,,,,


<font color='violet'> Explore each column, starting with drug columns

In [6]:
# drug0 holds the name of the most-commonly reviewed drug reviewed for a particular row
# How many are there?
len(df.drug0.unique())

373

In [7]:
# How many total drugs are there?
df.drug0.unique()

array(['vyvanse', 'ritalin-la', 'wellbutrin-sr', 'concerta', 'strattera',
       'ritalin', 'adderall', 'diazepam', 'pristiq', 'dextrostat',
       'clonazepam', 'provigil', 'baclofen', 'dexedrine', 'adderall-xr',
       'dextroamphetamine', 'chantix', 'lorazepam', 'amphetamine',
       'methadone', 'wellbutrin-xl', 'focalin-xr', 'methylphenidate',
       'ativan', 'effexor', 'Mirtazapine', 'Methadone', 'Quetiapine',
       'Zolpidem', 'Varenicline', 'Clonazepam', 'Trazodone',
       'Aripiprazole', 'Lurasidone', 'Lamotrigine', 'Escitalopram',
       'Acamprosate', 'Gabapentin', 'Bupropion', 'Venlafaxine',
       'Sertraline', 'Pregabalin', 'Buspirone', 'Nicotine',
       'Divalproex sodium', 'Fluoxetine', 'Desvenlafaxine',
       'Lisdexamfetamine', 'Buprenorphine / naloxone', 'Temazepam',
       'Diazepam', 'Drospirenone / ethinyl estradiol', 'Vilazodone',
       'Caffeine', 'Fluoxetine / olanzapine', 'Tranexamic acid',
       'Nefazodone', 'Diphenhydramine', 'Paroxetine', 'Vortioxet

In [None]:
# Which drugs are most commonly reviewed?
freq_drugs = df.drug.value_counts().head(10)
freq_drugs

It would be best to add a 'drug class' column when I come to feature engineering so that all these drugs are categorized. That column existed previously but came from one of the origina tables where too few of the rows had reviews for psych meds. I could eventually do this by creating a dictionary of drugs and their classes using information scraped from this website: https://www.drugs.com/drug-classes.html

Alternatively, drugs could be understood by the conditions they treat. 

<font color='violet'> Explore conditions

In [None]:
len(df.condition.unique())

In [None]:
freq_conditions = df.condition.value_counts().head(10)
freq_conditions

In [None]:
# Which of the most common drugs are used to treat which of the most common conditions?

freq_drugs = ['Sertraline', 'Escitalopram', 'Citalopram', 'Bupropion', 'Lexapro', 
             'Venlafaxine', 'Varenicline', 'Zoloft', 'Quetiapine', 'Clonazepam']
freq_conditions = ['depression', 'anxiety', 'bipolar', 'addiction', 'insomnia', 'hrt',
                  'schizophrenia', 'ocd', 'other', 'schizoaffective disorder']
freq_drug_conditions = df[df['drug'].isin(freq_drugs) & df['condition'].isin(freq_conditions)]

freq_drug_conditions.head()

In [None]:
freq_combo_summary = freq_drug_conditions.pivot_table(index='condition', columns='drug', 
                                                    aggfunc='count', values='review')
freq_combo_summary.columns = freq_drugs
freq_combo_summary = freq_combo_summary.sort_values(by=freq_drugs, ascending=False)
freq_combo_summary

In [None]:
# Visualize distribution of reviews across common conditions & drugs with a heatmap.
sns.heatmap(freq_combo_summary, cmap='gray_r')
plt.show()

<font color='violet'> Which drugs & conditions have the highest ratings?

In [None]:
top_drugs = df.groupby(['drug'])['rating'].mean().sort_values(ascending=False)
top_drugs.head(80)

In [None]:
successful_conditions = df.groupby(['condition'])['rating'].mean().sort_values(
    ascending=False)
successful_conditions.head(10)

In [None]:
top_combo = df.groupby(['drug', 'condition'])['rating'].mean().sort_values(
    ascending=False)
top_combo.head(140)

In [None]:
top_freq_drugs = set(freq_drugs).intersection(set(top_drugs.index[0:79]))
top_freq_drugs

In [None]:
top_freq_drugs_by_condition = set(freq_drugs).intersection(set(top_combo.index[0:138]))
top_freq_drugs_by_condition

In [None]:
successful_freq_conditions = set(freq_conditions).intersection(set(
    successful_conditions.index[0:10]))
successful_freq_conditions

Anxiety, addction, and ocd are conditions for which there are many drug reviews and high rates of success with treatment. 

The 10 most frequently-reviewed drugs have nothing in common with the 79 perfectly-rated drugs or the 138 drugs that are rated perfectly for any single condition. My hypothesis is that these drugs may have only one or very few reviews each, which is how their average rating is so high. 

<font color='violet'> Explore distribution of ratings

In [None]:
sns.histplot(df.rating)
plt.axvline(df.rating.mean(), color='orange')
plt.axvline(df.rating.median(), color='violet')
plt.show()

More participants gave their drug a high review than gave low reviews, and even fewer gave mediocre reviews.

Now, find out: of drugs that received an average rating of 10, how many reiews is that mean derived from?

In [None]:
perfect_avg_rating = set(top_drugs.index[0:79])
df[df.drug.isin(perfect_avg_rating)].value_counts(subset='drug')[0:17]

Of the 79 drugs with perfect average ratings, only 16 of them had more than one rating, and only 3 of them had more than 3 ratings. Given that there are about 50500 ratings and 650 drugs, the average number of ratings per drug is about 80, so the perfectly-rated drugs definitely seem like outliers. I'd not be surprised if a model eventually has a difficult time correctly classifying the extreme ratings, but for now I'll just keep this in mind and see what happens. 

<font color='violet'> What does the distribution of ratings look like for drugs with at least 20 ratings (20 = 25% of the average number of ratings)?

In [None]:
df['ratings_count'] = df.groupby(['drug'])['drug'].transform('count')
df.sort_values('ratings_count')

In [None]:
twenty_plus_ratings = df[df.ratings_count>=20]
twenty_plus_ratings.sort_values('ratings_count')

In [None]:
sns.histplot(twenty_plus_ratings.rating)
plt.axvline(df.rating.mean(), color='orange')
plt.axvline(df.rating.median(), color='violet')
plt.show()

This distribution doesn't look much different from that which includes all reviews, which tells me the outliers aren't affecting the distribution too much. So it's probably a good idea to keep all rows in the dataset when moving forward.  

<font color='violet'> What is the relationship between date and reviews?

In [None]:
df.date.unique()

In [None]:
df.date = df.date.replace('0', np.nan)
df.date

In [None]:
df.date = pd.to_datetime(df.date)
df.date

In [None]:
# Are there more or fewer reviews from any given point in time?
df['count_by_date'] = df.groupby(['date'])['date'].transform('count')
unique_dates = df.drop_duplicates(subset=['date'])
unique_dates.head()

In [None]:
df.describe()

In [None]:
sns.lineplot(data=unique_dates, x='date', y='count_by_date')
plt.show()

There was an increase in the number of reviews submitted daily around 2015. 

<font color='violet'> Do ratings change with time?

In [None]:
sns.lineplot(data=df, x='date', y='rating')
plt.show()

This looks like something other than total random noise, like maybe there were some current events happening around 2009 and again in 2015 that led people to start rating their psych meds less favorably. There may also be some annual seaonality. Whatever the reason, it seems that date could be correlated with rating and should not be removed. Process this column further to better understand the relationship between date and rating. 

In [None]:
rating_date = df[['date', 'rating']].dropna().set_index('date')
rating_date

In [None]:
downsample_week = rating_date.resample('W').mean()
downsample_week.plot()

In [None]:
rolling_mean = downsample_week.rolling(window=30).mean()
rolling_mean.plot()

In [None]:
# Check for seasonality
index_month = rating_date.index.month
rating_by_month = rating_date.groupby(index_month).mean()
rating_by_month.plot()

It seems that seasonal variation is less extreme than variation by year (average range of 7.3-7.55 instead of 6.5-9.0), with people rating their drugs as being, on average, very slightly less effective in July-November. 

It is even more clear now that weekly average ratings of drugs in these studies did in facat dip in 2009 and again in 2015. The purpose behind these trends isn't so important (though I have some guesses as to what was happening in 2009 and 2015). The date, though, will be a valuable variable alongside narrataive text features when predicting ratings, so as to compare like with like current-events wise. 

Now, move from the more quantitative data into the narrative column, cleaning up language therein. 

<font color='violet'> Parse Language
    
The review column contains narratives where patients explain their experience with a prescription psych med. Language features from that column need to be extracted or created after any necessary cleaning of strings has been completed. Do any preparations necessary to conduct sentiment analysis. I'll be drawing quite a bit from the following resource: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

In [None]:
# View a sample string. Search for special characters.
df.review[0]

In [None]:
df[df['review'].str.find("é")!=-1].head(1)

In [None]:
df[df['review'].str.find("ä")!=-1].head(1)

<font color='violet'> Remove Most Special Characters

...if there are any. Haven't been able to find any of the most common special characters é or ä in the data, but doing it just in case. 

In [None]:
# This function works in the test suite, but there may not be examples in the data
df['review'] = df['review'].apply(remove_accented_chars)

<font color='violet'> Expand Contractions

In [None]:
# First, find some to confirm it works. 
df[df['review'].str.find("'")!=-1].head(1)

In [None]:
df.review[9]

In [None]:
df['review'] = df['review'].apply(contractions.fix)
df.review[9]

"Don't" got changed to "do not"; contraction expansion worked. 

<font color='violet'> Remove punctuation/special characters where appropriate. 
    
Try to keep those correlated with sentiment: ! ? # % ;) :( .  Again, first find an example to confirm it works.

In [None]:
df[df['review'].str.find("!")!=-1].head(1)

In [None]:
df.review[6]

In [None]:
# Use function from package I made to get rid of most of the punctuation.
strip_most_punc(df, 'review')
df.review[6]

In [None]:
# left to strip are ' and : ; () where they don't appear as emoji.

df[df['review'].str.find("'")!=-1].head(1)

In [None]:
df.review[16]

In [None]:
strip_apostrophe(df, 'review')
df.review[16]

Stripping apostrophes worked. 

<font color='violet'> Zoom in on characters that are commonly used in emoji and remove them where they don't appear as part of an emoticon. 
    
Now remove :;() when they appear next to a letter, not emoji. This isn't a perfect solution, as many characters that I already removed can get used in emoji, but at least the most common emoji will be preserved. I'm not going to search for places where these appear next to numbers because my assumption is that symbols appear next to numbers more often as emoji, compared with letters which appear more often next to symbols used for basic punctuation. 
    
Row 6 from earlier has an emoji ;) as well as other ( and ) symbols. Where might I find some other : and ; to see if I'm successfully removing them?

In [None]:
df[df['review'].str.find(":")!=-1].head(1)

In [None]:
df.review[13]

In [None]:
# That example has lots of weird stuff going on; deal with that eventually if necessary.
# For now I can at lease see where the : is (row 2) and check if my function below delets it
# Finally, find an example of ;

df[df['review'].str.find(";")!=-1].head(8)

In [None]:
# Yikes, I'm glad I discovered those duplicate reviews; deal with that shortly
df.review[78]

<font color='violet'> This was working several times, and I changed nothing but it stopped working. Debug later. 

In [None]:
# Note: the ; in that last example above shows up in the third to last row.
# Use a function that can remove these characters appropriately
strip_non_emoji_emoji_symbol(df,'review')

df.review[6]

In [None]:
# The ;) is intact, but all other parentheses that had been in the string from row 6 are gone!
# Check on the two other strings.
df.review[13]

In [None]:
df.review[78]

Symbol removal so far has worked well. I'm going to stop with that becaue I don't need perfection, just cleaner text than I started with so as to end up with fewer oddballs to deal with if I want to do something like making a bag of words. 

I do definitely want to figure out if there are a bunch of rows with duplicate reviews. It seems that what I discovered is one person may have just written one big review for all their drugs and entered it multiple times, with a different drug and rating each time. Is this behavior an outlier or are there other examples like this? 

<font color='violet'> Decide what to do about duplicated reviews. 

In [None]:
df[df.review.duplicated()==True]

Many rows actually contain duplicate reviews, each connected with multiple different drugs. Did the data start out this way, or did I make an error during wrangling?

In [None]:
drugs_dotcom_train = pd.read_csv('../data/raw/drugsComTrain_raw.tsv', sep='\t')
drugs_dotcom_test = pd.read_csv('../data/raw/drugsComTest_raw.tsv', sep='\t')
druglib_train = pd.read_csv('../data/raw/drugLibTrain_raw.tsv', sep='\t')
druglib_test = pd.read_csv('../data/raw/drugLibTest_raw.tsv', sep='\t')
psytar = pd.read_csv('../data/raw/PsyTAR_dataset_samples.csv')

In [None]:
# Make a function to help figure out what's going on 
def inspect_duplicate_reviews(df, column):
    df = df.sort_values(by=column)
    print(len(df), len(df[df[column].duplicated()==True]))
    return df[df[column].duplicated()==True].head()

# What my current working data looks like
inspect_duplicate_reviews(df, 'review')

In [None]:
# Check out each of the other raw datasets
drugs_dotcom_train.info()

In [None]:
inspect_duplicate_reviews(drugs_dotcom_train, 'review')

In [None]:
# 30% of the original reviews from that set were duplicates. 
inspect_duplicate_reviews(drugs_dotcom_test, 'review')

In [None]:
# 10% of drugs_dotcom_test was duplicates
druglib_train.info()

In [None]:
inspect_duplicate_reviews(druglib_train, 'commentsReview')

In [None]:
# Fewer of these were duplicates
psytar.info()

In [None]:
inspect_duplicate_reviews(psytar, 'comment')

This last raw dataset has about 15% duplicate values but few rows overall. 

I did go back to the wrangling notebook and don't see any errors that would have caused this. I think I just didn't notice earlier because I would expect there to be duplicates in many of the columns (drug, condition) without it being a problem at all. Or perhaps completely duplicated rows, and took care of those. But it didn't cross my mind to think that specifically the reveiw column would have duplicates across multiple drugs. 

There are enough duplicated reviews in the raw data to account for all the duplicates in my current dataframe. My best working hypothesis is that the duplicate reviews appeared more often with psych meds because people may cycle through and try many drugs and then write up one big narrative to submit. Or perhaps, they feel one way about the drug's effects and go back to change their rating later, which results in two rows varying only by rating. I may need to more closely inspect each set of duplicates and find out which drugs the reviews are actually relevant for, removing the rest of the rows. 

<font color='violet'> Remove rows with irrelevant duplicated reviews

In [None]:
# Start with just one set of duplicates and see what I find.
df.head(8)

It appears that somebody submitted the same review for vyvanse, dextroamphetamine, saizen, and zyprexa. And with vyvanse, they submitted it as being used to treat both add and adhd. And for add they gave it a rating of 9 with one submission and 10 with another. 

I can see already that this definitly pertains to vyvanse. Since the add ratings are ambiguous, I can just get rid of those and keep the row for adhd. 

In [None]:
df = df.drop(labels=[0,5])
df.head(6)

In [None]:
# Take a closer look at the full review to see if it pertains to the other drugs.
df.review[1]

In [None]:
# This only pertains to vyvanse. Drop other rows. 
df = df.drop(labels=[1,3,4])
df.head(2)

How many sets of duplicates will I need to work with? 

In [None]:
len(df[df.review.duplicated()==True]['review'].unique())

There are so many sets of duplicates, I'm going to need to find some way to automate or otherwise speed up row deletion.

This could be a place to group by the review until there's just one row per review with various drug/rating/condition combinations that can be aggregated for each set of duplicates or analyzed more easily in batches for quicker identification of values to keep or delete. 

In [None]:
# I no longer need the ratings counts
long_df = df.drop(columns=['ratings_count', 'count_by_date'])
long_df.head()

In [None]:
# Create a columm where I can hold whether each row should be kept or deleted. 
# Work until every row is filled with a value, then delete indicated rows.
long_df['keep'] = ''
long_df.head()

<font color='violet'> Mark for keeping any rows where the name of the drug is contained in the text of the review. 

In [None]:
grouped_df = long_df.groupby(['review', 'drug']).count()
grouped_df

In [None]:
# Row indices are defined by the drug column. Gather indices for reviews to keep.
grouped_df_indices_to_keep = []

# Find if the review column contains the string from the drug column.
for row in tqdm(range(len(grouped_df.index))):
    if (grouped_df.index[row][1].lower() in grouped_df.index[row][0].lower()) == True:
        grouped_df_indices_to_keep.append(row)
        
grouped_df_indices_to_keep[:5]

In [None]:
len(grouped_df_indices_to_keep)

In [None]:
# It seems many rows should be kept. Check that this worked correctly.
grouped_df.index[1]

In [None]:
# The drug name is in the review narrative. 
# Isolate just the rows to keep
grouped_to_keep = pd.MultiIndex.to_frame(grouped_df.index[grouped_df_indices_to_keep])
grouped_to_keep.head()

In [None]:
grouped_to_keep = grouped_to_keep.reset_index(drop=True)
grouped_to_keep

In [None]:
# This is the correct number of rows for reviews that contain the drug name
# Add the keep row so that this df can be merged with the original long_df
grouped_to_keep['keep'] = 'yes'
grouped_to_keep.head()

In [None]:
long_df = long_df.merge(right=grouped_to_keep, how='left', on=['review', 'drug'])
long_df

In [None]:
# This contains the correct number of rows to match the original long_df
# keep_y has the values I need for knowing which rows to keep so far

long_df = long_df.drop(columns=['keep_x'])
long_df.head()

In [None]:
long_df = long_df.rename(columns={'keep_y':'keep'})
long_df.head()

In [None]:
# Fill na in keep column to make it easier to work with later.
long_df['keep'] = long_df.keep.fillna('z')
long_df.head()

Dig further into rows where the name of the drug is not in the review. This does not necessarily mean the review isn't applicable to the associated drug. But, I'd say that if there is a review that contains a drug name, that same review should be dropped wherever it appears along with a different drug not mentioned. 

<font color='violet'> Drop rows where text doesn't contain drug name but drug name is present in the same review for a different drug. 

In [None]:
no_drug_in_review = long_df.groupby(['review', 'keep']).count().sort_values(
    by=['review', 'keep'])
no_drug_in_review

In [None]:
len(no_drug_in_review)

There are fewer indices this time because some rows have multiple drugs aggregated within the 'z' row for a review. If a review has only unknown (z) keep values, that should remain unknown for now. But if there is a yes row for the review, then that review's z's should be come no's. 

Specifically, identify reviews for rows to keep. Then, since yes comes before z in the sorting, the yes row is on top in each set of rows per review. So, the row directly below each yes row can be deleted, IF it has the same review. (If it doesn't have the same review, then it should remain unknown for now). 

In [None]:
indices_to_drop = []

for idx in tqdm(range(len(no_drug_in_review))):
    # Isolate reviews for rows to keep, and if  
    if (no_drug_in_review.index[idx][1] == 'yes' and no_drug_in_review.index[idx][0] == 
        no_drug_in_review.index[idx+1][0]):
        indices_to_drop.append(idx+1)

indices_to_drop[:5]

In [None]:
len(indices_to_drop)

In [None]:
# Confirm this worked correctly
no_drug_in_review.index[1]

In [None]:
no_drug_in_review.index[2]

In [None]:
# This worked correctly. Index 2 is slotted for dropping, and it has the same review as 
# index 1, which is labeled yes to keep. Now, isolate the rows to drop.

un_reviewed_to_drop = pd.MultiIndex.to_frame(no_drug_in_review.index[indices_to_drop])
un_reviewed_to_drop.head()

In [None]:
un_reviewed_to_drop = un_reviewed_to_drop.reset_index(drop=True)
un_reviewed_to_drop.head()

In [None]:
# Change keep value to no
un_reviewed_to_drop['keep'] = 'no'
un_reviewed_to_drop.head()

This can again be merged with long_df. There may be multiple drugs per "no keep" review, and that's okay; each one can be filled with no because these reviews should be dropped wherever they appear, since they already have an associated yes review that is definitely relevant to its associated drug. Wherever the new keep column says no but the old keep column says yes, the value should be yes.

In [None]:
long_df = long_df.merge(right=un_reviewed_to_drop, on='review', how='left')
long_df

In [None]:
# Now, if keep_x = yes, that's the row to keep for that review. 
# anyplace where keep_x = z but keep_y = no, the keep value should end up as no

for row in tqdm(range(len(long_df))):
    if long_df.loc[row,'keep_y'] == 'no' and long_df.loc[row,'keep_x'] == 'z':
        long_df.loc[row,'keep_x'] = 'no'

long_df[long_df.keep_y=='no']

In [None]:
# Check if this worked correctly
long_df[long_df.review == long_df.loc[122,'review']]

In [None]:
# This looks correct. The drug name is in the review associated with the yes row
# The matching review now says no in keep_x. I can delete the row keep_y

long_df = long_df.drop(columns=['keep_y'])
long_df.head()

In [None]:
long_df = long_df.rename(columns={'keep_x':'keep'})
long_df.head()

In [None]:
# What remains? How many rows still have a keep value of z?
len(long_df[long_df.keep=='z'])

<font color='violet'> Deal with any reviews that are just duplicates related to multiple conditions.  

In [None]:
grouped_by_condition = long_df.groupby(['review', 'condition']).count()
grouped_by_condition

In [None]:
# Those duplicated by condition would show up where 2 subsequent indices have the same review.
indices_duplicated_by_condition = []
for idx in tqdm(range(len(grouped_by_condition))):
    # Need to include a try-except since sometimes idx+1 won't exist
    try:
        if grouped_by_condition.index[idx][0] == grouped_by_condition.index[idx+1][0]:
            indices_duplicated_by_condition.append(idx)
            indices_duplicated_by_condition.append(idx+1)
    except: pass
        
indices_duplicated_by_condition[:5]    

In [None]:
# Take a look at the rows I've identified
duplicated_by_condition = pd.MultiIndex.to_frame(grouped_by_condition.index[
    indices_duplicated_by_condition])
duplicated_by_condition

Here, I think it would make sense to just choose one of the conditions to keep. If there were many pairs like this, I might create columns "condition1" and "condition2", but if "condition2" would only have 4 values out of tens of thousands of rows, that seems like a waste. Instead, I'll go ahead and just keep the row for the less-common condition, so as to balance rather than further un-balance the condition column. 

First I'll need a dictionary of conditions

In [None]:
conditions_rank = long_df.condition.value_counts().to_frame()
conditions_rank.head()

In [None]:
conditions_rank['rank'] = range(len(conditions_rank))
conditions_rank.head()

In [None]:
conditions_rank = conditions_rank.drop(columns=['condition']).reset_index()
conditions_rank.head()

In [None]:
conditions_rank = conditions_rank.rename(columns={'index':'condition'})
conditions_rank.head()

In [None]:
conditions_rank = conditions_rank.set_index('condition').to_dict()['rank']
conditions_rank

In [None]:
# Prepare dataframe of just reviews that have multiple conditions attached
duplicated_by_condition = duplicated_by_condition.reset_index(drop=True)
duplicated_by_condition.head()

In [None]:
# Get this in a format where the conditions for each review can be compared
for row in range(len(duplicated_by_condition)):
    duplicated_by_condition.loc[row,'rank'] = conditions_rank[duplicated_by_condition.loc[
        row, 'condition']]

duplicated_by_condition.head()

In [None]:
# Identify max rank as the condition to keep for each review
condition_to_keep = duplicated_by_condition.groupby(['review']).max()
condition_to_keep.head()

In [None]:
# This is the wrong condition listed, but the correct condition rank that should be kept.

condition_to_keep = condition_to_keep.drop(columns=['condition'])
condition_to_keep.head()

In [None]:
# Change rank to int type
condition_to_keep['rank'] = condition_to_keep['rank'].astype(int)
condition_to_keep.head()

In [None]:
# Create regular df to iterate through:
condition_to_keep = condition_to_keep.reset_index()
condition_to_keep.head()

In [None]:
# Refill conditions 
for row in range(len(condition_to_keep)):
    for key, value in conditions_rank.items():
        if condition_to_keep.loc[row,'rank'] == value:
                condition_to_keep.loc[row,'condition'] = key
            
condition_to_keep.head()

In [None]:
# These conditions should have a keep value of 'yes'
condition_to_keep['keep'] = 'yes'
condition_to_keep.head()

In [None]:
# Merge with duplicated_by_condition so as to be able to mark remaining rows with "no"
duplicated_by_condition = duplicated_by_condition.merge(condition_to_keep, how='left')
duplicated_by_condition

In [None]:
duplicated_by_condition = duplicated_by_condition.drop(columns=['rank']).fillna('no')
duplicated_by_condition.head()

In [None]:
# Now duplicated_by_condition can be merged with the rest of the long_df
long_df = long_df.merge(duplicated_by_condition, on=['review', 'condition'], how='left')
long_df

In [None]:
# How did that work? What does the first review with duplicated conditions look like?
long_df[long_df.review.str.contains('After many months spent being given ten')]

In [None]:
# I'd previously mis-labeled some rows. 
long_df.sort_values(by=['keep_y', 'keep_x']).head(7)

In [None]:
# Wherever keep_y is not null, that is the value that should be kept. 
# Otherwise keep the value of keep_y

long_df = long_df.reset_index(drop = True)
long_df.head()

In [None]:
for row in tqdm(range(len(long_df))):
    if long_df.loc[row,'keep_y'] == 'yes' or long_df.loc[row,'keep_y'] == 'no':
        long_df.loc[row,'keep'] = long_df.loc[row,'keep_y']
    else: long_df.loc[row,'keep'] = long_df.loc[row,'keep_x']
        
long_df.head()

In [None]:
long_df.sort_values(by=['keep_y', 'keep_x']).head()

In [None]:
# This looks correct so far. Clean up. 
long_df = long_df.drop(columns=['keep_x', 'keep_y'])
long_df.head()

Now, everywhere there is a duplicated review, a row for that review is being kept if it contains the drug name and it is submitted for the least-common condition. Reviews are marked for removal if they don't contain the name of the drug but their duplicate does. And being removed if submitted for a more-common condition where the review is also submitted for a less-common condition. 

But, wherever there is no drug name at all in the review, duplicates likely still exist across multiple drugs. This may be a place where new columns for drug1, drug2, drug3 may be necessary

<font color='violet'> Deal with any remaining reviews duplicated across multiple drugs. But before going any further, mark all non-duplicated reviews to keep

In [None]:
long_df.loc[(long_df.review.duplicated(keep=False)==False),'keep'] = 'yes'
long_df[long_df.review.duplicated(keep=False)==False]

In [None]:
# How many reviews remain to deal with?
len(long_df[(long_df.review.duplicated(keep=False)==True) & (long_df.keep=='z')])

In [None]:
# What's the highest number of drugs associated with a single review?
row_count = long_df.groupby(['review']).count()
row_count.sort_values(by='drug', ascending=False)

The review "Good" is associated with 24 different drugs. Add columns drug0...drug23 wherever a review has more than one associated drug. First, sort drugs by prevalance, then enumerate drugs per review so that column can then become multiple nuew columns. Finally, create a pivot table and fill values of new drug_n columns with drug names.

In [None]:
# Go back and sort drugs according to how common they are so they're enumerated that way
by_drug = long_df.groupby('drug').count().sort_values(by='rating', ascending=False)
by_drug

In [None]:
by_drug['drug_prevalance'] = range(len(by_drug))
by_drug = by_drug.drop(columns=[
    'rating', 'condition', 'review', 'date', 'keep', 'drug_n']).reset_index()
by_drug

In [None]:
# Merge with long_df so that drugs have their prevalance values associated
long_df = long_df.merge(by_drug, how='left')
long_df

In [None]:
# Create drug_n to enumerate drugs per review
long_df['drug_n'] = long_df.sort_values(by='drug_prevalance').groupby(['review']).cumcount()
long_df.sort_values(by=['review', 'drug_n'])
long_df.drug_n.max()

In [None]:
# That appears to have worked. drug_n should contain values 0:23, for max 24 duplicates/review
# Now fill in values for some new drug_n columns
wid_df = 

<font color='violet'> Lemmatize text and do further NLP & EDA in a new notebook once this first round of basic text cleaning is complete. 
    
Resources with tips for effective EDA visualization with NLP:

https://medium.com/plotly/nlp-visualisations-for-clear-immediate-insights-into-text-data-and-outputs-9ebfab168d5b
    
https://www.numpyninja.com/post/nlp-text-data-visualization
    
https://www.kaggle.com/code/sainathkrothapalli/nlp-visualisation-guide
    
https://medium.com/acing-ai/visualizations-in-natural-language-processing-2ca60dd34ce
    
https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a
    
https://towardsdatascience.com/getting-started-with-text-nlp-visualization-9dcb54bc91dd
    
https://www.kaggle.com/code/mitramir5/nlp-visualization-eda-glove
    
https://medium.com/analytics-vidhya/how-to-begin-performing-eda-on-nlp-ffdef92bedf6
    
https://inside-machinelearning.com/en/eda-nlp/
    
https://towardsdatascience.com/fundamental-eda-techniques-for-nlp-f81a93696a75
    
https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools
    
https://www.kdnuggets.com/2019/05/complete-exploratory-data-analysis-visualization-text-data.html
    
