# <font color='violet'> Exploration & Parsing
Using data wrangled here: https://github.com/fractaldatalearning/psychedelic_efficacy/blob/main/notebooks/1-kl-wrangle-tabular.ipynb

In [1]:
# ! pip install tqdm 
# !{sys.executable} -m pip install contractions

In [2]:
import pandas as pd
import sys
import contractions
import re
import string
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

In [3]:
# prepare to add local python functions; import modules from src directory
src = '../src'
sys.path.append(src)

# import local functions
from nlp.parse import remove_accented_chars, strip_most_punc, strip_apostrophe

In [4]:
df = pd.read_csv('../data/interim/studies_clean.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50652 entries, 0 to 50651
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  50652 non-null  int64  
 1   drug        50652 non-null  object 
 2   rating      50652 non-null  float64
 3   condition   50652 non-null  object 
 4   review      50652 non-null  object 
 5   date        50652 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 2.3+ MB


In [5]:
# Drop "Unnamed" column; it's redundant with the index
df = df.drop(columns=['Unnamed: 0'])
df.head(2)

Unnamed: 0,drug,rating,condition,review,date
0,vyvanse,9.0,add,I had began taking 20mg of Vyvanse for three m...,0
1,dextroamphetamine,8.0,add,Switched from Adderall to Dexedrine to compare...,0


<font color='violet'> Explore each column, starting with 'drug'

In [6]:
# How many drugs are in the data? 
len(df.drug.unique())

651

In [11]:
# Which drugs are most commonly reviewed?
df.drug.value_counts().head(10)

Sertraline      1855
Escitalopram    1714
Citalopram      1298
Bupropion       1278
Lexapro         1232
Venlafaxine     1212
Varenicline     1082
Zoloft          1075
Quetiapine      1065
Clonazepam      1004
Name: drug, dtype: int64

<font color='violet'> Parse Language
    
   The review column contains narratives where patients explain their experience with a prescription psych med. Language features from that column need to be extracted or created after any necessary cleaning of strings has been completed. Do any preparations necessary to conduct sentiment analysis. I'll be drawing quite a bit from the following resource: 
    - https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

In [None]:
# View a sample string. Search for special characters.
df.review[0]

In [None]:
df[df['review'].str.find("é")!=-1].head(1)

In [None]:
df[df['review'].str.find("ä")!=-1].head(1)

<font color='violet'> Remove Special Characters

...if there are any. Haven't been able to find any of the most common special characters é or ä in the data, but doing it just in case. 

In [None]:
# This function works in the test suite, but there may not be examples in the data
df['review'] = df['review'].apply(remove_accented_chars)

<font color='violet'> Expand Contractions

In [None]:
# First, find some to confirm it works. 
df[df['review'].str.find("'")!=-1].head(1)

In [None]:
df.review[9]

In [None]:
df['review'] = df['review'].apply(contractions.fix)
df.review[9]

"Don't" got changed to "do not"; contraction expansion worked. 

<font color='violet'> Next, remove punctuation/special characters where appropriate. 
    
Try to keep those correlated with sentiment: ! ? # % ;) :( .  Again, first find an example to confirm it works.

In [None]:
df[df['review'].str.find("!")!=-1].head(1)

In [None]:
df.review[6]

In [None]:
strip_most_punc(df, 'review')
df.review[6]

In [None]:
# left to strip are ' and : ; () where they don't appear as emoji.

df[df['review'].str.find("'")!=-1].head(1)

In [None]:
df.review[16]

In [None]:
strip_apostrophe(df, 'review')
df.review[16]

In [None]:
# apostrophes stripped successfully. Move onto final characters. 

# use regex to identify :;() when they appear next to a letter rather than in emoji.



<font color='violet'> Lemmatize texts using spacy.

<font color='violet'> More EDA

https://medium.com/plotly/nlp-visualisations-for-clear-immediate-insights-into-text-data-and-outputs-9ebfab168d5b
    
https://www.numpyninja.com/post/nlp-text-data-visualization
    
https://www.kaggle.com/code/sainathkrothapalli/nlp-visualisation-guide
    
https://medium.com/acing-ai/visualizations-in-natural-language-processing-2ca60dd34ce
    
https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a
    
https://towardsdatascience.com/getting-started-with-text-nlp-visualization-9dcb54bc91dd
    
https://www.kaggle.com/code/mitramir5/nlp-visualization-eda-glove
    
https://medium.com/analytics-vidhya/how-to-begin-performing-eda-on-nlp-ffdef92bedf6
    
https://inside-machinelearning.com/en/eda-nlp/
    
https://towardsdatascience.com/fundamental-eda-techniques-for-nlp-f81a93696a75
    
https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools
    
https://www.kdnuggets.com/2019/05/complete-exploratory-data-analysis-visualization-text-data.html
    
