# <font color='violet'> Continue Language Processing, Continue Deeper EDA
    
Using prescription drug review data analyzed and parsed here: https://github.com/fractaldatalearning/psychedelic_efficacy/blob/main/notebooks/3-kl-studies-early-eda-parse.ipynb

In [7]:
# ! pip install spacy
# ! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m452.3 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import pandas as pd
import spacy

In [3]:
df = pd.read_csv('../data/interim/studies_early_parsing.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31559 entries, 0 to 31558
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     31559 non-null  int64  
 1   rating         31559 non-null  float64
 2   condition      31559 non-null  object 
 3   review         31559 non-null  object 
 4   date           31451 non-null  object 
 5   drug0          31559 non-null  object 
 6   drug1          18992 non-null  object 
 7   drug2          32 non-null     object 
 8   drug3          23 non-null     object 
 9   drug4          12 non-null     object 
 10  drug5          11 non-null     object 
 11  drug6          7 non-null      object 
 12  drug7          5 non-null      object 
 13  drug8          3 non-null      object 
 14  drug9          2 non-null      object 
 15  drug10         2 non-null      object 
 16  drug11         2 non-null      object 
 17  drug12         2 non-null      object 
 18  drug13

In [4]:
# Delete unnamed column and columns I'd used for eda previously but won't need here.
df = df.drop(columns = ['Unnamed: 0', 'ratings_count', 'count_by_date'])
df.head(3)

Unnamed: 0,rating,condition,review,date,drug0,drug1,drug2,drug3,drug4,drug5,drug6,drug7,drug8,drug9,drug10,drug11,drug12,drug13,drug14,drug15
0,9.0,add,I had began taking 20mg of Vyvanse for three m...,,vyvanse,,,,,,,,,,,,,,,
1,8.0,add,Switched from Adderall to Dexedrine to compare...,,dextroamphetamine,,,,,,,,,,,,,,,
2,8.0,adhd,I have only been on Vyvanse for 2 weeks I sta...,,vyvanse,,,,,,,,,,,,,,,


<font color='violet'> Lemmatize text & Remove Stopwords

In [9]:
nlp = spacy.load('en_core_web_sm')

stopwords = spacy.lang.en.stop_words.STOP_WORDS        

df['stops_removed'] = df.review.apply(lambda text: " ".join(token.lemma_ for token in nlp(text) if not token.is_stop))

df['stops_removed'].head()

0    begin take 20 mg Vyvanse month surprised find ...
1    switch Adderall Dexedrine compare effect Dexed...
2    Vyvanse 2 week   start 40 mg 60 mg week probab...
3    1 subcutaneous injection somatropin abdoman in...
4    ss ss diseaseLewy body Syndrome Demenia take r...
Name: stops_removed, dtype: object

In [11]:
# Check to see what happened with everything, including. There was one in row 6. 
df.stops_removed[6]

'far throwing stop headache come food look good eat craving easy diet ;) pass final amazing drug make strattera look like tylenol ! throw alot bad headache twitch crazy heart beat loss appetite   plus eye   happy thought   like wow soo beautiful reason pop head alot   negative thought worry homework constantly fine need switch straterra depressed quiet sleep like hour night drug horrible switch Vyvanse time take like anti deppresant normal happy hyper focus study start comme bad dry heave TERRIBLE HEADACHE hour throw irritated twitch ALOT heart beat fast heavy second day body start get use anymore problem'

I can see that each word is a basic lemma, and stopwords are removed successfully. 

Since I didn't want to simply strip all punctuation initially, I know some of it remains. Now, I should find any remaining punctuation and decide what to do with it on a more case-by-case basis. 

<font color='violet'> Deal with remaining punctuation

In [26]:
punctuation = set([token for token in df.stops_removed.str.cat(sep=' ') if 
                   token.isalpha()==False])
punctuation

{'\t',
 '\n',
 '\r',
 ' ',
 '!',
 '#',
 '$',
 '%',
 '(',
 ')',
 '+',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '=',
 '?',
 '\\',
 '`',
 '\x7f'}

In [45]:
# Identify obviously unwanted symbols. Some of the symbols above still carry meaning
replacements = ['\t', '\n', '\r', '\\', '`', '\x7f']

# Find where some of these occur, so as to see if removing them work?
df[df.stops_removed.str.find('\t')!=-1].head(1)

Unnamed: 0,rating,condition,review,date,drug0,drug1,drug2,drug3,drug4,drug5,...,drug7,drug8,drug9,drug10,drug11,drug12,drug13,drug14,drug15,stops_removed
1600,1.0,depression,bull \t19 April 2016\r\r\n\r\r\nBegan initial ...,2016-04-22,duloxetine,,,,,,...,,,,,,,,,,bull \t 19 April 2016 \r\r\n\r\r\n begin initi...


In [29]:
df[df.stops_removed.str.find('\\')!=-1].head(1)

Unnamed: 0,rating,condition,review,date,drug0,drug1,drug2,drug3,drug4,drug5,...,drug7,drug8,drug9,drug10,drug11,drug12,drug13,drug14,drug15,stops_removed
85,9.0,add,Took 20mg 3x a day Immediate release worked be...,,adderall,,,,,,...,,,,,,,,,,take 20 mg 3x day Immediate release work well ...


In [33]:
df.stops_removed[85]

'take 20 mg 3x day Immediate release work well extended version control med taper needsschedule day Dry mouth lead dental cavity desire smoke increase Jawhandfoot clench Loss appetite great keep thin force eatdrink Ability focus motivation energy stamina increase feeling wellbeing black hole depression sleep improve trouble go sleep quell disturbing\\busy dream leave exhausted morning lie awake night try slow calm mind'

In [30]:
df[df.stops_removed.str.find('`')!=-1].head(1)

Unnamed: 0,rating,condition,review,date,drug0,drug1,drug2,drug3,drug4,drug5,...,drug7,drug8,drug9,drug10,drug11,drug12,drug13,drug14,drug15,stops_removed
2535,10.0,other,I`ve had Ulcerative Colitis YOUC since I was...,2008-06-09,alprazolam,,,,,,...,,,,,,,,,,i`ve Ulcerative Colitis youc 17 age 25 thi...


In [32]:
df[df.stops_removed.str.find('\x7f')!=-1].head(1)

Unnamed: 0,rating,condition,review,date,drug0,drug1,drug2,drug3,drug4,drug5,...,drug7,drug8,drug9,drug10,drug11,drug12,drug13,drug14,drug15,stops_removed
25327,1.0,schizophrenia,I was told that Latuda is a better tablet than...,2016-10-06,lurasidone,latuda,,,,,...,,,,,,,,,,tell Latuda well tablet take old tablet stelaz...


In [35]:
df.stops_removed[25327]

'tell Latuda well tablet take old tablet stelazine 10ml day affect tired time sleep that#039 s I#039 m 60ml day effect bad one bad thing happen tonight life aggressive against\x7f daughter husband like Stelazine \r\r\n husband tear tonight attitude aggressive behavior didn#039 t feel remorse definitely good tablet Schizophrenia Bipolar depression don#039 t care hard Stelazine I#039 m go stelazine well family    '

In [51]:
# Remove these tokens wherever they are.
# Because the \ is used as a regex escape symbol, use for loop with regular replace function.

df['stripped'] = df.stops_removed
for symbol in replacements:
    df['stripped'] = df.stripped.str.replace(symbol, ' ', regex=False)

df.stripped[1600]

'bull   19 April 2016        begin initial dose 2230 hour Felt medicine work frac12 ; hour good mood take Warfarin date drift sleep glance clock approx 2300 hrs odd feeling throat possibly close remember worry go ? feeling throat persist feel ldquo Adamrsquo s applerdquo   fluttering elevated heart rate soon fall asleep med take Atorvastatin        bull   20 April 2016         awake 0600 hour bathroom arise feel damp spot underwear pull sheet discovered Shit bed sleep Felt real dizzy drowsy Thought ldquo amp   why?rdquo   happen go shower        morning continue 0630 effect evident     1   ldquo Hot flashesrdquo   absolutely miserable   stay comfortable firstly flush sweat cold turn air conditioner fan half hour hot continue stop extremely bad migraine wake let Sensitive noise light sound     2   feeling throat wake drink water cool stop didnrsquo t     3   face tingle     4   moderately nauseous stand sit nausea continue day special K cereal bar ease morning have relief hour     5   T

In [53]:
df.stripped[85]

'take 20 mg 3x day Immediate release work well extended version control med taper needsschedule day Dry mouth lead dental cavity desire smoke increase Jawhandfoot clench Loss appetite great keep thin force eatdrink Ability focus motivation energy stamina increase feeling wellbeing black hole depression sleep improve trouble go sleep quell disturbing busy dream leave exhausted morning lie awake night try slow calm mind'

In [54]:
df.stripped[2535]

'i ve Ulcerative Colitis   youc   17 age 25 think have heart attack go emergency room diagnose panic attack give XANAX ( 1 mg   heart attack tell doctor story history youc prescribe 2 mg day I#039 m 37 year young XANAX free tyranny debilitate condition    '

In [55]:
df.stripped[25327]

'tell Latuda well tablet take old tablet stelazine 10ml day affect tired time sleep that#039 s I#039 m 60ml day effect bad one bad thing happen tonight life aggressive against  daughter husband like Stelazine     husband tear tonight attitude aggressive behavior didn#039 t feel remorse definitely good tablet Schizophrenia Bipolar depression don#039 t care hard Stelazine I#039 m go stelazine well family    '

That worked to get rid of the msot obviously meaningless symbols. Non-alphabetic characters yet to deal with are: ! # $ ( ) + : ; = ? 0-9

! I'm going to keep wherever it appears because it's so strongly indicative of sentiment.
for the rest, I'd like to check out individually and see where exactly they appear. 

Resources with tips for effective EDA visualization with NLP:

https://medium.com/plotly/nlp-visualisations-for-clear-immediate-insights-into-text-data-and-outputs-9ebfab168d5b
    
https://www.numpyninja.com/post/nlp-text-data-visualization
    
https://www.kaggle.com/code/sainathkrothapalli/nlp-visualisation-guide
    
https://medium.com/acing-ai/visualizations-in-natural-language-processing-2ca60dd34ce
    
https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a
    
https://towardsdatascience.com/getting-started-with-text-nlp-visualization-9dcb54bc91dd
    
https://www.kaggle.com/code/mitramir5/nlp-visualization-eda-glove
    
https://medium.com/analytics-vidhya/how-to-begin-performing-eda-on-nlp-ffdef92bedf6
    
https://inside-machinelearning.com/en/eda-nlp/
    
https://towardsdatascience.com/fundamental-eda-techniques-for-nlp-f81a93696a75
    
https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools
    
https://www.kdnuggets.com/2019/05/complete-exploratory-data-analysis-visualization-text-data.html
    
