# <font color='violet'> Feature Engineering for Review Language 

On data that I started pre-processing here: https://github.com/fractaldatalearning/psychedelic_efficacy/blob/main/notebooks/5-kl-studies-lang-eda-preprocess.ipynb
    
Try multiple methods for engineering features out of the text of the reviews. 

In [1]:
# ! python -m spacy download en_core_web_lg
# ! pip install gensim

In [2]:
import pandas as pd
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import spacy
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
from gensim.models.fasttext import FastText, load_facebook_model

# Might need to do this:
# import fasttext.util
# fasttext.util.download_model('en', if_exists='ignore')  # English
# ft = fasttext.load_model('cc.en.300.bin')

In [3]:
df = pd.read_csv('../data/interim/studies_w_sentiment.csv').drop(columns=['Unnamed: 0'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31559 entries, 0 to 31558
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rating             31559 non-null  float64
 1   condition          31559 non-null  object 
 2   review             31559 non-null  object 
 3   date               31451 non-null  object 
 4   drug0              31559 non-null  object 
 5   drug1              18992 non-null  object 
 6   review_len         31559 non-null  int64  
 7   complexity         31559 non-null  float64
 8   spell_corr         31559 non-null  object 
 9   no_stops_lemm      31558 non-null  object 
 10  no_stop_cap_lemm   31558 non-null  object 
 11  subjectivity       31559 non-null  float64
 12  original_polarity  31559 non-null  float64
dtypes: float64(4), int64(1), object(8)
memory usage: 3.1+ MB


In [4]:
# I'll just be using the cleanest text
df = df.drop(columns=['review', 'spell_corr', 'no_stops_lemm'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31559 entries, 0 to 31558
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rating             31559 non-null  float64
 1   condition          31559 non-null  object 
 2   date               31451 non-null  object 
 3   drug0              31559 non-null  object 
 4   drug1              18992 non-null  object 
 5   review_len         31559 non-null  int64  
 6   complexity         31559 non-null  float64
 7   no_stop_cap_lemm   31558 non-null  object 
 8   subjectivity       31559 non-null  float64
 9   original_polarity  31559 non-null  float64
dtypes: float64(4), int64(1), object(5)
memory usage: 2.4+ MB


I'll start by engineering features with tfidf. Based on what I found during eda, ngrams through n=4 provided meaningful-seeming phrases, but apart from !!!!, the most commonly-occuring quadgram "post traumatic stress disorder" only appeared in less than 1% of reviews. While ptsd could very likely appear later in scrubbed psychedelic experience reports, other quadgrams are unlikely to improve model performance enough to make it worth the over-fitting that comes with including them. So, for tfidf, focus on ngrams 1-3. 

Speaking of exclamation points, I'd wanted to get rid of exccessive exclamations and reduce each instance to just 3 !!!. Do that real quick, and clean up and null values. I'd kept some null values in place earlier on until I more thoroughly understood the data/ was ready to create a train-test split, and that time has come. With everything clean, things will go smoother with modeling to test out various feature engineering techniques. 

In [5]:
# It appears as though stopword deletion left one very short review consisting of nothing
# Drop that row, then reset the index so it's in order, in order to replace !!!!. 
df[df.no_stop_cap_lemm.isnull()]

Unnamed: 0,rating,condition,date,drug0,drug1,review_len,complexity,no_stop_cap_lemm,subjectivity,original_polarity
29472,8.0,anxiety,2016-10-19,clonazepam,,7,-3.5,,0.0,0.0


In [6]:
df = df.drop(labels=29472).reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31558 entries, 0 to 31557
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rating             31558 non-null  float64
 1   condition          31558 non-null  object 
 2   date               31450 non-null  object 
 3   drug0              31558 non-null  object 
 4   drug1              18992 non-null  object 
 5   review_len         31558 non-null  int64  
 6   complexity         31558 non-null  float64
 7   no_stop_cap_lemm   31558 non-null  object 
 8   subjectivity       31558 non-null  float64
 9   original_polarity  31558 non-null  float64
dtypes: float64(4), int64(1), object(5)
memory usage: 2.4+ MB


In [7]:
# That worked. Find some exclamation points that need replacing.
df[df.no_stop_cap_lemm.str.find('!!!!!!')!=-1]

Unnamed: 0,rating,condition,date,drug0,drug1,review_len,complexity,no_stop_cap_lemm,subjectivity,original_polarity
9255,7.0,addiction,2016-08-28,varenicline,chantix,763,4.1,want let everyone know react chantexday 7 8 so...,0.736111,-0.364583


In [8]:
df.no_stop_cap_lemm[9255]

'want let everyone know react chantexday 7 8 soon tired desire mad angry confused know ! get sleep tell need bed set bed ! think computer desk study ! 4 pm asleep 10 time wake all want sleep tired doctoream color tired thankfully hit weekend tomorrow work ! hopefully able ! worry have cigarette anymore really fierceness quit get smoke!!!!!!quit numb'

In [9]:
exclamation_replacement = {'!!!!':'!!!', '!!!!!':'!!!', '!!!!!!':'!!!',
                            '!!!!!!!':'!!!', '!!!!!!!!':'!!!'}

for row in tqdm(range(len(df))):
    str_to_reduce_exclam = df.loc[row,'no_stop_cap_lemm']
    for key, value in exclamation_replacement.items():
        str_to_reduce_exclam = str_to_reduce_exclam.replace(key,value)
    df.loc[row,'no_stop_cap_lemm'] = str_to_reduce_exclam
        
df.no_stop_cap_lemm[9255]

100%|██████████| 31558/31558 [00:22<00:00, 1383.76it/s]


'want let everyone know react chantexday 7 8 soon tired desire mad angry confused know ! get sleep tell need bed set bed ! think computer desk study ! 4 pm asleep 10 time wake all want sleep tired doctoream color tired thankfully hit weekend tomorrow work ! hopefully able ! worry have cigarette anymore really fierceness quit get smoke!!!quit numb'

That worked, move on. 

<font color='violet'> Deal with null values. 

In [10]:
# It appears as though stopword deletion left one very short review consisting of nothing
# Drop that row, then reset the index so it's in order. 
df[df.no_stop_cap_lemm.isnull()]

Unnamed: 0,rating,condition,date,drug0,drug1,review_len,complexity,no_stop_cap_lemm,subjectivity,original_polarity


In [11]:
df = df.drop(labels=29472).reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31557 entries, 0 to 31556
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rating             31557 non-null  float64
 1   condition          31557 non-null  object 
 2   date               31449 non-null  object 
 3   drug0              31557 non-null  object 
 4   drug1              18991 non-null  object 
 5   review_len         31557 non-null  int64  
 6   complexity         31557 non-null  float64
 7   no_stop_cap_lemm   31557 non-null  object 
 8   subjectivity       31557 non-null  float64
 9   original_polarity  31557 non-null  float64
dtypes: float64(4), int64(1), object(5)
memory usage: 2.4+ MB


In [12]:
# There are rows without values for drug1; just replace nan with the string "na"
df['drug1'] = df.drug1.fillna('na')
df.isnull().any()

rating               False
condition            False
date                  True
drug0                False
drug1                False
review_len           False
complexity           False
no_stop_cap_lemm     False
subjectivity         False
original_polarity    False
dtype: bool

There are missing dates. I don't want to introduce leakage by imputing missing values with the most common date overall, but I can start with the train_test_split and just impute all missing dates with the most common date from the training set. 

In [13]:
# First, turn the date column into a date type
df['date'] = pd.to_datetime(df.date)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31557 entries, 0 to 31556
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   rating             31557 non-null  float64       
 1   condition          31557 non-null  object        
 2   date               31449 non-null  datetime64[ns]
 3   drug0              31557 non-null  object        
 4   drug1              31557 non-null  object        
 5   review_len         31557 non-null  int64         
 6   complexity         31557 non-null  float64       
 7   no_stop_cap_lemm   31557 non-null  object        
 8   subjectivity       31557 non-null  float64       
 9   original_polarity  31557 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(1), object(4)
memory usage: 2.4+ MB


In [14]:
X = df.drop(columns='rating')
y = df.rating
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=17, 
                                                    stratify=y)
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22089 entries, 3172 to 30071
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   condition          22089 non-null  object        
 1   date               22010 non-null  datetime64[ns]
 2   drug0              22089 non-null  object        
 3   drug1              22089 non-null  object        
 4   review_len         22089 non-null  int64         
 5   complexity         22089 non-null  float64       
 6   no_stop_cap_lemm   22089 non-null  object        
 7   subjectivity       22089 non-null  float64       
 8   original_polarity  22089 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(4)
memory usage: 1.7+ MB


In [15]:
X_train.date.value_counts(ascending=False)

2016-02-21    28
2016-01-14    24
2017-01-25    23
2017-01-18    22
2015-10-12    22
              ..
2010-01-14     1
2010-02-16     1
2008-11-12     1
2009-04-15     1
2009-03-14     1
Name: date, Length: 3503, dtype: int64

In [16]:
X_train['date'] = X_train.date.fillna('2016-02-21')
X_test['date'] = X_test.date.fillna('2016-02-21')
X_train.isnull().any()

condition            False
date                 False
drug0                False
drug1                False
review_len           False
complexity           False
no_stop_cap_lemm     False
subjectivity         False
original_polarity    False
dtype: bool

In [17]:
X_test.isnull().any()

condition            False
date                 False
drug0                False
drug1                False
review_len           False
complexity           False
no_stop_cap_lemm     False
subjectivity         False
original_polarity    False
dtype: bool

In [18]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22089 entries, 3172 to 30071
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   condition          22089 non-null  object        
 1   date               22089 non-null  datetime64[ns]
 2   drug0              22089 non-null  object        
 3   drug1              22089 non-null  object        
 4   review_len         22089 non-null  int64         
 5   complexity         22089 non-null  float64       
 6   no_stop_cap_lemm   22089 non-null  object        
 7   subjectivity       22089 non-null  float64       
 8   original_polarity  22089 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(4)
memory usage: 1.7+ MB


Ready to carry on with feature engineering

<font color='violet'> Implement tfidf 
    
I'll need to quickly create and evaluate a model after implementing each method of feature extraction from the no_stop_cap_lemm text. I'll do model tuning later, but for now, my understanding is that naive bayes is a common, solid model for nlp classification tasks, so I'll use that to compare various preprocessing techniques explored here. 
    
I'll start by modeling with just the extracted text features. These could be recombined with the other columns later (once they're encoded numerically) for improved model performance if I have some reason to do so. But I don't want to do all my modeling including variables like drugs and conditions, because those features will be absent from scraped psychedelic experience reports. 

In [19]:
X_train_models = X_train.drop(columns=['condition', 'date', 'drug0', 'drug1'])
X_test_models = X_test.drop(columns=['condition', 'date', 'drug0', 'drug1'])
X_train_models.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22089 entries, 3172 to 30071
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   review_len         22089 non-null  int64  
 1   complexity         22089 non-null  float64
 2   no_stop_cap_lemm   22089 non-null  object 
 3   subjectivity       22089 non-null  float64
 4   original_polarity  22089 non-null  float64
dtypes: float64(3), int64(1), object(1)
memory usage: 1.0+ MB


In [20]:
# Specifically, to get tfidf feature engineering to work, pare it down to just text column
X_train_tfidf = X_train_models.no_stop_cap_lemm
X_test_tfidf = X_test_models.no_stop_cap_lemm
X_train_tfidf.head()

3172                                     good give run gas
21607    75 mg x daily no noticeable effect 150 mg x da...
28990    take 145 mg 10 year fantastic insomnia really ...
23881    help stability mood help insomnia start experi...
16374                                  crazy eat sleep sit
Name: no_stop_cap_lemm, dtype: object

In [21]:
# Could change parametes to include min_df, max_df, but for now just use a simple version
tfidf = TfidfVectorizer(ngram_range=(1, 3), lowercase=False)

# Fit to training set, and transform both sets
X_train_tfidf = tfidf.fit_transform(X_train_tfidf)
X_test_tfidf = tfidf.transform(X_test_tfidf)

# Run through a model to evaluate accuracy
nb_clf = MultinomialNB()
nb_clf.fit(X_train_tfidf, y_train)
pred = nb_clf.predict_proba(X_test_tfidf)
metrics.roc_auc_score(y_test, pred, multi_class='ovr')

0.6055234472676633

With only the text itself and no additional features such as sentiment polarity, which is quite well-correlated with rating, the area under the curve is 0.6. Try some other methods for feature engineering with the text column. 

<font color='violet'> Implement feature engineering with CountVectorizer 

In [22]:
# Copy to create relevant train and test sets
X_train_cvect = X_train_models.no_stop_cap_lemm
X_test_cvect = X_test_models.no_stop_cap_lemm

# Instantiate count vectorizer
cvect = CountVectorizer(lowercase=False)

# Fit to training set, and transform both sets
X_train_cvect = cvect.fit_transform(X_train_cvect)
X_test_cvect = cvect.transform(X_test_cvect)

# Run through a model to evaluate accuracy
nb_clf = MultinomialNB()
nb_clf.fit(X_train_cvect, y_train)
pred = nb_clf.predict_proba(X_test_cvect)
metrics.roc_auc_score(y_test, pred, multi_class='ovr')

0.6497113011239557

The features created by the count vectorizer worked better with the model than those from tfidf. Given that individual words rather than bigrams or trigrams as arranged here are more likely to show up in unseen data from psychedelic experience reports, it also makes sense to favor count vectorizer from a perspective of avoiding overfitting. 

Finally, generate word embeddings. There are multiple methods with which to do this such as spacy's pretrained models, Word2Vec, GloVe, or FastText. I'm unsure about relative usability or performance, so play around. I did read that FastText is better for generalization to unknown words than Word2Vec or GloVe, so definitely go there and see if it's feasible, after starting with the simplest tool, spacy. 

<font color='violet'> Explore word embedding 

In [None]:
# Start with just using the spacy vector value to get average token vectors for each review. 

nlp = spacy.load('en_core_web_lg') 
df['vector'] = df['no_stop_cap_lemm'].apply(nlp).apply(lambda text: np.mean([token.vector for 
                                                                             token in text]))
df.head()

In [None]:
# Is this vector column correlated with the ratings column? 
sns.boxplot(data=df, x='rating', y='vector').set(ylim=(-0.3,0.1))
plt.show()

It appears as though the mean word vector value for each review isn't very meaningful on its own. However, what about columns that contains each review's similarity to a mega-review that is made up of all of the test set's reviews for just rating 1, rating 2, rating 3, etc? 

In [None]:
# Try creating just one column: similarity with reviews with a rating of 10
rating_10_review = ' '.join(df[df['rating']==10]['no_stop_cap_lemm'])
rating_10_review = nlp(rating_10_review[:100000])
df['similarity_w_10'] = df.no_stop_cap_lemm.apply(nlp).apply(
    lambda text: text.similarity(rating_10_review))
df.head()

In [None]:
sns.boxplot(data=df, x='rating', y='similarity_w_10').set(ylim=0.6)
plt.show()

In [None]:
# This appears slightly more meaningful than several other variables. Confirm. 

plt.figure(figsize=(8,6))
cmap = sns.diverging_palette(h_neg=0, h_pos=0, s=0, l=0, as_cmap=True)
sns.heatmap(df[['rating', 'review_len', 'complexity', 'subjectivity', 
                'original_polarity', 'vector', 'similarity_w_10']].corr(), linewidths=.1, cmap=cmap, center=0.0, annot=True)
plt.yticks(rotation=0);

"Similarity with 10" has a stronger correlation with rating than any variable other than polarity. It seems worth it to create more of these vector similarity columns. But, of course, only using the texts from the train set. Return to the train and test sets created earlier, and do any manipulation necessary to create spacy nlp docs for the train set's reviews associated with ratings 1-10. Then, create columns for each.  

<font color='violet'> Create columns for vector similarity with meta-reviews based on each rating. 

In [None]:
X_train['set'] = 'train'
X_test['set'] = 'test'
train_set = pd.concat([X_train, y_train], axis=1)
train_set.head()

In [None]:
rating_10_meta = ' '.join(train_set[train_set['rating']==10]['no_stop_cap_lemm'])
rating_10_meta = nlp(rating_10_meta[:100000])

rating_9_meta = ' '.join(train_set[train_set['rating']==9]['no_stop_cap_lemm'])
rating_9_meta = nlp(rating_9_meta[:100000])

rating_8_meta = ' '.join(train_set[train_set['rating']==8]['no_stop_cap_lemm'])
rating_8_meta = nlp(rating_8_meta[:100000])

rating_7_meta = ' '.join(train_set[train_set['rating']==7]['no_stop_cap_lemm'])
rating_7_meta = nlp(rating_7_meta[:100000])

rating_6_meta = ' '.join(train_set[train_set['rating']==6]['no_stop_cap_lemm'])
rating_6_meta = nlp(rating_6_meta[:100000])

rating_5_meta = ' '.join(train_set[train_set['rating']==5]['no_stop_cap_lemm'])
rating_5_meta = nlp(rating_5_meta[:100000])

rating_4_meta = ' '.join(train_set[train_set['rating']==4]['no_stop_cap_lemm'])
rating_4_meta = nlp(rating_4_meta[:100000])

rating_3_meta = ' '.join(train_set[train_set['rating']==3]['no_stop_cap_lemm'])
rating_3_meta = nlp(rating_3_meta[:100000])

rating_2_meta = ' '.join(train_set[train_set['rating']==2]['no_stop_cap_lemm'])
rating_2_meta = nlp(rating_2_meta[:100000])

rating_1_meta = ' '.join(train_set[train_set['rating']==1]['no_stop_cap_lemm'])
rating_1_meta = nlp(rating_1_meta[:100000])

len(rating_1_meta)

In [None]:
print(len(rating_2_meta), len(rating_3_meta), len(rating_4_meta), len(rating_5_meta), 
      len(rating_6_meta), len(rating_7_meta), len(rating_8_meta), len(rating_9_meta), 
      len(rating_10_meta))

For the entire dataset, now that each rating_n_meta doc contains only text from the training set, add a column and fill with each review's similarity to each meta-review. 

In [None]:
test_set = pd.concat([X_test, y_test], axis=1)
df = pd.concat([train_set, test_set])
df.sample(5)

In [None]:
# Create the columns
df['similarity_w_10'] = df.no_stop_cap_lemm.apply(nlp).apply(
    lambda text: text.similarity(rating_10_meta))

df['similarity_w_9'] = df.no_stop_cap_lemm.apply(nlp).apply(
    lambda text: text.similarity(rating_9_meta))

df['similarity_w_8'] = df.no_stop_cap_lemm.apply(nlp).apply(
    lambda text: text.similarity(rating_8_meta))

df['similarity_w_7'] = df.no_stop_cap_lemm.apply(nlp).apply(
    lambda text: text.similarity(rating_7_meta))

df['similarity_w_6'] = df.no_stop_cap_lemm.apply(nlp).apply(
    lambda text: text.similarity(rating_6_meta))

df['similarity_w_5'] = df.no_stop_cap_lemm.apply(nlp).apply(
    lambda text: text.similarity(rating_5_meta))

df['similarity_w_4'] = df.no_stop_cap_lemm.apply(nlp).apply(
    lambda text: text.similarity(rating_4_meta))

df['similarity_w_3'] = df.no_stop_cap_lemm.apply(nlp).apply(
    lambda text: text.similarity(rating_3_meta))

df['similarity_w_2'] = df.no_stop_cap_lemm.apply(nlp).apply(
    lambda text: text.similarity(rating_2_meta))

df['similarity_w_1'] = df.no_stop_cap_lemm.apply(nlp).apply(
    lambda text: text.similarity(rating_1_meta))

df.head()

In [None]:
plt.figure(figsize=(8,6))
cmap = sns.diverging_palette(h_neg=0, h_pos=0, s=0, l=0, as_cmap=True)
sns.heatmap(df[['rating', 'similarity_w_10', 'similarity_w_9', 'similarity_w_8', 
                'similarity_w_7', 'similarity_w_6', 'similarity_w_5', 'similarity_w_4', 
                'similarity_w_3', 'similarity_w_2', 'similarity_w_1']].corr(), 
            linewidths=.1, cmap=cmap, center=0.0, annot=True)
plt.yticks(rotation=0);

These columns are highly correlated with one another. Keep just the one column that has the highest correlation with rating, "similarity_w_10"

In [None]:
df = df.drop(columns=['similarity_w_9', 'similarity_w_8', 'similarity_w_7', 'similarity_w_6', 
                      'similarity_w_5', 'similarity_w_4', 'similarity_w_3', 'similarity_w_2', 
                      'similarity_w_1'])
df.head()

Several of these columns so far, if combined with the feature engineering done by CountVectorizer, could support solid modeling. Get started with that now, come back later to try more advanced techniques. 

With FastText, I'll use a pre-trained model and just update it because I don't want to limit my model to words present in this dataset. 

<font color='violet'> Explore FastText. 

I drew some code from these resources: 
- https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle/notebook


Resources related to FastText:
- https://fasttext.cc/docs/en/python-module.html
- https://www.kaggle.com/code/grantgasser/eda-naive-bayes-bert-glove-fasttext-nn
- https://pythonwife.com/fasttext-in-nlp/
- https://towardsdatascience.com/fasttext-for-text-classification-a4b38cbff27c
- https://towardsdatascience.com/sarcasm-classification-using-fasttext-788ffbacb77b
- - https://thinkinfi.com/fasttext-word-embeddings-python-implementation/


Other resources to check out:
- Try using huggingface based on example here: https://towardsdatascience.com/a-beginners-guide-to-use-bert-for-the-first-time-2e99b8c5423
- re: using spacy's visualizer: https://medium.com/acing-ai/visualizations-in-natural-language-processing-2ca60dd34ce
- Or visualize word embeddings with t-sne
- Come back to this resource used in the previous notebook; it also contains info re: visualizing word embeddings: Still especially interested in digging deeper with visualizing word embeddings: https://medium.com/plotly/nlp-visualisations-for-clear-immediate-insights-into-text-data-and-outputs-9ebfab168d5b
    

In [None]:
df.to_csv('../data/interim/studies_w_vector_similarity.csv')

Final model selection and tuning here: 