# <font color='violet'> Modeling to Predict Ratings based on Reviews 
    
Using data with most features engineered here: https://github.com/fractaldatalearning/psychedelic_efficacy/blob/main/notebooks/6-kl-studies-finish-preprocess.ipynb

Other feature engineering will be completed as part of the modeling pipeline. 

In [1]:
# ! pip install xgboost

In [2]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer 
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

from sklearn.naive_bayes import ComplementNB
from xgboost import XGBClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from IPython.display import Audio
sound_file = './alert.wav'

In [3]:
df = pd.read_csv('../data/interim/studies_w_vector_similarity.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31557 entries, 0 to 31556
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         31557 non-null  int64  
 1   condition          31557 non-null  object 
 2   date               31557 non-null  object 
 3   drug0              31557 non-null  object 
 4   drug1              31557 non-null  object 
 5   review_len         31557 non-null  int64  
 6   complexity         31557 non-null  float64
 7   no_stop_cap_lemm   31557 non-null  object 
 8   subjectivity       31557 non-null  float64
 9   original_polarity  31557 non-null  float64
 10  set                31557 non-null  object 
 11  rating             31557 non-null  float64
 12  similarity_w_10    31557 non-null  float64
dtypes: float64(5), int64(2), object(6)
memory usage: 3.1+ MB


In [4]:
# Remove categorical features that aren't specifically related to the review text
df = df.drop(columns = ['Unnamed: 0', 'condition', 'date', 'drug0', 'drug1'])

# Rename the review column for clarity.
df = df.rename(columns={'no_stop_cap_lemm':'review'})

# Reorder columns for clarity
df = df[['review', 'rating', 'review_len', 'complexity', 'subjectivity', 
              'original_polarity', 'similarity_w_10', 'set']]

df.head()

Unnamed: 0,review,rating,review_len,complexity,subjectivity,original_polarity,similarity_w_10,set
0,good give run gas,9.0,36,-1.2,0.6,0.7,0.64005,train
1,75 mg x daily no noticeable effect 150 mg x da...,8.0,547,5.4,0.343056,0.031439,0.841263,train
2,take 145 mg 10 year fantastic insomnia really ...,8.0,390,4.8,0.591667,0.096296,0.922576,train
3,help stability mood help insomnia start experi...,7.0,156,8.2,1.0,-1.0,0.825203,train
4,crazy eat sleep sit,2.0,66,-0.4,0.9,-0.6,0.467668,train


There are going to be a bunch of moving parts once I get to the gridsearch. I'd like to take care of anything I can in advance rather than including normalization in the pipeline. 

<font color='violet'> Normalize numerical dependent variables prior to moving on

In [5]:
# Normalize numeric dependent variables prior to moving on. 
mms = MinMaxScaler()
df[['review_len', 'complexity', 'subjectivity', 'original_polarity', 
    'similarity_w_10']] = mms.fit_transform(df[['review_len', 'complexity', 'subjectivity', 
                                                'original_polarity', 'similarity_w_10']])

df.head()

Unnamed: 0,review,rating,review_len,complexity,subjectivity,original_polarity,similarity_w_10,set
0,good give run gas,9.0,0.004525,0.034848,0.6,0.85,0.681421,train
1,75 mg x daily no noticeable effect 150 mg x da...,8.0,0.087104,0.134848,0.343056,0.51572,0.867294,train
2,take 145 mg 10 year fantastic insomnia really ...,8.0,0.061732,0.125758,0.591667,0.548148,0.942408,train
3,help stability mood help insomnia start experi...,7.0,0.023917,0.177273,1.0,0.0,0.852459,train
4,crazy eat sleep sit,2.0,0.009373,0.04697,0.9,0.2,0.522181,train


<font color='violet'> Change ratings from 1-10 to 0-9 

In [6]:
df['rating'] = LabelEncoder().fit_transform(df.rating)
df.rating.unique()

array([8, 7, 6, 1, 9, 2, 0, 4, 3, 5])

In [7]:
# Recreate train-test sets
train_set = df[df.set=='train'].drop(columns=['set']).copy()
test_set = df[df.set=='test'].drop(columns=['set']).copy()

train_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22089 entries, 0 to 22088
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   review             22089 non-null  object 
 1   rating             22089 non-null  int64  
 2   review_len         22089 non-null  float64
 3   complexity         22089 non-null  float64
 4   subjectivity       22089 non-null  float64
 5   original_polarity  22089 non-null  float64
 6   similarity_w_10    22089 non-null  float64
dtypes: float64(5), int64(1), object(1)
memory usage: 1.3+ MB


In [8]:
X_train = train_set.drop(columns=['rating'])
X_test = test_set.drop(columns=['rating'])
y_train = train_set.rating
y_test = test_set.rating

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22089 entries, 0 to 22088
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   review             22089 non-null  object 
 1   review_len         22089 non-null  float64
 2   complexity         22089 non-null  float64
 3   subjectivity       22089 non-null  float64
 4   original_polarity  22089 non-null  float64
 5   similarity_w_10    22089 non-null  float64
dtypes: float64(5), object(1)
memory usage: 1.2+ MB


Create a pipeline in order to gridsearch and tune hyperparameters of PCA and classifiers. Include PCA and CountVectorizer in a column transformer. 

<font color='violet'> Conduct initial random grid search

In [None]:
# Try out adding both PCA and CV parameters to param grid

# Make variables for types of columns & instantiate what to do with each 
review_col = 'review'
numeric_cols = ['review_len', 'complexity', 'subjectivity', 'original_polarity', 
                'similarity_w_10']

num_trans = Pipeline(steps=[('pca', PCA(random_state=17)), ('mms', MinMaxScaler())])
text_trans = CountVectorizer(lowercase=False)

# Build column transformer
cols = ColumnTransformer([('num', num_trans, numeric_cols), 
                          ('text', text_trans, review_col)])

# Create parameters for PCA, CountVectorizer
# I found during preprocessing that 1,2,3 or all features might be meaningful. 
# counting character combinations might actually make the model work better on unseen data.
# Don't actually want character 1grams or word 3grams; change later if those come out on top.

pca_cv = {'cols__num__pca__n_components':['passthrough',1,2,3], 
              'cols__text__ngram_range':[(1,1), (1,2), (1,3)], 
              'cols__text__analyzer':['word', 'char']}

# list potential parameters for each classifier
knn_params = {'clf':(KNeighborsClassifier(),), 'clf__n_neighbors':list(np.arange(3,22,2)), 
              'clf__weights':['uniform','distance'], 
              'clf__leaf_size':list(np.arange(10,101,10)), 'clf__p':[1,2], 
              'clf__metric':['euclidean','chebyshev','minkowski']}
knn_params.update(pca_cv)

svc_params = {'clf':(SVC(probability=True, random_state=43),), 'clf__C':list(np.arange(1,11)), 
              'clf__kernel':['linear', 'poly', 'rbf', 'sigmoid'], 
              'clf__degree':list(np.arange(1,11)), 'clf__gamma':['scale', 'auto'], 
              'clf__coef0':list(np.arange(0,4,0.5)), 'clf__shrinking':[True,False], 
              'clf__probability':[True,False], 'clf__class_weight':[None,'balanced']}
svc_params.update(pca_cv)

cnb_params = {'clf':(ComplementNB(),), 'clf__alpha':[0.01, 0.1, 0.5, 1, 2, 5, 10]}
cnb_params.update(pca_cv)

# Only select XGB hyperparameters that I've read minimize overfitting, a problem w/ trees
xgb_params = {'clf':(XGBClassifier(),), 'clf__colsample_bytree':list(np.arange(0.1, 0.6, 0.1)), 
              'clf__subsample':list(np.arange(0, 0.6, 0.1)), 'clf__max_depth':list(np.arange(1,5)), 
              'clf__gamma':list(np.arange(4,11,1)), 'clf__eta':list(np.arange(0, 0.6, 0.1)), 
              'clf__min_child_weight':[5,20,50,100,200], 'clf__alpha':[5,10,20,50,100], 
              'clf__n_estimators':[5,10,20,50]}
xgb_params.update(pca_cv)

ridge_params = {'clf':(RidgeClassifier(class_weight='balanced', random_state=17),), 
                'clf__alpha':[1e-10, 1e-5, 1e-2, 1, 5, 10, 20, 50, 100, 200, 500]}
ridge_params.update(pca_cv)

lda_params = {'clf':(LinearDiscriminantAnalysis(shrinkage='auto'),), 
              'clf__solver':['svd', 'lsqr', 'eigen'], 'clf__store_covariance':[True,False]}
lda_params.update(pca_cv)


# Build final pipeline and param grid & instantiate gridsearch
pipe = Pipeline(steps=[('cols', cols), ('clf', KNeighborsClassifier())])
param_grid = [svc_params, cnb_params, xgb_params, ridge_params, lda_params]

rgs = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid,
                         scoring='roc_auc_ovr', random_state=17, error_score='raise')

# Run the gridsearch
rgs.fit(X_train, y_train)
print(rgs.best_params_)
print(rgs.best_score_)

Audio(sound_file, autoplay=True)

After 1 iterations of random grid-search, the best parameters:

{'cols__text__ngram_range': (1, 1), 'cols__text__analyzer': 'char', 'cols__num__pca__n_components': 2, 'clf__subsample': 0.1, 'clf__n_estimators': 20, 'clf__min_child_weight': 200, 'clf__max_depth': 3, 'clf__gamma': 4, 'clf__eta': 0.30000000000000004, 'clf__colsample_bytree': 0.30000000000000004, 'clf__alpha': 20, 'clf': XGBClassifier(alpha=20, base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.30000000000000004, early_stopping_rounds=None,
              enable_categorical=False, eta=0.30000000000000004,
              eval_metric=None, feature_types=None, gamma=4, gpu_id=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=3, max_leaves=None,
              min_child_weight=200, missing=nan, monotone_constraints=None,
              n_estimators=20, n_jobs=None, num_parallel_tree=None, ...)}
0.5796871769733423

In [None]:
# Final model tuning and evaluation

In [None]:
# Pickle the model for later use

<font color='violet'> Next Steps:

I'd like to come back to this portion of the project (prior notebook and this one) to improve text preprocessing, but for now move on to part two: 
    
Begin scraping the web for psychedelic experience reports so they can be preprocessed to match the format of this data and run through this model. Get started with wrangling here: 
    
Note: I drew ideas from these resources: 
- https://towardsdatascience.com/getting-the-most-out-of-scikit-learn-pipelines-c2afc4410f1a
- https://stackoverflow.com/questions/63467815/how-to-access-columntransformer-elements-in-gridsearchcv