# Modeling and Evaluation of Sentiment Prediction

This stage of the project includes preparing text for machine learning algorithms, splitting the dataset into train and test set, and then performing predictions and evaluations using different classifiers. Best models will be then tested again.

Contents of this notebook:

<ul>
    <li>1. Imports</li>
    <li>2. Data</li>
    <li>3. Preparing Text</li>
        <ul>
            <li>3.1 Removing Missing values</li>
            <li>3.2 Creating three categories of labels from ratings</li>
            <li>3.3 Train/Test Split</li>
            <li>3.4 Vectorizing the text</li>
        </ul>
    <li>4. Classification</li>
        <ul>
            <li>4.1 Further splitting data into a train and validation set</li>
            <li>4.2 Logistic Regression</li>
            <li>4.3 Multinomial Naive Bayes</li>
            <li>4.4 Random Forest</li>
            <li>4.5 Decision Tree</li>
            <li>4.6 K Neighbors</li>
            <li>4.7 AdaBoost</li>
            <li>4.8 XGBoost</li>
        </ul>
    <li>5. Evaluation</li>
        <ul>
            <li>5.1 Comparing scores from all models</li>
            <li>5.2 Fitting the best model with test data</li>
            <li>5.3 Additional model metrics and tuning</li>
        </ul>
</ul>

# 1. Imports

In [1]:
#basic libraries for linear algebra and data processing
import numpy as np
import pandas as pd

#visualization
import matplotlib.pyplot as plt
import seaborn as sns

#data preparation tools
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
import nltk
from nltk.corpus import stopwords

import pickle

#classification class
from classification_py import Classification

#time and warnings
import time
import warnings

#settings
warnings.filterwarnings("ignore")
%matplotlib inline
sns.set_context('poster', font_scale=0.5)

# 2. Data

In [2]:
#loading the review dataset
review = pd.read_csv('../input/yelp-reviews-phoenix-az/review_prepared.csv')

In [3]:
#filtering out the stars and text columns
reviews = review[['text', 'stars']].reset_index().drop(columns='index')

In [4]:
print(reviews.shape)
reviews.head()

(229130, 2)


Unnamed: 0,text,stars
0,My wife took me here on my birthday for breakf...,5
1,I have no idea why some people give bad review...,5
2,love the gyro plate. Rice is so good and I als...,4
3,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",5
4,General Manager Scott Petello is a good egg!!!...,5


# 3. Preparing Text

## 3.1 Removing Missing values

Checking if there are missing values I didn't catch the first time

In [5]:
reviews.isnull().sum()

text     6
stars    0
dtype: int64

In [6]:
reviews.dropna(inplace = True)
print(reviews.shape)

(229124, 2)


## 3.2 Creating three categories of labels from ratings

Creating lables of positive, negative, and neutral for their corresponding ratings.

In [7]:
reviews['stars'].value_counts()

4    79702
5    75911
3    35266
2    20897
1    17348
Name: stars, dtype: int64

In [8]:
#creating labels from stars
reviews['label'] = reviews['stars'].apply(lambda s: 'positive' if s >= 4 else ('negative' if s <= 2 else 'neutral'))

reviews.head()

Unnamed: 0,text,stars,label
0,My wife took me here on my birthday for breakf...,5,positive
1,I have no idea why some people give bad review...,5,positive
2,love the gyro plate. Rice is so good and I als...,4,positive
3,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",5,positive
4,General Manager Scott Petello is a good egg!!!...,5,positive


In [9]:
reviews['label'].value_counts()

positive    155613
negative     38245
neutral      35266
Name: label, dtype: int64

## 3.3 Train/Test Split

In [10]:
X = reviews['text']
y = reviews['label']

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [12]:
print('Shape of the X train set: ', X_train.shape)
print('Shape of the X test set: ', X_test.shape)
print('Shape of the y train set: ', y_train.shape)
print('Shape of the y test set: ', y_test.shape)

Shape of the X train set:  (153513,)
Shape of the X test set:  (75611,)
Shape of the y train set:  (153513,)
Shape of the y test set:  (75611,)


In [13]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.fit_transform(y_test)

## 3.4 Vectorizing the text

In [14]:
#variable that contains stopwords
stopwords = set(stopwords.words('english'))

In [15]:
#building the vectorizer
vectorizer = TfidfVectorizer(lowercase = True, 
                             stop_words = stopwords, 
                             ngram_range = (1,2), 
                             min_df = 0.01)

In [16]:
#vectorizing the training set
X_train_vect = vectorizer.fit_transform(X_train)
print('Shape of X_train vectorized: ',X_train.shape)

Shape of X_train vectorized:  (153513,)


In [17]:
#vectorizing the test set
X_test_vect = vectorizer.fit_transform(X_test)
print('Shape of X_test vectorized: ',X_test.shape)

Shape of X_test vectorized:  (75611,)


## 3.5 Feature Scaling

In [18]:
#initializing StandardScaler
scaler = StandardScaler(with_mean = False)

In [19]:
#scaling X_train_vect
X_train_scaled = scaler.fit_transform(X_train_vect)

In [20]:
#scaling X_test_vect
X_test_scaled = scaler.fit_transform(X_test_vect)

In [21]:
print(X_train_scaled.shape)
print(y_train.shape)

(153513, 1095)
(153513,)


# 4. Classifcation

## 4.1 Further splitting data into a train and validation set

For better model performance, I decided to use GridSearch to find the best parameters for each model used, as well as Stratified k-fold Cross Validation, as my dataset is rather imbalanced.

In [22]:
#splitting the train dataset into a train and validation dataset
X_train, X_val, y_train, y_val = train_test_split(X_train_scaled, y_train, 
                                                  test_size = 0.3, random_state = 42)

In [23]:
#initializing the Stratified K-fold CV
skf = StratifiedKFold(n_splits = 5, random_state = 42, shuffle = True)

## 4.2 Logistic Regression

In [24]:
#establishing parameters for GridSearch
parameters = {'penalty':['l1','l2'],
              'C':[0.01,0.05,0.5,5]}

In [25]:
#fitting the model
log_reg = Classification('Logistic Regression', X_train, X_val, y_train, y_val)

In [26]:
%%time

#getting scores
log_reg.get_scores(parameters, skf)

Unnamed: 0,Model Name,Train Accuracy,Validation Accuracy,Accuracy Difference
0,Logistic Regression,0.801329,0.788683,0.012646


The best hyperparameters are:  {'C': 0.01, 'penalty': 'l2'} 

CPU times: user 12.8 s, sys: 201 ms, total: 13 s
Wall time: 58.9 s


Unnamed: 0,0,1,2,accuracy,macro avg
precision,0.694542,0.51226,0.838515,0.788683,0.681772
recall,0.678042,0.268962,0.934322,0.788683,0.627108
f1-score,0.686192,0.352725,0.88383,0.788683,0.640916


## 4.3 Multinomial Naive Bayes

In [27]:
#establishing parameters for GridSearch
parameters = {'alpha': [0.001, 0.01, 0.5, 1.0]}

In [28]:
#fitting the model
mnb = Classification('Multinomial Naive Bayes', X_train, X_val, y_train, y_val)

In [29]:
%%time

#getting scores
mnb.get_scores(parameters, skf)

Unnamed: 0,Model Name,Train Accuracy,Validation Accuracy,Accuracy Difference
0,Multinomial Naive Bayes,0.696545,0.687671,0.008874


The best hyperparameters are:  {'alpha': 0.001} 

CPU times: user 367 ms, sys: 30.1 ms, total: 397 ms
Wall time: 1.12 s


Unnamed: 0,0,1,2,accuracy,macro avg
precision,0.526862,0.346822,0.910156,0.687671,0.594613
recall,0.673185,0.561293,0.720061,0.687671,0.651513
f1-score,0.591103,0.428732,0.804025,0.687671,0.607953


## 4.4 Random Forest

In [30]:
#establishing parameters for GridSearch
parameters = {'min_samples_leaf':[1,3,15,50],
          'max_depth':[5,10,15,20]}

In [31]:
#fitting the model
rf = Classification('Random Forest', X_train, X_val, y_train, y_val)

In [32]:
%%time

#getting scores
rf.get_scores(parameters, skf)

Unnamed: 0,Model Name,Train Accuracy,Validation Accuracy,Accuracy Difference
0,Random Forest,0.760727,0.717962,0.042766


The best hyperparameters are:  {'max_depth': 20, 'min_samples_leaf': 1} 

CPU times: user 3min 36s, sys: 500 ms, total: 3min 36s
Wall time: 4min 23s


Unnamed: 0,0,1,2,accuracy,macro avg
precision,0.789629,0.52381,0.714804,0.717962,0.676081
recall,0.245833,0.012315,0.994087,0.717962,0.417412
f1-score,0.374937,0.024063,0.831624,0.717962,0.410208


## 4.5 Decision Tree

In [33]:
#establishing parameters for GridSearch
parameters = {'min_samples_leaf':[3,15,50,100],
              'max_depth':[3,5,7,10]}

In [34]:
#fitting the model
tree = Classification('Decision Tree', X_train, X_val, y_train, y_val)

In [35]:
%%time

#getting scores
tree.get_scores(parameters, skf)

Unnamed: 0,Model Name,Train Accuracy,Validation Accuracy,Accuracy Difference
0,Decision Tree,0.711174,0.708625,0.002549


The best hyperparameters are:  {'max_depth': 10, 'min_samples_leaf': 100} 

CPU times: user 3min 9s, sys: 122 ms, total: 3min 9s
Wall time: 5min 27s


Unnamed: 0,0,1,2,accuracy,macro avg
precision,0.6946,0.4,0.720327,0.708625,0.604976
recall,0.195826,0.084523,0.97603,0.708625,0.418793
f1-score,0.305519,0.139556,0.828907,0.708625,0.424661


## 4.6 K Neighbors 

In [36]:
#establishing parameters for GridSearch
parameters = {'n_neighbors':[5,10,50,150,300]}

In [37]:
#fitting the model
knn = Classification('KNN', X_train, X_val, y_train, y_val)

In [38]:
%%time

#getting scores
knn.get_scores(parameters, skf)

Unnamed: 0,Model Name,Train Accuracy,Validation Accuracy,Accuracy Difference
0,KNN,0.680715,0.680636,8e-05


The best hyperparameters are:  {'n_neighbors': 150} 

CPU times: user 44min 6s, sys: 11min 51s, total: 55min 57s
Wall time: 2h 59min 58s


Unnamed: 0,0,1,2,accuracy,macro avg
precision,0.755319,0.0,0.680483,0.680636,0.478601
recall,0.009319,0.0,0.999553,0.680636,0.33629
f1-score,0.01841,0.0,0.809719,0.680636,0.276043


## 4.7 AdaBoost

In [39]:
#establishing parameters for GridSearch
parameters = {'learning_rate':[0.1,1,10]}

In [40]:
#fitting the model
ada = Classification('AdaBoost', X_train, X_val, y_train, y_val)

In [41]:
%%time

#getting scores
ada.get_scores(parameters, skf)

Unnamed: 0,Model Name,Train Accuracy,Validation Accuracy,Accuracy Difference
0,AdaBoost,0.690552,0.690515,3.6e-05


The best hyperparameters are:  {'learning_rate': 1} 

CPU times: user 1min 56s, sys: 2.36 s, total: 1min 59s
Wall time: 5min 5s


Unnamed: 0,0,1,2,accuracy,macro avg
precision,0.868885,0.45488,0.691333,0.690515,0.671699
recall,0.058275,0.034565,0.994279,0.690515,0.362373
f1-score,0.109225,0.064248,0.815583,0.690515,0.329685


## 4.8 XGBoost

In [42]:
#establishing parameters for GridSearch
parameters = {'eta':[0.001,0.005,0.1,0.5],
              'min_child_weight':[1,5,10]}

In [43]:
#fitting the model
xgb = Classification('XGBoost', X_train, X_val, y_train, y_val)

In [44]:
%%time

#getting scores
xgb.get_scores(parameters, skf)

  np.log(sample_weight)
  + estimator_weight * incorrect * (sample_weight > 0)
  return super().fit(X, y, sample_weight)
  np.log(sample_weight)
  + estimator_weight * incorrect * (sample_weight > 0)
  return super().fit(X, y, sample_weight)
  np.log(sample_weight)
  + estimator_weight * incorrect * (sample_weight > 0)
  return super().fit(X, y, sample_weight)
  np.log(sample_weight)
  + estimator_weight * incorrect * (sample_weight > 0)
  return super().fit(X, y, sample_weight)
  np.log(sample_weight)
  + estimator_weight * incorrect * (sample_weight > 0)
  return super().fit(X, y, sample_weight)


Unnamed: 0,Model Name,Train Accuracy,Validation Accuracy,Accuracy Difference
0,XGBoost,0.825757,0.775546,0.050211


The best hyperparameters are:  {'eta': 0.001, 'min_child_weight': 10} 

CPU times: user 33min 30s, sys: 1.38 s, total: 33min 32s
Wall time: 3h 30min 42s


Unnamed: 0,0,1,2,accuracy,macro avg
precision,0.71717,0.508607,0.807231,0.775546,0.677669
recall,0.564116,0.219144,0.954105,0.775546,0.579122
f1-score,0.631502,0.306308,0.874544,0.775546,0.604118


In [45]:
xgb.scores_table

Unnamed: 0,Model Name,Train Accuracy,Validation Accuracy,Accuracy Difference
0,XGBoost,0.825757,0.775546,0.050211


# 5. Evaluation

## 5.1 Comparing scores from all models

In [46]:
models = pd.concat([log_reg.scores_table,
                    mnb.scores_table,
                    rf.scores_table,
                    tree.scores_table,
                    knn.scores_table,
                    ada.scores_table,
                    xgb.scores_table],
                    axis=0)

In [47]:
models

Unnamed: 0,Model Name,Train Accuracy,Validation Accuracy,Accuracy Difference
0,Logistic Regression,0.801329,0.788683,0.012646
0,Multinomial Naive Bayes,0.696545,0.687671,0.008874
0,Random Forest,0.760727,0.717962,0.042766
0,Decision Tree,0.711174,0.708625,0.002549
0,KNN,0.680715,0.680636,8e-05
0,AdaBoost,0.690552,0.690515,3.6e-05
0,XGBoost,0.825757,0.775546,0.050211


In [54]:
#saving models results as a csv
models.to_csv('./models_results.csv',index=False)

In [55]:
#saving all models to a pickle
for model in [log_reg, mnb, rf, tree, knn, ada, xgb]:
    pickle.dump(model, open(f'./{model.model_type}.pkl', 'wb'))

## 5.2 Fitting the best model with test data

## 5.3 Further model metrics and tuning

In [57]:
xgb.classification_report

Unnamed: 0,0,1,2,accuracy,macro avg
precision,0.71717,0.508607,0.807231,0.775546,0.677669
recall,0.564116,0.219144,0.954105,0.775546,0.579122
f1-score,0.631502,0.306308,0.874544,0.775546,0.604118


In [60]:
log_reg.classification_report

Unnamed: 0,0,1,2,accuracy,macro avg
precision,0.694542,0.51226,0.838515,0.788683,0.681772
recall,0.678042,0.268962,0.934322,0.788683,0.627108
f1-score,0.686192,0.352725,0.88383,0.788683,0.640916
