In [1]:
import pandas as pd
import numpy as np
import sklearn.metrics as metrics

from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

In [2]:
columns = ["user_id", "product_id", "date", "full_review", "rating", "label", "word_list",
           "lemmatized_word_list", "num_words", "num_verbs", "avg_word_len", "emotiveness_ratio",
           "num_positive_words", "num_negative_words", "sentiment"]
preprocessed_data = pd.read_csv("../data/preprocessing and features for modeling/review_features_02.txt",
                                delimiter="\t",
                                names=columns)

In [3]:
preprocessed_data

Unnamed: 0,user_id,product_id,date,full_review,rating,label,word_list,lemmatized_word_list,num_words,num_verbs,avg_word_len,emotiveness_ratio,num_positive_words,num_negative_words,sentiment
0,5044,0,2014-11-16,"Drinks were bad, the hot chocolate was watered...",1.0,-1,"['drinks', 'bad', 'hot', 'chocolate', 'watered...","['drink', 'bad', 'hot', 'chocolate', 'water', ...",17,2,4.882353,0.416667,1,5,-0.235294
1,5045,0,2014-09-08,This was the worst experience I've ever had a ...,1.0,-1,"['worst', 'experience', 'ive', 'ever', 'casual...","['bad', 'experience', 'ive', 'ever', 'casual',...",118,21,5.533898,0.430380,4,12,-0.067797
2,5046,0,2013-10-06,This is located on the site of the old Spruce ...,3.0,-1,"['located', 'site', 'old', 'spruce', 'st', 'vi...","['locate', 'site', 'old', 'spruce', 'st', 'vid...",24,3,5.125000,0.500000,4,1,0.125000
3,5047,0,2014-11-30,I enjoyed coffee and breakfast twice at Toast ...,5.0,-1,"['enjoyed', 'coffee', 'breakfast', 'twice', 't...","['enjoy', 'coffee', 'breakfast', 'twice', 'toa...",129,15,5.651163,0.476744,20,4,0.124031
4,5048,0,2014-08-28,I love Toast! The food choices are fantastic -...,5.0,-1,"['love', 'toast', 'food', 'choices', 'fantasti...","['love', 'toast', 'food', 'choice', 'fantastic...",79,11,5.354430,0.500000,12,0,0.151899
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
608593,119664,5039,2013-01-20,When I first moved to the area I must say I wa...,4.0,1,"['first', 'moved', 'area', 'must', 'say', 'was...","['first', 'move', 'area', 'must', 'say', 'wasn...",27,5,5.111111,0.625000,5,0,0.185185
608594,56277,5039,2012-11-12,Kind of pricey. I guess I expected a ridiculou...,2.0,1,"['kind', 'pricey', 'guess', 'expected', 'ridic...","['kind', 'pricey', 'guess', 'expect', 'ridicul...",79,9,5.367089,0.396226,4,8,-0.050633
608595,265320,5039,2012-08-22,"Stopped by this restaurant yesterday, we just ...",1.0,1,"['stopped', 'restaurant', 'yesterday', 'wanted...","['stop', 'restaurant', 'yesterday', 'want', 'q...",72,10,5.222222,0.675000,1,6,-0.069444
608596,161722,5039,2011-05-11,Finally checked out The Best Subs in Claremont...,4.0,1,"['finally', 'checked', 'best', 'subs', 'clarem...","['finally', 'check', 'best', 'sub', 'claremont...",35,4,5.171429,0.571429,2,0,0.057143


In [4]:
# Model will only include numeric features at this point, so let's filter to relevant columns
filtered_features = ['num_words', 'num_verbs', 'avg_word_len', 'emotiveness_ratio',
                     'num_positive_words', 'num_negative_words', 'sentiment', 'rating']
label_col = 'label'

In [5]:
%%time
# Baseline model without any direct representation of words or word embeddings

# Using a linear kernel for now as I have seen pretty good results for linear SVM with text classification
# Using LinearSVC for better performance on large data sets
baseline_svm = LinearSVC()

# Best practice scale data when using geometric based models like SVM where magnitude is important
scaler = StandardScaler()

# Best practice to split the data before scaling, so that the scaler is fit on only the training data
X_train, X_test, y_train, y_test = train_test_split(preprocessed_data[filtered_features],
                                                    preprocessed_data[label_col].values,
                                                    test_size=.3,
                                                    random_state=24)

# Minority class is significantly less than majority class, so we should use SMOTE to upsample the data
resampler = SMOTE(random_state=24, k_neighbors=3)

# Using imblearn pipeline to manage the scaling and resampling logic in a simple way
svm_pipe = Pipeline([('scaler', scaler),
                     ('upsampler', resampler),
                     ('svc', baseline_svm)])

svm_pipe.fit(X_train, y_train)

CPU times: total: 3min 11s
Wall time: 3min 12s




In [6]:
preds = svm_pipe.predict(X_test)

In [7]:
svm_pipe[2].coef_

array([[ 0.32999912, -0.15179489, -0.04664113,  0.00379112,  0.08096823,
        -0.09776226, -0.09691944,  0.0769002 ]])

In [8]:
print(metrics.classification_report(y_test, preds))

              precision    recall  f1-score   support

          -1       0.18      0.66      0.28     24295
           1       0.91      0.53      0.67    158285

    accuracy                           0.55    182580
   macro avg       0.54      0.60      0.48    182580
weighted avg       0.81      0.55      0.62    182580



In [9]:
metrics.confusion_matrix(y_test, preds)

array([[15971,  8324],
       [73947, 84338]], dtype=int64)

Unfortunately, the baseline model with no tuning gives a 55% overall accuracy on the test set. Looking at the confusion matrix, it's clear that the model is misidentifying many of the true reviews as fake reviews. We may need more complex features to represent the differences between the classes, or we could try alternative rebalancing methods to compare.

#### TF-IDF Vectorization
Term frequency - inverse document frequency is a common, simple word embedding technique that represents the uniqueness of the terms in a given document. It is the product of the term frequency within a single document with the inverse of the term frequency across the entire corpus. In this case, the training set is very large, so we may run into computational challenges with training an SVM model on a tf-idf vectorized dataset with only local resources.

In [20]:
# TF-IDF vectorizer with unigram representation (single words within each document)
vectorizer = TfidfVectorizer()
vectorizer.fit(preprocessed_data.lemmatized_word_list)

In [21]:
tfidf_svm = LinearSVC()

X_train, X_test, y_train, y_test = train_test_split(preprocessed_data.lemmatized_word_list,
                                                    preprocessed_data[label_col].values,
                                                    test_size=.3,
                                                    random_state=24)

# Using imblearn pipeline to manage the resampling logic in a simple way
tfidf_svm_pipe = Pipeline([('upsampler', resampler),
                           ('svc', tfidf_svm)])

tfidf_x_train = vectorizer.transform(X_train)
tfidf_x_test = vectorizer.transform(X_test)

print(tfidf_x_train.shape)
print(tfidf_x_test.shape)

(426018, 323907)
(182580, 323907)


In [22]:
%%time
tfidf_svm_pipe.fit(tfidf_x_train, y_train)

CPU times: total: 2min 24s
Wall time: 2min 26s


In [23]:
tfidf_preds = tfidf_svm_pipe.predict(tfidf_x_test)

print(metrics.classification_report(y_test, tfidf_preds))
metrics.confusion_matrix(y_test, tfidf_preds)

              precision    recall  f1-score   support

          -1       0.26      0.44      0.33     24295
           1       0.90      0.81      0.85    158285

    accuracy                           0.76    182580
   macro avg       0.58      0.62      0.59    182580
weighted avg       0.82      0.76      0.78    182580



array([[ 10657,  13638],
       [ 30345, 127940]], dtype=int64)

Simple unigram tf-idf model significantly outperforms the baseline model with only the computed features, achieving a 76% overally accuracy. It is still struggling with to identify the fake reviews and is identifying far too many of the real reviews as fake, with a relatively abysmal precision of 26% for the fake reviews. This is still an improvement over the other model, with had a precision of only 18%, but there is still significant room for improvement.

In [14]:
# Inspecting some of the lemmatized words, I see a few that appear to be errors in preprocessing
sorted(vectorizer.get_feature_names_out()[0:1000])

['aa',
 'aaa',
 'aaaa',
 'aaaaa',
 'aaaaaaaaaaaa',
 'aaaaaaaaaaaaaaaaaa',
 'aaaaaaaaaaaaaaaasaaaaaaaaaammmmmmmmmmmmmaaaaaaaaaa',
 'aaaaaaaaaaaaaand',
 'aaaaaaaaaaaaaayo',
 'aaaaaaaaaaaaah',
 'aaaaaaaaaaaallll',
 'aaaaaaaaaaaamazing',
 'aaaaaaaaaaamazing',
 'aaaaaaaaamazing',
 'aaaaaaaaand',
 'aaaaaaaahh',
 'aaaaaaaammazingggg',
 'aaaaaaaand',
 'aaaaaaaanyway',
 'aaaaaaah',
 'aaaaaaamazing',
 'aaaaaaamazzzzing',
 'aaaaaaand',
 'aaaaaaanywhoo',
 'aaaaaaay',
 'aaaaaach',
 'aaaaaah',
 'aaaaaahh',
 'aaaaaahhhhh',
 'aaaaaahhhmada',
 'aaaaaamazing',
 'aaaaaammaaazzzzingggg',
 'aaaaaammmmmaaazing',
 'aaaaaammmmmzzzzzzing',
 'aaaaaand',
 'aaaaaanyway',
 'aaaaaargh',
 'aaaaaarrrrrggghhhh',
 'aaaaah',
 'aaaaahhh',
 'aaaaahhhh',
 'aaaaahhhhh',
 'aaaaahhhhmaaaaayyyyziiiiiiing',
 'aaaaahhhhmazing',
 'aaaaahhhhmazzzing',
 'aaaaahmazing',
 'aaaaall',
 'aaaaalll',
 'aaaaallll',
 'aaaaamaazzzzing',
 'aaaaamahhzing',
 'aaaaamaxing',
 'aaaaamazing',
 'aaaaamazinggg',
 'aaaaammmaaazzziiinnnggg',
 'aaaaammm

In [15]:
# Error in the preprocessing, concatenating words together that are separated by ... and possibly other character/cases as well
preprocessed_data[preprocessed_data.lemmatized_word_list.str.lower().str.contains('aboutforgetaboutit')].full_review.values

array(["Don Peppe's is old school Italian at it's best....Go in...mind your business...order the meal...and don't rat on your friends. And if you get \xa0pinched, tell them nothing!!!....Seriously, the baked clams, linguine white clam, shrimp Luciano over linguine and veal don peppe is what's it's all about....Forgetaboutit!!!!"],
      dtype=object)

Potentially interesting feature I thought of after seeing these error cases, analyzing the usage of punctuation in the sentence. We usually strip all this information out when doing the preprocessing, but maybe it would be interesting to include counts of different types of punctuation? (!.,? are the main ones that jump to mind) Not sure how predictive this is of fake/authentic reviews but perhaps it's useful

The TF-IDF model is moving in the right direction, but it is still struggling to properly identify the fake reviews. Let's try a slightly more complex representation, using bi-grams instead of uni-grams, and see if this helps with picking out the fake reviews. We can also try alternatives where we combine the summary features from the original model into the TF-IDF model and see if that helps as well, since there did seem to be some important information that those features were capturing.

In [24]:
# TF-IDF vectorizer with bigram only representation
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(2,2))
vectorizer.fit(preprocessed_data.lemmatized_word_list)

In [25]:
%%time
tfidf2_svm = LinearSVC()

# Using imblearn pipeline to manage the scaling and resampling logic in a simple way
tfidf2_svm_pipe = Pipeline([('upsampler', resampler),
                            ('svc', tfidf2_svm)])

tfidf2_x_train = vectorizer.transform(X_train)
tfidf2_x_test = vectorizer.transform(X_test)

print(tfidf2_x_train.shape)
print(tfidf2_x_test.shape)

(426018, 6357944)
(182580, 6357944)
CPU times: total: 1min 7s
Wall time: 1min 7s


In [26]:
%%time
tfidf2_svm_pipe.fit(tfidf2_x_train, y_train)

CPU times: total: 1min 31s
Wall time: 1min 32s


In [28]:
tfidf2_preds = tfidf2_svm_pipe.predict(tfidf2_x_test)

print(metrics.classification_report(y_test, tfidf2_preds))
metrics.confusion_matrix(y_test, tfidf2_preds)

              precision    recall  f1-score   support

          -1       0.25      0.34      0.29     24295
           1       0.89      0.84      0.87    158285

    accuracy                           0.78    182580
   macro avg       0.57      0.59      0.58    182580
weighted avg       0.81      0.78      0.79    182580



array([[  8177,  16118],
       [ 24863, 133422]], dtype=int64)

The bi-gram representation is significantly more complex (6,356,944 input features vs 323,907 features in the uni-gram), and unfortunately this additional complexity does not appear to be helping the model performance. Accuracy has slightly increased on our test set but both the precision and recall have decreased, and the overall number of fake reviews correctly identified has also decreased. More than likely we are running into issues with the representation being too complex for the amount of training data that we have when using bi-grams, so it would seem to make more sense to use the uni-gram representation and try to iterate on this to find a better model.

Next, let's try to incorporate the original summary features into the uni-gram TF-IDF model and see if those features help at all with the performance

In [36]:
from sklearn.compose import ColumnTransformer

tfidf_combined_svm = LinearSVC()
X_train, X_test, y_train, y_test = train_test_split(preprocessed_data[filtered_features + ['lemmatized_word_list']],
                                                    preprocessed_data[label_col].values,
                                                    test_size=.3,
                                                    random_state=24)

summary_pipeline = Pipeline([('scaler', StandardScaler())])
tfidf_pipeline = Pipeline([('vectorizer', TfidfVectorizer(analyzer='word', ngram_range=(1,1)))])
column_processor = ColumnTransformer([('summary', summary_pipeline, filtered_features),
                                      ('tfidf', tfidf_pipeline, 'lemmatized_word_list')])


tfidf_combined_svm_pipe = Pipeline([('preprocessing', column_processor),
                                    ('upsampler', resampler),
                                    ('svc', tfidf_combined_svm)])


print(X_train.shape)
print(X_test.shape)
tfidf_combined_svm_pipe

(426018, 9)
(182580, 9)


In [37]:
%%time
tfidf_combined_svm_pipe.fit(X_train, y_train)

CPU times: total: 10min 48s
Wall time: 10min 53s




In [39]:
tfidf_combined_preds = tfidf_combined_svm_pipe.predict(X_test)

print(metrics.classification_report(y_test, tfidf_combined_preds))
metrics.confusion_matrix(y_test, tfidf_combined_preds)

              precision    recall  f1-score   support

          -1       0.25      0.54      0.34     24295
           1       0.91      0.75      0.82    158285

    accuracy                           0.72    182580
   macro avg       0.58      0.64      0.58    182580
weighted avg       0.82      0.72      0.76    182580



array([[ 13076,  11219],
       [ 39714, 118571]], dtype=int64)

The model is still struggling with low precision. This variant has significantly improved the recall of the previous TF-IDF models, while still maintained better precision than the summary-only model. However, there is still significant room for improvement on the precision. This model is still over-estimating the amount of fake reviews in the dataset, though using the word-embedding technique has vastly improved the performance over the original summary-only baseline.

#### Next steps

The TF-IDF uni-gram model appears to be a pretty good model for identifying fake reviews. It still has too many false negatives and therefore poor precision overall, but there has been notable improvements in performance over the original baseline model.

* **Managing imbalanced classes** - One potential confounder could be the SMOTE oversampling that we are doing. In other applications, SMOTE has had a lot of success in rebalancing datasets and improving performance when datasets are heavily biased towards one class or another, but it is not the only way to rebalance datasets. We could potentially try other alternatives, such as random downsampling on the majority class, or we could utilize the built-in `class-weight` parameter on the `LinearSVC` model without resampling to instead indicate the bias of the training set and to weight the minority class more heavily.
* **Feature engineering** - These models have only incorporated the simple summary statistics provided by Lu, as well as the TF-IDF embeddings. There are many other potential features that we have discussed that we should try to include in further iterations of the model, including the punctuation idea from above.
* **Hyper-parameter tuning** - The LinearSVC model has several hyperparameters that we could use to improve the overall performance of the model. For the time being, I have only explored using the default parameters, but we could implement cross-validated grid searches to attempt to optimize the model. The primary hyperparameters we could use would be C, the L2 regularization parameter and the penalty, which by default is an L2 penalty but we could experiment with L1 for even sparser coefficient vectors.
* **Alternative SVM classifier models** - the model I have been using is the LinearSVC class from sklearn, specifically because this class utilizes a different optimization algorithm from the standard SVC class that is _significantly_ more performant, but it limits the kernel to a linear kernel. Linear kernels have shown pretty good performance on text classification, but if we want to attempt other non-linear kernels, we will have to utilize the SGDClassifier class with an SVM optimization function.
