# Helpful Reviews Modelling

IDEAS
+ a helpful review is a review that affects other users who read it
+ may decourage or encourage buying behaviours
+ give data of review, product content, 
  + how 'helpful' a review is? i.e. helpful votes / total votes
  + decourage review or encourage review? and no effect
    + no label for this? helpfulness != encourage
    + helpfulness + positive sentiment == encourage
    + helpfulness + negative sentiment == decourage
    + i.e. first get sentiment value, then combine with the helpfulness value

In [1]:
import os
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

In [2]:
from sqlalchemy import create_engine
from dotenv import load_dotenv # env variables
load_dotenv(verbose=True)

True

In [3]:
SQLALCHEMY_DATABASE_URI = os.getenv('DATABASE_URL')
engine = create_engine(SQLALCHEMY_DATABASE_URI)

## Predict Helpfulness of a Review
+ Take reviews with votes, Predict if a review is helpful or not
+ using threshold .5 as helpful or not
1. Clean reviews with `re`
2. Tokenize with `sklearn` 
3. Train model with `sklearn`

TODO:
+ Use Word-Embedding? Gensim? spacey?
+ Use custom methods to clean strings
+ Lemmatizing? Stemming? POS Tags?
  + `from nltk.stem import WordNetLemmatizer`
  + `from nltk.stem import PorterStemmer`
+ Feature Engineering
  + Time of review may affect its helpfulness
  + Length of headline or body
  + Number of break line used in review
  + Maybe the first few words are more important?
+ Prediction: 
  + a numeric value between 0 and 1, regression
  + use different models

In [4]:
sql = \
"""
SELECT 
    helpful_votes, total_votes,
    review_headline, review_body
FROM 
    food_reviews
WHERE 
    energy_100g IS NOT NULL
    AND energy_100g < 3000
    AND review_date >= '2010-01-01'
    AND verified_purchase LIKE 'Y'
    AND total_votes > 0
"""
# helpfulness = helpful votes / total votes
df = pd.read_sql(sql, con=engine)\
    .assign(helpful=lambda df: df.helpful_votes/df.total_votes > .5)\
    .drop(['helpful_votes', 'total_votes'], axis=1)
#     .assign(helpfulness=lambda df: df.helpful_votes/df.total_votes)\
df.shape

(44363, 3)

### Process Reviews 

In [7]:
# prcoessing with individual functions 
def remove_nonword(string):
    # remove line breaker, non-word, except space and period,
    return re.sub(r'<br />|[^A-Za-z. ]', '', string.lower())
def remove_space(string):
    # remove space at setence end
    return re.sub(r'\s$', '', string)
def reduce_spaces(string):
    # reduce multiple space to single
    return re.sub(' +', ' ', string) 

In [8]:
# apply the processing functions
funcs = [remove_nonword, remove_space, reduce_spaces]
for func in funcs:
    df.review_headline = df.review_headline.apply(func)
    df.review_body = df.review_body.apply(func)
df.head()

Unnamed: 0,review_headline,review_body,helpful
0,five stars,it works well for nagging sensations in the st...,True
1,its not all natural and also surprise it got c...,look at the ingredients dear future customers....,False
2,five stars,if you want angostura this is the real thing. ...,False
3,best old fashion drinks..,great for making a cocktail name old fashion.....,False
4,five stars,excellent product and very fast shipping,False


### Prepare Train and Test Data

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


# from scipy.sparse import hstack

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['helpful'], axis=1),
    df.helpful, test_size=0.1, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((39926, 2), (4437, 2), (39926,), (4437,))

### Using Pipe to Train and Test

In [12]:
# pipeline to easy workflow
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion # join features
from sklearn.base import BaseEstimator, TransformerMixin # base class

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler

In [13]:
class TextSelector(BaseEstimator, TransformerMixin):
    """
    Use on text columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]

class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Use on numeric columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]

In [17]:
# select text 
headlines = Pipeline([
    ('selector', TextSelector(key='review_headline')),
    ('countVec', CountVectorizer(ngram_range=(1, 2))),
    ('tfidf', TfidfTransformer(use_idf=True))
])
bodies = Pipeline([
    ('selector', TextSelector(key='review_body')),
    ('countVec', CountVectorizer(ngram_range=(1, 2))),
    ('tfidf', TfidfTransformer(use_idf=True))
])
# select have_url fearture # no change in accuracy
# bool_url = Pipeline([
#     ('selector', NumberSelector(key='have_url'))
# ])

# union all features together
features = FeatureUnion([
    ('headlinesTfidf', headlines), 
    ('bodiesTfidf', bodies)
])

In [19]:
lrPipe = Pipeline([('features', features), 
                 ('clf', LogisticRegression())])
lrPipe.fit(X=X_train, y=y_train)
lrPipe.score(X_test, y_test)



Pipeline(memory=None,
         steps=[('features',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('headlinesTfidf',
                                                 Pipeline(memory=None,
                                                          steps=[('selector',
                                                                  TextSelector(key='review_headline')),
                                                                 ('countVec',
                                                                  CountVectorizer(analyzer='word',
                                                                                  binary=False,
                                                                                  decode_error='strict',
                                                                                  dtype=<class 'numpy.int64'>,
                                                                                  encoding='utf-8

In [None]:
svgPipe = Pipeline([('features', features), 
                 ('clf', SVC(gamma='auto'))])
svgPipe.fit(X=X_train, y=y_train)
svgPipe.score(X_test, y_test)

In [None]:
gnbPipe = Pipeline([('features', features), 
                 ('clf', GaussianNB())])
gnbPipe.fit(X=X_train, y=y_train)
gnbPipe.score(X_test, y_test)

### Error Analysis w/ Confusion Matrix

In [23]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
cf_matrix = confusion_matrix(y_true=y_dev, y_pred=pipe.predict(X=X_dev))
real_index = [['Real']*3, list(target_dict.keys())]
pred_colum = [['Predicted']*3, list(target_dict.keys())]
pd.DataFrame(cf_matrix, columns=pred_colum, index=real_index)

## Classify a Review as Encouraging or Decouraging Review
+ Encourage = Positive Sentiment w/ High Helpfulness
+ Decourage = Negative Sentiment w/ High Helpfulness
1. New attribute with sentiments
  + `from nltk.sentiment.vader import SentimentIntensityAnalyzer`
  + `import flair`
2. New attribute as combining value of sentiment and helpfulness
    + regression? between -1 and 1, around 0: not helpful at all
    + classification? four quadrants: helpful and sentiment
3. Same processing steps as before on strings
4. Predict 

In [27]:
sql = \
"""
SELECT 
    helpful_votes, total_votes,
    review_headline, review_body
FROM 
    food_reviews
WHERE 
    energy_100g IS NOT NULL
    AND energy_100g < 3000
    AND review_date >= '2010-01-01'
    AND verified_purchase LIKE 'Y'
    AND total_votes > 0
"""
# helpfulness = helpful votes / total votes
df = pd.read_sql(sql, con=engine)\
    .assign(helpful=lambda df: df.helpful_votes/df.total_votes > .5)\
    .drop(['helpful_votes', 'total_votes'], axis=1)
#     .assign(helpfulness=lambda df: df.helpful_votes/df.total_votes)\
df.shape

(44363, 3)

In [5]:
# prcoessing with individual functions 
def remove_nonword(string):
    # remove line breaker, non-word, except space and period,
    return re.sub(r'<br />|[^A-Za-z. ]', '', string.lower())
def remove_space(string):
    # remove space at setence end
    return re.sub(r'\s$', '', string)
def reduce_spaces(string):
    # reduce multiple space to single
    return re.sub(' +', ' ', string) 

In [6]:
# apply the processing functions
funcs = [remove_nonword, remove_space, reduce_spaces]
for func in funcs:
    df.review_headline = df.review_headline.apply(func)
    df.review_body = df.review_body.apply(func)
df.head()

Unnamed: 0,review_headline,review_body,helpful
0,five stars,it works well for nagging sensations in the st...,True
1,its not all natural and also surprise it got c...,look at the ingredients dear future customers....,False
2,five stars,if you want angostura this is the real thing. ...,False
3,best old fashion drinks..,great for making a cocktail name old fashion.....,False
4,five stars,excellent product and very fast shipping,False


In [7]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\davilaYuan\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [14]:
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))

In [28]:
sid = SentimentIntensityAnalyzer()

In [29]:
print(df.review_body[1])
sid.polarity_scores(df.review_body[1])

Look at the ingredients dear future customers. Its not all natural and also surprise it got Caramel Color just like Coca Cola. This color is toxic and have big part in giving cancer. Look it up.


{'neg': 0.141, 'neu': 0.651, 'pos': 0.208, 'compound': 0.2709}

In [33]:
tmp = df.review_body[69]
print(tmp)
sid.polarity_scores(tmp)

I can't give vegemite any less than 5 stars, it being an Australian icon. I ordered this because I had to see what the fuss was about, has to be good- right? Ummmm....no. Lol I even had my dad try it. He said it tastes like salted s...<br />Sadly, I had to agree. We love Australia...but guys? You can keep your vegemite ;)


{'neg': 0.0, 'neu': 0.748, 'pos': 0.252, 'compound': 0.9447}

In [34]:
tmp = ' '.join([w for w in df.review_body[69].split(' ') if w not in stopWords])
print(tmp)
sid.polarity_scores(tmp)

I can't give vegemite less 5 stars, Australian icon. I ordered I see fuss about, good- right? Ummmm....no. Lol I even dad try it. He said tastes like salted s...<br />Sadly, I agree. We love Australia...but guys? You keep vegemite ;)


{'neg': 0.0, 'neu': 0.628, 'pos': 0.372, 'compound': 0.9447}

In [26]:
df.iloc[20, :]

review_headline                                     overseas bitters
review_body        having found it impossible to purchase bitters...
helpful                                                         True
Name: 20, dtype: object