# Logistic Regression with Python

# Predicting sentiment from product reviews


In this notebook we will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative.

* Use Pandas to do some feature engineering.
* Train a logistic regression model to predict the sentiment of product reviews.
* Inspect the weights (coefficients) of a trained logistic regression model.
* Make a prediction (both class and probability) of sentiment for a new product review.
* Given the logistic regression weights, predictors and ground truth labels, write a function to compute the **accuracy** of the model.
* Inspect the coefficients of the logistic regression model and interpret their meanings.
* Compare multiple logistic regression models.

Let's get started!

In [1]:
# Libraries Import
import string
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# read dataframe
dataframe = pd.read_csv("data/amazon_baby.csv")
dataframe.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [3]:
dataframe.info()
# contains null values for name, reviews

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183531 entries, 0 to 183530
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   name    183213 non-null  object
 1   review  182702 non-null  object
 2   rating  183531 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 4.2+ MB


In [4]:
# replace null values with empty string
dataframe = dataframe.fillna({'review': ''})

In [5]:
#remove punctuations
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator) 

dataframe["review_without_punctuation"] = dataframe['review'].apply(lambda x : remove_punctuation(x))
dataframe = dataframe[["name", "review_without_punctuation", "rating"]]
dataframe.head()

Unnamed: 0,name,review_without_punctuation,rating
0,Planetwise Flannel Wipes,These flannel wipes are OK but in my opinion n...,3
1,Planetwise Wipe Pouch,it came early and was not disappointed i love ...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase I h...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried nonstop when I tried...,5


In [6]:
# ignore all reviews with rating = 3, since they tend to have a neutral sentiment
dataframe = dataframe[dataframe["rating"] != 3].reset_index(drop=True)
dataframe.head()

Unnamed: 0,name,review_without_punctuation,rating
0,Planetwise Wipe Pouch,it came early and was not disappointed i love ...,5
1,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
2,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase I h...,5
3,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried nonstop when I tried...,5
4,Stop Pacifier Sucking without tears with Thumb...,When the Binky Fairy came to our house we didn...,5


In [7]:
# reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 
# or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 
# for the negative class label
dataframe['sentiment'] = dataframe['rating'].apply(lambda rating : +1 if rating > 3 else -1)
dataframe.head()

Unnamed: 0,name,review_without_punctuation,rating,sentiment
0,Planetwise Wipe Pouch,it came early and was not disappointed i love ...,5,1
1,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,1
2,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase I h...,5,1
3,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried nonstop when I tried...,5,1
4,Stop Pacifier Sucking without tears with Thumb...,When the Binky Fairy came to our house we didn...,5,1


In [8]:
dataframe.describe(include = 'all')

Unnamed: 0,name,review_without_punctuation,rating,sentiment
count,166456,166752.0,166752.0,166752.0
unique,30729,165873.0,,
top,Vulli Sophie the Giraffe Teether,,,
freq,723,778.0,,
mean,,,4.233191,0.682247
std,,,1.295527,0.731124
min,,,1.0,-1.0
25%,,,4.0,1.0
50%,,,5.0,1.0
75%,,,5.0,1.0


In [9]:
dict_review = dataframe['review_without_punctuation'].to_dict()
print(f'{sum(len(review.split()) for review in list(dict_review.values())):_}')


13_353_959


In [10]:
from collections import Counter
words = Counter()
for review in list(dict_review.values()):
    words.update(review.split())

words
print(len(words))


165148


In [11]:
df = pd.DataFrame()
df['words'] = dataframe['review_without_punctuation'].str.split().str.len()
df

Unnamed: 0,words
0,30
1,23
2,74
3,76
4,93
...,...
166747,27
166748,64
166749,17
166750,170


In [12]:
df['words'].sum()

13353959

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(dataframe[["name","review_without_punctuation","rating"]], dataframe['sentiment'], test_size=0.2, random_state=1)

print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (133401, 3) (133401,)
Test set: (33351, 3) (33351,)


<img src='img/Count-Vectorization.jpg' width=600px>

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [14]:
# Build the word count vector for each review_without_punctuations
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(X_train['review_without_punctuation'])
test_matrix = vectorizer.transform(X_test['review_without_punctuation'])

In [15]:
# Logistic model fit
sentiment_model = LogisticRegression(solver='liblinear', n_jobs=1)
sentiment_model.fit(train_matrix, y_train)



`LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)`

In [16]:
len(sentiment_model.coef_[0])

121505

How many weights are greater than or equal to 0?

In [17]:
np.sum(sentiment_model.coef_ >= 0)

85930

Predict first five samples of X_test set:

In [18]:
X_test.head()

Unnamed: 0,name,review_without_punctuation,rating
162318,15 Plastic Alligator Grip Suspender Pacifier B...,These clips are just what I was looking for M...,5
144241,"green sprouts 2 Count Cool Hand Teether, Green...",This was a great buy the baby really loves che...,5
10916,Kidkusion Kid Safe Banister Guard,Its a little amusing that this is marketed as ...,5
50027,Mommy's Helper Car Seat Sun Shade,I live an area of the US where we get summers ...,5
40241,Gerber Graduates BPA Free 4 Pack Bunch-A-Bowls...,I ordered these to give to my daughter she lo...,5


In [19]:
sample_test_data = X_test.iloc[1:6]
sample_test_matrix = vectorizer.transform(sample_test_data['review_without_punctuation'])
print(sentiment_model.classes_)
print(sentiment_model.predict_proba(sample_test_matrix))
sample_test_data['review_without_punctuation'].to_dict()


[-1  1]
[[3.10885504e-03 9.96891145e-01]
 [2.10430733e-03 9.97895693e-01]
 [5.63125868e-03 9.94368741e-01]
 [6.13089739e-05 9.99938691e-01]
 [4.82599390e-02 9.51740061e-01]]


{144241: 'This was a great buy the baby really loves chewing on these  especially after putting them in the refrigerator  We love it',
 10916: 'Its a little amusing that this is marketed as an actual product the 34Banister Guard34 when in reality it is just a roll of plastic and some zip ties But it does work in guarding the banister so I guess Im OK with itThis kit also includes a hole punch for installation When punching holes in the plastic for the zip ties my recommendation is to toss that hole punch in with your artsandcrafts supplies and go grab an ice pick from the kitchen The ice pick worked great in creating holes that are the correct size for the zip ties The hole punch isnt quite strong enough to be effectiveOnce the plastic is installed it looks finevery transparent and if you can get it flat you can minimize any reflections Overall its a nice unobtrusive way to keep the kiddos away from those banister gaps',
 50027: 'I live an area of the US where we get summers up to 120 

Which of the following products are represented in the 20 most positive reviews?

In [20]:
X_test["postive_review_probability"] = [x[1] for x in np.asarray(sentiment_model.predict_proba(test_matrix))]
top_20 = list(X_test.sort_values("postive_review_probability", ascending=False)[:20]["name"])
options_list = ["Snuza Portable Baby Movement Monitor",
                "MamaDoo Kids Foldable Play Yard Mattress Topper, Blue",
                "Britax Decathlon Convertible Car Seat, Tiffany",
                "Safety 1st Exchangeable Tip 3 in 1 Thermometer",
                "Twist Breastfeeding Gift Set"
                ]
[x for x in options_list if x in top_20]

['Twist Breastfeeding Gift Set']

Which of the following products are represented in the 20 most negative reviews?

In [21]:
X_test["postive_review_probability"] = [x[0] for x in np.asarray(sentiment_model.predict_proba(test_matrix))]
top_20 = list(X_test.sort_values("postive_review_probability",ascending=False)[:20]["name"])
options_list = ["The First Years True Choice P400 Premium Digital Monitor, 2 Parent Unit",
                "JP Lizzy Chocolate Ice Classic Tote Set",
                "Belkin WeMo Wi-Fi Baby Monitor for Apple iPhone, iPad, and iPod Touch (Firmware Update)",
                "Peg-Perego Tatamia High Chair, White Latte",
                "Safety 1st High-Def Digital Monitor"
                ]
[x for x in options_list if x in top_20]

['Belkin WeMo Wi-Fi Baby Monitor for Apple iPhone, iPad, and iPod Touch (Firmware Update)']

What is the accuracy of the sentiment_model on the test_data?

In [22]:
def get_classification_accuracy(model, data, true_labels):
    pred_y = model.predict(data)
    correct = np.sum(pred_y==true_labels)
    accuracy = round(correct / len(true_labels), 2)
    return accuracy

get_classification_accuracy(sentiment_model, test_matrix, y_test)

0.93

### Simple model (20 words)

In [23]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']


vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 significant words
train_matrix_sub = vectorizer_word_subset.fit_transform(X_train['review_without_punctuation'])
test_matrix_sub = vectorizer_word_subset.transform(X_test['review_without_punctuation'])
# Logistic model fit
simple_model = LogisticRegression(solver='liblinear',n_jobs=1)
simple_model.fit(train_matrix_sub, y_train)

In [24]:
simple_model_coefficient = pd.DataFrame({'word':significant_words, 'simple_model_coefficient':simple_model.coef_.flatten()}).sort_values(['simple_model_coefficient'], ascending=False).reset_index(drop=True)
len(simple_model_coefficient[simple_model_coefficient["simple_model_coefficient"] > 0])

10

In [25]:
simple_model_coefficient.head(20)

Unnamed: 0,word,simple_model_coefficient
0,loves,1.730101
1,perfect,1.520583
2,love,1.358388
3,easy,1.172422
4,great,0.953138
5,well,0.516279
6,little,0.48667
7,able,0.208226
8,old,0.092202
9,car,0.062287


Are the positive words in the simple_model also positive words in the sentiment_model?

In [26]:
simple_model_coefficient = simple_model_coefficient.set_index("word",drop=True)

sentiment_model_coefficient = pd.DataFrame({'word':list(vectorizer.vocabulary_),'sentimental_model_coefficient':sentiment_model.coef_.flatten()}).sort_values(['sentimental_model_coefficient'], ascending=False).reset_index(drop=True)
sentiment_model_coefficient = sentiment_model_coefficient[sentiment_model_coefficient["word"].isin(significant_words)].set_index("word",drop=True)

simple_model_coefficient.join(sentiment_model_coefficient, on="word", how="left")

Unnamed: 0_level_0,simple_model_coefficient,sentimental_model_coefficient
word,Unnamed: 1_level_1,Unnamed: 2_level_1
loves,1.730101,0.008235
perfect,1.520583,-0.513307
love,1.358388,-0.272995
easy,1.172422,0.008235
great,0.953138,-6e-06
well,0.516279,0.000157
little,0.48667,-5.6e-05
able,0.208226,0.358951
old,0.092202,9e-06
car,0.062287,0.02905


Which model (sentiment_model or simple_model) has higher accuracy on the TRAINING set?

In [27]:
print("Sentiment Model: ", get_classification_accuracy(sentiment_model, train_matrix, y_train))
print("Simple Model: ", get_classification_accuracy(simple_model, train_matrix_sub, y_train))

Sentiment Model:  0.97
Simple Model:  0.87


Which model (sentiment_model or simple_model) has higher accuracy on the TEST set?

In [28]:
print("Sentiment Model: ", get_classification_accuracy(sentiment_model, test_matrix, y_test))
print("Simple Model: ", get_classification_accuracy(simple_model, test_matrix_sub, y_test))

Sentiment Model:  0.93
Simple Model:  0.87


Find the accuracy of the majority class classifier model on the test_data.

In [29]:
# Find Majority Class
freq = pd.crosstab(y_test, columns=["count"]).reset_index()
freq

col_0,sentiment,count
0,-1,5278
1,1,28073


In [30]:
# Majority class=1
baseline_model = round(freq[freq["sentiment"]==1]["count"].values[0]/freq["count"].sum(), 2)
print("Baseline Model: ", baseline_model)

Baseline Model:  0.84


In [31]:
my_review = 'Perfect mobile phone for my wife'
predict_review = pd.Series({1: my_review})
predict_matrix = vectorizer.transform(predict_review)
print(sentiment_model.predict_proba(predict_matrix))

[[0.02033068 0.97966932]]
