### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    # if function reveice np.nan, return nothing
    if not isinstance(text, str):
        return ''
    
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [2]:
#a)
baby_df['review'] = baby_df['review'].apply(remove_punctuation)

#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

In [3]:
#b)
baby_df['review'].replace(np.nan, '', inplace=True)

#short test:
baby_df["review"][38] == baby_df["review"][38]
print(baby_df['review'][38])




In [4]:
#c)
baby_df.drop(baby_df[baby_df['rating'] == 3].index, inplace=True)

#short test:
sum(baby_df["rating"] == 3)

0

In [5]:
#d) 
baby_df.loc[baby_df.rating <= 2, 'rating'] = -1
baby_df.loc[baby_df.rating >= 4, 'rating'] =  1

#short test:
sum(baby_df["rating"]**2 != 1)

0

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)
print(vectorizer.vocabulary_)

print(X_train_example.todense())



{'we': 9, 'like': 6, 'apples': 2, 'hate': 5, 'oranges': 7, 'adore': 0, 'bananas': 3, 'and': 1, 'they': 8, 'dislike': 4}
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [7]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)
print(vectorizer.vocabulary_)

print(X_test_example.todense())

{'we': 9, 'like': 6, 'apples': 2, 'hate': 5, 'oranges': 7, 'adore': 0, 'bananas': 3, 'and': 1, 'they': 8, 'dislike': 4}
[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [8]:
#a)
from sklearn.model_selection import train_test_split
review_train, review_test, rating_train, rating_test = train_test_split(baby_df.review ,baby_df.rating, train_size=0.8)

Spliting dataset into `rating` and `review` into `train & test` datasets

In [9]:
#b)
vectorizer = CountVectorizer()
vectorizer.fit(baby_df.review)

review_train_T = vectorizer.transform(review_train)
review_test_T = vectorizer.transform(review_test)

Transform `review` datasets into vectors

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [10]:
#a)
model = LogisticRegression()
model.fit(review_train_T, rating_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Train and fit model

In [11]:
#b)
indices = np.argsort(model.coef_)
words = np.array(vectorizer.get_feature_names_out())

print('Top 10 positive words: ', words[indices[0, -10:]])
print('Top 10 negative words: ', words[indices[0, :10]])
#hint: model.coef_, vectorizer.get_feature_names()

Top 10 positive words:  ['helped' 'perfect' 'highly' 'negative' 'plenty' 'lifesaver' 'satisfied'
 'pleased' 'worry' 'excellent']
Top 10 negative words:  ['worst' 'disappointing' 'poorly' 'worthless' 'concept' 'disappointment'
 'horrible' 'useless' 'disappointed' 'theory']


Sorts indices by model coefficiets and prints first 10 positive and negative words

## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [12]:
#a)
pred = model.predict(review_test_T)

In [13]:
#b)
prob = model.predict_proba(review_test_T)
#hint: model.predict_proba()

In [15]:
#c)
positive = np.argsort(prob[:,1])[-5:]
print(review_test.iloc[positive])

negative = np.argsort(prob[:,0])[-5:]
print(review_test.iloc[negative])
#hint: use the results of b)

103297    I was very excited when I heard Chicco was fin...
135152    Weve been using Britax for our boy now 14 mont...
123632    I did a TON of research before I purchased thi...
161127    My dad just bought this car seat for memy son ...
21557     Ok I read all the reviews already posted here ...
Name: review, dtype: object
16042     We have not had ANY luck with FisherPrice prod...
10370     This product should be in the hall of fame sol...
89902     I am so incredibly disappointed with the strol...
120219    I have NEVER written a review before for anyth...
175191    I had to return this stroller for three reason...
Name: review, dtype: object


In [16]:
#d) 
print(sum(pred == rating_test) / len(rating_test))

0.9295973134238853


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [17]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [19]:
#a)
X_train, X_test, y_train, y_test = train_test_split(baby_df.review,baby_df.rating, train_size=0.8)

vectorizer = CountVectorizer()
vectorizer.fit(significant_words)

X_train_T_sig = vectorizer.transform(X_train.values)
X_test_T_sig = vectorizer.transform(X_test.values)

model_sig = LogisticRegression()
model_sig.fit(X_train_T_sig, y_train)

indices = np.argsort(model_sig.coef_)
words = np.array(vectorizer.get_feature_names_out())
print("Top 10 positive words: ", words[indices[0,-10:]])
print("Top 10 negative words: ", words[indices[0,:10]])

y_pred = model_sig.predict(X_test_T_sig)
y_pred_proba = model_sig.predict_proba(X_test_T_sig)

positive = np.argsort(y_pred_proba[:,1])[-5:]
negative = np.argsort(y_pred_proba[:,0])[-5:]
print(X_test.iloc[positive])
print(X_test.iloc[negative])

print('Accuracy of prediction: ', sum(y_pred == y_test) / len(y_test))

Top 10 positive words:  ['car' 'old' 'able' 'well' 'little' 'great' 'easy' 'love' 'perfect'
 'loves']
Top 10 negative words:  ['disappointed' 'return' 'waste' 'broke' 'money' 'work' 'even' 'would'
 'product' 'less']
103297    I was very excited when I heard Chicco was fin...
134265    We bought this stroller after selling our belo...
170607    PROS1 Good to grow with a toddler Its perfect ...
25525     We bought this stroller about 2 weeks ago I ab...
116072    Ive posted an UPDATE at the endFirst let me st...
Name: review, dtype: object
24325     This never worked right and I tried to return ...
77915     This review is about Seller only Seller would ...
35763     Day 1 Assembled it Had it up and running playi...
168391    I loved all the features of the car seat  It i...
2186      This is a long review but if you read the whol...
Name: review, dtype: object
Accuracy of prediction:  0.8669005427123625


In [20]:
#b)
coef_tuple = zip(vectorizer.get_feature_names_out(), model.coef_[0])
for word, coef in coef_tuple:
    print(f"{word}: {coef}")

able: 0.0013279435149441358
broke: 0.005287831260236116
car: 0.012551652157347245
disappointed: 0.001993154306148187
easy: 7.092650322974392e-06
even: 0.000604974931756302
great: 0.045090823299666925
less: 0.0898576535228941
little: -0.0055539240616342125
love: 0.017917659073607442
loves: -0.0006839304637590155
money: -0.026762579305358573
old: 0.11101281206457021
perfect: -0.0057083420758347134
product: 0.0
return: 7.662710542763792e-05
waste: -0.009547348473792915
well: 0.0011219703707157926
work: 0.0
would: 0.0


In [21]:
#c)
import sys, time

In [22]:
%%time
%%timeit
model_sig.predict(X_test_T_sig)

#hint: %time, %timeit

792 µs ± 53.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
CPU times: total: 5.66 s
Wall time: 6.51 s


In [23]:
print('Accuracy of prediction: ', sum(y_pred == y_test) / len(y_test))

Accuracy of prediction:  0.8669005427123625


In [24]:
%%time
%%timeit
model.predict(review_test_T)

11.2 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
CPU times: total: 7.8 s
Wall time: 9.15 s


In [25]:
print('Accuracy of prediction: ', sum(pred == rating_test) / len(rating_test))

Accuracy of prediction:  0.9295973134238853


As we can see, accuracy of prediction is a little better in case we don't limit the dictionary to significant words, but prediction execution time is over 10 times faster with limited words, so sometimes it is worth to consider to limit our dictionary