# ADAM NOWAK 

### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [29]:
#a)
#conversion to str before using function
baby_df['review'] = baby_df['review'].astype(str).apply(remove_punctuation)

#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

In [30]:
#b)
baby_df['review'] = baby_df["review"].fillna("")

#short test:
baby_df["review"][38] == baby_df["review"][38]

True

In [31]:
#c)
baby_df = baby_df[baby_df["rating"] != 3]
#short test:
print(sum(baby_df["rating"] == 3))

0


We are deleting rating 3 because it's neutral!

In [32]:
#d) 
#filtered_df = baby_df[baby_df["rating"] > 4]
#print(filtered_df.head())
baby_df["new_rating"] = baby_df["rating"]

baby_df.loc[baby_df["new_rating"] <= 2, 'new_rating'] = -1
baby_df.loc[baby_df["new_rating"] > 3, 'new_rating'] = 1

#short test:

sum(baby_df["new_rating"]**2 != 1)
print(baby_df[['rating', 'new_rating']].head()) 

   rating  new_rating
1       5           1
2       5           1
3       5           1
4       5           1
5       5           1


Test result is 0 because every rating number is 1 or -1, so when we raise it to the second power its always 1. I'm creating new column 'new_rating'just for clarity. 

In [33]:
negative_ratings = baby_df[baby_df['new_rating'] == -1]
negative_ratings.head()

Unnamed: 0,name,review,rating,new_rating
21,Nature's Lullabies Second Year Sticker Calendar,I only purchased a secondyear calendar for my ...,2,-1
41,"SoftPlay Giggle Jiggle Funbook, Happy Bear",This bear is absolutely adorable and I would g...,2,-1
50,"SoftPlay Cloth Book, Love",This book is boring Nothing to stimulate my gr...,1,-1
70,Hunnt&reg; Falling Flowers and Birds Kids Nurs...,The reasonSmall sizeHard to apply on the wall ...,1,-1
71,Wall Decor Removable Decal Sticker - Colorful ...,Would not purchase again or recommend The deca...,2,-1


In [43]:
#test
print("Negative rating", sum(baby_df["new_rating"] == 1))
print("Positive rating", sum(baby_df["new_rating"] == -1))

Negative rating 140259
Positive rating 26493


Segregation went correctly. 

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [8]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [46]:
#a)
from sklearn.model_selection import train_test_split

# Split dataset into features (X) and target (y)
X = baby_df['review'].values
y = baby_df['new_rating'].values  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train[:2])


['You cannot imagine how much money youll save and it doesnt smell AT ALL You can use any kitchen size garbage bags without a problem and use it for A WEEK without having to change the bag ITS WONDERFULL'
 'I wanted to point out a couple things I didnt realize about the Kensington print in particular for others who are ordering and picking a colorprint  First it is much lighter than some of the other options which really helps when youre nursing and its hot outside my friends who have the solid color ones say that their babies are much warmer it gets a little warmer under there obviously with less air flow and seem uncomfortable compared to my little guy who seems relatively cool and happy Second and this is the main thing my baby loves seeing the print on the underside of the cover while hes nursing He loves the contrast between the light background and the blue and green vines which are visible even from the underside Added bonus  Finally Ive gotten several compliments on the pattern

In [50]:
#b)
from sklearn.feature_extraction.text import CountVectorizer

#testing if its working correctly
vectorizer_test = CountVectorizer()
X_train_vectors_test = vectorizer_test.fit_transform(X_train[:2])
print(vectorizer_test.get_feature_names_out()) 

#transformation
vectorizer = CountVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test) #converting the test data to space vector 

print(vectorizer.get_feature_names_out())
print(X_train_vectors.shape)

['about' 'added' 'air' 'all' 'and' 'any' 'are' 'asked' 'at' 'babies'
 'baby' 'background' 'bag' 'bags' 'because' 'between' 'blue' 'bonus' 'can'
 'cannot' 'change' 'color' 'colorprint' 'compared' 'compliments'
 'contrast' 'cool' 'could' 'couple' 'cover' 'didnt' 'doesnt' 'dress'
 'even' 'finally' 'first' 'flow' 'for' 'friends' 'from' 'garbage' 'get'
 'gets' 'gotten' 'green' 'guy' 'happy' 'have' 'having' 'he' 'helps' 'hes'
 'hot' 'how' 'imagine' 'in' 'is' 'it' 'its' 'ive' 'kensington' 'kitchen'
 'less' 'light' 'lighter' 'liked' 'little' 'loves' 'main' 'make'
 'material' 'me' 'money' 'much' 'my' 'nursing' 'obviously' 'of' 'on' 'one'
 'ones' 'options' 'ordering' 'other' 'others' 'out' 'outside' 'particular'
 'pattern' 'picking' 'point' 'print' 'problem' 'realize' 'really'
 'relatively' 'save' 'say' 'second' 'seeing' 'seem' 'seems' 'several'
 'she' 'size' 'smell' 'so' 'solid' 'some' 'stopped' 'than' 'that' 'the'
 'their' 'there' 'thing' 'things' 'this' 'to' 'uncomfortable' 'under'
 'undersid

As noted in the output we are getting the features names correctly, so fit_transform (transforming the training data into matrix) works correctly. 

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [51]:
#a)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize LogisticRegression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_vectors, y_train)
y_pred = model.predict(X_test_vectors)

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')


Accuracy: 0.9329


In [52]:
model.predict(vectorizer.transform(['disappointed']))[0]

-1

In [53]:
model.predict(vectorizer.transform(['perfect']))[0]

1

In [63]:
#b)
print(model.coef_,"\n") #printing the coefficients of every word

coeficients = model.coef_[0]
feature_names = vectorizer.get_feature_names_out()

#sorting and zipping coef with names (result = list of tuples) and reversing to displaying most positive ones 
#lambda function specifies sorting key with in this case is the coefficients value for specific word
most_positive = sorted(zip(coeficients, feature_names), key=lambda x: x[0], reverse=True)
print("Top 10 Most Positive Words:")
for n,c in most_positive[:10]:
    print(f"{n}: {c}")

print("\n")
#sorting and zipping coef with names and displaying most negative ones
most_negative = sorted(zip(coeficients, feature_names), key=lambda x: x[0], reverse= False)
print("Top 10 Most Negatice Words:")
for n,c in most_negative[:10]:
    print(f"{n}: {c}")

#hint: model.coef_, vectorizer.get_feature_names()

[[ 0.00045216  0.00858757  0.00624888 ...  0.00899608  0.00379664
  -0.00010044]] 

Top 10 Most Positive Words:
2.19140169878936: lifesaver
2.13701171530884: minor
2.0727530441180084: con
2.0315468023034486: skeptical
2.004349460490706: ply
2.0028946103789083: saves
1.9900021026894708: thankful
1.9331071335532408: rich
1.899027579850039: wonderfully
1.8746816226030407: hinder


Top 10 Most Negatice Words:
-2.826389088337347: dissapointed
-2.7545931318772556: worthless
-2.611657494104371: worst
-2.569384876023774: useless
-2.527244518929706: poorly
-2.466253947504001: disappointing
-2.2487688955854543: unusable
-2.2141949030859354: disappointed
-2.195760912796035: poor
-2.1679255334604024: unacceptable


Above we can observe 10 words with most positive and negative coefficients.

## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [96]:
#a)
y_pred = model.predict(X_test_vectors)
print("Predicted Sentiments for test data:", y_pred[:10])


Predicted Sentiments for test data: [ 1 -1 -1  1  1  1  1  1  1 -1]


Predicting the sentiments (positive (1) or negative (-1)) of the test data. 

In [73]:
#b)
y_pred_prob = model.predict_proba(X_test_vectors)
print("Predicted Probabilities for test data: \n", y_pred_prob[:5])
#hint: model.predict_proba()

Predicted Probabilities for test data: 
 [[0.46685486 0.53314514]
 [0.78368871 0.21631129]
 [0.97120749 0.02879251]
 [0.00176079 0.99823921]
 [0.00164759 0.99835241]]


Now we are interested in the probability of each class (so how likely it is that this review will be in this category (1 or -1)). So we want to know how confident the model is about its predictions. 
model.predict_proba() - this return the probabilities for each review being in each of the classes (-1, 1). [2D-array] 

In [79]:
#c)
#finding reviews with most positive and least positive probability of being in class (1). 
reviews_with_probabilities = [(review, prob[1]) for review, prob in zip(X_test, y_pred_prob)]
positive_reviews = sorted(reviews_with_probabilities, key=lambda x: x[1], reverse=True)[:5]
negative_reviews = sorted(reviews_with_probabilities, key=lambda x: x[1])[:5]

#printing results
print("Top 5 Most Positive Reviews:")
for i, (review, prob) in enumerate(positive_reviews, 1):
    print(f"{i}. Review: {review}\n   Probability of Positive: {prob:.4f}\n")

print("\nTop 5 Most Negative Reviews:")
for i, (review, prob) in enumerate(negative_reviews, 1):
    print(f"{i}. Review: {review}\n   Probability of Positive: {prob:.4f}\n")


Top 5 Most Positive Reviews:
1. Review: BOTTOM LINE I would buy this again in a heartbeat and I would buy it over any other travel crib or pack n play I have been so impressed with this crib 100 worth every penny It has been used every night for seven months and shows no signs of wearMy husband and I bought this crib as our only crib for our first baby We were nervous because it was such a large purchase for us This is the most expensive thing we have purchased besides our car After extensive research we decided on the babybjorn because my husband is in graduate school and we will be doing a lot of moving in the next few years She has slept in it since she was three days old and she is now seven months and the crib looks and functions as if it was newOur BabyBjorn has been in six states and five countries We have checked it at airports all over Europe and the US We have checked it in its carrying case alone and it has not been damaged in anyway It has been on trains buses metros trams 

In [82]:
#d)
y_pred_prob2 = [1 if prob[0] < prob[1] else 0 for prob in y_pred_prob]
accuracy2 = accuracy_score(y_test, y_pred_prob2)
print(f'Accuracy: {accuracy2:.4f}')

Accuracy: 0.8169


Quite good accuracy presents that our model is working correctly. 

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [83]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [84]:
#a)
#exercise 2 - everything the same except vocabulary
vectorizer2 = CountVectorizer(vocabulary=significant_words) 
X_train_vectors2 = vectorizer2.fit_transform(X_train)
X_test_vectors2 = vectorizer2.transform(X_test)
print(vectorizer2.get_feature_names_out())

['love' 'great' 'easy' 'old' 'little' 'perfect' 'loves' 'well' 'able'
 'car' 'broke' 'less' 'even' 'waste' 'disappointed' 'work' 'product'
 'money' 'would' 'return']


In [95]:
#a) exercise 3 - everything the same except vocabulary
model2 = LogisticRegression(max_iter=1000)
model2.fit(X_train_vectors2, y_train)
y_pred2 = model2.predict(X_test_vectors2)
accuracy = accuracy_score(y_test, y_pred2)
print(f'Accuracy: {accuracy:.4f}')

Accuracy: 0.8690


In [88]:
#a) exercise 4 - everything the same except vocabulary
y_pred_prob_limited = model2.predict_proba(X_test_vectors2)
print(y_pred_prob_limited[:5])

[[0.07624749 0.92375251]
 [0.21395717 0.78604283]
 [0.21395717 0.78604283]
 [0.0109731  0.9890269 ]
 [0.04805289 0.95194711]]


In [91]:
#b) the same thing was done in the exercise above but here it is on the new vocabulary
feature_names = vectorizer2.get_feature_names_out()
coefficients2 = model2.coef_[0]

word_impact_list = list(zip(coefficients2, feature_names))
for c,w in word_impact_list:
    print(f"{w}: {c:.4f}")

love: 1.3590
great: 0.9309
easy: 1.1932
old: 0.0734
little: 0.5024
perfect: 1.5151
loves: 1.6850
well: 0.4962
able: 0.1933
car: 0.0745
broke: -1.6806
less: -0.2016
even: -0.4897
waste: -1.9796
disappointed: -2.3988
work: -0.6356
product: -0.3137
money: -0.9464
would: -0.3422
return: -2.0928


In [98]:
#c)
from sklearn.metrics import accuracy_score

#time of evaluation and predictions accuracy on the full data set and dictionary 
print("Prediction time (full) measurement using %time: \n")
%time predictions = model.predict(X_test_vectors)

accuracy = accuracy_score(y_test, y_pred)
print(f"(full) Model accuracy: {accuracy:.4f}")

print("\n(full) measurement using %timeit")
%timeit model.predict(X_test_vectors)

print("\n")

#time evaluation and prediction accuracy of the LIMITED dictionary 
print("Prediction time (limited) measurement using %time: \n")
%time predictions = model2.predict(X_test_vectors2)

accuracy2 = accuracy_score(y_test, y_pred2)
print(f"(limited) Model accuracy: {accuracy2:.4f}")

print("\n(limited) measurement using %timeit")
%timeit model2.predict(X_test_vectors2)

#hint: %time, %timeit

Prediction time (full) measurement using %time: 

CPU times: user 4.35 ms, sys: 9.4 ms, total: 13.7 ms
Wall time: 11.9 ms
(full) Model accuracy: 0.9329

(full) measurement using %timeit
3.22 ms ± 112 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Prediction time (limited) measurement using %time: 

CPU times: user 804 μs, sys: 279 μs, total: 1.08 ms
Wall time: 969 μs
(limited) Model accuracy: 0.8690

(limited) measurement using %timeit
608 μs ± 12.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


1. Prediction time is shorter for the model with limited dictionary what is quite obvious. 
2. Prediction time using timeit showing average time for one iteration. 
3. Model accuracy: accuracy is better/higher when dictionary is complete (full). 

Limited dictionary resulted in lower accuracy, but lower the evaluation time. 