### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [2]:
#b)
baby_df['review'] = baby_df['review'].fillna("")

#short test:
baby_df["review"][38] == baby_df["review"][38]

True

### Comment
I changed the order of operations while preparing our data. In the first place, I had to change the empty values to empty words, so that later other functions would have no problem transforming the data.

Here we see that data cleansing is essential because, regardless of how sophisticated your problem or algorithm is, you can’t obtain good results from bad data.

In [3]:
#a)
baby_df['review'] = baby_df['review'].apply(remove_punctuation)

#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

### Comment
In this step, I performed transformations by applying a function to each line of our dataset that cleans the data from punctuation marks.

In [4]:
#c)
print(f"Max rating [{baby_df.rating.max()}]")
print(f"Min rating [{baby_df.rating.min()}]")

baby_df = baby_df[baby_df.rating != 3]

#short test:
sum(baby_df["rating"] == 3)

Max rating [5]
Min rating [1]


0

### Comment
In this step, I first checked what is the scope of the assessment. Negative equals 1 and positive equals 5. Then I removed all records with a rating of 3 as it is natural and we will not work on them in this set.

In [5]:
#d) 
def set_rate(rate):
    if rate <= 2:
        return -1
    if rate >= 4:
        return 1

baby_df['rating'] = baby_df['rating'].apply(set_rate)
    
#short test:
sum(baby_df["rating"]**2 != 1)

0

### Comment
In this step, I assigned all positive ratings 1 and negative ratings -1.

In [6]:
baby_df.head(25).tail(10)

Unnamed: 0,name,review,rating
17,Nature's Lullabies Second Year Sticker Calendar,This was the only calender I could find for th...,1
18,Nature's Lullabies Second Year Sticker Calendar,I completed a calendar for my sons first year ...,1
19,Nature's Lullabies Second Year Sticker Calendar,We wanted to get something to keep track of ou...,1
20,Nature's Lullabies Second Year Sticker Calendar,I had a hard time finding a second year calend...,1
21,Nature's Lullabies Second Year Sticker Calendar,I only purchased a secondyear calendar for my ...,-1
22,Nature's Lullabies Second Year Sticker Calendar,I LOVE this calendar for recording events of m...,1
24,Nature's Lullabies Second Year Sticker Calendar,Wife loves this calender Comes with a lot of s...,1
25,Nature's Lullabies Second Year Sticker Calendar,My daughter had her 1st baby over a year ago S...,1
26,Baby's First Journal - Green,Extremely useful As a new mom tired and inexpe...,1
28,"Lamaze Peekaboo, I Love You",One of babys first and favorite books and it i...,1


### Comment
Our data has been cleaned and prepared for further analysis and testing.

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [8]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [9]:
#a)
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(baby_df, test_size=0.3, random_state=44)

In [10]:
#b)
vectorizer = CountVectorizer()

x_train = vectorizer.fit_transform(train_df['review'])
x_test = vectorizer.transform(test_df['review'])

y_train = train_df['rating']
y_test = test_df['rating']
vectorizer.get_feature_names_out()

array(['00', '000', '001', ..., 'zzzzzz', 'zzzzzzz', 'zzzzzzzzzzz'],
      dtype=object)

### Comment
First, I divided the data into training and testing. I assigned 30% of all data to the test data, in reality we should take around 20%, but I wanted to experiment and see how our model will cope with this amount of training data.

In the next step I learn the vocabulary dictionary using the training data and after this I returned document-term matrix (I did it in one step using fit_transform). Finally, I also transformed the test data using the fitted model.

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [11]:
#a)
model = LogisticRegression(solver='sag', max_iter=200)
model.fit(x_train, y_train)



### Comment
The first step I took to be able to predict ratings was to train a logistic regression model using training data. This model is a model in which the dependent variables take only two values (that's why we had to convert to 1 and -1). 

As we have quite a complex dataset, I used a different solver that is more optimal for larger amounts of data. I also changed the maximum number of iterations after which the learning operation is interrupted.

As we train more and more data and complexity of algorithms increases, the running time of such models also can increase significantly. For example, this model took a minute to complete, but many business models used in companies can take several hours to learn.

In [12]:
#b)
all_names = vectorizer.get_feature_names_out()
all_coefs = model.coef_[0]

paired_coefs = list(zip(all_coefs, all_names))
sorted_coefs = sorted(paired_coefs, key=lambda x: x[0])

print(f"Most positive words: {[x[1] for x in sorted_coefs[-10:]]}")
print(f"Most negative words: {[x[1] for x in sorted_coefs[:10]]}")

Most positive words: ['exactly', 'pleased', 'great', 'perfectly', 'best', 'happy', 'easy', 'love', 'perfect', 'loves']
Most negative words: ['disappointed', 'returned', 'waste', 'return', 'poor', 'useless', 'returning', 'broke', 'idea', 'unfortunately']


### Comment
Using the feature factor in the decision function, we can easily find out what the most negative and positive words were. As we might have guessed, these positive words are most often associated with human emotions. Among the negative ones word "return" appears the most.

## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [13]:
#a)
prediction = model.predict(x_test)

diff = 0

for i in range(len(prediction)):
    if prediction[i] != y_test.tolist()[i]:
        diff += 1
        
pred_number = len(prediction)

print(f"Number of test elements [{pred_number}]")        
print(f"Number of mistakes in predictions [{diff}]")
print(f"Percentage of correct predictions [{np.round((pred_number - diff) * 100 / pred_number, 2)}%]")

Number of test elements [50026]
Number of mistakes in predictions [3493]
Percentage of correct predictions [93.02%]


### Comment
Using the predict method, I predicted the sentiment of the test data. Then I compared the result with real values. The model fits very well because we have only 7% of incorrect indications in the predicted values.

In [14]:
#b)
proba_prediction = model.predict_proba(x_test)
print(proba_prediction)
print(f"Model classes {model.classes_}")

[[5.42684318e-04 9.99457316e-01]
 [9.99846764e-01 1.53235709e-04]
 [6.01965609e-04 9.99398034e-01]
 ...
 [5.48193296e-04 9.99451807e-01]
 [8.45333125e-01 1.54666875e-01]
 [1.20174118e-01 8.79825882e-01]]
Model classes [-1  1]


### Comment
This method returns class estimates for each predicted item. In our case, there are two classes. The closer the value for a given class is to 0, the more certain the model is that it is a wrong term, and the closer to 1, the more certain it is that it is a certain value for a given class.

In [15]:
#c) 
paired_probs = list(zip(proba_prediction[:, 0], test_df['review']))
sorted_probs = sorted(paired_probs, key=lambda x: x[0])

print("Most positive reviews:")
for i in sorted_probs[:5]:
    print(f"* {i[1]}\n")

print("\nMost negative reviews:")
for i in sorted_probs[-5:]:
    print(f"* {i[1]}\n")

Most positive reviews:
* This review is going to compare 3 JuJuBe bags  Im writing a lot because I wish I had this info before I started shopping for a diaper bag  I am a first time mom so I am probably packing different than a mom thats been around the block  Also we are cloth diapering but not 100 of the time  We use the disposable biodegradable inserts for the cloth covers when we want to pack light  and those things take up less room than traditional disposables  But I wanted a bag that gave me options for either type of diaper  Here is what I found  I hope reading this long post will help you save some time and return shipping fees while shoppingI already had the JuJuBe BFF and Be Prepared before I bought the Be Right Back backpack  I bought them at a different store one with an amazing return policy so I left the tags on and debated the bags for a long time  I packed them and repacked them so many times and this is what I came up with  The BFF is too small and the Be Prepared is 

### Comment
In both cases, the reviews are quite long, which is probably why they were in the first positions because they have a lot of words that are characteristic of a given group.

In [16]:
#d) 
from sklearn.metrics import accuracy_score

print(f"Model accuracy [{round(model.score(x_test, y_test), 2)}%]")
print(f"Model accuracy [{round(accuracy_score(y_test, prediction), 2)}%]")

Model accuracy [0.93%]
Model accuracy [0.93%]


### Comment
I have already calculated the accuracy of the model manually in task 3a. But above I presented two more methods to get precision using library functions. All 3 gives the same result. Our model is very good trained.

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [17]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [18]:
#a)
vectorizer_limited = CountVectorizer()
vectorizer_limited.fit_transform(significant_words)

x_train_limited = vectorizer_limited.transform(train_df['review'])
x_test_limited = vectorizer_limited.transform(test_df['review'])

all_names_less = vectorizer_limited.get_feature_names_out()
print(f"Limited words: {all_names_less}")

model_limited = LogisticRegression(solver='sag', max_iter=200)
model_limited.fit(x_train_limited, y_train)

paired_coefs = list(zip(model_limited.coef_[0], all_names_less))
sorted_coefs = sorted(paired_coefs, key=lambda x: x[0])

print(f"Most positive words: {[x[1] for x in sorted_coefs[-10:]]}")
print(f"Most negative words: {[x[1] for x in sorted_coefs[:10]]}")

Limited words: ['able' 'broke' 'car' 'disappointed' 'easy' 'even' 'great' 'less' 'little'
 'love' 'loves' 'money' 'old' 'perfect' 'product' 'return' 'waste' 'well'
 'work' 'would']
Most positive words: ['car', 'old', 'able', 'well', 'little', 'great', 'easy', 'love', 'perfect', 'loves']
Most negative words: ['disappointed', 'return', 'waste', 'broke', 'money', 'work', 'even', 'would', 'product', 'less']


### Comment
The model again assigned the words according to what we feel and what emotions they reflect.

In [19]:

prediction = model_limited.predict(x_test_limited)
proba_prediction = model_limited.predict_proba(x_test_limited)
print(proba_prediction)
print(f"Model classes {model_limited.classes_}")

paired_probs = list(zip(proba_prediction[:, 0], test_df['review']))
sorted_probs = sorted(paired_probs, key=lambda x: x[0])

print("Most positive reviews:")
for i in sorted_probs[:5]:
    print(f"* {i[1]}\n")

print("\nMost negative reviews:")
for i in sorted_probs[-5:]:
    print(f"* {i[1]}\n")

[[0.00188068 0.99811932]
 [0.39802999 0.60197001]
 [0.01817549 0.98182451]
 ...
 [0.00572053 0.99427947]
 [0.05233346 0.94766654]
 [0.04712281 0.95287719]]
Model classes [-1  1]
Most positive reviews:
* My daughter 8 months old loves this toy and has for a few months I love it too  Its not only a stacking toy but several toys in one each providing development benefits The removable elephant head serves as a funnel into which you drop balls  The head itself is fun since the ears crinkle and the hair is little knots that provide tactile sensation  My daughter loves to grab it and wave it around Each stackable ring is soft and easy for my daughter to grab and wave around which she loves to do  Two of the rings the elephant feet have Velcro joining the feet that my daughter can separate and rejoin The 4 different colored balls rattle and each has a different patter pressed on it  My daughter loves to shake the balls which fit perfectly into her hands  but not her mouth and she also loves t

### Comment
We can see here that completely different opinions are considered the best and the worst.

In [20]:
print(f"Model accuracy [{round(model_limited.score(x_test_limited, y_test), 2)}%]")

Model accuracy [0.87%]


### Comment
This model has less precision. As we can see, narrowing the words that CountVectorizer processes increases the prediction error in our model, but on the other hand, it significantly accelerates the working time. The conclusion from this is that at the beginning we should indicate words that may mean little to our result to improve the speed.

In [21]:
#b)
for pair in sorted_coefs:
    print("{:>12} - [{}]".format(pair[1], pair[0]))

disappointed - [-2.3246600103959714]
      return - [-2.1698051024225373]
       waste - [-1.997479240917202]
       broke - [-1.733239349706696]
       money - [-0.9197622099350969]
        work - [-0.6350615185326765]
        even - [-0.5132266216447512]
       would - [-0.3459985789056369]
     product - [-0.30532285575865337]
        less - [-0.17858237489337592]
         car - [0.05689075070597551]
         old - [0.08322216355316647]
        able - [0.1949608676143343]
        well - [0.48223049378912]
      little - [0.4856560228370704]
       great - [0.9413064790708808]
        easy - [1.150474165981367]
        love - [1.3435959910840163]
     perfect - [1.4739416952059505]
       loves - [1.704582085173517]


### Comment
The words loves, perfect and love have the greatest positive impact and the words disappointed, return, waste have the negative impact. We can also notice that some words had a more natural character like the word car, old and less.

In [22]:
#c)
print(f"First model accuracy {model.score(x_test, y_test)}")
print(f"Limited model accuracy {model_limited.score(x_test_limited, y_test)}")

print("\nFirst model prediction time")
%timeit -n100 -r10 model.predict(x_test)

print("\nLimited model prediction time")
%timeit -n100 -r10 model_limited.predict(x_test_limited)

print("\nFirst model learning time")
%timeit -r1 model.fit(x_train, y_train)

print("\nLimited model learning time")
%timeit -r1 model_limited.fit(x_train_limited, y_train)

First model accuracy 0.9301763083196738
Limited model accuracy 0.8682285211689921

First model prediction time
11.5 ms ± 361 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

Limited model prediction time
864 µs ± 47.7 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

First model learning time




26.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Limited model learning time
612 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Comment
The model with a top-down list of words is less precise (6 percentage points), but for our data fragment it is more than 10 times faster. The learning time of the model I mentioned earlier is also noticeably longer in the model that takes all words. This is more than 25 times longer.