## Rating Prediction and model performance comparison

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.formula.api as smf
pd.set_option('display.max_colwidth', 1000)
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score,f1_score
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

### 1.1 Load the data

In [31]:
# Load data
amreviews = pd.read_csv("amazon-reviews.csv.bz2", sep='\t')
#Viewing the data
amreviews.sample(5)

Unnamed: 0,date,summary,review,rating
14254,2014-01-03,Great for baby food.,We make all of our own baby food from real ingredients like garden grown veggies. These trays are convenient for freezing small portion sizes for feeding the babies. They are made in the USA so no funky Chinese chemicals to worry about. I would prefer they were not plastic but I don't think there is much concern when items are frozen and not hot. We also don't make baby food out of acidic veggies like tomatoes so less chance of leachate.,5
142501,2013-07-12,Great Seat,"Took over 30 minutes to install the first time around, after that it was easy. Only downside is I wish there were better support for a head rest when my little guy falls asleep. I ended up buying a neck pillow, but he will only use it if he's really sleepy. I am glad I will not have to invest in another car seat ever, this will suit my needs till we are out of the car seat and booster phase.",5
81835,2013-11-22,Love Love Love,"I was skeptical with the reviews with people stating how Sophie is a choking hazard. My 5 month old loves his Sophie. He smiles every time I make her squeak. She is easy for him to hold and I am comforted knowing she is made from safe materials. The Blue Chan he hasn't gotten yet (its a Christmas present) but he squeaks VERY easily. Almost too easily actually. But they are still nice toys. For something that is intended for teething babies, I'd say they are well worth the money. I also got the mini Sophie for my son as well.",5
34214,2011-02-07,"Love the pattern, chair okay",I don't think there is anything really exciting about this highcair. Its pretty standard with the perk of being able to raise it higher or lower so that it can pull up to the table. In my opinion the best thing about this chair is the pattern. In the world of baby products they usually are bright and colorful and for my dining room I really just wanted something cute. For us we only used a highchair for about 5 months but it was a lifesaver while it last. I bought mine from diapers.com and was very happy. If you use the code BAIL2798 you can get 20% off your whole purchase from all their sites. Amazon also has good deals sometimes.,3
112400,2012-08-25,Don't waste money,"I have tryed both this and the Medela Pump, even though the Medela is $100.00 more it is worth it. I did not care for the way you have to hold both pumps. The Medela suctions right to your breasts and keeps you hands free.",3


In [32]:
# view dimensions
amreviews.shape

(205331, 4)

### 1.2 Remove missing and empty observations for review and rating

In [33]:
#Check for na values
amreviews.review.isna().sum()

80

In [34]:
#drop na 
amreviews_mod = amreviews.drop(amreviews[amreviews.review.isna()].index).reset_index()

In [35]:
#Check data
amreviews_mod.sample(3)

Unnamed: 0,index,date,summary,review,rating
24310,24318,2013-07-24,3rd or 4th set because it gets stinky,"I hate that I have to keep buying this, and it's because my child doesn't like the Baby Bjorn hard bib. My child doesn't mind using this one, except I hate that it keeps getting a stinky smell after a short period of use. I keep having to throw it out and buying new ones. This time around, I'm going to try wiping it after each use instead of washing it. I'm hoping that this way, it will keep it from stinking up so much. If your kid is ok with using the Baby Bjorn, I would recommend that one instead.",3
105925,105967,2014-04-01,Too bulky,I really wanted to like this bag. Great idea with the food area separate from the diaper area. It is just too bulky. I returned it.,3
117875,117919,2012-08-30,Great Umbrella,"I have a 2011 Bumbleride Flite in Lava and also a BOB Revolution SE. If you are looking for a sturdy, well-designed umbrella this is a great purchase. I leave this stroller in my car and use it for running errands/travel. It has been a functional partner to my larger jogging stroller.I like the:- Design. Very sturdy, quality stroller. Easy to fold and unfold (when using two hands) and locks securely. Durable handle on the side to grab and lift into your car. Folds compactly.- Basket. It's HUGE. BE SURE TO ACCESS FROM THE FRONT though. I can fit my large Vera Bradley diaper bag in there (which will not fit in my BOB basket...)- Included car seat adaptor. Works with tons of car seat brands. Great EXTRA option for an umbrella.- Canopy. It is HUGE for an umbrella (I had my heart set on a Maclaren but when comparing the strollers side by side, the canopy on the Bumbleride was far superior).- Handles. I am 5'11 and this is a very comfortable height for me to push. I like the angle of the...",4


In [36]:
# drop previous index column
amreviews_mod.drop('index', axis = 1, inplace= True)

In [37]:
#Check for empty strings 
np.where(amreviews.review.apply(lambda x: x == ''))

(array([], dtype=int64),)

In [38]:
#Check for value counts for rating
amreviews.rating.value_counts(dropna=False)

5    120434
4     42916
3     21911
2     10939
1      9131
Name: rating, dtype: int64

There are no missing values/empty values for review and rating in the sample.

### 1.3 Create outcome variable

In [39]:
# create outcome variable
amreviews_mod['5_star'] = np.where(amreviews_mod.rating < 5, 0, 1)
amreviews_mod.sample(4)

Unnamed: 0,date,summary,review,rating,5_star
11238,2011-09-06,Pain in the butt!,If you're anything like me and don't have a lot of time to set up or take down a child gate then don't buy this one.,2,0
96748,2011-04-21,RAVING,"I seriously don't ""rave"" about products, especially a prefold DIAPER!! But this is really an amazingly absorbant piece of cloth. Exteamly thin. If I had the money, I'd replace all my lousy, bulky kissaluvs and prefolds for these. My daughter is 10 months and is a very heavy wetter. I don't add a doubler with this prefold at naptime because it's THAT good!",5,1
192676,2016-07-26,FA BU LO US,"I love this Body Blur, it took a little effort to rub it in evenly but my legs look fabulously tanned. It gives a great natural glow, way better than self-tanner, and has not been rubbing off on my clothing. It's a nice thick moisturizing consistency.",5,1
118438,2014-02-22,from wifey....,"This is from my wife, she loves these and I got them for her for Christmas. They are okay. '",3,0


### 1.4 Sample data

In [51]:
#check some sample reviews
np.random.seed(400)
amreviews_mod.sample(7)

Unnamed: 0,date,summary,review,rating,5_star
11811,2006-06-19,This carrier was not for us,"We received this carrier as a gift, so we didn't have much input on the selection. We liked the idea of it being a front or back carrier, however it just doesn't seem to work with our bodies. Our daughter seemed to be carried with her head by my ear when I tried it as a front pack. The plastic side buckles to hold the child in place were not that secure, so when we used it we still held her for fear of them opening up - defeats the purpose of having your hands free! We much rather use our FABULOUS Quattro tour system than this!!!",2,0
173365,2014-02-17,Nice Nail Polish,"I'm not a connoisseur of different brands, but when I did an online search, Essie and OPI came up as as some of the better brands. I chose Essie because of the colors that so many of the other reviews where raving about. However, I liked the color of this nail polish better on the screen than on my nails. But the color is objective, it probably just doesn't go with my skin tone. The quality seems good - I had to put on about 3 thin coats to get a solid color, and if you dry them well in between, it will stay on for at least 5 or 6 days.",3,0
204694,2014-04-17,My favorite pedal - great for metal,"I bought this based on video reviews on the net. My first impressions:To my ears, if the OCD / Ultimate OD simulates ""fat"" power tube distortion, this pedal simulates a lightly distorted tube pre-amp, with a bit of edge, sharpness, the kind of edge the tube amps have. The focus knob seems like a tone knob, maybe more in the upper mids than the treble, but it does the job well.I wish it had a bit more gain. Surprisingly, this pedal with the gain on 10 and the tone on 10, gives me a great metal tone all by itself! If i boost the pedal with say a few db of clean boost, it gets me exactly where I'd want to be (a dry, tight metal tone, kinda 5150ish. And while this isn't the purpose of the pedal, it is how I've ended up using it just because it sounds better than any amp sim and produces less heat than my tube amps. It's tight, its raw, there is no fizz and no boomyness, just a healthy low end. Crank the knobs on this thing to 10 and you have a very aggressive, tight tone that ma...",5,1
127503,2014-06-07,"What we expected, easy installation","We just got this seat yesterday and judging by the few low-star comments I was expecting a nightmare of an installation. It was actually really easy for me and took about 5 minutes. I think people get a confused because there are a few different installation set ups, but you just have to follow the instructions for the one you are doing and ignore the rest.I primarily bought this seat in this color because they do not use flame retardants on the cover. It does have that typical &#34;chemical&#34; smell to it from the packaging and foam inside, but I feel like a lot of that will come off when I wash it. The material is soft and cozy and, yes, the seat is huge, but that's to be expected for a car seat that can turn into a booster and hold 100lbs. We can not fit it rear-facing behind the driver's seat of our mid-sized SUV, and I am only about 5'8. The passenger seat does have to be moved up quite a bit to fit it in, but my wife is small so it's not a big deal. I can see that it would...",4,0
182068,2018-04-16,Gets better as it sits,"Edited:\n\nI originally was not crazy about this formula, but after using it several days, I like it. It is pretty full coverage, and it sits well and doesnt seem to oxidize. It does come off easily to the touch, but overall, I like this for a more full coverage look than their B.B. cream.\n\nOriginal review:\n\nI have been using this brands BB creams for years. While I love those, although they could cover a little better, Im not a big fan of this one. It goes on more difficulty than the BB cream, and I have to use more product too. This foundation also clings to any dry skin illuminating it.\n\nI kept the foundation on for an hour or so and looked at my face. Surprisingly, it seemed to have settled nicely. My dry skin was no longer obvious, and it left me with a nice smooth finish. Overall, this is just okay. It may just be my skin, but I prefer the B.B. cream.",4,0
169332,2013-11-04,Great shampoo overall,"I really like this shampoo. It cleans effectively, smells nice, and lathers well. Not sure if it has done anything to noticeably thicken my hair, but works well as a daily shampoo.",4,0
107421,2014-02-07,Perfect for 4 month old,"I bought these for my child for Christmas as she was just beginning to reach out and feel the different textures of things. This was perfect for her and she still loves them, especially the ball with the rattle inside she loves to shake it and hear the noise she is making with it.",5,1


I feel as a person it would be easy to gauge a 5 star and less than 5 star review. However it might be difficult for the algorithm/s to predict unless we have perfectly glowing review. Take for example this review - "I typically dislike facial sunscreens because they tend to feel oily and heavy."

The words dislike, dont expect etc might lead the algorithm to rate this as a less than 5 star review.

### 1.5 Convert reviews into BOW using Count Vectorizer

In [52]:
# import libaries
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words

In [53]:
#drop rating column
amreviews_mod.drop('rating', axis=1, inplace=True)

In [54]:
# convert in BOW
vectorizer = CountVectorizer(stop_words=stop_words.ENGLISH_STOP_WORDS, binary = True)
X = vectorizer.fit_transform(amreviews_mod.review.values)

### 1.6 Here come the models

In [55]:
#set random seed
np.random.seed(894523)

In [56]:
#Splitting data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, amreviews_mod['5_star'], test_size=0.2)

In [57]:
# function that takes supervised learning algorithm, trains the model and prints model accuracy
def model_run(model,X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    print("Model Accuracy:", accuracy_score(prediction, y_test))
    print("Model F1-Score:", f1_score(prediction, y_test))

### SVC with basic linear kernel

In [58]:
model_run(LinearSVC(),X_train, X_test, y_train, y_test)

Model Accuracy: 0.7591776083408444
Model F1-Score: 0.8016293442491372




### SVC with basic polynomial kernel

In [59]:
poly_model = SVC(kernel = 'poly',degree=2, max_iter = 1000)
model_run(poly_model,X_train, X_test, y_train, y_test)



Model Accuracy: 0.5902170470877689
Model F1-Score: 0.7422310756972113


### SVC with rbf kernel

In [60]:
rbf_model = SVC(kernel = 'rbf', max_iter = 1000)
model_run(rbf_model,X_train, X_test, y_train, y_test)



Model Accuracy: 0.5904606465128742
Model F1-Score: 0.7422421194652276


### SVC with sigmoid kernel

In [61]:
sig_model = SVC(kernel = 'sigmoid', max_iter = 1000)
model_run(sig_model,X_train, X_test, y_train, y_test)



Model Accuracy: 0.5904850064553848
Model F1-Score: 0.7422772079903109


### Logistic Regression

In [62]:
model_run(LogisticRegression(),X_train, X_test, y_train, y_test)



Model Accuracy: 0.7757423692480085
Model F1-Score: 0.8163647969360888


### Random Forest Classifier

In [63]:
#Creating a pipeling to create the BOW and then apply Random Forest Classifier
model_run(RandomForestClassifier(),X_train, X_test, y_train, y_test)



Model Accuracy: 0.7137219556161848
Model F1-Score: 0.7547681649346855


### Multinomial NB

In [66]:
model_run(MultinomialNB(alpha=10),X_train, X_test, y_train, y_test)

Model Accuracy: 0.7503593091520304
Model F1-Score: 0.8083052749719417


### Logistic Regression with sover = sag and penalty  range

In [85]:
c = np.arange(0.1,1,0.2)

for pen in c:
    print("For C=", pen)
    model_run(LogisticRegression(solver='sag', C=pen),X_train, X_test, y_train, y_test)
    print()

For C= 0.1
Model Accuracy: 0.7832939514262747
Model F1-Score: 0.8240993395816032

For C= 0.30000000000000004
Model Accuracy: 0.7809310370027527
Model F1-Score: 0.8216134727153711

For C= 0.5000000000000001
Model Accuracy: 0.7799322793598207
Model F1-Score: 0.8204832684206343

For C= 0.7000000000000001
Model Accuracy: 0.7788604418893571
Model F1-Score: 0.8194869755418572

For C= 0.9000000000000001
Model Accuracy: 0.77808092372902
Model F1-Score: 0.8187641745911749



source - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression#sklearn.linear_model.LogisticRegression

### 1.7 Best model amongst chosen ones

For rating classification I chose the following main models: 
- Support Vector Classifier (linear, rbf, polynomial and sigmoid)
- Logistic Regression
- Multinomial Naive Bayes
- Random Forest Classifier

I first ran basic implementations for each and compared performance. Out of all the above models, Logistic Regression performed the best with an accuracy rate of 78% and a F1-score of 82%.

I then chose to tune the hyperparameters for Logistic Regression and discovered that I got the best performance when I selected the solver as 'sag' and the regularization term as 0.1. The model performance seemed to degrade with increasing value of C

### 1.8 Model prediction vs manual prediction

In [88]:
model = LogisticRegression(solver='sag', C=0.1)
model.fit(X_train, y_train)
prediction = model.predict(X)

In [90]:
amreviews_mod['prediction'] = prediction
amreviews_mod.iloc[np.where(amreviews_mod['5_star'] != amreviews_mod.prediction)].sample(10)

Unnamed: 0,date,summary,review,5_star,prediction
17453,2013-10-28,Much better pumping experience!,I got the Medela Advanced pump and it comes with the 24mm breastshileds (because apparently every woman is average sized...not!) I was having such pain and difficulty pumping. I was dissapointed the pump only came with one size shields because I found out that was why I was having such difficulty. Well turns out I have smaller than &#34;average&#34; nipples and the 21 mm shields were perfect! No more pain! And pumping is going so well!,1,0
77340,2013-07-13,I like it,"I like it&#65292;It smell not smell, but my baby is too young to use it,and it is very easy to dirty",0,1
128167,2012-04-20,Nifty little thing!,"This is a pretty cool little portable changer- first of all, the design (Nixon black) is really cool in person. You can't fit a ton in it, but for me that's fine as I really just wanted it for times when I didn't want to bring my entire diaper bag out. I usually put 3 diapers, the portable wipes case that comes with it, a small tube of diaper cream, and maybe a small burp rag in it. I love that it has the buckle wrist strap so you can just strap it onto your stroller and off you go!It would also be a handy thing to keep in you car as a backup changer or whatever.Definitely won't fit in most diaper bags- but that's not what I needed it for so it worked out great.",0,1
69286,2011-07-24,A very nice baby book,I have been looking for a babybook since my son was born. I am glad that I bought this book. The layout allows me to record the important events with lots of room for 1-2 photos and writings and yet it does not look too overcrowded. It also has photo sleeves to put additional photos which I am only using for birthday photos or photos from big event because we have a seperate baby photo album. The design of this baby book is neutral and clean line looking.,0,1
126549,2013-08-06,Handy and versatile,"This has been very useful for carrying swimsuits and baby towels home from the pool. I've also used it for lotion, shampoo, etc. for overnight trips: it's much larger than most make-up bags, and if something leaks, the mess would be contained. The Keith Haring print is not childish, making this even more versatile for travel, if that matters to you. It's certainly a lot more attractive and tidier than a grocery bag, which is what I would have used before getting this.",0,1
21854,2014-03-05,Clean Nose All Around,"I get such satisfaction from using these, and it's much more sanitary than fingers. My daughter doesn't like it, but she does enjoy easier breathing from an unclogged nose, so it's worth it to me for the temporary wrestle to get the job done. These are basically just plastic tweezers with soft rounded tips. Work great for us.",1,0
81352,2011-07-06,OK cup. Just use caution!,"This cup helps with transitioning a child from a bottle to a sippy cup, it has to be used just right,for the child to get liquids out of the cup. It has detachable handles and a travel top and is dishwasher safe.But once your child has teeth and can chew, they WILL chew the spout right off. My child did! Use EXTRA caution! Also the color is chosen for you.",0,1
138074,2013-01-25,Don't trust them,"I received these locks as a gift, and thought looked like they would be great. I installed them into an outlet and the cover would not lock. Once my 8 month figured out what I was doing, he came to investigate. It took him all of 10 seconds to figure out on how to pull the cover out of the outlet. Won't be using these at all.",0,1
157782,2014-01-17,Toddler likes it better than baby,"Baby is mobile and doesn't play with it much. Picks it up, drops it and moves on. Toddler likes to throw it and play with it more.",0,1
155221,2013-08-13,Potty POD,"The Potty POD is easy to clean, the accessories aren't expensive, so you can have an extra lining while of them while one is being cleaned out. The pee blocker is great for little boys even when they try to aim. Pee time is supposed to be fun for them, but not a mess for parents and prince lionheart got it right.",0,1


The Logistic Regression model predictions generally correspond to the rating I would assign to the review. It does slip up in the following cases (based on above reviews):
-  When the customer gives an ambivalent review but still goes ahead and gives a 5 star rating to the product.
-  When the customer mentions issues with previous products to provide contrast with the performance/quality of the one they have. 
- When the review isn't bad as such but the customer assigns a lower rating due to varied personal reasons.