# Predicting the rating of online dresses reviews. 
<i>In this assignment I predict whether dresses reviews are positive (>3 stars) or neutral/negative (<4 stars). To make this prediction I work out the following steps: </i>

- A brief explaination on how the bag-of-words model and Naïve Bayes work, and how they work together.
- Pre-processing steps where I also filter out all non-dress reviews. 
- The head() of the resulting dataframe. 
- Text pre-processing steps resulting in a document-feature matrix.
- Splitting the file into a training and a test set. 
- Training a Naive Bayes classifier predicting wheter a review is positive or neutral/ negative. 
- Evaluate the performance of my model on the test set. 
- Inspect 3 cases where the model is off target and explain why the model trips up. 

I use the Women’s E-Commerce Clothing Reviews data set (https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews) from Kaggle for this assignment. 

## Bag-of-words model and Naive Bayes. 

<b> Bag-of words model</b>
Bag-of-words is an algorithm that counts the occurance, and ignores order or grammar of words in a text. Words that belong together are seen as seperate words and counte seperately because the information regarding the order or structure are discarded. For example: Ice Cream ("Ice" and "Cream) or High School ("High" and "School"). 

<b> Naive Bayes</b>
The Naive Bayes classifier is an algorithm that classifies features. The algorithm sees these feautures as seperate and independent. It is called naive because this is most often not the case. Using the Naive Bayes algorithm, we can predict whether a review is positive or neutral/negative.
 


## Data pre-processing

In [14]:
# import the modules
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import numpy as np

In [15]:
# import the csv file
df = pd.read_csv('DataClothingReviews.csv')

In [16]:
# filtering all non-dress reviews out
df = df.loc[(df['Class Name'] == 'Dresses')]

In [17]:
# only keeping review text and rating
df = df[['Review Text', 'Rating']]

In [18]:
# marking positive reviews as 1 and negative reviews as 0. 
df.loc[df['Rating'] < 4, 'Positive or Negative'] = '0' 
df.loc[df['Rating'] > 3, 'Positive or Negative'] = '1' 

In [19]:
# drop non values
df = df.dropna()

df.head()

Unnamed: 0,Review Text,Rating,Positive or Negative
1,Love this dress! it's sooo pretty. i happene...,5,1
2,I had such high hopes for this dress and reall...,3,0
5,"I love tracy reese dresses, but this one is no...",2,0
8,I love this dress. i usually get an xs but it ...,5,1
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1


## Text pre-processing

In [21]:
# convert text to unicode
text = df['Review Text'].values.astype('U')

In [22]:
# create object with English stopwords 
vect = CountVectorizer(stop_words = 'english')
# fit our text with the model
vect = vect.fit(text)


In [28]:
# get words from the vocabulary
feature_names = vect.get_feature_names()

print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[700:720]}")

There are 8079 words in the vocabulary. A selection: ['assessed', 'assessing', 'assessment', 'assets', 'assistance', 'assistant', 'associate', 'associates', 'assume', 'assumed', 'assumption', 'assured', 'ast', 'astounded', 'astronomical', 'asylum', 'asymmetric', 'asymmetrical', 'ate', 'athlete']


## Document feature matrix

The matrix is mostly zeroes so they are left out to save memory. 
The positions of the cells that don't have a zero are shown, with their values. 
The first (row) number is the document (dialogue line), and the second (column) is the feature (word). 

In [34]:
# get the features
doc_feat = vect.transform(text)
# print a small part of the sparse matrix
print(doc_feat[0 :50, 0 : 50]) 

  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (25, 40)	1
  (34, 12)	2
  (38, 31)	1


## Training the model

In [37]:
# create the model
nb = MultinomialNB()

# document-feature matrix as x and rating as y
X = doc_feat
y = df['Positive or Negative']

#splitting into test and training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# fit the model
nb = nb.fit(X_train, y_train)

## Evaluation the model


In [39]:
# make predictions
y_test_p = nb.predict(X_test)

# calculate the accuracy of those predictions
nb.score(X_test, y_test)

0.8508676789587852

In [40]:
df['Positive or Negative'].value_counts(normalize=True)

1    0.754109
0    0.245891
Name: Positive or Negative, dtype: float64

The accuracy of the predictions using nb is 85% this is a better performance than just guessing Positive all the time. When guessing positive all the time the accuracy would be 75%.  

In [41]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Neutral or Negative', 'Positive'], columns=['Neutral or Negative predictions', 'Positive predictions'])
cm

Unnamed: 0,Neutral or Negative predictions,Positive predictions
Neutral or Negative,301,168
Positive,107,1268


In [46]:
# classification report
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

           0       0.74      0.64      0.69       469
           1       0.88      0.92      0.90      1375

    accuracy                           0.85      1844
   macro avg       0.81      0.78      0.79      1844
weighted avg       0.85      0.85      0.85      1844



## Off target cases

In [52]:
# find wrong predictions to check
test_data = pd.DataFrame({'y_test' : y_test, 'y_test_p' : y_test_p})
test_data[test_data['y_test'] != test_data['y_test_p']].sort_index().head(15)

Unnamed: 0,y_test,y_test_p
23,0,1
52,0,1
383,0,1
480,0,1
506,0,1
533,0,1
560,1,0
734,1,0
755,0,1
1140,0,1


In [57]:
# a negative review mistaken as positive
df.iloc[23]['Review Text']

'Perfect dress for hot, humid, sticky weather.'

In [59]:
# another negative review mistaken as positive
df.iloc[52]['Review Text']

'This dress has potential, but it didn\'t work for me. it runs true to size to a little big, i ordered medium, my usual size for maeve). as for length it fit me as the model (5\'9"). the reason i\'m not keeping it is that i wish it had some darts in the back to help define the waist a bit,'

In [60]:
# a positive review mistaken as negative
df.iloc[1207]['Review Text']

"This jumpsuit is so cute and it looked amazing on. i had to return it because of the poor design. it was extremely hard to put on and take off! there are so many buttons to do and undo which are not as easy as snaps! the fabric buttons definitely make a great look to the jumpsuit but there should be a zipper on the side for easy access, or use snaps instead of the loops for the fabric buttons. what a shame of a lovely and classy design on a good quality material, but not worth it if it's difficu"

<b>First case</b>
<i> 'Perfect dress for hot, humid, sticky weather.'</i>

Sarcasm is used here. The model is not able to detect this. 

<b>Second case</b>
<i> 'This dress has potential, but it didn\'t work for me. it runs true to size to a little big, i ordered medium, my usual size for maeve). as for length it fit me as the model (5\'9"). the reason i\'m not keeping it is that i wish it had some darts in the back to help define the waist a bit,'</i>

The problem is harder to detect here but words like potential, true, keeping, define could throw the model off. 

<b>Third case</b>
<i> 'This jumpsuit is so cute and it looked amazing on. i had to return it because of the poor design. it was extremely hard to put on and take off! there are so many buttons to do and undo which are not as easy as snaps! the fabric buttons definitely make a great look to the jumpsuit but there should be a zipper on the side for easy access, or use snaps instead of the loops for the fabric buttons. what a shame of a lovely and classy design on a good quality material, but not worth it if it's difficu'</i>

The first sentence is really positive and the customer is also using a lot of positive words troughout. 