# Week 6 - Naïve Bayes classifier

In this assignment, we are going to use text mining to predict the rating of a dress from online reviews.

We will predict whether dresses reviews are positive (>3 stars) or neutral/negative (<4 stars).

### Dataset

- We are using the Women’s E-Commerce Clothing Reviews data set (Links to an external site).


- The full data set is included, but we are running a model on the reviews of dresses only.


- The data set on Kaggle, for context: https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/home


### To do:

- Explain briefly in your own words how the bag-of-words model and Naïve Bayes work, and how they work together.


- Pre-processing steps (don’t forget to filter out all non-dress reviews).


- The head() of the resulting dataframe.


- Text pre-processing steps resulting in a document-feature matrix.


- Split the file into a training and a test set.



- Train a Naïve Bayes classifier predicting whether a review is positive (>3 stars) or neutral/negative (<4 stars).


- Evaluate the performance of your model on the test set.


- Check out 3 cases where your model is off target. Inspect the associated texts. Do you understand why your model trips up? Explain.

## Bag-of-words model and Naïve Bayes relation

Naive Bayes model assume that the value of a particular feature is independent of the value of any other feature.


In the bag-of-words model, a text is represented as the bag (multiset) of its words, disregarding grammar and even word order.


The frequency of each words determines the probability.

### Import Data.

In [1]:
import pandas as pd

In [2]:
#importing our data and filter it to only show (review) data for dresses
df = pd.read_csv('data clothing reviews.csv')
dresses = df['Department Name']=='Dresses' 
reviews_dresses = df[dresses]
reviews_dresses.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses


To read the text and use it for our analysis, we need an object from sklearn called a CountVectorizer. 

CountVectorizer creates a dictionary from a series of text. 

It lowercases the text and tokenizes it by using whitespace and interpunction as separations between words.

I use a list of frequent English words ('stop words') that will not be counted: they are not informative enough.

We will need to convert the text to Unicode, which is a standard text format. We do so by using .values.astype('U').

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

text = reviews_dresses['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary

print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 8080 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']


In [4]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:50,0:50]) #Let's print a little part of the matrix: the first 50 words & documents

  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (26, 40)	1
  (35, 12)	2
  (39, 31)	1


## Building the model


Now, we will use the Naïve Bayes classifier from sklearn.

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

model = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = reviews_dresses['Rating'] #creating the y vector

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3) #split the data and store it

model = model.fit(X_train, y_train) #fit the model

## Evaluating the model

In [6]:
preds = model.predict(X_val)
print(preds)
model.score(X_val, y_val)

[5 5 3 ... 4 5 5]


0.5622362869198312

- The accuracy is 58.75%.


- This accuracy is shared between 5 categories.


- Considering there are 5 categories, almost 60% accuracy is pretty good.

In [7]:
reviews_dresses['Rating'].value_counts(normalize=True)

5    0.537585
4    0.220763
3    0.132616
2    0.072955
1    0.036082
Name: Rating, dtype: float64

In [8]:
from sklearn.metrics import classification_report

print(classification_report(y_val, preds))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00        64
           2       0.30      0.07      0.11       137
           3       0.31      0.19      0.23       268
           4       0.26      0.25      0.25       396
           5       0.69      0.88      0.77      1031

    accuracy                           0.56      1896
   macro avg       0.31      0.28      0.27      1896
weighted avg       0.49      0.56      0.51      1896



- The precision for Rating 1 is 0.00, which means that out of a 100 lines that are predicted to be 1 star rating, none were right.


- For 5 star ratings, 69% of our predictions were true. Not bad..