# Naïve Bayes

**Explain briefly in your own words how the bag-of-words model and Naïve Bayes work**
bag-of-words models count the frequency of words in a certain set, naïve bayes puts these word in relation to the to be predicted value, so in this case a bag of positive words and a bag of negative/neutral words. Then it uses this to predict the chance of a review being positive.

## Import the needed libraries and the Data

Import what we need in this document, import the data and check it out. 

In [123]:
import sklearn as sk
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [124]:
df = pd.read_csv('Assignment text mining - data clothing reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## Pre-processing
Subset the DataFrame with only variables we need and cases we need.

In [125]:
df = df.loc[df['Class Name'] == 'Dresses'] # to only select dresses
df = df[['Review Text', 'Rating']] # variables we need. Review Text to analyze and Rating as y-axis
df.head()

Unnamed: 0,Review Text,Rating
1,Love this dress! it's sooo pretty. i happene...,5
2,I had such high hopes for this dress and reall...,3
5,"I love tracy reese dresses, but this one is no...",2
8,I love this dress. i usually get an xs but it ...,5
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5


Change rating to 1's and 0's. 1 if a review is positive

In [126]:
df.loc[df['Rating'] < 4, 'Rating'] = 0
df.loc[df['Rating'] > 3, 'Rating'] = 1
df.head()

Unnamed: 0,Review Text,Rating
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
5,"I love tracy reese dresses, but this one is no...",0
8,I love this dress. i usually get an xs but it ...,1
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",1


In [127]:
df['Rating'].value_counts()

1    4792
0    1527
Name: Rating, dtype: int64

4792 positive reviews and 1527 neutral/negative reviews. This means that if we guess positive (1) every time we are right in ~76% of the cases.

## Text pre-processing

This shows all the 1's in our bag of words, first visualized as array, and also transposed so it's better readible. We can only do this second step as this is not a very large datset.

In [128]:
text = df['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix
print(docu_feat[0:50,0:50]) #Let's print a little part of the matrix: the first 50 words & documents

  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (26, 40)	1
  (35, 12)	2
  (39, 31)	1


In [129]:
rev_words = pd.concat([df, pd.DataFrame(docu_feat.toarray())], axis=1)
rev_words.head(10)

Unnamed: 0,Review Text,Rating,0,1,2,3,4,5,6,7,...,8070,8071,8072,8073,8074,8075,8076,8077,8078,8079
0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Love this dress! it's sooo pretty. i happene...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,I had such high hopes for this dress and reall...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"I love tracy reese dresses, but this one is no...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,I love this dress. i usually get an xs but it ...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Building the model

We build the model with the matrix as X and the y as Rating. What we want to predict. Split in train and test set and train the model. 

In [130]:
nb = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = df['Rating'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data and store it

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

## Evaluating the model

Evakuate the prediction by predicting on the test set and score it to the real ratings in the test set.

In [131]:
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.8554852320675106

86% accuracy, which is higher than 76% (when we guess) but not that much higher. Let's see the confusion matrix for the bigger picture.

In [132]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Neutral_negative', 'Positive'], columns=['Neutral_negative pred', 'Positive pred'])
cm

Unnamed: 0,Neutral_negative pred,Positive pred
Neutral_negative,282,182
Positive,92,1340


In [133]:
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

           0       0.75      0.61      0.67       464
           1       0.88      0.94      0.91      1432

    accuracy                           0.86      1896
   macro avg       0.82      0.77      0.79      1896
weighted avg       0.85      0.86      0.85      1896



precision (how precise can we predict) of neutral/negative is 282/(282+92) which is a bit over 75%
recall (how much do we catch) of neutral/negative is 282/(282+182) which is a bit below 61%

precision (how precise can we predict) of positive is 1340/(1340+182) which is a bit over 88%
recall (how much do we catch) of positive is 1340/(1340+92) which is a bit below 94%

If you want to predict how happy someone is with a product precision is more important, but if you want to build a filter to only let positive reviews show up on your webshop recall is probably more important.

### Evaluate the predictions
As we have a 'weird' index as I made a selection of only dresses from the reviews, the index isn't following up. I need to reset the index to use the for loop.

Then I make the predictions based on the index which returns a double array with the chances of negative and positive in an array in the order of the index in the top array. So we loop through the array and at each item we put the negative and positive chances in a dataframe plus the difference between them. Which results in the dataframe below.

In [134]:
df = df.reset_index(drop=True)
predictions = nb.predict_proba(X[df.index])
for i in range(len(predictions)):
    df.at[i, 'pred_negative'] = predictions[i][0]
    df.at[i, 'pred_positive'] = predictions[i][1]
    df.at[i, 'diff'] = predictions[i][1] - predictions[i][0]
df.head()

Unnamed: 0,Review Text,Rating,pred_negative,pred_positive,diff
0,Love this dress! it's sooo pretty. i happene...,1,5.3e-05,0.9999466,0.999893
1,I had such high hopes for this dress and reall...,0,0.999999,9.021497e-07,-0.999998
2,"I love tracy reese dresses, but this one is no...",0,0.899449,0.1005509,-0.798898
3,I love this dress. i usually get an xs but it ...,1,0.000894,0.9991058,0.998212
4,"I'm 5""5' and 125 lbs. i ordered the s petite t...",1,1.3e-05,0.9999874,0.999975


Next we subset 3 rows from the dataframe based on the values in the difference column closest to 0. These are the rows the algorithm had the most difficulties with to predict.

In [135]:
df_lowest_diff = df.iloc[(df['diff']).abs().argsort()[:3]] #https://stackoverflow.com/questions/30112202/how-do-i-find-the-closest-values-in-a-pandas-series-to-an-input-number

And now we print each line in that DataFrame including the actual rating (1 is positive 0 is negative)

#### Line 1

In [136]:
print(f" Line 1: {df_lowest_diff.iloc[0,0]} \n")
print(f" Real rating: {df_lowest_diff.iloc[0,1]} \n")

 Line 1: Beautiful and high quality. unfortunately it made me look top-heavy. i am a 36dd . 

 Real rating: 1 



**Why Naïve Bayes has trouble:**
The first line is very positive with a word and a 2-gram like "Beautiful" and "high quality", on the other side it also has negative words in the second sentence "unfortunately" and "top-heavy". Also the amount of words after unfortunately are quite high. This could help the prediction go down

#### Line 2

In [137]:
print(f" Line 2: {df_lowest_diff.iloc[1,0]} \n")
print(f" Real rating: {df_lowest_diff.iloc[1,1]} \n")

 Line 2: I am usually an xsp or 0p but i ordered up one size (2p) due to previous reviews complaining about the tight slip. however i definitely did not need to size up. the slip is fine and the outer 'shell' is billowy and i should have stuck with my usual size! when i put it on i was not in love due to the dress being so roomy, but a second check in the mirror showed that the fabric / seams draped well and it overall looked pretty cute. it hits about 2 inches above my knees and therefore seems pretty p 

 Real rating: 1 



**Why Naïve Bayes has trouble:**

It's a long review and talks about previous negative reviews. Also on her specific order she tells at first that she didn't like it but the second judgement was very positive. Even when you read this as human being you are unable to figure out if it's neutral or positive. 

#### Line 3

In [138]:
print(f" Line 3: {df_lowest_diff.iloc[2,0]} \n")
print(f" Real rating: {df_lowest_diff.iloc[2,1]} \n")

 Line 3: So disappointed! i couldn't even get it on. definitely order a size up! 

 Real rating: 1 



**Why Naïve Bayes has trouble:**
This is the opposite of the first Line we evaluated in a sense that it starts with some negative words "disappointed" and "couldn't" but ends with "definitly order".

When I read this I would assume this would be a negative review but it actually is rated as positive.