# Weekly Assignment 06

The purpose of this assignment is to predict the rating category (Positive and Neutral/Negative) of **dresses** in a dataset. Given that reviews with rating 4 and more (Rating>4)  are considered **Positive** and ones with rating equals to 3 or less (Rating<3) are considered as **Neutral/Negative**. We are going to use **Naive Bayes** algorithm to predict our classification.

# Data Processing

let's start by importing the libraries and imporint our dataset for analysis.

In [113]:
# importing the libraries we are going to use in our notebook
import pandas as pd
from random import randrange
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB


In [96]:
# importing dataset and viewing the first 5 items 
df = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


after looking at the dataset, we determin the following:
* We have some *NaN* in that we need to get rid of.
* We need to choose only the **Dresses** from our dataset.
* We need to add a new column to classify each and every review. 

In [80]:
# applying a lambda because our operation is straight forward and does not require us to declare complex funciton
df["Rating Category"] = df["Rating"].apply(lambda x: "Positive" if x > 3 else "Neutral/Negative")
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Rating Category
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,Positive
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,Positive
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Neutral/Negative
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,Positive
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,Positive


After the classification process, we are going to take a subset of our dataset and choose only reviews that belongs to the _Dresses_ under column **Department Name**. Then we drop rows _NaN_ values.

In [81]:
df_subset = df[df["Department Name"] == "Dresses"]
df_subset = df_subset[["Clothing ID","Review Text", "Rating", "Rating Category", "Department Name"]]
df_subset = df_subset.dropna()
df_subset.head()

Unnamed: 0,Clothing ID,Review Text,Rating,Rating Category,Department Name
1,1080,Love this dress! it's sooo pretty. i happene...,5,Positive,Dresses
2,1077,I had such high hopes for this dress and reall...,3,Neutral/Negative,Dresses
5,1080,"I love tracy reese dresses, but this one is no...",2,Neutral/Negative,Dresses
8,1077,I love this dress. i usually get an xs but it ...,5,Positive,Dresses
9,1077,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,Positive,Dresses


Now, let's look at the value count of the reviews we have in our subset of the data.

In [82]:
df_subset["Rating Category"].value_counts()

Positive            4634
Neutral/Negative    1511
Name: Rating Category, dtype: int64

# Text Modeling and Document Feature Matrix

Now that we have our dataset ready, we are going to create the **Document Feature Matrix** by Tokenizing the text and fitting it into a matrix.

**Tokenizing** is the process of breaking a text into a words or Units. We are going to do so by using *CountVectorizer from sklearn.feature_text_extraction* library. 
*CountVectorizer* breaks the text into a a bag of words and create a dictionary from a series of text and then change them to lower case. Since the text used is **English** then we are going to use _stop_words_ for the English language.
The *stop_words* are informative words and frequently used in the English language and therefor will not be counted.
Also, we need to convert the text to **Unicode** and we are going to do so by using *astype('U')*

In [100]:
# converting the text to unicode and saving it in a NumpyArray
text = df_subset['Review Text'].values.astype('U')
# creating the object CountVectorizer with English stop words
vect = CountVectorizer(stop_words="english")
# fitting the text into the CountVectorizer Object
vect = vect.fit(text)
# getting the words from the dictionary of words 
feature_names = vect.get_feature_names()
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 8079 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']


Now that we have a dictionary of words created. we are going to create a **Document Feature** matrix ou of it by using the *transform* method. the *transform* method transform a document into a **Document Feature Matrix** also known as **Document Term Matrix** 

In [107]:
# creating the document feature matrix
docu_feat = vect.transform(text)
docu_feat

<6145x8079 sparse matrix of type '<class 'numpy.int64'>'
	with 154409 stored elements in Compressed Sparse Row format>

A sparse matrix is a very large matrix that consists mostly of 0's. therefor we are going to view a sample of it below.

In [110]:
print(docu_feat[0:500, 0:500])

  (2, 8)	1
  (4, 72)	1
  (4, 224)	1
  (18, 210)	1
  (19, 80)	1
  (19, 435)	1
  (19, 447)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (22, 248)	1
  (24, 55)	1
  (25, 40)	1
  (27, 60)	1
  (27, 226)	1
  (27, 237)	1
  (28, 436)	1
  (30, 241)	1
  (31, 104)	1
  (31, 186)	1
  (31, 233)	1
  (31, 255)	1
  (32, 428)	1
  (34, 12)	2
  :	:
  (472, 304)	1
  (472, 316)	1
  (474, 104)	1
  (474, 243)	1
  (474, 422)	1
  (475, 84)	1
  (475, 96)	1
  (475, 186)	1
  (475, 233)	1
  (475, 242)	1
  (475, 248)	1
  (477, 86)	1
  (477, 485)	1
  (480, 37)	1
  (482, 66)	1
  (482, 484)	1
  (483, 45)	1
  (483, 184)	1
  (483, 201)	1
  (484, 96)	1
  (488, 108)	1
  (492, 167)	1
  (497, 447)	1
  (498, 60)	1
  (498, 408)	1


# Building the Naive Bayes Model

Now that we have our Document Feature Matrix ready. we are going to plit our dataset into a training set and a test set to prepare for building our **Naive Bayes Model**

In [85]:
# spliting the data set into training set and test set
X = docu_feat
y = df_subset["Rating Category"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

Now we are going to create our Naive Bayes model.

The Naive Bayes is an algorithm based on Bayes Theorem. Now Bayes Theorem itself calculates the probability of something given that a prior condition related to that event. For example the probability for raining given that it is cloudy. 

The general formula of Naive Theorem is as follows: $ P(A|B) = \frac{P(B|A) * P(A)}{P(B)} $

it means calculating the probability of A given the condition B

How **Naive Bayes Algorithm** works is that it compares the $ P(A|B) $ vs $ P(B|A) $ and makes the classification based on the higher number.


So in our case it is going to be as follows: $ P(Positive Review | Word) $ vs $ P(Neutral/Negative | Word)$ and makes the classification based on that. To put it in words, Naive Bayes algorithm compares the reviews given the have certain features, in our case words and makes the classification based on the probability comparison of the two cateogries.

The steps which **Naive Bayes** works is as follows.
We are going to demonstrate the steps by doing the $ P(Positive Review|words) $:
Formula would be as follows:
$P(Positive Review|words) = \frac{P(Word|Positive Review) * P(Positive Review)}{P(Word)}$

* Calculate the Prior Probability: $ P(Positive Reviews)$ which means the probability of text in our dataset being a positive review devided by the total number of reviews:

$ \frac{Count Of Positive Reviews}{Total Number Of Reviews} $.

* Calculate the Marginal Likelyhood: $ P(Word) $ which means, the likelyhood of the choosing a specific word If we picked randomly.  

$ \frac{Count Of A Specific Word}{Total Number Of Words}$

* Calculating the Likelyhood $P(Words | Positive Reviews)$: In our case it means the likelyhood of us picking a word that belongs to the category of Positive Reviews.

$ \frac{Count Of a Specific Word} {Total Number Of Words In the Positive Review Category}.$

The same thing is done for $ P(Negative Review | Words) $ and then the probability is compared and based on that comparison the classification is made.

In [111]:
#creating the model
mnb = MultinomialNB()
# training the model
mnb.fit(X_train, y_train)

MultinomialNB()

In [87]:
# creating prediction array with the test set
y_test_p = mnb.predict(X_test)
y_test_p

array(['Positive', 'Positive', 'Positive', ..., 'Positive', 'Positive',
       'Positive'], dtype='<U16')

In [88]:
mnb.score(X_test, y_test)

0.8508676789587852

The Accuracy of our model is *85%* which is fairly good. Now we are going to create a Confusion Matrix to analize our classification

In [112]:
# creating a confusion matrix with the test set
cm = confusion_matrix(y_test, y_test_p)
cm

array([[ 301,  168],
       [ 107, 1268]])

In [90]:
# to build the matrix we are going to view the classes of the array first to view the order 
mnb.classes_

array(['Neutral/Negative', 'Positive'], dtype='<U16')

In [91]:
conf_matrix = pd.DataFrame(cm, index = ["Neutral/Negative (Actual)", "Positive (Actual)"], columns = ["Neutral/Negative (Predicted)", "Positive (Predicted)"])
conf_matrix

Unnamed: 0,Neutral/Negative (Predicted),Positive (Predicted)
Neutral/Negative (Actual),301,168
Positive (Actual),107,1268


After looking at the confusion matrix, we are going to calculate **Precision** and the **Recall**

In [114]:
print(classification_report(y_test, y_test_p))

                  precision    recall  f1-score   support

Neutral/Negative       0.74      0.64      0.69       469
        Positive       0.88      0.92      0.90      1375

        accuracy                           0.85      1844
       macro avg       0.81      0.78      0.79      1844
    weighted avg       0.85      0.85      0.85      1844



We are going to fit our trained model into the whole dataset, and take a subset where the model was off and try to analyze the reason.

In [115]:
df_subset["Rating_Prediction"] = mnb.predict(X)
df_contradictions = df_subset[df_subset["Rating Category"] != df_subset["Rating_Prediction"]]
df_contradictions.head()

Unnamed: 0,Clothing ID,Review Text,Rating,Rating Category,Department Name,Rating_Prediction
23,1077,Cute little dress fits tts. it is a little hig...,3,Neutral/Negative,Dresses,Positive
52,1104,"Love the color and style, but material snags e...",3,Neutral/Negative,Dresses,Positive
311,1089,Looks beautiful online but has too much materi...,3,Neutral/Negative,Dresses,Positive
383,1104,This dress is not what i expected. the bottom ...,3,Neutral/Negative,Dresses,Positive
417,1083,"I love byron lars dresses, and this design is ...",2,Neutral/Negative,Dresses,Positive


In [95]:
for i in range(0,3):
    print(f"for item number {i+1}")
    print(f"the review text we have is:\n{df_contradictions.iloc[i,1]}\n")
    print(f"the actual Score for this review is:\n{df_contradictions.iloc[i,2]}\n")
    print(f"the actual Rating Category for this review is:\n{df_contradictions.iloc[i,3]}\n")
    print(f"the Predicted Category for this review is:\n{df_contradictions.iloc[i,5]}\n\n")

for item number 1
the review text we have is:
Cute little dress fits tts. it is a little high waisted. good length for my 5'9 height. i like the dress, i'm just not in love with it. i dont think it looks or feels cheap. it appears just as pictured.

the actual Score for this review is:
3

the actual Rating Category for this review is:
Neutral/Negative

the Predicted Category for this review is:
Positive


for item number 2
the review text we have is:
Love the color and style, but material snags easily

the actual Score for this review is:
3

the actual Rating Category for this review is:
Neutral/Negative

the Predicted Category for this review is:
Positive


for item number 3
the review text we have is:
Looks beautiful online but has too much material and the zipper catches on the lace. also runs very large, i am normally a small but would need and xs in this dress

the actual Score for this review is:
3

the actual Rating Category for this review is:
Neutral/Negative

the Predicted Ca

With the look at the first 3 cases, we can see that the prediciton was off due to the fact that those reviews contains word such as "Love", "Beautiful". And our model assumed that the probability of these words occuring in the text is most likely to be a positive review.  