# Assignment 6

## Bag-of-words model and Naïve Bayes - how it works
### Bag-of-words
This model counts all the words in the text and see how many of them are in all the values. It is simple, but quite effective.

### Naïve Bayes
This calculates a probability that a text belongs to a certain category.

With these two models together you can create text classification and calculate the probability of the category. 

In [88]:
import pandas as pd

## Data pre-processing steps

In [89]:
df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


Show only the rows that has a class name Dresses. Then only show the two columns Review Text and Rating that we need for the prediction.

In [90]:
df_subset = df.loc[(df['Class Name'] == 'Dresses')] #only show the dresses
df_subset = df_subset[['Review Text', 'Rating']] #only show the columns review text and rating
df_subset.head()

Unnamed: 0,Review Text,Rating
1,Love this dress! it's sooo pretty. i happene...,5
2,I had such high hopes for this dress and reall...,3
5,"I love tracy reese dresses, but this one is no...",2
8,I love this dress. i usually get an xs but it ...,5
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5


In the column rating I change the values. If rating is higher than 3 it will change it to the string value Positive. If rating is lower than 4 it will change the value to the string value Negative.

In [91]:
df_subset.loc[:, 'Rating'].replace([1, 2, 3], ['Negative', 'Negative', 'Negative'], inplace=True)
df_subset.loc[:, 'Rating'].replace([4, 5], ['Positive', 'Positive'], inplace=True)
df_subset.head()

Unnamed: 0,Review Text,Rating
1,Love this dress! it's sooo pretty. i happene...,Positive
2,I had such high hopes for this dress and reall...,Negative
5,"I love tracy reese dresses, but this one is no...",Negative
8,I love this dress. i usually get an xs but it ...,Positive
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",Positive


## Text pre-processing steps

Then i create a dictionary from a series of text and convert it to Unicode.

In [92]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = df_subset['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode

vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")


There are 8080 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']


### Document-feature matrix

Then I can count the occurences of each word of each review with the document-feature matrix.

In [93]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:500,0:950]) #Let's print a little part of the matrix: the first 50 words & documents

  (0, 855)	2
  (2, 8)	1
  (2, 545)	1
  (2, 768)	1
  (4, 72)	1
  (4, 224)	1
  (5, 664)	1
  (7, 877)	1
  (8, 941)	1
  (10, 768)	1
  (10, 941)	1
  (11, 775)	1
  (12, 636)	1
  (14, 546)	1
  (14, 885)	1
  (14, 910)	2
  (15, 586)	1
  (15, 876)	2
  (16, 512)	1
  (16, 876)	1
  (17, 899)	1
  (18, 210)	1
  (18, 546)	1
  (18, 927)	1
  (18, 941)	1
  :	:
  (488, 943)	1
  (489, 211)	1
  (489, 304)	1
  (489, 316)	1
  (489, 720)	1
  (489, 817)	1
  (489, 877)	2
  (491, 104)	1
  (491, 243)	1
  (491, 422)	1
  (492, 84)	1
  (492, 96)	1
  (492, 186)	1
  (492, 233)	1
  (492, 242)	1
  (492, 248)	1
  (493, 828)	1
  (494, 86)	1
  (494, 485)	1
  (494, 768)	1
  (494, 924)	1
  (496, 927)	1
  (497, 37)	1
  (497, 674)	1
  (498, 876)	1


## Splitting the data into a training and test set

In [94]:
from sklearn.model_selection import train_test_split
y = df_subset['Rating'] #We need to take out the rating as our Y-variable
X =  docu_feat #and the subset variables as our x variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data into training and test set, store it into different variables

## Naïve Bayes classifier

Then I train a Naïve Bayes classifier to predict if the review is postive or negative.

In [95]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X_train, y_train)
MultinomialNB()

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Evaluation

In [96]:
y_test_p = clf.predict(X_test)#the prediction of the model
clf.score(X_test, y_test) #the accuracy of the model

0.8554852320675106

The accuracy is 86%. Which is pretty high for two categories.

In [97]:
from sklearn.metrics import confusion_matrix
y_test_pred = clf.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[ 282,  182],
       [  92, 1340]], dtype=int64)

In [98]:
clf.classes_ #see which label is what

array(['Negative', 'Positive'], dtype='<U8')

In [99]:
#In order to read it easily , let's make a dataframe out of it, and add labels to it.
conf_matrix = pd.DataFrame(cm, index=['Negative', 'Positive'], columns = ['Negative (predicted)', 'Positive (predicted)']) 
conf_matrix

Unnamed: 0,Negative (predicted),Positive (predicted)
Negative,282,182
Positive,92,1340


Then I created a confusion matrix to see how much are predicted right and wrong.
So, 282 negative reviews are correctly predicted. 182 are instead wrong predicted.

In [100]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

    Negative       0.75      0.61      0.67       464
    Positive       0.88      0.94      0.91      1432

   micro avg       0.86      0.86      0.86      1896
   macro avg       0.82      0.77      0.79      1896
weighted avg       0.85      0.86      0.85      1896



For precision and recall you see some difference between negative and positive reviews. The negative reviews score a bit low on precision and recall. Positive reviews on the other hand score pretty high.

## Predicting probabilities

Then I predict the probabilities of the reviews. Here you see the rating (positive or negative), the review text and the prediction if the review is postive or negative for every row.

In [103]:
for i in range(55): #get the predictions of text of 80 rows
    prob = clf.predict_proba(X[i]) 
    print(i, f"The rating is: {df_subset.iloc[i,1]}")
    print(df_subset.iloc[i,0])
    print(f"Negative: {prob[0,0]}, Positive: {prob[0,1]}")

0 The rating is: Positive
Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
Negative: 5.337412695289174e-05, Positive: 0.9999466258730467
1 The rating is: Negative
I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
Negative: 0.9999990978503296, Positive: 9.021496679830375e-07
2 The rating is: Ne

### Off target

At line 12, 13 and 53 the rating is negative, but the predicted is positive. This is probably because of the individual words ( beautiful, good, like and love). They sound positive seperatly, but if you read the whole review it doesn't sound as positive. This is were the model trips up.