# Assignment 5

Fundamentals of Machine Learning <br>
Master Data-Driven Design <br>
University of Applied Science Utrecht

## Bag-of-words model and Naive Bayes

Bag-of-words is an algorithm that counts the occurance of words in a text. Since it only represents the frequency of each individual word, context is lost. Information regarding the order or structure is discarded. For example: words that belong together, such as 'New York' ('New' and 'York'), are seen as seperate words and countes seperately.

The Naive Bayes classifier, on the other hand, is an algorithm that classifies features. The algorithm sees these feautures as seperate and independent. It is called naive because this is most often not the case. Using the Naive Bayes algorithm, we can predict whether a review is positive or neutral/negative.

## Data Pre-Processing

In [1]:
# importing stuff
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [2]:
# importing the dataset
data = pd.read_csv('Assignment text mining - data clothing reviews.csv')

In [3]:
# discarding all reviews other than dresses
data = data.loc[(data['Class Name'] == 'Dresses')]

In [4]:
# only keeping review text and rating + nice and clean renaming
data = data[['Review Text', 'Rating']]
data = data.rename(columns = {'Review Text': 'review', 'Rating': 'rating'})

In [5]:
# transform the data into neutral/negative and positive
data['rating'] = data['rating'].map(
    {0: 'neutral/negative', 
     1: 'neutral/negative',
     2: 'neutral/negative',
     3: 'neutral/negative',
     4: 'positive',
     5: 'positive'})

In [6]:
# drop all the nothing
data = data.dropna()

In [7]:
# show the data
data.head()

Unnamed: 0,review,rating
1,Love this dress! it's sooo pretty. i happene...,positive
2,I had such high hopes for this dress and reall...,neutral/negative
5,"I love tracy reese dresses, but this one is no...",neutral/negative
8,I love this dress. i usually get an xs but it ...,positive
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",positive


In [8]:
# convert text to unicode
text = data['review'].values.astype('U') 

In [9]:
# create object with English stopwords and fit our text with the model
vect = CountVectorizer(stop_words = 'english')
vect = vect.fit(text) 

In [10]:
# get the features
doc_feat = vect.transform(text)

In [11]:
# print to show we got something
print(doc_feat[0 : 50, 0 : 50])

  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (25, 40)	1
  (34, 12)	2
  (38, 31)	1


## Training the model

In [12]:
# create the model
nb = MultinomialNB()

In [13]:
# document-feature matrix as x and rating as y
x = doc_feat
y = data['rating']

In [14]:
# create a test and training set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)

In [15]:
# fit the model
nb = nb.fit(x_train, y_train)

## Evaluating the model

In [16]:
# make predictions
y_test_p = nb.predict(x_test)

In [17]:
# calculate the accuracy of those predictions
nb.score(x_test, y_test)

0.8508676789587852

The accuracy of the model is 85%, which is quite okay.

In [18]:
# make a confusion matrix
conf_matr = confusion_matrix(y_test, y_test_p)
pd.DataFrame(conf_matr, index = ['neutral/negative', 'positive'], 
             columns = ['neutral/negative pred', 'positive pred'])

Unnamed: 0,neutral/negative pred,positive pred
neutral/negative,301,168
positive,107,1268


In [19]:
# print classification report
print(classification_report(y_test, y_test_p, nb.classes_))

                  precision    recall  f1-score   support

neutral/negative       0.74      0.64      0.69       469
        positive       0.88      0.92      0.90      1375

        accuracy                           0.85      1844
       macro avg       0.81      0.78      0.79      1844
    weighted avg       0.85      0.85      0.85      1844



In [20]:
# find wrong predictions to check
test_data = pd.DataFrame({'y_test' : y_test, 'y_test_p' : y_test_p})
test_data[test_data['y_test'] != test_data['y_test_p']].sort_index().head(10)

Unnamed: 0,y_test,y_test_p
23,neutral/negative,positive
52,neutral/negative,positive
383,neutral/negative,positive
480,neutral/negative,positive
506,neutral/negative,positive
533,neutral/negative,positive
560,positive,neutral/negative
734,positive,neutral/negative
755,neutral/negative,positive
1140,neutral/negative,positive


In [21]:
# a neutral/negative review mistaken for a positive one
prob = nb.predict_proba(x[23])
print(data.iloc[23]['review'])
print(f'negative: {prob[0, 0]}, positive: {prob[0, 1]}')

Perfect dress for hot, humid, sticky weather.
negative: 0.012275032134469864, positive: 0.9877249678655337


In [22]:
# another neutral/negative review mistaken for a positive one
prob = nb.predict_proba(x[52])
print(data.iloc[52]['review'])
print(f'negative: {prob[0, 0]}, positive: {prob[0, 1]}')

This dress has potential, but it didn't work for me. it runs true to size to a little big, i ordered medium, my usual size for maeve). as for length it fit me as the model (5'9"). the reason i'm not keeping it is that i wish it had some darts in the back to help define the waist a bit,
negative: 0.05943374614029698, positive: 0.9405662538597088


In [23]:
# a positive review mistaken for a neutral/negative one
prob = nb.predict_proba(x[560])
print(data.iloc[560]['review'])
print(f'negative: {prob[0, 0]}, positive: {prob[0, 1]}')

I love this dress. i'd get it in both colors if i could! the cut and fit is beautiful, i'd suggest sizing down if you feel like it's too boxy or lacks shape. the bottom skirt is round enough to where you can twirl ( #1 thing to look for in a dress ;) ) and it's just overall a classic pretty dress. my only complaint is that the overlay cut out seems a little bit delicate and i'm afraid it will be ruined after a few wears but it seems to be holding up fine so far and isn't incredibly delicate like
negative: 0.06752837329433599, positive: 0.932471626705665


In the cases above the model is wrong due to the way humans use language. In the first case sarcasm was used, which the model obviously cannot detect. In the second case it is the lack of context that the model is missing: words like potential, true, fit, define, waist, etc. make it seem like the reviewer was positive about the dress. The thirds case could also benefit from context: the reviewer names suggestions for others and small details that (s)he did not like that make it seem like a negative review to the model.

## Find most informative features

In [24]:
# get all features
feature_names = vect.get_feature_names()

# get top 10
top = 10

for i, class_label in enumerate(nb.classes_):
    
    # find top features (numbers)
    results = np.argsort(nb.feature_log_prob_[i])[-top:]
    
    # translate numbers to corresponding words
    print("%s: %s" % (class_label, ', '.join(feature_names[j] for j in results)))

neutral/negative: really, small, love, look, fit, just, fabric, size, like, dress
positive: flattering, fabric, just, great, like, wear, fit, love, size, dress


I don't find these very informative since they overlap quite a lot, but ok.

If you cancel out the overlap:
* really, small, look
* flattering, great, wear