# Text mining

The _bag-of-words_ model treats text as a bag of words and counts each word. It ignores other textual aspects, such as spelling or word order.

_Naive Bayes_ methods are a set of supervised learning algorithms based on applying _Bayes' theorem_ with the naive assumption of conditional independence between every pair of features given the value of the class variable.

In [1]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# Import the CountVectorizer object
from sklearn.feature_extraction.text import CountVectorizer

# Import Naive Bayes
from sklearn.naive_bayes import MultinomialNB

## Clothing reviews

The data set used in this analysis was downloaded from _Kaggle_ and can be viewed here: [Women's E-Commerce Clothing Reviews](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/home).

In [2]:
# Import data set
df = pd.read_csv("6_clothing reviews.csv")

df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


#### The data explained

The data set consists of 23486 rows and 10 feature variables. Each row corresponds to a customer review and includes the following variables:

* Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed
* Age: Positive Integer variable of the reviewers age
* Title: String variable for the title of the review
* Review Text: String variable for the review body
* Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best
* Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended
* Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive
* Division Name: Categorical name of the product high level division
* Department Name: Categorical name of the product department name
* Class Name: Categorical name of the product class name

## Pre-processing and subsetting

In [3]:
# Does a review contain the word dress?
# df = df[df["Review Text"].str.contains("dress", regex=False, case=False, na=False)]

# Single out the dress reviews (best solution)
df_dress = df[df["Class Name"] == "Dresses"]

# Drop NaN
df_dress = df_dress.dropna()

# Define Neutral / Negative or Positive rating, apply the function and create a new column
def rating_word(row):
    if row["Rating"] > 3:
        return "Positive"
    else:
        return "Neutral / Negative"

df_dress["Rating Word"] = df_dress.apply(rating_word, axis=1)

# Final subset
df_dress_final = df_dress[["Rating Word", "Review Text"]]

df_dress_final.head()

Unnamed: 0,Rating Word,Review Text
2,Neutral / Negative,I had such high hopes for this dress and reall...
5,Neutral / Negative,"I love tracy reese dresses, but this one is no..."
8,Positive,I love this dress. i usually get an xs but it ...
9,Positive,"I'm 5""5' and 125 lbs. i ordered the s petite t..."
10,Neutral / Negative,Dress runs small esp where the zipper area run...


## Naive Bayes

In [4]:
# Take the text from the DataFrame; convert it to Unicode
text = df_dress_final["Review Text"].values.astype("U")

# Create the CountVectorizer object with English stop words
vect = CountVectorizer(stop_words="english")

# Fit the model with the words from the review texts
vect = vect.fit(text)
vect

# Get the words from the vocabulary
feature_names = vect.get_feature_names()
# feature_names

print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 7747 words in the vocabulary. A selection: ['allusion', 'allusione', 'almsot', 'alr', 'alright', 'als', 'altar', 'alter', 'alteration', 'alterations', 'altered', 'altering', 'alternate', 'alternations', 'alternative', 'althetic', 'altho', 'altogether', 'am5', 'amadi']


#### Document-feature matrix

In [5]:
# Use the transform method from the CountVectorizer object to create a document-feature matrix
docu_feat = vect.transform(text)

# Print part of the matrix; the first 50 documents and words
print(docu_feat[0:50,0:50])

  (1, 7)	1
  (15, 37)	1
  (16, 3)	1
  (16, 44)	1
  (17, 11)	1
  (19, 39)	1
  (26, 11)	2
  (29, 30)	1
  (43, 0)	1


#### (Regular matrix)

In [6]:
# Make a regular matrix out of docu_feat; make it into a DataFrame and concatenate it along the columns
# rev_words = pd.concat([df_dress_final, pd.DataFrame(docu_feat.toarray())], axis=1)
# rev_words.head(10)

#### Train and test set

In [7]:
# Create a separate X and y
X = docu_feat
y = df_dress_final["Rating Word"]

In [8]:
# Randomly split the data into a 70% train set and a 30% test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

#### Training the model

In [9]:
nb = MultinomialNB()

# Fit the model
nb = nb.fit(X_train, y_train)

## Evaluating the model

In [10]:
# Predicted values
y_pred = nb.predict(X_test)

# Evaluate the model
nb.score(X_test, y_test)

0.8542183622828784

The accuracy of the model is about 85%. At first glance a pretty trustworthy result. But as we will see below, in some cases the model will interpret a negative review as a positive review because of the use of positive words. (The model doesn't consider context.)

#### Confusion matrix

In [11]:
# Determine the order of the confusion matrix
nb.classes_

array(['Neutral / Negative', 'Positive'], dtype='<U18')

In [12]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Create DataFrame and label it
cm = pd.DataFrame(cm, index=["Neutral / Negative", "Positive"], columns=["Neutral / Negative pred.", "Positive pred."])
cm

Unnamed: 0,Neutral / Negative pred.,Positive pred.
Neutral / Negative,238,168
Positive,67,1139


From the confusion matrix we can conclude that:

* 238 reviews are correctly predicted as Neutral / Negative
* 1139 reviews are correctly predicted as Positive
* 67 reviews are wrongly predicted as Neutral / Negative
* 168 reviews are wrongly predicted as Positive

Of the 1612 predictions 1377 are true and 235 are false; a substantial amount of errors.

#### Precision and recall

In [13]:
# Calculate precision and recall
print(classification_report(y_test, y_pred))

                    precision    recall  f1-score   support

Neutral / Negative       0.78      0.59      0.67       406
          Positive       0.87      0.94      0.91      1206

          accuracy                           0.85      1612
         macro avg       0.83      0.77      0.79      1612
      weighted avg       0.85      0.85      0.85      1612



For Neutral / Negative reviews _precision_ is okay at 78% and _recall_ is not so good at 59%.

Positive reviews are easier to predict and score significantly higher, most likely because they contain the same positive words more often, such as 'love' or 'comfortable'. _Precision_ is pretty good at 87% and _recall_ is even better at 94%.

## Probabilities

#### Off target #1

In [14]:
# Get the probabilities for a certain text belonging to a class 
print(df_dress_final.iloc[0,1])
print(nb.predict_proba(X_test[0]))

I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
[[0.10654598 0.89345402]]


Although the review is negative, the model predicts a 89% chance that it's positive. Probably words like 'comfortable' and 'nicely' are misleading.

#### Off target #2

In [15]:
print(df_dress_final.iloc[114,1])
print(nb.predict_proba(X_test[114]))

I wanted to love this dress. the colors are heavenly and it looks light and airy. it isn't, it is very heavy, much too heavy for florida heat. the top layer is a beautiful sand color and while the fabric is nice, the heaviness of the top really weighs the whole dress down. i felt like it added ten pounds to my appearance, easily. back it went, boo hoo. if you are between sizes, i would size down, runs a bit big.
[[0.00307598 0.99692402]]


Here, the model performs even worse. It predicts a 99% chance that the review is positive when it clearly isn't. Again, words like 'love' and 'heavenly' suggest a positive review, when they are actually being used in a negative way.

#### Off target #3

In [16]:
print(df_dress_final.iloc[953,1])
print(nb.predict_proba(X_test[953]))

This dress is super cute and has the most striking color (i got the blue). i just love it. it's comfortable and flattering and hits me in all the right places. i am a petite girl and normally wear a 0 or 2 in retailer dresses, and, as the other reviewers mentioned, this runs small, so a 4 was what i needed, and even the 4 does not leave any wiggle room. that said, i like my dresses to fit sort of snugly so it works for me. i highly recommend it overall - the quality, cut, color, and comfort all ma
[[0.93953856 0.06046144]]


'Super', 'cute', 'striking', 'love', 'comfortable', 'flattering' and 'recommend' all point to a positive review, but the model still predicts a 93% chance that the review is negative. Other than the words 'blue' and 'small', there doesn't seem to be a clear cut solution to why the model is off target here.