In this assignment, you are going to use text mining to predict the rating of a dress from online reviews.

Objective

Predict whether dresses reviews are positive (>3 stars) or neutral/negative (<4 stars). Write a Jupyter Notebook report documenting your investigation.

Dataset

We are using the Women’s E-Commerce Clothing Reviews data set (Links to an external site.). Note that the full data set is included, but we are running a model on the reviews of dresses only.

- The data setPreview the document
- The data set on Kaggle, for context (Links to an external site.)
Included in your Jupyter Notebook

- Explain briefly in your own words how the bag-of-words model and Naïve Bayes work, and how they work together.
- Pre-processing steps (don’t forget to filter out all non-dress reviews).
- The head() of the resulting dataframe.
- Text pre-processing steps resulting in a document-feature matrix
- Split the file into a training and a test set.
- Train a Naïve Bayes classifier predicting whether a review is positive (>3 stars) or neutral/negative (<4 stars).
- Evaluate the performance of your model on the test set.
- Check out 3 cases where your model is off target. Inspect the associated texts. Do you understand why your model trips up? Explain.

Please provide a link to your Notebook on GitHub . Make sure the GitHub folder includes the data file so the Notebook runs without problems.

Note

- Only comments on the code should be in coding formatting. Answers to questions in the assignment (e.g., "Explain how linear regression works in your own words" or "evaluate the performance of your model") are in text (Markdown) cells.
- Use Markdown formula notation for mathematical formulas http://csrgxtu.github.io/2015/03/20/Writing-Mathematic-Fomulars-in-Markdown/
- The Jupyter Notebook should run in its entirety. An assignment that doesn't run will not be scored "complete". If you can't get a certain section to run, please comment out the code and explain what you would want it to do.

Bag-of-word methods aims to get the most simple representation of a text document. You strip down a document until only the important words are left. Naive Bayes is a type of model that relies on these bag-of-word input in order to classify. The concept that is exploited in these models is probability. 

In [1]:
# Importing  libraries
import seaborn as sns
import sklearn as sk 
import pandas as pd
import matplotlib.pyplot as plt 
import re
from bs4 import BeautifulSoup 
from sklearn.naive_bayes import MultinomialNB

In [2]:
# Preprocessing
DataFrame = pd.read_csv('Assignment text mining - data clothing reviews.csv') # importing the data set
DataFrame=DataFrame[DataFrame['Class Name'] == 'Dresses'] # filtering out the dresses (6319 reviews of dresses)
DataFrame=DataFrame.reset_index() # resetting the index

DataFrame = DataFrame[['Review Text', 'Rating']] # extracting rating and review text
DataFrame.loc[DataFrame['Rating'] < 4, 'Positive'] = 0 # creating variable 'positive'
DataFrame.loc[DataFrame['Rating'] > 3, 'Positive'] = 1
DataFrame=DataFrame.drop(columns = ['Rating']) # remove the old rating column 
DataFrame=DataFrame.dropna() # drop rows that don't have an review
DataFrame=DataFrame.reset_index() # reset the index
DataFrame.head()

print(len(DataFrame)) # 6145 reviews of dresses left

6145


In [6]:
import nltk 
nltk.download('stopwords')
from nltk.corpus import stopwords 

for i in range(0,6145):
    review_text = BeautifulSoup(DataFrame['Review Text'][i],"html.parser").get_text() #remove html
    letters = re.sub("[^a-zA-Z]"," ",review_text) # making sure there are letters only 
    words = letters.lower().split() # remove capital letters
    stop = set(stopwords.words("english")) # creating an array of important words
    meaningful_words = [w for w in words if w not in stop] # only storing the important words
    DataFrame['Review Text'][i] = " ".join(meaningful_words) # store the bag of words in the data frame as strings
print(DataFrame)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aylakok/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DataFrame['Review Text'][i] = " ".join(meaningful_words) # store the bag of words in the data frame as strings


      index                                        Review Text  Positive
0         0  love dress sooo pretty happened find store gla...       1.0
1         1  high hopes dress really wanted work initially ...       0.0
2         2  love tracy reese dresses one petite feet tall ...       0.0
3         3  love dress usually get xs runs little snug bus...       1.0
4         4  lbs ordered petite make sure length long typic...       1.0
...     ...                                                ...       ...
6140   6314  surprised positive reviews product terrible cu...       0.0
6141   6315  happy snag dress great price easy slip flatter...       1.0
6142   6316  fit well top see never would worked glad able ...       0.0
6143   6317  bought dress wedding summer cute unfortunately...       0.0
6144   6318  dress lovely platinum feminine fits perfectly ...       1.0

[6145 rows x 3 columns]


In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer="word",preprocessor=None,stop_words="english",max_features=5000) # tokenize a collection of text

train_data_features = vectorizer.fit_transform(DataFrame['Review Text']) # create corret format
train_data_features = train_data_features.toarray()

x_training=train_data_features[0:4200] # seperating x and y and train and test
x_testing=train_data_features[4201:6145]

y_training=DataFrame['Positive'][0:4200]
y_testing=DataFrame['Positive'][4201:6145]
y_testing=y_testing.reset_index()

In [8]:
Bayes = MultinomialNB() # importing the model 
model = Bayes.fit(x_training,y_training) # training te model
y_predicted = model.predict(x_testing) # testing the model


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_predicted,y_testing['Positive']) # evaluating the model
accuracy = cm.trace()/cm.sum()
print(accuracy)

0.8410493827160493


The accuracy of the model is 0.84, which is good.

Reviews 13,1941 and 1942 are predicted wrong.

In [165]:
#DataFrame['Review Text'][4213]
#DataFrame['Review Text'][6141]
#DataFrame['Review Text'][6142]

13 is positive and negative so this makes it hard to predict. The same holds for the other two reviews, they say lots of good things but one really negative thing. 