# Text-mining - Week 6

<b> bag-of-words model </b>
Machine learning algorithms can't directly work with raw text. So you can use the bag-of-words to get the text converted into numbers (feature encoding). Basicly what you do is get a representation of the text that describes the occurrence of words within a document. It will show a vocabulary of known words and measures its presence. 

You first collect the data: getting small snippets/lines of text from your dataframe. Then you can design the models vocabulary. Getting all the unique words. Then the next step would be scoring the words in the document.

<b> Naïve Bayes </b>
A Naive Bayes classifier is a (probabilistic) machine learning model that’s used for classification task. It assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature: independent.

<b> how they work together </b>
Bayes theorem can be used to calculate the probability of a snippet/line of text that belongs to a certain category/vocaulary feature (bag of words). The frequency of each words then determines the probability.

In [11]:
import pandas as pd

In [28]:
reviews = pd.read_csv('W_clothing_reviews.csv')
reviews.dropna()
reviews.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


We need to filter out all the non-dress reviews. As we are only interested in the rating of dresses.

In [30]:
dresses = reviews['Department Name']=='Dresses'
dress_reviews = reviews[dresses]
dress_reviews.dropna()
dress_reviews.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses


To be able to use the text, we need to import the object CountVectorizer from sklearn, which creates a dictionary from a series of text. To make sure we can read the texts correctly we can use functions to lowercase the text an tokenize it by using whitespace and interpunction as seperations between words.
Stopwords are not valuable as they are not informative enough. Therefore I will use leave out the stopwords. To transform the text in a standard text format we convert the text to Unicode with .values.astype(U).

In [31]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = dress_reviews['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
#feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 8080 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']


Now that we have the dictionary of 8080 words (the vocabulary), we can count how many times each word for each review occurs. This way we can create a document-feature matrix. Documents (reviews) will be the rows, and features (words) will be the columns.

In [32]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:50,0:50]) #Let's print a little part of the matrix: the first 50 words & documents

  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (26, 40)	1
  (35, 12)	2
  (39, 31)	1


The printed result shows no 0's in the matrix. Most of the matrix consists of zeroes, but to save memory it's left out while printing. Instead, the positions of the cells that don't have a zero are spelled out, with their values. This is a sparse matrix (which saves a lot of memory). To convert it to a regular matrix we could use .toarray().

# Start building the model

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB

In [34]:
nb = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = dress_reviews['Rating'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) #split the data and store it

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

# Evaluate the model

In [35]:
#Evaluate the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.5870253164556962

The accuracy is 58.7%. (considering there are five categories it's not bad)

In [36]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00        59
           2       0.35      0.05      0.08       151
           3       0.34      0.26      0.29       246
           4       0.35      0.28      0.31       426
           5       0.68      0.91      0.78      1014

    accuracy                           0.59      1896
   macro avg       0.35      0.30      0.29      1896
weighted avg       0.52      0.59      0.53      1896



The precision for the low ratings are a bit lower compared to the 5 stars rating.

In [46]:
df = dress_reviews[['Rating','Review Text']]
df.head()

Unnamed: 0,Rating,Review Text
1,5,Love this dress! it's sooo pretty. i happene...
2,3,I had such high hopes for this dress and reall...
5,2,"I love tracy reese dresses, but this one is no..."
8,5,I love this dress. i usually get an xs but it ...
9,5,"I'm 5""5' and 125 lbs. i ordered the s petite t..."


In [47]:
print(df.iloc[0,1])
print(nb.predict_proba(X[0]))

Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
[[3.60622795e-13 1.34163893e-08 3.11453866e-05 3.15131942e-01
  6.84836900e-01]]


In [48]:
for i in range(10):
    prob = nb.predict_proba(X[i])
    print(f"Review text: {i}. {df.iloc[i,1]}")
    print(f"1 star: {prob[0,0]}, 2 star: {prob[0,1]}, 3 star: {prob[0,2]}, 4 star: {prob[0,3]}, 5 star: {prob[0,4]}")

Review text: 0. Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
1 star: 3.606227948943918e-13, 2 star: 1.3416389267827808e-08, 3 star: 3.114538659714562e-05, 4 star: 0.3151319416911108, 5 star: 0.6848368995055337
Review text: 1. I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
1 star: 1.3827

Not sure where it goes wrong. The probabilities are larger than one. unfortunately I can't tell what went wrong (no errors). 

For review (1): you can see that the probability for a 2 star review is the highest which sounds pretty accurate as most of it is negative.

For review (7): The probability is set for a 2 star review which doesn't sound accurate. This is probably because she explains she doesn't trust other reviews and then continues she actually thinks the dress is beautifully. So the doubt is probably causing the inaccuracy.

For review (9): The probability of a low star review is very high. But this is not in line with the actual review. She is actually very positive, so it should be higher. Not sure what caused this.
