### Machine Learning: Bag-of-Words Model
___
#### Summary:

The Bag-of-Words model takes in text and transforms it into a "bag of words" or more specifically, a vector of values where the values correspond to either the frequency of a word or the presence of a word. For this notebook we will be using the latter. So for example, given a dataset that consists of the following texts: "Henry jumped down the well" and "The bus was late again",  the words within the bag-of-words model would be: Henry, jumped, down, the, well, bus, was, late, again. The text "Henry jumped down the well" can then be represented by [1, 1, 1, 1, 1, 0, 0, 0, 0] and a new text like "Henry jumped the bus again" can be represented by [1, 1, 0, 1, 0, 1, 0, 0, 1]. In this notebook we will also preprocess the text before applying the bag-of-words model. The different preprocessing techniques will depend on what kind of text is being preprocessed. Some of these techniques include stemming, lowercasing, and removing non-alphabetical symbols. Note that the Bag-of-Words model disregards word order and any spatial
information and so it is expected to perform poorly on many text datasets.
___
#### This notebook will include:
1. Preprocessing restaurant reviews
2. Preprocessing emails
3. Bag-of-Words model
___
#### Reference:

Much of what is in this notebook was learned from the Machine Learning Coursera course by Andrew Ng and the Udemy course "Machine Learning: A-Z" by Kirill Eremenko.

In [62]:
# Preprocessing Restaurant Reviews
"""
When choosing how to preprocess any text you should consider what parts of the text contain information 
relevant to your task. For classifying reviews as positive or negative we apply the following 
preprocessing techniques: remove non-alphabetical symbols, apply lowercasing, stem words, and remove 
stop words (words that are insignificant).

"""
# Function that preprocesses a string of text that is in the form of a review
def preprocess_review(review):
    # Importing the libraries
    import re
    #nltk.download('stopwords')
    from nltk.corpus import stopwords
    from nltk.stem.porter import PorterStemmer

    # Preprocessing the text
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    
    return review

# Applying the text preprocessor to an example review
review = 'The food was amazing but a little overpriced.'
preprocessed_review = preprocess_review(review)

# Printing the original and preprocessed review
print('Original:\n', review)
print('Preprocessed:\n', preprocessed_review)

Original:
 The food was amazing but a little overpriced.
Preprocessed:
 food amaz littl overpr


In [63]:
# Preprocessing emails
"""
When preprocessing emails you have to consider a few more things. Often emails contain email addresses 
and website URLs. These could be good indicators of whether or not the email is spam or not so they
should be included in the bag-of-words model. For preprocessing emails we apply the same techniques used
on the restaurant reviews but we add the following new ones: remove HTML formatting, replace numbers
with the word "number", replace URLs with "httpaddr", replace email addresses with the word "emailaddr",
and replace "$" with the word "dollar".
"""
# Function that preprocesses a string of text that is in the form of an email
def preprocess_email(email):
    # Importing the libraries
    import re
    from nltk.corpus import stopwords
    from nltk.stem.porter import PorterStemmer

    # Preprocessing the text
    email = email.lower()                       
    email = re.sub('<[^<>]+>', ' ', email)          
    email = re.sub('[0-9]+', 'number', email)      
    email = re.sub('(http|https)://[^\s]*', 'httpaddr', email)
    email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email)  
    email = re.sub('[$]+', 'dollar', email)
    email = re.sub('[^a-zA-Z]', ' ', email)
    email = email.split()
    ps = PorterStemmer()
    email = [ps.stem(word) for word in email if not word in set(stopwords.words('english'))]
    email = ' '.join(email)
    
    return email

# Applying the text preprocessor to an example email
email = 'Get bitcoin for <p><b>free NOW</b></p>! Just subscribe to us at http://www.notascam.com and \
email us your private key at freebitcoin@notascam.com. This is a limited time offer so act fast!'
preprocessed_email = preprocess_email(email)

# Printing the original and preprocessed email
print('Original:\n', email)
print('Preprocessed:\n', preprocessed_email)

Original:
 Get bitcoin for <p><b>free NOW</b></p>! Just subscribe to us at http://www.notascam.com and email us your private key at freebitcoin@notascam.com. This is a limited time offer so act fast!
Preprocessed:
 get bitcoin free subscrib us httpaddr email us privat key emailaddr limit time offer act fast


In [45]:
# Restaurant Review Dataset
"""
The dataset that will be used was obtained from the Udemy "Machine Learning: A-Z" course. It contains
2 columns. The first being the review and the second being a value 0 or 1 indicating whether the review 
is positive or negative.
"""
# Importing the libraries
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Datasets/RestaurantReviews/Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

# Printing the dataset to see what it looks like
print('\ndataset:\n', dataset)


dataset:
                                                 Review  Liked
0                             Wow... Loved this place.      1
1                                   Crust is not good.      0
2            Not tasty and the texture was just nasty.      0
3    Stopped by during the late May bank holiday of...      1
4    The selection on the menu was great and so wer...      1
5       Now I am getting angry and I want my damn pho.      0
6                Honeslty it didn't taste THAT fresh.)      0
7    The potatoes were like rubber and you could te...      0
8                            The fries were great too.      1
9                                       A great touch.      1
10                            Service was very prompt.      1
11                                  Would not go back.      0
12   The cashier had no care what so ever on what I...      0
13   I tried the Cape Cod ravoli, chicken, with cra...      1
14   I was disgusted because I was pretty sure that...     

In [64]:
# Bag-of-words model
"""
In this section we apply the preprocessing function to every review in the dataset and then transform
them using the bag-of-words model. This will give us a new representation of the data that can be
used as inputs into classifiers.
"""
# Importing the libraries
import numpy as np

# Preprocessing the entire dataset
corpus = []
for i in range(0, dataset.shape[0]):
    review = preprocess_review(dataset['Review'][i])
    corpus.append(review)

# Obtaining the Bag-of-Words representation of the dataset
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

# Splitting the Bag-of-Words representation of the dataset into a training set and test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

In [65]:
# Applying SVM
"""
Applying SVM to the bag-of-words representation of the dataset.
"""
# Creating and fitting the SVM model to the training set
from sklearn.svm import SVC
classifier = SVC(C = 1, kernel = 'rbf')
classifier.fit(X_train, y_train)

# Printing the accuracy on the training set
print('training accuracy', classifier.score(X_train, y_train))

# Printing the accuracy on the test set
print('test accuracy', classifier.score(X_test, y_test))

training accuracy 0.5025
test accuracy 0.49


In [66]:
# Applying Naive Bayes
"""
Applying Naive Bayes to the bag-of-words representation of the dataset.
"""
# Creating and fitting the Naive Bayes model to the training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Printing the accuracy on the training set
print('training accuracy', classifier.score(X_train, y_train))

# Printing the accuracy on the test set
print('test accuracy', classifier.score(X_test, y_test))

training accuracy 0.92625
test accuracy 0.685


In [67]:
# Random Forest classifier
"""
Applying the Random Forest model to the bag-of-words representation of the dataset.
"""
# Creating and fitting the Random Forest model to the training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 20)
classifier.fit(X_train, y_train)

# Printing the accuracy on the training set
print('training accuracy', classifier.score(X_train, y_train))

# Printing the accuracy on the test set
print('test accuracy', classifier.score(X_test, y_test))

training accuracy 0.9925
test accuracy 0.705


In [68]:
# 3-layer Neural Network
"""
Applying a 3-layer neural network to the bag-of-words representation of the dataset.
"""
# Importing the libraries
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense

# Creating and fitting the 3-layer neural network to the training set
classifier = Sequential()
classifier.add(Dense(units = 32, kernel_initializer = 'uniform', input_dim = 1500, activation = 'relu'))
classifier.add(Dense(units = 32, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

classifier.fit(X_train, y_train, batch_size = 64, epochs = 100)

# Printing the accuracy on the training subset
print('training accuracy', classifier.evaluate(X_train, y_train)[1])

# Printing the accuracy on the test set
print('test accuracy', classifier.evaluate(X_test, y_test)[1])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78