__We will be performing sentiment analysis utilizing naive bayes to predict whether customer feedback is positive or negative. I retrieved the data from the UCI Machine Learning database.__
-  We will start by seperating the most common positive and negative words respectively.
-  We will then test various models using various features to see if we can minimize our error.
-  We will test the model using a confusion matrix, train test split, and a cross evaluation test.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
from sklearn.naive_bayes import BernoulliNB, GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns

We will start by defining and loading in the data, and labeling the columns.

In [2]:
# Grab and process the raw data
data = "amazon_cells_labelled.txt"
amazon_raw = pd.read_csv(data, delimiter= '\t', header=None)
amazon_raw.columns = ['message', 'feedback']
amazon_raw.head()

Unnamed: 0,message,feedback
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


Below we will separate the positive feedback from the negative feedback, convert it all to lowercase, and print out the most common words for each.

In [3]:
# Creating new dataframes for positive and negative feedback.
positive_raw = amazon_raw[amazon_raw['feedback'] == 1]
negative_raw = amazon_raw[amazon_raw['feedback'] == 0]

# Creating a list of the individual words in the strings.
positive_words = []
negative_words = []
for x in positive_raw['message']:
    split = x.split()
    positive_words = positive_words + split

for x in negative_raw['message']:
    split = x.split()
    negative_words = negative_words + split

# Using set's to determine common words in each of the lists.
intersect = list(set(positive_words) & set(negative_words))

for x in intersect:
    while x in positive_words:
        positive_words.remove(x)
    while x in negative_words:
        negative_words.remove(x)
        
# Converting the words to lowercase
positive_words = [x.lower() for x in positive_words]
negative_words = [x.lower() for x in negative_words]

# Getting a count of the words in each list, and printing the top 10 most common in each.
negative_counts = Counter(negative_words)
positive_counts = Counter(positive_words)    
print('Most common negative words:', negative_counts.most_common(10))
print('Most common positive words:', positive_counts.most_common(10))

Most common negative words: [('waste', 13), ('worst', 13), ('bad', 11), ("don't", 11), ('not', 9), ('poor', 9), ('do', 8), ('what', 7), ('money', 7), ("didn't", 7)]
Most common positive words: [('works', 43), ('great', 30), ('love', 20), ('best', 19), ('nice', 19), ('good', 13), ('working', 9), ('pretty', 8), ('excellent', 8), ('fine', 6)]


__We will start by creating 4 different set of keywords.__
-  The first set will contain the top 10 most common negative words.
-  The second set will contain the top 10 most common positive words.
-  The third set will contain the top 5 positive and the top 5 negative words.
-  The fourth set will contain the top 10 most common positive and negative words.

In [4]:
keywords = ['waste', 'worst', 'bad', 'don\'t', 'not', 'poor', 'what', 'do', 'money', 'didn\'t']
keywords2 = ['works', 'great', 'love', 'best', 'nice', 'good', 'working', 'pretty', 'excellent', 'fine']
keywords3 = ['works', 'great', 'love', 'best', 'nice', 'waste', 'worst', 'bad', 'don\'t', 'not']
keywords4 = ['works', 'great', 'love', 'best', 'nice', 'good', 'working', 'pretty', 'excellent', 'fine', 'waste', 'worst', 'bad', 'don\'t', 'not', 'poor', 'what', 'do', 'money', 'didn\'t']

for key in keywords:
    # Note that we add spaces around the key so that we're getting the word,
    # not just pattern matching.
    amazon_raw[str(key)] = amazon_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )
    
for key in keywords2:
    # Note that we add spaces around the key so that we're getting the word,
    # not just pattern matching.
    amazon_raw[str(key)] = amazon_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )
    
for key in keywords3:
    # Note that we add spaces around the key so that we're getting the word,
    # not just pattern matching.
    amazon_raw[str(key)] = amazon_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )
    
for key in keywords4:
    # Note that we add spaces around the key so that we're getting the word,
    # not just pattern matching.
    amazon_raw[str(key)] = amazon_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )


We will create our first model using the first set of keywords which contains the top 10 negative words.

In [5]:
data = amazon_raw[keywords]
target = amazon_raw['feedback']

# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 386


After using just the top 10 negative words we were able to classify 614 points out of 1000. Lets try it using just positive words now.

In [6]:
data2 = amazon_raw[keywords2]

bnb.fit(data2, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data2)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data2.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 364


We were able to lower the points even more using just the positive words. We classified 636 out of 1000 words. Below we will run the same test again with the top 5 positive and top 5 negative keywords.

In [7]:
data3 = amazon_raw[keywords3]

bnb.fit(data3, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data3)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data3.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 402


Interestingly using a mix of the two caused there to be only 598 points identified. Lets explore deeper by using the top 10 most common positive and negative words.

In [8]:
data4 = amazon_raw[keywords4]

bnb.fit(data4, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data4)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data4.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 364


Using the top 10 of positive and negative words produced the same quality of classification as using just the top 10 positive words. Lets try using count vectorizer to see what kind of results we can get.

We will begin building our model, fitting it to the data and displaying the result. Using countvectorizer and fit transform to tokenize the words in the message column thereby transforming the string into numeric values. It will then add up all the values and get a count of occurences of each value. I will then display the accuracy.


In [9]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(amazon_raw['message'].values)

classifier = BernoulliNB()
targets = amazon_raw['feedback'].values
classifier.fit(counts,targets)
# Classify, storing the result in a new variable.
y_pred = classifier.predict(counts)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    counts.shape[0],
    (targets != y_pred).sum()
))


Number of mislabeled points out of a total 1000 points : 41


I will examine the model using a confusion matrix, train test split, and k cross validation.

In [10]:
# Display a confusion matrix to visualize our prediction results.
print('Confusion Matrix:', confusion_matrix(targets, y_pred))

# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(counts, targets, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(classifier.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(classifier.fit(counts, targets).score(counts, targets)))

# Use K Cross Validation to further examine how valid our model is.
bnbcross = cross_val_score(classifier, counts, targets, cv=10)
print('K cross validation score:', bnbcross.mean() )

Confusion Matrix: [[468  32]
 [  9 491]]
With 20% Holdout: 0.84
Testing on Sample: 0.959
K cross validation score: 0.8160000000000001


The bernoulli classifier earns a B. It identified 468 out of 500 negatives, giving a specificity score of 94%, and 491 out of 500 positives givining it a specificity score of 98%.  it had a 96% accuracy rate on the entire sample, but only a 84% rate of accuracy on the train test using a 20% holdout. Using cross validation it has a rate of nearly 82%. There is some overfitting going on.

I will repeat the process above this time using a multinomial classifier to see how the results differ.

In [11]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(amazon_raw['message'].values)

mnb = MultinomialNB()
targets = amazon_raw['feedback'].values
mnb.fit(counts,targets)
# Classify, storing the result in a new variable.
y_pred = mnb.predict(counts)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    counts.shape[0],
    (targets != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 35


In [12]:
# Display a confusion matrix to visualize our prediction results.
print('Confusion Matrix:', confusion_matrix(targets, y_pred))

# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(counts, targets, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(mnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(mnb.fit(counts, targets).score(counts, targets)))

# Use K Cross Validation to further examine how valid our model is.
mnbcross = cross_val_score(mnb, counts, targets, cv=10)
print('K cross validation score:', mnbcross.mean())

Confusion Matrix: [[476  24]
 [ 11 489]]
With 20% Holdout: 0.825
Testing on Sample: 0.965
K cross validation score: 0.8169999999999998


Overall the multionomial classifier appears to be slight superior boasting a higher negative prediction rate. This model has a specifiticy score of 95% and a sensitivity score of 98%. The train test and k validation score are quite similar. There does still appear to be some overfitting.