# Sentiment Analyzer

Understand how positive or negative a text is. This is usefule for understanding reviews, tweets, and comments. 

The data:<br/>
[link to Multi-Domain Sentiment Dataset](http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html)

## About the data
This sentiment dataset was used in our paper:

John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association of Computational Linguistics (ACL), 2007. [[PDF](http://www.cs.jhu.edu/~mdredze/publications/sentiment_acl07.pdf)]


If you use this data for your research or a publication, please cite the above paper as the reference for the data. Also, please drop me a line so I know that you found the data useful.


The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from 4 product types (domains): Kitchen, Books, DVDs, and Electronics. Each domain has several thousand reviews, but the exact number varies by domain. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed. This page contains some descriptions about the data. If you have questions, please email me directly ([email found here](http://www.cs.jhu.edu/~mdredze/)).


A few notes regarding the data.


1) There are 4 directories corresponding to each of the four domains. Each directory contains 3 files called positive.review, negative.review and unlabeled.review. (The books directory doesn't contain the unlabeled but the link is below.) While the positive and negative files contain positive and negative reviews, these aren't necessarily the splits we used in the experiments. We randomly drew from the three files ignoring the file names.


2) Each file contains a pseudo XML scheme for encoding the reviews. Most of the fields are self explanatory. The reviews have a unique ID field that isn't very unique. If it has two unique id fields, ignore the one containing only a number.

## The Plan
* Examine the electronics category (but the same code can be applied to others
* Do a classification on positive or negative because these reviews are already classified (we will ignore the starred ratings)
* The files are in XML so we will need a parser (Beautiful Soup)
* We will need 2 passes. One to determine vocabulary size (and remove stop words) and one to create vectors.
* We will use a classifier model because this is a classification problem.
* And we'll use logistic regression so we can interpret the weights.


## Logistic Regression

The factor we need to remember when looking at the logistic regression is that the weights of the linear equation indicate the impact of a feature on the outcome. Positive coefficients have a positive impact on the outcome while negative coefficients have a negative impact. coefficents of zero or close to zero have none to little impact on the output. The logisitc regression can also give us an idea to the level of certainty so we can identify the certainty of a prediction.


In [120]:
#import the packages and libraries

import nltk
import numpy as np
from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from bs4 import BeautifulSoup as bs

In [121]:
# This will turn words in to their base word (e.g. cas ==> cat)
lemmatizer = WordNetLemmatizer()

In [122]:
# creat our list of stopwords
from nltk.corpus import stopwords
stopwords = stopwords.words('english')


In [123]:
# import the positive reviews
positive_reviews = bs(open('data/sentiment/electronics/positive.review').read(),'lxml')
# take a peek at the xml and you can see we are only looking for the element <review_text>
positive_reviews = positive_reviews.findAll('review_text')

In [124]:
# import the positive reviews
negative_reviews = bs(open('data/sentiment/electronics/negative.review').read(),'lxml')
# take a peek at the xml and you can see we are only looking for the element <review_text>
negative_reviews = negative_reviews.findAll('review_text')

In [125]:
#make sure the datasets are the same size
print('Num positive reviews: ',len(negative_reviews), '\nNum negative reviews: ', len(positive_reviews))

Num positive reviews:  1000 
Num negative reviews:  1000


**Note:** If the data sets were not the same size we would want to shuffle the larger set and then remove the extra rows.

`np.random.shuffle(larger_set)
larger_set = larger_set[:len(smaller_set)]`

In [126]:
# create an index for each of the words in the documents.
# how many words are in the documents

def my_tokenizer(s):
    s = s.lower()
    #tokenize the string
    tokens = nltk.tokenize.word_tokenize(s)
    # remove words of 1 or 2 characters (e.g. "'s", "an", "it")
    tokens = [t for t in tokens if len(t)>2]
    # get the token lemmas
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    tokens = [t for t in tokens if t not in stopwords]
    return tokens
    
word_index_map = {}
current_index = 0

positive_tokenized = []
negative_tokenized = []

for review in positive_reviews:
    tokens = my_tokenizer(review.text)
    positive_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] =current_index
            current_index +=1
            
for review in negative_reviews:
    tokens = my_tokenizer(review.text)
    negative_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] =current_index
            current_index +=1

In [127]:
# create the array with the lables and the features
def tokens_to_vector(tokens,label):
    x = np.zeros(len(word_index_map) + 1)
    for t in tokens:
        i = word_index_map[t]
        x[i] += 1
    x = x/x.sum()
    x[-1] = label
    return x

In [128]:
# idenitfying the rows or N factor of the NxD array (i.e. dimension)
N = len(positive_tokenized) + len(negative_tokenized)

# creating the array to hols the data
data = np.zeros((N,(len(word_index_map) + 1)))
i = 0

# put the tokens in the array

for tokens in positive_tokenized:
    xy = tokens_to_vector(tokens,1)
    data[i,:] = xy
    i += 1

for tokens in negative_tokenized:
    xy = tokens_to_vector(tokens,0)
    data[i,:] = xy
    i += 1
    
# shuffle the data because right now all of the positive reviews are stacked on the negative
np.random.shuffle(data)

In [134]:
#divide the array into features and labels
X = data[:,:-1]
Y = np.ravel(data[:,-1:])


In [137]:
# split into train and test sets
X_train = X[:-100,]
Y_train = Y[:-100,]
x_test = X[-100:,]
y_test = Y[-100:,]

In [139]:
# Create the model
model = LogisticRegression()
model.fit(X_train,Y_train)
print("Classification rate: ", model.score(x_test,y_test))

Classification rate:  0.73


In [147]:
# Now we can look at the weights and see which words are weighted heavier than others.

threshold = .5
for word, index in word_index_map.items():
    weight = model.coef_[0][index]
    if weight > threshold or weight < -threshold:
        print (word, ': ', weight)

unit :  -0.5395088250463969
bad :  -0.6040821789820418
cable :  0.6806291275119105
time :  -0.713272987457447
used :  1.014175690767491
've :  0.520810917731183
month :  -0.6286165311318604
problem :  0.5584639450917689
need :  0.5955134261283141
good :  1.7288606307595995
sound :  1.0225085054263092
like :  0.578202625980877
lot :  0.5464679990212311
n't :  -1.8104674658150908
easy :  1.3634575382191199
get :  -1.0899412851930284
use :  1.6323682100032912
quality :  1.3135207212237405
best :  0.9044055154547704
item :  -0.848394414107202
working :  -0.5193466604828412
well :  0.9091145116394838
wa :  -1.215304786165066
perfect :  0.878869735019799
fast :  0.7766885555653344
ha :  0.636487413408659
price :  2.228570260288225
great :  3.4794455537548425
money :  -0.8746184448398857
memory :  0.8080703581303034
would :  -0.79392362629664
buy :  -0.9523372600784666
worked :  -0.7899525296038465
pretty :  0.509938066000245
could :  -0.5572956812832564
doe :  -0.9746099635138824
two :  -0.5