# Case Study: Sentiment Analysis

### Data Prep + Data Cleaning

**We'll be going through an example of using scikit-learn to perform sentiment analysis on Amazon Reviews.**
Looking at the head of the dataframe, we can see we have the Product Name, Brand, Price, Rating, Review text and the number of people who found the review helpful. For our purposes, we'll be focusing on the Rating and Reviews columns.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')
df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [3]:
df.columns

Index(['Product Name', 'Brand Name', 'Price', 'Rating', 'Reviews',
       'Review Votes'],
      dtype='object')

In [4]:
df.Rating.value_counts()

5    223605
1     72350
4     61392
3     31765
2     24728
Name: Rating, dtype: int64

In [5]:
df.dropna(inplace=True)
df = df[df['Rating']!=3] # taking out rating 3 reviews
df['Positively Rated'] = np.where(df['Rating']>3,1,0) # Value 1 if Rating is 4 or 5, and value 0 if it is not well rated (1,2)
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1
5,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,1,I already had a phone with problems... I know ...,1.0,0
6,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,The charging port was loose. I got that solder...,0.0,0
7,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,"Phone looks good but wouldn't stay charged, ha...",0.0,0
8,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I originally was using the Samsung S2 Galaxy f...,0.0,1
11,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,This is a great product it came after two days...,0.0,1


In [6]:
df['Positively Rated'].mean()

0.74826860258793226

*So this is about determining which reviews are positively retaled. Binary-class problem (0 or 1) with one feature (each individual review)*

In [7]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(df['Reviews'],df['Positively Rated'],random_state=0)
print(y_train.head())
print(X_train.head())
y_test.shape

97039     0
243783    1
88792     0
388802    1
161607    1
Name: Positively Rated, dtype: int64
97039     I bought a BB Black and was deliveried a White...
243783    overall i am very happy so far with this phone...
88792     the keyboard stutters! after i made a research...
388802    excellent smart phone, good performance. all p...
161607    I received my new Blu Vivo 5 Smartphone 3 days...
Name: Reviews, dtype: object


(77070,)

### CountVectorizer method

**CountVectorizer allows us to use the bag-of-words approach by converting a collection of text documents into a matrix of token counts.**

Fitting the CountVectorizer tokenizes each document by finding all sequences of characters of at least two letters or numbers separated by word boundaries. Converts everything to lowercase and builds a vocabulary using these tokens.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer().fit(X_train)
vect.get_feature_names()[::2000] # every tokens that are in the position 2000th of each review with is length at minimum

['00',
 '4less',
 'adr6275',
 'assignment',
 'blazingly',
 'cassettes',
 'condishion',
 'debi',
 'dollarsshipping',
 'esteem',
 'flashy',
 'gorila',
 'human',
 'irullu',
 'like',
 'microsaudered',
 'nightmarish',
 'p770',
 'poori',
 'quirky',
 'responseive',
 'send',
 'sos',
 'synch',
 'trace',
 'utiles',
 'withstanding']

**We can get the vocabulary by using the get_feature_names method.** This vocabulary is built on any tokens that occurred in the training data. Looking at every 2,000th feature, we can get a small sense of what the vocabulary looks like. We can see it looks pretty messy, including words with numbers as well as misspellings.

In [9]:
len(vect.get_feature_names())

53216

In [10]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)
X_train_vectorized

<231207x53216 sparse matrix of type '<class 'numpy.int64'>'
	with 6117776 stored elements in Compressed Sparse Row format>

In [11]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_vectorized,y_train) # modeling from the previous method in order to create a strong classifier

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [12]:
from sklearn.metrics import roc_auc_score

X_test_vectorized = vect.transform(X_test)
predictions = model.predict(X_test_vectorized)

print('AUC: ',roc_auc_score(y_test,predictions)) # evaluating the quality of the test data prediction comparing it to y_test

AUC:  0.92648398605


In [13]:
feature_names = np.array(vect.get_feature_names()) # get the feature names as numpy array

sorted_coef_index = model.coef_[0].argsort() # Sort the coefficients from the model
print(sorted_coef_index)
sorted_coef_index.size

[52310 19183 52316 ..., 18547 18377 18376]


53216

In [14]:
# Find the 10 smallest and 10 largest coefficients
print('Smallest coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest coefs:\n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest

Smallest coefs:
['worst' 'false' 'worthless' 'junk' 'garbage' 'mony' 'useless' 'messing'
 'unusable' 'horrible']

Largest coefs:
['excelent' 'excelente' 'exelente' 'excellent' 'loving' 'loves' 'efficient'
 'perfecto' 'amazing' 'love']


Looking at the ten smallest and ten largest coefficients, we can see the model has connected words like worst, worthless and junk to negative reviews. And words like excellent, loves, and amazing to positive reviews

### Tfidf Vectorizer

**Tf–idf, or Term frequency-inverse document frequency, allows us to weight terms based on how important they are to a document.**
High weight is given to terms that appear often in a particular document, but don't appear often commonly. Features with low tf–idf are either commonly used across all documents or rarely used and only occur in long documents.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(min_df=5).fit(X_train) # the token only will be taken in account if it appears in at least 5 instances
len(vect.get_feature_names()) # the min_df parameter helps to reduce the number of tokens in the vectorization process

17951

In [16]:
X_train_vectorized

<231207x53216 sparse matrix of type '<class 'numpy.int64'>'
	with 6117776 stored elements in Compressed Sparse Row format>

In [17]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized,y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ',roc_auc_score(y_test,predictions)) # +/- the same score as with the prior model but with less features

AUC:  0.926610066675


In [18]:
X_train_vectorized.max(0).toarray()

array([[ 0.71042189,  0.32454897,  0.31976905, ...,  0.4614497 ,
         0.49678755,  0.49678755]])

In [19]:
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

In [20]:
sorted_tfidf_index

array([ 3624, 12532, 17320, ...,  7414,  2184,  4635])

In [21]:
print('Smallest coefs:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest coefs:\n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest coefs:
['commenter' 'pthalo' 'warmness' 'storageso' 'aggregration' '1300'
 '625nits' 'a10' 'submarket' 'brawns']

Largest coefs:
['defective' 'batteries' 'gooood' 'epic' 'luis' 'goood' 'basico'
 'aceptable' 'problems' 'excellant']


**List of features with the smallest tf–idf either commonly appeared across all reviews or only appeared rarely in very long reviews.**

**List of features with the largest tf–idf contains words which appeared frequently in a review, but did not appear commonly across all reviews.**

In [25]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]])) # in terms of ratings
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]])) # in terms of ratings

Smallest Coefs:
['not' 'worst' 'useless' 'disappointed' 'terrible' 'return' 'waste' 'poor'
 'horrible' 'doesn']

Largest Coefs: 
['love' 'great' 'excellent' 'perfect' 'amazing' 'awesome' 'perfectly'
 'easy' 'best' 'loves']


In [24]:
# In this model, these reviews are treated the same (both as negative ones, as the model disregards the order of words in a sentence)
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


### n-grams

One way we can add some context is by adding sequences of word features known as n-grams. For example, bigrams, which count pairs of adjacent words, could give us features such as is working versus not working. And trigrams, which give us triplets of adjacent words, could give us features such as not an issue.

In [27]:
vect = CountVectorizer(min_df=5,ngram_range=(1,2)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
len(vect.get_feature_names())

198917

In [29]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.967143758101


In [30]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['no good' 'worst' 'junk' 'not good' 'not happy' 'horrible' 'garbage'
 'terrible' 'looks ok' 'nope']

Largest Coefs: 
['not bad' 'excelent' 'excelente' 'excellent' 'perfect' 'no problems'
 'exelente' 'awesome' 'no issues' 'great']


Just by adding bigrams, the number of features we have has increased to almost 200,000. And after training our logistic regression model and our new features, looks like by adding bigrams, we were able to improve our AUC score to 0.967. If we take a look at what features our model connected with negative reviews, we can see that we now have bigrams such as no good and not happy,

In [31]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]
