# Sentiment Analysis

### Importing the required libraries

In [1]:
import numpy as np
import pandas as pd

### Reading the data

Here we are using the Amazon_unlocked_Mobile dataset where in 
we will predict whether the users sentiments are positive or negative based on user's review.

In [2]:
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')
df.head() # it is used to print the first n data samples, by default n=5

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


### Data preprocessing

In [3]:
df.dropna(inplace=True) # it removes all samples will missing values

df = df[df['Rating']!=3] # since it is a binary classification so we will
                         # predict whether the it is a postive review or negative 
                         # by considering the rating value 3 as a threshold.
                         # Any value below 3 is a negative review while any value above 3 
                         # is a positive review.So, we will remove all reviews having a rating
                         # equal to 3.

# Encode rating = 4 or 5 as 1 (rated positively)
# Encode rating = 1 or 2 as 0 (rated negatively)
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)

df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1
5,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,1,I already had a phone with problems... I know ...,1.0,0
6,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,The charging port was loose. I got that solder...,0.0,0
7,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,"Phone looks good but wouldn't stay charged, ha...",0.0,0
8,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I originally was using the Samsung S2 Galaxy f...,0.0,1
11,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,This is a great product it came after two days...,0.0,1


In [4]:
## We can see that about 75% of the reviews are positive.
df['Positively Rated'].mean()

0.7482686025879323

In [5]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positively Rated'], 
                                                    random_state=0,
                                                    test_size=0.2)
# I have set the test size as 20% of the entire set, by default it is equal to 25%.

In [6]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 Excelent!!


X_train shape:  (246621,)


## Bag-of-Words Model

We cannot work with text directly when using machine learning algorithms.

Instead, we need to convert the text to numbers.

We may want to perform classification of documents, so each document is an “input” and a class label is the “output” for our predictive algorithm. Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.

A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW.

The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.

This can be done by assigning each word a unique number. Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.

This is the bag of words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.

There are many ways to extend this simple method, both by better clarifying what a “word” is and in defining what to encode about each word in the vector.

## CountVectorizer

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

You can use it as follows:

1.Create an instance of the CountVectorizer class.                                                                          
2.Call the fit() function in order to learn a vocabulary from one or more documents.                                        
3.Call the transform() function on one or more documents as needed to encode each as a vector.                              

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package.

The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to numpy arrays to look and better understand what is going on by calling the toarray() function.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

#Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)

In [18]:
# Uncomment these to get feature names and their id's
#len(vect.get_feature_names()) # This returns the number of feature names in a sorted manner
# This returns a dictionary with each feature name and it's frequency

In [9]:
#Transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)

In [10]:
print(X_train_vectorized.shape)
print(type(X_train_vectorized))
arr = np.array(X_train_vectorized) # convert X_train_vectorised from a sparse matrix to a ndarray in numpy
print(type(arr))
#print(arr) # Uncomment to print arr

(246621, 54467)
<class 'scipy.sparse.csr.csr_matrix'>
<class 'numpy.ndarray'>


### Using Logistic regression to train the model

In [11]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_vectorized,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

#### Prediction using the logistic model

In [12]:
from sklearn.metrics import roc_auc_score,accuracy_score

predictions = model.predict(vect.transform(X_test))
print('AUC: {:.3f}'.format(roc_auc_score(y_test,predictions)))
print('Accuracy score: {:.3f}'.format(accuracy_score(y_test,predictions)))

AUC: 0.925
Accuracy score: 0.950


In [13]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

# smallest coefficient implies that they are negative.
# As is evident with a word such as 'worst' the user's review is negative.

Smallest Coefs:
['worst' 'false' 'junk' 'worthless' 'garbage' 'mony' 'useless' 'horrible'
 'terrible' 'unusable']

Largest Coefs: 
['excelent' 'excelente' 'excellent' 'exelente' 'loving' 'loves' 'amazing'
 'perfecto' 'love' 'lovely']


## Tfidf

Word counts are a good starting point, but are very basic.

One issue with simple counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

1.Term Frequency: This summarizes how often a given word appears within a document.                                                                  
2.Inverse Document Frequency: This downscales words that appear a lot across documents.                                                         
                                                                           
The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.                                                                                                                                            
The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())

18621

In [20]:
#Transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)

In [21]:
print(X_train_vectorized.shape)
print(type(X_train_vectorized))
arr = np.array(X_train_vectorized) # convert X_train_vectorised from a sparse matrix to a ndarray in numpy
print(type(arr))
#print(arr) # Uncomment to print arr

(246621, 18621)
<class 'scipy.sparse.csr.csr_matrix'>
<class 'numpy.ndarray'>


### Using Logistic regression to train the model

In [22]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_vectorized,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

#### Prediction using the logistic model

In [24]:
from sklearn.metrics import roc_auc_score,accuracy_score

predictions = model.predict(vect.transform(X_test))
print('AUC: {:.3f}'.format(roc_auc_score(y_test,predictions)))
print('Accuracy score: {:.3f}'.format(accuracy_score(y_test,predictions)))

AUC: 0.927
Accuracy score: 0.950


In [33]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['no good' 'junk' 'worst' 'not good' 'not happy' 'horrible' 'garbage'
 'terrible' 'not satisfied' 'not worth']

Largest Coefs: 
['excelent' 'excelente' 'not bad' 'excellent' 'perfect' 'no problems'
 'exelente' 'awesome' 'great' 'no issues']


The issue with these models is that they work upon a single word and so by seeing the 'not' word it just predicts the first review as negativein the example given below.

In [26]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


To remove this probelm we use a concept called as n-grams.

## n-grams

An N-gram is simply a sequence of N words. For instance, let us take a look at the following examples.                      
1.San Francisco (is a 2-gram)                                                                                               
2.The Three Musketeers (is a 3-gram)                                                                                        
3.She stood up slowly (is a 4-gram)                                                                                         

Now which of these three N-grams have you seen quite frequently? Probably, “San Francisco” and “The Three Musketeers”. On the other hand, you might not have seen “She stood up slowly” that frequently. Basically, “She stood up slowly” is an example of an N-gram that does not occur as often in sentences as Examples 1 and 2.

Now if we assign a probability to the occurrence of an N-gram or the probability of a word occurring next in a sequence of words, it can be very useful. Why?

First of all, it can help in deciding which N-grams can be chunked together to form single entities (like “San Francisco” chunked together as one word, “high school” being chunked as one word).It can also help make next word predictions. Say you have the partial sentence “Please hand over your”. Then it is more likely that the next word is going to be “test” or “assignment” or “paper” than the next word being “school”.                                                                  

When performing machine learning tasks related to natural language processing, we usually need to generate n-grams from input sentences. For example, in text classification tasks, in addition to using each individual token found in the corpus, we may want to add bi-grams or tri-grams as features to represent our documents. 

 Let’s take the following sentence as a sample input:

If we want to generate a list of bi-grams from the above sentence, the expected output would be something like below (depending on how do we want to treat the punctuations, the desired output can be different):

Fit the CountVectorizer to the training data specifiying a minimum 
document frequency of 5 and extracting 1-grams and 2-grams

In [28]:
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names())

209766

AS we can see that the number of features have increased from 54467 to 209766.

### Using Logistic regression to train the model

In [29]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_vectorized,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

#### Prediction using the logistic model

In [30]:
from sklearn.metrics import roc_auc_score,accuracy_score

predictions = model.predict(vect.transform(X_test))
print('AUC: {:.3f}'.format(roc_auc_score(y_test,predictions)))
print('Accuracy score: {:.3f}'.format(accuracy_score(y_test,predictions)))

AUC: 0.964
Accuracy score: 0.975


###### As we can see that that using bigrams the AUC score and accuracy has imporved significantly.

In [34]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['no good' 'junk' 'worst' 'not good' 'not happy' 'horrible' 'garbage'
 'terrible' 'not satisfied' 'not worth']

Largest Coefs: 
['excelent' 'excelente' 'not bad' 'excellent' 'perfect' 'no problems'
 'exelente' 'awesome' 'great' 'no issues']


In [35]:
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]


Also we can see that the 1st review which was predicted incorrectly is now predicted correctly.

References:                                                                
1.Applied Text Mining in Python by University of Michigan , Coursera.      
2.https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/