# Sentiment Analysis:

En este Jupyter Notebook nos empeñamos en realizar una comparación entre dos métodos de análisis de sentimiento usando regresión logística.

Las implementaciones a evaluar son:
- `LogisticRegression` de `sklearn`
- `logistic_classifier.create` de `turicreate`

Para comparabilidad se utiliza el mismo conjunto de datos: "amazon_baby" que contiene **183531** reseñas de productos para bebé en Amazon en inglés.

# I. Using sklearn's `LogisticRegression`

### Data Prep

In [43]:
import pandas as pd
import numpy as np
import sklearn
import string

In [1]:
# PASO 1) CARGAR LOS DATOS

# Read the data
df = pd.read_csv('./data/amazon_baby.csv')
df.head()

NameError: name 'pd' is not defined

In [5]:
# Remove any 'neutral' ratings equal to 3
df = df[df['rating'] != 3]

# Encode 4s and 5s as 1 (positive reviews)
# Encode 1s and 2s as 0 (negative reviews)
df['sentiment'] = np.where(df['rating'] > 3, 1, 0)
df.head(10)

## NOTE --- should we remove punctuation?

Unnamed: 0,name,review,rating,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,1
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,1
6,A Tale of Baby's Days with Peter Rabbit,"Lovely book, it's bound tightly so you may not...",4,1
7,"Baby Tracker&reg; - Daily Childcare Journal, S...",Perfect for new parents. We were able to keep ...,5,1
8,"Baby Tracker&reg; - Daily Childcare Journal, S...",A friend of mine pinned this product on Pinter...,5,1
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4,1
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,4,1


In [40]:
len(df)

166752

## Train-Test split

In [7]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
# what percentage is used?
Xtrain, Xtest, ytrain, ytest = train_test_split(df['review'], 
                                                    df['sentiment'], 
                                                    random_state=0)

In [8]:
print('X_train first entry:\n\n', Xtrain.iloc[0])
print('\n\nX_train shape: ', Xtrain.shape)

# 125064 observations in sklearn training set 

X_train first entry:

 So far so good.  My baby is yet to come and try it out herself, but for the time spent on building the crib and the look/feel of it, I give it 5 stars.  Mattress fits well.  Good safety tips provided.  Instructions were at times a little hard to follow but if you look at the pictures you can't go wrong.  The crib is sturdy, yet light enough to be moved around (we've already done that twice!).  Bought the espresso finish and it goes well with the nursery theme.  It does include the toddler rail.  I was also impressed with the packing; they seem to care about the product.  I preferred this model over the Emily although they are similar and the Emily is cheaper - I didn't like the gap in the rail for when my baby grows older and would be able to stand up in the crib and put her hand through it.  Of course you can get suitable covers for the gap, but less the material surrounding your baby, the better!


X_train shape:  (125064,)


## Word count vectors with CountVectorizer
CountVectorizer is a tool provided by scikit-learn to convert a given text into a vector (array) of word frequency counts.

Count Vectorizer is helpful when we have multiple texts to analyze, and we wish to convert each word in each text into vectors put into a matrix.

CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample. 

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
# Obtain the matrix of word counts (observations as rows, words as columns)
vect = CountVectorizer().fit(Xtrain.values.astype('U'))

In [29]:
len(vect.get_feature_names()) 

55486

In [12]:
# transform the documents in the training data to a document-term matrix
Xtrain_vectorized = vect.transform(Xtrain.values.astype('U'))
Xtrain_vectorized

<125064x55486 sparse matrix of type '<class 'numpy.int64'>'
	with 6620576 stored elements in Compressed Sparse Row format>

### Training the word count model

In [15]:
from sklearn.linear_model import LogisticRegression

# Train the model with training set and training labels
model = LogisticRegression()
model.fit(Xtrain_vectorized, ytrain)


# Predict the transformed test documents
predictions = model.predict(vect.transform(Xtest.values.astype('U')))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Obtaining Performance metrics for Word count model

In [17]:
from sklearn.metrics import roc_auc_score, precision_score


print('AUC: ', roc_auc_score(ytest, predictions))

print('Precision: ', precision_score(ytest, predictions))

AUC:  0.8472402772762485
Precision:  0.9490078643194985


#### Word count Extreme words

In [18]:
feature_names = np.array(vect.get_feature_names_out())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['disappointing' 'worst' 'concept' 'poorly' 'worthless' 'useless'
 'horrible' 'returning' 'disappointment' 'shame']

Largest Coefs: 
['excellent' 'worry' 'awesome' 'amazing' 'satisfied' 'negative'
 'fantastic' 'complaints' 'pleased' 'perfect']




## Tf-idf vector

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Minimum document frequency 5
vect = TfidfVectorizer(min_df=5).fit(Xtrain.values.astype('U'))
len(vect.get_feature_names_out())

16962

In [25]:
Xtrain_vectorized = vect.transform(Xtrain.values.astype('U'))

model = LogisticRegression()
model.fit(Xtrain_vectorized, ytrain)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Performance of Tf-idf vector

In [26]:
predictions = model.predict(vect.transform(Xtest.values.astype('U')))


print('AUC: ', roc_auc_score(ytest, predictions))
print('Precision: ', precision_score(ytest, predictions))

AUC:  0.8323408454703418
Precision:  0.9425091418987654


#### Extreme words Tf-Idf vector

In [28]:
feature_names = np.array(vect.get_feature_names_out())

sorted_tfidf_index = Xtrain_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['emotionally' 'routed' 'paragraph' 'consi' 'court' 'ly' '249' 'thumping'
 'bumpier' 'pds']

Largest tfidf: 
['awesome' 'nice' 'flimsy' 'love' 'good' 'works' 'gracias' 'excellent'
 'excelente' 'excelent']


In [29]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'disappointed' 'returned' 'useless' 'returning' 'return' 'waste'
 'disappointing' 'poor' 'worst']

Largest Coefs: 
['love' 'great' 'easy' 'perfect' 'loves' 'best' 'perfectly' 'highly'
 'happy' 'glad']


In [30]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


## n-grams

In [31]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(Xtrain.values.astype('U'))

Xtrain_vectorized = vect.transform(Xtrain.values.astype('U'))

len(vect.get_feature_names_out())

190992

In [32]:
model = LogisticRegression()
model.fit(Xtrain_vectorized, ytrain)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

### Performance for N-Gram Count Vector

In [34]:
predictions = model.predict(vect.transform(Xtest.values.astype('U')))

print('AUC: ', roc_auc_score(ytest, predictions))
print('Precision: ', precision_score(ytest, predictions))

AUC:  0.8919209034316948
Precision:  0.9639553249097473


#### Extreme words n-gram count vector

In [35]:
feature_names = np.array(vect.get_feature_names_out())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not worth' 'not recommend' 'disappointing' 'two stars' 'not happy'
 'to love' 'wouldn recommend' 'useless' 'very disappointed' 'worst']

Largest Coefs: 
['my only' 'not too' 'excellent' 'perfect' 'awesome' 'not leak'
 'be disappointed' 'just what' 'high quality' 'fantastic']


In [37]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 1]


In [39]:
len(Xtest) + len(Xtrain)

166752

# II. Using Turi Create's `logistic_regression`

In [1]:
import turicreate
import math

### Data prep

In [2]:
data = turicreate.SFrame('./data/amazon_baby.sframe/')
data

name,review,rating
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0


## Word Count Vector

In [48]:
# Remove neutral reviews
data = data[data['rating'] != 3]

len(data) # 166,752

166752

In [52]:
# Remove punctuation

def remove_punctuation(text):
    translator = text.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    
    return text

# Create array of punctuation-less reviews
review_without_punctuation = data['review'].apply(remove_punctuation)

In [53]:
# Create a word count column from every punctuation-less review
data['word_count'] = turicreate.text_analytics.count_words(review_without_punctuation)

In [54]:
data['sentiment'] = data['rating'].apply(lambda r: +1 if r>3 else -1)

## Train-Test split

In [60]:
# sklearns test-train split does 75-25
train_data, test_data = data.random_split(0.75, seed=0)

In [62]:
print(len(test_data))

41630


## Train Logistic Classifier with Word count

In [63]:
# Turi Create logistic classifier implementation.

sentiment_model = turicreate.logistic_classifier.create(train_data,
                                                        target = 'sentiment',
                                                        features=['word_count'],
                                                        validation_set=None)

In [67]:
len(sentiment_model.coefficients)

116754

In [66]:
weights.sort('value', ascending=False).print_rows(num_rows=10)

+------------+-------------------+-------+--------------------+--------+
|    name    |       index       | class |       value        | stderr |
+------------+-------------------+-------+--------------------+--------+
| word_count |       offsi       |   1   | 22.24657743326877  |  None  |
| word_count |     apathetic     |   1   | 21.940532564887228 |  None  |
| word_count | intentionalbuying |   1   | 21.940532564887228 |  None  |
| word_count |   pullingguiding  |   1   | 21.940532564887228 |  None  |
| word_count |     inversely     |   1   | 21.940532564887228 |  None  |
| word_count |    unsurvivable   |   1   | 21.940532564887228 |  None  |
| word_count |    strongermore   |   1   | 21.940532564887228 |  None  |
| word_count |     betterive     |   1   | 21.940532564887228 |  None  |
| word_count |    featurebelt    |   1   | 21.940532564887228 |  None  |
| word_count |     restupper     |   1   | 21.940532564887228 |  None  |
+------------+-------------------+-------+---------

In [68]:
weights.sort('value', ascending=True).print_rows(num_rows=10)

+------------+----------------+-------+--------------------+--------+
|    name    |     index      | class |       value        | stderr |
+------------+----------------+-------+--------------------+--------+
| word_count |  chairsthere   |   1   | -32.12151548266517 |  None  |
| word_count |     winsof     |   1   |  -22.731830657705  |  None  |
| word_count |    leachops    |   1   |  -22.731830657705  |  None  |
| word_count |   aboutmommy   |   1   | -22.34199389422742 |  None  |
| word_count |   superibibs   |   1   | -22.34199389422742 |  None  |
| word_count | countertoptray |   1   | -22.34199389422742 |  None  |
| word_count | carrierupdate  |   1   | -22.34199389422742 |  None  |
| word_count |   rubberthe    |   1   | -22.34199389422742 |  None  |
| word_count |    towelin     |   1   | -22.34199389422742 |  None  |
| word_count |  resultsboth   |   1   | -22.34199389422742 |  None  |
+------------+----------------+-------+--------------------+--------+
[116754 rows x 5 col

In [69]:
predictions = sentiment_model.predict(test_data)

In [70]:
def get_classification_accuracy(model, data, true_labels):
    N = len(true_labels)
    # First get the predictions
    predictions = model.predict(data)
    
    # Compute the number of correctly classified examples
    num_correct = 0
    for i in range(N):
        num_correct += (predictions[i] == true_labels[i])
    # Then compute accuracy by dividing num_correct by total number of examples
    accuracy = num_correct / N
    
    return accuracy

get_classification_accuracy(sentiment_model, test_data, test_data['sentiment'])

0.9203699255344703

In [76]:
predictions = sentiment_model.predict(test_data)
predictions_norm = predictions.apply(lambda x: 0 if x<0 else 1)

turicreate.evaluation.auc(test_data['sentiment'], predictions_norm)

0.6573432644224521

# Conlcusiones


Los modelos de nltk tienen mejor desempeño (y además tokenizan de forma automática el texto), a condición de que se cuente con un conjunto de datos suficiente.

El conjunto de palabras más significativas en ambos extremos por parte de turi