# Predicting Sentiment From Product Reviews

# Fire Up GraphLab Create
(See [Getting Started with SFrames](../../week-1/work/Getting-Started-With-SFrames.ipynb) for setup instructions)

In [1]:
# Ignore GraphLab
# import graphlab

# Use Pandas
import pandas as pd
# User NumPy
import numpy as np

In [None]:
# Limit number of worker processes. This preserves system memory, which prevents hosted notebooks from crashing.
# graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)

# Read Some Product Review Data

Loading reviews for a set of baby products.

In [2]:
# products = graphlab.SFrame('amazon_baby.gl/')

# Define the column type (if needed)
products_dtype = {
    "name": str,
    "review": str,
    "rating": int,
    "sentiment": int
}

# Import the CSV (Comma Separated Value)
# products  = pd.read_csv("amazon_baby.csv", dtype=products_dtype)
products  = pd.read_csv("amazon_baby.csv")

In [3]:
# Fill the empty value(s) in the `review` column
# The imported value is `NaN`, fill it with empty (blank) value
products = products.fillna({"review": ""})

# Let's Explore This Data Together

Data includes the product name, the review text and the rating of the review.

In [4]:
products.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


# Build The Word Count Vector For Each Review

In [5]:
def remove_punctuation(text):
    """Remove punctuation(s) from a line of string text
    
    Args:
        text (str): The line of string text
        
    Returns:
        A string for the line of text with punctuation(s) removed
    """
    # Use the string library
    import string
    # Use `str.maketrans` to build a translation table
    # Use `translate` and the translation table to remove punctuation
    return text.translate(str.maketrans("", "", string.punctuation))

# Create a new column `review_clean` from `review` with punctuation removed
products["review_clean"] = products["review"].apply(remove_punctuation)

In [None]:
# products['word_count'] = graphlab.text_analytics.count_words(products['review'])
# vectorizer = CountVectorizer()
# re.split("\W+", products["review"][0])

In [6]:
products.head()

Unnamed: 0,name,review,rating,review_clean
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3,These flannel wipes are OK but in my opinion n...
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...


In [None]:
# graphlab.canvas.set_target('ipynb')

In [None]:
# products['name'].show()

# Examining The Reviews For Most Sold Product: "Vulli Sophie the Giraffe Teether"

In [None]:
# giraffe_reviews = products[products["name"] == "Vulli Sophie the Giraffe Teether"]

In [None]:
# len(giraffe_reviews)

In [None]:
# giraffe_reviews['rating'].show(view='Categorical')

# Build Sentiment Classifier

In [None]:
# products['rating'].show(view='Categorical')

## Define What Positive Versu Negative Sentiment

We will ignore all reviews with `rating = 3`, since they tend to have a neutral sentiment. Reviews with a rating of 4 or higher will be considered positive, while the ones with rating of 2 or lower will have a negative sentiment.

In [7]:
# Ignore all 3* reviews
products = products[products["rating"] != 3]

In [None]:
# Positive sentiment = 4* or 5* reviews
# products["sentiment"] = products["rating"] >= 4

In [8]:
# Simple function to get positive or negative sentiment based on rating
def get_sentiment(rating):
    return +1 if rating > 3 else -1

# Apply the get_sentiment method on the rating column and save to a new sentiment column
products["sentiment"] = products["rating"].apply(get_sentiment)

In [9]:
products.head()

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...,1
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,When the Binky Fairy came to our house we didn...,1


## Let's Train The Sentiment Classifier

In [10]:
# train_data,test_data = products.random_split(.8, seed=0)

# Use JSON (JavaScript Object Notation)
import json

# Get the train data
# Get the train data index from JSON file
with open("train-idx.json") as train_file:
    train_index = json.load(train_file)
# Select the train data using the train index (these are IDs)
train_data = products.iloc[train_index,:]

# Get the test data
# Get the test data index from JSON file
with open("test-idx.json") as test_file:
    test_index = json.load(test_file)
# Select the test data using the test index (these are IDs)
test_data = products.iloc[test_index,:]

In [11]:
# Use scikit learn CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
# sentiment_model = graphlab.logistic_classifier.create(train_data,
#                                                      target='sentiment',
#                                                      features=['word_count'],
#                                                      validation_set=test_data)

# Create a CountVectorizer
# Convert a collection of text documents to a matrix of token counts
# Use token pattern to keep single-letter words
vectorizer = CountVectorizer(token_pattern=r"\b\w+\b")
# First learn vocabulary from the train data and assign columns to words
# Then convert the train data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data["review_clean"])
# Second convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data["review_clean"])

# Evaluate The Sentiment Model

In [None]:
# sentiment_model.evaluate(test_data, metric='roc_curve')

In [None]:
# sentiment_model.show(view='Evaluation')

In [13]:
# Use Logistic Regression
from sklearn.linear_model import LogisticRegression

# Create the algorithm
logistic_regression = LogisticRegression(max_iter=2000)

# Train the logistic regression algorithm using train matrix and sentiment output
sentiment_model = logistic_regression.fit(train_matrix, train_data["sentiment"])

In [14]:
# Get the coefficient of the sentiment model
coefficient = sentiment_model.coef_[0]
# Take a look at the size of the coefficient
print("Coefficient Length", coefficient.size)
# Take a look at the coefficient that are greater than or equal to 0
print("Positive Coefficient", (coefficient >= 0).sum())

Coefficient Length 121712
Positive Coefficient 91052


# Sample Evalutation

Pick a few observation records from the **test data** and make predictions.
* Predict the `score` of the sample record, then evalutate for the sentiment
* Predict the sample record using the `predict` function
* Predict the probability using the `predict_proba` function (it returns 2 values for each observation record)
    * First is the probability that the output is 0
    * Second is the probability that the output is 1
    * Use `model.predict_proba(test)[:,0]` or `model.predict_proba(test)[:,1]` to get the output of either 0 or 1 respectively

In [15]:
# Pick a few sample observation record
sample_test_data = test_data[10:13]

# Transform dictionary features into 2D feature matrix
sample_test_matrix = vectorizer.transform(sample_test_data["review_clean"])
# Predict the probability of the matrix
# sentiment_model.predict_proba(sample_test_matrix)
# Predict the probability of the matrix and store it
sample_predicted = sentiment_model.predict_proba(sample_test_matrix)
# Convert the prediction to a list and assigned it back to the `sample_test_data` as a new column
sample_test_data["predicted_sentiment"] = sample_predicted.tolist()

# Show the new `sample_test_data` DataFrame
sample_test_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,name,review,rating,review_clean,sentiment,predicted_sentiment
59,Our Baby Girl Memory Book,Absolutely love it and all of the Scripture in...,5,Absolutely love it and all of the Scripture in...,1,"[0.0036801888659515614, 0.9963198111340484]"
71,Wall Decor Removable Decal Sticker - Colorful ...,Would not purchase again or recommend. The dec...,2,Would not purchase again or recommend The deca...,-1,"[0.9596424524424022, 0.040357547557597816]"
91,New Style Trailing Cherry Blossom Tree Decal R...,Was so excited to get this product for my baby...,1,Was so excited to get this product for my baby...,-1,"[0.9999703000228383, 2.969997716171231e-05]"


# Applying The Learned Model To Understand Sentiment For Giraffe

In [33]:
# giraffe_reviews['predicted_sentiment'] = sentiment_model.predict(giraffe_reviews, output_type='probability')

# Get the product for "Vulli Sophie the Giraffe Teether"
giraffe_reviews = products[products["name"] == "Vulli Sophie the Giraffe Teether"]

# Convert (transform) the Giraffe clean review to a matrix
giraffe_test_matrix = vectorizer.transform(giraffe_reviews["review_clean"])

# Predict the `score` of the `giraffe_test_matrix` output
giraffe_reviews["score"] = sentiment_model.decision_function(giraffe_test_matrix)

# Evaluate the predict sentiment
giraffe_reviews["predict_sentiment"] = giraffe_reviews["score"].apply(lambda x: 1 if x >= 0 else -1)

# Predict the probability of the `giraffe_test_matrix` and store it
giraffe_predict_0 = sentiment_model.predict_proba(giraffe_test_matrix)[:,0]
giraffe_predict_1 = sentiment_model.predict_proba(giraffe_test_matrix)[:,1]

# Parse the probability prediction for 0 to 1 column for `giraffe_reviews`
giraffe_reviews["predict_probability_0"] = giraffe_predict_0.tolist()
giraffe_reviews["predict_probability_1"] = giraffe_predict_1.tolist()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats

In [34]:
giraffe_reviews.head()

Unnamed: 0,name,review,rating,review_clean,sentiment,score,predict_sentiment,predict_probability_0,predict_probability_1
34313,Vulli Sophie the Giraffe Teether,He likes chewing on all the parts especially t...,5,He likes chewing on all the parts especially t...,1,6.825508,1,0.001085,0.998915
34314,Vulli Sophie the Giraffe Teether,My son loves this toy and fits great in the di...,5,My son loves this toy and fits great in the di...,1,7.51325,1,0.000546,0.999454
34315,Vulli Sophie the Giraffe Teether,There really should be a large warning on the ...,1,There really should be a large warning on the ...,-1,-0.967579,-1,0.724637,0.275363
34316,Vulli Sophie the Giraffe Teether,All the moms in my moms' group got Sophie for ...,5,All the moms in my moms group got Sophie for t...,1,4.307709,1,0.013285,0.986715
34317,Vulli Sophie the Giraffe Teether,I was a little skeptical on whether Sophie was...,5,I was a little skeptical on whether Sophie was...,1,-0.115729,-1,0.5289,0.4711


## Sort The Reviews Based With The Predicted Sentiment Then Explore

In [45]:
# giraffe_reviews = giraffe_reviews.sort('predicted_sentiment', ascending=False)

giraffe_reviews = giraffe_reviews.sort_values(["predict_probability_1", "rating"], ascending=False)

In [46]:
giraffe_reviews.head()

Unnamed: 0,name,review,rating,review_clean,sentiment,score,predict_sentiment,predict_probability_0,predict_probability_1
34892,Vulli Sophie the Giraffe Teether,"Sophie, oh Sophie, your time has come. My gran...",5,Sophie oh Sophie your time has come My grandda...,1,29.361635,1,1.771916e-13,1.0
34434,Vulli Sophie the Giraffe Teether,My Mom-in-Law bought Sophie for my son when he...,5,My MominLaw bought Sophie for my son when he w...,1,23.089634,1,9.382095e-11,1.0
34442,Vulli Sophie the Giraffe Teether,"Yes, it's imported. Yes, it's expensive. And y...",5,Yes its imported Yes its expensive And yes I l...,1,22.588523,1,1.548563e-10,1.0
34515,Vulli Sophie the Giraffe Teether,"As every mom knows, you always want to give yo...",5,As every mom knows you always want to give you...,1,21.263562,1,5.825755e-10,1.0
34410,Vulli Sophie the Giraffe Teether,Our son really likes Sopie...the problem is th...,5,Our son really likes Sopiethe problem is that ...,1,19.950326,1,2.166125e-09,1.0


## Most Positive Reviews For The Giraffe

In [41]:
# giraffe_reviews[0]['review']

giraffe_reviews["review"].iloc[0]

"Sophie, oh Sophie, your time has come. My granddaughter, Violet is 5 months old and starting to teeth. What joy little Sophie brings to Violet. Sophie is made of a very pliable rubber that is sturdy but not tough. It is quite easy for Violet to twist Sophie into unheard of positions to get Sophie into her mouth. The little nose and hooves fit perfectly into small mouths, and the drooling has purpose. The paint on Sophie is food quality.Sophie was born in 1961 in France. The maker had wondered why there was nothing available for babies and made Sophie from the finest rubber, phthalate-free on St Sophie's Day, thus the name was born. Since that time millions of Sophie's populate the world. She is soft and for babies little hands easy to grasp. Violet especially loves the bumpy head and horns of Sophie. Sophie has a long neck that easy to grasp and twist. She has lovely, sizable spots that attract Violet's attention. Sophie has happy little squeaks that bring squeals of delight from Viol

In [42]:
# giraffe_reviews[1]['review']

giraffe_reviews["review"].iloc[1]

'My Mom-in-Law bought Sophie for my son when he was just starting to really chew on things (and we were hearing some pretty scary things about toys not made in the USA). She did some research and came across Sophie and we are so glad that she did! While Sophie doesn\'t come from the USA, we love the fact that she is 100% safe and natural, and my son loves to play with her. I also love how soft Sophie is, my son tends to swing his toys around and when he\'s sitting on my lap I\'m usually in danger of being hit in the face with whatever he\'s holding, needless to say a soft toy is even better in my book! There\'s one last thing I want to comment on, I\'ve read reviews that said that Sophie was a "glorified dog toy" or something to that effect, and I don\'t want to seem rude, but I think they\'re crazy! Yes Sophie does squeak, (which my son didn\'t care about much at first but now he loves) but that\'s about as far as the comparison could go! If you want a quality teething toy for your ch

## Show Most Negative Reviews For Giraffe

In [43]:
# giraffe_reviews[-1]['review']

giraffe_reviews["review"].iloc[-1]

'I wanted to love this product and was excited to buy it when I became pregnant but am now hesitant to let my baby use it after reading about the recall in Europe. Apparently, as I understand it, their toxin standards of measurement are lower than ours so they have not been recalled here (apparently we are OK with low levels of nitrates in the toys our children put in their mouths, but Europeans are not...hmmm)...Be that as it may, toxins registering even CLOSE to a dangerous level made me nervous about using. After digging around online I did discover that the company claims to have changed the product after a certain date and lists manufacturing codes so you can check yours (those listed were made after a certain date and are said to be safer). Sadly mine was not made after the &#34;improved&#34; date but I could not return it because there was no formal recall in our country. I considered returning it and hunting for one with an approved manufacturing date but man that was just too 

In [44]:
# giraffe_reviews[-2]['review']

giraffe_reviews["review"].iloc[-2]

"I was so looking forward to getting this for my little girl, but from the second I opened the box I was disappointed.  It didn't smell like vanilla rubber; it smelled like latex.  I don't get HOW it can be called a teether.  It is a squeak toy.  Period.  It is completely hollow and has an obnoxious squeak when you barely even touch it.  It is so flexible that I don't see how a- it can be safe (which reading some of the other negative reviews I now see that it probably isn't safe) or b- can be effective.  This thing gives at the slightest touch, so how can it possibly aid in cutting teeth or massaging the gums?Additionally, you cannot sterilize this toy.  So... let's see.  My baby got thrush when she was only 2 weeks old and I had to sterilize everything that came in contact with her mouth.  What would I have done with this then?  And she just got over her first cold, so again everything was sterilized.  You need to sterilize things with a baby- you just do.  How can this be a toy that

# Reference

* [Build Logistic Classifier](https://taigi0315.github.io/logistic-classifier/)
* [Coursera UW Machine Learning Specialization Notebook](https://ssq.github.io/2017/08/19/Coursera%20UW%20Machine%20Learning%20Specialization%20Notebook/)
* [GitHub](https://github.com/SSQ)
* [GitHub Classifying Sentiment of Review With Logistic Regression](https://gist.github.com/shengch02/6edfa765276c3731be29b7c0f83c61af)
* [GitHub Logistic Regression](https://github.com/santoshghimire/logistic-regression-using-scikit-learn/blob/master/logistic-regression.ipynb)