# Exploring Logistic Regression/Feature Engineering

The goal of this assignment is to explore logistic regression and feature engineering with existing GraphLab Create functions.

In this assignment, you will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative. You will:

* Use SFrames to do some feature engineering
* Train a logistic regression model to predict the sentiment of product reviews.
* Inspect the weights (coefficients) of a trained logistic regression model.
* Make a prediction (both class and probability) of sentiment for a new product review.
* Given the logistic regression weights, predictors and ground truth labels, write a function to compute the accuracy of the model.
* Inspect the coefficients of the logistic regression model and interpret their meanings.
* Compare multiple logistic regression models.

In [1]:
import graphlab
# import sframe

In [2]:
products = graphlab.SFrame('amazon_baby.gl/')

This non-commercial license of GraphLab Create for academic use is assigned to ekeleshian@gmail.com and will expire on August 10, 2019.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1533806212.log


In [3]:
import re

In [4]:
products=  products.fillna('review', '')

In [5]:
# def remove_punctuation(text):

#     reg_ex = r"[\.\?\,\!\;\s\:\)\(]+"
#     p = re.compile(reg_ex)
#     array = p.split(text)
#     array.pop()
#     return ' '.join(array)
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation)



In [6]:
products

name,review,rating
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0


In [7]:
review_clean=products['review'].apply(remove_punctuation)
products['word_count']= graphlab.text_analytics.count_words(review_clean)


In [8]:
products.head()


name,review,rating,word_count
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0,"{'and': 5, 'stink': 1, 'because': 1, 'ordered': ..."
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0,"{'and': 3, 'love': 1, 'it': 3, 'highly': 1, ..."
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0,"{'and': 2, 'quilt': 1, 'it': 1, 'comfortable': ..."
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0,"{'and': 3, 'ingenious': 1, 'love': 2, 'what': 1, ..."
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0,"{'and': 2, 'all': 2, 'help': 1, 'cried': 1, ..."
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0,"{'and': 2, 'this': 2, 'her': 1, 'help': 2, ..."
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0,"{'shop': 1, 'noble': 1, 'is': 1, 'it': 1, 'as': ..."
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0,"{'and': 2, 'all': 1, 'right': 1, 'had': 1, ..."
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0,"{'and': 1, 'fantastic': 1, 'help': 1, 'give': 1, ..."
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,"{'all': 1, 'standarad': 1, 'another': 1, 'when': ..."


In [9]:
products = products[products['rating']!=3]

In [10]:
len(products)

166752

In [11]:
products['sentiment'] = products['rating'].apply(lambda rating: +1 if rating >3 else -1)

In [12]:
products['sentiment']

dtype: int
Rows: 166752
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ]

In [13]:
train_data, test_data = products.random_split(.8, seed=1)

# Now we wil generate a vector consisting of word counts (bag-of-word features)

In [14]:
print len(test_data)
print len(train_data)

33336
133416


In [14]:
# from sklearn.feature_extraction.text import CountVectorizer


In [15]:
# vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
#this token pattern is used to keep single-letter words

# train_matrix = vectorizer.fit_transform(train_data['review_clean'])


## convert the training data into a sparse matric^
## next, convert test data into sparse matrix, using the same word-column mapping

In [16]:
# test_matrix = vectorizer.transform(test_data['review_clean'])

In [28]:
# from sklearn.linear_model import LogisticRegression

 Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the LogisticRegression class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (train_matrix) as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.

In [29]:
# logisticRegr = LogisticRegression()

In [15]:
# sentiment_model = logisticRegr.fit(train_matrix, train_data['sentiment'])
sentiment_model = graphlab.logistic_classifier.create(train_data, 
                                                     target= 'sentiment',
                                                     features = ['word_count'],
                                                     validation_set = None)

In [16]:
sentiment_model

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 121713
Number of examples             : 133416
Number of classes              : 2
Number of feature columns      : 1
Number of unpacked features    : 121712

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : lbfgs
Solver iterations              : 6
Solver status                  : TERMINATED: Terminated due to numerical difficulties.
Training time (sec)            : 4.8883

Settings
--------
Log-likelihood                 : inf

Highest Positive Coefficients
-----------------------------
word_count[mobileupdate]       : 41.9847
word_count[placeid]            : 41.7354
word_count[labelbox]           : 41.151
word_count[httpwwwamazoncomreviewrhgg6qp7tdnhbrefcmcrprcmtieutf8asinb00318cla0nodeid] : 40.0454
word_count[knobskeeping]       : 36.2091

Lowest Negative Coeffi

In [18]:
# sentiment_model.coef_

weights = sentiment_model.coefficients
weights.column_names()

['name', 'index', 'class', 'value', 'stderr']

In [49]:
# sentiment_model.coef_[0]

array([-2.81793672, -2.62622303, -2.59970587, ...,  1.90700635,
        1.90859766,  1.96212432])

In [51]:
# sentiment_model.intercept_

array([ 1.37452252])

In [55]:
# array = graphlab.SArray(sentiment_model.coef_)

In [56]:
# array

dtype: array
Rows: 1
[array('d', [-2.817936721464668, -2.6262230255105123, -2.5997058667803024, -2.579634876974758, -2.472968248993654, -2.459335687834939, -2.4327366115688425, -2.2857585298067815, -2.2616563835738823, -2.246871823111771, -2.205164853196803, -2.1744598166993754, -2.084153080861963, -2.0612809832792913, -2.0210629720013222, -2.009344241247645, -1.9933031671176356, -1.9593412694468217, -1.9541728072725137, -1.9374568631460036, -1.858736165011739, -1.8505198745862081, -1.8490081705870416, -1.8458747949229122, -1.8356416687899624, -1.8251342653525435, -1.8224029032599818, -1.8018343757621689, -1.7592577164175522, -1.756481217736178, -1.7197397706400952, -1.7103470464432968, -1.7089518815453866, -1.696416057541035, -1.682016186017823, -1.6806203304349858, -1.6702464780238477, -1.6700238172185113, -1.6612906822996436, -1.6378105458158683, -1.6264470151507664, -1.6249240108274052, -1.6200859502601226, -1.6170191334758148, -1.6077013359276833, -1.5942699854816658, -1.594070706

In [61]:
non_neg_coef = []
def nonneg_coef(i):
    if i >= 0:
        non_neg_coef.append(i)
    return non_neg_coef


In [62]:
non_negs = array.apply(nonneg_coef)

In [64]:
print(list(sentiment_model.coef_))

[array([-2.81793672, -2.62622303, -2.59970587, ...,  1.90700635,
        1.90859766,  1.96212432])]


In [65]:
coef_dict = {}
for coef, feat in zip(sentiment_model.coef_,train_matrix):
    coef_dict[feat] = coef

In [66]:
coef_dict

{<1x57185 sparse matrix of type '<type 'numpy.int64'>'
 	with 24 stored elements in Compressed Sparse Row format>: array([-2.81793672, -2.62622303, -2.59970587, ...,  1.90700635,
         1.90859766,  1.96212432])}

In [73]:
sentiment_model.coef_


array([[-2.81793672, -2.62622303, -2.59970587, ...,  1.90700635,
         1.90859766,  1.96212432]])

In [76]:
import numpy
nonneg_coef = []

In [78]:
for x in numpy.nditer(sentiment_model.coef_):
    if x >= 0:
        nonneg_coef.append(x)
        

In [79]:
nonneg_coef

[array(7.765641003538354e-09),
 array(1.1424898766396385e-08),
 array(2.503635789462224e-08),
 array(2.503635789462224e-08),
 array(2.5720381282060324e-08),
 array(2.5814507623585254e-08),
 array(2.6018944895522855e-08),
 array(3.662292132970274e-08),
 array(5.007271578924448e-08),
 array(5.529031554884796e-08),
 array(5.652239495545362e-08),
 array(5.75784553813202e-08),
 array(5.75784553813202e-08),
 array(5.928191404079741e-08),
 array(5.928191404079741e-08),
 array(6.474535701164077e-08),
 array(6.474535701164077e-08),
 array(6.474535701164077e-08),
 array(6.849418353633313e-08),
 array(8.234781899204441e-08),
 array(8.387222810735315e-08),
 array(8.739094110039777e-08),
 array(8.739094110039777e-08),
 array(1.0217417083948238e-07),
 array(1.0292699015048317e-07),
 array(1.0894126296043881e-07),
 array(1.0894126296043881e-07),
 array(1.1048824351823976e-07),
 array(1.1048824351823976e-07),
 array(1.1560545603786431e-07),
 array(1.204877718474158e-07),
 array(1.267231532989908e-07),

In [80]:
len(nonneg_coef)

39160

# Quiz Question for step 8

In [81]:
sample_test_data = test_data[10:13]

In [82]:
print sample_test_data

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
|   Our Baby Girl Memory Book   | Absolutely love it and all... |  5.0   |
| Wall Decor Removable Decal... | Would not purchase again o... |  2.0   |
| New Style Trailing Cherry ... | Was so excited to get this... |  1.0   |
+-------------------------------+-------------------------------+--------+
+-------------------------------+-----------+
|          review_clean         | sentiment |
+-------------------------------+-----------+
| Absolutely love it and all... |     1     |
| Would not purchase again o... |     -1    |
| Was so excited to get this... |     -1    |
+-------------------------------+-----------+
[3 rows x 5 columns]



In [83]:
sample_test_data[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [84]:
sample_test_data[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

In [85]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])

In [86]:
scores = sentiment_model.decision_function(sample_test_matrix)

In [87]:
print scores

[ 1.72579872  3.75315243  8.0066879 ]


In [88]:
sample_test_data[2]['review']

"Was so excited to get this product for my baby girls bedroom!  When I got it the back is NOT STICKY at all!  Every time I walked into the bedroom I was picking up pieces off of the floor!  Very very frustrating!  Ended up having to super glue it to the wall...very disappointing.  I wouldn't waste the time or money on it."

In [93]:
sentiment_model.predict(sample_test_matrix)



array([1, 1, 1])

In [104]:
import math
def find_probability(score):
    denominator = 1 + math.exp(numpy.multiply(-sentiment_model.coef_.T, score))
    return 1.0/denominator


In [105]:
scores = graphlab.SArray(scores)

In [106]:
print scores

[1.7257987241340567, 3.7531524305841355, 8.006687895142365]


In [107]:
results = scores.apply(find_probability)

TypeError: only length-1 arrays can be converted to Python scalars

In [111]:
sentiment_model.predict_proba(sample_test_matrix)



array([[  1.51125758e-01,   8.48874242e-01],
       [  2.29067060e-02,   9.77093294e-01],
       [  3.33115572e-04,   9.99666884e-01]])

In [113]:
sentiment_model.predict_proba?

In [114]:
test_matrix = vectorizer.transform(test_data['review_clean'])

In [117]:
sentiment_model.predict_proba(test_matrix)

array([[  1.22217290e-04,   9.99877783e-01],
       [  6.75160295e-04,   9.99324840e-01],
       [  8.16350467e-02,   9.18364953e-01],
       ..., 
       [  7.17341314e-01,   2.82658686e-01],
       [  8.24361859e-03,   9.91756381e-01],
       [  3.88564987e-02,   9.61143501e-01]])

In [118]:
test_data.show()

Canvas is accessible via web browser at the URL: http://localhost:36619/index.html
Opening Canvas in default web browser.


In [119]:
graphlab.canvas.set_target('ipynb')

In [121]:
test_data.num_rows

<bound method SFrame.num_rows of Columns:
	name	str
	review	str
	rating	float
	review_clean	str
	sentiment	int

Rows: 33336

Data:
+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| Baby Tracker&reg; - Daily ... | This has been an easy way ... |  4.0   |
| Baby Tracker&reg; - Daily ... | I love this journal and ou... |  4.0   |
| Nature's Lullabies First Y... | I love this little calende... |  5.0   |
| Nature's Lullabies Second ... | I had a hard time finding ... |  5.0   |
|  Lamaze Peekaboo, I Love You  | One of baby's first and fa... |  4.0   |
|  Lamaze Peekaboo, I Love You  | My son loved this book as ... |  5.0   |
|  Lamaze Peekaboo, I Love You  | Our baby loves this book &... |  5.0   |
| SoftPlay Giggle Jiggle Fun... | This bear is absolutely ad... |  2.0   |
| SoftPlay Peek-A-Boo Where'... | I bought t

In [122]:
id = []
for i in range(1, 33337):
    id.append(i)

In [125]:
id = graphlab.SArray(id)

In [128]:
test_data['id'] = id

In [129]:
test_data

name,review,rating,review_clean,sentiment,id
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,This has been an easy way for my nanny to record ...,1,1
"Baby Tracker&reg; - Daily Childcare Journal, ...",I love this journal and our nanny uses it ...,4.0,I love this journal and our nanny uses it ...,1,2
Nature's Lullabies First Year Sticker Calendar ...,"I love this little calender, you can keep ...",5.0,I love this little calender you can keep ...,1,3
Nature's Lullabies Second Year Sticker Calendar ...,"I had a hard time finding a second year calendar, ...",5.0,I had a hard time finding a second year calendar ...,1,4
"Lamaze Peekaboo, I Love You ...","One of baby's first and favorite books, and i ...",4.0,One of baby's first and favorite books and it is ...,1,5
"Lamaze Peekaboo, I Love You ...",My son loved this book as an infant. It was ...,5.0,My son loved this book as an infant It was perfect ...,1,6
"Lamaze Peekaboo, I Love You ...",Our baby loves this book & has loved it for a ...,5.0,Our baby loves this book & has loved it for a ...,1,7
"SoftPlay Giggle Jiggle Funbook, Happy Bear ...",This bear is absolutely adorable and I would ...,2.0,This bear is absolutely adorable and I would ...,-1,8
SoftPlay Peek-A-Boo Where's Elmo A Childr ...,I bought two for recent baby showers! The book ...,5.0,I bought two for recent baby showers The book is ...,1,9
Baby's First Year Undated Wall Calendar with ...,I searched high and low for a first year cale ...,5.0,I searched high and low for a first year cale ...,1,10


In [130]:
import string

In [132]:
'elizabeth a.....! asdfkj '.translate(None, string.punctuation)

'elizabeth a asdfkj '