# Sentiment Analysis with SciKit Learn
* SFrame for data preprocessing and Feature Engineering
* Scikit-Learn CountVerctorizer for NLP feature extraction
* Scikit-Learn LogisticRegression for model fitting

* Interpret weights (coefficients)
* Make predictions (both class and probability)
* Evaluation.

In [1]:
import pandas as pd
import numpy as np
import math
import string
import json
import re
from bs4 import BeautifulSoup
#from nltk.corpus import stopwords # python3 only
from sklearn import feature_extraction
from sklearn import linear_model
from sklearn.cross_validation import train_test_split

# Data Preprocessing

In [2]:
products = pd.read_csv('data/amazon_baby.csv', parse_dates=True)

In [3]:
products.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


For sentiment analysis, **slice out** reviews with *rating = 3*, assuming they have neutral sentiment.

In [4]:
print len(products)
products = products[products['rating'] != 3]
print len(products)
print
print 'Empty reviews: ', sum(pd.isnull(products.review))
products = products[~pd.isnull(products.review).values]
print len(products)

183531
166752

Empty reviews:  777
165975


## NLP Feature Extraction

Next, some of the classical steps in an NLP pipeline:
1. Remove punctuation.
2. Remove non-letters
3. Remove stop words
4. Convert to lower case.
5. Transform the reviews into word-counts.

In [5]:
def remove_punctuation(text):
    return str(text).translate(None, string.punctuation) 

In [6]:
def extract_features(raw_documents, stop=None):
    """
    Returns the X feature matrix in scipy sparse array format.
    To convert it to a regular numpy array format use .toarray()
    """
    vectorizer = feature_extraction.text.CountVectorizer(analyzer = "word",   \
                                                         tokenizer = None,    \
                                                         preprocessor = None, \
                                                         stop_words = stop,   \
                                                         min_df = 0,          \
                                                         max_features = 100000) 

    # Learn the vocabulary dictionary and return term-document matrix.
    doc_term_matrix = vectorizer.fit_transform(raw_documents)
    vocab = vectorizer.get_feature_names()
    
    # Sum the counts of each term over all the docs
    #dist = np.sum(doc_term_matrix.toarray(), axis=0)
    #for tag, count in zip(vocab, dist):
    #    print count, tag
    
    return vocab, doc_term_matrix

In [7]:
def text_preprocessing( raw_review, whitelist=None):
    """
    param: whitelist: is the conjugate of stopwords, the whitelist is a positive filter, 
    only the words included in the whitelist will be included."""
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw HTML review), and 
    # the output is a single string (a preprocessed review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review, "html.parser").get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. Remove stop words
    #   In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    #stops = set(stopwords.words("english"))                   
    #words = [w for w in words if not w in stops]   
    # 
    # 5. Include white list words
    if whitelist:
        whitelist = set(whitelist)                   
        words = [w for w in words if w in whitelist]   
    #
    # 7. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( words ))   

In [8]:
raw_reviews = products['review'].apply(remove_punctuation)

In [9]:
prep_reviews = []
num_reviews = len(raw_reviews)
for i, review in enumerate(raw_reviews):
    if( (i+1)%25000 == 0 ):
        print "Review %d of %d" % ( i+1, num_reviews )                                                                    
    prep_reviews.append(text_preprocessing(review))

Review 25000 of 165975
Review 50000 of 165975
Review 75000 of 165975
Review 100000 of 165975
Review 125000 of 165975
Review 150000 of 165975


In [10]:
vocab, X = extract_features(prep_reviews, 'english')

In [11]:
# remove the u'' prefix in unicode strings
feature_names = json.dumps(vocab).translate(None, '"[] ').split(',')
print len(vocab), feature_names[0:100], len(feature_names)

100000 ['aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaacd', 'aaaaahhhh', 'aaaaall', 'aaaaarrrrggghhhhthese', 'aaaahhhs', 'aaaallll', 'aaaand', 'aaaggees', 'aaah', 'aaahh', 'aaahs', 'aaai', 'aaargh', 'aaarghdo', 'aaas', 'aaasdfdfdfdfdfdfdfdfdfdfdfdfdfdfddd', 'aaasonly', 'aacells', 'aaddition', 'aae', 'aahem', 'aahhing', 'aahs', 'aaliyah', 'aalma', 'aamuch', 'aamzon', 'aand', 'aap', 'aapguidelinequoting', 'aaps', 'aardvark', 'aarghthe', 'aas', 'aasdont', 'aaseasy', 'aasi', 'aavailable', 'aawwws', 'ab', 'ababy', 'ababybjoumlrn', 'ababycom', 'ababys', 'aback', 'abandon', 'abandoned', 'abandoning', 'abandonment', 'abar', 'abasic', 'abassinet', 'abated', 'abattoiroh', 'abbington', 'abbout', 'abbrasivei', 'abbreviated', 'abby', 'abc', 'abcd', 'abck', 'abcs', 'abd', 'abdc', 'abdomen', 'abdoment', 'abdomin', 'abdominal', 'abdominalmuscles', 'abduct', 'abducted', 'abduction', 'abe', 'abed', 'abel', 'abena', 'aberrations', 'abesolute', 'abf', 'abg', 'abhor', 'abhore', 'abhorent', 'abide', 'abig', 'abilities

In [12]:
def as_dictionary(vocab, sparse_x_i):
    """
    Converts the sparse feature array of a sample to a dictionaries.
    This function is just to help print the features in a human-interpretable format.
    """
    x_i = sparse_x_i.toarray()[0]
    feature_dict = {}
    for j, w in enumerate(vocab):
        if x_i[j] > 0:
            feature_dict[w] = x_i[j]
    
    return feature_dict

In [13]:
#products['word_count'] = [row.toarray() for row in X] # <-- this is a fast way to crash the kernel
print as_dictionary(feature_names, X[269])
products.iloc[269]['review']

{'playpen': 1, 'getting': 1, 'started': 1, 'money': 1, 'unraveling': 1, 'cheaper': 1, 'wash': 1, 'sheets': 1, 'recommend': 1, 'worth': 1}


'I would recommend getting cheaper sheets made for a playpen. These started unraveling after one wash! Not worth the money.'

## Provide Sentiment Labels

In order to train a logistic classifier, we need labels. In this case, because it is going to be a sentiment classifier, we want a sentiment label. We will assume that the star rating is equivalent to a sentiment.

Create the **label** (sentiment column), we assign class +1 to star rating of 4 or 5 and -1 to star rating of 1 or 2.

In [14]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

## Split data into training and test sets

I will split a position array to access the Numpy ndarray. Having the split on the position array is enough and I dont need to actually split X and y.

In this case I leave most columns of the original dataset in the Pandas dataframe, and store the extracted features (wordcounts) in a Numpy ndarray for the ML algorithm.

This presents a problem when we split the numpy array into two randomly chosen sets, because I would loose the hability to know which row in the dataframe correspond to each row in the numpy array.

In order to be able to refer back to the original row in the df from any row in the numpy arrays, I will split an array of positional indices.

Alternatively I could convert all the text columns to char arrays and concatenate these columns to the Numpy ndarray, but all that extra data copying can be avoided by just working with indices.

In [15]:
y = products['sentiment'].values

pos = np.array(range(len(products)))
pos_train, pos_test = train_test_split(pos, test_size=0.2, random_state=1)

print X[pos_train].shape, X[pos_test].shape, y[pos_train][:10], sum(y[pos_train]<0)
print len(pos_train), len(pos_test), type(pos_train)

(132780, 100000) (33195, 100000) [ 1  1 -1  1  1 -1  1  1  1 -1] 21080
132780 33195 <type 'numpy.ndarray'>


# Train the sentiment classifier with logistic regression

In [16]:
sentiment_model = linear_model.LogisticRegression()
# fit() requires X features in scipy compressed sparse format, fit_transform() takes numpy ndarrays
sentiment_model.fit(X[pos_train], y[pos_train])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [17]:
coef = sentiment_model.coef_[0]
inter = sentiment_model.intercept_
print sentiment_model.coef_.shape

print coef[0:10]
num_positive_weights = sum(coef>0)
num_negative_weights = sum(coef<0)
print "Number of positive weights: %s " % num_positive_weights
print "Number of negative weights: %s " % num_negative_weights
print

print inter
num_positive_inter = sum(inter>0)
num_negative_inter = sum(inter<0)
print "Number of positive intercepts: %s " % num_positive_inter
print "Number of negative intercepts: %s " % num_negative_inter

(1, 100000)
[  2.49875619e-01   1.30748533e+00   6.24072804e-04  -1.20713079e-01
   8.95634613e-02   6.24686891e-05   3.86380070e-07   3.16755064e-01
  -2.49622333e-01   9.12890213e-08]
Number of positive weights: 62881 
Number of negative weights: 25276 

[ 1.18419346]
Number of positive intercepts: 1 
Number of negative intercepts: 0 


## Making predictions with logistic regression

Now that a model is trained, we can make predictions on the **test data**. In this section, we will explore this in the context of 3 examples in the test dataset.  We refer to this set of 3 examples as the **sample_test_data**.

In [18]:
#idx_test_sample = idx_test[14:17]
pos_test_sample = pos_test[15:18]
#print idx_test_sample
print pos_test_sample
print as_dictionary(feature_names, X[pos_test_sample[0]])
products.iloc[pos_test_sample]

[ 94825 100257  30688]
{'efficientvery': 1, 'think': 1, 'love': 1, 'dont': 1, 'just': 1, 'urban': 1, 'backwardshigh': 1, 'seat': 3, 'selfproclaimed': 1, 'junkie': 1, 'strollerthe': 1, 'purchased': 1, 'seen': 1, 'quality': 1, 'huge': 1, 'ease': 1, 'pregnant': 1, 'long': 1, 'better': 1, 'read': 1, 'yetits': 1, 'wobbly': 1, 'qualitythe': 1, 'wiped': 1, 'store': 1, 'nice': 1, 'model': 1, 'snaps': 1, 'consonce': 1, 'gear': 1, 'emergency': 1, 'easily': 1, 'positioners': 1, 'loose': 1, 'spot': 1, 'sturdyeasily': 1, 'wave': 2, 'strap': 1, 'connected': 1, 'prossun': 1, 'uselightweighthandle': 1, 'coordinating': 1, 'baby': 2, 'bit': 1, 'stiff': 1, 'stylishfeels': 1, 'easiest': 1, 'cleanthe': 1, 'boy': 1, 'tricky': 1, 'daycare': 1, 'aroundeasy': 1, 'car': 2, 'inexpensive': 1, 'bargain': 1, 'stroller': 2, 'try': 1, 'materials': 1, 'canopy': 1, 'obsess': 1, 'review': 1, 'say': 2, 'settings': 1}


Unnamed: 0,name,review,rating,sentiment
104808,"The First Years Via I450 Infant Carseat, Urban...",I must say I am in love with this car seat. I ...,5,1
110790,Medela 150 Ml Storage Bottle Case of 10 BPA FREE,Received this product and was a bit worried to...,1,-1
33772,Lamaze Play &amp; Grow My Friend Emily Take Al...,"a great toy for new born babies, has lot of st...",4,1


Let's dig deeper into the first row of the **sample_test_data**. Here's the full review:

In [19]:
products.iloc[pos_test_sample[0]]['review']

"I must say I am in love with this car seat. I am pregnant with baby boy #3 and I am a self-proclaimed baby gear junkie. I read every review, try every model in the store and obsess over every detail. We also purchased the coordinating Wave Urban Stroller.The pros:sun canopy is efficientvery stylishfeels very sturdyeasily wiped cleanthe car seat snaps on easily to the stroller (as long as you don't have it backwards)high quality materials all aroundeasy to uselightweighthandle has spot for emergency & daycare inforoomy seat with nice positioners for newbornschanging strap settings is the easiest i have seen yetIt's so inexpensive- this is a huge bargain for the qualitythe cons:once connected to the wave stroller it was a bit tricky to get off, but I think it's just stiff and will ease up, which I still say is better than being wobbly and loose."

That review seems pretty positive.

Now, let's see what the next row of the **sample_test_data** looks like. As we could guess from the sentiment (-1), the review is quite negative.

In [20]:
print products.iloc[pos_test_sample[1]]['review']

Received this product and was a bit worried to use it because I came in a zip lock type bag that was not even sealed so half of the bottles fell out into the box it was shipped in.  Was hesitant to use it so I am returning it.


We will now make a **class** prediction for the **sample_test_data**. 
The `sentiment_model` should predict **+1** if the sentiment is positive and **-1** if the sentiment is negative. The **score** (sometimes called **margin**) for the logistic regression model  is defined as:

$$
\mbox{score}_i = \mathbf{w}^T h(\mathbf{x}_i)
$$ 

where $h(\mathbf{x}_i)$ represents the features for example $i$.  We will write some code to obtain the **scores**. For each row, the **score** (or margin) is a number in the range **[-inf, inf]**.

### Predicting sentiment

These scores can be used to make class predictions as follows:

$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{w}^T h(\mathbf{x}_i) > 0 \\
      -1 & \mathbf{w}^T h(\mathbf{x}_i) \leq 0 \\
\end{array} 
\right.
$$
Where whe have included -b as w[0]

Using scores, write code to calculate $\hat{y}$, the class predictions:

In [21]:
print "Calculation of predictions with formula (w.x - b > 0)" 
h_x = X[pos_test_sample]
w = sentiment_model.coef_
b = sentiment_model.intercept_
print 'Types:', type(w),  type(h_x)
print 'Dimension alignment for dot product:', h_x.shape, w.T.shape

Calculation of predictions with formula (w.x - b > 0)
Types: <type 'numpy.ndarray'> <class 'scipy.sparse.csr.csr_matrix'>
Dimension alignment for dot product: (3, 100000) (100000, 1)


In [22]:
scores =  np.dot(h_x.toarray(), w.T) - b
print "Scores:" 
print " \n".join([str(s[0]) for s in scores])

Scores:
5.06516649448 
-3.05514722041 
0.803633184553


In [23]:
print "Class predictions (with formula w.x - b > 0):" 
print map(lambda s: 1 if s>0 else -1, scores)

Class predictions (with formula w.x - b > 0):
[1, -1, 1]


In [24]:
print "Class predictions (scikit-learn):" 
y_hats = sentiment_model.predict(X[pos_test_sample])
print y_hats

Class predictions (scikit-learn):
[ 1 -1  1]


### Probability predictions

We can also calculate the probability predictions from the scores using:
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}.
$$

Using the variable **scores** calculated previously, write code to calculate the probability that a sentiment is positive using the above formula. For each row, the probabilities should be a number in the range **[0, 1]**.

In [25]:
# Returns the probability of the sample for each class in the mode
proba_sample = sentiment_model.predict_proba(X[pos_test_sample])

print "Class probability predictions (scikit-learn):" 
print "          -1              1" 
print proba_sample


Class probability predictions (scikit-learn):
          -1              1
[[  5.90734235e-04   9.99409266e-01]
 [  6.65245851e-01   3.34754149e-01]
 [  4.02323390e-02   9.59767661e-01]]


Of the three data points in **sample_test_data**, the second sample has the **lowest probability** of being classified as a positive review?

# Find the most positive (and negative) review

Find the 20 with the **highest probability** of being classified as a **positive review**.
1.  Make probability predictions on all the test set  **X_test** 
2.  Sort the results descendently by the probability of class '1' and pick the top 20 
    (or sort ascendently and pick the last 20)

In [26]:
# Returns the probability of the sample for each class in the mode
proba_test = sentiment_model.predict_proba(X[pos_test])

print "Class probability predictions (scikit-learn):" 
print "     -1              1" 
print proba_test[0:5]

Class probability predictions (scikit-learn):
     -1              1
[[  2.01660624e-06   9.99997983e-01]
 [  7.94329813e-03   9.92056702e-01]
 [  5.47455478e-01   4.52544522e-01]
 [  9.88904233e-01   1.10957670e-02]
 [  1.27436044e-01   8.72563956e-01]]


In [27]:
# Pick the top 20: sorting ascendently and select the last 20
# call np.argsort() passing the column which is the sort key, it returns an array of row indices 
# that sort the matrix by the key column when you use it as index to access the matrix.
proba_test_sort_ascen = np.argsort(proba_test[:, 1])
print "        -1              1" 
print proba_test[proba_test_sort_ascen[-5:]]
orig_pos = pos_test[proba_test_sort_ascen[-20:]]
products.iloc[orig_pos]  # iloc is by position, the i stands by 'integer', not 'index'

        -1              1
[[ 0.  1.]
 [ 0.  1.]
 [ 0.  1.]
 [ 0.  1.]
 [ 0.  1.]]


Unnamed: 0,name,review,rating,sentiment
114796,"Fisher-Price Cradle 'N Swing, My Little Snuga...",My husband and I cannot state enough how much ...,5,1
95572,Peg-Perego 2010 Gt3 for Two Performance Stroll...,I should probably start out by saying that I a...,4,1
54234,UPPAbaby PiggyBack Ride Along Board,We love our Vista with PiggyBack. There are a...,5,1
42430,bumGenius One-Size Cloth Diaper Twilight,new to cloth diapering- trying to figure out i...,5,1
133651,"Britax 2012 B-Agile Stroller, Red",[I got this stroller for my daughter prior to ...,4,1
152013,"UPPAbaby Vista Stroller, Denny",I researched strollers for months and months b...,5,1
111746,"Baby Jogger 2011 City Mini Double Stroller, Bl...","Before purchasing this stroller, I read severa...",5,1
168697,Graco FastAction Fold Jogger Click Connect Str...,Graco's FastAction Jogging Stroller definitely...,5,1
79357,WubbaNub Tabby Kitten,I first bought these when I again had to repla...,5,1
168081,Buttons Cloth Diaper Cover - One Size - 8 Colo...,"We are big Best Bottoms fans here, but I wante...",4,1


**Quiz Question**: Which of the following products are represented in the 20 most positive reviews? [multiple choice]


Now, let us repeat this excercise to find the "most negative reviews." Use the prediction probabilities to find the  20 reviews in the **test_data** with the **lowest probability** of being classified as a **positive review**. Repeat the same steps above but make sure you **sort in the opposite order**.

In [28]:
# sorting in reverse order [::-1]
# sorting by the same column, just get the rows with 
# the lowest probability of being classified as a positive review
proba_test_sort_desc = np.argsort(proba_test[:, 1])[::-1]
print "      -1              1" 
print proba_test[proba_test_sort_desc[-5:]] 
orig_pos = pos_test[proba_test_sort_desc[-20:]]

      -1              1
[[  1.00000000e+00   7.40554082e-11]
 [  1.00000000e+00   2.72453108e-11]
 [  1.00000000e+00   4.44859718e-12]
 [  1.00000000e+00   2.09069217e-13]
 [  1.00000000e+00   1.69573589e-20]]


In [29]:
products.iloc[orig_pos]

Unnamed: 0,name,review,rating,sentiment
7075,"Peace of Mind Two 900 Mhz Baby Receivers, Monitor",If we only knew when we registered how terribl...,1,-1
138762,Fuzzi Bunz Diaper Sprayer - White,Purchased from &#34;perrymerchant&#34;This spr...,1,-1
18599,Eddie Bauer Classic Wood Swing,What an awful product! The motor failed after ...,1,-1
36569,Summer Infant Secure Sounds Digital Multi Room...,"We had a multi-room baby monitor before, but i...",2,-1
110305,Eco Vessel Kids Stainless Steel Water Bottle w...,I was looking forward to receiving three quali...,1,-1
179608,Diy Monthly Chalkboard Calendar Vinyl Wall Dec...,Fantastic concept. Awful execution. Don't wa...,1,-1
59546,Ellaroo Mei Tai Baby Carrier - Hershey,This is basically an overpriced piece of fabri...,1,-1
37099,The First Years Clean Air Diaper Disposal System,Unbelievable... that's what I say every time I...,1,-1
176038,"Megaseat Play and Snack Tray, White",This comes in two pieces that sort of snap tog...,1,-1
134225,Tek-tok 2.5&quot; Color Video Baby Monitor and...,"Fellow Parents, I bought the Tek Tok baby moni...",1,-1


Notice that many of the reviews for which we  predict a negative sentiment have a high start rating, this may be because our classifier is detecting negative sentiment in the words although the person writing the review is nice and doesn't want to give a bad rating.

## Compute accuracy of the classifier

We will now evaluate the accuracy of the trained classifer. Recall that the accuracy is given by


$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

This can be computed as follows:

* **Step 1:** Use the trained model to compute class predictions
* **Step 2:** Count the number of data points when the predicted class labels match the ground truth labels (called `true_labels` below).
* **Step 3:** Divide the total number of correct predictions by the total number of data points in the dataset.

Complete the function below to compute the classification accuracy:

In [30]:
def get_classification_accuracy(model, data, true_labels):
    # First get the predictions
    y_hat = model.predict(data)
    
    # Compute the number of correctly classified examples (true negatives plus true positives)
    tn_tp = sum(y_hat == true_labels)

    # Then compute accuracy by dividing num_correct by total number of examples
    accuracy = tn_tp / float(data.shape[0])
    
    return accuracy

Now, let's compute the classification accuracy of the **sentiment_model** on the **test_data**.

In [31]:
ac_train = get_classification_accuracy(sentiment_model, X[pos_train], y[pos_train])
print "Accuracy on training data %.2f" % ac_train
ac_test = get_classification_accuracy(sentiment_model, X[pos_test], y[pos_test])
print "Accuracy on test data %.2f" % ac_test

Accuracy on training data 0.96
Accuracy on test data 0.92


A higher accuracy value on the **training_data** not always imply that the classifier is better, it could be overfitting and loosing generalization.

## Learn another classifier with fewer words

There were a lot of words in the model we trained above. We will now train a simpler logistic regression model using only a subet of words that occur in the reviews. For this assignment, we selected a 20 words to work with. These are:

In [32]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [33]:
len(significant_words)

20

For each review, we will use the **word_count** column and trim out all words that are **not** in the **significant_words** list above. Note that we are performing this on both the training and test set.

In [34]:
# preprocess
white_reviews = []
num_reviews = len(raw_reviews)
for i, review in enumerate(raw_reviews):
    if( (i+1)%25000 == 0 ):
        print "Review %d of %d" % ( i+1, num_reviews )                                                                    
    white_reviews.append(text_preprocessing(review, significant_words))

# feature extraction
white_vocab, white_X = extract_features(white_reviews)

Review 25000 of 165975
Review 50000 of 165975
Review 75000 of 165975
Review 100000 of 165975
Review 125000 of 165975
Review 150000 of 165975


In [35]:
# inspect the wordcounts
white_feature_names = json.dumps(white_vocab).translate(None, '"[] ').split(',')
print len(white_feature_names), white_feature_names
print "4 words have been removed as stop words: well, even, less, would"
print as_dictionary(white_feature_names, white_X[269])

# split the new features
# no need to split white_X because we reuse the pos index partition

# train the new model
white_sentiment_model = linear_model.LogisticRegression()
white_sentiment_model.fit(white_X[pos_train], y[pos_train])

20 ['able', 'broke', 'car', 'disappointed', 'easy', 'even', 'great', 'less', 'little', 'love', 'loves', 'money', 'old', 'perfect', 'product', 'return', 'waste', 'well', 'work', 'would']
4 words have been removed as stop words: well, even, less, would
{'money': 1, 'would': 1}


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Let's see what the first example of the dataset looks like:

In [36]:
print pos_train[0]
products.iloc[pos_train[0]]['review']

56113


'We had bought the Avent bottles & they gave our little girl colic.  So we switched to these & what a difference.  They have a super cute design too.  Love them & would highly recommend them.'

The **word_count** column had been working with before looks like the following:

In [37]:
print as_dictionary(feature_names, X[pos_train[0]])

{'cute': 1, 'little': 1, 'difference': 1, 'bottles': 1, 'love': 1, 'switched': 1, 'highly': 1, 'avent': 1, 'design': 1, 'recommend': 1, 'gave': 1, 'girl': 1, 'bought': 1, 'super': 1, 'colic': 1}


Since we are only working with a subet of these words, the column **word_count_subset** is a subset of the above dictionary. In this example, only 2 `significant words` are present in this review.

In [38]:
print as_dictionary(white_feature_names, white_X[pos_train[0]])

{'little': 1, 'love': 1, 'would': 1}


We can compute the classification accuracy using the `get_classification_accuracy` function implemented earlier.

In [39]:
ac_train = get_classification_accuracy(white_sentiment_model, white_X[pos_train], y[pos_train])
print "Accuracy on training data %.2f" % ac_train
ac_test = get_classification_accuracy(white_sentiment_model, white_X[pos_test], y[pos_test])
print "Accuracy on test data %.2f" % ac_test

Accuracy on training data 0.87
Accuracy on test data 0.87


Now, we will inspect the weights (coefficients) of the **simple_model**:

In [40]:
white_coef = white_sentiment_model.coef_[0]
white_inter = white_sentiment_model.intercept_
print white_sentiment_model.coef_.shape

print white_coef[0:10]
num_positive_white = sum(white_coef>0)
num_negative_white = sum(white_coef<0)
print "Number of positive weights: %s " % num_positive_white
print "Number of negative weights: %s " % num_negative_white
print

print white_inter
num_positive_white_inter = sum(white_inter>0)
num_negative_white_inter = sum(white_inter<0)
print "Number of positive intercepts: %s " % num_positive_white_inter
print "Number of negative intercepts: %s " % num_negative_white_inter

(1, 20)
[ 0.22286894 -1.642837    0.08412074 -2.40920758  1.18133062 -0.52189432
  0.93987636 -0.18517016  0.51997707  1.34524368]
Number of positive weights: 10 
Number of negative weights: 10 

[ 1.29005289]
Number of positive intercepts: 1 
Number of negative intercepts: 0 


Let's sort the coefficients (in descending order) by the **value** to obtain the coefficients with the most positive effect on the sentiment.

In [41]:
xs = np.argsort(white_coef)
print white_coef[xs]

[-2.40920758 -2.06217068 -1.97669413 -1.642837   -0.92120186 -0.63278865
 -0.52189432 -0.33844415 -0.31477555 -0.18517016  0.08412074  0.08970073
  0.22286894  0.51697152  0.51997707  0.93987636  1.18133062  1.34524368
  1.46918496  1.72105392]


Are the positive words in the **simple_model** (let us call them `positive_significant_words`) also positive words in the **sentiment_model**?

In [48]:
for iwhite, wwhite in enumerate(significant_words):
    if white_coef[iwhite] > 0:
        i = feature_names.index(wwhite) if wwhite in feature_names else -1
        print "%s\t\t%1.10f\t\t%1.10f\t\t%s" % (wwhite, white_coef[iwhite], 
                                                coef[i] if i>0 else 0, feature_names[i] if i>0 else '*stop word*')

love		0.2228689353		1.6127443365		love
easy		0.0841207424		1.3131702092		easy
little		1.1813306155		0.6874813956		little
loves		0.9398763577		1.7471269759		loves
able		0.5199770713		0.2671479952		able
car		1.3452436801		0.1934654377		car
broke		1.7210539168		-1.4537817255		broke
even		0.0897007266		0.0000000000		*stop word*
waste		1.4691849582		-2.0790057659		waste
money		0.5169715209		-0.7805717827		money


# Comparing models

We will now compare the accuracy of the **sentiment_model** and the **simple_model** using the `get_classification_accuracy` method you implemented above.

First, compute the classification accuracy of the **sentiment_model** on the **train_data**:

In [45]:
ac_train = get_classification_accuracy(sentiment_model, X[pos_train], y[pos_train])
print "Accuracy on training data %.2f" % ac_train

Accuracy on training data 0.96


Now, compute the classification accuracy of the **simple_model** on the **train_data**:

In [46]:
ac_train_white = get_classification_accuracy(white_sentiment_model, white_X[pos_train], y[pos_train])
print "Accuracy on whitelisted training data %.2f" % ac_train_white

Accuracy on whitelisted training data 0.87


**Quiz Question**: Which model (**sentiment_model** or **simple_model**) has higher accuracy on the TRAINING set?

Now, we will repeat this excercise on the **test_data**. Start by computing the classification accuracy of the **sentiment_model** on the **test_data**:

In [175]:
ac_test = get_classification_accuracy(sentiment_model, X[pos_test], y[pos_test])
print "Accuracy on test data %.2f" % ac_test

Accuracy on test data 0.92


Next, we will compute the classification accuracy of the **simple_model** on the **test_data**:

In [47]:
ac_test_white = get_classification_accuracy(white_sentiment_model, white_X[pos_test], y[pos_test])
print "Accuracy on whitelisted test data %.2f" % ac_test_white

Accuracy on whitelisted test data 0.87


**Quiz Question**: Which model (**sentiment_model** or **simple_model**) has higher accuracy on the TEST set?

## Baseline: Majority class prediction

It is quite common to use the **majority class classifier** as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.

What is the majority class in the **train_data**?

In [181]:
num_positive  = (y[pos_train] == +1).sum()
num_negative = (y[pos_train] == -1).sum()
print num_positive
print num_negative
print 'Mayority class classifier:  predict always', 'positive' if num_positive > num_negative else 'negative'

111700
21080
Mayority class classifier:  predict always positive


Now compute the accuracy of the majority class classifier on **test_data**.

In [183]:
def get_classification_accuracy_mcc(true_labels):
    # First get the predictions
    y_hat = np.ones(len(true_labels))
    
    # Compute the number of correctly classified examples (true negatives plus true positives)
    tn_tp = sum(y_hat == true_labels)

    # Then compute accuracy by dividing num_correct by total number of examples
    accuracy = tn_tp / float(len(true_labels))
    
    return accuracy

ac_test_mcc = get_classification_accuracy_mcc( y[pos_test])
print "Accuracy of mayrity class classifier on test data %.2f" % ac_test_mcc

Accuracy of mayrity class classifier on test data 0.84
