# Predicting sentiment from product reviews
##### Eric Andrés Jardón Chao

**Goal**: _to explore logistic regression and feature engineering with Turi Create_

We use product review data from Amazon.com to predict whether the sentiments about a product are positive or negative.

* used SFrames and do some minor feature engineering
* trained a **Logistic Regression Model** to predict the sentiment of product reviews.
* made predictions for a new product review.
* wrote a function to compute the **accuracy** of the model, given model parameters, features and "ground truth labels",
* inspected the coefficients of the logistic regression model and interpret their meanings.
* Compared among multiple LogRMs.

In [1]:
import turicreate
import math

For this exercise we use a dataset consisting of baby product reviews from Amazon.com.

The dataset consists of 1.8 Million observations, with 3 columns: `name`, `review` and `rating` 

# Part 1. Preparing the Data

In [None]:
products = turicreate.SFrame('../data/amazon_baby.sframe/')

In [5]:
products

name,review,rating
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0


## For every review, obtain word count vectors

For every review text we have to perform 2 simple data transformations:

1. Remove punctuation.
2. Transforming text into word-counts.

**Note**. In this notebook, we remove all punctuations for the sake of simplicity. A smarter approach to punctuations would preserve phrases such as "I'd", "would've", "hadn't" and so forth. 

In [7]:
import string 

def remove_punctuation(text):
    translator = text.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    return text

In [10]:
remove_punctuation("Why? It's like you don't care...")  # porbably should normalize to lowercase? ¿

'Why Its like you dont care'

In [12]:
# Remove all punctuation from every review
review_without_punctuation = products['review'].apply(remove_punctuation)

# Create a word count column from every review
products['word_count'] = turicreate.text_analytics.count_words(review_without_punctuation)

In [14]:
# The new **word_count** column is a dictionary where the key is the word and the value is a count of the number of times the word occurs in the review text.

products[269]['word_count']

{'our': 1.0, 'in': 1.0, 'favorite': 1.0, 'house': 1.0, 'a': 1.0}

## Labeling sentiments (-1 or +1)

For classification tasks we must have a labeled dataset. <br>
We will label each of these reviews' sentiments according to the heuristic: 2 stars and lower is bad, 4 stars and up is good <br>
(For this exercise we will **ignore** all reviews with *rating = 3*, since they tend to have a neutral sentiment.)

In [16]:
# Remove neutral reviews
products = products[products['rating'] != 3]

len(products) # 166,752

166752

* Assign reviews with a rating of 4 or higher to be *positive* reviews, 
* The ones with rating of 2 or lower are *negative*. 

For the sentiment column, we use +1 for the positive class label and -1 for the negative class label.

In [17]:
# Label every row in the dataset according to rating heuristic
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
products

name,review,rating,word_count,sentiment
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0,"{'recommend': 1.0, 'disappointed': 1.0, ...",1
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0,"{'quilt': 1.0, 'this': 1.0, 'for': 1.0, ...",1
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0,"{'tool': 1.0, 'clever': 1.0, 'binky': 2.0, ...",1
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0,"{'rock': 1.0, 'headachesthanks': 1.0, ...",1
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0,"{'thumb': 1.0, 'or': 1.0, 'break': 1.0, 'trying': ...",1
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0,"{'2995': 1.0, 'for': 1.0, 'barnes': 1.0, 'at': ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0,"{'right': 1.0, 'because': 1.0, 'questions': 1.0, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0,"{'like': 1.0, 'and': 1.0, 'changes': 1.0, 'the': ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,"{'in': 1.0, 'pages': 1.0, 'out': 1.0, 'run': 1.0, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",I love this journal and our nanny uses it ...,4.0,"{'tracker': 1.0, 'now': 1.0, 'postits': 1.0, ...",1


## Train - Test split of the data

As usual we perform a train/test split from our subset of product reviews, with 80% for training set and 20% for the test set.
We use `seed=1` for reproducibility.

In [18]:
train_data, test_data = products.random_split(.8, seed=1)
print(len(train_data))
print(len(test_data))

133416
33336


# Part 2. Train a sentiment classifier (Logistic Regression)

We train a logistic regression model with the column `word_count` as a feature and the column `sentiment` as the target. 
We will use `validation_set=None` for reproducibility

**Note:** This line may take 1-2 minutes.

In [20]:
# Turi Create logistic classifier implementation.

sentiment_model = turicreate.logistic_classifier.create(train_data,
                                                        target = 'sentiment',
                                                        features=['word_count'],
                                                        validation_set=None)

**Aside**. If you get a warning to the effect of `Terminated due to numerical difficulties --- this model may not be ideal`; meaning the quality metric could not be improved in the last iteration of the run. The difficulty appears when the model places too much weight on extremely rare words. 

A way to rectify this is to apply regularization, which is covered in future notebooks. Regularization lessens the effect of extremely rare words. <br> 
For this assignment, however, we'll proceed with the model above.

In [27]:
# Inspect the learned weights: 121713 of them!

weights = sentiment_model.coefficients
print(weights.column_names())
print(len(weights))

['name', 'index', 'class', 'value', 'stderr']
121713


There are a total of `121713` coefficients in the model. Each of them correspond to a unique word or token. <br>
For every weight $w_j$, positive values correspond to positive sentiment, while negative values correspond to negative sentiment of a word. 

Calculate how many *weights* are positive ( >= 0). (**i.e.**: The `'value'` column in SFrame *weights* is positive ( >= 0)).

In [30]:
num_positive_weights = len(weights[weights['value'] >= 0])
num_negative_weights = len(weights[weights['value'] < 0])

print("Number of positive words: %s " % num_positive_weights)
print("Number of negative words: %s " % num_negative_weights)
print((weights['value']>=0).sum())

Number of positive words: 91073 
Number of negative words: 30640 
91073


**Quiz Question:** How many weights are >= 0?
`A = 91073`

## Making predictions with the logistic regression model

The Turi Create implementation computes scores for every observation, uses a link function and based on the output decides to label either -1 or +1.

To examine this procedure, we use 3 examples from  `sample_test_data`.

In [37]:
# Work with observations 10-12 from dataset
sample_test_data = test_data[10:13]
sample_test_data

name,review,rating,word_count,sentiment
Our Baby Girl Memory Book,Absolutely love it and all of the Scripture in ...,5.0,"{'again': 1.0, 'book': 1.0, 'same': 1.0, ...",1
Wall Decor Removable Decal Sticker - Colorful ...,Would not purchase again or recommend. The decals ...,2.0,"{'peeling': 1.0, '5': 1.0, 'about': 1.0, 'f ...",-1
New Style Trailing Cherry Blossom Tree Decal ...,Was so excited to get this product for my baby ...,1.0,"{'on': 1.0, 'waste': 1.0, 'wouldnt': 1.0, ...",-1


In [38]:
sample_test_data[0]['review'] # seems positive

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [39]:
sample_test_data[1]['review'] # seems negative

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

We want to make predictions for the rows in **sample_test_data**. The `sentiment_model` should predict **+1** if the sentiment is positive and **-1** if the sentiment is negative. <br> The **score** (aka **margin**) for the logistic regression model  is defined as:


$$
\mbox{score}_i = \mathbf{w}^T h(\mathbf{x}_i)
$$ 

where $h(\mathbf{x}_i)$ represents the features (word counts dictionary) for example $i$.  <br>

Next we write some code to obtain the **scores** using Turi Create. 

For each row, the **score** (or margin) computed by logistic regression is a number in the range **[-inf, inf]**.

In [41]:
# Set output type as margin (the computed score)
scores = sentiment_model.predict(sample_test_data, output_type='margin')
print(scores)

[4.788907309214048, -3.000782222462631, -8.188501360762764]


### Predicting sentiment from Margin with a threshold

Scores from logistic regressor can be used to make class predictions as follows:

$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{w}^T h(\mathbf{x}_i) > 0 \\
      -1 & \mathbf{w}^T h(\mathbf{x}_i) \leq 0 \\
\end{array} 
\right.
$$

Using scores, write code to calculate $\hat{y}$, the class predictions:

In [42]:
class_predictions = []
for score in scores:
    y = 1 if score> 0 else -1
    class_predictions.append(y)


Run the following code to verify that the class predictions obtained by your calculations are the same as that obtained from Turi Create.

In [43]:
print("Class predictions according to Turi Create:")
print(sentiment_model.predict(sample_test_data))
print("Class predictions according to Score:")
print(class_predictions)

Class predictions according to Turi Create:
[1, -1, -1]
Class predictions according to Score:
[1, -1, -1]


## Probability predictions

Recall from the lectures that we can also calculate the probability predictions (range 0 to 1.0) from the scores using a logistic function:
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}.
$$

Using the **scores** array calculated previously we'll calculate the probability that a sentiment is positive using the above formula.

In [46]:
from math import e

def link_function(score):
    return 1 / (1 + e**(-score))

Making sure our probability predictions match the ones obtained from Turi Create:

In [49]:
print("Class probabilities according to Turi Create:")
print(sentiment_model.predict(sample_test_data, output_type='probability'))

prob_preds = []
for s in scores:
    prob_preds.append(link_function(s))
print("Class probabilities according to link function")
print(prob_preds)

Class probabilities according to Turi Create:
[0.9917471313286887, 0.0473905474871164, 0.0002777527712172623]
Class probabilities according to link function
[0.9917471313286887, 0.04739054748711641, 0.00027775277121726234]


## Find most positive and negative reviews

Next we find the 20 reviews in the entire **test_data** with **highest probability** of being a **positive** review. 

Sort the data according to those predictions and pick the top 20. <br/> (Use the `.topk` method on an SFrame to find the top k rows sorted according to the value of a specified column.)

In [54]:
# Make probability predictions on test_data
prob_predictions = sentiment_model.predict(test_data, output_type='probability')

test_data['class_prob'] = prob_predictions

In [55]:
# Use topk() to sort and pick the top 20 rows according to probability. Default order is descending
most_positive = test_data.topk('class_prob', k=20)

most_positive.print_rows(20)

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| Fisher-Price Cradle 'N Swi... | My husband and I cannot st... |  5.0   |
| The Original CJ's BuTTer (... | I'm going to try to review... |  4.0   |
| Baby Jogger City Mini GT D... | We are well pleased with t... |  4.0   |
| Diono RadianRXT Convertibl... | Like so many others before... |  5.0   |
| Diono RadianRXT Convertibl... | I bought this seat for my ... |  5.0   |
| Graco Pack 'n Play Element... | My husband and I assembled... |  4.0   |
| Maxi-Cosi Pria 70 with Tin... | We love this car seat!! It... |  5.0   |
| Britax 2012 B-Agile Stroll... | [I got this stroller for m... |  4.0   |
| Quinny 2012 Buzz Stroller,... | Choice - Quinny Buzz 2011 ... |  4.0   |
| Roan Rocco Classic Pram St... | Great Pram Rocco!!!!!!I bo... |  5.0   |
| Britax Decathlon Conver

#### Now, find the "most negative reviews."

In [56]:
most_negative = test_data.topk('prob', k=20, reverse=True)

most_negative.print_rows(20)

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| Luna Lullaby Bosom Baby Nu... | I have the boppy deluxe pi... |  5.0   |
| The First Years True Choic... | Note: we never installed b... |  1.0   |
| Jolly Jumper Arctic Sneak ... | I am a "research-aholic" i... |  5.0   |
| Motorola MBP36 Remote Wire... | I could go on and on about... |  4.0   |
| VTech Communications Safe ... | This is my second video mo... |  1.0   |
| Fisher-Price Ocean Wonders... | We have not had ANY luck w... |  2.0   |
| Levana Safe N'See Digital ... | This is the first review I... |  1.0   |
| Safety 1st High-Def Digita... | We bought this baby monito... |  1.0   |
| Snuza Portable Baby Moveme... | I would have given the pro... |  1.0   |
| Adiri BPA Free Natural Nur... | I will try to write an obj... |  2.0   |
| Samsung SEW-3037W Wirel

**Quiz Question**: Which of the following products are represented in the 20 most negative reviews?  [multiple choice]

# Part 3. Computing accuracy of the classifier

We will now evaluate the accuracy of the trained classifier, according to the formula


$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

We do the following

* **Step 1:** Use the trained model to compute class predictions (**Hint:** Use the `predict` method)
* **Step 2:** Count the number of data points when the predicted class labels match the ground truth labels (called `true_labels` below).
* **Step 3:** Divide the total number of correct predictions by the total number of data points in the dataset.

Complete the function below to compute the classification accuracy:

In [58]:
def get_classification_accuracy(model, data, true_labels):
    N = len(true_labels)
    # First get the predictions
    predictions = model.predict(data)
    
    # Compute the number of correctly classified examples
    num_correct = 0
    for i in range(N):
        num_correct += (predictions[i] == true_labels[i])
    # Then compute accuracy by dividing num_correct by total number of examples
    accuracy = num_correct / N
    
    return accuracy

In [59]:
get_classification_accuracy(sentiment_model, test_data, test_data['sentiment'])

0.9221862251019919

**Quiz Question**: What is the accuracy of the **sentiment_model** on the **test_data**? Round your answer to 2 decimal places (e.g. 0.76).
`A = 0.92`

NOTE: A higher accuracy value on training set does not always imply the classifier is better. It may, however, imply _overfitting_.

# Part 4. Learn another classifier with fewer words

There were a lot of words in the model we trained above. We will now train a simpler logistic regression model using only a subset of 20 words that occur in the reviews. These are:

In [61]:
meaningful_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [62]:
len(meaningful_words)

20

For each review, we will use the **word_count** column and trim out all words that are **not** in the **significant_words** list above. We will use the [SArray dictionary trim by keys functionality]( https://apple.github.io/turicreate/docs/api/generated/turicreate.SArray.dict_trim_by_keys.html). This is done on both the training and test set.


The input data our model will work with is now much simpler, reducing the number of words to analyze

In [67]:
# Derive a new dictionary that only has the keys specified in meaningful_words
# Save this dictionary to a new column in boths datasets
train_data['word_count_subset'] = train_data['word_count'].dict_trim_by_keys(meaningful_words, exclude=False)
test_data['word_count_subset'] = test_data['word_count'].dict_trim_by_keys(meaningful_words, exclude=False)

In [68]:
train_data[0]['review']

'it came early and was not disappointed. i love planet wise bags and now my wipe holder. it keps my osocozy wipes moist and does not leak. highly recommend it.'

In [69]:
print(train_data[0]['word_count'])

{'recommend': 1.0, 'disappointed': 1.0, 'wise': 1.0, 'love': 1.0, 'it': 3.0, 'planet': 1.0, 'and': 3.0, 'bags': 1.0, 'wipes': 1.0, 'highly': 1.0, 'not': 2.0, 'early': 1.0, 'came': 1.0, 'i': 1.0, 'does': 1.0, 'my': 2.0, 'was': 1.0, 'now': 1.0, 'wipe': 1.0, 'holder': 1.0, 'leak': 1.0, 'keps': 1.0, 'osocozy': 1.0, 'moist': 1.0}


In [70]:
print(train_data[0]['word_count_subset'])

{'disappointed': 1.0, 'love': 1.0}


## Train a logistic regression model on a subset of data

We will now build a classifier with **word_count_subset** as the feature and **sentiment** as the target. 

In [71]:
simple_model = turicreate.logistic_classifier.create(train_data,
                                                     target = 'sentiment',
                                                     features=['word_count_subset'],
                                                     validation_set=None)
simple_model

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 21
Number of examples             : 133416
Number of classes              : 2
Number of feature columns      : 1
Number of unpacked features    : 20

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : newton
Solver iterations              : 6
Solver status                  : SUCCESS: Optimal solution found.
Training time (sec)            : 0.3639

Settings
--------
Log-likelihood                 : 44323.7254

Highest Positive Coefficients
-----------------------------
word_count_subset[loves]       : 1.6773
word_count_subset[perfect]     : 1.5145
word_count_subset[love]        : 1.3654
(intercept)                    : 1.2995
word_count_subset[easy]        : 1.1937

Lowest Negative Coefficients
----------------------------
word_count_subset[disappointed] : -2.3551
wo

We can compute the classification accuracy using the `get_classification_accuracy` function you implemented earlier.

In [72]:
get_classification_accuracy(simple_model, test_data, test_data['sentiment'])

0.8693004559635229

Now, we will inspect the weights (coefficients) of the **simple_model**:

In [76]:
simple_model.coefficients # we have 20 coefficients (one per word) and an intercept

name,index,class,value,stderr
(intercept),,1,1.2995449552027043,0.0120888541330532
word_count_subset,disappointed,1,-2.3550925006107253,0.0504149888556979
word_count_subset,love,1,1.3654354936790372,0.0303546295109051
word_count_subset,well,1,0.5042567463979284,0.02138130063099
word_count_subset,product,1,-0.320555492995575,0.0154311321362016
word_count_subset,loves,1,1.6772714555592918,0.0482328275383501
word_count_subset,little,1,0.5206286360250184,0.0214691475664903
word_count_subset,work,1,-0.6217000124253143,0.0230330597945848
word_count_subset,easy,1,1.1936618983284648,0.0292888692020295
word_count_subset,great,1,0.9446912694798444,0.02095099265905


Let's sort the coefficients (in descending order) by the **value** to obtain the coefficients with the most positive effect on the sentiment.

In [77]:
simple_model.coefficients.sort('value', ascending=False).print_rows(num_rows=21)

+-------------------+--------------+-------+----------------------+
|        name       |    index     | class |        value         |
+-------------------+--------------+-------+----------------------+
| word_count_subset |    loves     |   1   |  1.6772714555592918  |
| word_count_subset |   perfect    |   1   |  1.5144862670271348  |
| word_count_subset |     love     |   1   |  1.3654354936790372  |
|    (intercept)    |     None     |   1   |  1.2995449552027043  |
| word_count_subset |     easy     |   1   |  1.1936618983284648  |
| word_count_subset |    great     |   1   |  0.9446912694798443  |
| word_count_subset |    little    |   1   |  0.5206286360250184  |
| word_count_subset |     well     |   1   |  0.5042567463979284  |
| word_count_subset |     able     |   1   |  0.1914383022947509  |
| word_count_subset |     old      |   1   |  0.0853961886678159  |
| word_count_subset |     car      |   1   | 0.05883499006802042  |
| word_count_subset |     less     |   1   | -0.

**Quiz Question**: How many of the 20 coefficients (corresponding to the 20 **significant_words** and *excluding the intercept term*) are positive for the `simple_model`?
`A = 10`

**Quiz Question**: Are the positive words in the **simple_model** (let us call them `positive_significant_words`) also positive words in the **sentiment_model**? `A=YES`

In [78]:
positive_significant_words = simple_model.coefficients[simple_model.coefficients['value']>=0.0]
positive_significant_words = positive_significant_words[positive_significant_words['index'] != None]

In [79]:
positive_significant_words

name,index,class,value,stderr
word_count_subset,love,1,1.3654354936790372,0.0303546295109051
word_count_subset,well,1,0.5042567463979284,0.02138130063099
word_count_subset,loves,1,1.6772714555592918,0.0482328275383501
word_count_subset,little,1,0.5206286360250184,0.0214691475664903
word_count_subset,easy,1,1.1936618983284648,0.0292888692020295
word_count_subset,great,1,0.9446912694798444,0.02095099265905
word_count_subset,able,1,0.1914383022947509,0.0337581955697336
word_count_subset,perfect,1,1.5144862670271348,0.0498619522939948
word_count_subset,old,1,0.0853961886678159,0.0200863423024574
word_count_subset,car,1,0.0588349900680204,0.0168291532090873


In [98]:
psw = list(positive_significant_words['index'])
model_weights = sentiment_model.coefficients
for w in model_weights:
    if w['index'] in psw:
        print(w['index'], "value in full model: ", w['value'])

        # They are all also positive weights in the full model

love value in full model:  0.8405057320615064
well value in full model:  0.4010755749233182
loves value in full model:  0.9749823125142647
little value in full model:  0.40993162725717047
easy value in full model:  0.7349826255674927
great value in full model:  0.7789532883805086
able value in full model:  0.10752802191424245
perfect value in full model:  1.0447994204048683
old value in full model:  0.07967490900987588
car value in full model:  0.11965787650766


# Comparing models

We will now compare the accuracy of the **sentiment_model** and the **simple_model** using the `get_classification_accuracy` method you implemented above.

First, compute the classification accuracy of the **sentiment_model** on the **train_data**:

In [100]:
get_classification_accuracy(sentiment_model, train_data, train_data['sentiment'])

0.976494573364514

Now, compute the classification accuracy of the **simple_model** on the **train_data**:

In [101]:
get_classification_accuracy(simple_model, train_data, train_data['sentiment'])

0.8668150746537147

**Quiz Question**: Which model (**sentiment_model** or **simple_model**) has higher accuracy on the TRAINING set?
`A = Sentiment model`

Now, we will repeat this exercise on the **test_data**. Start by computing the classification accuracy of the **sentiment_model** on the **test_data**:

In [103]:
get_classification_accuracy(sentiment_model, test_data, test_data['sentiment'])

0.9221862251019919

Next, we will compute the classification accuracy of the **simple_model** on the **test_data**:

In [102]:
get_classification_accuracy(simple_model, test_data, test_data['sentiment'])

0.8693004559635229

**Quiz Question**: Which model (**sentiment_model** or **simple_model**) has higher accuracy on the TEST set?
`A = Full Sentiment model`

## Baseline: Majority class prediction

It is quite common to use the **majority class classifier** as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.

What is the majority class in the **train_data**?

In [80]:
# Vectorized operations: the boolean expression is evaluated for every row and returned
num_positive  = (train_data['sentiment'] == +1).sum()
num_negative = (train_data['sentiment'] == -1).sum()
print(num_positive)
print(num_negative)

112164
21252


Now compute the accuracy of the majority class classifier on **test_data**.

**Quiz Question**: Enter the accuracy of the majority class classifier model on the **test_data**. Round your answer to two decimal places (e.g. 0.76).

In [81]:
majority_class_acc = num_positive / len(train_data['sentiment'])
majority_class_acc

0.8407087605684476

**Quiz Question**: Is the **sentiment_model** definitely better than the majority class classifier (the baseline)?
`A = YES`