# Sentiment analysis using logistic regression

We will use logistic regression to learn a classifier from review data.

Download the data from https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences. The folder `sentiment_labelled_sentences` (containing the data file `full_set.txt`) should be in the same directory as the notebook.

## Set up notebook, load and preprocess data

In [None]:
%matplotlib inline

import string
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

### Data

The **`sentiment`** data set consists of 3000 sentences which come from reviews on `imdb.com`, `amazon.com`, and `yelp.com`. Each sentence is labeled according to whether it comes from a positive review or negative review;
 - '1' if it came from a positive review
 - '0' if it came from a negative review

We will change the negative review label to '-1'.

In [None]:
## Read in the data set.
with open("../../_data/sentiment_labelled_sentences/full_set.txt") as f:
    content = f.readlines()
    
## Remove leading and trailing white space
content = [x.strip() for x in content]

## Separate the sentences from the labels
sentences = [x.split("\t")[0] for x in content]
labels = [x.split("\t")[1] for x in content]

## Transform the labels from '0 v.s. 1' to '-1 v.s. 1'
y = np.array(labels, dtype='int8')
y = 2*y - 1

### Preprocessing the text data

To transform this prediction problem into one amenable to linear classification, we will first need to preprocess the text data. We will do four transformations:

1. Remove punctuation and numbers.
2. Transform all words to lower-case.
3. Remove _stop words_.
4. Convert the sentences into vectors, using a bag-of-words representation.

We begin with first two steps.

In [None]:
## full_remove takes a string x and a list of characters removal_list 
## returns x with all the characters in removal_list replaced by ' '
def full_remove(x, removal_list):
    for w in removal_list:
        x = x.replace(w, ' ')
    return x

## Remove digits
digits = [str(x) for x in range(10)]
digit_less = [full_remove(x, digits) for x in sentences]

## Remove punctuation
punc_less = [full_remove(x, list(string.punctuation)) for x in digit_less]

## Make everything lower-case
sents_lower = [x.lower() for x in punc_less]

### Stop words

Stop words are words that are filtered out because they are believed to contain no useful information for the task at hand. These usually include articles such as 'a' and 'the', pronouns such as 'i' and 'they', and prepositions such 'to' and 'from'. We have put together a very small list of stop words, but these are by no means comprehensive. Feel free to use something different; for instance, larger lists can easily be found on the web.

In [None]:
## Define our stop words
stop_set = set(['the', 'a', 'an', 'i', 'he', 'she', 'they', 'to', 'of', 'it', 'from'])

## Remove stop words
sents_split = [x.split() for x in sents_lower]
sents_processed = [" ".join(list(filter(lambda a: a not in stop_set, x))) for x in sents_split]

What do the sentences look like so far?

In [None]:
sents_processed[0:10]

### Bag of words

In order to use linear classifiers on our data set, we need to transform our textual data into numeric data. The classical way to do this is known as the _bag of words_ representation. 

In this representation, each word is thought of as corresponding to a number in `{1, 2, ..., V}` where `V` is the size of our vocabulary. And each sentence is represented as a V-dimensional vector $x$, where $x_i$ is the number of times that word $i$ occurs in the sentence.

To do this transformation, we will make use of the `CountVectorizer` class in `scikit-learn`. We will cap the number of features at 4500, meaning a word will make it into our vocabulary only if it is one of the 4500 most common words in the corpus. This is often a useful step as it can weed out spelling mistakes and words which occur too infrequently to be useful.

Finally, we will also append a '1' to the end of each vector to allow our linear classifier to learn a bias term.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

## Transform to bag of words representation.
vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=4500)
data_features = vectorizer.fit_transform(sents_processed)

## Append '1' to the end of each vector.
data_mat = data_features.toarray()

### Training / test split

Finally, we split the data into a training set of 2500 sentences and a test set of 500 sentences (of which 250 are positive and 250 negative).

In [None]:
## Split the data into testing and training sets
np.random.seed(0)
test_inds = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))
train_inds = list(set(range(len(labels))) - set(test_inds))

train_data = data_mat[train_inds,]
train_labels = y[train_inds]

test_data = data_mat[test_inds,]
test_labels = y[test_inds]

print("train data: ", train_data.shape)
print("test data: ", test_data.shape)

## Fitting a logistic regression model to the training data

We could implement our own __logistic regression solver using stochastic gradient descent__, but fortunately, there is already one built into `scikit-learn`.

Due to the randomness in the SGD procedure, different runs can yield slightly different solutions (and thus different error values).

In [None]:
from sklearn.linear_model import SGDClassifier

## Fit logistic classifier on training data
clf = SGDClassifier(loss="log", penalty="none",max_iter=1000, tol=1e-3)
clf.fit(train_data, train_labels)

## Pull out the parameters (w,b) of the logistic regression model
w = clf.coef_[0,:]
b = clf.intercept_

## Get predictions on training and test data
preds_train = clf.predict(train_data)
preds_test = clf.predict(test_data)

## Compute errors
errs_train = np.sum((preds_train > 0.0) != (train_labels > 0.0))
errs_test = np.sum((preds_test > 0.0) != (test_labels > 0.0))

print("Training error: ", float(errs_train)/len(train_labels))
print("Test error: ", float(errs_test)/len(test_labels))

### Analyzing the margin

The logistic regression model produces not just classifications but also conditional probability estimates. 

We will say that `x` has **margin** `gamma` if (according to the logistic regression model):

__`Pr(y=1|x) > (0.5)+gamma`__  
or  
__`Pr(y=1|x) < (0.5)-gamma`__  


The function **margin_counts** computes how many points in the test set are at least a margin(`gamma`) away from the decision boundary (0.5).
input:  
 - the classifier `clf`
 - the test set `test_data`
 - a value of `gamma`

In [None]:
def predictions(clf, test_data):
    """Prediction probabilities"""
    return clf.predict_proba(test_data)[:,1]

In [None]:
def margin_indices(clf, test_data, gamma):
    """Indices of test points at least a margin away from the decision boundary"""
    preds = predictions(clf, test_data)
    return np.where((preds > (0.5+gamma)) | (preds < (0.5-gamma)))[0]

In [None]:
def margin_counts(clf, test_data, gamma):
    """Number of test points at least a margin away from the decision boundary"""
    return len(margin_indices(clf, test_data, gamma))

In [None]:
def margin_counts_(clf, test_data, gamma):
    """Number of test points at least a margin away from the decision boundary"""
    
    ## Compute probability on each test point
    preds = clf.predict_proba(test_data)[:,1]
    
    margin_inds = np.where((preds > (0.5+gamma)) | (preds < (0.5-gamma)))[0]
    
    return len(margin_inds)

### Test set's distribution of margin values

In [None]:
gammas = np.arange(0, 0.5 ,0.01)

f = np.vectorize(lambda g: margin_counts(clf, test_data, g))

plt.plot(gammas, f(gammas)/len(test_data), linewidth=2, color='green')
plt.xlabel('Margin', fontsize=14)
plt.ylabel('Fraction of points above margin', fontsize=14)
plt.show();

### Are points `x` with larger margin more likely to be classified correctly?

The function **margin_errors** computes the fraction of misclassified points at least `gamma` away from the decision boundary.

In [None]:
## Return error of predictions that lie in intervals [0, 0.5 - gamma) and (0.5 + gamma, 1]
def margin_errors(clf, test_data, test_labels, gamma):
    """Fraction of Misclassifications at least a margin away from the decision boundary"""
    
    preds = predictions(clf, test_data)
    margin_inds = margin_indices(clf, test_data, gamma)
    
    ## Compute error on those data points.
    num_errors = np.sum((preds[margin_inds] > 0.5) != (test_labels[margin_inds] > 0.0))
    return num_errors / len(margin_inds)

### Visualisation of margin and error rate

In [None]:
## Create grid of gamma values
gammas = np.arange(0, 0.5, 0.01)

## Compute margin_errors on test data for each value of g
f = np.vectorize(lambda g: margin_errors(clf, test_data, test_labels, g))

## Plot the result
plt.plot(gammas, f(gammas), linewidth=2)
plt.ylabel('Error rate', fontsize=14)
plt.xlabel('Margin', fontsize=14)
plt.show();

### Words with large influence

Finally, we attempt to partially **interpret** the logistic regression model.
Words whose coefficients in __`w`__ have the largest positive and negative values are the most influential.

In [None]:
## Convert vocabulary into a list:
# CountVectorizer builts vector of words on alphabetical order; index 0 ~ aa..
vocab = np.array([z[0] for z in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1])])
vocab[:5], vocab[-5:]

## Get indices of sorted coefs
inds = np.argsort(clf.coef_[0, :])
inds

## Words with large negative values
neg_inds = inds[:50]
print("\nHighly negative words: ")
print([str(x) for x in list(vocab[neg_inds])])

## Words with large positive values
pos_inds = inds[::-1][:50]
print("\nHighly positive words: ")
print([str(x) for x in list(vocab[pos_inds])])

### Take away

Suppose you are building a classifier, and can tolerate an error rate of at most some value __`e`__. Unfortunately, every classifier you try has a higher error than this. 

Therefore, you decide that the classifier is allowed to occasionally **abstain**: that is, to say *"don't know"*. When it actually makes a prediction, it must have error rate at most __`e`__. And subject to this constraint, it should abstain as infrequently as possible.