# Sentiment analysis with support vector machines

In this notebook, we will revisit a learning task that we encountered earlier in the course: predicting the *sentiment* (positive or negative) of a single sentence taken from a review of a movie, restaurant, or product. The data set consists of 3000 labeled sentences, which we divide into a training set of size 2500 and a test set of size 500. Previously we found a logistic regression classifier. Today we will use a support vector machine.

Before starting on this notebook, make sure the folder `sentiment_labelled_sentences` (containing the data file `full_set.txt`) is in the same directory. Recall that the data can be downloaded from https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences. 

## 1. Loading and preprocessing the data
 
Here we follow exactly the same steps as we did earlier.

In [None]:
%matplotlib inline

import string
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer

matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

### Data

In [None]:
!find ../../_data | grep -i full_set.txt

In [None]:
## Read in the data set.
with open("../../_data/sentiment_labelled_sentences/full_set.txt") as f:
    content = f.readlines()
    
## Remove leading and trailing white space
content = [x.strip() for x in content]

## Separate the sentences from the labels
sentences = [x.split("\t")[0] for x in content]
labels = [x.split("\t")[1] for x in content]

## Transform the labels from '0 v.s. 1' to '-1 v.s. 1'
y = np.array(labels, dtype='int8')
y = 2*y - 1

### Remove unwanted chars, symbols, digits, stopwords

In [None]:
def full_remove(x, removal_list):
    """returns x with all the characters in removal_list replaced by a space"""
    for w in removal_list:
        x = x.replace(w, ' ')
    return x

In [None]:
## Remove digits
digits = [str(x) for x in range(10)]
digit_less = [full_remove(x, digits) for x in sentences]

## Remove punctuation
punc_less = [full_remove(x, list(string.punctuation)) for x in digit_less]

## Make everything lower-case
sents_lower = [x.lower() for x in punc_less]

## Define our stop words
stop_set = set(['the', 'a', 'an', 'i', 'he', 'she', 'they', 'to', 'of', 'it', 'from'])

## Remove stop words
sents_split = [x.split() for x in sents_lower]
sents_processed = [" ".join(list(filter(lambda a: a not in stop_set, x))) for x in sents_split]

In [None]:
len(sents_processed), sents_processed[:3]

### Transform sentences to Bag of Words

In [None]:
## Transform to bag of words representation.
vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=4500)
data_features = vectorizer.fit_transform(sents_processed)

## Append '1' to the end of each vector.
data_mat = data_features.toarray()

### Train-test split

In [None]:
np.random.seed(0)
test_inds = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))
train_inds = list(set(range(len(labels))) - set(test_inds))

train_data = data_mat[train_inds,]
train_labels = y[train_inds]

test_data = data_mat[test_inds,]
test_labels = y[test_inds]

print("train data: ", train_data.shape)
print("test data: ", test_data.shape)

In [None]:
train_data[:3]

## 2. Fitting a support vector machine to the data

In support vector machines, we are given a set of examples $(x_1, y_1), \ldots, (x_n, y_n)$ and we want to find a weight vector $w \in \mathbb{R}^d$ that solves the following optimization problem:

$$ \min_{w \in \mathbb{R}^d} \| w \|^2 + C \sum_{i=1}^n \xi_i $$
$$ \text{subject to } y_i \langle w, x_i \rangle \geq 1 - \xi_i \text{ for all } i=1,\ldots, n$$

`scikit-learn` provides an SVM solver that we will use. The following routine takes as input the constant `C` (from the above optimization problem) and returns the training and test error of the resulting SVM model. It is invoked as follows:

* `training_error, test_error = fit_classifier(C)`

Hyperparameter __`C`__ is the cost of misclassification:
 - reducing C means less misclassification cost, expect more misclassifications
 - increases the boundary margin
 - increases bias (misclassifications)
 - lowers variance and as result overfitting
 - the default value for parameter `C` is 1.0

In [None]:
from sklearn import svm

In [None]:
def fit_classifier(C_value=1.0):
    clf = svm.LinearSVC(C=C_value, loss='hinge')
    clf.fit(train_data,train_labels)
    
    ## Get predictions on training data
    train_preds = clf.predict(train_data)
    train_error = float(np.sum((train_preds > 0.0) != (train_labels > 0.0)))/len(train_labels)
    
    ## Get predictions on test data
    test_preds = clf.predict(test_data)
    test_error = float(np.sum((test_preds > 0.0) != (test_labels > 0.0)))/len(test_labels)
    return train_error, test_error

In [None]:
cvals = [0.01,0.1,1.0,10.0,100.0,1000.0]
for c in cvals:
    train_error, test_error = fit_classifier(c)
    print("C: {:.3f} \ttrain-error: {:0.3f} test-error: {:0.3f}".format(c, train_error, test_error))

### Evaluating C by k-fold cross-validation

In [None]:
def cross_validation_error(x, y, C_value, k):
    
    n = len(y)
    fold = int(n/k)
    ## Randomly shuffle indices
    indices = np.random.permutation(n)
    
    ## Initialize error
    err = 0.0
    
    ## Iterate over partitions
    for i in range(k):
        
        ## Split train-test indices
        test_indices = indices[i*fold:(i+1)*fold]
        train_indices = np.setdiff1d(indices, test_indices)
        
        ## Train classifier with parameter c
        clf = svm.LinearSVC(C=C_value, loss='hinge')
        clf.fit(x[train_indices], y[train_indices])
        
        ## Get predictions on test partition
        preds = clf.predict(x[test_indices])
        
        ## Compute error
        err += float(np.sum((preds > 0.0) != (y[test_indices] > 0.0)))/len(test_indices)
        
    return err/k

### Hyperparameter (C) optimisation

The procedure **cross_validation_error** (above) evaluates a single candidate value of `C`. We need to use it repeatedly to identify a good `C`. 

* `c, err = choose_parameter(x,y,k)`

where
* `x,y` is the training data
* `k` is the number of folds of cross-validation
* `c` is chosen value of the parameter `C`
* `err` is the cross-validation error estimate at `c`

### Recursively narrow down the best C-value

In [None]:
plot_data = []
def zoom_range(x, y, k, c, err, low, hi):
    
    # Base case - plot results
    if hi - low < 0.05:
#         print('found in: [{:.3f} < {:.3f} < {:.3f}]'.format(low, err, hi))
        fig, ax = plt.figure(), plt.gca()
        ax.scatter([x for x, y in plot_data], [y for x, y in plot_data], linewidth=2, color='green')
        ax.set_xscale('log')
        plt.xlabel('C')
        plt.ylabel('Error')
        plt.show()
        return (c, err)
    
    # Reset hyperparam range
    c_space = np.linspace(low, hi, 5)
    err_space = np.zeros(5)
    
    # Get CV error scores
    for i, c in enumerate(c_space):
        err_space[i] = cross_validation_error(x, y, c, k)
        plot_data.append([c, err_space[i]])
#         print('index: {}, error: {:.3f}, C: {:.3f} [{:.3f} - {:.3f}]'.format(i, err_space[i], c, low,  hi))

    # Recursive call - on narrowed down hyperparam space
    
    # if current min hyperparam value is min => min/=4, max=min*2
    if np.argmin(err_space) == 0:
        return zoom_range(x, y, k, c_space[0], err_space[0], c_space[0]/4, c_space[0]*2)
    
    # if current max hyperparam value is min => min/=2, max=min*4
    elif np.argmin(err_space) == 4:
        return zoom_range(x, y, k, c_space[4], err_space[4], c_space[4]/2, c_space[4]*4)
    else:
        # else zoom-in; remove indices 0 and 4
        return zoom_range(x, y, k, c_space[np.argmin(err_space)], err_space[np.argmin(err_space)], 
               c_space[np.argmin(err_space)-1], c_space[np.argmin(err_space)+1])

def choose_parameter(x, y, k):
    return zoom_range(x, y, k, c=0, err=1, low=.1, hi=10)


### Find best C and train model

In [None]:
c, err = choose_parameter(train_data, train_labels, 10)

print("Choice of C: ", c)
print("Cross-validation error estimate: ", err)

## Train it and test it
clf = svm.LinearSVC(C=c, loss='hinge')
clf.fit(train_data, train_labels)

### Test error

In [None]:
preds = clf.predict(test_data)
error = float(np.sum((preds > 0.0) != (test_labels > 0.0)))/len(test_labels)
print("Test error: ", error)

In [None]:
def sentiment(index):
    idx = test_inds[index]
    sents_processed[idx]
    data_mat[idx]
    if y[idx] != clf.predict([data_mat[idx]])[0]:
        return 'Review *****: {}. (label:{}, prediction:{})'.format(sents_processed[idx], y[idx], clf.predict([data_mat[idx]])[0])
    return 'Review clear: {}. (sentiment:{})'.format(sents_processed[idx], clf.predict([data_mat[idx]])[0])

for i in range(50):
    sentiment(i)