we will be predicting the sentiment (positive or negative) of a single sentence taken from a review of a movie, restaurant or product. the dataset consists of 3000 labeled sentences, which we divide into a training set of size 2500 and test set of size 500. we have already used a logistic regression classifier. Now we will use a support vector machine.

In [16]:
import numpy as np
import string
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
with open("../data/full_set.txt") as f:
    content=f.readlines()
    
# remove leading and trailing white space:
content=[x.strip() for x in content]

In [9]:
# separate the sentences from the labels:
sentences=[x.split("\t")[0] for x in content]
labels=[x.split("\t")[1] for x in content]


# transform the labels from '0 versus 1' to '-1 versus 1'
y=np.array(labels, dtype='int8')
y=(2*y)-1

# let us define a function, "full_remove" that takes a string x and a list of characters from a 
# removal_list and returns with all the characters in removal_list replaced by ' ' 

def full_remove(x, removal_list):
    for w in removal_list:
        x=x.replace(w, ' ')
    
    return x

# remove digits
digits=[str(x) for x in range(10)]
digit_less=[full_remove(x, digits) for x in sentences]
    
# remove punctuation
punc_less=[full_remove(x, list(string.punctuation)) for x in digit_less]

# make everything lowercase
sents_lower=[x.lower() for x in punc_less]

# define our stop words
stop_set=set(['the', 'a', 'an', 'i', 'he', 'she', 'they', 'to', 'of', 'it', 'from'])

# remove stop words
sents_split=[x.split() for x in sents_lower]
sents_processed=[" ".join(list(filter(lambda a: a not in stop_set, x))) for x in sents_split]


In [12]:
# transform to bag of words representation
vectorizer=CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=4500)
data_features=vectorizer.fit_transform(sents_processed)

In [13]:
# append '1' to the end of each vector
data_mat=data_features.toarray()

In [14]:
np.random.seed(0)
test_inds=np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))
train_inds=list(set(range(len(labels)))-set(test_inds))
train_data=data_mat[train_inds,]
train_labels=y[train_inds]
test_data=data_mat[test_inds,]
test_labels=y[test_inds]

print("train data: ", train_data.shape)
print("test data: ", test_data.shape)

train data:  (2500, 4500)
test data:  (500, 4500)


## Fitting a support vector machine to the data
In support vector machines, we are given a set of examples and we want to find a weight vector that solves the optimization problem


In [17]:
def fit_classifier(C_value=1.0):
    clf=svm.LinearSVC(C=C_value, loss="hinge")
    clf.fit(train_data, train_labels)
    
#     get predictions on training data
    train_preds=clf.predict(train_data)
    train_error=float(np.sum((train_preds>0.0)!=(train_labels>0.0))/len(train_labels))
    
#     get predictions on test data
    test_preds=clf.predict(test_data)
    test_error=float(np.sum((test_preds>0.0)!=(test_labels>0.0))/len(test_labels))
    
    return train_error, test_error


c_vals=[0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
for c in c_vals:
    train_error, test_error=fit_classifier(c)
    print("Error rate for C = %0.2f: train %0.3f test %0.3f" % (c, train_error, test_error))
    
    

Error rate for C = 0.01: train 0.215 test 0.250
Error rate for C = 0.10: train 0.074 test 0.174
Error rate for C = 1.00: train 0.011 test 0.152




Error rate for C = 10.00: train 0.002 test 0.188




Error rate for C = 100.00: train 0.002 test 0.198




Error rate for C = 1000.00: train 0.003 test 0.206
Error rate for C = 10000.00: train 0.001 test 0.204




### 3. Evaluating C by k-fold cross validation
* As we can see that, the choice of C has very significant impact on the performance of the SVM classifier. We were able to assess this because we have a separate test set. In general however, this is a luxury we wont possess.
* A reasonable way to estimate the error associated with a specific value of "C" is by K fold cross validation
    * partition the training set into "k" equal sized subsets
    * for i=1, 2, ...k, train a classifoer with parameter C
        * average the errors "e_1+e_2+e_3...e_k/k"
        
* The following procedure, cross_validation_error, does exactly this. It takes an input:
    * The training set x, y
    * the value of C to be evaluated
    * the integer K
    
* it returns the estimated error of the classifier for that particular setting of "C"

In [20]:
def cross_validation_error(x, y, C_value, k):
    n=len(y)
    indices=np.random.permutation(n)
#     Initialize error
    err=0.0
#     Iterate over partitions
    for i in range(k):
#         partition indices
        test_indices=indices[int(i*(n/k)):int((i+1)*(n/k)-1)]
        train_indices=np.setdiff1d(indices, test_indices)
#         Train classifier with parameter c
        clf=svm.LinearSVC(C=C_value, loss="hinge")
        clf.fit(x[train_indices], y[train_indices])
        
#         get prediction on the test partition
        preds=clf.predict(x[test_indices])
    
#         compute error
        err+=float(np.sum((preds>0.0) != (y[test_indices]>0.0)))/len(test_indices)
    return err/k


for k in range(2, 10):
    print("Cross Validation Error: ")
    print(cross_validation_error(train_data, train_labels, 1.0, k))

Cross Validation Error: 
0.21577261809447557
Cross Validation Error: 
0.19984003216671284
Cross Validation Error: 




0.19471153846153844
Cross Validation Error: 




0.20040080160320645
Cross Validation Error: 




0.19486696787148594
Cross Validation Error: 




0.18450301468902167
Cross Validation Error: 




0.19745547860499627
Cross Validation Error: 




0.18705273316009463
