# Support Vector Machine Model

In our project, we expect to use Support Vector Machine to train [Cornell Dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/), as comparsion to the performance of LSTM(RNN) version.

## 1 Loading Data

Load tokenized clean corpus generated by `Data_cleaning_saving.ipynb` in Naive Bayes Section.

In [1]:
import os

def open_file(path):
    with open(path, mode='r', errors='replace') as f:
        sentence_list = f.readlines()
    return sentence_list

pos_dir = 'data/pos_sample_tokenized.txt'
neg_dir = 'data/neg_sample_tokenized.txt'
all_dir = 'data/all_sample_tokenized.txt'

# Open text with positive sentences
pos_list = open_file(pos_dir)
# Open text with negative sentences
neg_list = open_file(neg_dir)
# Open text with all sentences
all_list = open_file(all_dir)

Build **bag of words** model for further training.

In [2]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Make labels for sentences
# 1: positive; 0: negative
y = np.hstack((np.ones(len(pos_list)),np.zeros(len(neg_list))))
print(f'Shape of label: {y.shape}')

# Create vectorizer
vectorizer = CountVectorizer(input='content', lowercase=False)

# Fit vectorizer with all processed tokens
# Transform them into sparse vector X
X = vectorizer.fit_transform(all_list).toarray()
print(f'Shape of vectorized dataset: {X.shape}')

Shape of label: (10655,)
Shape of vectorized dataset: (10655, 11952)


Split training set and test set.

The ratio is same as that in LSTM, which is 80% for training and 20% for testing.

In [3]:
from sklearn.model_selection import train_test_split

# Define dataset test split ratio
RANDOMNESS_SEED = 42
DATASET_TEST_SPLIT_RATIO = 0.2

Xtr, Xts, ytr, yts = train_test_split(X, y, 
                                      test_size=DATASET_TEST_SPLIT_RATIO, 
                                      random_state=RANDOMNESS_SEED, 
                                      shuffle=True)

print(f'Number of features: {Xtr.shape[1]}')
print(f'Number of training texts: {Xtr.shape[0]}')
print(f'Number of training texts: {Xts.shape[0]}')

Number of features: 11952
Number of training texts: 8524
Number of training texts: 2131


## 2 Support Vector Machine Implementation

### 2.1 Feature Selection Method 1

#### 2.1.1 Feature Selection

Selecting 3000 words as feature, based on appearing frequency in the overall data set.

In [4]:
from sklearn import svm
from sklearn.metrics import precision_score, recall_score
import pandas as pd

def show_features(feature_num, index, display_num, vectorizer=vectorizer):
    print(f'First 20 selected features in total of {feature_num} features:')
    j = 0 # List index
    for i in index[:display_num]:
        j += 1
        print(f'%02d. {vectorizer.get_feature_names()[i]}' % j)

In [6]:
# Select 3000 words
feature_num_1 = 3000

# Calculate frequency of each words in the overall dataset
frequency = np.sum(X, axis=0)

# Find the indices of the most frequent words
indices_1 = np.argsort(frequency)[-1:(-1*feature_num_1-1):-1]

# Training and testing data with only the selected feature
Xtr_1 = Xtr[:, indices_1]
Xts_1 = Xts[:, indices_1]

# Show selected features
show_features(feature_num=feature_num_1, index=indices_1, display_num=20)

First 20 selected features in total of 3000 features:
01. film
02. movi
03. like
04. one
05. make
06. stori
07. charact
08. time
09. comedi
10. good
11. even
12. much
13. work
14. perform
15. feel
16. way
17. get
18. littl
19. look
20. love


#### 2.1.2 Fitting SVM

The regularization coefficients we selected are 1, 5, 10, 50, 100, 200, 500, 1000, 1500, 2000.

In [7]:
# List of regularization coefficients
list_C = [1, 5, 10, 50, 100, 200, 500, 1000, 1500, 2000]

# Fitting SVMs with different regularization coefficients
classifiers = []
for C in list_C:
    svc = svm.SVC(C=C)
    svc.fit(Xtr_1, ytr)
    print(f'Fitting complete with C = {C}')
    classifiers.append(svc)

Fitting complete with C = 1
Fitting complete with C = 5
Fitting complete with C = 10
Fitting complete with C = 50
Fitting complete with C = 100
Fitting complete with C = 200
Fitting complete with C = 500
Fitting complete with C = 1000
Fitting complete with C = 1500
Fitting complete with C = 2000


Saving performance data.


In [9]:
# Accuracy, recall and precision scores
nc = len(list_C)
score_train = np.zeros(nc)
score_test = np.zeros(nc)
recall_test = np.zeros(nc)
precision_test = np.zeros(nc)

for count in range(nc):
    # Prediction
    yhat = classifiers[count].predict(Xts_1)
    
    # Results report
    score_train[count] = svc.score(Xtr_1, ytr)
    score_test[count]= svc.score(Xts_1, yts)
    recall_test[count] = recall_score(yts, yhat)
    precision_test[count] = precision_score(yts, yhat)

    print(f'Report complete for SVM with C = {list_C[count]}')

Report complete for SVM with C = 1
Report complete for SVM with C = 5
Report complete for SVM with C = 10
Report complete for SVM with C = 50
Report complete for SVM with C = 100
Report complete for SVM with C = 200
Report complete for SVM with C = 500
Report complete for SVM with C = 1000
Report complete for SVM with C = 1500
Report complete for SVM with C = 2000


#### 2.1.3 Performance Report

In [14]:
# Build performance matrix
matrix = np.matrix(np.c_[list_C, score_train, score_test, recall_test, precision_test])
# For better looking, save the matrix as pandas dataframe
models = pd.DataFrame(data = matrix, columns = ['C', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision'])

print(f'Performance of SVM using {feature_num_1} words with highest frequency.')
print(models)

print('\nConfiguration that achieves the best performance')
best_index = models['Test Precision'].idxmax()
models.iloc[best_index, :]

Performance of SVM using 3000 words with highest frequency.
        C  Train Accuracy  Test Accuracy  Test Recall  Test Precision
0     1.0        0.999531       0.695917     0.732276        0.745489
1     5.0        0.999531       0.695917     0.747201        0.738249
2    10.0        0.999531       0.695917     0.751866        0.730072
3    50.0        0.999531       0.695917     0.712687        0.712687
4   100.0        0.999531       0.695917     0.708022        0.704735
5   200.0        0.999531       0.695917     0.698694        0.700655
6   500.0        0.999531       0.695917     0.698694        0.697393
7  1000.0        0.999531       0.695917     0.698694        0.697393
8  1500.0        0.999531       0.695917     0.698694        0.697393
9  2000.0        0.999531       0.695917     0.698694        0.697393

Configuration that achieves the best performance


C                 1.000000
Train Accuracy    0.999531
Test Accuracy     0.695917
Test Recall       0.732276
Test Precision    0.745489
Name: 0, dtype: float64

### 2.2 Feature Selection Method 2

#### 2.2.1 Feature Selection

Selecting 3000 words as feature, based on Information Gain (IG) in the overall data set.

Information Gain value is calculated in Naive Bayes section, saved as `IG.txt`.

In [None]:
# Select 3000 words
feature_num_2 = 3000

# Read IG value
IG_dir = 'data/IG.txt'
IG = []
with open(IG_dir, 'r') as f:
    IG = f.readlines()
IG = np.array(IG).astype(np.float64)

# Find the indices of the most frequent words
indices_2 = np.argsort(IG)[-1:(-1*feature_num_2-1):-1]

# Training and testing data with only the selected feature
Xtr_2 = Xtr[:, indices_2]
Xts_2 = Xts[:, indices_2]

# Show selected features
show_features(feature_num=feature_num_2, index=indices_2, display_num=20)

First 20 selected features in total of 3000 features:
01. bad
02. bore
03. beauti
04. dull
05. perform
06. wast
07. joke
08. movi
09. heart
10. move
11. best
12. human
13. cultur
14. titl
15. film
16. examin
17. flat
18. entertain
19. unfunni
20. funni


#### 2.2.2 Fitting SVM

In [None]:
# List of regularization coefficients
list_C = [1, 5, 10, 50, 100, 200, 500, 1000, 1500, 2000]

# Fitting SVMs with different regularization coefficients
classifiers_2 = []
for C in list_C:
    svc = svm.SVC(C=C)
    svc.fit(Xtr_2, ytr)
    print(f'Fitting complete with C = {C}')
    classifiers_2.append(svc)

Fitting complete with C = 1
Fitting complete with C = 5
Fitting complete with C = 10
Fitting complete with C = 50
Fitting complete with C = 100
Fitting complete with C = 200
Fitting complete with C = 500
Fitting complete with C = 1000
Fitting complete with C = 1500
Fitting complete with C = 2000


Saving performance data.

In [None]:
# Accuracy, recall and precision scores
nc = len(list_C)
score_train_2 = np.zeros(nc)
score_test_2 = np.zeros(nc)
recall_test_2 = np.zeros(nc)
precision_test_2 = np.zeros(nc)

for count in range(nc):
    # Prediction
    yhat = classifiers_2[count].predict(Xts_2)
    
    # Results report
    score_train_2[count] = svc.score(Xtr_2, ytr)
    score_test_2[count]= svc.score(Xts_2, yts)
    recall_test_2[count] = recall_score(yts, yhat)
    precision_test_2[count] = precision_score(yts, yhat)

    print(f'Report complete for SVM with C = {list_C[count]}')

Report complete for SVM with C = 1
Report complete for SVM with C = 5
Report complete for SVM with C = 10
Report complete for SVM with C = 50
Report complete for SVM with C = 100
Report complete for SVM with C = 200
Report complete for SVM with C = 500
Report complete for SVM with C = 1000
Report complete for SVM with C = 1500
Report complete for SVM with C = 2000


#### 2.2.3 Performance Report

In [None]:
# Build performance matrix
matrix_2 = np.matrix(np.c_[list_C, score_train_2, score_test_2, recall_test_2, precision_test_2])
# For better looking, save the matrix as pandas dataframe
models_2 = pd.DataFrame(data = matrix_2, columns = ['C', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision'])

print(f'Performance of SVM using {feature_num_2} words with highest information gain.')
print(models_2)

print('\nConfiguration that achieves the best performance')
best_index_2 = models_2['Test Precision'].idxmax()
models_2.iloc[best_index_2, :]

Performance of SVM using 3000 words with highest information gain.
        C  Train Accuracy  Test Accuracy  Test Recall  Test Precision
0     1.0        0.995307       0.693102     0.753731        0.756554
1     5.0        0.995307       0.693102     0.756530        0.750926
2    10.0        0.995307       0.693102     0.747201        0.743043
3    50.0        0.995307       0.693102     0.721082        0.693896
4   100.0        0.995307       0.693102     0.720149        0.686833
5   200.0        0.995307       0.693102     0.720149        0.685613
6   500.0        0.995307       0.693102     0.720149        0.685613
7  1000.0        0.995307       0.693102     0.720149        0.685613
8  1500.0        0.995307       0.693102     0.720149        0.685613
9  2000.0        0.995307       0.693102     0.720149        0.685613

Configuration that achieves the best performance


C                 1.000000
Train Accuracy    0.995307
Test Accuracy     0.693102
Test Recall       0.753731
Test Precision    0.756554
Name: 0, dtype: float64