# NOTEBOOK 04c: SUPPORT VECTOR MACHINE

Support Vector Machine algorithms utilize orthogonal feature independance to determine a hyperplane that linearly separates classes. The support vectors identified by the model fitting define this plane through a n+1 dimension where n is the number of features. This extra dimension is produced through algorithmic transformations the render non-linear n-space features as linear features in n+1 space. While this algorithm inherently separates data geometrically, we will also perform the modeling on SVD transformed data in hopes of increasing the clustering of similar data points, and thereby increasing the performance of our model.

In [1]:
import time
import pandas as pd
import numpy as np

from scipy import sparse
from sklearn import svm
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

np.random.seed(42)

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

%matplotlib inline

In [2]:
!ls '../assets'

1544988010_comments_df.csv     1545277666_y_train.csv
1545241316_clean_target.csv    1545336727_SVD_col.csv
1545241316_clean_text.csv      1545336727_XtestSVD_coo.npz
1545266972_clean_text.csv      1545336727_XtrainSVD_coo.npz
1545266972_cvec_coo.npz        cvec_1545266972_coo_col.csv
1545272821_eda_words.csv       file_log.txt
1545277666_tfidf_col.csv       test_1545277666_tfidf_coo.npz
1545277666_y_test.csv          train_1545277666_tfidf_coo.npz


Loading in the SVD data. 

columns = pd.read_csv('../assets/1545277666_tfidf_col.csv', na_filter=False, header=None)
cols = np.array(columns[0])

X_train_tfidf_coo=sparse.load_npz('../assets/train_1545277666_tfidf_coo.npz')
X_train_tfidf = pd.SparseDataFrame(X_train_tfidf_coo, columns=cols)

X_test_tfidf_coo=sparse.load_npz('../assets/test_1545277666_tfidf_coo.npz')
X_test_tfidf = pd.SparseDataFrame(X_test_tfidf_coo, columns=cols)

X_train_tfidf.fillna(0, inplace=True)
X_test_tfidf.fillna(0, inplace=True)

y_train = pd.read_csv('../assets/1545277666_y_train.csv', header=None)
y_test = pd.read_csv('../assets/1545277666_y_test.csv', header=None)

In [None]:
columns = pd.read_csv('../assets/1545336727_SVD_col.csv', na_filter=False, header=None)
cols = np.array(columns[0])

In [None]:
XtrainSVD_coo = sparse.load_npz('../assets/1545336727_XtrainSVD_coo.npz')
X_train_svd = pd.SparseDataFrame(XtrainSVD_coo, columns=cols)

In [None]:
XtestSVD_coo = sparse.load_npz('../assets/1545336727_XtestSVD_coo.npz')
X_test_svd = pd.SparseDataFrame(XtestSVD_coo, columns=cols)

In [None]:
X_train_svd.fillna(0, inplace=True)
X_test_svd.fillna(0, inplace=True)

In [None]:
y_train = pd.read_csv('../assets/1545277666_y_train.csv', header=None)
y_test = pd.read_csv('../assets/1545277666_y_test.csv', header=None)

In [None]:
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

Instantiating the Support Vector Classifier. The hyperparameter of interest for a baseline run is C, or the inverse of the regularization strength for the error function. This has an impact on how much bias/variance is produced by the model. If C is large then the classifcation boundary will be narrow, while if C is small the boundary will be large. The overall effect of this depends on the amount of variance contained in unseen data. Since we should have dense, grouped data as a result of SVD, using a high C should ensure clean linear separability. The other important hyperparameter is the kernal, which defines how the polynomial space is created. 

In [9]:
svc = svm.SVC(gamma='auto', random_state = 42)

In [10]:
params={
    'C':[.90,.95,1],
    }

gs = GridSearchCV(svc, 
                  param_grid=params, 
                  cv=3, 
                  verbose = 2,
                 )

In [11]:
gs.fit(X_train_svd, y_train)

Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV] C=0.9 ...........................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


KeyboardInterrupt: 

In [80]:
gs.best_params_

{'C': 0.9, 'gamma': 0.001, 'kernel': 'rbf'}

In [None]:
gs.score(X_train_svd, y_train)

In [None]:
gs.score(X_test_svd, y_test)

In [82]:
gs.best_score_

0.9875311720698254

In [81]:
preds = gs.predict(X_test_svd)

In [None]:
gs.predict_proba(X_test_svd)

In [74]:
results = pd.DataFrame(gs.predict(X_test_svd), columns=['predicted'])

results['true'] = y_test

In [75]:
# Check out first five rows.
results.head()

Unnamed: 0,predicted,true
0,6,6
1,9,9
2,3,3
3,7,7
4,2,2


In [76]:
# Find all indices where predicted and true results 
# aren't the same, then save in an array.
row_ids = results[results['predicted'] != results['true']].index
print(row_ids)


Int64Index([133, 149, 159, 431, 516, 557], dtype='int64')
