# Support Vector Machine

Due to the high success of my logistic regression, I wanted to analyze whether a Support Vector Machine (SVM) would slightly improve the classification accuracy of my model. 

A support vector machine works by finding the maximum margin hyperplane between the two classes. The SVM can create the maximum margin hyperplane in many different dimensions to best find the threshold boundary.  

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
#from sklearn.model_selection import  GridSearchCV
from sklearn.linear_model import LogisticRegression
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, linear_model
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
# Read in DataFrame
df = pd.read_csv('../Data/casual_political.csv')

In [3]:
# Set predictor and resulting variables
X = df['lemma_totaltext']
y = df['binary_sub']

In [4]:
# Train test SPlit using same random_state
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=16)

In [5]:
# Creating scaffolding for hyperparameter gridsearch pipeline
pipe = Pipeline([
    
    ('tvec', TfidfVectorizer(analyzer='word', stop_words='english')),
    ('svc', svm.SVC())
    
])

In [9]:
# Creating parameters to gridsearch over 

gamma_range = np.logspace(-5, 2, 10)
C_range = np.logspace(-3, 2, 10)
kernel_range = ['rbf', 'sigmoid', 'linear', 'poly']

param_grid = {'svc__gamma': gamma_range,
                  'svc__C': C_range,
                  'svc__kernel': kernel_range
             }

grid = GridSearchCV(pipe, param_grid, cv = 3,
                    scoring = 'accuracy', verbose = 1, n_jobs = -1)

# Fitting training data for SVM
grid.fit(X_train, y_train)

Fitting 3 folds for each of 400 candidates, totalling 1200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 17.4min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 79.0min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 175.8min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 289.5min
[Parallel(n_jobs=-1)]: Done 1200 out of 1200 | elapsed: 428.5min finished


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('tvec',
                                        TfidfVectorizer(stop_words='english')),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': array([1.00000000e-03, 3.59381366e-03, 1.29154967e-02, 4.64158883e-02,
       1.66810054e-01, 5.99484250e-01, 2.15443469e+00, 7.74263683e+00,
       2.78255940e+01, 1.00000000e+02]),
                         'svc__gamma': array([1.00000000e-05, 5.99484250e-05, 3.59381366e-04, 2.15443469e-03,
       1.29154967e-02, 7.74263683e-02, 4.64158883e-01, 2.78255940e+00,
       1.66810054e+01, 1.00000000e+02]),
                         'svc__kernel': ['rbf', 'sigmoid', 'linear', 'poly']},
             scoring='accuracy', verbose=1)

In [10]:
print(grid.best_params_)
print(grid.best_score_)

{'svc__C': 7.742636826811277, 'svc__gamma': 0.4641588833612782, 'svc__kernel': 'rbf'}
0.9896547347663621


In [12]:
grid.score(X_test, y_test)

0.9877866400797607

This model achieved the same accuracy as the random forest and logistic regression, but over a significantly longer time period. Therefore it is not even worth analyzing the best parameters. 

Had there been less data, the SVM might have been able to be more accurate than the other models, however with sufficient data SVM is not worth the time hassle. 