# Support Vector Machine (SVM) Modeling

In this notebook, we iterate through an SVM baseline, trying different class imbalance remedy methods. We also grid search to try and optomize the baseline's hyperparameters.

The idea behind SVMs is that you perform classification by finding the seperation line or (in higher dimensions) 'hyperplane' that maximizes the distance between two classes.

In [19]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns; sns.set()
%matplotlib inline
import nltk
from sklearn.feature_extraction import text 
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn import metrics, model_selection, svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc, classification_report
import pickle

In [20]:
X_lem = pickle.load(open('X_lem.pkl', 'rb'))
y_lem = pd.read_pickle('y_lem.pkl')

In [21]:
# setting up stop words
stop_words = set(stopwords.words('english'))

## Train-Test Split & Vectorize

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X_lem, y_lem, test_size=0.20, random_state=15)

# Initialize vectorizer correctly
tfidf = TfidfVectorizer(stop_words=list(stop_words), ngram_range=(1, 2))

# Transform text
tfidf_data_train = tfidf.fit_transform(X_train)
tfidf_data_test = tfidf.transform(X_test)

## SVM Baseline

SVM Hyperparameters:
- `C` is the regularization parameter, `1.0` is the default.
- `kernel` specifies the kernal type to be used in the algorithm, default is `rbf`. These are different ways of drawing non-linear boundaries around classes.
- `degree` is the degree of the polynomial kernal functions (`poly`), ignored by all other kernals.
- `gamma` is the kernal coefficient for `rbf`, `poly` and `sigmoid`, default is `scale`.
- 'class_weight' default 1. If balanced, it uses the values of y to automatically adjust weights inversely proportional to class frequencies in the output data as `n_samples / (n_classes * np.bincount(y))`.


In [23]:
SVM_baseline = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto', class_weight='balanced', random_state=20)

In [24]:
%%time 
# fit the training dataset on the classifier
SVM_baseline.fit(tfidf_data_train, y_train)
# predict the labels on validation dataset
SVM_test_preds = SVM_baseline.predict(tfidf_data_test)


CPU times: total: 1min 42s
Wall time: 1min 58s


In [25]:
baseline_precision = precision_score(y_test, SVM_test_preds)
baseline_recall = recall_score(y_test, SVM_test_preds)
baseline_f1_score = f1_score(y_test, SVM_test_preds)
baseline_weighted_f1_score = f1_score(y_test, SVM_test_preds, average='weighted')

In [26]:
# printing evaluation metrics up to 4th decimal place
print('Testing Metrics for SVM Baseline with Lemmatization & TF-IDF Vectorization')
print('Precision: {:.4}'.format(baseline_precision))
print('Recall: {:.4}'.format(baseline_recall))
print('F1 Score: {:.4}'.format(baseline_f1_score))
print('Weighted F1 Score: {:.4}'.format(baseline_weighted_f1_score))

Testing Metrics for SVM Baseline with Lemmatization & TF-IDF Vectorization
Precision: 0.3609
Recall: 0.4373
F1 Score: 0.3955
Weighted F1 Score: 0.9281


In [27]:
# creating dictionary with all metrics
metric_dict = {}
metric_dict['Baseline SVM'] = {'precision': baseline_precision, 'recall': baseline_recall, 'f1_score': baseline_f1_score, 'weighted_f1': baseline_weighted_f1_score}

## Baseline with SMOTE
Used to over-sample the minority class (hate speech).

In [29]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=35)
smote_X_train, smote_y_train = sm.fit_resample(tfidf_data_train, y_train)


In [30]:
smote_SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto', random_state=15)

In [31]:
%%time
smote_SVM.fit(smote_X_train, smote_y_train)
smote_SVM_test_preds = smote_SVM.predict(tfidf_data_test)

CPU times: total: 6min 59s
Wall time: 8min 54s


In [32]:
smote_precision = precision_score(y_test, smote_SVM_test_preds)
smote_recall = recall_score(y_test, smote_SVM_test_preds)
smote_f1_score = f1_score(y_test, smote_SVM_test_preds)
smote_weighted_f1_score = f1_score(y_test, smote_SVM_test_preds, average='weighted')

In [33]:
# printing evaluation metrics up to 4th decimal place
print('Testing Metrics for Oversampled SVM Baseline with Lemmatization')
print('Precision: {:.4}'.format(smote_precision))
print('Recall: {:.4}'.format(smote_recall))
print('F1 Score: {:.4}'.format(smote_f1_score))
print('Weighted F1 Score: {:.4}'.format(smote_weighted_f1_score))

Testing Metrics for Oversampled SVM Baseline with Lemmatization
Precision: 0.3422
Recall: 0.276
F1 Score: 0.3056
Weighted F1 Score: 0.9258


Looks like SMOTE actually decreased the F1, which also happened with Logistic Regression.

In [34]:
# adding these metrics to evaluation metric dict
metric_dict['Baseline SVM Oversampled with SMOTE'] = {'precision': smote_precision, 'recall': smote_recall, 'f1_score': smote_f1_score, 'weighted_f1': smote_weighted_f1_score}

## Baseline with Tomek Links
Used to under-sample the majority class (not hate speech).

In [35]:
from collections import Counter
from imblearn.under_sampling import TomekLinks # doctest: +NORMALIZE_WHITESPACE

tl = TomekLinks()
tomek_X_train, tomek_y_train = tl.fit_resample(tfidf_data_train, y_train)
print('Resampled dataset shape %s' % Counter(tomek_y_train))

Resampled dataset shape Counter({0: 18627, 1: 1151})


Only removed 48 values from the majority class.

In [36]:
tomek_SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto', random_state=15)

In [37]:
%%time
tomek_SVM.fit(tomek_X_train, tomek_y_train)
tomek_logreg_test_preds = tomek_SVM.predict(tfidf_data_test)

CPU times: total: 1min 14s
Wall time: 1min 41s


In [38]:
tomek_precision = precision_score(y_test, tomek_logreg_test_preds)
tomek_recall = recall_score(y_test, tomek_logreg_test_preds)
tomek_f1_score = f1_score(y_test, tomek_logreg_test_preds)
tomek_weighted_f1_score = f1_score(y_test, tomek_logreg_test_preds, average='weighted')

In [39]:
# printing evaluation metrics up to 4th decimal place
print('Testing Metrics for Undersampled SVM Baseline with Lemmatization')
print('Precision: {:.4}'.format(tomek_precision))
print('Recall: {:.4}'.format(tomek_recall))
print('F1 Score: {:.4}'.format(tomek_f1_score))
print('F1 Score: {:.4}'.format(tomek_weighted_f1_score))

Testing Metrics for Undersampled SVM Baseline with Lemmatization
Precision: 0.6562
Recall: 0.2258
F1 Score: 0.336
F1 Score: 0.938


In [40]:
# adding these metrics to evaluation metric dict
metric_dict['Baseline SVM Undersampled with Tomek Links'] = {'precision': tomek_precision, 'recall': tomek_recall, 'f1_score': tomek_f1_score, 'weighted_f1': tomek_weighted_f1_score}

## Metrics for All Baselines

In [41]:
pd.DataFrame.from_dict(metric_dict, orient='index')

Unnamed: 0,precision,recall,f1_score,weighted_f1
Baseline SVM,0.360947,0.437276,0.395462,0.928112
Baseline SVM Oversampled with SMOTE,0.342222,0.275986,0.305556,0.925813
Baseline SVM Undersampled with Tomek Links,0.65625,0.225806,0.336,0.937993


- The baseline SVM with `class_weight=balanced` has the highest unweighted F1
- The undersampled baseline has a lower raw F1, but higher weighted F1.

We can take a look at each model's classification report to get a better idea about what's happening.

In [42]:
from sklearn.metrics import classification_report
target_names = ['class 0', 'class 1']
# class_weight='balanced' Baseline report
print(classification_report(y_test, SVM_test_preds, target_names=target_names))
# Undersampled Baseline report
print(classification_report(y_test, tomek_logreg_test_preds, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.97      0.95      0.96      4678
     class 1       0.36      0.44      0.40       279

    accuracy                           0.92      4957
   macro avg       0.66      0.70      0.68      4957
weighted avg       0.93      0.92      0.93      4957

              precision    recall  f1-score   support

     class 0       0.96      0.99      0.97      4678
     class 1       0.66      0.23      0.34       279

    accuracy                           0.95      4957
   macro avg       0.81      0.61      0.65      4957
weighted avg       0.94      0.95      0.94      4957



There are some differances. But most noteably, the baseline with `class_weight=balanced` predicts the hate speech (1) class much better than the other model. 
