# SVM Example

Using an SMS Spam data set (slightly modified) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data set is a collection of 5574 SMS messages that have been labeled as ham or spam. The file is a tab-delimited file with the first column the label and the second the message content. I edited the data set to remove some unwanted columns and add headings. 



In [17]:
import pandas as pd
df = pd.read_csv('sms-spam.csv', header=0, usecols=[1,2], encoding='latin-1')
print('rows and columns:', df.shape)
print(df.head())

rows and columns: (4837, 2)
   spam                                               text
0     0  Go until jurong point, crazy.. Available only ...
1     0                      Ok lar... Joking wif u oni...
2     1  Free entry in 2 a wkly comp to win FA Cup fina...
3     0  U dun say so early hor... U c already then say...
4     0  Nah I don't think he goes to usf, he lives aro...


In [18]:
# text preprocessing
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

stopwords = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stopwords, binary=True)

In [19]:
# set up X and y
X = vectorizer.fit_transform(df.text)
y = df.spam

In [20]:
# divide into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

### Train and test

Train on the train data and then evaluate on the test data.

In [21]:
from sklearn import svm
classifier = svm.SVC(kernel='linear')
classifier.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [22]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score
pred = classifier.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))

accuracy score:  0.981404958678
precision score:  0.990384615385
recall score:  0.858333333333
f1 score:  0.919642857143


This is a higher accuracy and precision than Naive Bayes, and we just used default settings. Perhaps even higher accuracy could be achieved if we tuned the C parameter. 

In [23]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
import warnings

# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        print("# Tuning hyper-parameters for %s" % score)
        print()

        clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
                       scoring='%s_macro' % score)
        clf.fit(X_train, y_train)

        print("Best parameters set found on development set:")
        print()
        print(clf.best_params_)
        print()
        print("Grid scores on development set:")
        print()
        means = clf.cv_results_['mean_test_score']
        stds = clf.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, clf.cv_results_['params']):
            print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
        print()

        print("Detailed classification report:")
        print()
        print("The model is trained on the full development set.")
        print("The scores are computed on the full evaluation set.")
        print()
        y_true, y_pred = y_test, clf.predict(X_test)
        print(classification_report(y_true, y_pred))
        print()

# Tuning hyper-parameters for precision

Best parameters set found on development set:

{'C': 1000, 'kernel': 'rbf', 'gamma': 0.001}

Grid scores on development set:

0.433 (+/-0.001) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.001}
0.433 (+/-0.001) for {'C': 1, 'kernel': 'rbf', 'gamma': 0.0001}
0.433 (+/-0.001) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.001}
0.433 (+/-0.001) for {'C': 10, 'kernel': 'rbf', 'gamma': 0.0001}
0.962 (+/-0.013) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.001}
0.433 (+/-0.001) for {'C': 100, 'kernel': 'rbf', 'gamma': 0.0001}
0.983 (+/-0.006) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.001}
0.962 (+/-0.013) for {'C': 1000, 'kernel': 'rbf', 'gamma': 0.0001}
0.979 (+/-0.011) for {'C': 1, 'kernel': 'linear'}
0.977 (+/-0.005) for {'C': 10, 'kernel': 'linear'}
0.977 (+/-0.005) for {'C': 100, 'kernel': 'linear'}
0.977 (+/-0.005) for {'C': 1000, 'kernel': 'linear'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed

Run svm again with the suggested parameters. There wasn't a lot of room to improve, but there was a very slight improvement in both accuracy and precision. 

In [24]:
classifier = svm.SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, kernel='rbf', gamma=0.001,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

classifier.fit(X_train, y_train)

pred = classifier.predict(X_test)

print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))


accuracy score:  0.984504132231
precision score:  0.990654205607
recall score:  0.883333333333
f1 score:  0.933920704846
