<a href="https://colab.research.google.com/github/WelfLowe/ML4developers/blob/main/5_Kernel_Methods_and_SVMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SVMs and parameter selection

Here the entry point into the corresponding [sklearn documentation](https://scikit-learn.org/stable/modules/cross_validation.html).

Import necessary libraries.

In [9]:
import pandas as pd
from sklearn import datasets
from sklearn import svm
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

Load Iris.

In [22]:
iris = datasets.load_iris()
X = iris.data
y = iris.target
X[0:5,:]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

SVM classifiers are (in contrast to other classifiers) sensitive to different value ranges of the predictors. Hence, it is suggested to normalize the predictors, here by dividing each feature value by the maximum value of that feature.

**OBS**, normalization should be applied on both training and test datasets. However, "training" the normalizer function, herecalculating the maximum feature values should only be done on the training data set. We ignore this here in our example code.

**OBS**, normalization does not always help actually. Find out how it works here in Iris by commenting out the below line and rerunning the notebook.

In [23]:
print(X.max(axis=0))
X = preprocessing.normalize(X, axis =0, norm='max')
X[0:5,:]

[7.9 4.4 6.9 2.5]


array([[0.64556962, 0.79545455, 0.20289855, 0.08      ],
       [0.62025316, 0.68181818, 0.20289855, 0.08      ],
       [0.59493671, 0.72727273, 0.1884058 , 0.08      ],
       [0.58227848, 0.70454545, 0.2173913 , 0.08      ],
       [0.63291139, 0.81818182, 0.20289855, 0.08      ]])

Train an SVM model with fixed hyperparameters and assess the models using train-test splitting.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #random_state=2
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
score = clf.score(X_test, y_test)
score

1.0

This suggests that the classifier is quite bad. Try different seeds and split sizes.

Train SVM models with the same fixed hyperparameters and assess the models using 5-fold cross validation.

In [13]:
clf = svm.SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)
scores

array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1.        ])

Calculating the mean score of all 5 folds (and its standard deviation).

In [14]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.96 accuracy with a standard deviation of 0.02


Obviously, the train-test split is random and the accuracy varies with it. Averaging over several splits, as done with cross validation, reduces the uncertainty in the accuracy estimation. 

Still, we need to guess the hyperparamters, here the $C$ hyperparamters, i.e., the penalty parameter of the error term, and the selected kernel, each of which may or may not come with further hyperparameters, here $\gamma$. 

Grid search in combination with cross validation approximates the optimum hyperparameter setting by systematically testing all combinations of a grid of hyperparameter values.

In [15]:
parameters = {'kernel':['linear', 'rbf', 'poly', 'sigmoid'], 'C':[0.1, 1, 10, 100], 'gamma': [1,0.1,0.01,0.001]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(X, y)
GridSearchCV(estimator=svc, param_grid=parameters)
sorted(clf.cv_results_.keys())

['mean_fit_time',
 'mean_score_time',
 'mean_test_score',
 'param_C',
 'param_gamma',
 'param_kernel',
 'params',
 'rank_test_score',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'split4_test_score',
 'std_fit_time',
 'std_score_time',
 'std_test_score']

In [16]:
df = pd.DataFrame.from_dict(clf.cv_results_)
df[[
 'mean_test_score',
 'param_C',
 'param_kernel',
 'param_gamma',
 'std_test_score']].sort_values([
 'mean_test_score',
 'std_test_score',
 'param_kernel'],ascending=False).head(15)

Unnamed: 0,mean_test_score,param_C,param_kernel,param_gamma,std_test_score
33,0.966667,10.0,rbf,1.0,0.029814
53,0.966667,100.0,rbf,0.1,0.029814
49,0.966667,100.0,rbf,1.0,0.042164
50,0.966667,100.0,poly,1.0,0.042164
48,0.966667,100.0,linear,1.0,0.042164
52,0.966667,100.0,linear,0.1,0.042164
56,0.966667,100.0,linear,0.01,0.042164
60,0.966667,100.0,linear,0.001,0.042164
55,0.96,100.0,sigmoid,0.1,0.038873
32,0.96,10.0,linear,1.0,0.038873


The linear kernel with $C=1$ is not among the champions, so we did not so good in the initial fixed setting.

Recall the definition of [kernel functions](https://scikit-learn.org/dev/modules/svm.html#svm-kernels). The parameter $\gamma$ is not used in the linear kernel. Since $\gamma$ does not matter here, the best linear kernel with $C=10$ comes in four equally good paramerizations with $\gamma \in \{1,0.1,0.01,0.001\}$ (all ignored). Testing them adds to training time and is the drawback of using the grid search library instead of programming nested loops manually.