# SVM exercises

## Testing hypotheses with the full Iris data set

1. Apply SVM classifier to the full, 3 classes, iris data set. Use a 20% holdout to obtain the confusion matrix for a linear kernel.
2. Try different values for the cost C from 1E-10 to 2. What do you observe?
3. Try using other kernels.
4. Formulate at least 2 hypotheses regarding this classification problem and the hyperparameters `kernel`and `C`. For example: the X kernel is better than the Y kernel when C=Y; given a kernel, using a value of C of X is the same as using Y.
5. Check your hypotheses using 10 fold corss validation and t-test. Don't forget to shuffle.

In [50]:
# 1

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score


iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)

model = SVC(kernel='rbf', C=100)
model.fit(X, y)

y_pred=model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm


array([[10,  0,  0],
       [ 0,  9,  1],
       [ 0,  0, 10]])

In [51]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=10)
scores=cross_val_score(model, X, y, cv=cv)
scores

array([0.93333333, 0.93333333, 0.93333333, 1.        , 1.        ,
       1.        , 1.        , 0.93333333, 1.        , 1.        ])

In [59]:
import numpy as np
from scipy import stats

model_linear = SVC(kernel='linear', C=1000)
model_rbf = SVC(kernel='linear', C=1e-1)

scores_linear=cross_val_score(model_linear, X, y, cv=cv)
scores_rbf=cross_val_score(model_rbf, X, y, cv=cv)

print("scores linear:",scores_linear)
print("scores rbf:",scores_rbf)
print("mean linear:",np.mean(scores_linear))
print("mean rbf:",np.mean(scores_rbf))
stats.ttest_rel(scores_linear, scores_rbf)

scores linear: [1.         1.         0.86666667 0.86666667 1.         0.93333333
 1.         0.93333333 1.         1.        ]
scores rbf: [1.         0.93333333 0.93333333 1.         1.         1.
 0.93333333 0.86666667 0.86666667 0.93333333]
mean linear: 0.96
mean rbf: 0.9466666666666667


Ttest_relResult(statistic=0.5144957554275265, pvalue=0.6193005100381609)