# SVM's in python: worksheet

BAIT 509 Class Meeting 10

Let's work with the breast cancer dataset:

In [2]:
from sklearn import datasets
dat = datasets.load_breast_cancer()
y = dat.target
X = dat.data

Here are the predictors of breast cancer:

In [4]:
dat.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='<U23')

And their dimension:

In [5]:
X.shape

(569, 30)

## Fitting an SVM model

#### 0\. Scale the data 

([this page](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling) on sklearn's documentation is useful).
 
- Initiate transformer with `StandardScaler()` from `sklearn.preprocessing`.
- Fit the transformer using the `.fit()` method.
- Use the scaler to transform `X`, with the `.transform()` method.

In [8]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
Xscale = scaler.transform(X)

#### 1\. Fit a SVC (i.e., linear SVM)

Fitting an SVM can be done using `sklearn.svm`'s `SVC` module. 

In [9]:
from sklearn import svm

Do the fitting here:

In [14]:
my_svc = svm.SVC(C=1, kernel="linear")
my_svc.fit(Xscale, y)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

What's the accuracy? Try changing the `C` parameter.

In [15]:
sum(my_svc.predict(Xscale) == y) / len(y)

0.9876977152899824

#### 2\. Fit a radial-basis SVM

Try again, this time with radial SVM. What's the accuracy? Try changing the parameters. 

In [23]:
my_svc = svm.SVC(C=1, kernel="rbf", gamma=100)
my_svc.fit(Xscale, y)
sum(my_svc.predict(Xscale) == y) / len(y)

1.0

## Cross validation 

Evaluate generalization over a grid of parameters. Use the linear kernel. Here is the module we'll import:

In [24]:
from sklearn.model_selection import GridSearchCV

Define a grid of `C` (hyperparameter/tuning parameter) values:

In [25]:
C = [1, 10, 20]

Initiate the model fit as usual; ignore specification of `C`.

In [27]:
model = svm.SVC(kernel="linear")

From the initiated model, initiate cross validation using the `GridSearchCV()` function, like so:

In [28]:
model_cv = GridSearchCV(model, param_grid={"C":C}, cv=10)

Now, "fit" the cross validation with the `.fit()` method (as if you're fitting a model). Warning: this will be slow if you did not scale the data!

In [29]:
model_cv.fit(Xscale, y)

GridSearchCV(cv=10, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1, param_grid={'C': [1, 10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

You can obtain the best parameters and best scores by appending `.best_params_` and `.best_score_`:

In [34]:
model_cv.best_params_
#model_cv.best_score_

{'C': 1}

You can obtain info about all folds by appending `.cv_results_`. What are the test scores of the fourth fold for each value of `C`?

In [33]:
model_cv.cv_results_["split4_test_score"]

array([ 0.98245614,  0.96491228,  0.96491228])