# Building a support vector regression model

In this activity we will build a support vector regression for the medical data set on diabetes progression.

## Dataset

We use a dataset included in scikit-learn.

In [1]:
##### added line to ensure plots are showing
%matplotlib inline
#####

import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler

dataset = load_diabetes()

X = pd.DataFrame(data=dataset['data'],columns=dataset['feature_names'])

# Again, best to scale the input variables
X = StandardScaler().fit_transform(X)

y = pd.DataFrame(data=dataset['target'],columns=['progression'])

## Building the regression

We can almost use exactly the same code as for classification, except we use an instance of ```SVR```:

In [2]:
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

lr = LinearRegression()
lr.fit(X_train,y_train.values.ravel())
pred_lr = lr.predict(X_test)
print('RMSE LR:',np.sqrt(mse(y_test,pred_lr)))

svr = SVR(gamma='auto')
svr.fit(X_train,y_train.values.ravel())
pred_svm = svr.predict(X_test)
print('RMSE SVM:', np.sqrt(mse(y_test,pred_svm)))

RMSE LR: 60.34742250148648
RMSE SVM: 71.84843439044327


That's wrose than linear regression. We might want to change the parameters:

In [3]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel':['linear','poly','rbf'],'C':[0.2,0.5,1.0]}

grid_search = GridSearchCV(SVR(gamma='auto'), parameters, cv=5,scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train.values.ravel())

means = grid_search.cv_results_['mean_test_score']
stds = grid_search.cv_results_['std_test_score']

print('Mean RMSE (+/- standard deviation), for parameters')
for mean, std, params in zip(means, stds, grid_search.cv_results_['params']):
    print("%0.3f (+/- %0.03f) for %r"
          # The MSE is return as a negative, so we multiple it with -1 before squaring it
          % (np.sqrt(-1*mean), np.sqrt(std), params))

Mean RMSE (+/- standard deviation), for parameters
56.316 (+/- 18.890) for {'C': 0.2, 'kernel': 'linear'}
72.752 (+/- 22.029) for {'C': 0.2, 'kernel': 'poly'}
76.019 (+/- 23.862) for {'C': 0.2, 'kernel': 'rbf'}
52.795 (+/- 19.261) for {'C': 0.5, 'kernel': 'linear'}
68.965 (+/- 20.070) for {'C': 0.5, 'kernel': 'poly'}
74.277 (+/- 23.404) for {'C': 0.5, 'kernel': 'rbf'}
52.344 (+/- 19.721) for {'C': 1.0, 'kernel': 'linear'}
66.148 (+/- 18.524) for {'C': 1.0, 'kernel': 'poly'}
71.728 (+/- 22.983) for {'C': 1.0, 'kernel': 'rbf'}




In [4]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel':['linear','poly','rbf'],'C':[0.2,0.5,1.0]}

grid_search = GridSearchCV(SVR(gamma='auto'), parameters, cv=10,scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train.values.ravel())

means = grid_search.cv_results_['mean_test_score']
stds = grid_search.cv_results_['std_test_score']

print('Mean RMSE (+/- standard deviation), for parameters')
for mean, std, params in zip(means, stds, grid_search.cv_results_['params']):
    print("%0.3f (+/- %0.03f) for %r"
          # The MSE is return as a negative, so we multiple it with -1 before squaring it
          % (np.sqrt(-1*mean), np.sqrt(std), params))

Mean RMSE (+/- standard deviation), for parameters
55.554 (+/- 21.183) for {'C': 0.2, 'kernel': 'linear'}
72.276 (+/- 29.010) for {'C': 0.2, 'kernel': 'poly'}
75.864 (+/- 31.302) for {'C': 0.2, 'kernel': 'rbf'}
52.680 (+/- 21.506) for {'C': 0.5, 'kernel': 'linear'}
68.667 (+/- 26.861) for {'C': 0.5, 'kernel': 'poly'}
73.875 (+/- 30.686) for {'C': 0.5, 'kernel': 'rbf'}
52.208 (+/- 21.951) for {'C': 1.0, 'kernel': 'linear'}
65.719 (+/- 25.198) for {'C': 1.0, 'kernel': 'poly'}
71.187 (+/- 29.652) for {'C': 1.0, 'kernel': 'rbf'}




It seems that, again, the linear kernel is working best with the cost parameter only influencing the result slightly. The confidence intervals are quite wide, however, so the results might not be very reliable.