# California housing dataset regression with support vector machines

In this notebook, we'll use linear and non-linear [support vector machines (SVMs)](https://scikit-learn.org/stable/modules/svm.html#regression) to estimate median house values on Californian housing districts.

First, the needed imports. 

In [None]:
%matplotlib inline

import numpy as np
from sklearn import svm, datasets, __version__
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Data

Then we load the California housing data. First time we need to download the data, which can take a while.

In [None]:
chd = datasets.fetch_california_housing()

We'll split the data into a training and a test set.

Let's also select a single attribute to start the analysis with, say *MedInc*.

In [None]:
test_size = 5000
single_attribute = 'MedInc'

X_train_all, X_test_all, y_train, y_test = train_test_split(
    chd.data, chd.target, test_size=test_size, shuffle=True)

attribute_index = chd.feature_names.index(single_attribute)
X_train_single = X_train_all[:, attribute_index].reshape(-1, 1)
X_test_single = X_test_all[:, attribute_index].reshape(-1, 1)
     
print()
print('California housing data: train:',len(X_train_all),'test:',len(X_test_all))
print()
print('X_train_all:', X_train_all.shape)
print('X_train_single:', X_train_single.shape)
print('y_train:', y_train.shape)
print()
print('X_test_all', X_test_all.shape)
print('X_test_single', X_test_single.shape)
print('y_test', y_test.shape)

The training data matrix `X_train_all` is a matrix of size (`n_train`, 8), and `X_train_single` contains only the first attribute (*MedInc* by default). The vector `y_train` contains the target value (median house value) for each housing district in the training set.

Let's start our analysis with the single attribute. Later, you can set `only_single_attribute = False` to use all eight attributes in the regression.

As the final step, let's scale the input data to zero mean and unit variance: 

In [None]:
only_single_attribute = True

if only_single_attribute:
    X_train = X_train_single
    X_test = X_test_single
else:
    X_train = X_train_all
    X_test = X_test_all

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
print('X_train: shape:', X_train.shape, 'mean:', X_train.mean(axis=0), 'std:', X_train.std(axis=0))
print('X_test: shape:', X_test.shape, 'mean:', X_test.mean(axis=0), 'std:', X_test.std(axis=0))

## Linear SVM

We begin with SVM using a linear kernel.

### Learning

Let's use [`LinearSVR`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR), as it is a specialized in linear SVMs. `C` is the penalty parameter.  (The general `SVR` has a similar `kernel=’linear’` option that can also be used.)

In [None]:
%%time

C = 1.0
lin_reg = svm.LinearSVR(C=C)
lin_reg.fit(X_train, y_train)
print('coefficients:', lin_reg.coef_)
print('intercept:', lin_reg.intercept_)

We can visualize the results if we are using only a single attribute:

In [None]:
if X_train.shape[1] == 1:
    plt.figure(figsize=(10, 10))
    plt.scatter(X_train, y_train, s=5)
    reg_x = np.arange(np.min(X_train), np.max(X_train), 0.01).reshape(-1, 1)
    plt.scatter(reg_x, lin_reg.predict(reg_x), s=8, label='linear SVR')
    plt.legend(loc='best');

### Inference

We use *mean squared error* as the performance measure for our regression algorihm: 

In [None]:
%%time

predictions = lin_reg.predict(X_test)
print("Mean squared error: %.3f"
      % mean_squared_error(y_test, predictions))

## Non-linear (or kernel) SVM

In addition to using a linear kernel, SVMs can be used for non-linear regression by implicitly mapping the input features into high-dimensional feature spaces.  This is sometimes called the *kernel trick*, as the implicit mapping is often computationally cheaper than explicitly operating in the high-dimensional space.

### Learning

Let's start with a Gaussian kernel or `kernel='rbf'`. 

A polynomial kernel, that is `kernel='poly'`, is another common choice. The degree of the polynomial is set using the `degree` parameter.

Note that non-linear SVMs can be relatively slow to train, so it might be a good idea to start with a subset of the training data.

In [None]:
%%time

kernel = 'rbf'
C = 1.0
svm_reg = svm.SVR(kernel=kernel, C=C, gamma='auto')
svm_reg.fit(X_train, y_train)

In [None]:
if X_train.shape[1] == 1:
    plt.figure(figsize=(10, 10))
    plt.scatter(X_train, y_train, s=5)
    reg_x = np.arange(np.min(X_train), np.max(X_train), 0.01).reshape(-1, 1)
    plt.scatter(reg_x, lin_reg.predict(reg_x), s=8, label='linear SVR')
    plt.scatter(reg_x, svm_reg.predict(reg_x), s=8, label='non-linear ({}) SVR'.format(kernel))
    plt.legend(loc='best');

### Inference

In [None]:
%%time

predictions = svm_reg.predict(X_test)
print("Mean squared error: %.3f"
      % mean_squared_error(y_test, predictions))

## Model tuning

Try to reduce the mean squared error of the regression. Experiment with several single attributes and with using all attributes. See the documentation of [LinearSVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR) and [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR) for further options.

To further improve the results, it is possible to replace [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), that is scaling the input data to zero mean and unit variance, with more advanced preprocessing.
See [Preprocessing data](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-data) for more information.