# Non-Linear SVM
## Classification of linearly inseparable data

For the sake of simplicity, we'll be revisiting the Iris dataset.  
We will build and train a non-linear SVM classifier to detect whether data points
represent *I. setosa* or one of the other *Iris* varieties.  
Then, we will plot the decision boundaries resulting from different parameters
for training the support vector machine.

## Initial imports
First step is to load the required libraries.

In [None]:
import numpy as np
from sklearn import datasets

## TensorFlow initialization
Next, we need to import TensorFlow and clear the default computational graph.

In [None]:
import tensorflow as tf
from tensorflow.python.framework import ops
ops.reset_default_graph()

## Session declaration

In [None]:
session = tf.Session()

## Loading the dataset

In [None]:
# Load dataset using sklearn.datasets.load_iris()
dataset = datasets.load_iris()

# We select a pair of features rather than using the whole set
X = np.array([[X[0], X[3]] for X in dataset.data])

# We binarize class labels - 1 if Setosa, -1 otherwise
y = np.array([1 if y==0 else 0 for y in dataset.target])

## Setting up model parameters, placeholder grids

In [None]:
# Deciding ahead of time what batch size should be used
batch_size = 200

# Init X, y placeholder grids
X_grid = tf.placeholder(shape=[None, 2], dtype=tf.float32)
y_grid = tf.placeholder(shape=[None, 1], dtype=tf.float32)

# Grid for predictions
prediction_grid = tf.placeholder(shape=[None, 2], dtype=tf.float32)

# Creating b-value for the SVM kernel
b = tf.Variable(tf.random_normal(shape=[1, batch_size]))

## Constructing the RBF kernel

The Gaussian / Radial Basis Function (RBF) kernel may be defined as follows:  

$$
K(x_{1}, x_{2})=exp\left(-\gamma*(x_{1}-x_{2})^{2}\right)
$$  
which, where `X` is some vector of points, is roughly equivalent to  

$$K(\textbf{x})=exp\left( -\gamma * |\textbf{x} \cdot \textbf{x}^{T}| \right)$$  

which is the relation we will use for our kernel calculation.

  * [Read more about the RBF Kernel](https://en.wikipedia.org/wiki/Radial_basis_function_kernel)

## What's with the gamma?
Gamma is a constant for use in the Radial Basis Function (RBF) kernel that effectively determines the range of influence for a single subsample, i.e., the radius.

  * Smaller values for gamma *increase* that relative influence, producing a wider kernel.  
  * Larger values *decrease* the influence of a subsample, producing 'tighter'-looking decision boundaries.

The code below is a TensorFlow representation of the above RBF kernel (remember, we defined gamma as negative):

In [None]:
gamma = tf.constant(-45.0) # Gamma is some constant, which we make negative
sq_vec = tf.multiply(2., tf.matmul(X_grid, tf.transpose(X_grid)))
rbf_kernel = tf.exp(tf.multiply(gamma, tf.abs(sq_vec)))

## Computational step
The non-linear SVM actually aims at *maximizing* the loss function, specifically by minimizing its negative:

In [None]:
first = tf.reduce_sum(b)
b_cross = tf.matmul(tf.transpose(b), b)
y_grid_cross = tf.matmul(y_grid, tf.transpose(y_grid))
second = tf.reduce_sum(tf.multiply(rbf_kernel, tf.multiply(b_cross, y_grid_cross)))

# Loss is negative here because this value needs to be maximized.
# Minimizing a negative maximizes the positive equivalent.
loss = tf.negative(tf.subtract(first, second))

## Building and applying a prediction kernel
Next, we need to produce a predictor kernel:

In [None]:
# RBF prediction kernel
rA = tf.reshape(tf.reduce_sum(tf.square(X_grid), 1),[-1,1])
rB = tf.reshape(tf.reduce_sum(tf.square(prediction_grid), 1),[-1,1])
pred_sq_dist = tf.add(tf.subtract(rA, tf.multiply(2., tf.matmul(X_grid, tf.transpose(prediction_grid)))), tf.transpose(rB))
pred_kernel = tf.exp(tf.multiply(gamma, tf.abs(pred_sq_dist)))

# Applying said kernel
pred_output = tf.matmul(tf.multiply(tf.transpose(y_grid),b), pred_kernel)
prediction = tf.sign(pred_output-tf.reduce_mean(pred_output))
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.squeeze(prediction), tf.squeeze(y_grid)), tf.float32))

## Declare optimizer function, train step
Next, we need to declare an optimizer function to train the classifier.  
We'll use a `GradientDescentOptimizer` from `tensorflow.train`.

Furthermore, we aim to train the model by minimizing the loss function.  
This informs our definition for `train_step` below.

In [None]:
# Initialize gradient descent optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
# Note: the parameter (0.01) is the learning rate - toy with this and see what happens.

# Define training step
train_step = optimizer.minimize(loss)

## Training preparation: global variables and loop parameters
Having cleared the graph state earlier, we should re-initialize:

In [None]:
# Initialize global variables
init = tf.global_variables_initializer()
session.run(init)

# Loop parameters/variables
num_iter = 300
# Note: here's where you'd want to track temp_loss
# and a temp_accuracy if you're interested in the
# additional exercises at the bottom of the notebook.

## Training the model (finally)
Finally, we construct a training loop that runs the `tf` session with our optimizer via `train_step`.

In [None]:
# Training loop
for i in range(num_iter):
    rand_index = np.random.choice(len(X), size=batch_size)
    rand_X = X[rand_index]
    rand_y = np.transpose([y[rand_index]])
    session.run(train_step, feed_dict={X_grid: rand_X, y_grid: rand_y})
    
    # It's a good idea to confirm that our loss values are decreasing:
    temp_loss = session.run(loss, feed_dict={X_grid: rand_X, y_grid: rand_y})

    if (i+1)%50==0:
        print('Loss @ step ' + str(i+1) + '= ' + str(temp_loss))

We've confirmed that the loss minimizer is working, so the model is (probably) learning.

## Visualizing the classifier: grid construction
Having trained up our classifier, we can visually confirm its accuracy.

In [None]:
# Now that we're ready to plot, we should import pyplot
import matplotlib.pyplot as plt

# Construct numpy mesh for plotting
X_min, X_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(X_min, X_max, 0.02),
                     np.arange(y_min, y_max, 0.02))
grid_points = np.c_[xx.ravel(), yy.ravel()]
[grid_predictions] = session.run(prediction, feed_dict={X_grid: rand_X,
                                                       y_grid: rand_y,
                                                       prediction_grid: grid_points})
grid_predictions = grid_predictions.reshape(xx.shape)

## Pulling per-class data from feature data

In [None]:
# Pulling out sepal width and length for each class for plotting
X1 = [x[0] for i,x in enumerate(X) if y[i]==1]
y1 = [x[1] for i,x in enumerate(X) if y[i]==1]
X2 = [x[0] for i,x in enumerate(X) if y[i]==0]
y2 = [x[1] for i,x in enumerate(X) if y[i]==0]

## Plotting results

In [None]:
%matplotlib inline
# Plot points and grid
plt.contourf(xx, yy, grid_predictions, cmap=plt.cm.Paired, alpha=0.8)
plt.plot(X1, y1, 'ro', label='I. setosa')
plt.plot(X2, y2, 'kx', label='Non setosa')
plt.title('RBF Kernel Results on Iris Data')
plt.xlabel('Petal length')
plt.ylabel('Sepal width')
plt.legend(loc='lower right')
plt.ylim([-0.5, 3.0])
plt.xlim([3.5, 8.5])
plt.show()

_Sweet victory!_
The classifier can clearly differentiate between *I. setosa* and the other varieties.

Notice the decisoin boundary is a curve.
You can imagine that for each support vector selected, an ellipsoid is projected out with it at the center.
The overlap of the hyper-ellipsoids forms the decision boundaries.

## Going further: additional exercises
* Plot the boundaries produced when using different values for `gamma`.
* Plot per-batch accuracy over time
* Plot the loss function over time (intuitively, how should this graph look?)
* Describe the tradeoffs between boundary smoothness and classification accuracy based on your understanding of `gamma` and the first question above.

# Save your notebook