# Stochastic Optimization

In [None]:
#sklearn

iris = datasets.load_iris()
X = iris.data[:, :2] 
y = iris.target
svc1 = svm.SVC(C=1.0, gamma='auto', kernel='linear')
svc2 = svm.SVC(C=1.0, gamma='auto', kernel='rbf')
svc1.fit(X, y)
svc2.fit(X, y)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

plt.subplot(1, 1, 1)

Z1 = svc1.predict(np.c_[xx.ravel(), yy.ravel()])
Z1 = Z1.reshape(xx.shape)

plt.contourf(xx, yy, Z1, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Sklearn SVC')
plt.show()

Z2 = svc2.predict(np.c_[xx.ravel(), yy.ravel()])
Z2 = Z2.reshape(xx.shape)
plt.contourf(xx, yy, Z2, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Sklearn RBF')
plt.show()

### Python - Spark

In [None]:
#spark
# From Spark documnetation: http://spark.apache.org/docs/2.2.0/mllib-linear-methods.html#linear-support-vector-machines-svms

from pyspark.mllib.classification import SVMWithSGD, SVMModel
from pyspark.mllib.regression import LabeledPoint

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)

# Build the model
model = SVMWithSGD.train(parsedData, iterations=100)

# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))

# Save and load model
model.save(sc, "target/tmp/pythonSVMWithSGDModel")
sameModel = SVMModel.load(sc, "target/tmp/pythonSVMWithSGDModel")

### Resources

* Implementing in R: https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/

## Stochastic Gradient Descent (SGD)

![450px-Gradient_ascent_%28surface%29.png](attachment:450px-Gradient_ascent_%28surface%29.png)

### Intuition

Wait, we've already seen this graphic before! It has come back to help us build our intuition of how stochastic gradient descent works. As we explorerd earlier, we can imagine the objective function output we are looking to optimize is on the $Z$ axis, and that the input variables our objective function lie on the $X$ and $Y$ axes. 

We'll remember that in traditional gradient descent we calculate the gradient (the vector of all the partial derivatives for a given point), and move in the opposite direction of the steepest differential (aka - we descend). So how does *stochastic* gradient descent differ?

The "stochastic part" is a random shuffling of the order of training data to be serached over in the iterative gradient updating. 

Do we know if we're ever going to get there? From Wikipeida "when the learning rates decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum". For those of you interested in why this is so, I encourage you to check out: Robbins-Siegmund paper from 1971. 

### Definition

Choose an initial vector of parameters {\displaystyle w} w and learning rate {\displaystyle \eta } \eta 

Repeat until an approximate minimum is obtained:
* Randomly shuffle examples in the training set.

For ${\displaystyle i=1,2,...,n} {\displaystyle i=1,2,...,n}$, do:
* ${\displaystyle \!w:=w-\eta \nabla Q_{i}(w).} \!w:=w-\eta \nabla Q_{i}(w)$.

### Python - Sklearn

In [None]:
# http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html

import numpy as np
from sklearn import linear_model

X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
clf = linear_model.SGDClassifier()
clf.fit(X, Y)

print(clf.predict([[-0.8, -1]]))

## Simulated Annealing

Let's explore another method for optimizing SVMs that also involves a random variable. This technique attempts to find a global optimum, and was inspired by temperature regulation in metullurgy. 


* More useful in discrete data contexts


### Pseudocode

Let s = s0

For k = 0 through kmax (exclusive):
* T ← temperature(k ∕ kmax)
* Pick a random neighbour, snew ← neighbour(s)

If P(E(s), E(snew), T) ≥ random(0, 1):
* s ← snew

Output: the final state s

### CODE: Scipy

In [None]:
# https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.anneal.html

from scipy import optimize
np.random.seed(777)  # Seeded to allow replication.
x0 = np.array([2., 2.]) #Seeds a guess
params = (2, 3, 7, 8, 9, 10, 44, -1, 2, 26, 1, -2, 0.5)

myopts = {'C'     : 'boltzmann',   # Non-default value.
          'gamma' : None,  # Default, formerly `maxeval`.
         }
          
res2 = optimize.minimize(f, x0, args=params, method='Anneal', options=myopts)

### CODE: Skyler

In [None]:
#https://github.com/skylergrammer/SimulatedAnnealing

## Particle Swarm Optimization

### Psuedocode

 for each particle i = 1, ..., S do:
   * Initialize the particle's position with a uniformly distributed random vector: xi ~ U(blo, bup)
   * Initialize the particle's best known position to its initial position: pi ← xi
   
   if f(pi) < f(g) then
       update the swarm's best known  position: g ← pi
   Initialize the particle's velocity: vi ~ U(-|bup-blo|, |bup-blo|)
   
while a termination criterion is not met do:
   * for each particle i = 1, ..., S do
   
      for each dimension d = 1, ..., n do
         Pick random numbers: rp, rg ~ U(0,1)
         Update the particle's velocity: vi,d ← ω vi,d + φp rp (pi,d-xi,d) + φg rg (gd-xi,d)
      Update the particle's position: xi ← xi + vi
      if f(xi) < f(pi) then
         Update the particle's best known position: pi ← xi
         if f(pi) < f(g) then
            Update the swarm's best known position: g ← pi

### CODE: Pyswarms