In [11]:
import tensorflow as tf
import numpy as np
sess = tf.Session()

# Optimization in TensorFlow

## Constraints that can't be satisfied: put them into the objective function as a penalty

What happens if you want something not yet in TensorFlow?

Optimization is a great framework for modeling.

Be an **innovator** don't just **follow**.

Read papers and implement them.

Common objective functions

1. MSE
2. MAE
3. Logistic Error
4. Cross Entropy

Generate some synthetic data - understand the algorithm with synthetic data.

In [2]:
a = np.array([[1, 2, 3, 4]], dtype=np.float32)
XX = np.random.rand(10000, 4)
YY = np.dot(XX, a.transpose())

In [5]:
w = tf.Variable(tf.random_uniform([4,1]))

In [6]:
# objective function is MSE
def f(X):
    return tf.matmul(X, w)

In [8]:
def objective(X, Y):
    return tf.reduce_sum(tf.square(tf.sub(Y, f(X)))) # tf.sub is subtract

In [9]:
X = tf.placeholder(tf.float32, [None, 4])

In [10]:
Y = tf.placeholder(tf.float32, [None, 1])

In [13]:
grad = tf.gradients(objective(X, Y), [w])

In [14]:
sess.run(tf.initialize_all_variables())

In [15]:
sess.run(grad, feed_dict={X:np.zeros([10,4]), Y:np.random.rand(10,1)})

[array([[ 0.],
        [ 0.],
        [ 0.],
        [ 0.]], dtype=float32)]

In [16]:
step = tf.constant(1e-5)

In [17]:
sess.run(tf.initialize_all_variables())

In [19]:
for i in range(100):
    sess.run(tf.assign_add(w, tf.mul(-step, grad[0])), feed_dict={X:XX, Y:YY})

## Original data was 1, 2, 3, 4

In [20]:
sess.run(w)

array([[ 1.33069658],
       [ 1.98263347],
       [ 2.94487882],
       [ 3.74473619]], dtype=float32)

In [22]:
for i in range(1000):
    sess.run(tf.assign_add(w, tf.mul(-step, grad[0])), feed_dict={X:XX, Y:YY})

In [23]:
sess.run(w)

array([[ 1.00000656],
       [ 1.99999964],
       [ 2.99999642],
       [ 3.99999642]], dtype=float32)

## This is slow, of course - use TensorFlow gradient descent

In [25]:
optimizer = tf.train.GradientDescentOptimizer(step)

In [26]:
train = optimizer.minimize(objective(X,Y), var_list=[w])

In [27]:
for i in range(100):
    sess.run(train, feed_dict={X:XX, Y:YY})

In [28]:
sess.run(w)

array([[ 1.00000656],
       [ 1.99999964],
       [ 2.99999642],
       [ 3.99999642]], dtype=float32)

## TensorFlow Cuts the Number of Iterations

Using the naive gradient descent required 1000 iterations to get the same precision as TensorFlow did with 100 iterations.

Try step = 1e-2.

## Other optimizers

1. Gradient descent - no memory
2. Adam Optimizer - has memory
3. RMSProp - pay more attention to the sign of the gradient
4. AdaGrad - diffent step sizes for different variables
5. StochasticGradientDescent - introduce noise to overcome local minima
6. CoordinateGradientDescent is available in a later version of TensorFlow

MiniBatch

* Adding noise in the gradient helps - forces mistakes to get away from local minima
* Full gradient is slow
* Compting the gradient on a few points is good
* Sampling gives an idea of where the global minimum is

## Revisit the linear regression problem with mini-batch

Approximate the $L_1$ norm with $\sqrt{(y - \hat{y})^2 + \epsilon}$.