##Lecture 7 - Optimization Problems using Theano

#What you will learn in this session
1.  How to use Theano to implement the gradient descent techniques that you've learned
2.  How to use Theano to implement the simple ANN that you've seen.  

#Order of Topics
1.  Using Theano for function minimization
2.  Using Theano to train a simple ANN

#Using Theano for function minimization (and maximization).

Now that you've got the basics of Theano and an understanding of gradient descent, you're ready to use Theano to do gradient descent.  The first problem will be a simple maximization of a real-valued function of several variables.  For an example, here's a function in the shape of a multidimensional Gaussian density.  

$f(x) = e^{-\frac12(x - m)^T(x - m)}$

In this equation $x, m \in\mathbb{R^n}$.  x is the variable that you'll find by gradient descent.  m is like the mean value in a Gaussian density.  Notice that the argument of the exponential is always less than or equal to zero and that it's only equal to zero when $x = m$.  Therefore the point that maximizes f(x) is $x = m$.  The function f(x) has the shape of a Gaussian density but it's missing the multiplicative normalizing constant so that the area under f(x) is not equal to 1.  The code snip below shows Theano code for maximizing f(x) using gradient descent.  

In [13]:
__author__ = 'mike.bowles'
import theano
from theano import tensor as T
from theano import function
import numpy as np

x = theano.shared(np.array([0.0, 0.0, 0.0]), name='x')
cost = T.scalar('cost')
err = T.vector(name='err')
gradient = T.vector(name='gradient')

#cost is essentially a multidimensional unit variance, Gaussian centered at m = [1, 2, 3].
#The only difference is that the multiplicative normalizing constant is left out.
#That doesn't change the location of the maximum which is still at "m"
err = x - np.array([1.0, 2.0, 3.0])
cost =  T.exp((- 0.5) * T.sum((err**2)))

#gradient of the cost wrt the vector x
gradient = T.grad(cost, wrt=x)

#update step
update = [[x, x + 1.0 * gradient]]

train = function([], outputs = cost, updates=update)
#train = function([], outputs = cost, updates=update, mode='DebugMode')
nSteps = 100

for i in range(nSteps):
    train()
    if i % 10 == 0: print i, x.get_value(), cost.eval()

theano.printing.pprint(cost)
#theano.printing.debugprint(cost) 

0 [ 0.00091188  0.00182376  0.00273565] 0.000923592617972
10 [ 0.01067264  0.02134528  0.03201792] 0.00105799424176
20 [ 0.02182399  0.04364798  0.06547197] 0.00123362517897
30 [ 0.03479875  0.0695975   0.10439625] 0.00147176314327
40 [ 0.05026133  0.10052265  0.15078398] 0.00181073064956
50 [ 0.06930423  0.13860845  0.20791268] 0.00232656499668
60 [ 0.09390395  0.1878079   0.28171185] 0.00319211890747
70 [ 0.12818393  0.25636786  0.38455178] 0.00489058745742
80 [ 0.1830675  0.366135   0.5492025] 0.00935680898265
90 [ 0.30663705  0.6132741   0.91991116] 0.0345528499153


'exp((TensorConstant{-0.5} * Sum{acc_dtype=float64}(((x - TensorConstant{[ 1.  2.  3.]}) ** TensorConstant{2}))))'

#Q's
1.  Walk through the code one line at a time and describe what the line is doing. 
2.  Uncomment the first print statement at the bottom of code snip.  Rerun and give a description of added output. 
3.  Remove the name='x' from the constructor for the shared variable x and describe how that changes the output.  
4.  Uncomment the debug print statement (last line) and explain the resulting output.
5.  Switch to the debug version of the train function and see if you can introduce a coding error that requires the debug output to unravel. 

#In-class coding exercise
1.  Modify the code above to use momentum method.  The code box below will help you get started. 

In [16]:
__author__ = 'mike.bowles'
import theano
from theano import tensor as T
from theano import function
import numpy as np

x = theano.shared(np.array([0.0,0.0,0.0]),name='x')
cost = T.scalar('cost')
err = T.vector('err')
gradient = T.vector('vector')
beta = theano.shared(x.x, name='beta')  #<================ Fill in value here
delta = theano.shared(x.x, name='delta')  #<=============== Fill in value here
xStep = theano.shared(np.array([0.0,0.0,0.0]),name='xStep')

#cost is essentially a multidimensional unit variance, Gaussian centered at m = [1, 2, 3].
#The only difference is that the multiplicative normalizing constant is left out.
#That doesn't change the location of the maximum which is still at "m"
err = x - np.array([1.0, 2.0, 3.0])
cost =  T.exp((- 0.5) * T.sum((err**2)))

#gradient of the cost wrt the vector x
gradient = T.grad(cost, wrt=x)
xStep =    #<========================== Fill in code here (hint: look at lecture notes on momentum)

#update step
update = [[x, x + xStep]]

train = function([], outputs = cost, updates=update)
#train = function([], outputs = cost, updates=update, mode='DebugMode')
nSteps = 100

for i in range(nSteps):
    train()
    if i % 10 == 0: print i, x.get_value(), cost.eval()

0 [ 0.00136782  0.00273565  0.00410347] 0.000929500163129
10 [ 0.016551    0.033102    0.04965299] 0.0011474610305
20 [ 0.03534802  0.07069605  0.10604407] 0.0014827244323
30 [ 0.05983992  0.11967984  0.17951976] 0.00205535472918
40 [ 0.09451585  0.18903169  0.28354754] 0.00321698448097
50 [ 0.15207815  0.3041563   0.45623445] 0.00652052581711
60 [ 0.30087555  0.6017511   0.90262665] 0.0326658541905
70 [ 0.99633654  1.99267309  2.98900963] 0.999906058086
80 [ 0.99999642  1.99999285  2.99998927] 0.99999999991
90 [ 1.          1.99999999  2.99999999] 1.0


##Q's
1.  It's a pain to adjust the parameters beta and delta by altering the code, recompiling and rerunning.  Since those get changed frequently, it would be better to make them inputs to the function train().  Outline the steps required to do that. 

##In-class Coding exercise
1.  Below is code that treats beta and delta as inputs and that makes the center of the Gaussian-like function an input as well.  It has a bug in it.  See if you can find it.  See if the graph printing statements help you find it quickly.

In [18]:
__author__ = 'mike.bowles'
import theano
from theano import tensor as T
from theano import function
import numpy as np

x = theano.shared(np.array([0.0,0.0,0.0]),name='x')
cost = T.scalar('cost')
err = T.vector('err')
gradient = T.vector('vector')
beta = T.scalar(name='beta')
delta = T.scalar(name='delta')
xStep = theano.shared(np.array([0.0,0.0,0.0]),name='xStep')
m = T.vector(name='m')

#cost is essentially a multidimensional unit variance, Gaussian centered at m = [1, 2, 3].
#The only difference is that the multiplicative normalizing constant is left out.
#That doesn't change the location of the maximum which is still at "m"
err = x - m
cost =  T.exp((- 0.5) * T.sum((err**2)))

#gradient of the cost wrt the vector x
gradient = T.grad(cost, wrt=x)
xStep = beta * xStep + (1 - beta) * delta * gradient

#update step
update = [[x, x + xStep]]

train = function([beta, delta, m], outputs = cost, updates=update)
#train = function([], outputs = cost, updates=update, mode='DebugMode')
nSteps = 100

for i in range(nSteps):
    train(0.7, 5.0, np.array([1.0, 2.0, 3.0]))
    if i % 10 == 0: print i, x.get_value()#, cost.eval()

0 [ 0.00136782  0.00273565  0.00410347]

MissingInputError: ("An input of the graph, used to compute Elemwise{sub,no_inplace}(x, m), was not provided and not given a value.Use the Theano flag exception_verbosity='high',for more information on this error.", m)

#Building a neural net with Theano
Now return to the simple neural net that you trained last week.  You have all the Theano tools to train that neural net using Theano.  The code for that is in the code box below.  

In [20]:
__author__ = 'mike.bowles'
import random
import theano
from theano import tensor as T
from theano import pp
import numpy as np

numRows = 200
noiseSd = 0.5
X1 = []; X2 = []; Yin = []
for i in range(numRows):
    #generate attributes x1 and x2 by drawing from uniform (0,1)
    x1 = random.random()
    x2 = random.random()
    y = 0.0
    if x2 > (x1 + random.normalvariate(0.0, noiseSd)):
        y = 1.0
    X1.append(x1); X2.append(x2); Yin.append(y);

#Form (nrows x 2) np matrix of x-values
Xin = np.array([np.array([x1, x2]) for (x1, x2) in zip(X1, X2)])

X = T.matrix(name='X')
Y = T.vector(name='Y')
cost = T.scalar(name='cost')
y = T.vector(name='y')

w = theano.shared(np.asarray([-10.0, 10.0]), name='w')
y = T.nnet.sigmoid(T.dot(X, w))

cost = T.mean(T.sqr(y - Y))
gradient = T.grad(cost=cost, wrt=w)
updates = [[w, w - gradient * 1.0]]

train = theano.function(inputs=[X, Y], outputs=cost, updates=updates, allow_input_downcast=True)

for i in range(1000):
    train(Xin, Yin)
    if i % 50 == 0: print i, w.get_value()

0 [-9.99842204  9.99747987]
50 [-9.91386469  9.87526903]
100 [-9.82122285  9.75772079]
150 [-9.72342534  9.64210187]
200 [-9.62209163  9.52686775]
250 [-9.51810666  9.4111443 ]
300 [-9.4119438   9.29443483]
350 [-9.30384656  9.17645495]
400 [-9.19393077  9.05703986]
450 [-9.08224203  8.93609249]
500 [-8.96878786  8.81355495]
550 [-8.85355588  8.6893929 ]
600 [-8.73652406  8.56358734]
650 [-8.61766675  8.4361304 ]
700 [-8.49695832  8.3070236 ]
750 [-8.37437572  8.17627727]
800 [-8.2499003   8.04391084]
850 [-8.12351956  7.90995349]
900 [-7.9952288   7.77444519]
950 [-7.86503282  7.63743784]


#Q's
1.  Go through the code one line at a time and explain what each line is doing.  
2.  What is the shape of T.dot(X, w)?  What are the shapes of X and w?  

#Homework
1.  Modify the code above to use momentum method instead of gradient descent. 