(Unfinished)

### Introduction

In this notebook we aim to build a neural network(Multi-layer perception) that can solve harder problems(other than linear).

We work on the regression of a sine function with gaussian noise first:

In [1]:
import toy_data.sine_wave as swg
import bokeh.io
bokeh.io.output_notebook()

n_samples = 500
sw = swg.SineWave(n_samples=n_samples, frequency=0.5, sigma=2)
swg.visualize_1D_regression(sw)

### Multi-layer perceptrons(Feed-forward neural network)

One way to under stand a standard feed-forward neural network is a chain of linear models.

But if we chain raw linear models together, we will have

$$
y = W_n \ldots W_2 W_1 X = W_{new} X
$$,

which is still a linear model.
In order to achieve some linearity, we can add a non-linear function(activation function) after each step, then:

$$
y = W_n \ldots a(W_2 a(W_1 X))
$$,
which have non-linear representation power.

For a regression problem with Gaussian noise, we have:
$$
y \sim N(f(X), \sigma)
$$,
if we define the whole network as $f(X)$.

Then 
$$
P(x,y\mid f) = P(y\mid f, x)P(x)
$$,
assuming $x$ is i.i.d, we have:
$$
\begin{align}
P(x,y\mid f) &= P(y\mid f, x) \\
&= \frac{1}{\sqrt{2\pi}\sigma}e^{\frac{(y - f(x))^2}{2\sigma^2}}
\end{align}
$$.

#### Formulate the cost function by MLE

The likelihood is:
$$
\ell(f) = \prod_i{C_1e^{\frac{(f(x_i)-y_i)^2}{C_2}}}
$$

Maximizing the log-likelihood we get:
$$
\begin{eqnarray*}
  &  & \max_f \log \left( \prod_z C_1 \exp \left(- \frac{( f ( x_z) -
  y_z)^2}{C_2} \right) \right)\\
  & = & \max_f  \sum_z  -( f ( x_z) - y_z)^2 + \log ( C)\\
  & = & \max_f  \sum_z  -( f ( x_z) - y_z)^2
\end{eqnarray*}
$$,
which is equivalent to minimizing the squared error.

Although there is no garentee of getting the global optimum,
we can still use the gradient descent method.

The parameter we are tuning in the network is $w_{ij}^{(n)}$,
and we will calculate the gradient of it in the next session.

#### Back-propagation

For the $w_{ij}$ in the top layer ($n=$number of layers),
we expand one dimension of $f$:
$$
f_i(x) = \sum_j w_{ij}^{(n)}a(f^{(n-1)}_j(x))
$$
then for each data entry $x_z, y_z$
$$
\newcommand{\mathd}{\mathrm{d}}
\newcommand{\nocomma}{}
\newcommand{\noplus}{}
\begin{eqnarray*}
  \frac{\mathd \ell}{\mathd w_{i \nocomma j}}  & = & \frac{\mathd}{\mathd w_{i
  \nocomma j}^{( n)}}  \sum_z  ( f ( x_z) - y_z)^2\\
  & = & 2 ( f ( x) - y)  \frac{\mathd}{\mathd w_{i \nocomma \nocomma j}^{(
  n)}}  ( f^{} ( x) - y)\\
  & = & 2 ( f ( x) - y)  \frac{\mathd \sum_j w_{i \nocomma j}^{( n)} a (
  f_j^{( n - 1)} ( x))}{\mathd w_{i \nocomma j}^{( n - 1)}}\\
  & = & 2 ( f ( x) - y)  \frac{\mathd w_{i \nocomma j}^{( n)} a ( f_j^{( n -
  1)} ( x))}{\mathd w_{i \nocomma j}^{( n - 1)}}\\
  & = & 2 ( f ( x) - y) a ( f_j^{( n - 1)} ( x))
\end{eqnarray*}
$$

For any $w_{jk}$ in $n-1$th layer:
$$
\begin{align*}
  \frac{\mathd \ell}{\mathd w_{j \nocomma k}}  & = & 2 ( f ( x) - y) 
  \frac{\mathd \sum_i w_{i \nocomma j}^{( n)} a ( f_j^{( n - 1)} ( x)) -
  y}{\mathd w_{j \nocomma k}^{( n - 1)}}\\
  & = & 2 ( f ( x) - y)  \frac{\mathd \sum_i w_{i \nocomma j}^{( n)} a (
  f_j^{( n - 1)} ( x))}{\mathd w_{j \nocomma k}^{( n - 1)}}\\
  & = & 2 ( f ( x) - y)  \sum_i w_{i \nocomma j}  \frac{\mathd a ( f_j^{( n -
  1)} ( x))}{\mathd f_j^{( n - 1)} ( x)}  \frac{\mathd}{\mathd w_{j \nocomma
  k}^{( n - 1)}}  ( f_j^{( n - 1)} ( x))\\
  & = & 2 ( f ( x) - y)  \sum_i w_{i \nocomma j} a'  \frac{\mathd}{\mathd
  w_{j \nocomma k}^{( n - 1)}}  \left( \sum_k w_{j \nocomma k}^{( n - 1)} a (
  f_k^{( n - 2)}  ( x)) \right)\\
  & = & 2 ( f ( x) - y)  \sum_i w_{i \nocomma j} a' a ( f_k^{( n - 2)}  ( x))
\end{align*}
$$

For the subsequent layer $k$, the gradients will be computed similarly using the weight of the $k+1$th layer and the output unit it came from.

### Tensorflow implementation

#### Network construction

We first build a network of 4 layers(2 hidden, 1 input, 1 output),
with 10 units in each hidden layer.

And we will use logistic(sigmoid) function as activation function $a$.

In [2]:
import tensorflow as tf

n_units_l = (1, 10, 10, 10, 10, 1)

io_tf = lambda dim: tf.placeholder(tf.float32, dim)

def tf_logistic(_X):  
    return 1/(1 + tf.exp(-_X))

def tf_leaky_relu(_X, leak=0.1):  
    return tf.maximum(_X, leak*_X)

x = io_tf([None, 1])
y = io_tf([None, 1])

def hidden_layer(_input, n_units):
    n_in = int(_input.get_shape()[1])
    W = tf.Variable(tf.random_uniform([n_in, n_units], minval=-1, maxval=1))
    b = tf.Variable(tf.random_uniform([n_units], minval=-1, maxval=1))
    return tf_leaky_relu(tf.matmul(_input, W) + b), W, b

layer = []
Ws = []
bs = []
_l, _W, _b = hidden_layer(x, n_units_l[1])
Ws.append(_W)
bs.append(_b)
layer.append(_l)
for n_l in n_units_l[2:-1]:
    _l, _W, _b = hidden_layer(_l, n_l)
    Ws.append(_W)
    bs.append(_b)
    layer.append(_l)
    
W_out = tf.Variable(tf.random_normal([n_l, 1]))
b_out = tf.Variable(tf.random_normal([1])) 
net = tf.matmul(layer[-1], W_out) + b_out

#### Define objective

The aim is to Maximize likelihood(Minimize squared error):

In [3]:
MSE = tf.reduce_mean(tf.squared_difference(net, y))

In [6]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# train_step = tf.train.AdadeltaOptimizer(0.01).minimize(MSE)
train_step = tf.train.RMSPropOptimizer(0.01).minimize(MSE)


init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

def regressor1D(x_mesh):
    X = x_mesh.reshape(-1,1)
    return np.ravel(sess.run(net, feed_dict={x: X}))
n_batches = 5
batch_size = n_samples//n_batches
print(sw.tr.X.shape)
for i in range(3000):
#    sess.run(train_step, feed_dict={x: sw.tr.X, y: sw.tr.y})
#    if not i % 800:
#        print(sess.run(MSE, feed_dict={x: sw.tr.X, y: sw.tr.y}))

    for j in range(n_batches):
        sess.run(train_step, feed_dict={x: sw.tr.X[batch_size*j:batch_size*(j+1)-1], y: sw.tr.y[batch_size*j:batch_size*(j+1)-1]})
        if not i % 400 and j == 1:
#            #print(sess.run(l2, feed_dict={x: sw.tr.X, y: sw.tr.y}))
#            #print(sess.run(l3, feed_dict={x: sw.tr.X, y: sw.tr.y}))
            print(sess.run(MSE, feed_dict={x: sw.tr.X, y: sw.tr.y}))
#            #if j == 3:
            swg.visualize_1D_regression(sw, regressor1D)
#            plt.sca np.ravel(sess.run(net, feed_dict={x: X}))tter(sw.tr.X, sw.tr.y)
#            plt.plot(x_mesh, regressor(x_mesh))
#            plt.show()

swg.visualize_1D_regression(sw, regressor1D)
sess.close()

(411, 1)
102.45


4.49472


4.35384


4.2521


4.304


4.21675


4.20618


4.15073
