# TensorFlow
## Computational Graph
Start with a simple architecture to outline how strikingly similar numpy and tensorflow are.

![computational graph](compu_graph.png)

The following is forward prop and gradient computation in `numpy`:

In [1]:
import numpy as np
np.random.seed(0)

N, D = 3, 4

x = np.random.randn(N, D)
y = np.random.randn(N, D)
z = np.random.randn(N, D)

# Forward prop
a = x * y
b = a + z
c = np.sum(b)

# Backward prop
grad_c = 1 # Gradient of c with respect to c 
grad_b = grad_c * np.ones((N, D))
grad_a = grad_b.copy()
grad_z = grad_b.copy() 
grad_x = grad_a * y
grad_y = grad_a * x

print "Gradient x:", grad_x
print "Graident y:", grad_y
print "Gradient z:", grad_z

Gradient x: [[ 0.76103773  0.12167502  0.44386323  0.33367433]
 [ 1.49407907 -0.20515826  0.3130677  -0.85409574]
 [-2.55298982  0.6536186   0.8644362  -0.74216502]]
Graident y: [[ 1.76405235  0.40015721  0.97873798  2.2408932 ]
 [ 1.86755799 -0.97727788  0.95008842 -0.15135721]
 [-0.10321885  0.4105985   0.14404357  1.45427351]]
Gradient z: [[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]


The following is forward prop and gradient computation in `tf`:

In [2]:
import tensorflow as tf
np.random.seed(0)

N, D = 3, 4

x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
z = tf.placeholder(tf.float32)

a = x * y
b = a + z 
c = tf.reduce_sum(b)

grad_x, grad_y, grad_z = tf.gradients(c, [x, y, z])

with tf.Session() as sess:
    values = {
        x: np.random.randn(N, D),
        y: np.random.randn(N, D),
        z: np.random.randn(N, D),
    }

    out = sess.run([c, grad_x, grad_y, grad_z], feed_dict=values)
    c_val, grad_x_val, grad_y_val, grad_z_val = out

print "Gradient x:", grad_x_val
print "Graident y:", grad_y_val
print "Gradient z:", grad_z_val

Gradient x: [[ 0.76103771  0.12167501  0.44386324  0.33367434]
 [ 1.49407911 -0.20515826  0.3130677  -0.85409576]
 [-2.55298972  0.65361857  0.86443621 -0.74216503]]
Graident y: [[ 1.76405239  0.40015721  0.97873801  2.24089313]
 [ 1.867558   -0.97727787  0.95008844 -0.1513572 ]
 [-0.10321885  0.41059852  0.14404356  1.45427346]]
Gradient z: [[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]


## Silly Neural Network
The following is an example of building a silly fully connected 2-layer neural network with TensorFlow. As you can see that the code isn't that much different from `numpy` as like the examples above. TensorFlow shares many similar API with `numpy`. 

In [3]:
N, D, H = 64, 1000, 100

x = tf.placeholder(tf.float32, shape=(N, D))
y = tf.placeholder(tf.float32, shape=(N, D))
w1 = tf.placeholder(tf.float32, shape=(D, H))
w2 = tf.placeholder(tf.float32, shape=(H, D))

h = tf.maximum(tf.matmul(x, w1), 0)
y_pred = tf.matmul(h, w2)
diff = y_pred - y
loss = tf.reduce_mean(tf.reduce_sum(diff ** 2, axis=1))

grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

with tf.Session() as sess:
    values = {
        x: np.random.randn(N, D),
        y: np.random.randn(N, D),
        w1: np.random.randn(D, H),
        w2: np.random.randn(H, D),
    }
    
    out = sess.run([loss, grad_w1, grad_w2], feed_dict=values)
    loss_val, grad_w1_val, grad_w2_val = out

The workflow in `tf` is usually splitted into two. First, we define the computational graph and decide on what are the gradients we are looking for, e.g. `grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])`.

```python
x = tf.placeholder(tf.float32, shape=(N, D))
y = tf.placeholder(tf.float32, shape=(N, D))
w1 = tf.placeholder(tf.float32, shape=(D, H))
w2 = tf.placeholder(tf.float32, shape=(H, D))
h = tf.maximum(tf.matmul(x, w1), 0)
y_pred = tf.matmul(h, w2)
diff = y_pred - y
loss = tf.reduce_mean(tf.reduce_sum(diff ** 2, axis=1))
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])
```

And then we run the graph many times and perform updates on them
```python
with tf.Session() as sess:
    values = {
        x: np.random.randn(N, D),
        y: np.random.randn(N, D),
        w1: np.random.randn(D, H),
        w2: np.random.randn(H, D),
    }
    
    out = sess.run([loss, grad_w1, grad_w2], feed_dict=values)
    loss_val, grad_w1_val, grad_w2_val = out
```

Remember that whatever `placeholder` we used, we need to feed them with values using the key word argument `feed_dict` in the `run()` function.

## Bottleneck
Notice that vanilla `numpy` only works with CPU, the output of `run()` is in numpy array. This creates a problem when we want to run our `tf` code on GPU because we are basically copying values from GPU to CPU and CPU back to GPU. This can create a bottleneck on huge dataset. 

There is a solution, instead of using `placeholder`, use `Variable`.

```python
w1 = tf.Variable(tf.random_normal((D, H)))
w2 = tf.Variable(tf.random_normal((H, D)))
``` 

These variables are values that live inside the computational graph and will persist throughout training. And then we need to specify how we'd like to update these variables per iteration through `session.Run()` in Tensorflow.

```python
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)
```

In order to actually compute and update these values, we need to assign them to a dummy node and tell tensor flow that we need the node to be computed in each iteration.

```python
weight_updates = tf.group(new_w1, new_w2)
with tf.Session() as sess:
    for t in range(50):
        loss_val, _ = sess.run([loss, weight_updates], feed_dict=values)
```

## TF Optimizer
Tensor flow actually gives us API to run gradient descent. The API lives inside `tf.train`. Basically what it does is similar to what's written above. It performs the updates by calling `assign` on the `tf.Variable`s and then group them into a dummy node and execute the computation per iteration.

```python
optimizer = tf.train.GradientDescentOptimizer(1e-5)
weight_updates = optimizer.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    values = { x: np.random.randn(N, D), y: np.random.randn(N, D) }
    losses = []
    for t in range(50):
        loss_val, _ = sess.run([loss, weight_updates], feed_dict=values)
        losses.append(loss_val)
```

## TF Loss
If `tf` gives optimizier, it must also other conveninent functions like L1 and L2 losses. That's right, it does!

```python
loss = tf.losses.mean_squared_error(y_pred, y)
```

## TF Layers
So far we have omitted all the nitty-gritty details of defining biases and performing xavier initialization for simplicity. It'd take a good amount of code to carefully piece everything together if we were to write them from scratch. And again, `tf` provides everything out of the box for us!

```python 
init = tf.contrib.layers.xavier_initializer()
h = tf.layers.dense(inputs=x, units=H, activation=tf.nn.relu, kernel_initializer=init)
y_pred = tf.layers.dense(inputs=h, units=D, kernel_initializer=init)
loss = tf.losses.mean_squared_error(y_pred, y)
```

This `tf.layers` library provides architectural setup for us so we don't have to create the layer manually. All the biases and weight initializations are set right out of the box!

## Keras
I guess I don't have to write much about this one, it's so popular these days. 


```python
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD

N, D, H = 64, 1000, 100

model = Sequential()
model.add(Dense(input_dim=D, output_dim=H))
model.add(Activation('relu'))
model.add(Dense(input_dim=H, output_dim=D))

optimizer = SGD(lr=1e0)
model.compile(loss='mean_squared_error', optimizer=optimizer)

x = np.random.randn(N, D)
y = np.random.randn(N, D)
history = model.fit(x, y, nb_epoch=50, batch_size=N, verbose=0)
```