# Introduction to TensorFlow

View this tutorial at: [https://nbviewer.jupyter.org/github/berkeley-cocosci/tensorflow-tutorial/blob/master/Gaussians.ipynb](https://nbviewer.jupyter.org/github/berkeley-cocosci/tensorflow-tutorial/blob/master/Gaussians.ipynb)

In this tutorial, we'll walk through the basics of TensorFlow and go through a simple example of fitting a Gaussian distribution to some data.

## Setup

I would recommend installing Tensorflow inside a conda environment. If you don't already have conda or Anaconda installed, you can install it quickly using `miniconda`: http://conda.pydata.org/miniconda.html

After you have conda installed, run the following commands:

```bash
# create and activate the conda environment
conda create -n tensorflow python=3 numpy scipy matplotlib ipython
source activate tensorflow

# install the IPython kernel for the Jupyter notebook
ipython kernel install --user --name=tensorflow --display-name=tensorflow

# install tensorflow
conda install -c conda-forge tensorflow
```

You should now be able to open this notebook in Jupyter. Make sure you are running the right kernel by selecting "Kernel > Change kernel > tensorflow" from the menubar.

Note that this will not include GPU support, but you don't need that anyway unless you're running a machine that has a good GPU. For our lab, that will mean `theano`, though it doesn't have a GPU yet.

## TensorFlow basics

First, we'll start off by defining some of the basic terminology surrounding TensorFlow:

**Tensors** -- these are like arrays in NumPy, but they are internal to the TensorFlow graph.

**Ops** -- these are operations that take tensors, manipulate them some how, and produce a new tensor. For example, the `mul` op would take two tensors and multiply them together, producing a third tensor.

**Variables** -- variables are a special type of tensor that have values that are independent of the rest of the graph (in other words, they are not the product of an operation). Variables are what we care about in terms of being able to learn parameters -- the parameters that we want to learn are stored in variables, and we compute gradients with respect to variables.

**Graph** -- the graph is what defines the set of computations that will occur: which ops operate over which tensors and variables, which tensors are produced, etc. We we create a tensor or use an operation, those pieces are automatically added to the graph definition; however, *they are not computed, they are only defined*. You can think of the graph like a file containing code: the code is only the definition of the computation that will be performed.


**Session** -- the session is what actually performs the computation. You can think of it like a programming language interpreter (e.g. the Python interpreter). You tell the session to run operations by calling `session.run()`. Like an interpreter, the session maintains state, so if the values of any variables change, they will persist across different calls to `session.run()`.

## Constructing the graph

In [1]:
import tensorflow as tf
import numpy as np
import tfutils
import scipy.stats

%load_ext autoreload
%autoreload 2

For our first example, we'll construct a TensorFlow graph that computes the log probability of some data under a Gaussian distribution. To start, we can construct a custom op that computes the Gaussian log probability:

In [2]:
def gaussian_logpdf(X, mu, sigma, name=None):
    """Construct a tensorflow op for a Gaussian probability."""
    with tf.op_scope([X, mu, sigma], name, "gaussian_logpdf"):
        Z = -0.5 * tf.log(2 * np.pi * sigma ** 2)
        e = -(X - mu) ** 2 / (2 * sigma ** 2)
        logpdf = Z + e
        return logpdf

We also need some initial starting values for our estimate of those parameters:

In [3]:
np.random.seed(123)
initial_mu = np.random.rand()
initial_sigma = np.random.rand()
print("Initial: mu={:.4f}, sigma={:.4f}".format(initial_mu, initial_sigma))

Initial: mu=0.6965, sigma=0.2861


Now, we are ready to start constructing the TensorFlow graph! We are going to create (1) a placeholder tensor for the data, (2) variables to hold the parameters mu and sigma, (3) the op to compute the log probability of the data given the parameters, and (4) a special TensorFlow operation to initialize the values of the variables:

In [4]:
# create a placeholder for the input to our Gaussian distribution. a
# placeholder is a special type of Tensor that is only given a value
# at runtime by the user (i.e., it is not the product of an op).
X = tf.placeholder(tf.float32, shape=(None, 1), name="X")

# create trainable variables for the parameters of the Gaussian distribution
mu = tf.Variable(initial_value=initial_mu, name="mu")
sigma = tf.Variable(initial_value=initial_sigma, name="sigma")

# compute the log probability of X given mu and sigma
logp = tf.reduce_sum(gaussian_logpdf(X, mu, sigma), name="logp")

# when we run this op, it will actually assign initial values to all the variables
init = tf.initialize_all_variables()

TensorFlow comes with some really nice visualization capabilites. Here, we can interactively peek at what the graph looks like. Notice that the graph includes variables for `mu` and `sigma`, and that those variables go into the op for `gaussian_logpdf`, which also takes the input tensor of `X`. Try double clicking on `gaussian_logpdf` and you'll be able to see the flow of operations taking place within that scope!

In [5]:
tfutils.show_graph(tf.get_default_graph())

## Doing a forward pass

Ok, now that we've defined our graph, let's actually do some computation with it. Remember that we have so far only defined a placeholder variable for our data -- so we need to externally generate some data first:

In [6]:
true_mu = 6.9647
true_sigma = 2.8614
print("True:    mu={:.4f}, sigma={:.4f}".format(true_mu, true_sigma))

def sample_X():
    X_vals = np.random.normal(true_mu, true_sigma, (1000, 1))
    return X_vals

X_vals = sample_X()
X_vals[:10]

True:    mu=6.9647, sigma=2.8614


array([[ 11.9388723 ],
       [  8.12772775],
       [  7.8863743 ],
       [  6.81728719],
       [  6.38039936],
       [ 12.62840761],
       [  2.33123479],
       [  3.77720222],
       [  5.68439312],
       [ 11.73866437]])

To evaluate the log probablity of these values, we create a `Session` and then call the `run()` method. Note that we pass in to the `feed_dict` keyword argument a dictionary where the keys are the actual tensors, and the values are the values we want to use for those tensors:

In [7]:
with tf.Session() as sess:
    sess.run(init)
    logp_val = sess.run(logp, feed_dict={X: X_vals})

logp_val

-275808.19

We can verify that we're getting the correct values from TensorFlow by comparing them to what Scipy says the log probability is:

In [8]:
scipy.stats.norm.logpdf(X_vals, initial_mu, initial_sigma).sum()

-275808.22153962194

## Computing gradients

One of the really awesome things about Tensorflow is that it knows how to compute the gradient for (almost) every one of its ops. Thus, if we want to compute the gradient of some quantity with respect to variables in the graph, all we have to do is create a single op called `gradients`:

In [9]:
# compute gradients of the negative log probability with respect to the parameters.
# note that we are using negative log probability here since when we do optimization
# it will be doing minimization, rather than maximization
grads = tf.gradients(-logp, [mu, sigma])
grads

[<tf.Tensor 'gradients/gaussian_logpdf/sub_grad/Reshape_1:0' shape=() dtype=float32>,
 <tf.Tensor 'gradients/AddN:0' shape=() dtype=float32>]

Now, when we inspect the graph, you'll notice that there is a new `gradients` op. If you open it up, you'll see what is almost a mirror image of our original graph, but where all the ops have names ending in `_grad`. Importantly, a key thing to take away here is that *computing gradients is just another operation*! There is absolutely nothing special about it; it is just that TensorFlow makes it incredibly easy to do because it knows about the derivatives of all its ops:

In [10]:
tfutils.show_graph(tf.get_default_graph())

Continuing with the point that computing gradients is just another operation, we can also compute the gradients and look at their values:

In [11]:
with tf.Session() as sess:
    sess.run(init)
    grad_vals = sess.run(grads, feed_dict={X: X_vals})

print("True:      mu={:.4f}, sigma={:.4f}".format(true_mu, true_sigma))
print("Initial:   mu={:.4f}, sigma={:.4f}".format(initial_mu, initial_sigma))
print("Gradients: mu={:.4f}, sigma={:.4f}".format(*grad_vals))

True:      mu=6.9647, sigma=2.8614
Initial:   mu=0.6965, sigma=0.2861
Gradients: mu=-74570.2734, sigma=-1926617.8750


## Optimization

Ok, we are now ready to try to optimize the parameters of our Gaussian. We'll use the `AdamOptimizer` here as it is much more efficient than regular gradient descent (if you want to try it out yourself, change the optimizer to `tf.train.GradientDescentOptimizer` and set the learning rate to something small, like 0.0001):

In [12]:
optimizer = tf.train.AdamOptimizer(0.1)
train = optimizer.apply_gradients(zip(*[grads, [mu, sigma]]))

# we need to recreate the initializer, because the Adam optimizer creates
# additional variables
init = tf.initialize_all_variables()

Like computing gradients, the optimizer is just another set of operations in our graph. You should now see a new node called `Adam`, along with a couple of new variables (`beta1_power` and `beta2_power`) that the Adam optimizer uses:

In [13]:
tfutils.show_graph(tf.get_default_graph())

We can now construct our training loop. We'll run for 1000 optimization steps, and compute at each step the log probability of the data, the current value of `mu`, the current value of `sigma`, and the actual training operation. We'll print out the status of the training after every 100 steps:

In [14]:
with tf.Session() as sess:
    sess.run(init)
    for i in range(1000):
        ops_to_run = {
            "logp": logp,
            "mu": mu,
            "sigma": sigma,
            "train": train
        }
        results = sess.run(ops_to_run, feed_dict={X: sample_X()})
        if (i % 100) == 0:
            print("log p(X | mu={mu:.4f}, sigma={sigma:.4f}) = {logp:.4f}".format(**results))

log p(X | mu=0.7965, sigma=0.3861) = -294427.0000
log p(X | mu=3.4322, sigma=1.6038) = -5360.1357
log p(X | mu=4.7522, sigma=1.7714) = -3619.7363
log p(X | mu=5.6646, sigma=1.8733) = -2916.5569
log p(X | mu=6.2374, sigma=1.9476) = -2746.0251
log p(X | mu=6.5817, sigma=2.0106) = -2660.7000
log p(X | mu=6.7746, sigma=2.0696) = -2646.9377
log p(X | mu=6.8673, sigma=2.1254) = -2564.0066
log p(X | mu=6.9186, sigma=2.1784) = -2563.6431
log p(X | mu=6.9543, sigma=2.2289) = -2568.0984


Let's compare this to what we'd get if we computed the log probability under the true values of mu and sigma:

In [15]:
with tf.Session() as sess:
    sess.run(init)
    logp_val = sess.run(logp, feed_dict={X: sample_X(), mu: true_mu, sigma: true_sigma})
    print("log p(X | mu={mu:.4f}, sigma={sigma:.4f}) = {logp:.4f}".format(
            mu=true_mu, sigma=true_sigma, logp=logp_val))

log p(X | mu=6.9647, sigma=2.8614) = -2449.9692


### A note on running operations in the session

When we tell the session to run something, it will *only* run what we tell it to, plus whatever operations the given ops depend on. So, if we told the session to just run our `logp` op, *it wouldn't run the training step or compute gradients* because `logp` doesn't depend on those operations. So, that's why we have to explicitly plass the `train` operation to the session if we want it to update the variables.