In [None]:
import warnings
warnings.filterwarnings('ignore')

#Blocks tutorial
For more information see [documentation](http://blocks.readthedocs.org/en/latest) and [examples](https://github.com/mila-udem/blocks-examples/).

###Note
Blocks is in early stage of development and it changes very fast. You can use a stable [version](https://github.com/mila-udem/blocks/releases/tag/v0.0.1) but the development version has much more functionality.

We appreciate bug reports, pull requests, constructive criticism and answer questions. The mailing list can be found [here](https://groups.google.com/forum/#!forum/blocks-users).

This tutorial mostly follows the MNIST [example](https://github.com/mila-udem/blocks-examples/blob/master/mnist/__init__.py).

##Bricks

###Introduction

Blocks provides instruments to extend Theano. `Brick` is a parametrized Theano operation, bricks are used to construct Theano graphs.

Bricks can be applied to Theano variables and output Theano variables.

In [None]:
from __future__ import print_function
import theano
from theano import tensor
from blocks.bricks import Linear
x = tensor.matrix('features') # dim: (batch, features)
linear = Linear(input_dim=784, output_dim=10)
y_hat = linear.apply(x)
y_hat = abs(2 * y_hat)
isinstance(y_hat, theano.Variable)

Now we can compile a Theano function

In [None]:
import numpy
from theano import function
f = function([x], y_hat)
f(numpy.zeros((10, 784)))

The function works like we expected except that the output is NaNs. The reason for this is that all the shared variables are initialized with NaN at the beginning. So, if your output is NaN, check that you didn't forget to initialize all the bricks properly.

In [None]:
from blocks.initialization import Constant
linear.weights_init = Constant(1.)
linear.biases_init = Constant(0.)
linear.initialize()
f(numpy.zeros((10, 784)))

Every brick has a list of parameters and a list of its children

In [None]:
print(linear.parameters)
print(linear.children)

`Linear` brick doesn't have any children. `MLP` is a sequence of linear transformations and activations

In [None]:
from blocks.bricks import MLP, Tanh, Softmax
from blocks.initialization import IsotropicGaussian
mlp = MLP([Tanh(), Softmax()], [784, 100, 10],
          weights_init=IsotropicGaussian(0.01),
          biases_init=Constant(0))
mlp.initialize()
probs = mlp.apply(tensor.flatten(x, outdim=2))
mlp.children

Note that activations and costs are also bricks

In [None]:
from blocks.bricks.cost import CategoricalCrossEntropy
y = tensor.lmatrix('targets')
cost = CategoricalCrossEntropy().apply(y.flatten(), probs)

###Brick lifecycle
The life-cycle of a brick is as follows:

1. **Configuration:** set (part of) the *attributes* of the brick. Can take
   place when the brick object is created, by setting the arguments of the
   constructor, or later, by setting the attributes of the brick object. No
   Theano variable is created in this phase.

2. **Allocation:** (optional) allocate the Theano shared variables for the
   *parameters* of the Brick. When `Brick.allocate` is called, the
   required Theano variables are allocated and initialized by default to ``NaN``.

3. **Application:** instantiate a part of the Theano computational graph,
   linking the inputs and the outputs of the brick through its *parameters*
   and according to the *attributes*. Cannot be performed (i.e., results in an
   error) if the Brick object is not fully configured.

4. **Initialization:** set the **numerical values** of the Theano variables
   that store the *parameters* of the Brick. The user-provided value will
   replace the default initialization value.

####Note
   If the Theano variables of the brick object have not been allocated when 
   `Application.apply` is called, Blocks will quietly call 
   `Brick.allocate`.

For details see [this](http://blocks.readthedocs.org/en/latest/bricks_overview.html#bricks-life-cycle) tutorial.

Children bricks get the same initialization scheme as a parent

In [None]:
mlp.children[0].weights_init

You can change this behaviour by pushing the initialization configuration

In [None]:
mlp.push_initialization_config()
mlp.linear_transformations[0].weights_init = Constant(1.)
mlp.initialize()
mlp.children[0].weights_init

##Graph filtering and modifications
Using brick annotations one can easily extract variables from the computation graph

In [None]:
from blocks.graph import ComputationGraph
from blocks.filter import VariableFilter
from blocks.roles import WEIGHT

cg = ComputationGraph([cost])
W1, W2 = VariableFilter(roles=[WEIGHT])(cg.variables)
print("W1 brick:", W1.tag.annotations[0].name)
print("W2 brick:", W2.tag.annotations[0].name)

Now we can apply L2 regularization

In [None]:
cost = cost + .00005 * (W1 ** 2).sum() + .00005 * (W2 ** 2).sum()
cost.name = 'final_cost'

You can filter by 
- roles of a variable,
- bricks and applications which created a variable
- variable names
    
Blocks assigns roles such as INPUT, OUTPUT, WEIGHT, BIAS, PARAMETER, AUXILIARY, etc.

Variable filtering is a very powerfull tool. Usually it's used to extract variables for monitoring or apply different types of regularization.

In [None]:
from blocks.graph import apply_dropout
cg_dropout = apply_dropout(cg, [W2], 0.5)

Internally `apply_dropout` uses `ComputationGraph.replace` which is a wrapper for Theano clone with replacement. It is smarter than Theano function and works with several replacements at the same time.

In [None]:
new_cg = cg.replace({x: x + 1})

##Main loop

Main loop in Blocks manages the training process. All the additional functionality is added using extensions.

Main loop fetches data from a datastream, feeds it to a training algorithm and runs extensions.

###Algorithms

Currently, there is only one algorithm: `GradientDescent`. It computes gradients and changes parameters according to a step rule which depends on the gradients.

The step rule does most of the job: the simplies step rule is `Scale` which scales its input. The following code runs SGD with fixed learning rate `0.1`.

In [None]:
from blocks.algorithms import GradientDescent, Scale
algorithm = GradientDescent(
    cost=cost, parameters=cg.parameters,
    step_rule=Scale(learning_rate=0.1))

In general, step rule can be a sequence of several step rules. Each step rule in the sequence is applied to a preceding step rule output. Blocks has a set of predefined step rules:
- `Scale` -- scales its input, if it is a single rule applied to gradients it is SGD
- `CompositeRule` -- a rule used to compose several rules to a sequence
- `Momentum` -- adds momentum (SGD-Momentum, but can be applied to any kind of rule)
- `AdaDelta` -- adaptive learning rate AdaDelta algorithm
- `RMSProp` -- another learning algorithm
- `Adam` -- one more algorithm
- `StepClipping` -- clips the step, can be used for gradient clipping if applied before or for step clipping if applied after other rules
- and others, see [documentation](http://blocks.readthedocs.org/en/latest/api/algorithms.html)

Here is an example of step composition

In [None]:
from blocks.algorithms import CompositeRule, StepClipping
gradient_clipping = CompositeRule([StepClipping(threshold=1.0), Scale(learning_rate=0.1)])

For the next example we'll need MNIST dataset from fuel

In [None]:
from fuel.datasets.mnist import MNIST
mnist_train = MNIST(("train",))
mnist_test = MNIST(("test",))
mnist_train.sources

Note, that we providently defined our input variables `x` and `y` with the same names as dataset sources. The main loop finds inputs by their names and feeds corresping dataset sources.

In [None]:
from blocks.main_loop import MainLoop
from blocks.model import Model
from blocks.extensions import FinishAfter, Printing
from fuel.streams import DataStream
from fuel.schemes import SequentialScheme
from fuel.transformers import Flatten

train_stream = Flatten(
    DataStream.default_stream(
        mnist_train,
        iteration_scheme=SequentialScheme(
            mnist_train.num_examples, 50)),
    which_sources=('features',))
main_loop = MainLoop(
        algorithm,
        train_stream,
        model=Model(cost),
        extensions=[FinishAfter(after_n_batches=5), Printing()])

main_loop.run()

This simple example uses Fuel to construct a data stream and runs training for `5` iterations.

###Logging

###Monitoring

###Serialization

##Exercises

Clone the blocks-examples [repository](https://github.com/mila-udem/blocks-examples) and open MNIST [example](https://github.com/mila-udem/blocks-examples/blob/master/mnist/__init__.py). 

###1 Use other activation
Change activation from tanh to ReLU in the MLP and add one more layer. You will need to find the ReLU brick and import it. Run the example for several iterations.

###2 Apply dropout
Apply dropout with 0.8 drop probability to the input and with 0.5 to all other layers.

###3 Logging
Save log separately (see documentation how to do it). Run a separate python notebook and unpickle the log. Install pandas and convert log to a pandas dataset (see log documentation).