In [None]:
from __future__ import print_function

import numpy
import theano
from theano import config
from theano import tensor

# Blocks tutorial
For more information see [documentation](http://blocks.readthedocs.org/en/latest) and [examples](https://github.com/mila-udem/blocks-examples/).

### Note
Blocks is in early stage of development and the interface is a subject to regular changes. You can use the [stable version](https://github.com/mila-udem/blocks/releases/tag/v0.0.1) but the development version will always have more functionality.

We appreciate bug reports, pull requests, constructive criticism and answer questions. The mailing list can be found [here](https://groups.google.com/forum/#!forum/blocks-users).

## Bricks

### Basic Concepts

_Bricks_ are "parametrized ops". They can build Theano graphs given input variables.
The process is called _application of a brick_. Let us create a simple ``Linear`` brick and apply it:

In [None]:
from blocks.bricks import Linear
x = tensor.matrix('features') # dimensions: (batch, features)
linear = Linear(input_dim=784, output_dim=10)
# Applying a brick yields a (several) Theano variable(s)
linear_output = linear.apply(x)
print(type(linear_output))
# With the returned variable you can do whatever 
# you usually do with Theano variables
y_hat = abs(2 * linear_output)

In the example above ``linear.apply`` is an _application method_. A brick might have several application methods, see e.g. ``Softmax``. 

In [None]:
from blocks.bricks import Softmax
# The application methods are decorated by `@application` decorator in the code
# and in fact are somewhat more complex than usual python functions:
print(Softmax.categorical_cross_entropy)
print(Softmax.apply)
print(Softmax.__init__)

Let's take a step back and take a look at the computation graph that we have built to see that it matches our expectations

In [None]:
theano.printing.debugprint(y_hat)

We can see that $\hat{y} = |2(xW + b)|$ as expected. Note, that the parameters $W$ and $b$ were automatically created by the ``linear`` object. The parameters are also accessible as ``linear.parameters``:

In [None]:
linear.parameters

##### Annotated graphs

You might wonder why the debugprint above contains two ``Elemwise{identity}`` nodes. This is because bricks build  _annotated_ computation graphs. Variables created by bricks are heavily tagged:

In [None]:
linear_output.tag

Bricks insert special variables in the computation graph that carry these tags. This is done by calling ``TensorVarible.copy`` method which works as follows:

In [None]:
x_copy = x.copy(name='x_copy')
# The new variable is an output of a new Elemwise{identity} node
theano.printing.debugprint(x_copy)

That's why you have so many ``Elemwise{identity}`` nodes in your graphs created by bricks.

Let's go through the content of the tags added by bricks. For one, each variable in addition to a Theano-level name gets a brick-level name:

In [None]:
print(linear_output.tag.name)
print(linear_output.owner.inputs[0].owner.inputs[0].owner.inputs[0].tag.name)

Furthermore, each variable, including parameters, is assigned a (several) role(s):

In [None]:
print(linear.parameters[0].tag.roles)
print(linear.parameters[1].tag.roles)
print(linear_output.tag.roles)
print(linear_output.owner.inputs[0].owner.inputs[0].owner.inputs[0].tag.roles)

Finally, the ``annotations`` attribute of the tag contains references to the brick, that created the variable and the ``ApplicationCall`` object. For now, all we need to know about the latter is that it was created when ``linear.apply`` was called and that it refers by name to the application method that created it.

In [None]:
print(linear_output.tag.annotations)
print(linear_output.tag.annotations[0] == linear)
print(linear_output.tag.annotations[1].application.application.name)

To make exploration of the computation graph easier, variables created by bricks are assigned names composed from the name of the brick, the name of the application method and the local (``.tag.name``) variable name:

In [None]:
print(linear_output.name)

You can see these names when you use `theano.printing.debugprint`.

##### Lazy initialization

Finally, let's try do the forward pass using the computatation graph $\hat{y}$:

In [None]:
f = theano.function([x], y_hat)
f(numpy.ones((10, 784), dtype=config.floatX))

Oops! Not so surprising, because we did not tell ``linear`` how to initialize its parameters!

In [None]:
from blocks.initialization import Constant
linear.weights_init = Constant(2.)
linear.biases_init = Constant(0.)
linear.initialize()
f(numpy.ones((3, 784), dtype=config.floatX))

Now it works. Alternatively, you could pass ``weights_init`` and ``biases_init`` settings at the construction time.

In [None]:
linear = Linear(input_dim=784, output_dim=10, 
                weights_init=Constant(2), biases_init=Constant(0))
linear.initialize()
f = theano.function([x], linear.apply(x))
f(numpy.ones((3, 784), dtype=config.floatX))

We deliberately allow to create the brick first and set the initialization schemes later. This is called _lazy initialization_ and implemented by decorating all brick constructors with `@lazy`. It is important to
explicitly call ``.initialize()`` method if you train from scratch, and you can skip it when you continue training from loaded parameters.

##### Nested bricks

``Linear`` is a very simple brick. More complex bricks often use other bricks to build parts of their computation graphs.

In [None]:
from blocks.bricks import MLP, Tanh, Softmax
mlp = MLP([Tanh(), Softmax()], [784, 100, 10])
probs = mlp.apply(x)

The bricks to be used are either passed at the creation time or created by the brick itself, also at the creation time. A brick that is using other bricks is called a _parent_, and the bricks used are called _children_. Each brick keeps track of its children and parents. In theory a brick can be a child of multiple bricks, it practice it sufficient to have tree-like brick hierarchies.

In [None]:
# The children of `mlp` are the linear bricks that it created plus
# the activations that we created.
print(mlp.children)
# `mlp` is not a child of any brick => it does not have a parent
print(mlp.parents)
# `Softmax()` is a basic brick and does not have children
print(mlp.children[-1].children)
# but it does have a parent, which is `mlp`
print(mlp.children[-1].parents)

Let's check the new computation graph:

In [None]:
theano.printing.debugprint(probs)

We can see that indeed all ``mlp.children`` were applied sequentially. Note how annotations help us read the computation graph!

Your might wonder how ``Linear`` children of ``mlp`` will initialize their parameters. No worries:

In [None]:
from blocks.bricks import Initializable
# All bricks that inherit from Initializable
# - have "weights_init" and "biases_init" properties
# - push them to children  before initialization
print(isinstance(mlp, Initializable))
mlp.weights_init = Constant(4.)
mlp.biases_init = Constant(0.)
mlp.initialize()
# Let's check that "weights_init" was indeed propagated
print(mlp.children[0].weights_init)
print(mlp.children[0].parameters[0].get_value().sum() == 784 * 100 * 4)

All right, but what if you want to initialize the linear layers differently? The way to go is to trigger configuration pushing _before_ initialization, so that you could hack the configuration of children in between:

In [None]:
mlp = MLP([Tanh(), Softmax()], [784, 100, 10], 
          weights_init=Constant(4.), biases_init=Constant(0.))
# Each brick has a `push_initialization_config` method that pushes 
# initialization settings downward. It is only called once, so when you
# trigger it explicitly, it will not be called in `initialize`.
mlp.push_initialization_config()
mlp.linear_transformations[1].weights_init = Constant(2.)
mlp.initialize()
print(mlp.linear_transformations[0].parameters[0].get_value().sum() == 784 * 100 * 4)
print(mlp.linear_transformations[1].parameters[0].get_value().sum() == 10 * 100 * 2)

##### Basic concepts summary

We have introduced the following concepts: 

- bricks
- application methods
- brick parameters
- annotated computation graph
- lazy initialization
- children and parents of a brick

### Overview of available bricks

##### Basic bricks

With the basic bricks you can e.g. build a classifier for MNIST

In [None]:
from blocks.initialization import IsotropicGaussian
mlp = MLP([Tanh(), Softmax()], [784, 100, 10],
          weights_init=IsotropicGaussian(0.01),
          biases_init=Constant(0))
mlp.initialize()
probs = mlp.apply(x)

In order to train ``mlp`` you need to compute the cross-entropy cost. The Blocks way to do that is to use another brick:

In [None]:
from blocks.bricks.cost import CategoricalCrossEntropy
y = tensor.lmatrix('targets') # dimensions: (batch, 1)
cost = CategoricalCrossEntropy().apply(y.flatten(), probs)

The gradient of the cost above is not numerically stable unless Theano optimizer is very smart. To our knowledge it is not yet smart enough, and to ensure that you do not have numerical instability in your training procedure you should use ``Softmax.categorical_cross_entropy``.

In [None]:
# Softmax() is moved out of MLP
mlp = MLP([Tanh(), None], [784, 100, 10],
          weights_init=IsotropicGaussian(0.01),
          biases_init=Constant(0))
mlp.initialize()
softmax = Softmax()
# This never computes the probabilities, only their logarithms.
# No NaNs in the gradients!
cost = softmax.categorical_cross_entropy(y.flatten(), mlp.apply(x)).mean()

Some other basic bricks:

In [None]:
# activations
from blocks.bricks import Logistic, Tanh, Rectifier
# maxout related bricks
from blocks.bricks import Maxout, LinearMaxout
# for e.g. word embeddings
from blocks.bricks.lookup import LookupTable
# simply chains several bricks
from blocks.bricks import Sequence

##### Convolutional bricks

The bricks that build convolutional networks are not that much different from the basic ones.



In [None]:
# convolutions, subsampling
from blocks.bricks.conv import Convolutional, MaxPooling
# and classes which construct multi layer CNNs (like MLP)
from blocks.bricks.conv import ConvolutionalSequence

See our [LeNet demo](https://github.com/mila-udem/blocks-examples/blob/master/mnist_lenet/__init__.py).

##### Recurrent bricks

In [None]:
# We have a standard set of recurrent networks
from blocks.bricks.recurrent import SimpleRecurrent, GatedRecurrent, LSTM
# plus a brick than can stack a few of them!
from blocks.bricks.recurrent import RecurrentStack

See the [tutorial](https://blocks.readthedocs.org/en/latest/) and the [parity problem demo](https://github.com/mila-udem/blocks-examples/tree/master/parity_problem).

##### Sequence generator

``SequenceGenerator`` is a high-level brick that can be used to implement language models and Encoder-Decoders, with and without attention.

In [None]:
from blocks.bricks.sequence_generators import SequenceGenerator

See [the extensive API documentation](https://blocks.readthedocs.org/en/latest/api/bricks.html#blocks.bricks.sequence_generators.BaseSequenceGenerator), [Markov chain demo](https://github.com/mila-udem/blocks-examples/tree/master/markov_chain), [reverse words demo](https://github.com/mila-udem/blocks-examples/tree/master/reverse_words).

### Building your own bricks

Writing a new brick is very easy when you know that it will neither be a child not a parent of any other brick. Consider an example of a brick that computes $x W x^T + xb$:

In [None]:
from blocks.bricks.base import application
from blocks.utils import shared_floatx_nans

# Inheriting from Initializable gives us 
# - lazy `weights_init` and `biases_init` attributes
# - `rng` attribute which is the random number generator 
#    that should be used to actually initialize parameters.
class Quadratic(Initializable):
    def __init__(self, input_dim, **kwargs):
        # Do not forget to call super()!!!
        super(Quadratic, self).__init__(**kwargs)
        self.input_dim = input_dim
        
    # You must put the code that creates shared variables
    # for parameters in `_allocate` method. This requirement
    # comes from the "lazy allocation" feature which we do not discuss today.
    def _allocate(self):
        self.parameters = [
            shared_floatx_nans((self.input_dim, self.input_dim), name='W'),
            shared_floatx_nans((self.input_dim), name='b')]      
        
    # You must put your actual initialization code 
    # in your `_initialize` method for lazy initialization to work
    def _initialize(self):
        self.weights_init.initialize(self.parameters[0], self.rng)
        self.biases_init.initialize(self.parameters[1], self.rng)
        
    # It is the `@application` decorator that actually takes
    # care of tagging input and output variables. The `inputs`
    # and `outputs` arguments define the brick-level names 
    # of inputs and outputs respectively. 
    @application(inputs=['input_'], outputs=['output'])
    def apply(self, input_):
        return (input_.dot(self.parameters[0]).dot(input_.transpose()) + 
                input_.dot(self.parameters[1]))
                
quadratic = Quadratic(input_dim=2, 
                      weights_init=Constant(2), biases_init=Constant(1))
quadratic.initialize()
result = quadratic.apply(x)
f = theano.function([x], [result])
print(f(3 * numpy.ones((1, 2), dtype=config.floatX))[0] ==  3 ** 2 * 2 * 2 * 2 + 3 * 2)

Writing a new brick that works well in hierarchies requires deeper understanding of brick life-cycle, which is slightly out of the scope of this tutorial. You might find the [existing brick tutorial](http://blocks.readthedocs.org/en/latest/bricks_overview.html) and the  [work-in-progress brick development tutorial](https://github.com/mila-udem/blocks/pull/772) useful.

## Graph filtering and modifications

At this point you might wonder what is the benefit from annotating the graph. The following are the most important usecases for the annotations:

- you can extract inner variables from the graph and use them for debugging, monitoring or constructing various additive penalties (L1, L2, etc.)
- you can replace inner variables to implement advanced regularization such as dropout, weight noise addition, batch normalization (see [the PR](https://github.com/mila-udem/blocks/pull/851))
- the computation graph simply becomes more readable (looking forward for visualizations taking use of annotations)

Two main classes that you need to work with annotated graphs are ``ComputationGraph`` and ``VariableFilter``. Here is a simple example of applying L2 regularization:

In [None]:
from blocks.graph import ComputationGraph
from blocks.filter import VariableFilter
from blocks.roles import WEIGHT

cg = ComputationGraph([cost])
# cg.variables is simply the list of all graph variables
W1, W2 = VariableFilter(roles=[WEIGHT])(cg.variables)
cost = cost + .00005 * (W1 ** 2).sum() + .00005 * (W2 ** 2).sum()
cost.name = 'training_cost'

``VariableFilter`` can filter by
- the roles of a variable (``INPUT``, ``OUTPUT``, ``WEIGHT``, ...)
- the bricks and the application method that created a variable
- the variable names
    
Examples:

In [None]:
from blocks.roles import OUTPUT
print(VariableFilter(bricks=[mlp.linear_transformations[0]])(cg.variables))
layer1_activations, = VariableFilter(
    roles=[OUTPUT], bricks=[mlp.activations[0]])(cg.variables)
print(layer1_activations)

In the first example we found parameters, input, outputs and also _auxiliary_ variables that ``linear_transformations[0]`` created.

In the second example we fetched activations of the first layer. Let's try to do something useful with them, e.g. apply dropout:

In [None]:
from blocks.graph import apply_dropout
cg_dropout = apply_dropout(cg, [layer1_activations], 0.5)

# Let's check what has been done
layer2_activations = VariableFilter(
    roles=[OUTPUT], bricks=[mlp.linear_transformations[1]])(cg_dropout.variables)
theano.printing.debugprint(layer2_activations)                           

Internally `apply_dropout` uses `ComputationGraph.replace` which is a wrapper for ``theano.clone``. It is smarter than ``theano.clone`` and handle several interdependent replacements.

##### Remarks
- we were able to apply L2 and dropout regularization to ``MLP`` without any explicit support on its side 
- if for example we had a ``Dropout`` brick, we would not be able to insert it into ``MLP``

This was to highlight the main advantage of our "search-and-replace" approach: we can keep our bricks plain and simple.

## Algorithms
Blocks has a collection of algorihms and allows to use arbitary combinations of them. Note, that this part of the library is almost standalone.

The heart of blocks/algorithms is the ``GradientDescent`` class. Do not be confused by the name, ``GradientDescent`` is very generic! It computes gradients and feeds them to a ``StepRule`` object and at every iteration subtracts the step proposed by the step rule.

Let's make one SGD step with learning rate $0.1$.

In [None]:
from blocks.algorithms import GradientDescent, Scale
algorithm = GradientDescent(
    cost=cost, parameters=cg.parameters,
    step_rule=Scale(learning_rate=0.1))
# This call compiles the Theano function
algorithm.initialize()
# This makes one parameter update.
# Note, that the input of process_batch must be a dictionary
# with the variable names as keys.
algorithm.process_batch(
    {'features' : numpy.ones((10, 784), dtype=config.floatX), 
     'targets' : numpy.ones((10, 1), dtype='int64')})
# You can check which updates are performed at each step
print(algorithm.updates)
# Uncomment the following line if you want to make sure that the update is right
# theano.printing.debugprint(algorithm.updates[0][1])

We have the following step rules:

- `Scale` -- scales its input, if it is a single rule applied to gradients it is SGD
- `CompositeRule` -- a rule used to compose several rules to a sequence
- `Momentum` -- adds momentum (SGD-Momentum, but can be applied to any kind of rule)
- `AdaDelta` -- adaptive learning rate AdaDelta algorithm
- `RMSProp` -- another learning algorithm
- `Adam` -- one more algorithm
- `StepClipping` -- clips the step, can be used for gradient clipping if applied before or for step clipping if applied after other rules
- and others, see [documentation](http://blocks.readthedocs.org/en/latest/api/algorithms.html)

Here is an example of step rule composition:

In [None]:
from blocks.algorithms import CompositeRule, StepClipping
gradient_clipping = CompositeRule([StepClipping(threshold=1.0), Scale(learning_rate=0.1)])

## Main loop

The main loop in Blocks manages the training process and glues up an algorithm, a data stream and extensions. Main loop itself is a very simple object, it fetches data from a datastream and feeds it to a training algorithm. All additional functionality is added with extensions.

A minimal example of a main loop requires a dataset, we'll use MNIST dataset from fuel

In [None]:
from fuel.datasets.mnist import MNIST
mnist_train = MNIST(("train",))
mnist_test = MNIST(("test",))
mnist_train.sources

Note, that we providently defined our input variables $x$ and $y$ with the same names as dataset sources. This will allow the algorithm to parse batches produced by the data stream.

In [None]:
# Some Fuel magic, we'll get a stream of batches of size 50
from fuel.transformers import Flatten
from fuel.streams import DataStream
from fuel.schemes import SequentialScheme

train_stream = Flatten(
    DataStream.default_stream(
        mnist_train,
        iteration_scheme=SequentialScheme(
            mnist_train.num_examples, 50)),
    which_sources=('features',))

Now we are ready to construct our first main loop. But even the termination of the main loop is handled by an extension, so the plain main loop will never finish:

In [None]:
from blocks.main_loop import MainLoop

# You have to provide a fresh algorithm object every time 
# you create a new main loop. 
algorithm = GradientDescent(
    cost=cost, parameters=cg.parameters,
    step_rule=Scale(learning_rate=0.1))
main_loop = MainLoop(algorithm, train_stream)
# If you run this, it will never finish
# main_loop.run()

Let's proceed to a more sensible example which terminates (thanks to `FinishAfter`) and prints something (thanks to `Printing`):

In [None]:
from blocks.model import Model
from blocks.extensions import FinishAfter, Printing

# Reinitialize algorithm
algorithm = GradientDescent(
    cost=cost, parameters=cg.parameters,
    step_rule=Scale(learning_rate=0.1))
# Define the main loop, `Model` is a very simple wrapper of the computational graph
main_loop = MainLoop(
    algorithm,
    train_stream,
    model=Model(cost),
    extensions=[
        Printing(),
        FinishAfter(after_n_batches=2)])
# And run it!
main_loop.run()

### Logging
In the example above the printing extension prints the values from the log. You can access the log like

In [None]:
main_loop.log

Log is a dictionary which maps from iteration number to a dictionary of log records. Each record is a pair of record name and its value.

As you may have seen in the previous example, log also contains status information:

In [None]:
main_loop.log.status

Log has an sqlite backend, with this backend it can store only simple types like boolean, numberical, or string.

### Monitoring

There are two types of monitoring in Blocks: training monitoring and data stream monitoring. The first one computes the monitored values during the training and uses training batches. This type of monitoring almost doesn't take time. The data stream monitoring iterates over the dataset and computes the monitored quantities. It may take some time to iterate over a big dataset. 

The first type usually used for approximate train subset monitoring and the second one for validation/test subsets. However, you can use `DataStreamMonitoring` for the train subset to get not averaged values.

The following example shows regular usage of monitoring

In [None]:
from blocks.extensions.monitoring import DataStreamMonitoring, TrainingDataMonitoring
from blocks.bricks.cost import MisclassificationRate

error_rate = MisclassificationRate().apply(y.flatten(), probs).copy(name='error_rate')

train_monitoring = TrainingDataMonitoring(
    [cost, error_rate], prefix='train', after_batch=True)
test_monitoring = DataStreamMonitoring(
    [cost, error_rate], 
    Flatten(
        DataStream.default_stream(
            mnist_test,
            iteration_scheme=SequentialScheme(
                mnist_test.num_examples, 500)),
        which_sources=('features',)), 
    prefix='test',
    after_batch=True)
print(train_monitoring.record_name(cost))
print(test_monitoring.record_name(cost))

In order to combine values you can use aggregations schemes. For example for the mean gradient norm you can use

In [None]:
from blocks.monitoring import aggregation
average_train_monitoring = TrainingDataMonitoring(
    [cost, error_rate, 
     aggregation.mean(algorithm.total_gradient_norm).copy(name='mean_total_grad_norm'),
     ], every_n_batches=2)
train_monitoring = TrainingDataMonitoring(
    [algorithm.total_gradient_norm.copy(name='total_grad_norm')], after_batch=True)

Internally the values are aggregated using a shared theano variable. It means that the shape of the aggregated value should be constant. In a case if you would like to monitor all the activations, for example, make sure that your batch size is constant and the time series have the same length.

Gradient descent and step rules have fields which contain variables which can be monitored: total gradient norm, total step norm, learning rate, momentum, etc.

In [None]:
from blocks.extensions import Printing
# Reinitialize algorithm
algorithm = GradientDescent(
    cost=cost, parameters=cg.parameters,
    step_rule=Scale(learning_rate=0.1))

main_loop = MainLoop(
        algorithm,
        train_stream,
        model=Model(cost),
        extensions=[FinishAfter(after_n_batches=4), 
                    train_monitoring, average_train_monitoring, test_monitoring, 
                    Printing(after_batch=True)])
main_loop.run()

You can see that the aggregation scheme averaged the gradient norm.

### Serialization
We tried to make serialization to work with any kind of user code and to be easy enough deserialized on different types of hardware. We partly addressed these tasks using `cPickle` and saving Theano shared variables as numpy array with meta information (we use persistent ids internally). Serialization is being refactored now, so pay attention to the mailing list.

`blocks.serialization` module provides functions `dump`, `sequre_dump`, `load`, and `continue_training`.

`Checkpoint` is an extension which serializes the main loop, it has an option to save parts of the main loop separately. You can use `Load` extension to continue the training or `continue_training` function.

Note, that python doesn't know how to unpickle the objects from the global namespace if you are unpickling in a different script. One way to solve this problem is to define all your objects in some module. Other way is to run `continue_training` from the same script, in this case the objects from the global namespace defined in the same way when you run serialization.

In [None]:
from blocks.extensions.saveload import Checkpoint
# Reinitialize algorithm
algorithm = GradientDescent(
    cost=cost, parameters=cg.parameters,
    step_rule=Scale(learning_rate=0.1))

main_loop = MainLoop(
        algorithm,
        train_stream,
        model=Model(cost),
        extensions=[FinishAfter(after_n_batches=2), 
                    train_monitoring, test_monitoring,
                    Checkpoint('mnist.pkl'), 
                    Printing(after_batch=True)])
main_loop.run()

Note the log entry called 'saved_to'. You can now load the checkpoint using `load`

In [None]:
from blocks.serialization import load
loaded_main_loop = load('mnist.pkl')
loaded_main_loop.model.parameters[0].get_value()

This file can be loaded with `numpy.load` 

In [None]:
params = numpy.load('mnist.pkl')
params.keys()
params['mlp-linear_0.W']

### Other extensions

Blocks has a bunch of other useful extensions like
- `Printing` -- prints whatever was added to the log
- `Progressbar` -- outputs a progress bar of the training procedure
- `TrackTheBest` -- checks if a quantity is the best so far and adds it to the log

Blocks-extras contains more extensions, the most useful one is `Plot` which can create plots using Bokeh. The plots can be seen in your browser.

###Predicates

We have only one predicate now: `OnLogRecord` it triggers when a certain log record is found. Usually it is used in a couple with `TrackTheBest` extension. For example, you can save the model to a separate location when you get the best validation score.

You can use `add_conditions` and `set_conditions` methods to modify conditions when to run the extension. Note, that `set_conditions` is going to overwrite the default ones.

In [None]:
from blocks.extensions.predicates import OnLogRecord
from blocks.extensions.training import TrackTheBest
checkpoint = Checkpoint('mnist.pkl')
extensions = [TrackTheBest('error_rate'), 
              checkpoint.add_condition(['after_epoch'],
                  OnLogRecord('error_rate'),
                  ("mnist_best.pkl",))]

### Writing your own extension
Sometimes one needs to performs certain actions during the training process. You can implement your extension to do this job. You can inherit from `TrainingExtension` or from `SimpleExtension`.

Using the first one you will need to implement one or several callbacks such as `before_batch`, `after_batch`, `every_n_epoch`, etc. To use the second one you need to implement `do` method and possibly add default callback names.

You can access the main loop from the extension and therefore log, model, algorithm. If you need to access some other variable, just give a link to the constructor of the extension.

## Exercises

Clone the blocks-examples [repository](https://github.com/mila-udem/blocks-examples) and open MNIST [example](https://github.com/mila-udem/blocks-examples/blob/master/mnist/__init__.py). 

### 1 Use other activation
Change activation from tanh to ReLU in the MLP and add one more layer. You will need to find the ReLU brick and import it. Run the example for several iterations.

### 2 Apply dropout
Apply dropout with 0.1 drop probability to the input and with 0.5 to all other layers.

### 3 Logging
Save log separately (see documentation how to do it). Run a separate python notebook and unpickle the log. Install pandas and convert log to a pandas dataset (see log documentation).