# Einops tutorial, part 2: deep learning

[Previous part](https://github.com/arogozhnikov/einops/tree/master/docs) provides visual examples with numpy.

## What's in this tutorial?

- working with deep learning packages
- important cases for deep learning models
- `einsops.asnumpy` and `einops.layers`

In [1]:
from einops import rearrange, reduce

In [2]:
import numpy
x = numpy.random.RandomState(42).normal(size=[10, 32, 100, 200])

In [3]:
from IPython.display import display_html
def guess(x):
    display_html("""
    <h4>Answer is: <a class="anchor-link jp-InternalAnchorLink" href="#Z">{x}</a> (hover to see)</h4>
    """.format(x=tuple(x)), raw=True)

## Select your flavour

Switch to the framework you're most comfortable with. 

In [4]:
# select one from 'chainer', 'gluon', 'tensorflow', 'pytorch' 
flavour = 'pytorch'

In [5]:
if flavour == 'tensorflow':
    import tensorflow as tf
    tf.enable_eager_execution()
    tape = tf.GradientTape(persistent=True)
    tape.__enter__()
    x = tf.contrib.eager.Variable(x) + 0
elif flavour == 'pytorch':
    import torch
    x = torch.from_numpy(x)
    x.requires_grad = True
elif flavour == 'chainer':
    import chainer
    x = chainer.Variable(x)
else:
    assert flavour == 'gluon'
    import mxnet as mx
    mx.autograd.set_recording(True)
    x = mx.nd.array(x, dtype=x.dtype)
    x.attach_grad()
print('selected {} backend'.format(flavour))

selected pytorch backend


In [6]:
type(x), x.shape

(torch.Tensor, torch.Size([10, 32, 100, 200]))

## Simple computations 

- simple computations are not different from those for numpy.
- same code works for tensors of different libraries

In [7]:
# let's start with a simple example
# converting bchw to bhwc format and back is a common operation.
# try to predict output shape and then check your guess
y = rearrange(x, 'b c h w -> b h w c')
guess(y.shape)

## Backpropagation

- Gradients are a corner stone of deep learning
- You can normally backpropagate through einops operations (just if those were native)

In [8]:
y0 = x
y1 = reduce(y0, 'b c h w -> b c', 'max')
y2 = rearrange(y1, 'b c -> c b')
y3 = reduce(y2, 'c b -> ', 'sum')

if flavour == 'tensorflow':
    print(reduce(tape.gradient(y3, x), 'b c h w -> ', 'sum'))
else:
    y3.backward()
    print(reduce(x.grad, 'b c h w -> ', 'sum'))

tensor(320., dtype=torch.float64)


In [9]:
# flattening is another common operation, 
# which happens on a boundary between convolutional layers and fully connected layers
y = rearrange(x, 'b c h w -> b (c h w)')
guess(y.shape)

In [10]:
# space-to-depth operation
y = rearrange(x, 'b c (h h1) (w w1) -> b (h1 w1 c) h w', h1=2, w1=2)
guess(y.shape)

In [11]:
# depth-to-space operation
y = rearrange(x, 'b (c h1 w1) h w -> b c (h h1) (w w1)', h1=2, w1=2)
guess(y.shape)

## Reductions

In [12]:
# simple global average pooling.
y = reduce(x, 'b c h w -> b c', reduction='mean')
guess(y.shape)

In [13]:
# max-pooling with a kernel 2x2
y = reduce(x, 'b c (h h1) (w w1) -> b c h w', reduction='max', h1=2, w1=2)
guess(y.shape)

## 1d, 2d and 3d pooling are defined in a similar way

for sequential 1-d models, you'll probably want pooling over time
```python
reduce(x, '(t t2) b c -> t b c', reduction='max', t2=2)
```

for volumetric models, all three dimensions are pooled
```python
reduce(x, 'b c (x x2) (y y2) (z z2) -> b c x y z', reduction='max', x2=2, y2=2, z2=2)
```

Uniformity is a strong point of `einops`, and you don't need specific operation for each particular case.


### Good exercises 

- write a version of space-to-depth for 1d and 3d (2d is provided above)
- write an average / max pooling for 1d models. Those are frequently in NLP to woth with lengths of arbitrary length

## Squeeze and unsqueeze (expand_dims)

In [14]:
# models typically work only with batches, 
# so to predict a single image ...
image = rearrange(x[0, :3], 'c h w -> h w c')
# ... create a dummy 1-element axis ...
y = rearrange(image, 'h w c -> () c h w')
# ... imagine you predicted this with a convolutional network for classification, we'll just flatten axes ...
predictions = rearrange(y, 'b c h w -> b (c h w)')
# ... finally, decompose (remove) dummy axis
predictions = rearrange(predictions, '() class -> class')

## keepdims-like behavior for reductions

In [15]:
# empty composition () provides dimensions of length 1, which are broadcastable, 

# per-channel mean-normalization for each image:
y = x - reduce(x, 'b c h w -> b c () ()', 'mean')
guess(y.shape)

# per-channel mean-normalization for whole batch:
y = x - reduce(y, 'b c h w -> () c () ()', 'mean')
guess(y.shape)

## Stacking

In [16]:
# stack along first dimension
list_of_tensors = list(x)
tensors = rearrange(list_of_tensors, 'b c h w -> b h w c')
guess(tensors.shape)
# or maybe stack along last dimension?
tensors = rearrange(list_of_tensors, 'b c h w -> h w c b')
guess(tensors.shape)

## Concatenation

In [17]:
# concatenate over first dimension?
tensors = rearrange(list_of_tensors, 'b c h w -> (b h) w c')
guess(tensors.shape)

# or maybe concatenate along last dimension?
tensors = rearrange(list_of_tensors, 'b c h w -> h w (b c)')
guess(tensors.shape)

## Shuffling within a dimension

In [18]:
# channel shuffle (as it is drawn in shufflenet paper)
y = rearrange(x, 'b (g1 g2 c) h w-> b (g2 g1 c) h w', g1=4, g2=4)
guess(y.shape)

# simpler version of channel shuffle
y = rearrange(x, 'b (g c) h w-> b (c g) h w', g=4)
guess(y.shape)

## Split a dimension

In [19]:
# NB: some symbolic backends don't support simply iterating over the first dimension
# when network predicts several bboxes for each position, here's a convenient way to work with it
# 8 bboxes, 4 coordinates each
bbox_x, bbox_y, bbox_w, bbox_h = rearrange(x, 'b (coord bbox) h w -> coord b bbox h w', coord=4, bbox=8)
max_bbox_area = reduce(bbox_w * bbox_h, 'b bbox h w -> b h w', 'max')
guess(bbox_x.shape)
guess(max_bbox_area.shape)

In [20]:
# when implementing custom gated activation (GLU), split is needed
y1, y2 = rearrange(x, 'b (split c) h w -> split b c h w', split=2)
# typically result = y1 * sigmoid(y2) or something very similar

# ... but we could split differently
y1, y2 = rearrange(x, 'b (c split) h w -> split b c h w', split=2)

# first one splits channels into consequent groups: y1 = x[:, :x.shape[1] // 2, :, :]
# while second takes channels with a step: y1 = x[:, 0::2, :, :]

# these make big difference when input is 
# - a result of group convolution
# - a result of bidirectional LSTM/RNN
# Let's focus on the second case, since it is less obvious. 
# For instance, in cudnn LSTM output is concatenated of forward-in-time and backward-in-time outputs
# Also in pytorch GLU splits channels into consequent groups (first way)
# So when LSTM's output comes to GLU, ...
# ... forward-in-time produces linear part, and backward-in-time produces activation ... 
# ... and role of directions is different, and gradients coming to two parts are different

# einops notation helps detecting such inconsistencies when packing several things into a single dimension

## Shape parsing

In [21]:
def convolve_2d(x):
    # imagine we have a simple 2d convolution with padding, so output has same shape as input
    return x

In [22]:
from einops import parse_shape

In [23]:
# imagine we are working with 3d data
x_5d = rearrange(x, 'b c x (y z) -> b c x y z', z=20)
# but we have only 2d convolutions. 
# That's not a problem, since we can apply
y = rearrange(x_5d, 'b c x y z -> (b z) c x y')
y = convolve_2d(y)
# not just specifies additional information, but verifies that all dimensions match
y = rearrange(y, '(b z) c x y -> b c x y z', **parse_shape(x_5d, 'b c x y z'))

In [24]:
parse_shape(x_5d, 'b c x y z')

{'b': 10, 'c': 32, 'x': 100, 'y': 10, 'z': 20}

In [25]:
# we can skip some dimensions by writing underscore
parse_shape(x_5d, 'batch c _ _ _')

{'batch': 10, 'c': 32}

## Striding

In [26]:
# finally, how to convert any operation into a strided operation 
# (like convolution with strides, aka dilated/atrous convolution)

# each image is split into subgrids, each is now a separate "image"
y = rearrange(x, 'b c (h hs) (w ws) -> (hs ws b) c h w', hs=2, ws=2)
y = convolve_2d(y)
y = rearrange(y, '(hs ws b) c h w -> b c (h hs) (w ws)', hs=2, ws=2)

assert y.shape == x.shape

## Layers

For frameworks that prefer operating with layers, layers are available.

You'll need to import a proper one depending on your backend:

```python
from einops.layers.chainer import Rearrange, Reduce
from einops.layers.gluon import Rearrange, Reduce
from einops.layers.keras import Rearrange, Reduce
from einops.layers.torch import Rearrange, Reduce
```

`Einops` layers are behaving in the same way as operations, and have same parameters 
(for the exception of first argument, which should be passed during call)

```python
layer = Rearrange(pattern, **axes_lengths)
layer = Reduce(pattern, reduction, **axes_lengths)

# apply layer to tensor
x = layer(x)
```

Usually it is more convenient to use layers, not operations, to build models
```python
# example given for pytorch, but code in other frameworks is almost identical
from torch.nn import Sequential, Conv2d, MaxPool2d, Linear, ReLU
from einops.layers.torch import Rearrange

model = Sequential(
    Conv2d(3, 6, kernel_size=5),
    MaxPool2d(kernel_size=2),
    Conv2d(6, 16, kernel_size=5),
    # combined pooling and flattening
    Reduce('b c (h h2) (w w2) -> b (c h w)', 'max', h2=2, w2=2), 
    Linear(16*5*5, 120), 
    ReLU(),
    Linear(120, 10), 
)
```

# Summary

- `einops` operates with different deep learning frameworks, accepts various tensors and allows automatic gradient computation
- interface is uniform: same code works for different frameworks
- code for different dimensionality is written in a uniform way
- layers are provided for easier integration of `einops` into models