# Building the Transformer from Scratch

In this notebook, we'll be implementing the famous Transformer architecture from scratch.

The code is based off of the following repos/blog posts:

- [attention-is-all-you-need-pytorch](https://github.com/jadore801120/attention-is-all-you-need-pytorch)
- [pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT)
- [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html) 

Thanks so much to their authors!

In [1]:
import torch
import torch.nn as nn
import numpy as np

One of the keys to understanding how any model works is understanding how the shapes of the tensors change during the processing of each part. We'll be using the logging module to output debugging information to help our understanding.

In [66]:
import logging
logger = logging.getLogger("tensor_shapes")
handler = logging.StreamHandler()
formatter = logging.Formatter(
        '%(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
# if you want the model to continuously print tensor shapes, set to DEBUG!
logger.setLevel(1)

In [60]:
import inspect
def getclass():
    stack = inspect.stack()
    return stack[3][0].f_locals["self"].__class__

# A helper function to check how tensor sizes change
def log_size(tsr: torch.Tensor, name: str):
    cls = getclass()
    logger.log(level=cls.level, msg=f"[{cls.__name__}] {name} size={tsr.shape}")

We'll use logging levels to control the modules we receive output from. The lower the logging level, the more tensor information you'll get. Feel free to play around!

In [61]:
from enum import IntEnum
# Control how much debugging output we want
class TensorLoggingLevels(IntEnum):
    attention = 1
    attention_head = 2
    multihead_attention_block = 3
    enc_dec_block = 4
    enc_dec = 5

We'll be using an enum to refer to dimensions whenever possible to improve readability.

In [5]:
class Dim(IntEnum):
    batch = 0
    seq = 1
    feature = 2

# Components

### Scaled dot product attention

The Transformer is an attention-based architecture. The attention used in the Transformer is the scaled dot product attention, represented by the following formula.

##### torch.bmm: https://pytorch.org/docs/stable/torch.html#torch.bmm
Performs a batch matrix-matrix product of matrices stored in input and mat2.

*input and mat2 must be 3-D tensors each containing the same number of matrices ie. (batch).*


If input is a(b×n×m) tensor, mat2 is a (b×m×p) tensor, out will be a (b×n×p) tensor.

##### torch.transpose: https://pytorch.org/docs/stable/torch.html#torch.transpose
torch.transpose(input, dim0, dim1) → Tensor

Returns a tensor that is a transposed version of input. The given dimensions dim0 and dim1 are swapped.

##### torch.sum https://pytorch.org/docs/stable/torch.html#torch.sum
Returns the sum of each row of the input tensor in the given dimension dim. If dim is a list of dimensions, reduce over all of them.

If keepdim is True, the output tensor is of the same size as input

$$ \textrm{Attention}(Q, K, V) = \textrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$

![images](./assets/scaled-dot-product-attention-steps.png)

In [85]:
import math

class ScaledDotProductAttention(nn.Module):
    level = TensorLoggingLevels.attention # Logging level: 
    def __init__(self, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

    def forward(self, q, k, v, mask=None):
        # q,k,v size = 5,10,20 (Batch, seq, feature)
        
        # Check if key and query size are the same
        d_k = k.size(-1) # get the size of the key
        assert q.size(-1) == d_k

        # STEP 1: Calculate score
        # compute the dot product between queries and keys for each batch and position in the sequence
        k = k.transpose(Dim.seq, Dim.feature) #1, 2 #k.size: 5,20,10
        score_attn = torch.bmm(q, k) # (Batch, Seq, Seq)
        # we get an attention score between each position in the sequence for each batch

        
        # STEP 2: Divide by sqrt(Dk)
        # scale the dot products by the dimensionality. Normalize the weights across the sequence dimension
        score_attn = score_attn / math.sqrt(d_k)
        # (Note that since we transposed, the sequence and feature dimensions are switched)
        
        # STEP 3: Mask Optional
        # fill attention weights with 0s where padded
        if mask is not None: score_attn = score_attn.masked_fill(mask, 0)

        # STEP 4: Softmax    
        score_attn = torch.exp(score_attn)
        log_size(score_attn, "attention weight") # (Batch, Seq, Seq)
        score_attn = score_attn / score_attn.sum(dim=-1, keepdim=True)
        
        score_attn = self.dropout(score_attn)
        
        
        # STEP 5: Matmul with value matrix
        output = torch.bmm(score_attn, v) # (Batch, Seq, Feature)
        
        log_size(output, "attention output size") # (Batch, Seq, Seq)
        return output

In [86]:
attn = ScaledDotProductAttention()

In [87]:
q = torch.rand(5, 10, 20)
k = torch.rand(5, 10, 20)
v = torch.rand(5, 10, 20)

In [113]:
result = attn(q, k, v)
result.size()

[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 20])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 20])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 20])


torch.Size([5, 10, 20])

### Multi-Head Attention

Now, we turn to the core component in the Transformer architecture: the multi-head attention block. This block applies linear transformations to the input, then applies scaled dot product attention.

![image](https://i2.wp.com/mlexplained.com/wp-content/uploads/2017/12/multi_head_attention.png?zoom=2&resize=224%2C293)

In [10]:
class AttentionHead(nn.Module):
    level = TensorLoggingLevels.attention_head
    def __init__(self, d_model, d_feature, dropout=0.1):
        super().__init__()
        # We will assume the queries, keys, and values all have the same feature size
        self.attn = ScaledDotProductAttention(dropout)
        self.query_tfm = nn.Linear(d_model, d_feature)
        self.key_tfm = nn.Linear(d_model, d_feature)
        self.value_tfm = nn.Linear(d_model, d_feature)

    def forward(self, queries, keys, values, mask=None):
        Q = self.query_tfm(queries) # (Batch, Seq, Feature)
        K = self.key_tfm(keys) # (Batch, Seq, Feature)
        V = self.value_tfm(values) # (Batch, Seq, Feature)
        log_size(Q, "queries, keys, vals")
        # compute multiple attention weighted sums
        x = self.attn(Q, K, V)
        return x

In [114]:
attn_head = AttentionHead(20, 20)
result = attn_head(q, k, v)
result.size()

[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 20])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 20])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 20])


torch.Size([5, 10, 20])

The multi-head attention block simply applies multiple attention heads, then concatenates the outputs and applies a single linear projection.

In [12]:
# We'll supress logging from the scaled dot product attention now
logger.setLevel(TensorLoggingLevels.attention_head)

##### repeat() : Repeats this tensor along the specified dimensions. https://pytorch.org/docs/stable/tensors.html#torch.Tensor.repeat

In [95]:
class MultiHeadAttention(nn.Module):
    level = TensorLoggingLevels.multihead_attention_block
    def __init__(self, d_model, d_feature, n_heads, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.d_feature = d_feature
        self.n_heads = n_heads
        # in practice, d_model == d_feature * n_heads
        assert d_model == d_feature * n_heads

        # Note that this is very inefficient:
        # I am merely implementing the heads separately because it is 
        # easier to understand this way
        self.attn_heads = nn.ModuleList([
            AttentionHead(d_model, d_feature, dropout) for _ in range(n_heads)
        ])
        self.projection = nn.Linear(d_feature * n_heads, d_model) 
    
    def forward(self, queries, keys, values, mask=None):
        log_size(queries, "Input queries")
        x = [attn(queries, keys, values, mask=mask) # (Batch, Seq, Feature)
             for i, attn in enumerate(self.attn_heads)]
        log_size(x[0], "output of single head")
        
        # reconcatenate
        x = torch.cat(x, dim=Dim.feature) # (Batch, Seq, D_Feature * n_heads)
        log_size(x, "concatenated output")
        
        # Final linear operation
        x = self.projection(x) # (Batch, Seq, D_Model)
        log_size(x, "projected output")
        return x

In [115]:
#d_model & d_feature are dimension for our linear operation. Ie. nn.Linear(d_model, d_feature)
d_model = 20 * 8
d_feature = 20
n_heads = 8

heads = MultiHeadAttention(d_model, d_feature, n_heads)
result = heads(q.repeat(1, 1, 8), 
      k.repeat(1, 1, 8), 
      v.repeat(1, 1, 8))

result.size()
#q.repeat(1,1,8).size() => torch.Size([5, 10, 160])
#k.repeat(1,1,8).size() => torch.Size([5, 10, 160])
#v.repeat(1,1,8).size() => torch.Size([5, 10, 160])

[MultiHeadAttention] Input queries size=torch.Size([5, 10, 160])
[MultiHeadAttention] Input queries size=torch.Size([5, 10, 160])
[MultiHeadAttention] Input queries size=torch.Size([5, 10, 160])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 20])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 20])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queri

torch.Size([5, 10, 160])

### The Encoder

With these core components in place, implementing the encoder is pretty easy.

![image](./assets/encoder-steps.png)

The encoder consists of the following components:
- A multi-head attention block
- A simple feedforward neural network

These components are connected using residual connections and layer normalization

In [15]:
# We'll supress logging from the individual attention heads
logger.setLevel(TensorLoggingLevels.multihead_attention_block)

Layer normalization is similar to batch normalization, but normalizes across the feature dimension instead of the batch dimension.

![image](https://i1.wp.com/mlexplained.com/wp-content/uploads/2018/01/%E3%82%B9%E3%82%AF%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%B7%E3%83%A7%E3%83%83%E3%83%88-2018-01-11-11.48.12.png?w=1500)

In [104]:
class LayerNorm(nn.Module):
    def __init__(self, d_model, eps=1e-8):
        super(LayerNorm, self).__init__()
        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.zeros(d_model))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

The encoder just stacks these together

In [105]:
class EncoderBlock(nn.Module):
    level = TensorLoggingLevels.enc_dec_block
    def __init__(self, d_model=512, d_feature=64,
                 d_ff=2048, n_heads=8, dropout=0.1):
        super().__init__()
        
        self.attn_head = MultiHeadAttention(d_model, d_feature, n_heads, dropout)
        
        self.layer_norm1 = LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
        self.position_wise_feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
        )
        
        self.layer_norm2 = LayerNorm(d_model)
        
    def forward(self, x, mask=None):
        log_size(x, "Encoder block input")
        
        # STEP 1
        att = self.attn_head(x, x, x, mask=mask)
        log_size(x, "Attention output")
        
        # STEP 2
        # Apply normalization and residual connection
        x = x + self.dropout(self.layer_norm1(att))
        
        # STEP 3
        # Apply position-wise feedforward network
        pos = self.position_wise_feed_forward(x)
        log_size(x, "Feedforward output")
        
        # STEP 4
        # Apply normalization and residual connection
        x = x + self.dropout(self.layer_norm2(pos))
        log_size(x, "Encoder size output")
        return x

In [106]:
enc = EncoderBlock()

In [116]:
result = enc(torch.rand(5, 10, 512))
result.size()

[EncoderBlock] Encoder block input size=torch.Size([5, 10, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 10, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 10, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 10, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 10, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 10, 512])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 64])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 64])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 64])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 64])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 64])
[ScaledDotProductAttention] attenti

torch.Size([5, 10, 512])

The encoder consists of 6 consecutive encoder blocks, so can simply be implemented like the following

In [108]:
class TransformerEncoder(nn.Module):
    level = TensorLoggingLevels.enc_dec
    def __init__(self, n_blocks=6, d_model=512,
                 n_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.encoders = nn.ModuleList([
            EncoderBlock(d_model=d_model, d_feature=d_model // n_heads,
                         d_ff=d_ff, dropout=dropout)
            for _ in range(n_blocks)
        ])
    
    def forward(self, x: torch.FloatTensor, mask=None):
        for encoder in self.encoders:
            x = encoder(x)
        return x

### The Decoder

The decoder is mostly the same as the encoder. There's just one additional multi-head attention block that takes the target sentence as input.

![image](./assets/decoder-steps.png)

The keys and values are the outputs of the encoder, and the queries are the outputs of the multi-head attention over the target entence embeddings.

In [109]:
class DecoderBlock(nn.Module):
    level = TensorLoggingLevels.enc_dec_block
    def __init__(self, d_model=512, d_feature=64,
                 d_ff=2048, n_heads=8, dropout=0.1):
        super().__init__()
        self.masked_attn_head = MultiHeadAttention(d_model, d_feature, n_heads, dropout)
        
        self.attn_head = MultiHeadAttention(d_model, d_feature, n_heads, dropout)
        
        self.position_wise_feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
        )

        self.layer_norm1 = LayerNorm(d_model)
        self.layer_norm2 = LayerNorm(d_model)
        self.layer_norm3 = LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_out, 
                src_mask=None, tgt_mask=None):
        # Step 1
        # Apply attention to inputs
        att = self.masked_attn_head(x, x, x, mask=src_mask)
        x = x + self.dropout(self.layer_norm1(att))
        
        # Step 2 
        # Apply attention to the encoder outputs and outputs of the previous layer
        att = self.attn_head(queries=x, keys=enc_out, values=enc_out, mask=tgt_mask)
        x = x + self.dropout(self.layer_norm2(att))
        
        # Step 3
        # Apply position-wise feedforward network
        pos = self.position_wise_feed_forward(x)
        x = x + self.dropout(self.layer_norm2(pos))
        return x

In [117]:
dec = DecoderBlock()
result = dec(torch.rand(5, 10, 512), enc(torch.rand(5, 10, 512)))
result.size()

[EncoderBlock] Encoder block input size=torch.Size([5, 10, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 10, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 10, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 10, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 10, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 10, 512])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 64])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 64])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 64])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 64])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 64])
[ScaledDotProductAttention] attenti

torch.Size([5, 10, 512])

Again, the decoder is just a stack of the underlying block so is simple to implement.

In [110]:
class TransformerDecoder(nn.Module):
    level = TensorLoggingLevels.enc_dec
    def __init__(self, n_blocks=6, d_model=512, d_feature=64,
                 d_ff=2048, n_heads=8, dropout=0.1):
        super().__init__()
        self.position_embedding = PositionalEmbedding(d_model)
        self.decoders = nn.ModuleList([
            DecoderBlock(d_model=d_model, d_feature=d_model // n_heads,
                         d_ff=d_ff, dropout=dropout)
            for _ in range(n_blocks)
        ])
        
    def forward(self, x: torch.FloatTensor, 
                enc_out: torch.FloatTensor, 
                src_mask=None, tgt_mask=None):
        for decoder in self.decoders:
            x = decoder(x, enc_out, src_mask=src_mask, tgt_mask=tgt_mask)
        return x

### Positional Embeddings

Attention blocks are just simple matrix multiplications: therefore they don't have any notion of order! The Transformer explicitly adds positional information via the positional embeddings.

In [24]:
class PositionalEmbedding(nn.Module):
    level = 1
    def __init__(self, d_model, max_len=512):
        super().__init__()        
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.weight = nn.Parameter(pe, requires_grad=False)
        
    def forward(self, x):
        return self.weight[:, :x.size(1), :] # (1, Seq, Feature)

In [25]:
class WordPositionEmbedding(nn.Module):
    level = 1
    def __init__(self, vocab_size, d_model=512):
        super().__init__()
        self.word_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = PositionalEmbedding(d_model)
        
    def forward(self, x: torch.LongTensor, mask=None) -> torch.FloatTensor:
        return self.word_embedding(x) + self.position_embedding(x)

In [26]:
emb = WordPositionEmbedding(1000)
encoder = TransformerEncoder()

In [118]:
result = encoder(emb(torch.randint(1000, (5, 30))))
result.size()

[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 30, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 30, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 30, 512])
[AttentionHead] queries, keys, vals size=torch.Size([5, 30, 64])
[AttentionHead] queries, keys, vals size=torch.Size([5, 30, 64])
[AttentionHead] queries, keys, vals size=torch.Size([5, 30, 64])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 30, 30])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 30, 30])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 30, 30])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 30, 64])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 30, 64])
[ScaledDotProductAttention] attenti

torch.Size([5, 30, 512])

### Putting it All Together

Let's put everything together now.

![image](https://camo.githubusercontent.com/88e8f36ce61dedfd2491885b8df2f68c4d1f92f5/687474703a2f2f696d6775722e636f6d2f316b72463252362e706e67)

In [28]:
# We'll supress logging from the scaled dot product attention now
logger.setLevel(TensorLoggingLevels.enc_dec_block)

In [29]:
emb = WordPositionEmbedding(1000)
encoder = TransformerEncoder()
decoder = TransformerDecoder()

In [30]:
src_ids = torch.randint(1000, (5, 30))
tgt_ids = torch.randint(1000, (5, 30))
x = encoder(emb(src_ids))
decoder(emb(tgt_ids), x)

[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[EncoderBlock] Attention output size=torch.Size([5, 30, 512])
[EncoderBlock] Feedforward output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder size output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[EncoderBlock] Attention output size=torch.Size([5, 30, 512])
[EncoderBlock] Feedforward output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder size output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[EncoderBlock] Attention output size=torch.Size([5, 30, 512])
[EncoderBlock] Feedforward output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder size output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[EncoderBlock] Attention output size=torch.Size([5, 30, 512])
[EncoderBlock] Feedforward output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder size output size=t

tensor([[[  2.2477,   3.3096,  -2.8686,  ...,  -3.2644,  -2.4669,   6.1730],
         [ -1.8250,  -4.9711,  -2.4306,  ...,  -2.9770,  -3.0167,   1.3833],
         [  3.5309,   2.0060,  -3.2075,  ...,   0.1359,  -4.5663,  -1.5324],
         ...,
         [  7.0745,  -1.9728,   1.4077,  ...,   1.3778,  -0.3159,   1.5097],
         [  5.8074,  -4.9769,   1.1956,  ...,  -0.4884,   0.6054,   2.2616],
         [ -0.5769,   3.2016,  -0.5948,  ...,  -1.7086,  -3.5718,   3.0980]],

        [[  4.1633,   1.9668,  -0.4448,  ...,  -0.5528,  -7.7785,  -0.5628],
         [  6.3603,   4.5621,   1.4731,  ...,   2.1059,  -9.3620,  -4.9059],
         [  1.9835,  -2.7217,  -1.0474,  ...,   1.4569,  -7.1355,   0.1433],
         ...,
         [  2.3934,   1.2819,   1.9249,  ...,  -3.7213,  -9.2985,  -5.8557],
         [ -0.0274,   0.3156,  -0.6774,  ...,   1.5031, -10.2572,  -4.0842],
         [ -2.2030,   4.4538,   0.1307,  ...,  -1.5143, -10.7648,  -1.3035]],

        [[  3.4523,   8.5233,  -1.9825,  ...