# Attention Model Documentation

This notebook serves as a documentation hub for understanding each part of the code presented by Dmitry Eremeev and Alexey Pustynnikov.

## Table of Contents (Files)

1. [`train_model.ipynb`](#train_model)
2. [`train.py`](#train)
3. [`attention_graph_encoder.py`](#graph_encoder)
4. [`attention_dynamic_model.py`](#dynamic_model)
5. [`reinforce_baseline.py`](#baseline)
6. [`environment.py`](#environment)
7. [`layers.py`](#layers)
8. [`utils.py`](#utils)

# `train_model.ipynb`

The notebook contains the code for:

1. setting up.
2. Loading/Creating data
3. Executing experiment
4. Showing learning loss

## Parameter set up

The notebook is divided into two cells that execute the code either not saving or saving the resulting models and loading it up. On either cell, the first part of code we can appreciate is the parameters for the model to be used. The outlined parameters here are:

``` python
SAMPLES = 512 
BATCH = 128
START_EPOCH = 0
END_EPOCH = 5
FROM_CHECKPOINT = False
embedding_dim = 128
LEARNING_RATE = 0.0001
ROLLOUT_SAMPLES = 10000
NUMBER_OF_WP_EPOCHS = 1 
GRAD_NORM_CLIPPING = 1.0 
BATCH_VERBOSE = 1000
VAL_BATCH_SIZE = 1000
VALIDATE_SET_SIZE = 10000
SEED = 1234
GRAPH_SIZE = 50
FILENAME = 'VRP_{}_{}'.format(GRAPH_SIZE, strftime("%Y-%m-%d", gmtime()))
```

Some extra notes on these parameters:

```python 

```


<a id='train_model'></a>

# `train.py`



<a id='train'></a>

# `attention_graph_encoder.py`

<a id='graph_encoder'></a>

# `attention_dynamic_model.py`

The dynamic model's main code is hosted here. The rest of the model's lower-level parts can be found in [`attention_graph_encoder.py`](#graph_encoder) and [`layers.py`](#layers). Further, the model's class makes use directly of the environment problem to be trained on. These definitions can be found at [`environment.py`](#environment)

![Encoder Decoder Architecture](pictures/EncoderDecoderAM.png)

## Constructor

The main thing to keep in mind is that the constructor does the following:

- Sets up the basic parameters of the Attention Model. (embedding dimensions etc.)
- Creates the Encoder modules.
- Sets up the projections as defined in the paper as Dense layers.

### Input Parameters

`embedding_dim`

`n_encode_layers`

`n_heads`

`tanh_clipping`

### Variables to keep in mind

`embedding_dim`
* MHA's embedding dimensions. Note that this will not be the actual embedding used in embedding the input. The embedding shall be: Number of Heads * Depth of Head. This is this way since instead of having individual projections in a list being applied to the same output, the output as a whole is passed through a single projection that applies it at the same time. The outputs of the projection would then be this embedding dimension, and it is easily "split" to allocate multi-headed attention. (Like folding down the dimensions.)

`n_encode_layers`
* Number of Encode layers to use. 

`decode_type`
* Style of decoding. Either 'sampling' or 'greedy'.

`problem`
* Defines the problem type. Helpful in obtaining the mask and knowing when the problem is solved.

`n_heads`
* Number of heads in multi-headed attention.

`embedder`
* Embedder module. Type `GraphAttentionEncoder`. It encapsulates all 'Nx' encoders

`output_dim`
* Output dimensions of encoder. Used for the decoder. Same as `embedding_dim`

`num_heads`
* Number of heads. This is mathematically related to `embedding_dim`

`head_depth` **?**
* "Actual" embedding dimensions as described in original Transformer. Same as the dimension of the key, value,

`dk_mha_decoder`
* 


`dk_get_loc_p`


`tanh_clipping`

#### Linear Transformation

In the literature, the query, key, and values are generated by applying a linear transformation on the embedded input nodes. These lienar projections are parametrized using a weight matrix for each: 

$$
W^Q, \; W^K, \; W^V
$$

In the case of the Decoder, the Query is computed using not the embedded input but a context vector of the graph encoding, first, and last nodes produced:

$$
h^{(N)}_{(c)} = [\bar{h}^{(N)},\;\; \text{Last},\;\; \text{First}]
$$
$$
q_{(c)} = W^{Q}h_{(c)}
$$

Because the context input does not change but the last and first nodes does, it can be benefitial to compute the context matrix and recompute the first and last parts as needed. To do this, the matrix $W^Q_{\text{context}}$ and $W^Q_{\text{step\_context}}$ are used.

`wq_context`
* Dense layer used for computing the context part of the query matrix.

`wq_step_context`
* Dense layer used for computing the first and last parts. In the case of VRP, the current node and left over material $[\text{last}, \;D_t]$

`wk`
* Matrix for calculating Key in decoder

`wk_tanh`
* Matrix used for calculating key in last 1HA in decoder

`wv`
* Used in calculating the value in MHA of decoder

`w_out`
* 




<a id='dynamic_model'></a>

# `reinforce_baseline.py`

<a id='baseline'></a>

# `environment.py`

<a id='environment'></a>

# `layers.py`

<a id='layers'></a>

# `utils.py`

<a id='utils'></a>