# Dynamic Attention Model (AM-D) Custom Execution

This notebook contains code as I explore and test the implementation created by Eremeev and Pustynnikov.

In [1]:
from attention_dynamic_model import AttentionDynamicModel
import tensorflow as tf
import time

2022-12-01 10:37:10.555939: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
%env TF_GPU_ALLOCATOR=cuda_malloc_async

Num GPUs Available:  1
env: TF_GPU_ALLOCATOR=cuda_malloc_async


2022-12-01 10:37:40.505385: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-01 10:37:41.858511: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-01 10:37:41.859353: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


## Documentation on the AttentionDynamicModel class

The AttentionDynamicModel class is the main model class that implements the AM-D model.

### The constructor

The constructor of the class will set up the basic attributes of the model as well as all the layers. For the encoder and decoder module, a separate class exists for the encoder, but the decoder is implemented inside this model.
```python
def __init__(self,
             embedding_dim, 
             n_encode_layers=2, 
             n_heads=8, 
             tanh_clipping=10): ...
```
| Parameter | Description |
|:---:|:---|
| embedding_dim | The cardinality of the output produced by the embedding projection. This is used to set define the input to the encoder module as well as for the input of the decoder module. |
| n_encode_layers | Number of encoder modules stacked |
| n_heads | Number of heads used by both encoder and decoder attention modules.|
| tanh_clipping | Value used for clipping the attention. |

In [7]:
# AM-D Model Parameters
embedding_dim   = 128
n_encode_layers = 2
n_heads         = 8
tanh_clipping   = 10

model_amd = AttentionDynamicModel(
    embedding_dim  =embedding_dim,
    n_encode_layers=n_encode_layers,
    n_heads        =n_heads,
    tanh_clipping  =tanh_clipping
)

#### Setting the Decode Style

There are two types of decoding for AM-D:

1. Greedy
  - Greedy decoding will return the node with the highest probability from the decoder output.
2. Sampling
  - Sampling decoding will return a random node following the random distribution generated by the decoder.

In [8]:
model_amd.set_decode_type('sampling')

## Selecting Optimizer

We have to define an optimizer for the model. See keras' options for this. There are many types to choose from.

### Adam
```python
tf.keras.optimizers.Adam(
    learning_rate=0.001,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-07,
    amsgrad=False,
    name="Adam",
    **kwargs
)
```

In [9]:
from tensorflow.keras.optimizers import Adam

# Optimizer Parameters
learning_rate = 0.0001
beta_1        = 0.9
beta_2        = 0.999
epsilon       = 1e-07
amsgrad       = False
name          = "Adam"


optimizer = Adam(
    learning_rate=learning_rate,
    beta_1=beta_1,
    beta_2=beta_2,
    epsilon=epsilon,
    amsgrad=amsgrad,
    name=name,
)

## Creating a Baseline for More Efficient Reinforcement Learning

The AM-D model uses a baseline to help improve the stability of learning. During regular reinforcement learning, it can be challenging to distinguish between actions when the reward obtained may vary based on the state the agent was found on. For example, a set of actions may produce a reward that could be different that the reward obtained starting in a different state that leads to the same end state. This means that the reward variance is high, and it can be hard to compare high reward actions when the initial state is variant. 

To solve this problem, a baseline can be used. In the past, baselines in RL would be a constant value that helps distinguish states based on a "neutral" state that provides information on whether an action is "better or worse." Variance in RL tends to be dependent on the state an agent is found, so we require a dynamic baseline that provides a "neutral" baseline to use for comparing the rewards and reach a referendum easier on what actions are better or worse. There are many different baselines, but for AM-D the best traines AM-D policy is used. If better actions are found that produce better reward, then, the baseline can help see these better actions, based on the best actions learned so far.

### Rollout Baseline

The Rollout baseline is a type of baseline that uses a ML model that produces the best learned actions so far. Whenever the learned model outperforms the baseline (statistically speaking with T-test) then, the baseline is replaced with the learned model.

__For the AM-D model, a warm up stage can be used for more stable convergance.__ The way this works is that a combination of exponential moving average baseline is used together with the rolling baseline. At the very begining of learning, the policy for the rollout baseline may be too bad to give meaningful baseline costs. To solve this, we can first rely on exponential moving average that utilizes the mean cost obtained by the training model and the average cost is over time is updated by weighting recently-obtained costs higher than previous ones. This also helps in getting through the initial "row" of bad costs obtained by exploration. The Exponential Moving Average (EMA) equation is the following:

$M \leftarrow \beta M + (1-\beta) L(\pi)$

Where $M$ is the moving average and $\beta$ is the weight factor on the importance of previous obtained costs with respect to recent ones. As $\beta$ approaches 0, recent costs are more important and previous costs are forgoten faster. as $\beta$ apptoaches 1.0, previous costs are not as forgoten and remain important to recent average cost. Finding a good balance is important because low $\beta$ will make the average cost be too unstable but high $\beta$ will make the cost depend too much on early outlier costs that may make the cost unstable.

The output cost used for the baseline learning will be a weighted combination of both EMA and rollout baseline. More precisely:

$L(baseline) \leftarrow \alpha L(\pi_g) + (1 - \alpha) M$

Where $\pi_g$ is the rollout baseline used by the algorithm; $\alpha = \frac{epoch + 1}{wp\_n\_epochs}$ is an dynamically-increasing constant from $0.0$ up to $1.0$. $\alpha$ is not allowed to pass $1.0$, and when it does, warm up is deemed complete and only the baseline cost is used.

```python
def __init__(self, 
             model, 
             filename,
             from_checkpoint=False,
             path_to_checkpoint=None,
             wp_n_epochs=1,
             epoch=0,
             num_samples=10000,
             warmup_exp_beta=0.8,
             embedding_dim=128,
             graph_size=20
             ): ...
```

| Parameter           | Description |
| :---:               | :--- |
| model               | Initial ML model to use as baseline |
| filename            | Suffix for checkpoint name for the model (Keras). Model name template is `{path_to_checkpoint}/baseline_checkpoint_epoch_{epoch}_{filename}.h5`|
| from_checkpoint     | Flag to use a saved checkpoint following the suffix provided. |
| path_to_checkpoint  | Directory where to save baseline |
| wp_n_epochs         | Number of warm up epochs |
| epoch               | Starting epoch number |
| num_samples         | Size of dataset generated for baseline. Used when deciding whether current model is statistically better than the baseline. |
| warmup_exp_beta     | weight used during warm up. Balances incorporation of Exponential Moving Average and Rollout Baselines |
| embedding_dim       | used for loading up model. |
| graph_size          | Used for loading up model. |

In [10]:
from reinforce_baseline import RolloutBaseline
from time import strftime, gmtime

model              = model_amd
graph_size         = 10
filename           = 'VRP_{}_{}'.format(graph_size, strftime("%Y-%m-%d", gmtime()))
from_checkpoint    = False
path_to_checkpoint = None
wp_n_epochs        = 5
epoch              = 0
num_samples        = 10_000
warmup_exp_beta    = 0.8
embedding_dim      = embedding_dim


baseline = RolloutBaseline(model             = model,
                           filename          = filename,
                           from_checkpoint   = from_checkpoint,
                           path_to_checkpoint= path_to_checkpoint,
                           wp_n_epochs       = wp_n_epochs,
                           epoch             = epoch,
                           num_samples       = num_samples,
                           embedding_dim     = embedding_dim,
                           graph_size        = graph_size
                           )

> [0;32m/home/joseaguilar/coding/github/DynamicAttentionModel-VRP/attention_dynamic_model.py[0m(200)[0;36mget_log_likelihood[0;34m()[0m
[0;32m    198 [0;31m        [0;31m# Get log_p corresponding to selected actions[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    199 [0;31m        [0;32mimport[0m [0mpdb[0m[0;34m;[0m [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 200 [0;31m        [0mlog_p[0m [0;34m=[0m [0mtf[0m[0;34m.[0m[0mgather_nd[0m[0;34m([0m[0m_log_p[0m[0;34m,[0m [0mtf[0m[0;34m.[0m[0mcast[0m[0;34m([0m[0mtf[0m[0;34m.[0m[0mexpand_dims[0m[0;34m([0m[0ma[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0;34m-[0m[0;36m1[0m[0;34m)[0m[0;34m,[0m [0mtf[0m[0;34m.[0m[0mint32[0m[0;34m)[0m[0;34m,[0m [0mbatch_dims[0m[0;34m=[0m[0;36m2[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    201 [0;31m[0;34m[0m[0m
[0m[0;32m    202 [0;31m        [0;31m# Calculate log_likeli

## Creating Problem Dataset

Dataset generation is provided by the `utils.py` module. The recommended function to generate data is `create_data_on_disk`:

```python
def create_data_on_disk(
    graph_size, 
    num_samples, 
    is_save=True, 
    filename=None, 
    is_return=False, 
    seed=1234): ...

```

| Parameter | Description|
| :---:     | :---       |
| graph_size  | Number of nodes to generate |
| num_samples | Size of dataset |
| is_save     | Flag for saving to disk |
| filename    | Suffix of dataset: 'Validation\_dataset\_{filename}.pkl' |
| is_return   | Whether to return dataset or not |
| seed        | Seed for generation |

Note that the data generated will use the TensorFlow API for datasets.

In [None]:
from utils import create_data_on_disk

graph_size  = graph_size
num_samples = 10_000
is_save     = True
filename    = filename
is_return   = True
seed        = 42

validation_dataset = create_data_on_disk(graph_size =graph_size,
                                         num_samples=num_samples,
                                         is_save    =is_save,
                                         filename   =filename,
                                         is_return  =is_return,
                                         seed       =seed)

### Vizualization of output data

Bellow we can appreciate the data generated with the TF dataset API. Also, a plot of the data can be seen in the graph bellow.

#### Format of Data Generated

The data generated uses Tensorflow's dataset API. This means that the data is generated in a batch-to-batch generation format. To access the data directly, use `next(data.batch(n).as_numpy_iterator)`. The data generated has the following structure:

```
batched_data[data type][batch's sample][xy data]
```

Here, `data type` represents the type of node that it is dealt with. In the case of the VRP data generator, this means whether we wish to look at the requesting/space nodes (value 1) or if we want to see the initial depot that the agents starts at (value 0). The `batch's sample` is just an index within the batch generated. Finally, `xy data` provides the type of node data to see. 0 is for x, 1 is for y.

In [None]:
import pandas as pd
import seaborn as sns

# Grab first 6,000 datapoints. 6,000 is some random number.
gen_data = next(validation_dataset.batch(6_000).as_numpy_iterator())

# We select the first sample from the generated batch (first out of 6,000)
sample   = 0

# We collect all other nodes. Note: first index of gen_data [0] -> depot node, [1] -> other nodes
graph    = pd.DataFrame(gen_data[1][sample], columns=['x', 'y'])

# We collect the depot node. Add a label to identify it in table.
depot = {
    'x': gen_data[0][sample][0], 
    'y': gen_data[0][sample][1], 
    'type': 'depot'
}
# Label all other datapoints as nodes and add the already-labeled depot node
graph = graph.assign(type='node').append(depot, ignore_index=True)

In [None]:
graph

In [None]:
sns.scatterplot(data=graph, x='x', y='y', hue='type', style='type')

### Training the Model

The developers provide a training function for orchestrating the training and rollout process of training.

```python
def train_model(optimizer,
                model_tf,
                baseline,
                validation_dataset,
                samples = 1280000,
                batch = 128,
                val_batch_size = 1000,
                start_epoch = 0,
                end_epoch = 5,
                from_checkpoint = False,
                grad_norm_clipping = 1.0,
                batch_verbose = 1000,
                graph_size = 20,
                filename = None
                ): ...
```

| Parameter          | Description |
| :----------------: | :---------- |
| optimizer          | Optimizer to be used for training |
| model_tf           | Training model to use             |
| baseline           | Initial baseline                  |
| validation_dataset | Generated dataset for validation (used at the end) |
| samples            | Number of samples to use for each learning epoch |
| batch              | Size of batches for learning      |
| val_batch_size     | Number of batches to use for validation (averaged out at the end for final score)|
| start_epoch        | Initial epoch configuration       |
| end_epoch          | Last epoch                        |
| from_checkpoint    | Flag for loading checkpoint       |
| grad_norm_clipping | Clipping/rescaling of gradients   |
| batch_verbose      | Verbosity of output               |
| graph_size         | Number of nodes for graph generated |
| filename           | Suffix of saved model: `VRP_{graph_size}_{date}` |

In [None]:
from train import train_model

# Parameters
optimizer          = optimizer
model_tf           = model
baseline           = baseline
validation_dataset = validation_dataset
samples            =  52_000 # Paper: 1_280_000
batch              = 32
val_batch_size     = 1_000 
start_epoch        = 0
end_epoch          = 50
from_checkpoint    = False
grad_norm_clipping = 1.0
batch_verbose      = 1_000
graph_size         = graph_size
filename           = filename

# Used for timing
start_time = time.time()
print(start_time)

# Train...
train_model(optimizer =optimizer,
            model_tf = model_tf,
            baseline = baseline,
            validation_dataset = validation_dataset,
            samples = samples,
            batch = batch,
            val_batch_size = val_batch_size,
            start_epoch = start_epoch,
            end_epoch = end_epoch,
            from_checkpoint = from_checkpoint,
            grad_norm_clipping = grad_norm_clipping,
            batch_verbose = batch_verbose,
            graph_size = graph_size,
            filename = filename
            )
print(time.time() - start_time )

## Saving a Model

Since AM-D is a custom model, we can only save the weights on their own. Loading them requires creating a new AM-D model, set it up, and load the weights:

```python
loaded_model = AttentionDynamicModel(
    embedding_dim  =embedding_dim,
    n_encode_layers=n_encode_layers,
    n_heads        =n_heads,
    tanh_clipping  =tanh_clipping
)

# See Keras' API. the type *.ckp can be anything.
loaded_model.load_weights('some_folder/some_checkpoint.ckp')

# Don't forger to set the decoding type! Otherwise you get an error when trying to do inference!
loaded_model.set_decode_type('greedy')
```

In [11]:
nodes     = 20
date      = 'oct_19'
_iter     = end_epoch - start_epoch
n_batches = samples
batch_size = batch

model_tf.save_weights(f"checkpoints/AM-D_{date}_{nodes}_nodes_{_iter}_iter_{n_batches}_batches_{batch_size}_batch_size.ckp")