This notebook is heavily based on the rl4co tutorial!

### Installation

Starting instructions:


1.   If you are using Google colab, in the upper right, select the T4 (or better) GPU
2.   Run the following cell and take a short break
3.   Restart the runtime (Runtime -> Restart runtime). If desired, clear the cell output of the following cell.
4.   Jump to Imports and begin running the cells of the notebook.



In [None]:
## You may need to restart the runtime in Colab after this
## Remember to choose a GPU runtime for faster training!
## NOTE: These two lines may take a little while to run. Grab a drink in the meantime.

!pip install rl4co
!pip install -e git+https://github.com/jvkersch/pyconcorde#egg=pyconcorde

### Imports

In [None]:
import torch
from lightning.pytorch.callbacks import ModelCheckpoint, RichModelSummary

from rl4co.envs import TSPEnv
from rl4co.models.zoo.am import AttentionModel
from rl4co.utils.trainer import RL4COTrainer

#from concorde.tsp import TSPSolver

import warnings
warnings.filterwarnings('ignore')

### Environment, Model and LitModule

In [None]:
# RL4CO env based on TorchRL
env = TSPEnv(num_loc=20)

# Model: default is AM with REINFORCE and greedy rollout baseline
model = AttentionModel(env,
                       baseline='rollout',
                       train_data_size=100_000,
                       val_data_size=10_000)

### Test greedy rollout with untrained model and plot

In [None]:
# Greedy rollouts over untrained model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch_sz = 3
td_init = env.reset(batch_size=[batch_sz]).to(device)
model = model.to(device)
out = model(td_init, phase="test", decode_type="greedy", return_actions=True)

# Plotting
print(f"Tour lengths: {[f'{-r.item():.2f}' for r in out['reward']]}")
for td, actions in zip(td_init, out['actions'].cpu()):
    env.render(td, actions)

### Trainer

The RL4CO trainer is a wrapper around PyTorch Lightning's `Trainer` class which adds some functionality and more efficient defaults

In [None]:
from rl4co.utils.trainer import RL4COTrainer

trainer = RL4COTrainer(
    max_epochs=3,
    accelerator="gpu",
    logger=None
)

INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


### Fit the model

In [None]:
trainer.fit(model)

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name     | Type                 | Params
--------------------------------------------------
0 | env      | TSPEnv               | 0     
1 | policy   | AttentionModelPolicy | 710 K 
2 | baseline | WarmupBaseline       | 710 K 
--------------------------------------------------
1.4 M     Trainable params
0         Non-trainable params
1.4 M     Total params
5.681     Total estimated model params size (MB)
INFO:lightning.pytorch.callbacks.model_summary:
  | Name     | Type                 | Params
--------------------------------------------------
0 | env      | TSPEnv               | 0     
1 | policy   | AttentionModelPolicy | 710 K 
2 | baseline | WarmupBaseline       | 710 K 
--------------------------------------------------
1.4 M     Trainable params
0         Non-trainable params
1.4 M     Total params
5.681     Total estimated model params 

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


### Testing

In [None]:
# Greedy rollouts over trained model (same states as previous plot)
model = model.to(device)
out = model(td_init, phase="test", decode_type="greedy", return_actions=True)

# Plotting
# print(f"Tour lengths: {[f'{-r.item():.2f}' for r in ]]}")
for td, actions, rr in zip(td_init, out['actions'].cpu(), out['reward'].cpu()):
    print(f"Tour length: {-rr}")
    env.render(td, actions)

We can see that even after just 3 epochs, our trained AM is able to find much better solutions than the random policy! ðŸŽ‰

### Comparison to optimal solutions

Now, let us compare against the optimal solutions as computed by Concorde.

In [None]:
tdi_cpu = td_init.cpu()
opt_sols = []
for ii in range(batch_sz):
  # Multiply the distances by 1000 to prevent the rounding in the solver from making all the distances 1...
  solver = TSPSolver.from_data(tdi_cpu['locs'][ii][:,0]*1000, tdi_cpu['locs'][ii][:,1]*1000, norm="EUC_2D")
  opt_sols.append(solver.solve())

In [None]:
# Convert the optimal solutions into a tensor that rl4co understands for visualization
opt_vals = []
opt_actions = torch.tensor([sol.tour for sol in opt_sols], dtype=torch.int64)
opt_vals = [sol.optimal_value / 1000. for sol in opt_sols]

for ii in range(batch_sz):
  gap = (-out['reward'][ii] - opt_vals[ii]) / opt_vals[ii]
  print(f'Instance {ii}: Gap of policy to optimal: {gap*100}%')
  env.render(tdi_cpu[ii], opt_actions[ii])