# PARCO for the FFSP


Learning a Parallel AutoRegressive policy for a Combinatorial Optimization problem: the Flexible Flow Shop Scheduling Problem (FFSP).

<a href="https://colab.research.google.com/github/ai4co/parco/blob/main/examples/3.quickstart-ffsp.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>  <a href="https://arxiv.org/abs/2409.0381"><img src="https://img.shields.io/badge/arXiv-2409.03811-b31b1b.svg" alt="Open In ArXiv"></a>


In [None]:
%load_ext autoreload
%autoreload 2

import torch
from rl4co.utils.trainer import RL4COTrainer
from rl4co.models import POMO
from parco.envs import FFSPEnv
from parco.models import PARCOMultiStagePolicy

# Greedy rollouts over trained model
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## Environment

In [None]:
env = FFSPEnv(generator_params=dict(num_job=20, num_machine=4),
              data_dir="",
              val_file="../data/omdcpdp/n50_m10_seed3333.npz",
              test_file="../data/omdcpdp/n50_m10_seed3333.npz",
              )            
td_test_data = env.generator(batch_size=[3])
td_init = env.reset(td_test_data.clone()).to(device)
td_init_test = td_init.clone()

## Model

Here we declare our policy and our PARCO model (policy + environment + RL algorithm)

In [None]:
emb_dim = 128

# Policy is the neural network
policy = PARCOMultiStagePolicy(num_stages=env.num_stage,
                               env_name=env.name,
                               embed_dim=emb_dim,
                               num_heads=8,
                               normalization="instance",
                               init_embedding_kwargs={"one_hot_seed_cnt": env.num_machine})

# We refer to the model as the policy + the environment + training data (i.e. full RL algorithm)
model = POMO(     
    env, 
    policy=policy,
    train_data_size=1000, 
    val_data_size=100,
    test_data_size=1000,    
    batch_size=50, 
    val_batch_size=100,
    test_batch_size=100,        
    num_starts=24,   
    num_augment=0,      
    optimizer_kwargs={'lr': 1e-4, 'weight_decay': 0},
)    

### Test untrained model

In [None]:
td_pre = td_init_test.clone()

policy = model.policy.to(device)
out = policy(td_pre, env, decode_type="greedy", return_actions=True)

print("Average makespan: {:.2f}".format(-out['reward'].mean().item()))
for i in range(3):
    print(f"Schedule {i} makespan: {-out['reward'][i].item():.2f}")
    env.render(td_pre, idx=i)

## Training

In here we call the trainer and then fit the model

In [None]:
trainer = RL4COTrainer(
    max_epochs=5, # few epochs for demo
    accelerator="gpu" if torch.cuda.is_available() else "cpu",
    devices=1, # change this to your GPU number
    logger=None,
)

In [None]:
trainer.fit(model)

## Evaluating the trained model

Now, we take the testing instances from above and evaluate the trained model on them with different evaluation techniques:
- Greedy: We take the action with the highest probability
- Sampling: We sample from the probability distribution N times and take the best one
- Augmentation: we first augment N times the state and then take the best action

### Greedy evaluation

Here we simply take the solution with greedy decoding type

In [None]:
td_post = td_init_test.clone()

policy = model.policy.to(device)
out = policy(td_post, env, decode_type="greedy", return_actions=True)

print("Average makespan: {:.2f}".format(-out['reward'].mean().item()))
for i in range(3):
    print(f"Schedule {i} makespan: {-out['reward'][i].item():.2f}")
    env.render(td_post, idx=i)