- CNN policy ?
- grid search for HP tuning (OK)
- Increasingly difficult Environment
- Positive reward for populating increasingly "deep" blending tanks ?
- RL for chem sched paper (https://arxiv.org/pdf/2203.00636)
- Masking (https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html, https://arxiv.org/pdf/2006.14171)
    - Adding binary decision variables ?g  
    - Requires discrete action space (only integer flows -> treated as categories ?)
    - masking: disable incoming flows (resp. outgoing flows) for tanks at UB inv limit (resp. LB inv. limit), disable selling/buying when available = 0
    - multiple envs with multiple agents ? (MARL, https://arxiv.org/pdf/2103.01955)
        - Predict successive pipelines ("source > blend" then "blend > blend" (as many as required) then "blend > demand")
        - Each agent has access to the whole state
        - Action mask is derived from the previous agent's actions (0 if inventory at bounds or incoming flow already reserved, else 1)
        - https://github.com/Rohan138/marl-baselines3/blob/main/marl_baselines3/independent_ppo.py
- Safe RL: (https://proceedings.mlr.press/v119/wachi20a/wachi20a.pdf)
    - "Unsafe state" ? > Do not enforce constraints strictly, instead opt for early episode termination to show which states are unsafe ? 
    - Implementations:
        - https://pypi.org/project/fast-safe-rl/#description (Policy optimizers)
        - https://github.com/PKU-Alignment/safety-gymnasium/tree/main/safety_gymnasium (environments; "cost" ?)


1. Try other learning rates/CNN policies
2. Implement Masking with single agent
3. Try other ways to tell the model what are illegal/unsafe states (safe RL)
4. Try multiple agents

-----------------------

- Masking: Discretization of action space is too slow/might not work -> Need to implement masking for continuous action space
- Recurrent policy makes the most sense ? (window of demand forecasts)
- https://www.reddit.com/r/reinforcementlearning/comments/17l5b47/invalid_action_masking_when_action_space_is/
    - Suggestion of autoregressive model for having constraints respected: one predicted action is input to a second model
    - Suggestion of editing the distribution in such a way that the constraint is respected
- https://www.sciencedirect.com/science/article/pii/S0098135420301599
    - Choice of ELU activation ?
    - Choice of NN size ?
    - "The feature engineering in the net inventory means the network does not have to learn these relationships itself, which did help speed training." ?
- Simplify the problem (remove tanks 5 to 8), find the optimal solution with Gurobi

- remove all constraints except in/out
- https://arxiv.org/pdf/1711.11157
- https://arxiv.org/pdf/2111.01564
- Softmax with large coef to produce action mask
- Graph convolution NN instead of RNN ?
    - https://pytorch-geometric.readthedocs.io/en/latest/
    - Graph rep. learning - William L Hamilton

- DDPG
- Softmax
- ~~Remove non-selling rewards~~
- MultiplexNet
- Why softmax doesn't work ? -> gradient doesn't compute properly

- Finalize adjustment of flows
- Add more difficulty (bigger env)

In [67]:
# import gymnasium as gym
import json
import numpy as np
import torch as th
from stable_baselines3 import PPO, DDPG
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.callbacks import *
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize, VecCheckNan

from envs import BlendEnv, flatten_and_track_mappings, reconstruct_dict
from models import *
from math import exp, log
import yaml

import warnings
warnings.filterwarnings("ignore")

( Regexp for Tensorboard coloring )

(1\\|2\\|3\\|4\\|5\\|6\\|7\\|8\\|9\\|10\\|11\\|12\\|13\\)

In [68]:
with open("configs/12.yaml", "r") as f:
    s = "".join(f.readlines())
    cfg = yaml.load(s, Loader=yaml.FullLoader)

![image info](simplest.png)

In [69]:
# th.autograd.set_detect_anomaly(True)

In [70]:
if cfg["clipped_std"]:
    policytype = CustomMLP_ACP_simplest_std
elif cfg["custom_softmax"]:
    policytype = CustomMLP_ACP_simplest_softmax
elif cfg["policytype"] == "MLP":
    policytype = "MlpPolicy"
elif cfg["policytype"] == "MLPtanh":
    policytype = CustomMLP_ACP_simplest_tanh
    
if cfg["optimizer"] == "PPO":
    optimizer_cls = PPO
elif cfg["optimizer"] == "DDPG":
    optimizer_cls = DDPG

if cfg["model"]["act_fn"] == "ReLU":
    act_cls = th.nn.ReLU
elif cfg["model"]["act_fn"] == "tanh":
    act_cls = th.nn.Tanh
elif cfg["model"]["act_fn"] == "sigmoid":
    act_cls = th.nn.Sigmoid

In [71]:
connections = {
    "source_blend": {"s1": ["j1"]},
    "blend_blend": {"j1": []},
    "blend_demand": {"j1": ["p1"]}
}

In [72]:
action_sample = {
    'source_blend':{'s1': {'j1':1}},
    'blend_blend':{},
    'blend_demand':{'j1': {'p1':1}},
    "tau": {"s1": 10},
    "delta": {"p1": 0}
}
action_sample_flat, mapp = flatten_and_track_mappings(action_sample)

In [73]:
tau0   = {'s1': [10, 10, 10, 0, 0, 0]}
delta0 = {'p1': [0, 0, 0, 10, 10, 10]}
sigma = {"s1":{"q1": 0.06}} # Source concentrations
sigma_ub = {"p1":{"q1": 0.16}} # Demand concentrations UBs/LBs
sigma_lb = {"p1":{"q1": 0}}
s_inv_lb = {'s1': 0}
s_inv_ub = {'s1': 999}
d_inv_lb = {'p1': 0}
d_inv_ub = {'p1': 999}
betaT_d = {'p1': 1} # Price of sold products
betaT_s = {'s1': cfg["env"]["product_cost"]} # Cost of bought products
b_inv_ub = {"j1": 30} 
b_inv_lb = {j:0 for j in b_inv_ub.keys()}

In [74]:
env = BlendEnv(v = False, 
               D = cfg["env"]["D"], 
               Q = cfg["env"]["Q"], 
               P = cfg["env"]["P"], 
               B = cfg["env"]["B"], 
               Z = cfg["env"]["Z"], 
               M = cfg["env"]["M"],
               reg = cfg["env"]["reg"],
               reg_lambda = cfg["env"]["reg_lambda"],
               MAXFLOW = cfg["env"]["maxflow"],
               alpha = cfg["env"]["alpha"],
               beta = cfg["env"]["beta"],
               connections = connections,
               action_sample = action_sample,
               tau0 = tau0,delta0 = delta0,
               sigma = sigma,
               sigma_ub = sigma_ub, sigma_lb = sigma_lb,
               s_inv_lb = s_inv_lb, s_inv_ub = s_inv_ub,
               d_inv_lb = d_inv_lb, d_inv_ub = d_inv_ub,
               betaT_d = betaT_d, betaT_s = betaT_s,
               b_inv_ub = b_inv_ub,
               b_inv_lb = b_inv_lb)

env = Monitor(env)
env = DummyVecEnv([lambda: env])
env = VecNormalize(env, 
                   norm_obs=cfg["obs_normalizer"], 
                   norm_reward=cfg["reward_normalizer"])
env = VecCheckNan(env, raise_exception=True)

In [75]:
policy_kwargs = dict(
    net_arch=[dict(pi = [cfg["model"]["arch_layersize"]] * cfg["model"]["arch_n"], 
                   vf = [cfg["model"]["arch_layersize"]] * cfg["model"]["arch_n"])],
    activation_fn = act_cls,
    log_std_init = cfg["model"]["log_std_init"]
)

In [76]:
policytype

'MlpPolicy'

In [77]:
print(policytype)

if optimizer_cls == DDPG:
    kwa = dict(policy = policytype, 
                env = env,
                tensorboard_log = "./logs",
                learning_rate = cfg["model"]["lr"])

else:
    kwa = dict(policy = policytype, 
                env = env,
                tensorboard_log = "./logs",
                clip_range = cfg["model"]["clip_range"],
                learning_rate = cfg["model"]["lr"],
                ent_coef = cfg["model"]["ent_coef"],
                use_sde = cfg["model"]["use_sde"],
                policy_kwargs = policy_kwargs)

model = optimizer_cls(**kwa)

if cfg["starting_point"]:
    model.set_parameters(cfg["starting_point"])

MlpPolicy


In [78]:
import datetime

if type(model.policy) == CustomRNN_ACP:
    policytype = "CRNN"
elif type(model.policy) == CustomMLP_ACP_simplest_std:
    policytype = "CMLP"
else:
    policytype = "MLP"
    
entcoef = str(model.ent_coef) if type(model) == PPO else ""
cliprange = str(model.clip_range(0)) if type(model) == PPO else ""
model_name = f"models/simplest/{cfg['id']}/{cfg['id']}_{datetime.datetime.now().strftime('%m%d-%H%M')}"

In [79]:
class CustomLoggingCallbackPPO(BaseCallback):
    def __init__(self, verbose=0):
        super().__init__(verbose)
        self.log_stds = []
        self.total_rewards = []
        self.update1 = True
        self.print_flag = False
        
        self.pen_M, self.pen_B, self.pen_P, self.pen_reg = [], [], [], []
        
    def _on_rollout_end(self) -> None:
        self.logger.record('train/learning_rate', self.model.learning_rate)
        self.logger.record('train/clip_range', self.model.clip_range(0))
        
        self.stds = th.exp(self.model.policy.log_std).mean().item()
        
        if self.stds > 50:
            print("clipping log-stds")
            self.model.policy.log_std = nn.Parameter( 2*th.ones(self.model.policy.log_std.shape, requires_grad=True) )
        
        self.logger.record("train/std", th.exp(self.model.policy.log_std).mean().item())
        self.logger.record("penalties/in_out", sum(self.pen_M)/len(self.pen_M))
        self.logger.record("penalties/buysell_bounds", sum(self.pen_B)/len(self.pen_B))
        self.logger.record("penalties/tank_bounds", sum(self.pen_P)/len(self.pen_P))
        self.logger.record("penalties/regterm", sum(self.pen_reg)/len(self.pen_reg))
        
        self.pen_M, self.pen_B, self.pen_P, self.pen_reg = [], [], [], []
        
    def _on_step(self) -> bool:
        log_std: th.Tensor = self.model.policy.log_std
        # print(self.locals)
        t = self.locals["infos"][0]['dict_state']['t']
        
        if self.locals["dones"][0]: # record info at each episode end
            self.pen_M.append(self.locals["infos"][0]["pen_tracker"]["M"])
            self.pen_B.append(self.locals["infos"][0]["pen_tracker"]["B"])
            self.pen_P.append(self.locals["infos"][0]["pen_tracker"]["P"])
            self.pen_reg.append(self.locals["infos"][0]["pen_tracker"]["reg"])
            
            self.log_stds.append(log_std.mean().item())
            self.total_rewards.append(self.locals['rewards'][0])
            
            if self.locals['rewards'][0] > 200 and self.update1:
                self.model.learning_rate = 1e2
                self.model.clip_range = 5e-2
                self.update1 = False
        
        if self.num_timesteps%2048 < 6 and t == 1: # start printing
            self.print_flag = True
            
        if self.print_flag:
            print("\nt:", t)
            if np.isnan(self.locals['rewards'][0]) or np.isinf(self.locals['rewards'][0]):
                print(f"is invalid reward {self.locals['rewards'][0]}")
            for i in ['obs_tensor', 'actions', 'values', 'clipped_actions', 'new_obs', 'rewards']:
                if i in self.locals:
                    print(f"{i}: " + str(self.locals[i]))
            if t == 6:
                self.print_flag = False
                print(f"\n\nLog-Std at step {self.num_timesteps}: {log_std.detach().numpy()}")
                # print(f"\nAvg rewards over the last 100 episodes:{sum(self.total_rewards[-100:])/100} ; last reward: {self.total_rewards[-1]}")
                print("\n\n\n\n\n")
                
        return True

In [80]:
class CustomLoggingCallbackDDPG(BaseCallback):
    def __init__(self, verbose=0):
        super().__init__(verbose)
        self.total_rewards = []
        self.update1 = True
        self.print_flag = False
        
        self.pen_M, self.pen_B, self.pen_P, self.pen_reg = [], [], [], []
        
    def _on_rollout_end(self) -> None: ...
        
    def _on_step(self) -> bool:
        # print(self.locals)
        t = self.locals["infos"][0]['dict_state']['t']
        # print(self.locals["infos"][0]["pen_tracker"])
        
        if self.locals["dones"][0]: # record info at each episode end
            self.pen_M.append(self.locals["infos"][0]["pen_tracker"]["M"])
            self.pen_B.append(self.locals["infos"][0]["pen_tracker"]["B"])
            self.pen_P.append(self.locals["infos"][0]["pen_tracker"]["P"])
            self.pen_reg.append(self.locals["infos"][0]["pen_tracker"]["reg"])
            
            self.total_rewards.append(self.locals['rewards'][0])
            
            # if self.locals['rewards'][0] > 200 and self.update1:
            #     self.model.learning_rate = 1e2
            #     self.update1 = False
        
        if self.num_timesteps%2048 < 6 and t == 1: # start printing
            self.print_flag = True
            
        if self.print_flag:
            print("\nt:", t)
            if np.isnan(self.locals['rewards'][0]) or np.isinf(self.locals['rewards'][0]):
                print(f"is invalid reward {self.locals['rewards'][0]}")
            for i in ['obs_tensor', 'actions', 'values', 'new_obs', 'rewards']:
                if i in self.locals:
                    print(f"{i}: " + str(self.locals[i]))
            if t == 6:
                self.print_flag = False
                # print(f"\nAvg rewards over the last 100 episodes:{sum(self.total_rewards[-100:])/100} ; last reward: {self.total_rewards[-1]}")
                
                self.logger.record('train/learning_rate', self.model.learning_rate)
                self.logger.record("penalties/in_out", sum(self.pen_M)/len(self.pen_M))
                self.logger.record("penalties/buysell_bounds", sum(self.pen_B)/len(self.pen_B))
                self.logger.record("penalties/tank_bounds", sum(self.pen_P)/len(self.pen_P))
                self.logger.record("penalties/regterm", sum(self.pen_reg)/len(self.pen_reg))
        
                self.pen_M, self.pen_B, self.pen_P, self.pen_reg = [], [], [], []   
                
                print("\n\n\n\n\n")
                
        return True

In [81]:
log_callback = CustomLoggingCallbackPPO() if optimizer_cls == PPO else CustomLoggingCallbackDDPG()
callback = CallbackList([log_callback])
model_name

'models/simplest/12/12_0716-1703'

In [82]:
logpath = model_name[len("models/"):]
print(f"logging at {logpath}")
model.learn(total_timesteps = 100000, 
            progress_bar = False, 
            tb_log_name = logpath, 
            callback = callback,
            reset_num_timesteps = False
            )

logging at simplest/12/12_0716-1703

t: 1
obs_tensor: tensor([[ 0.,  0.,  0.,  0., 10.,  0., 10.,  0., 10.,  0.,  0., 10.,  0., 10.,
          0., 10.,  0.]])
actions: [[ 0.1070335  -0.22346415  0.10184398 -0.16763408]]
values: tensor([[-4.2488]])
clipped_actions: [[0.1070335  0.         0.10184398 0.        ]]
new_obs: [[ 0.          0.10184398  0.          0.06       10.          0.
  10.          0.          0.         10.          0.         10.
   0.         10.          0.          0.          1.        ]]
rewards: [-10.]

t: 2
obs_tensor: tensor([[ 0.0000,  0.1018,  0.0000,  0.0600, 10.0000,  0.0000, 10.0000,  0.0000,
          0.0000, 10.0000,  0.0000, 10.0000,  0.0000, 10.0000,  0.0000,  0.0000,
          1.0000]])
actions: [[ 0.15069139 -0.07826404  0.03532644 -0.01408055]]
values: tensor([[-2.8299]])
clipped_actions: [[0.15069139 0.         0.03532644 0.        ]]
new_obs: [[ 0.          0.13717043  0.          0.07545222 10.          0.
   0.         10.          0.        

<stable_baselines3.ppo.ppo.PPO at 0x19574011520>

In [15]:
model.save(model_name)

In [15]:
# M,Q,P,B,Z,D = 10, 0, 5, 5, 1, 0
M, Q, P, B, Z, D  = cfg["env"]["M"], cfg["env"]["Q"], cfg["env"]["P"], cfg["env"]["B"], cfg["env"]["Z"], 0
# M,Q,P,B,Z,D = 0, 0, 0, 0, 1, 0

In [16]:
env = BlendEnv(v = True, 
               D = cfg["env"]["D"], 
               Q = cfg["env"]["Q"], 
               P = cfg["env"]["P"], 
               B = cfg["env"]["B"], 
               Z = cfg["env"]["Z"], 
               M = cfg["env"]["M"],
               reg = cfg["env"]["reg"],
               reg_lambda = cfg["env"]["reg_lambda"],
               MAXFLOW = cfg["env"]["maxflow"],
               alpha = cfg["env"]["alpha"],
               beta = cfg["env"]["beta"],
               connections = connections, 
               action_sample = action_sample,
               tau0 = tau0,delta0 = delta0,
               sigma = sigma,
               sigma_ub = sigma_ub, sigma_lb = sigma_lb,
               s_inv_lb = s_inv_lb, s_inv_ub = s_inv_ub,
               d_inv_lb = d_inv_lb, d_inv_ub = d_inv_ub,
               betaT_d = betaT_d, betaT_s = betaT_s,
               b_inv_ub = b_inv_ub,
               b_inv_lb = b_inv_lb)
env = Monitor(env)

In [17]:
with th.autograd.set_detect_anomaly(True):
    obs = env.reset()
    obs, obs_dict = obs
    for k in range(env.T):
        action, _ = model.predict(obs, deterministic=False)
        print(env.pen_tracker)
        print("\n\n   ",reconstruct_dict(action, env.mapping_act))
        obs, reward, done, term, _ = env.step(action)
        dobs = reconstruct_dict(obs, env.mapping_obs)
        print("\n    >>     ",dobs["sources"], dobs["blenders"], dobs["demands"])
        print("   " ,reward)
        

{'M': 0, 'B': 0, 'P': 0, 'reg': 0}


    {'source_blend': {'s1': {'j1': 0.0}}, 'blend_demand': {'j1': {'p1': 0.13335086}}, 'tau': {'s1': 0.106610715}, 'delta': {'p1': 0.022796683}}
[PEN] t1; p1:			sold too much (more than demand)
Increased reward by 0.10661071538925171 through tank population in s1
j1: inv: 0, in_flow_sources: 0.0, in_flow_blend: 0, out_flow_blend: 0, out_flow_demands: 0.13335086405277252
j1: b: 0.0
[PEN] t1; j1:			inventory OOB (resulting amount less than blending tank LB)
Increased reward by 0 through tank population in j1
Increased reward by 0 through tank population in p1

    >>      {'s1': 0.106610715} {'j1': 0.0} {'p1': 0.0}
    -10.04953682422638
{'M': 0, 'B': -5, 'P': -5, 'reg': -0.2627582550048828}


    {'source_blend': {'s1': {'j1': 0.0}}, 'blend_demand': {'j1': {'p1': 0.10351893}}, 'tau': {'s1': 0.0}, 'delta': {'p1': 0.0}}
Increased reward by 0 through tank population in s1
j1: inv: 0.0, in_flow_sources: 0.0, in_flow_blend: 0, out_flow_blend: 0, out_flow_d

In [24]:
th.Tensor(action)

tensor([0.0336, 0.0893, 0.0000, 0.0000])

In [253]:
# 0 (only once per episode)
episode_rewards = []
obs = env.reset()
obs, obs_dict = obs

In [262]:
# 1 Get first action
print(env.t)
action, _ = model.predict(obs, deterministic=True)

2


In [263]:
print(env.t)
d = reconstruct_dict(obs, env.mapping_obs)
print(d["sources"])
print(d["blenders"])
print(d["demands"])
print(d["properties"])

2
{'s1': 17.46205}
{'j1': 0.0}
{'p1': 0.0}
{'j1': {'q1': 0.0}}


In [264]:
# 2 Visualize action
print(env.t)
reconstruct_dict(action, env.mapping_act)

2


{'source_blend': {'s1': {'j1': 0.0}},
 'blend_demand': {'j1': {'p1': 30.307917}},
 'tau': {'s1': 8.731916},
 'delta': {'p1': 17.08481}}

In [265]:
# 3
# Step once: get 2nd action
print(env.t)
obs, reward, done, term, _ = env.step(action)

2


In [207]:
# 4 Visualize new state
print(env.t)
d = reconstruct_dict(obs, env.mapping_obs)
print(d["sources"])
print(d["blenders"])
print(d["demands"])
print(d["properties"])

3
{'s1': 26.193966}
{'j1': 0.0}
{'p1': 0.0}
{'j1': {'q1': 0.0}}
