- Increasingly difficult Environment
- Positive reward for populating increasingly "deep" blending tanks ?
- RL for chem sched paper (https://arxiv.org/pdf/2203.00636)
- Masking (https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html, https://arxiv.org/pdf/2006.14171)
    - Adding binary decision variables ?g  
    - Requires discrete action space (only integer flows -> treated as categories ?)
    - masking: disable incoming flows (resp. outgoing flows) for tanks at UB inv limit (resp. LB inv. limit), disable selling/buying when available = 0
    - multiple envs with multiple agents ? (MARL, https://arxiv.org/pdf/2103.01955)
        - Predict successive pipelines ("source > blend" then "blend > blend" (as many as required) then "blend > demand")
        - Each agent has access to the whole state
        - Action mask is derived from the previous agent's actions (0 if inventory at bounds or incoming flow already reserved, else 1)
        - https://github.com/Rohan138/marl-baselines3/blob/main/marl_baselines3/independent_ppo.py
- Safe RL: (https://proceedings.mlr.press/v119/wachi20a/wachi20a.pdf)
    - "Unsafe state" ? > Do not enforce constraints strictly, instead opt for early episode termination to show which states are unsafe ? 
    - Implementations:
        - https://pypi.org/project/fast-safe-rl/#description (Policy optimizers)
        - https://github.com/PKU-Alignment/safety-gymnasium/tree/main/safety_gymnasium (environments; "cost" ?)


- Masking: Discretization of action space is too slow/might not work -> Need to implement masking for continuous action space
- Recurrent policy makes the most sense ? (window of demand forecasts)
- https://www.reddit.com/r/reinforcementlearning/comments/17l5b47/invalid_action_masking_when_action_space_is/
    - Suggestion of autoregressive model for having constraints respected: one predicted action is input to a second model
    - Suggestion of editing the distribution in such a way that the constraint is respected
- https://www.sciencedirect.com/science/article/pii/S0098135420301599
    - Choice of ELU activation ?
    - Choice of NN size ?
    - "The feature engineering in the net inventory means the network does not have to learn these relationships itself, which did help speed training." ?
- Simplify the problem (remove tanks 5 to 8), find the optimal solution with Gurobi

- Proportional penalty instead of flat
- Solution pool from Gurobi for data generation
- Uniformize the distribution profile
    - Idea is to remove start/end of episode effects to make the distribution simpler (see photo)
    - -> Simulate infinite sup/dem profile
    - -> simulate env with 12 time periods, only use the first 6 for the data, then do the same by shifting by 12.
        - -> Need to implement non-zero initial inv states for both gym and gurobi

- Clarify how DT works at inference time

In [1]:
import sys, os
sys.path.append(os.path.dirname(os.path.abspath(os.getcwd())))
try:
    print(curr_dir)
except:
    curr_dir = os.path.dirname(os.path.abspath(os.getcwd()))
    os.chdir(curr_dir)
    print(curr_dir)

c:\Users\adame\OneDrive\Bureau\CODE\BlendingRL


In [2]:
import json
import numpy as np
import torch as th
from stable_baselines3 import PPO, DDPG, SAC, TD3
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.callbacks import *
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize, VecCheckNan
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.utils import safe_mean

from envs import BlendEnv, flatten_and_track_mappings, reconstruct_dict
from models import *
from utils import *
from math import exp, log
import yaml
from datetime import datetime
from PIL import Image

import warnings
warnings.filterwarnings("ignore")

KeyboardInterrupt: 

In [3]:
# Regex for tensorboard
# Gives the current day's runs of the given config list
from datetime import datetime
L = []
cfg_list = range(48, 55)
for cfg in cfg_list:
    L.append("(" + str(cfg) + "_" + datetime.now().strftime('%m%d') + ")")
"|".join(L)

'(48_0114)|(49_0114)|(50_0114)|(51_0114)|(52_0114)|(53_0114)|(54_0114)'

In [7]:
L = []
cfg_list = range(41, 48)
for cfg in cfg_list:
    L.append("(/" + str(cfg) + "/)")
"|".join(L)

'(/41/)|(/42/)|(/43/)|(/44/)|(/45/)|(/46/)|(/47/)'

In [5]:
############## ENV CONFIGURATION ##############
CONFIG = 65         # See /configs
layout = "simple" # See /img
###############################################

In [6]:
with open(f"./configs/{CONFIG}.yaml", "r") as f:
    s = "".join(f.readlines())
    cfg = yaml.load(s, Loader=yaml.FullLoader)

In [7]:
if cfg["custom_softmax"]:
    policytype = CustomMLP_ACP_simplest_softmax
elif cfg["policytype"] == "MLP":
    policytype = "MlpPolicy"
elif cfg["policytype"] == "MLPtanh":
    policytype = CustomMLP_ACP_simplest_tanh
    
optimizer_cls = eval(cfg["optimizer"])

if cfg["model"]["act_fn"] == "ReLU":
    act_cls = th.nn.ReLU
elif cfg["model"]["act_fn"] == "tanh":
    act_cls = th.nn.Tanh
elif cfg["model"]["act_fn"] == "sigmoid":
    act_cls = th.nn.Sigmoid

In [8]:
connections, action_sample = get_jsons(layout)
sources, blenders, demands = get_sbp(connections)

In [9]:
T = 6
if layout == "base":
    sigma = {"s1":{"q1": 0.06}, "s2":{"q1": 0.26}}
    sigma_ub = {"p1":{"q1": 0.16}, "p2":{"q1": 1}}
    sigma_lb = {"p1":{"q1": 0}, "p2":{"q1": 0}}
else:
    sigma = {s:{"q1": 0.06} for s in sources}
    sigma_ub = {d:{"q1": 0.16} for d in demands}
    sigma_lb = {d:{"q1": 0} for d in demands}
    
s_inv_lb = {s: 0 for s in sources}
s_inv_ub = {s: 999 for s in sources}
d_inv_lb = {d: 0 for d in demands}
d_inv_ub = {d: 999 for d in demands}
betaT_d = {d: 1 for d in demands} # Price of sold products
b_inv_ub = {j: 30 for j in blenders} 
b_inv_lb = {j: 0 for j in blenders}
betaT_s = {s: cfg["env"]["product_cost"]  for s in sources} # Cost of bought products

if cfg["env"]["uniform_data"]:
    length = 13
    if cfg["env"]["max_pen_violations"] < 999:
        length = 50
        T = length
        
    tau0   = {s: [np.random.normal(20, 3) for _ in range(length)] for s in sources}
    delta0 = {d: [np.random.normal(20, 3) for _ in range(length)] for d in demands}
    
else:
    if cfg["env"]["shrink_data"]:
        tau0   = {s: [1, 1, 1, 0, 0, 0] for s in sources}
        delta0 = {d: [0, 0, 0, 1, 1, 1] for d in demands}
    else:
        tau0   = {s: [10, 10, 10, 0, 0, 0] for s in sources}
        delta0 = {d: [0, 0, 0, 10, 10, 10] for d in demands}

In [10]:
env = BlendEnv(v = False, T = T, layout = layout,
               D = cfg["env"]["D"], Q = cfg["env"]["Q"], P = cfg["env"]["P"], B = cfg["env"]["B"], Z = cfg["env"]["Z"], M = cfg["env"]["M"],
               reg = cfg["env"]["reg"], reg_lambda = cfg["env"]["reg_lambda"], L0_pen = cfg["env"]["L0_pen"],
               MAXFLOW = cfg["env"]["maxflow"], alpha = cfg["env"]["alpha"], beta = cfg["env"]["beta"], 
               max_pen_violations = cfg["env"]["max_pen_violations"], illeg_act_handling = cfg["env"]["illeg_act_handling"],
               connections = connections, action_sample = action_sample, 
               tau0 = tau0, delta0 = delta0, sigma = sigma,
               sigma_ub = sigma_ub, sigma_lb = sigma_lb,
               s_inv_lb = s_inv_lb, s_inv_ub = s_inv_ub,
               d_inv_lb = d_inv_lb, d_inv_ub = d_inv_ub,
               betaT_d = betaT_d, betaT_s = betaT_s,
               b_inv_ub = b_inv_ub,
               b_inv_lb = b_inv_lb)

In [11]:
env = Monitor(env)
env = DummyVecEnv([lambda: env])
env = VecNormalize(env, 
                   norm_obs=cfg["obs_normalizer"], 
                   norm_reward=cfg["reward_normalizer"])
# env = VecCheckNan(env, raise_exception=True)

In [12]:
policy_kwargs = dict(
    net_arch=[dict(pi = [cfg["model"]["arch_layersize"]] * cfg["model"]["arch_n"], 
                   vf = [cfg["model"]["arch_layersize"]] * cfg["model"]["arch_n"])],
    activation_fn = act_cls,
    # log_std_init = cfg["model"]["log_std_init"]
)

In [13]:
print(policytype)

if optimizer_cls == PPO:
    kwa = dict(policy = policytype, 
                env = env,
                tensorboard_log = "./logs2",
                clip_range = cfg["model"]["clip_range"],
                learning_rate = cfg["model"]["lr"] if not cfg["model"]["lr_sched"] else (lambda p: cfg["model"]["lr"] + (cfg["model"]["lr_end"] - cfg["model"]["lr"]) * p),
                ent_coef = cfg["model"]["ent_coef"],
                use_sde = cfg["model"]["use_sde"],
                batch_size = cfg["model"]["batch_size"],
                policy_kwargs = policy_kwargs)
    
else:
    kwa = dict(policy = policytype, 
                env = env,
                tensorboard_log = "./logs2",
                batch_size = cfg["model"]["batch_size"],
                learning_rate = cfg["model"]["lr"])

model = optimizer_cls(**kwa)

MlpPolicy


In [14]:

if cfg["starting_point"]:
    try:
        cfg_start = int(cfg["starting_point"])
        bin_ = get_bin(cfg_start)
        directory = f"C:\\Users\\adame\\OneDrive\\Bureau\\CODE\\BlendingRL\\models\\{layout}\\{bin_}\\{cfg_start}"
        chosen, mod_chosen = "", 0
        for f in os.listdir(directory):
            mod_time = os.path.getmtime(os.path.join(directory, f))
            if mod_time > mod_chosen:
                chosen = os.path.join(f"models\\{layout}\\{bin_}\\{cfg_start}", f)
        model.set_parameters(chosen)
        
    except ValueError:
        model.set_parameters(cfg["starting_point"])

In [15]:
# PPO(learning_rate=lambda p: a + (b-a)* p)

If batch_size = 64 and n_steps = 2048, then 1 epoch = 2048/64 = 32 batches

In [16]:
bin_ = get_bin(cfg['id'])
entcoef = str(model.ent_coef) if type(model) == PPO else ""
cliprange = str(model.clip_range(0)) if type(model) == PPO else ""
model_name = f"models/{layout}/{bin_}/{cfg['id']}/{cfg['id']}_{datetime.now().strftime('%m%d-%H%M')}"
model_name

'models/simple/61-72/65/65_1225-1535'

In [17]:
class CustomLoggingCallbackPPO(BaseCallback):
    def __init__(self, schedule_timesteps, start_log_std=2, end_log_std=-1, std_control = None, model_name = None, v = None):
        super().__init__(verbose = 0)
        self.std_control = std_control
        
        self.start_log_std = start_log_std
        self.end_log_std = end_log_std
        self.schedule_timesteps = schedule_timesteps
        self.current_step = 0
        # self.perfs = []
        self.print_flag = False
        self.model_name = model_name
        
        self.pen_M, self.pen_B, self.pen_P, self.pen_reg = [], [], [], []
        self.n_pen_M, self.n_pen_B, self.n_pen_P, self.n_pen_Q, self.pen_nv, self.pen_nv_counted = [], [], [], [], [], []
        self.units_sold, self.units_bought, self.rew_sold, self.rew_depth = [], [], [], []
        self.v = v
        
    def _on_rollout_end(self) -> None:
        self.logger.record("penalties/in_out",              sum(self.pen_M)/len(self.pen_M))
        self.logger.record("penalties/buysell_bounds",      sum(self.pen_B)/len(self.pen_B))
        self.logger.record("penalties/tank_bounds",         sum(self.pen_P)/len(self.pen_P))
        
        self.logger.record("penalties/n_in_out",            sum(self.n_pen_M)/len(self.n_pen_M))
        self.logger.record("penalties/n_buysell_bounds",    sum(self.n_pen_B)/len(self.n_pen_B))
        self.logger.record("penalties/n_tank_bounds",       sum(self.n_pen_P)/len(self.n_pen_P))
        self.logger.record("penalties/n_concentration",     sum(self.n_pen_Q)/len(self.n_pen_Q))
        self.logger.record("penalties/n_vltn",              sum(self.pen_nv)/len(self.pen_nv))
        self.logger.record("penalties/n_vltn_counted",      sum(self.pen_nv_counted)/len(self.pen_nv_counted))
        
        self.logger.record("penalties/units_sold",          sum(self.units_sold)/len(self.units_sold))
        self.logger.record("penalties/units_bought",        sum(self.units_bought)/len(self.units_bought))
        self.logger.record("penalties/rew_sold",            sum(self.rew_sold)/len(self.rew_sold))
        self.logger.record("penalties/rew_depth",           sum(self.rew_depth)/len(self.rew_depth))
        
        self.pen_M, self.pen_B, self.pen_P, self.pen_reg = [], [], [], []
        self.n_pen_M, self.n_pen_B, self.n_pen_P, self.n_pen_Q, self.pen_nv, self.pen_nv_counted = [], [], [], [], [], []
        self.units_sold, self.units_bought, self.rew_sold, self.rew_depth = [], [], [], []
        
        # this_model_name = self.model_name + datetime.now().strftime('%m%d-%H%M%S')
        # self.perfs[this_model_name] = safe_mean([ep_info["r"] for ep_info in model.ep_info_buffer])
        # self.model.save(this_model_name)
        
        
    def _on_step(self) -> bool:
        log_std: th.Tensor = self.model.policy.log_std
        t = self.locals["infos"][0]['dict_state']['t']
        
        if self.locals["infos"][0]["terminated"] or self.locals["infos"][0]["truncated"]: # record info at each episode end
            self.pen_M.append(self.locals["infos"][0]["pen_tracker"]["M"])
            self.pen_B.append(self.locals["infos"][0]["pen_tracker"]["B"])
            self.pen_P.append(self.locals["infos"][0]["pen_tracker"]["P"])
            
            self.n_pen_M.append(self.locals["infos"][0]["pen_tracker"]["n_M"])
            self.n_pen_B.append(self.locals["infos"][0]["pen_tracker"]["n_B"])
            self.n_pen_P.append(self.locals["infos"][0]["pen_tracker"]["n_P"])
            self.n_pen_Q.append(self.locals["infos"][0]["pen_tracker"]["n_Q"])
            self.pen_nv.append(self.locals["infos"][0]["pen_tracker"]["n_M"] +
                               self.locals["infos"][0]["pen_tracker"]["n_B"] +
                               self.locals["infos"][0]["pen_tracker"]["n_Q"] +
                               self.locals["infos"][0]["pen_tracker"]["n_P"])
            self.pen_nv_counted.append(self.locals["infos"][0]["pen_tracker"]["n_M"] if cfg["env"]["M"] > 0 else 0 +
                                       self.locals["infos"][0]["pen_tracker"]["n_B"] if cfg["env"]["B"] > 0 else 0 +
                                       self.locals["infos"][0]["pen_tracker"]["n_Q"] if cfg["env"]["Q"] > 0 else 0 +
                                       self.locals["infos"][0]["pen_tracker"]["n_P"] if cfg["env"]["P"] > 0 else 0)
            # print(self.pen_nv, self.pen_nv_counted, self.n_pen_M, self.n_pen_B, self.n_pen_P)
            self.units_sold.append(self.locals["infos"][0]["pen_tracker"]["units_sold"])
            self.units_bought.append(self.locals["infos"][0]["pen_tracker"]["units_bought"])
            self.rew_sold.append(self.locals["infos"][0]["pen_tracker"]["rew_sold"])
            self.rew_depth.append(self.locals["infos"][0]["pen_tracker"]["rew_depth"])
        
        if self.v:
            if self.num_timesteps%5000 < 6 and t == 1: # start printing
                self.print_flag = True
                
            if self.print_flag:
                if self.v == "text":
                    print("\nt:", t)
                    if np.isnan(self.locals['rewards'][0]) or np.isinf(self.locals['rewards'][0]):
                        print(f"is invalid reward {self.locals['rewards'][0]}")
                    for i in ['obs_tensor', 'clipped_actions', 'rewards']:
                        if i in self.locals:
                            print(f"{i}: " + str(self.locals[i]))
                    

                elif self.v == "img":
                    try:
                        img = self.training_env.get_attr("render")[0]()
                        pil_image = Image.fromarray(img)
                        pil_image.save(f"{model_name.replace('models', 'logs2')}_0/img/{self.num_timesteps}_{t}.png")
                        
                    except FileNotFoundError:
                        os.mkdir(os.path.join(os.getcwd(), f"{model_name.replace('models', 'logs2')}_0/img"))
                
                if t == 6:
                    self.print_flag = False
                    print(f"\n\nLog-Std at step {self.num_timesteps}: {log_std.detach().cpu().numpy()}")
                    print(self.locals["infos"][0]["pen_tracker"])
                    print("\n\n\n\n\n")
        
        if self.std_control:
            progress = self.current_step / self.schedule_timesteps
            new_log_std = self.start_log_std + progress * (self.end_log_std - self.start_log_std)
            self.model.policy.log_std.data.fill_(new_log_std)
            self.current_step += 1
        
        return True

In [18]:
os.getcwd()

'c:\\Users\\adame\\OneDrive\\Bureau\\CODE\\BlendingRL'

In [19]:
total_timesteps = 1e5
log_callback = CustomLoggingCallbackPPO(schedule_timesteps=total_timesteps, 
                                        std_control = cfg["clipped_std"],
                                        model_name = model_name, 
                                        v = "img") if optimizer_cls == PPO else CustomLoggingCallbackDDPG()
callback = CallbackList([log_callback])
model_name

'models/simple/61-72/65/65_1225-1535'

In [20]:
logpath = model_name[len("models/"):]
print(f"logging at {logpath}")
model.learn(total_timesteps = total_timesteps,
            progress_bar = False,
            tb_log_name = logpath,
            callback = callback,
            reset_num_timesteps = False)

logging at simple/61-72/65/65_1225-1535


Log-Std at step 6: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
{'M': -120, 'B': -70.69430910050869, 'P': -9.59325310921998, 'Q': 0, 'n_M': 12, 'n_B': 17, 'n_P': 3, 'n_Q': 0, 'units_sold': 0.0, 'units_bought': 6.6315078139305115, 'rew_sold': 0.0, 'rew_depth': 7.935308179591971, 'rew_bought': 0.0}








Log-Std at step 5010: [ 0.01100536  0.00716582  0.01837818  0.01416336 -0.0116841   0.01679907
  0.00834745  0.01354424 -0.00936338  0.01154996  0.01022923  0.01095804
  0.01747564  0.01549258  0.02031877 -0.00291336  0.01864661  0.00682647
  0.01520657  0.018809  ]
{'M': -10, 'B': -57.29678297042847, 'P': -99.55651724338531, 'Q': 0, 'n_M': 1, 'n_B': 6, 'n_P': 13, 'n_Q': 0, 'units_sold': 0.0, 'units_bought': 36.54814672470093, 'rew_sold': 0.0, 'rew_depth': 35.355886340141296, 'rew_bought': 0.0}








Log-Std at step 10008: [ 0.01831699  0.0123056   0.0318554   0.01898001 -0.00406772  0.02410472
  0.00346603  0.00282144  0.000

KeyboardInterrupt: 

In [30]:
import re

def save_next_file(directory, model_name):
    base_pattern = re.compile(model_name + r"_(\d+)\.zip")
    
    try:
        files = os.listdir(directory)
    except:
        os.mkdir(directory)
        
        files = os.listdir(directory)
        
    max_number = 0
    for file in files:
        match = base_pattern.match(file)
        if match:
            number = int(match.group(1))
            max_number = max(max_number, number)
    
    # Generate the next filename
    next_file_number = max_number + 1
    next_file_name = f"{model_name}_{next_file_number}"
    next_file_path = os.path.join(directory, next_file_name)
    
    model.save(next_file_path)
    
save_next_file(os.path.dirname(model_name), os.path.basename(model_name) )

# Testing

In [72]:
model = PPO.load("models\\simplest\\25-36\\30\\30_0930-1359_1.zip")

In [73]:
# M,Q,P,B,Z,D = 10, 0, 5, 5, 1, 0
M, Q, P, B, Z, D  = cfg["env"]["M"], cfg["env"]["Q"], cfg["env"]["P"], cfg["env"]["B"], cfg["env"]["Z"], 0
# M,Q,P,B,Z,D = 0, 0, 0, 0, 1, 0

In [75]:
if cfg["env"]["uniform_data"]:
    tau0   = {s: [np.random.binomial(1, 0.7) * np.random.normal(15, 2) for _ in range(20)] for s in sources}
    delta0 = {d: [np.random.binomial(1, 0.7) * np.random.normal(15, 2) for _ in range(20)] for d in demands}
else:
    tau0   = {s: [10, 10, 10, 0, 0, 0] for s in sources}
    delta0 = {d: [0, 0, 0, 10, 10, 10] for d in demands}

In [76]:
env = BlendEnv(v = True, 
               D = cfg["env"]["D"], 
               Q = cfg["env"]["Q"], 
               P = cfg["env"]["P"], 
               B = cfg["env"]["B"], 
               Z = cfg["env"]["Z"], 
               M = cfg["env"]["M"],
               reg = cfg["env"]["reg"],
               reg_lambda = cfg["env"]["reg_lambda"],
               MAXFLOW = cfg["env"]["maxflow"],
               alpha = cfg["env"]["alpha"],
               beta = cfg["env"]["beta"],
               connections = connections, 
               action_sample = action_sample,
               tau0 = tau0,delta0 = delta0,
               sigma = sigma,
               sigma_ub = sigma_ub, sigma_lb = sigma_lb,
               s_inv_lb = s_inv_lb, s_inv_ub = s_inv_ub,
               d_inv_lb = d_inv_lb, d_inv_ub = d_inv_ub,
               betaT_d = betaT_d, betaT_s = betaT_s,
               b_inv_ub = b_inv_ub,
               b_inv_lb = b_inv_lb)
env = Monitor(env)

In [77]:
with th.autograd.set_detect_anomaly(True):
    obs = env.reset()
    obs, obs_dict = obs
    for k in range(env.T):
        action, _ = model.predict(obs, deterministic=False)
        print("\n\n",env.pen_tracker)
        print(action)
        print("\n\n   ",reconstruct_dict(action, env.mapping_act))
        obs, reward, done, term, _ = env.step(action)
        print(obs)
        dobs = reconstruct_dict(obs, env.mapping_obs)
        print("\n    >>     ",dobs["sources"], dobs["blenders"], dobs["demands"])
        print("   " ,reward)
        



 {'M': 0, 'B': 0, 'P': 0, 'Q': 0, 'reg': 0, 'n_violations': 0}
[ 2.2490938  0.        10.549076   0.       ]


    {'source_blend': {'s1': {'j1': 2.2490938}}, 'blend_demand': {'j1': {'p1': 0.0}}, 'tau': {'s1': 10.549076}, 'delta': {'p1': 0.0}}
Increased reward by 8.29998230934143 through tank population in s1
j1: inv: 0, in_flow_sources: 2.249093770980835, in_flow_blend: 0, out_flow_blend: 0, out_flow_demands: 0.0
Increased reward by 4.49818754196167 through tank population in j1
Increased reward by 0 through tank population in p1
[ 8.299982   2.2490938  0.         0.06      19.031122  17.232286
 12.631584  20.010368  13.486182  11.07766   16.463406   0.
 17.160105   0.        16.847889   0.         1.       ]

    >>      {'s1': 8.299982} {'j1': 2.2490938} {'p1': 0.0}
    10.449076080322266


 {'M': 0, 'B': 0, 'P': 0, 'Q': 0, 'reg': 0, 'n_violations': 0}
[5.444898  0.        3.9590578 0.       ]


    {'source_blend': {'s1': {'j1': 5.444898}}, 'blend_demand': {'j1': {'p1': 0.0}}, 't

In [253]:
# 0 (only once per episode)
episode_rewards = []
obs = env.reset()
obs, obs_dict = obs

In [262]:
# 1 Get first action
print(env.t)
action, _ = model.predict(obs, deterministic=True)

2


In [263]:
print(env.t)
d = reconstruct_dict(obs, env.mapping_obs)
print(d["sources"])
print(d["blenders"])
print(d["demands"])
print(d["properties"])

2
{'s1': 17.46205}
{'j1': 0.0}
{'p1': 0.0}
{'j1': {'q1': 0.0}}


In [264]:
# 2 Visualize action
print(env.t)
reconstruct_dict(action, env.mapping_act)

2


{'source_blend': {'s1': {'j1': 0.0}},
 'blend_demand': {'j1': {'p1': 30.307917}},
 'tau': {'s1': 8.731916},
 'delta': {'p1': 17.08481}}

In [265]:
# 3
# Step once: get 2nd action
print(env.t)
obs, reward, done, term, _ = env.step(action)

2


In [207]:
# 4 Visualize new state
print(env.t)
d = reconstruct_dict(obs, env.mapping_obs)
print(d["sources"])
print(d["blenders"])
print(d["demands"])
print(d["properties"])

3
{'s1': 26.193966}
{'j1': 0.0}
{'p1': 0.0}
{'j1': {'q1': 0.0}}
