- CNN policy ?
- grid search for HP tuning (OK)
- Increasingly difficult Environment
- Positive reward for populating increasingly "deep" blending tanks ?
- RL for chem sched paper (https://arxiv.org/pdf/2203.00636)
- Masking (https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html, https://arxiv.org/pdf/2006.14171)
    - Adding binary decision variables ?g  
    - Requires discrete action space (only integer flows -> treated as categories ?)
    - masking: disable incoming flows (resp. outgoing flows) for tanks at UB inv limit (resp. LB inv. limit), disable selling/buying when available = 0
    - multiple envs with multiple agents ? (MARL, https://arxiv.org/pdf/2103.01955)
        - Predict successive pipelines ("source > blend" then "blend > blend" (as many as required) then "blend > demand")
        - Each agent has access to the whole state
        - Action mask is derived from the previous agent's actions (0 if inventory at bounds or incoming flow already reserved, else 1)
        - https://github.com/Rohan138/marl-baselines3/blob/main/marl_baselines3/independent_ppo.py
- Safe RL: (https://proceedings.mlr.press/v119/wachi20a/wachi20a.pdf)
    - "Unsafe state" ? > Do not enforce constraints strictly, instead opt for early episode termination to show which states are unsafe ? 
    - Implementations:
        - https://pypi.org/project/fast-safe-rl/#description (Policy optimizers)
        - https://github.com/PKU-Alignment/safety-gymnasium/tree/main/safety_gymnasium (environments; "cost" ?)


1. Try other learning rates/CNN policies
2. Implement Masking with single agent
3. Try other ways to tell the model what are illegal/unsafe states (safe RL)
4. Try multiple agents

-----------------------

- Masking: Discretization of action space is too slow/might not work -> Need to implement masking for continuous action space
- Recurrent policy makes the most sense ? (window of demand forecasts)
- https://www.reddit.com/r/reinforcementlearning/comments/17l5b47/invalid_action_masking_when_action_space_is/
    - Suggestion of autoregressive model for having constraints respected: one predicted action is input to a second model
    - Suggestion of editing the distribution in such a way that the constraint is respected
- https://www.sciencedirect.com/science/article/pii/S0098135420301599
    - Choice of ELU activation ?
    - Choice of NN size ?
    - "The feature engineering in the net inventory means the network does not have to learn these relationships itself, which did help speed training." ?
- Simplify the problem (remove tanks 5 to 8), find the optimal solution with Gurobi

- remove all constraints except in/out
- https://arxiv.org/pdf/1711.11157
- https://arxiv.org/pdf/2111.01564
- Softmax with large coef to produce action mask
- Graph convolution NN instead of RNN ?
    - https://pytorch-geometric.readthedocs.io/en/latest/
    - Graph rep. learning - William L Hamilton

Latest Model learned in/out rule
Watch out, bounds aren't properly respected (neg flows sometimes)
Fix it properly without adding penalty

In [114]:
# import gymnasium as gym
import json
import numpy as np
import torch as th
from stable_baselines3 import PPO, DDPG
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.monitor import Monitor
from envs import BlendEnv, flatten_and_track_mappings, reconstruct_dict

In [115]:
from stable_baselines3.common.distributions import DiagGaussianDistribution
x = DiagGaussianDistribution(2)

- DDPG
- Softmax
- Remove non-selling rewards
- MultiplexNet

In [116]:
import warnings
warnings.filterwarnings("ignore")

![image info](simplest.png)

In [117]:
connections = {
    "source_blend": {
        "s1": [
            "j1"
        ]
    },
    "blend_blend": {
        "j1": [],
    },
    "blend_demand": {
        "j1": [
            "p1"
        ]
    }
}

In [118]:
action_sample = {
    'source_blend':{
        's1': {'j1':1}
    },
    
    # 'source_demand':{
    #     's1': {},
    #     's2': {}
    # },
    
    'blend_blend':{
    },
    
    'blend_demand':{
        'j1': {'p1':1}
    },
    
    "tau": {"s1": 10},
    
    "delta": {"p1": 0}
}
action_sample_flat, _ = flatten_and_track_mappings(action_sample)

In [119]:
tau0   = {'s1': [10, 10, 10, 0, 0, 0]}
delta0 = {'p1': [0, 0, 0, 10, 10, 10]}
sigma = {"s1":{"q1": 0.06}} # Source concentrations
sigma_ub = {"p1":{"q1": 0.16}} # Demand concentrations UBs/LBs
sigma_lb = {"p1":{"q1": 0}}
s_inv_lb = {'s1': 0}
s_inv_ub = {'s1': 999}
d_inv_lb = {'p1': 0}
d_inv_ub = {'p1': 999}
betaT_d = {'p1': 1} # Price of sold products
betaT_s = {'s1': 0} # Cost of bought products
b_inv_ub = {"j1": 30} 
b_inv_lb = {j:0 for j in b_inv_ub.keys()} 

In [120]:
def lr_scheduler(p):
    if p > 0.9:
        return 4e-2
    if p > 0.75:
        return 2e-2
    if p > 0.4:
        return 5e-3
    else:
        return 1e-3

In [121]:
def lr_scheduler_mult(p):
    if p > 0.9:
        return 4e-2
    if p > 0.75:
        return 5e-3
    if p > 0.4:
        return 1e-3
    else:
        return 5e-4

In [122]:
env = BlendEnv(v = False, 
               D=50, Q = 0, P = 0, B = 0, Z = 0, M = 0,
               connections = connections, 
               action_sample = action_sample,
               tau0 = tau0,
               delta0 = delta0,
               sigma = sigma,
               sigma_ub = sigma_ub,
               sigma_lb = sigma_lb,
               s_inv_lb = s_inv_lb,
               s_inv_ub = s_inv_ub,
               d_inv_lb = d_inv_lb,
               d_inv_ub = d_inv_ub,
               betaT_d = betaT_d,
               betaT_s = betaT_s,
               b_inv_ub = b_inv_ub,
               b_inv_lb = b_inv_lb)
env = Monitor(env)

In [123]:
policy_kwargs = dict(
    # net_arch=[dict(pi=[128]*6, vf=[128]*6)],
    activation_fn = th.nn.ReLU
)

In [124]:
# model = PPO("MlpPolicy", env, tensorboard_log="./logs", clip_range=0.3, learning_rate=lr_scheduler_mult, policy_kwargs=policy_kwargs, ent_coef=0.0025)

In [125]:
model = DDPG("MlpPolicy", env, tensorboard_log="./logs", learning_rate=lr_scheduler_mult, policy_kwargs=policy_kwargs)

In [126]:
# model.set_parameters("models\\simplest_model_0606-1629_ent_0.001_gam_0.99_clip_0.3_1000_1000_0")

In [127]:
env.observation_space

Box(0.0, 50.0, (17,), float32)

In [128]:
# model = PPO(CustomRNNPolicy, env, tensorboard_log="./logs", clip_range=0.4, learning_rate=lr_scheduler, policy_kwargs=policy_kwargs, ent_coef=0)

In [129]:
import datetime
modeltype = "PPO" if type(model) == PPO else "DDPG"
entcoef = str(model.ent_coef) if type(model) == PPO else ""
cliprange = str(model.clip_range(0)) if type(model) == PPO else ""
model_name = f"models/simplest_model_{modeltype}_{datetime.datetime.now().strftime('%m%d-%H%M')}_ent_{entcoef}_gam_{model.gamma}_clip_{cliprange}_{int(env.M)}_{int(env.Z)}_{int(env.P)}_{int(env.D)}"+"_ReLU"
model_name

'models/simplest_model_DDPG_0609-2146_ent__gam_0.99_clip__0_0_0_50_ReLU'

In [130]:
model.policy

TD3Policy(
  (actor): Actor(
    (features_extractor): FlattenExtractor(
      (flatten): Flatten(start_dim=1, end_dim=-1)
    )
    (mu): Sequential(
      (0): Linear(in_features=17, out_features=400, bias=True)
      (1): ReLU()
      (2): Linear(in_features=400, out_features=300, bias=True)
      (3): ReLU()
      (4): Linear(in_features=300, out_features=4, bias=True)
      (5): Tanh()
    )
  )
  (actor_target): Actor(
    (features_extractor): FlattenExtractor(
      (flatten): Flatten(start_dim=1, end_dim=-1)
    )
    (mu): Sequential(
      (0): Linear(in_features=17, out_features=400, bias=True)
      (1): ReLU()
      (2): Linear(in_features=400, out_features=300, bias=True)
      (3): ReLU()
      (4): Linear(in_features=300, out_features=4, bias=True)
      (5): Tanh()
    )
  )
  (critic): ContinuousCritic(
    (features_extractor): FlattenExtractor(
      (flatten): Flatten(start_dim=1, end_dim=-1)
    )
    (qf0): Sequential(
      (0): Linear(in_features=21, out_featu

In [131]:
for k in range(5):
    model.learn(total_timesteps=150, progress_bar=False, tb_log_name=model_name + "_" + str(k), reset_num_timesteps=False)
    model.save(model_name + "_" + str(k))

In [132]:
model_name

'models/simplest_model_DDPG_0609-2146_ent__gam_0.99_clip__0_0_0_50_ReLU'

In [28]:
model.save(model_name)

In [335]:
# model.save(f"./models/test_newmodel")

In [16]:
# model.set_parameters('model_0530-2312_ent_0.5_gam_0.99_clip_0.5_1000_10000_100')
model.set_parameters("models\\simplest_model_DDPG_0609-2115_ent__gam_0.99_clip__1000_1000_0_0_ReLU")

In [17]:
env.v = True

In [145]:
env.reset()
x = th.Tensor(env.flatt_state).unsqueeze(0)
y = model.policy.actor.forward(x)
y

tensor([[-1., -1.,  1., -1.]], grad_fn=<TanhBackward0>)

In [19]:
z = model.policy.action_net.forward(y)
z

AttributeError: 'TD3Policy' object has no attribute 'action_net'

In [20]:
model.policy.log_std.exp()

AttributeError: 'TD3Policy' object has no attribute 'log_std'

In [133]:
env.D, env.Q,  env.P, env.B, env.Z, env.M

(50, 0, 0, 0, 0, 0)

In [134]:
env = BlendEnv(v = True, 
               D=env.D, Q = env.Q, P = env.P, B = env.B, Z = env.Z, M = env.M,
               action_sample = action_sample, 
               connections = connections, 
               tau0 = tau0,
               delta0 = delta0,
               sigma = sigma,
               sigma_ub = sigma_ub,
               sigma_lb = sigma_lb,
               s_inv_lb = s_inv_lb,
               s_inv_ub = s_inv_ub,
               d_inv_lb = d_inv_lb,
               d_inv_ub = d_inv_ub,
               betaT_d = betaT_d,
               betaT_s = betaT_s,
               b_inv_ub = b_inv_ub,
               b_inv_lb = b_inv_lb)
env = Monitor(env)

In [135]:
env.mapping_act

[(0, ['source_blend', 's1', 'j1']),
 (1, ['blend_demand', 'j1', 'p1']),
 (2, ['tau', 's1']),
 (3, ['delta', 'p1'])]

In [136]:
obs = env.reset()
obs, obs_dict = obs
for k in range(env.T):
    action, _ = model.predict(obs, deterministic=False)
    obs, reward, done, term, _ = env.step(action)
    print(reconstruct_dict(action, env.mapping_act))
    print(reward)
    

[PEN] t1; s1:			bought too much (more than supply)
Increased reward by 500.0 through tank population in s1
j1: inv: 0, in_flow_sources: 0.0, in_flow_blend: 0, out_flow_blend: 0, out_flow_demands: 0.0
Increased reward by 0 through tank population in j1
Increased reward by 0 through tank population in p1
{'source_blend': {'s1': {'j1': 0.0}}, 'blend_demand': {'j1': {'p1': 0.0}}, 'tau': {'s1': 50.0}, 'delta': {'p1': 0.0}}
500.0
[PEN] t2; s1:			bought too much (more than supply)
Increased reward by 500.0 through tank population in s1
j1: inv: 0.0, in_flow_sources: 0.0, in_flow_blend: 0, out_flow_blend: 0, out_flow_demands: 0.0
Increased reward by 0 through tank population in j1
Increased reward by 0 through tank population in p1
{'source_blend': {'s1': {'j1': 0.0}}, 'blend_demand': {'j1': {'p1': 0.0}}, 'tau': {'s1': 50.0}, 'delta': {'p1': 0.0}}
1000.0
[PEN] t3; s1:			bought too much (more than supply)
Increased reward by 500.0 through tank population in s1
j1: inv: 0.0, in_flow_sources: 0.0

In [253]:
# 0 (only once per episode)
episode_rewards = []
obs = env.reset()
obs, obs_dict = obs

In [262]:
# 1 Get first action
print(env.t)
action, _ = model.predict(obs, deterministic=True)

2


In [263]:
print(env.t)
d = reconstruct_dict(obs, env.mapping_obs)
print(d["sources"])
print(d["blenders"])
print(d["demands"])
print(d["properties"])

2
{'s1': 17.46205}
{'j1': 0.0}
{'p1': 0.0}
{'j1': {'q1': 0.0}}


In [264]:
# 2 Visualize action
print(env.t)
reconstruct_dict(action, env.mapping_act)

2


{'source_blend': {'s1': {'j1': 0.0}},
 'blend_demand': {'j1': {'p1': 30.307917}},
 'tau': {'s1': 8.731916},
 'delta': {'p1': 17.08481}}

In [265]:
# 3
# Step once: get 2nd action
print(env.t)
obs, reward, done, term, _ = env.step(action)

2


In [207]:
# 4 Visualize new state
print(env.t)
d = reconstruct_dict(obs, env.mapping_obs)
print(d["sources"])
print(d["blenders"])
print(d["demands"])
print(d["properties"])

3
{'s1': 26.193966}
{'j1': 0.0}
{'p1': 0.0}
{'j1': {'q1': 0.0}}


In [172]:
reward

200.0

In [None]:
# End of episode
episode_rewards.append(reward)

In [16]:
with open("./connections_sample.json" ,"r") as f:
    connections_s = f.readline()
connections = json.loads(connections_s)
connections

{'source_blend': {'s1': ['j1', 'j2', 'j3', 'j4'],
  's2': ['j1', 'j2', 'j3', 'j4']},
 'blend_blend': {'j1': ['j5', 'j6', 'j7', 'j8'],
  'j2': ['j5', 'j6', 'j7', 'j8'],
  'j3': ['j5', 'j6', 'j7', 'j8'],
  'j4': ['j5', 'j6', 'j7', 'j8'],
  'j5': [],
  'j6': [],
  'j7': [],
  'j8': []},
 'blend_demand': {'j1': [],
  'j2': [],
  'j3': [],
  'j4': [],
  'j5': ['p1', 'p2'],
  'j6': ['p1', 'p2'],
  'j7': ['p1', 'p2'],
  'j8': ['p1', 'p2']}}

In [14]:
def eval_policy(model=model, env=env, n_eval_episodes=10):
    episode_rewards = []
    for _ in range(n_eval_episodes):
        obs = env.reset()
        episode_reward = 0
        done = False
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, done, _ = env.step(action)
            episode_reward += reward
        episode_rewards.append(episode_reward)
    mean_reward = sum(episode_rewards) / n_eval_episodes
    std_reward = np.std(episode_rewards)
    return mean_reward, std_reward

In [None]:
eval_policy()