- CNN policy ?
- grid search for HP tuning (OK)
- Increasingly difficult Environment
- Positive reward for populating increasingly "deep" blending tanks ?
- RL for chem sched paper (https://arxiv.org/pdf/2203.00636)
- Masking (https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html, https://arxiv.org/pdf/2006.14171)
    - Adding binary decision variables ?g  
    - Requires discrete action space (only integer flows -> treated as categories ?)
    - masking: disable incoming flows (resp. outgoing flows) for tanks at UB inv limit (resp. LB inv. limit), disable selling/buying when available = 0
    - multiple envs with multiple agents ? (MARL, https://arxiv.org/pdf/2103.01955)
        - Predict successive pipelines ("source > blend" then "blend > blend" (as many as required) then "blend > demand")
        - Each agent has access to the whole state
        - Action mask is derived from the previous agent's actions (0 if inventory at bounds or incoming flow already reserved, else 1)
        - https://github.com/Rohan138/marl-baselines3/blob/main/marl_baselines3/independent_ppo.py
- Safe RL: (https://proceedings.mlr.press/v119/wachi20a/wachi20a.pdf)
    - "Unsafe state" ? > Do not enforce constraints strictly, instead opt for early episode termination to show which states are unsafe ? 
    - Implementations:
        - https://pypi.org/project/fast-safe-rl/#description (Policy optimizers)
        - https://github.com/PKU-Alignment/safety-gymnasium/tree/main/safety_gymnasium (environments; "cost" ?)


1. Try other learning rates/CNN policies
2. Implement Masking with single agent
3. Try other ways to tell the model what are illegal/unsafe states (safe RL)
4. Try multiple agents

-----------------------

- Masking: Discretization of action space is too slow/might not work -> Need to implement masking for continuous action space
- Recurrent policy makes the most sense ? (window of demand forecasts)
- https://www.reddit.com/r/reinforcementlearning/comments/17l5b47/invalid_action_masking_when_action_space_is/
    - Suggestion of autoregressive model for having constraints respected: one predicted action is input to a second model
    - Suggestion of editing the distribution in such a way that the constraint is respected
- https://www.sciencedirect.com/science/article/pii/S0098135420301599
    - Choice of ELU activation ?
    - Choice of NN size ?
    - "The feature engineering in the net inventory means the network does not have to learn these relationships itself, which did help speed training." ?
- Simplify the problem (remove tanks 5 to 8), find the optimal solution with Gurobi

- remove all constraints except in/out
- https://arxiv.org/pdf/1711.11157
- https://arxiv.org/pdf/2111.01564
- Softmax with large coef to produce action mask
- Graph convolution NN instead of RNN ?
    - https://pytorch-geometric.readthedocs.io/en/latest/
    - Graph rep. learning - William L Hamilton

Latest Model learned in/out rule
Watch out, bounds aren't properly respected (neg flows sometimes)
Fix it properly without adding penalty

In [1]:
# import gymnasium as gym
import json
import numpy as np
import torch as th
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.monitor import Monitor
from envs import BlendEnv, flatten_and_track_mappings, reconstruct_dict
# from models import CustomRNNPolicy

In [2]:
import warnings
warnings.filterwarnings("ignore")

![image info](simple.png)

In [3]:
connections = {
    "source_blend": {
        "s1": [
            "j1",
            "j2",
            "j3",
            "j4"
        ],
        "s2": [
            "j1",
            "j2",
            "j3",
            "j4"
        ]
    },
    "blend_blend": {
        "j1": [],
        "j2": [],
        "j3": [],
        "j4": []
    },
    "blend_demand": {
        "j1": [
            "p1",
            "p2"
        ],
        "j2": [
            "p1",
            "p2"
        ],
        "j3": [
            "p1",
            "p2"
        ],
        "j4": [
            "p1",
            "p2"
        ]
    }
}

In [4]:
action_sample = {
    'source_blend':{
        's1': {'j1':1, 'j2':1, 'j3':1, 'j4':0}, # From s1 to b1, from s1 to b2 etc...
        's2': {'j1':1, 'j2':1, 'j3':0, 'j4':1},
    },
    
    # 'source_demand':{
    #     's1': {},
    #     's2': {}
    # },
    
    'blend_blend':{
        # 'j1': {'j5':1, 'j6':0, 'j7':0, 'j8':0},
        # 'j2': {'j5':0, 'j6':0, 'j7':0, 'j8':0},
        # 'j3': {'j5':0, 'j6':0, 'j7':0, 'j8':0},
        # 'j4': {'j5':0, 'j6':0, 'j7':0, 'j8':0},
        # 'j5': {},
        # 'j6': {},
        # 'j7': {},
        # 'j8': {}
    },
    
    'blend_demand':{
        'j1': {'p1':1, 'p2':0},
        'j2': {'p1':1, 'p2':2},
        'j3': {'p1':1, 'p2':2},
        'j4': {'p1':1, 'p2':2}
    },
    
    "tau": {"s1": 10, "s2": 10},
    
    "delta": {"p1": 0, "p2": 0}
}
action_sample_flat, _ = flatten_and_track_mappings(action_sample)

In [5]:
def lr_scheduler(p):
    if p > 0.9:
        return 4e-2
    if p > 0.75:
        return 2e-2
    if p > 0.4:
        return 5e-3
    else:
        return 1e-3

In [6]:
env = BlendEnv(v = False, D=0, connections = connections, Q = 0, P = 0, B = 0, Z = 1e3, action_sample = action_sample, MAXFLOW = 30)
env = Monitor(env)

In [7]:
policy_kwargs = dict(
    net_arch=[dict(pi=[128]*6, vf=[128]*6)]
)

In [8]:
model = PPO("MlpPolicy", env, tensorboard_log="./logs", clip_range=0.3, learning_rate=lr_scheduler, policy_kwargs=policy_kwargs, ent_coef=0.001)

In [9]:
# model = PPO(CustomRNNPolicy, env, tensorboard_log="./logs", clip_range=0.4, learning_rate=lr_scheduler, policy_kwargs=policy_kwargs, ent_coef=0)

In [16]:
import datetime
model_name = f"models/model_{datetime.datetime.now().strftime('%m%d-%H%M')}_ent_{model.ent_coef}_gam_{model.gamma}_clip_{model.clip_range(0)}_{int(env.M)}_{int(env.Z)}_{int(env.P)}"
model_name

'models/model_0605-1449_ent_0.02_gam_0.99_clip_0.3_1000_1000_0'

In [17]:
model.learn(total_timesteps=500000, progress_bar=False, tb_log_name=model_name)

<stable_baselines3.ppo.ppo.PPO at 0x1ae572278b0>

In [18]:
model.save(model_name)

In [None]:
# model.save(f"./models/test_newmodel")

In [11]:
# model.set_parameters('model_0530-2312_ent_0.5_gam_0.99_clip_0.5_1000_10000_100')
model.set_parameters("models\\model_0605-1449_ent_0.02_gam_0.99_clip_0.3_1000_1000_0")

In [30]:
env = BlendEnv(v = True, Z = env.Z, D=env.D, connections = env.connections, Q = env.Q, P = env.P, B = env.B, action_sample = env.action_sample)

In [48]:
obs = env.reset()
obs, obs_dict = obs
for k in range(env.T):
    action, _ = model.predict(obs, deterministic=False)
    obs, reward, done, term, _ = env.step(action)
    print("\t\t\t\t\t\t\t\t\t\t\t\t",reward)
    

[PEN] t1; s1:			bought too much (more than supply)
[PEN] t1; s2:			bought too much (more than supply)
[PEN] t1; p1:			sold too much (more than demand)
[PEN] t1; p2:			sold too much (more than demand)
s1: b: 0.1165213291140816
[PEN] t1; s1:			bought too little (resulting amount less than source tank LB)
Increased reward by 0 through tank population in s1
s2: b: 0.23443328967499014
[PEN] t1; s2:			bought too little (resulting amount less than source tank LB)
Increased reward by 0 through tank population in s2
j1: inv: 0, in_flow_sources: 17.547730939453587, in_flow_blend: 0, out_flow_blend: 0, out_flow_demands: 50.0
[PEN] t1; j1:			In and out flow both non-zero (in: 17.55, out: 50.0)
j2: inv: 0, in_flow_sources: 11.721664483749507, in_flow_blend: 0, out_flow_blend: 0, out_flow_demands: 0.0
Increased reward by 0.0 through tank population in j2
j3: inv: 0, in_flow_sources: 4.17393354429592, in_flow_blend: 0, out_flow_blend: 0, out_flow_demands: 50.0
[PEN] t1; j3:			In and out flow both non

In [49]:
# 0 (only once per episode)
episode_rewards = []
obs = env.reset()
obs, obs_dict = obs

In [62]:
# 1 Get first action
print(env.t)
action, _ = model.predict(obs, deterministic=False)

2


In [63]:
print(env.t)
d = reconstruct_dict(obs, env.mapping_obs)
print(d["sources"])
print(d["blenders"])
print(d["demands"])
print(d["properties"])

2
{'s1': 0.0, 's2': 0.0}
{'j1': 0.0, 'j2': 0.0, 'j3': 10.0, 'j4': 0.0}
{'p1': 0.0, 'p2': 0.0}
{'j1': {'q1': 0.0}, 'j2': {'q1': 0.0}, 'j3': {'q1': 0.06}, 'j4': {'q1': 0.0}}


In [64]:
# 2 Visualize action
print(env.t)
reconstruct_dict(action, env.mapping_act)

2


{'source_blend': {'s1': {'j1': 0.0, 'j2': 50.0, 'j3': 50.0, 'j4': 0.0},
  's2': {'j1': 0.0, 'j2': 50.0, 'j3': 0.0, 'j4': 0.0}},
 'blend_demand': {'j1': {'p1': 0.0, 'p2': 50.0},
  'j2': {'p1': 0.0, 'p2': 0.0},
  'j3': {'p1': 50.0, 'p2': 50.0},
  'j4': {'p1': 50.0, 'p2': 0.0}},
 'tau': {'s1': 0.0, 's2': 50.0},
 'delta': {'p1': 50.0, 'p2': 0.0}}

In [65]:
# 3
# Step once: get 2nd action
print(env.t)
obs, reward, done, term, _ = env.step(action)

2
[PEN] t3; s2:			bought too much (more than supply)
[PEN] t3; p1:			sold too much (more than demand)
s1: b: 0.0
[PEN] t3; s1:			bought too little (resulting amount less than source tank LB)
Increased reward by 0 through tank population in s1
s2: b: 0.6
[PEN] t3; s2:			bought too little (resulting amount less than source tank LB)
Increased reward by 0 through tank population in s2
j1: inv: 0.0, in_flow_sources: 0.0, in_flow_blend: 0, out_flow_blend: 0, out_flow_demands: 50.0
j1: b: 0.0
[PEN] t3; j1:			inventory OOB (resulting amount less than blending tank LB)
Increased reward by 0 through tank population in j1
j2: inv: 0.0, in_flow_sources: 30.0, in_flow_blend: 0, out_flow_blend: 0, out_flow_demands: 0.0
Increased reward by 0.0 through tank population in j2
j3: inv: 10.0, in_flow_sources: 0.0, in_flow_blend: 0, out_flow_blend: 0, out_flow_demands: 100.0
j3: b: 0.1
[PEN] t3; j3:			inventory OOB (resulting amount less than blending tank LB)
Increased reward by 0 through tank population 

In [66]:
# 4 Visualize new state
print(env.t)
d = reconstruct_dict(obs, env.mapping_obs)
print(d["sources"])
print(d["blenders"])
print(d["demands"])
print(d["properties"])

3
{'s1': 0.0, 's2': 0.0}
{'j1': 0.0, 'j2': 30.0, 'j3': 0.0, 'j4': 0.0}
{'p1': 5.0, 'p2': 10.0}
{'j1': {'q1': 0.0}, 'j2': {'q1': 0.26}, 'j3': {'q1': 0.0}, 'j4': {'q1': 0.0}}


In [67]:
reward

7999.6

In [None]:
# End of episode
episode_rewards.append(reward)

In [16]:
with open("./connections_sample.json" ,"r") as f:
    connections_s = f.readline()
connections = json.loads(connections_s)
connections

{'source_blend': {'s1': ['j1', 'j2', 'j3', 'j4'],
  's2': ['j1', 'j2', 'j3', 'j4']},
 'blend_blend': {'j1': ['j5', 'j6', 'j7', 'j8'],
  'j2': ['j5', 'j6', 'j7', 'j8'],
  'j3': ['j5', 'j6', 'j7', 'j8'],
  'j4': ['j5', 'j6', 'j7', 'j8'],
  'j5': [],
  'j6': [],
  'j7': [],
  'j8': []},
 'blend_demand': {'j1': [],
  'j2': [],
  'j3': [],
  'j4': [],
  'j5': ['p1', 'p2'],
  'j6': ['p1', 'p2'],
  'j7': ['p1', 'p2'],
  'j8': ['p1', 'p2']}}

In [14]:
def eval_policy(model=model, env=env, n_eval_episodes=10):
    episode_rewards = []
    for _ in range(n_eval_episodes):
        obs = env.reset()
        episode_reward = 0
        done = False
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, done, _ = env.step(action)
            episode_reward += reward
        episode_rewards.append(episode_reward)
    mean_reward = sum(episode_rewards) / n_eval_episodes
    std_reward = np.std(episode_rewards)
    return mean_reward, std_reward

In [None]:
eval_policy()