# Simulator

In this section we will show how to use the `a2rl.Simulator` to get a recommendation.

The way Simulator provides recommendation is different from a typical Reinforcement Learning approach, where you need to first train a RL agent (e.g. SAC, PPO) with a simulator, then only the agent can recommend an action.

First a Q-value has been calculated internally when you load the data using `wi_df.add_value()`. Then the Simulator is trained with sequences of states, actions, rewards, Q-value. In order to choose an action, you just need to sample multiple trajectory based on the current context.


In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import my_nb_path  # isort: skip
import os
from pathlib import Path

import numpy as np

import a2rl as wi
from a2rl.nbtools import pprint, print  # Enable color outputs when rich is installed.

## Load Dataset

Instantiate a tokenizer given the selected dataset.

In [None]:
wi_df = wi.read_csv_dataset(wi.sample_dataset_path("chiller"))
wi_df.add_value()

# Speed up training for demo purpose
wi_df = wi_df.iloc[:1000]
tokenizer = wi.AutoTokenizer(wi_df, block_size_row=2)

tokenizer.df.head(2)

In [None]:
tokenizer.df_tokenized.head(2)

## Train a model

Default hyperparam is located at `src/a2rl/config.yaml`. Alternative you can (1) specify your own configuration file using `config_dir` and `config_name`, or (2) passing in the configuration as parameter `config`. Refer to `GPTBuilder` for more info.

In [None]:
model_dir = "model-simulator"
config = None  # Default training configuration

################################################################################
# To run in fast mode, set env var NOTEBOOK_FAST_RUN=1 prior to starting Jupyter
################################################################################
if os.environ.get("NOTEBOOK_FAST_RUN", "0") != "0":
    config = {
        "train_config": {
            "epochs": 1,
            "batch_size": 512,
            "embedding_dim": 512,
            "gpt_n_layer": 1,
            "gpt_n_head": 1,
            "learning_rate": 6e-4,
            "num_workers": 0,
            "lr_decay": True,
        }
    }

    from IPython.display import Markdown

    display(
        Markdown(
            '<p style="color:firebrick; background-color:yellow; font-weight:bold">'
            "NOTE: notebook runs in fast mode. Use only 1 epoch. Results may differ."
        )
    )
################################################################################
builder = wi.GPTBuilder(tokenizer, model_dir, config)

Start GPT model training.

In [None]:
%%time
builder.fit()

Plot the original GPT token vs predicted horizon given initial context window.

In [None]:
builder.evaluate(context_len=5, sample=False, horizon=50);

The graph above is like behaviour cloning. The model will active according to historical pattern. In the next graph, you can sample different trajectory when `sample=True`.

In [None]:
builder.evaluate(context_len=5, sample=True, horizon=50);

## Get Recommendation



In [None]:
simulator = wi.Simulator(tokenizer, builder.model)
simulator.tokenizer.df_tokenized.head(2)

Get a custom context sequence. 

**Note:** The sequence should ends with state, i.e. (s,a,r...s)

In [None]:
custom_context = tokenizer.df_tokenized.sequence[:7]
custom_context

### One step sample

`sample` returns a dataframe whose columns are (actions, reward, value, next states) given the
context. The contents of the dataframe is in the original space (approximated).

In [None]:
recommendation_df = simulator.sample(custom_context, max_size=10, as_token=False)
recommendation_df

## Build Your Own Planner

If you want to build your own planner, `whatif` provides a few lower level api.

### Get valid actions

`get_valid_actions` return a dataframe of potential action (in tokenized forms) given the context.

Let's get some custom context, assume always up to current states, and find out the next top_k actions.

In [None]:
valid_actions = simulator.get_valid_actions(custom_context, max_size=2)
valid_actions

### One step lookahead

`lookahead` return reward and next states, given the context and action.

Let pick an action to simulate the reward and next states. This api does not change the simulator internal counter and states

In [None]:
custom_context = np.array([0, 100])
action_seq = [valid_actions.loc[0, :]]
print(f"Given the context: {custom_context} and action: {action_seq}\n")

reward, next_states = simulator.lookahead(custom_context, action_seq)
print(f"{reward=}")
print(f"{next_states=}")

## Gym

Get a gym compatible simulator using `SimulatorWrapper`.

In [None]:
sim_wrapper = wi.SimulatorWrapper(env=simulator)

Get the action to gym encoding mapping. Gym expect action to be a list of continuous integer.

In [None]:
sim_wrapper.gym_action_to_enc

In [None]:
sim_wrapper.reset()

In [None]:
obs, reward, done, info = sim_wrapper.step([0])
obs, reward

In [None]:
sim_wrapper.observation_space

In [None]:
sim_wrapper.action_space

## 3rd Party Tools 

Use with 3rd party package like `stable_baseline3`. 

As PPO requires observation in an array of np.float32, use OpenAI Gym's observation wrapper to perform transformation as needed by your training agent.

In [None]:
%%time

import gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.ppo import MlpPolicy


class CustomObservation(gym.ObservationWrapper):
    def __init__(self, env: gym.Env):
        super().__init__(env)
        self.observation_space = gym.spaces.Box(
            low=-np.inf,
            high=np.inf,
            shape=(len(self.tokenizer.state_indices),),
            dtype=np.float32,
        )

    def observation(self, observation):
        new_obs = observation.astype(np.float32)
        return new_obs


new_sim = CustomObservation(sim_wrapper)
model = PPO(MlpPolicy, new_sim, verbose=0)
model.learn(total_timesteps=2)

obs = new_sim.reset()
for i in range(2):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = new_sim.step(action)
    if done:
        obs = new_sim.reset()

mean_reward, std_reward = evaluate_policy(model, new_sim, n_eval_episodes=1)
print(f"Mean reward:{mean_reward:.2f} +/- {std_reward:.2f}")