In [1]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl
%set_random_seed 12

In [2]:
%presentation_style

In [3]:
%load_latex_macros


$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$
$\newcommand{\amax}{{\text{argmax}}}$
$\newcommand{\P}{{\mathbb{P}}}$
$\newcommand{\E}{{\mathbb{E}}}$
$\newcommand{\R}{{\mathbb{R}}}$
$\newcommand{\Z}{{\mathbb{Z}}}$
$\newcommand{\N}{{\mathbb{N}}}$
$\newcommand{\C}{{\mathbb{C}}}$
$\newcommand{\abs}[1]{{ \left| #1 \right| }}$
$\newcommand{\simpl}[1]{{\Delta^{#1} }}$


In [None]:
%autoreload

In [None]:
from training_rl.offline_rl.load_env_variables import load_env_variables
load_env_variables()

import os
import gymnasium as gym
import torch
import warnings
import matplotlib.pyplot as plt
import numpy as np

from training_rl.offline_rl.behavior_policies.behavior_policy_registry import BehaviorPolicyType
from training_rl.offline_rl.custom_envs.custom_2d_grid_env.obstacles_2D_grid_register import ObstacleTypes
from training_rl.offline_rl.custom_envs.custom_envs_registration import CustomEnv, RenderMode, register_grid_envs, EnvFactory
from training_rl.offline_rl.custom_envs.utils import Grid2DInitialConfig, InitialConfigCustom2DGridEnvWrapper
from training_rl.offline_rl.generate_custom_minari_datasets.generate_minari_dataset_grid_envs import \
    create_combined_minari_dataset 
from training_rl.offline_rl.offline_policies.offpolicy_rendering import offpolicy_rendering
from training_rl.offline_rl.offline_policies.policy_registry import PolicyName
from training_rl.offline_rl.offline_trainings.offline_training import offline_training
from training_rl.offline_rl.offline_trainings.policy_config_data_class import TrainedPolicyConfig, get_trained_policy_path
from training_rl.offline_rl.offline_trainings.restore_policy_model import restore_trained_offline_policy
from training_rl.offline_rl.utils import widget_list
from training_rl.offline_rl.visualizations.utils import (
    get_state_action_data_and_policy_grid_distributions, snapshot_env)
from training_rl.offline_rl.utils import load_buffer_minari
from training_rl.offline_rl.generate_custom_minari_datasets.generate_minari_dataset_grid_envs import MinariDatasetConfig
from training_rl.offline_rl.visualizations.utils import policy_rollout_torcs_env, compare_policy_decisions_vs_expert_suggestions
from training_rl.offline_rl.generate_custom_minari_datasets.generate_minari_dataset_grid_envs import create_minari_datasets

warnings.filterwarnings("ignore")
register_grid_envs()
render_mode = RenderMode.RGB_ARRAY_LIST

<img src="_static/images/aai-institute-cover.svg" alt="Snow" style="width:100%;">
<div class="md-slide title"> Offline RL algorithms exercises </div>

Offline RL pipeline:

<img src="_static/images/93_offline_RL_pipeline.png" alt="Snow" style="width:50%;">

# Offline RL algorithms exercises

Previously, we discussed that **off-policy methods cannot learn from data efficiently unless a significant amount of data covering a large portion of the environment states is available**. Only in such cases can the agent explore the environment and get feedback similar to what's done in an online approach. However, this scenario is rare and challenging to achieve in realistic applications, which is one of the reasons why we turn to offline RL, where only a small amount of data is available.

We also discussed one of the major issues when applying off-policy methods to collected data: the agent's tendency to go out-of-distribution (o.o.d.). More importantly, once it goes o.o.d., the policy becomes unpredictable, making it impossible to return to the in-distribution region. This unpredictability propagates errors in the policy evaluation process (i.e., the dynamic programming equations), destroying the algorithm's learning capabilities.

## Exercise I

**Similar to our approach in the off-policy notebook (nb_95), we will collect a small amount of expert data and a larger amount of suboptimal data. We will then observe how two offline RL algorithms introduced earlier, BCQ and CQL, can recover the expert policy without going o.o.d. We will compare our results with the imitation learning approach, specifically the BC algorithm, which, as we discussed in the imitation learning section (nb_93), is another viable option when expert data is available.**


In this exercise, we will collect two datasets: one with expert and another with suboptimal data. The goal of the agent will be to get as close as possible to the target.

I - **expert policy**: collect ~ 1000 steps

II  - **Suboptimal policy**:  collect ~ 2000 steps

### STEP 1: Create the environment

In [None]:
obstacle_selected = widget_list([ObstacleTypes.obstacle_8x8_wall_with_door])

In [None]:
ENV_NAME = CustomEnv.Grid_2D_8x8_discrete

# Grid configuration
OBSTACLE = obstacle_selected.value
INITIAL_STATE = (7, 7)
FINAL_STATE = (0, 7)

env_2D_grid_initial_config = Grid2DInitialConfig(
    obstacles=OBSTACLE,
    initial_state=INITIAL_STATE,
    target_state=FINAL_STATE,
)

env = InitialConfigCustom2DGridEnvWrapper(gym.make(ENV_NAME, render_mode=render_mode), env_config=env_2D_grid_initial_config)
snapshot_env(env)

### STEP 2: Create Minari datasets

**Behavior policies and datasets configurations**

In [None]:
BEHAVIOR_POLICY_I = BehaviorPolicyType.behavior_move_up_from_bottom_5_steps
DATA_SET_IDENTIFIER_I = "_expert_"
NUM_STEPS_I = 1000

BEHAVIOR_POLICY_II = BehaviorPolicyType.behavior_8x8_move_left_with_noise
DATA_SET_IDENTIFIER_II = "_suboptimal_"
NUM_STEPS_II = 2000

In [None]:
policy_selected = widget_list([BEHAVIOR_POLICY_I, BEHAVIOR_POLICY_II])

In [None]:
offpolicy_rendering(
    env_or_env_name=ENV_NAME,
    render_mode=render_mode,
    behavior_policy_name=policy_selected.value,
    env_2d_grid_initial_config=env_2D_grid_initial_config,
    num_frames=100,
)

**Collect data**

In [None]:
config_combined_data = create_combined_minari_dataset(
        env_name=ENV_NAME,
        dataset_identifiers = (DATA_SET_IDENTIFIER_I, DATA_SET_IDENTIFIER_II),
        num_collected_points = (NUM_STEPS_I, NUM_STEPS_II),
        behavior_policy_names = (BEHAVIOR_POLICY_I, BEHAVIOR_POLICY_II),
        combined_dataset_identifier = "combined_data_sets_offline_rl",
        env_2d_grid_initial_config = env_2D_grid_initial_config,
)

dataset_availables = [config_combined_data.data_set_name] + config_combined_data.children_dataset_names
selected_data_set = widget_list(dataset_availables)

### STEP 3: Feed data into replay buffer

In [None]:
buffer_data = load_buffer_minari(selected_data_set.value)
len_buffer = len(buffer_data)

# Compute state-action data distribution
state_action_count_data, _ = get_state_action_data_and_policy_grid_distributions(buffer_data, env)
snapshot_env(env)

#### Data analysis

Note that we have four peaks. The ones at (2,7) and (3,7) come from policy-I, which goes towards the target but stops before reaching it. The other two peaks at (6,0) and (7,0) are produced by policy-II, which drifts the agent to the left with noise. **It is important to notice that the amount of collected data at state (5,7) is very little, but this state is crucial if we want to approach the target.**

What do you think a BC algorithm would do? What about an offline one?

<div style="margin-top: 20px;">
    <div style="display: flex; justify-content: space-between;">
        <div style="width: 100%;">
            <img src="_static/images/nb_96_critical_state.png" alt="Snow" style="width:100%;">
        </div>
        <div style="width: 100%;">
            <img src="_static/images/96_critical_action_states.png" alt="KL divergence" width=80%>
        </div>
    </div>
</div>

### STEP 4-5: Select offline policies and training

In this part of the exercise you need to: 

a) Restore the policy configurations (through TrainedPolicyConfig) for three offline RL policies, namely **BCQ, CQL and BC**, i.e.:

offline_policy_config = TrainedPolicyConfig( ... )

Give a look to the policy parameteres in offline_rl/offline_policies. 


b) Train the policies on the **expert data**:

NUM_EPOCHS =10

BATCH_SIZE = 128

STEP_PER_EPOCH = len_buffer


offline_training( ... )


c) Visualize the policies:

offpolicy_rendering( ... )


**SOLUTION:**

In [None]:
offline_rl_policies = [PolicyName.bcq_discrete, PolicyName.cql_discrete, PolicyName.imitation_learning]
selected_offline_rl_policy = widget_list(offline_rl_policies)

**Training**

In [None]:
NUM_EPOCHS = 5
BATCH_SIZE = 128
STEP_PER_EPOCH = 1.0*len_buffer
NUMBER_TEST_ENVS = 1


offline_policy_config = TrainedPolicyConfig(
    name_expert_data=selected_data_set.value,
    policy_name=selected_offline_rl_policy.value,
    render_mode=render_mode,
    device="cpu"
)

In [None]:
offline_training(
    offline_policy_config=offline_policy_config,
    num_epochs = NUM_EPOCHS,
    number_test_envs=NUMBER_TEST_ENVS,
    step_per_epoch=STEP_PER_EPOCH,
    restore_training=False,
)

**Restore and visualize trained policy**

In [None]:
available_obstacles = [ ObstacleTypes.obstacle_8x8_wall_with_door]
selected_obstacle = widget_list(available_obstacles)

In [None]:
#SAVED_POLICY_NAME = "policy_best_reward.pth"
SAVED_POLICY_NAME = "policy.pth"
INITIAL_STATE = (7, 7)
FINAL_STATE = (0, 7)

offline_policy_config = TrainedPolicyConfig(
    name_expert_data=selected_data_set.value,
    policy_name=selected_offline_rl_policy.value,
    render_mode=render_mode,
    device="cpu"
)

policy = restore_trained_offline_policy(offline_policy_config)
log_name = os.path.join(selected_data_set.value, selected_offline_rl_policy.value)
log_path = get_trained_policy_path(log_name)
policy.load_state_dict(torch.load(os.path.join(log_path, SAVED_POLICY_NAME), map_location="cpu"))

env.set_new_obstacle_map(selected_obstacle.value.value)
env.set_starting_point(INITIAL_STATE)
env.set_goal_point(FINAL_STATE)
#snapshot_env(env)

offpolicy_rendering(
    env_or_env_name=env,
    render_mode=render_mode,
    policy_model=policy,
    env_2d_grid_initial_config=env_2D_grid_initial_config,
    num_frames=100,
    imitation_policy_sampling=False,
    inline=True
)

### Summary and conclusions

**1 - Are BCQ and the CQL policies able to learn the expert data?**

**2 - As we saw before, imitation learning is a good option when you have expert data. How it compares with the offline algorithms?**

## Exercise II 

**Now, we'll explore how BCQ and CQL, address the issue of connecting suboptimal trajectories in order to get new ones with higer rewards (stitching property). We will see how they compare with imitation learning.**

We will start again with the previous setup. So, as we did before, we will create again two datasets: one from a policy moving suboptimal from (0,0) to (2,4), and the other from another policy moving from (4,0) to (7,7). The goal is to find an agent capable of connecting trajectories coming from both datasets, in order to find the optimal path between (2,0) and (2,4).

### STEP 1: Create the environment

**Create the environment**

In [None]:
ENV_NAME = CustomEnv.Grid_2D_8x8_discrete

OBSTACLE = ObstacleTypes.obst_free_8x8
INITIAL_STATE_POLICY_I = (0,0)
INITIAL_STATE_POLICY_II = (2,0)
FINAL_STATE_POLICY = (2, 4)


env_2D_grid_initial_config_I = Grid2DInitialConfig(
    obstacles=OBSTACLE,
    initial_state=INITIAL_STATE_POLICY_I,
    target_state=FINAL_STATE_POLICY,
)

env_2D_grid_initial_config_II = Grid2DInitialConfig(
    obstacles=OBSTACLE,
    initial_state=INITIAL_STATE_POLICY_II,
    target_state=FINAL_STATE_POLICY,
)


env = InitialConfigCustom2DGridEnvWrapper(gym.make(ENV_NAME, render_mode=render_mode),
                                          env_config=env_2D_grid_initial_config_I)
snapshot_env(env)

env = InitialConfigCustom2DGridEnvWrapper(gym.make(ENV_NAME, render_mode=render_mode),
                                          env_config=env_2D_grid_initial_config_II)
snapshot_env(env)

### STEP 2: Create Minari datasets

**Let's study how well offline RL algorithms can deal with the stitching property. We will examine some edge cases to compare them with some of the algorithms we have already studied.**

In [None]:
IDENTIFIER_COMBINED_DATASETS = "_stiching_property_I"

# Dataset I with 2000 collected points
BEHAVIOR_POLICY_I = BehaviorPolicyType.behavior_8x8_grid_deterministic_0_0_to_4_7
DATA_SET_IDENTIFIER_I = "_longer_path"
NUM_STEPS_I =2000

# Dataset II with 1000 points
#BEHAVIOR_POLICY_II = BehaviorPolicyType.behavior_8x8_eps_greedy_4_0_to_7_7
BEHAVIOR_POLICY_II = BehaviorPolicyType.behavior_move_right
DATA_SET_IDENTIFIER_II = "_short_path"
NUM_STEPS_II = 1000

In [None]:
select_policy_to_render = widget_list([BEHAVIOR_POLICY_I, BEHAVIOR_POLICY_II])

In [None]:
env_2D_grid_initial_config = env_2D_grid_initial_config_I if select_policy_to_render.value == BEHAVIOR_POLICY_I else env_2D_grid_initial_config_II

offpolicy_rendering(
    env_or_env_name=ENV_NAME,
    render_mode=render_mode,
    behavior_policy_name=select_policy_to_render.value,
    env_2d_grid_initial_config=env_2D_grid_initial_config,
    num_frames=100,
)

**Create datasets**

In [None]:
config_combined_data = create_combined_minari_dataset(
        env_name=ENV_NAME,
        dataset_identifiers = (DATA_SET_IDENTIFIER_I, DATA_SET_IDENTIFIER_II),
        num_collected_points = (NUM_STEPS_I, NUM_STEPS_II),
        behavior_policy_names = (BEHAVIOR_POLICY_I, BEHAVIOR_POLICY_II),
        combined_dataset_identifier = "combined_dataset",
        env_2d_grid_initial_config = (env_2D_grid_initial_config_I, env_2D_grid_initial_config_II),
)
buffer_data = load_buffer_minari(config_combined_data.data_set_name)
data_size = len(buffer_data)

In [None]:
dataset_availables = [config_combined_data.data_set_name] + config_combined_data.children_dataset_names
selected_data_set = widget_list(dataset_availables)

### STEP 3: Feed data into replay buffer

In [None]:
buffer_data = load_buffer_minari(selected_data_set.value)
len_buffer = len(buffer_data)

# Compute state-action data distribution
state_action_count_data, _ = get_state_action_data_and_policy_grid_distributions(buffer_data, env, normalized=False)

if "start_0_0" in selected_data_set.value:
    env.set_starting_point((0,0))
    snapshot_env(env)
elif "start_2_0" in selected_data_set.value:
    env.set_starting_point((2,0))
    snapshot_env(env)
    

### STEP 4: Select offline policies and training

In [None]:
offline_rl_policies = [PolicyName.bcq_discrete, PolicyName.cql_discrete]
selected_offline_rl_policy = widget_list(offline_rl_policies)

In [None]:
# Offiline - Training

NUM_EPOCHS = 5
BATCH_SIZE = 128
STEP_PER_EPOCH = 1.0*len_buffer
NUMBER_TEST_ENVS = 1


offline_policy_config = TrainedPolicyConfig(
    name_expert_data=selected_data_set.value,
    policy_name=selected_offline_rl_policy.value,
    render_mode=render_mode,
    device="cpu"
)

In [None]:
offline_training(
    offline_policy_config=offline_policy_config,
    num_epochs = NUM_EPOCHS,
    number_test_envs=NUMBER_TEST_ENVS,
    step_per_epoch=STEP_PER_EPOCH,
    restore_training=False,
)

##### **Restore and visualize trained policy**

In [None]:
#SAVED_POLICY_NAME = "policy_best_reward.pth"
SAVED_POLICY_NAME = "policy.pth"
INITIAL_STATE = (0, 0)
FINAL_STATE = (2, 4)

offline_policy_config = TrainedPolicyConfig(
    name_expert_data=selected_data_set.value,
    policy_name=selected_offline_rl_policy.value,
    render_mode=render_mode,
    device="cpu"
)

policy = restore_trained_offline_policy(offline_policy_config)
log_name = os.path.join(selected_data_set.value, selected_offline_rl_policy.value)
log_path = get_trained_policy_path(log_name)
policy.load_state_dict(torch.load(os.path.join(log_path, SAVED_POLICY_NAME), map_location="cpu"))

env.set_starting_point(INITIAL_STATE)
env.set_goal_point(FINAL_STATE)
#snapshot_env(env)

offpolicy_rendering(
    env_or_env_name=env,
    render_mode=render_mode,
    policy_model=policy,
    env_2d_grid_initial_config=env_2D_grid_initial_config,
    num_frames=100,
    imitation_policy_sampling=False
)

**Question**: Let's now change the dataset distribution. We'll collect 600 points with the first behavior policy and 100 with the second one. In this case, the probability of taking the suboptimal path will be higher. What paths are chosen by the algorithms?

### Summary

ToDo together.

## Exercise III

Now, let's train a BCQ policy using the TORCS data from our previous DAGGER exercise.

Imitation learning failed to achieve autonomous driving, as it imitate the behavior policy that caused the car to crash after a few meters. Let's test if the BCQ algorithm is able to avoid the crash.

Note that we won't be able to get results as good as DAGGER, as it not only introduced expert knowledge but also new states, which makes the problem much easier to solve. However access to human experts is quite rare in real problems.

In [None]:
os.system("torcs")

#### STEP 1: Create TORCS Environment

In [None]:
ENV_NAME = EnvFactory.torcs

#### STEP 2: Create Minari dataset

**Select TORCS behavior policies and visualize them**

In [None]:
policy_selected_to_visualize = widget_list([
    BehaviorPolicyType.torcs_drunk_driver_policy
])

In [None]:
offpolicy_rendering(
    env_or_env_name=ENV_NAME,
    render_mode=None,
    behavior_policy_name=policy_selected_to_visualize.value,
    num_frames=4000,
)

**Collect dataset**

In [None]:
# Configure the policy
BEHAVIOR_POLICY = policy_selected_to_visualize.value
DATA_SET_IDENTIFIER = "torcs_crash"
NUM_STEPS = 4000

config_torcs_data_set = create_minari_datasets(
    env_name=ENV_NAME,
    dataset_identifier=DATA_SET_IDENTIFIER,
    num_colected_points=NUM_STEPS,
    behavior_policy_name=BEHAVIOR_POLICY,
)

_ = os.system("pkill torcs")

#### STEP 3: Feed dataset to replay buffer 

In [None]:
DATA_SET_NAME = config_torcs_data_set.data_set_name
buffer_data = load_buffer_minari(DATA_SET_NAME)

#### STEP 4-5: Select offline policy and training

**Select the policy**

In [None]:
NUM_EPOCHS = 2
BATCH_SIZE = 128
NUMBER_TEST_ENVS = 1
PERCENTAGE_DATA_PER_EPOCH = 1.0
DEVICE = "cuda"

OFFLINE_POLICY_NAME = PolicyName.bcq_continuous
TRAINED_POLICY_NAME = "policy_bcq.pt"

**Offline RL policy metadata config**

In [None]:
offline_policy_config = TrainedPolicyConfig(
    name_expert_data=DATA_SET_NAME,
    policy_name= OFFLINE_POLICY_NAME,
    device=DEVICE
)

**Training**

In [None]:
offline_training(
    offline_policy_config=offline_policy_config,
    num_epochs=NUM_EPOCHS,
    number_test_envs=NUMBER_TEST_ENVS,
    step_per_epoch=PERCENTAGE_DATA_PER_EPOCH * len(buffer_data),
    restore_training=False,
    batch_size=BATCH_SIZE,
    policy_name=TRAINED_POLICY_NAME,
)

**Visualize trained policy**

In [None]:
trained_policy_selected = widget_list([TRAINED_POLICY_NAME, "policy_best_reward.pth"])

In [None]:
trained_bcq_policy = restore_trained_offline_policy(offline_policy_config)
log_name = os.path.join(DATA_SET_NAME, OFFLINE_POLICY_NAME)
log_path = get_trained_policy_path(log_name)
trained_bcq_policy.load_state_dict(torch.load(str(os.path.join(log_path, trained_policy_selected.value)), map_location="cpu"))
trained_bcq_policy

offpolicy_rendering(
    env_or_env_name="torcs",
    render_mode=None,
    policy_model=trained_bcq_policy,
    num_frames=4000,
)

**Analysis of results: Let's compare the actions taken by the BCQ policy against the ones that would have been taken by the expert.** 

In [None]:
NUM_STEPS_ROLLOUT = 4500 

output_initial_phase = policy_rollout_torcs_env(
    driver_policy=trained_bcq_policy,
    advisor_policy=BehaviorPolicyType.torcs_expert_policy,
    num_steps=NUM_STEPS_ROLLOUT,
)

compare_policy_decisions_vs_expert_suggestions(
    policy_actions=output_initial_phase["actions_driver"],
    expert_suggestions=output_initial_phase["actions_advisor"],
    
)

As we can see, the BCQ decisions are very close to the expert ones!

**Let's give a look to the BCQ vs data action distributions.**

In [None]:
num_bins = 60
actions_bcq_policy = output_initial_phase["actions_driver"]
_ = plt.hist(actions_bcq_policy, bins=num_bins, alpha=0.5, label='bcq actions')
_ = plt.hist(buffer_data.act, bins=num_bins, alpha=0.5, label='data actions')
plt.legend()
plt.show()

**Note that BCQ goes a bit o.o.d. to find the optimal path, but it doesn't create totally different actions, as we discussed before. It basically analyzes if it can produce similar actions to the collected ones, that bring the agent towards higher rewards.**

**Finally, let's see which are the states corresponding to the o.o.d. actions**

In [None]:
actions_bcq = np.array(actions_bcq_policy)
mask = ((actions_bcq>-0.1)&(actions_bcq<-0.05)) | ((actions_bcq>0.05)&(actions_bcq<0.1))
indx_ood_states = np.where(mask)[0]
_ = plt.hist(indx_ood_states, bins=100)
plt.title("Histogram of states where BCQ decides to go o.o.d.")

**You can check that these states are the ones corresponding to the curve where the behavior policy crashed.**

## Exercise IV 

[AdroitHandPen-v1](https://minari.farama.org/datasets/pen/expert/)

## Final remarks

Offline RL proves valuable in various scenarios, especially when:

a. Robots require intelligent behavior in complex open-world environments demanding extensive training data due to robust visual perception requirements. (complex environment modeling and extensive data collection)

b. Robot grasping tasks, which involve expert data that cannot be accurately simulated, providing an opportunity to assess our BCQ algorithm.

c. Robotic navigation tasks, where offline RL aids in crafting effective navigation policies using real-world data.

d. Autonomous driving, where ample expert data and an offline approach enhance safety.

e. Healthcare applications, where safety is paramount due to the potential serious consequences of inaccurate forecasts.

... and many more.

However, if you have access to an environment with abundant data, online Reinforcement Learning (RL) can be a powerful choice due to its potential for exploration and real-time feedback. Nevertheless, the landscape of RL is evolving, and a data-centric approach is gaining prominence, exemplified by vast datasets like X-Embodiment. It's becoming evident that robots trained with diverse data across various scenarios tend to outperform those solely focused on specific tasks. Furthermore, leveraging multitask trained agents for transfer learning can be a valuable strategy for addressing your specific task at hand.