In [1]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl
%set_random_seed 12

In [2]:
%presentation_style

In [3]:
%load_latex_macros


$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$
$\newcommand{\amax}{{\text{argmax}}}$
$\newcommand{\P}{{\mathbb{P}}}$
$\newcommand{\E}{{\mathbb{E}}}$
$\newcommand{\R}{{\mathbb{R}}}$
$\newcommand{\Z}{{\mathbb{Z}}}$
$\newcommand{\N}{{\mathbb{N}}}$
$\newcommand{\C}{{\mathbb{C}}}$
$\newcommand{\abs}[1]{{ \left| #1 \right| }}$
$\newcommand{\simpl}[1]{{\Delta^{#1} }}$


In [4]:
%autoreload

In [5]:
from training_rl.offline_rl.load_env_variables import load_env_variables
load_env_variables()

import os
import gymnasium as gym
import torch
import warnings

from training_rl.offline_rl.behavior_policies.behavior_policy_registry import BehaviorPolicyType
from training_rl.offline_rl.custom_envs.custom_2d_grid_env.obstacles_2D_grid_register import ObstacleTypes
from training_rl.offline_rl.custom_envs.custom_envs_registration import CustomEnv, RenderMode, register_grid_envs
from training_rl.offline_rl.custom_envs.utils import Grid2DInitialConfig, InitialConfigCustom2DGridEnvWrapper
from training_rl.offline_rl.generate_custom_minari_datasets.generate_minari_dataset_grid_envs import \
    create_combined_minari_dataset 
from training_rl.offline_rl.offline_policies.offpolicy_rendering import offpolicy_rendering
from training_rl.offline_rl.offline_policies.policy_registry import PolicyName
from training_rl.offline_rl.offline_trainings.offline_training import offline_training
from training_rl.offline_rl.offline_trainings.policy_config_data_class import TrainedPolicyConfig, get_trained_policy_path
from training_rl.offline_rl.offline_trainings.restore_policy_model import restore_trained_offline_policy
from training_rl.offline_rl.utils import widget_list
from training_rl.offline_rl.visualizations.utils import (
    get_state_action_data_and_policy_grid_distributions, snapshot_env)
from training_rl.offline_rl.utils import load_buffer_minari, state_action_histogram
from training_rl.offline_rl.generate_custom_minari_datasets.generate_minari_dataset_grid_envs import MinariDatasetConfig

warnings.filterwarnings("ignore")
register_grid_envs()
render_mode = RenderMode.RGB_ARRAY_LIST if os.environ.get("DISPLAY") else None

<img src="_static/images/aai-institute-cover.svg" alt="Snow" style="width:100%;">
<div class="md-slide title"> Offline RL algorithms exercises </div>

Offline RL pipeline:

<img src="_static/images/93_offline_RL_pipeline.png" alt="Snow" style="width:50%;">

# Offline RL algorithms exercises

Previously, we discussed that **off-policy methods cannot learn from data efficiently unless a significant amount of data covering a large portion of the environment states is available**. Only in such cases can the agent explore the environment and get feedback similar to what's done in an online approach. However, this scenario is rare and challenging to achieve in realistic applications, which is one of the reasons why we turn to offline RL, where only a small amount of data is available.

We also discussed one of the major issues when applying off-policy methods to collected data: the agent's tendency to go out-of-distribution (o.o.d.). More importantly, once it goes o.o.d., the policy becomes unpredictable, making it impossible to return to the in-distribution region. This unpredictability propagates errors in the policy evaluation process (i.e., the dynamic programming equations), hindering the algorithm's learning capabilities.

## Exercise I

**Similar to our approach in the off-policy notebook (nb_95), we will collect a small amount of expert data and a larger amount of suboptimal data. We will then observe how two offline RL algorithms introduced earlier, BCQ and CQL, can recover the expert policy without going o.o.d. We will compare our results with the imitation learning approach, specifically the BC algorithm, which, as we discussed in the imitation learning section (nb_93), is another viable option when expert data is available.**


In this exercise we will collect two datasets with expert and suboptimal data that tries to bring the agent from (3,0) to (0,7) .

I  - **Suboptimal expert policy**:  collect ~ 500 steps

II - **expert policy**: collect ~ 100 steps

### STEP 1: Create the environment

In [None]:
obstacle_selected = widget_list([ObstacleTypes.obstacle_8x8_top_right, ObstacleTypes.obst_free_8x8])

In [None]:
ENV_NAME = CustomEnv.Grid_2D_8x8_discrete

# Grid configuration
OBSTACLE = obstacle_selected.value
INITIAL_STATE = (7, 0)
FINAL_STATE = (0, 7)

env_2D_grid_initial_config = Grid2DInitialConfig(
    obstacles=OBSTACLE,
    initial_state=INITIAL_STATE,
    target_state=FINAL_STATE,
)

env = InitialConfigCustom2DGridEnvWrapper(gym.make(ENV_NAME, render_mode=render_mode), env_config=env_2D_grid_initial_config)
snapshot_env(env)

### STEP 2: Create Minari datasets

**Behavior policies and datasets configurations**

In [None]:
BEHAVIOR_POLICY_I = BehaviorPolicyType.behavior_move_up_from_bottom_twice
DATA_SET_IDENTIFIER_I = "_suboptimal_for_offline_rl"
NUM_STEPS_I = 500

BEHAVIOR_POLICY_II = BehaviorPolicyType.behavior_8x8_suboptimal_determ_initial_3_0_final_3_7
DATA_SET_IDENTIFIER_II = "_expert_for_offline_rl"
NUM_STEPS_II = 100

In [None]:
policy_selected = widget_list([BEHAVIOR_POLICY_I, BEHAVIOR_POLICY_II])

In [None]:
offpolicy_rendering(
    env_or_env_name=ENV_NAME,
    render_mode=render_mode,
    behavior_policy_name=policy_selected.value,
    env_2d_grid_initial_config=env_2D_grid_initial_config,
    num_frames=100,
)

**Collect data**

In [None]:
config_combined_data = create_combined_minari_dataset(
        env_name=ENV_NAME,
        dataset_identifiers = (DATA_SET_IDENTIFIER_I, DATA_SET_IDENTIFIER_II),
        num_collected_points = (NUM_STEPS_I, NUM_STEPS_II),
        behavior_policy_names = (BEHAVIOR_POLICY_I, BEHAVIOR_POLICY_II),
        combined_dataset_identifier = "combined_data_sets_offline_rl",
        env_2d_grid_initial_config = env_2D_grid_initial_config,
)

dataset_availables = [config_combined_data.data_set_name] + config_combined_data.children_dataset_names
selected_data_set = widget_list(dataset_availables)

### STEP 3: Feed data into replay buffer

In [None]:
buffer_data = load_buffer_minari(selected_data_set.value)
len_buffer = len(buffer_data)

# Compute state-action data distribution
state_action_count_data, _ = get_state_action_data_and_policy_grid_distributions(buffer_data, env)
state_action_histogram(state_action_count_data, title="State-Action data distribution", inset_pos_xy=(-0.1, -0.01))

snapshot_env(env)

### STEP 4-5: Select offline policies and trained them

In this part of the exercise you need to: 

a) Restore the policy configurations (through TrainedPolicyConfig) for three offline RL policies, namely **BCQ, CQL and BC**, i.e.:

offline_policy_config = TrainedPolicyConfig( ... )

Give a look to the policy parameteres in offline_rl/offline_policies. 


b) Train the policies on the **expert data**:

NUM_EPOCHS =10

BATCH_SIZE = 128

STEP_PER_EPOCH = len_buffer


offline_training( ... )


c) Visualize the policies:

offpolicy_rendering( ... )


**SOLUTION:**

In [None]:
offline_rl_policies = [PolicyName.bcq_discrete, PolicyName.cql_discrete, PolicyName.imitation_learning]
selected_offline_rl_policy = widget_list(offline_rl_policies)

**Training**

In [None]:
NUM_EPOCHS = 10
BATCH_SIZE = 128
STEP_PER_EPOCH = 1.0*len_buffer
NUMBER_TEST_ENVS = 1


offline_policy_config = TrainedPolicyConfig(
    name_expert_data=selected_data_set.value,
    policy_name=selected_offline_rl_policy.value,
    render_mode=render_mode,
    device="cpu"
)

offline_training(
    offline_policy_config=offline_policy_config,
    num_epochs = NUM_EPOCHS,
    number_test_envs=NUMBER_TEST_ENVS,
    step_per_epoch=STEP_PER_EPOCH,
    restore_training=False,
)

**Restore and visualize trained policy**

In [None]:
available_obstacles = [ObstacleTypes.obstacle_8x8_top_right, ObstacleTypes.obst_free_8x8]
selected_obstacle = widget_list(available_obstacles)

In [None]:
#SAVED_POLICY_NAME = "policy_best_reward.pth"
SAVED_POLICY_NAME = "policy.pth"
INITIAL_STATE = (7, 0)
FINAL_STATE = (0, 7)

offline_policy_config = TrainedPolicyConfig(
    name_expert_data=selected_data_set.value,
    policy_name=selected_offline_rl_policy.value,
    render_mode=render_mode,
    device="cpu"
)

policy = restore_trained_offline_policy(offline_policy_config)
log_name = os.path.join(selected_data_set.value, selected_offline_rl_policy.value)
log_path = get_trained_policy_path(log_name)
policy.load_state_dict(torch.load(os.path.join(log_path, SAVED_POLICY_NAME), map_location="cpu"))

env.set_new_obstacle_map(selected_obstacle.value.value)
env.set_starting_point(INITIAL_STATE)
env.set_goal_point(FINAL_STATE)
#snapshot_env(env)

offpolicy_rendering(
    env_or_env_name=env,
    render_mode=render_mode,
    policy_model=policy,
    env_2d_grid_initial_config=env_2D_grid_initial_config,
    num_frames=100,
    imitation_policy_sampling=False
)

### Summary and conclusions

**1 - Are BCQ and the CQL policies able to learn the expert data?**

**2 - As we saw before, imitation learning is a good option when you have expert data. How it compares with the offline algorithms?**

**3 - Now rollout the three policies from o.o.d. data. What do you observe?**

**4 - Remove the obstacle and do a rollout of the three policies. What do you observe?**

**5 - Remove now the obstacle and use the combined dataset that includes a fair amount of suboptimal data. What do you notice?**

## Exercise II 

**Now, we'll explore how BCQ and CQL, address the issue of connecting suboptimal trajectories in order to get new ones with higer rewards (stitching property). We will see how they compare with imitation learning.**

We will start again with the previous setup. So, as we did before, we will create again two datasets: one from a policy moving suboptimal from (0,0) to (7,0), and the other from another policy moving from (4,0) to (7,7). The goal is to find an agent capable of connecting trajectories coming from both datasets, in order to find the optimal path between (0,0) and (7,7).

### STEP 1: Create the environment

**Create the environment**

In [None]:
ENV_NAME = CustomEnv.Grid_2D_8x8_discrete

OBSTACLE = ObstacleTypes.obst_free_8x8
INITIAL_STATE_POLICY_I = (0,0)
INITIAL_STATE_POLICY_II = (4,0)
FINAL_STATE = (7, 7)

env_2D_grid_initial_config_I = Grid2DInitialConfig(
    obstacles=OBSTACLE,
    initial_state=INITIAL_STATE_POLICY_I,
    target_state=FINAL_STATE,
)

env_2D_grid_initial_config_II = Grid2DInitialConfig(
    obstacles=OBSTACLE,
    initial_state=INITIAL_STATE_POLICY_II,
    target_state=FINAL_STATE,
)


env = InitialConfigCustom2DGridEnvWrapper(gym.make(ENV_NAME, render_mode=render_mode),
                                          env_config=env_2D_grid_initial_config_I)
snapshot_env(env)

env.set_starting_point(INITIAL_STATE_POLICY_II)

snapshot_env(env)

### STEP 2: Create Minari datasets

In [None]:
IDENTIFIER_COMBINED_DATASETS = "_stiching_property_I"

# Dataset I
BEHAVIOR_POLICY_I = BehaviorPolicyType.behavior_8x8_grid_deterministic_0_0_to_4_7
DATA_SET_IDENTIFIER_I = "_move_downwards"
NUM_STEPS_I = 200

# Dataset II
#BEHAVIOR_POLICY_II = BehaviorPolicyType.behavior_8x8_eps_greedy_4_0_to_7_7
BEHAVIOR_POLICY_II = BehaviorPolicyType.behavior_8x8_deterministic_4_0_to_7_7
DATA_SET_IDENTIFIER_II = "_move_deterministic"
NUM_STEPS_II = 200

In [None]:
select_policy_to_render = widget_list([BEHAVIOR_POLICY_I, BEHAVIOR_POLICY_II])

In [None]:
env_2D_grid_initial_config = env_2D_grid_initial_config_I if select_policy_to_render.value == BEHAVIOR_POLICY_I else env_2D_grid_initial_config_II

offpolicy_rendering(
    env_or_env_name=ENV_NAME,
    render_mode=render_mode,
    behavior_policy_name=select_policy_to_render.value,
    env_2d_grid_initial_config=env_2D_grid_initial_config,
    num_frames=100,
)

**Create datasets**

In [None]:
config_combined_data = create_combined_minari_dataset(
        env_name=ENV_NAME,
        dataset_identifiers = (DATA_SET_IDENTIFIER_I, DATA_SET_IDENTIFIER_II),
        num_collected_points = (NUM_STEPS_I, NUM_STEPS_II),
        behavior_policy_names = (BEHAVIOR_POLICY_I, BEHAVIOR_POLICY_II),
        combined_dataset_identifier = "combined_dataset",
        env_2d_grid_initial_config = (env_2D_grid_initial_config_I, env_2D_grid_initial_config_II),
)
buffer_data = load_buffer_minari(config_combined_data.data_set_name)
data_size = len(buffer_data)

dataset_availables = [config_combined_data.data_set_name] + config_combined_data.children_dataset_names
selected_data_set = widget_list(dataset_availables)

### STEP 3: Feed data into replay buffer

In [None]:
buffer_data = load_buffer_minari(selected_data_set.value)
len_buffer = len(buffer_data)

# Compute state-action data distribution
state_action_count_data, _ = get_state_action_data_and_policy_grid_distributions(buffer_data, env)
state_action_histogram(state_action_count_data, title="State-Action data distribution", inset_pos_xy=(-0.1, -0.03))

if "start_0_0" in selected_data_set.value:
    env.set_starting_point((0,0))
    snapshot_env(env)
elif "start_4_0" in selected_data_set.value:
    env.set_starting_point((4,0))
    snapshot_env(env)
    

### STEP 4: Select offline policies and training

In [None]:
offline_rl_policies = [PolicyName.bcq_discrete, PolicyName.cql_discrete, PolicyName.imitation_learning]
selected_offline_rl_policy = widget_list(offline_rl_policies)

In [None]:
# Offiline - Training

NUM_EPOCHS =10
BATCH_SIZE = 128
STEP_PER_EPOCH = 1.0*len_buffer
NUMBER_TEST_ENVS = 1


offline_policy_config = TrainedPolicyConfig(
    name_expert_data=selected_data_set.value,
    policy_name=selected_offline_rl_policy.value,
    render_mode=render_mode,
    device="cpu"
)

#offline_policy_config.policy_config["unlikely_action_threshold"]=0.6
#offline_policy_config.policy_config["min_q_weight"]=15.0
#offline_policy_config.policy_config["num_quantiles"]=5

offline_training(
    offline_policy_config=offline_policy_config,
    num_epochs = NUM_EPOCHS,
    number_test_envs=NUMBER_TEST_ENVS,
    step_per_epoch=STEP_PER_EPOCH,
    restore_training=False,
)

**Restore and visualize trained policy**

In [None]:
#SAVED_POLICY_NAME = "policy_best_reward.pth"
SAVED_POLICY_NAME = "policy.pth"
INITIAL_STATE = (0, 0)
FINAL_STATE = (7, 7)

offline_policy_config = TrainedPolicyConfig(
    name_expert_data=selected_data_set.value,
    policy_name=selected_offline_rl_policy.value,
    render_mode=render_mode,
    device="cpu"
)

policy = restore_trained_offline_policy(offline_policy_config)
log_name = os.path.join(selected_data_set.value, selected_offline_rl_policy.value)
log_path = get_trained_policy_path(log_name)
policy.load_state_dict(torch.load(os.path.join(log_path, SAVED_POLICY_NAME), map_location="cpu"))

env.set_starting_point(INITIAL_STATE)
env.set_goal_point(FINAL_STATE)
#snapshot_env(env)

offpolicy_rendering(
    env_or_env_name=env,
    render_mode=render_mode,
    policy_model=policy,
    env_2d_grid_initial_config=env_2D_grid_initial_config,
    num_frames=100,
    imitation_policy_sampling=False
)

**What do you observe? Try increasing the number of expert samples and run it again. What happens now?**

**As we can see, the BCQ algorithm is able to stitch two trajectories together to create an optimal one.**

**Now, trying to do the same with CQL and compare results.**

## Exercise III

Now, let's train a BCQ policy using the TORCS data from our previous DAGGER exercise.

Imitation learning failed to achieve autonomous driving, as it imitate the behavior policy that caused the car to crash after a few meters. Let's test if the BCQ algorithm can successfully navigate the entire race track.

Note that we won't be able to get results as good as DAGGER, as it not only introduced expert knowledge but also new states, which makes the problem much easier to solve. However access to human experts is quite rare in real problems.

In [None]:
os.system("torcs")

**Select the dataset**

In [7]:
collected_datasets_names = ["torcs-data-v0"]
collected_datasets = widget_list(collected_datasets_names)

Dropdown(options=('torcs-data-v0',), value='torcs-data-v0')

**Select the policy**

In [10]:
NUM_EPOCHS = 13
BATCH_SIZE = 128
NUMBER_TEST_ENVS = 1
EXPLORATION_NOISE = True
SEED = None  # 1626
PERCENTAGE_DATA_PER_EPOCH = 1.0
DEVICE = "cuda"

OFFLINE_POLICY_NAME = PolicyName.bcq_continuous
DATA_SET_NOISY_NAME = collected_datasets.value
TRAINED_POLICY_NAME = "policy_bcq.pt"

**Feed data into replay buffer**

In [11]:
buffer_data = load_buffer_minari(DATA_SET_NOISY_NAME)
data_config = MinariDatasetConfig.load_from_file(DATA_SET_NOISY_NAME)

offline_policy_config = TrainedPolicyConfig(
    name_expert_data=DATA_SET_NOISY_NAME,
    policy_name= OFFLINE_POLICY_NAME,
    device=DEVICE
)

Dataset /home/ivan/Documents/GIT_PROJECTS/tfl-training-rl/src/training_rl/offline_rl/data/offline_data/torcs-data-v0 downloaded. number of episodes: 2


  gym.logger.warn(f"Box bound precision lowered by casting to {self.dtype}")


**Training**

In [13]:
offline_training(
    offline_policy_config=offline_policy_config,
    num_epochs=NUM_EPOCHS,
    number_test_envs=NUMBER_TEST_ENVS,
    step_per_epoch=PERCENTAGE_DATA_PER_EPOCH * len(buffer_data),
    restore_training=False,
    batch_size=BATCH_SIZE,
    policy_name=TRAINED_POLICY_NAME,
)

**Visualize trained policy**

In [18]:
trained_policy_selected = widget_list(["policy_best_reward.pth", TRAINED_POLICY_NAME])

In [19]:
trained_bcq_policy = restore_trained_offline_policy(offline_policy_config)
log_name = os.path.join(DATA_SET_NOISY_NAME, OFFLINE_POLICY_NAME)
log_path = get_trained_policy_path(log_name)
trained_bcq_policy.load_state_dict(torch.load(str(os.path.join(log_path, trained_policy_selected.value)), map_location="cpu"))
trained_bcq_policy

offpolicy_rendering(
    env_or_env_name="torcs",
    render_mode=None,
    policy_model=trained_bcq_policy,
    num_frames=10000,
)

**Analysis of results: Let's compare the expert, suboptimal and offline trained policies.** 

In [20]:
from training_rl.offline_rl.visualizations.utils import collect_data_from_rollout_torcs_policy
NUM_STEPS_ROLLOUT = 5000 
bcq_driver = collect_data_from_rollout_torcs_policy(
    env_collected_quantity='angle',
    driver_policy=trained_bcq_policy,
    num_steps=NUM_STEPS_ROLLOUT
)

drunk_driver = collect_data_from_rollout_torcs_policy(
    env_collected_quantity='angle',
    driver_policy=BehaviorPolicyType.torcs_drunk_driver_policy,
    num_steps=NUM_STEPS_ROLLOUT
)

expert_driver = collect_data_from_rollout_torcs_policy(
    env_collected_quantity='angle',
    driver_policy=BehaviorPolicyType.torcs_expert_policy,
    num_steps=NUM_STEPS_ROLLOUT
)

In [21]:
import matplotlib.pyplot as plt

def plot_multiple_lines(expert_data, bcq_data, drunk_data, obs_name="actions_driver"):
    x = range(len(expert_data[obs_name]))
    plt.plot(x, expert_data[obs_name], label='Expert Driver')
    plt.plot(x, bcq_data[obs_name], label='BCQ Driver')
    plt.plot(x, drunk_data[obs_name], label='Drunk Driver')
    plt.xlabel('Time Steps')
    plt.ylabel(obs_name)
    plt.title(f'Comparison of Drivers {obs_name}')
    plt.legend()
    plt.show()
    
plot_multiple_lines(expert_driver, bcq_driver, drunk_driver)

In [22]:
lidar_value = 1

expert_lidars_left = [lidar[lidar_value] for lidar in expert_driver["observations"]]
bcq_lidars_left = [lidar[lidar_value] for lidar in bcq_driver["observations"]]
drunk_lidars_left = [lidar[lidar_value] for lidar in drunk_driver["observations"]]

def hist_comparison(data_list, labels, x_title, bins=20, alpha=0.4, colors=None, y_range=None):
    
    if colors is None:
        colors = ['blue', 'orange','green']

    for i, data in enumerate(data_list):
        plt.hist(data, bins=bins, density=True, alpha=alpha, color=colors[i], label=labels[i])
    
        plt.xlabel(x_title)
        plt.ylabel('Density')
        plt.title('Histogram Comparison')
        plt.legend()
        if y_range:
            plt.ylim(y_range[0], y_range[1])  # Change y-axis range here
    plt.show()

    
data = [expert_lidars_left, bcq_lidars_left]
labels=["bcq","behavior"]
hist_comparison(data, labels, y_range=[0.0,10.5], x_title="observations")

data = [expert_driver["actions_driver"], bcq_driver["actions_driver"]]
labels=["bcq","behavior"]
hist_comparison(data, labels, y_range=[0.0,10.5], x_title="actions")

## Exercise IV 

Robotic hand: Homework.

## Final remarks

Offline RL proves valuable in various scenarios, especially when:

a. Robots require intelligent behavior in complex open-world environments demanding extensive training data due to robust visual perception requirements. (complex environment modeling and extensive data collection)

b. Robot grasping tasks, which involve expert data that cannot be accurately simulated, providing an opportunity to assess our BCQ algorithm.

c. Robotic navigation tasks, where offline RL aids in crafting effective navigation policies using real-world data.

d. Autonomous driving, where ample expert data and an offline approach enhance safety.

e. Healthcare applications, where safety is paramount due to the potential serious consequences of inaccurate forecasts.

... and many more.

However, if you have access to an environment with abundant data, online Reinforcement Learning (RL) can be a powerful choice due to its potential for exploration and real-time feedback. Nevertheless, the landscape of RL is evolving, and a data-centric approach is gaining prominence, exemplified by vast datasets like X-Embodiment. It's becoming evident that robots trained with diverse data across various scenarios tend to outperform those solely focused on specific tasks. Furthermore, leveraging multitask trained agents for transfer learning can be a valuable strategy for addressing your specific task at hand.