In [None]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl
%set_random_seed 12

In [None]:
%presentation_style

In [None]:
%load_latex_macros

<img src="_static/images/aai-institute-cover.svg" alt="Snow" style="width:100%;">
<div class="md-slide title"> Offline-RL open source datasets </div>

# Open Source Datasets libraries for offline RL

**The goal of offline RL or imitation learning is to learn a policy from a fixed dataset. This approach has gained significant attention because it allows RL methods to utilize vast, pre-collected datasets, somewhat similar to how large datasets have propelled advances in supervised learning.**


These days, gathering data from sensors or cameras has become remarkably accessible across various contexts, including robots, cars, and manufacturing processes, among others. It's crucial to establish a standardized approach for organizing and processing this data, as well as enabling the generation of custom data, whether from real machines or simulations, while adhering to these standards. Previously, offline RL lacked such a standardized interface. Whenever a new algorithm emerged, significant preprocessing of datasets was necessary, especially when dealing with extensive data collections, leading to substantial costs. 

As we will see next, the MINARI library has been created to address this scope and provide a streamlined solution for offline RL data management and preprocessing tasks. It also includes a very interesting collection of datasets.

## MINARI Dataset
(previously called D4RL from UC Berkeley/Google Brain)

This library is emerging as the standard in the field. While previously reliant on D4RL, the community is currently transitioning to Minari. **One of its scopes, as said before, is to create a standardized approach for organizing and processing data**. 

**However, another goal of the library is to address the gap in representative datasets for offline RL by introducing open-source datasets explicitly crafted for offline scenarios**. These datasets are tailored to meet the critical properties demanded by real-world applications of offline RL and are crucial for benchmarking offline algorithms accurately to measure progress in the field. As we will see, existing benchmarks designed for online RL are not well-suited for the offline RL setting.


Minari provides **datasets collected with random, medium, and expert policies in different environments** (we will explore them in a moment), allowing us to evaluate whether an algorithm can extract meaning from noise.

In particular the provided datasets focus mainly on the following properties that appear often in realistic situations:


1 - **Narrow and biased data distributions**: e.g. from deterministic policies: Narrow datasets may arise in human demonstrations or in hand-crafted policies. **(not an issue in online RL)**

2 - **Undirected and multitask data**:"Undirected" here means that the data collection wasn't aimed at a specific task, like recording user interactions on the internet or capturing videos for autonomous driving. This data was gathered without a specific goal in mind, but we aim to use it to solve particular tasks. Offline RL should be able to extract the highest-reward trajectories from this data.

The main purpose is to test how well the offline agent can be used for "trajectory stitching," which involves combining trajectories from different tasks to achieve new objectives, rather than searching for out-of-distribution trajectories.

<img src="_static/images/stiching.png" alt="stich_traj" style="width:200px;">

As seen in the figure, suppose we have collected data from our car only for paths 1-2 (green) and 2-3 (yellow); we should be able to use this data to teach our car to go from 1-3. **(not an issue in online RL)**

3 - **Sparse rewards**: Sparse rewards are challenging in online settings due to their close correlation with exploration. In offline RL, we exclusively explore within the dataset, making it an ideal framework to study the algorithm's response to sparse rewards. Note that crafting effective rewards can be challenging, and overly complex rewards may inadvertently push solutions towards suboptimal outcomes. In contrast, designing sparse rewards is often more straightforward as it merely involves specifying the task's success criteria, making it an attractive property to work with.

4 - **Suboptimal data**: Give a clear task the data could not contain any optimal trajectory so this is a realistic scenario in general and still the offline agent should be able to find the best trajectory within the data.

5 - **Non-representable behavior policies**: non-Markovian behavior policies, and partial observability. For instance, if the data is collected with a classical control algorithm that have access to a window of previous states. **(not an issue in online RL)**


6 - **Realistic domains**: Different Mujoco tasks as robot manipulation or multi-tasking.

### Quick overview of Minari main functionality

#### Collect Data

<img src="_static/images/nb_91_minari_dataset_collection.png" alt="minari_folder_structure" style="width:100%;">


As shown in the figure above, you have access to a Gymnasium environment interface for your hardware or simulator, which you can utilize with Minari's **DataCollector (gymnasium.Wrapper) class**. This class allows you to collect data in a manner similar to online RL, where episodes are gathered with data points such as observations (obs), rewards (rew), termination flags (terminated), truncation indicators (truncated), and additional information (info) for each time step. The wrapper will also include the actions generated by the behavior policy used during data collection.


Obviously, we are assuming here that in a real application, the simulation/hardware is slow enough that it doesn't have a practical use for online RL. Otherwise, that would be the way to go. However, they will be more than enough for data collection.

#### Save Datasets

The main methods are:

**DataCollector.create_dataset(env, record_infos=True, max_buffer_steps=100000)**

If you already have a dataset collected somehow, or you want more flexibility, you can use:

**minari.create_dataset_from_buffer(...)**

However, in that case, you will need to preprocess your data as explained in the Minari documentation.

In this part of the workshop, we will always use the DataCollector wrapper as we will be collecting data from our custom environments. However, in real applications where you already have historical data, minari.create_dataset_from_buffer will be the way to go.

#### Combine Datasets

**import minari \
human_dataset = minari.load_dataset('door-human-v0') \
expert_dataset = minari.load_dataset('door-expert-v0') \
combine_dataset = minari.combine_datasets(
                    datasets_to_combine= \[human_dataset, expert_dataset\],new_dataset_id="door-all-v0")**

**combine_dataset.name**

'door-all-v0'

**minari.list_local_datasets()**

dict_keys(['door-all-v0', 'door-human-v0', 'door-expert-v0'])

**We will use this functionality extensively in our exercises, as we will typically combine expert, suboptimal, and noisy data.**

#### Download Minari Datasets

The most important methods are **minari.list_remote_datasets()** and **minari.download_datasets(dataset_id="name_of_minari_dataset")**. These are used to list the publicly available Minari datasets on their GCP storage and download them to your local machine.

**Minari contains other functionalities as well, such as splitting datasets or saving useful metadata, but we will primarily focus on the previous methods.**

**To manipulate the Minari datasets, we will feed them to the ReplayBuffer in the Tianshou library and handle data manipulation using the ReplayBuffer.**

#### Dataset structure

Here is the typical Minari dataset folder structure:

<img src="_static/images/nb_91_minari_dataset.png" alt="minari_folder_structure" style="width:300px;">

And this is how the data is saved in the .hdf5 file:

<img src="_static/images/nb_91_minari_dataset_2.png" alt="minari_folder_structure" style="width:300px;">

The Hierarchical Data Format version 5 (HDF5) is an open-source file format that supports large, complex, heterogeneous data. HDF5 uses a "file directory" like structure that allows you to organize data within the file in many different structured ways, similar to organizing files on your computer.

The HDF5 format is a compressed format. The size of all data contained within HDF5 is optimized, resulting in a smaller overall file size.

A powerful attribute of HDF5 is data slicing, which allows you to extract particular subsets of a dataset for processing. This means that the entire dataset doesn't need to be read into memory (RAM) at once.

Minari datasets are stored in the HDF5 file format using the h5py Python interface. HDF5 structures the data into groups and dataset elements, dividing the recorded step data into episode groups. Additionally, custom metadata can be added to the whole dataset, to each episode group, or to the individual HDF5 datasets comprising each episode group.

#### Minari datasets 

Minari offers a range of datasets, among which are the ADROIT datasets classified into **expert/human/clone** categories:

**Expert data**: Trajectories from a fine-tuned RL policy.

**Human data**:small amount of human demonstrations. 

**Clone data**: obtained by training an imitation policy on the demonstrations from expert and human, then running the policy, and mixing data at a 50-50 ratio with the demonstrations.

Let's give a look to [Minari](https://minari.farama.org/main/content/basic_usage/) and in particular to the provided datasets.

## RL Unplugged dataset

(Deep Mind - Google Brain) [github](https://github.com/google-deepmind/deepmind-research/tree/master/rl_unplugged) and [blog](https://www.deepmind.com/blog/rl-unplugged-benchmarks-for-offline-reinforcement-learning)

It includes a nice set of tasks, but the crucial point is that in general all datasets come from behavior policies trained online, so the collected data may not be representative of realistic situations where human experts and non-RL policies are typically used to collect data. Additionally, most of the data comes from medium to expert policies. In summary, while it may not fully reflect reality, these datasets are still valuable for algorithm benchmarking purposes.

## Open X-Embodiment Repository
October 2023 - Partners from 33 academic labs.

This [library](https://robotics-transformer-x.github.io/) introduced the **Open X-Embodiment Repository** that includes a dataset with 22 different robot types for **X-embodiment learning**, i.e. to learn from diverse and large-scale datasets from multiple robots for better transfer learning and improved generalization.

[Let's give a look](https://www.deepmind.com/blog/scaling-up-learning-across-many-different-robot-types)

## References

[ \[Fu.Justin et. al. '2021 \] D4RL: Datasets for Deep Data-Driven Reinforcement Learning](https://arxiv.org/abs/2004.07219)

[ MINARI: A dataset API for Offline Reinforcement Learning ](https://minari.farama.org/main/content/basic_usage/) 

[ C. Gulcehre et al. '2021, “RL unplugged: A suite of benchmarks for offline
reinforcement learning](https://arxiv.org/abs/2006.13888)

[ A. Padalkar et. al. '2023 Open X-Embodiment: Robotic Learning Datasets and RT-X Models ](https://robotics-transformer-x.github.io/)
