In [None]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl
%set_random_seed 12

In [None]:
%presentation_style

In [None]:
%load_latex_macros

<img src="_static/images/aai-institute-cover.png" alt="Snow" style="width:100%;">
<div class="md-slide title"> Offline-RL open source datasets </div>

# Open Source Datasets for offline RL

**The goal of offline RL or imitation learning is to learn a policy from a fixed dataset. This approach has gained significant attention because it allows RL methods to utilize vast, pre-collected datasets, somewhat similar to how large datasets have propelled advances in supervised learning.**

Challenges:

- Existing benchmarks designed for online RL are not well-suited for the offline RL 
  setting.
- RL benchmarks relying on data generated by partially-trained agents or environment-
  dependent are not robust.

These challenges makes it difficult to accurately measure progress in offline RL.

## MINARI Dataset
(previously called D4RL from UC Berkeley/Google Brain)

This library is emerging as the standard in the field. While previously reliant on D4RL, the community is currently transitioning to Minari. Its scope is to address the gap of representative datasets in offline RL, by introducing open-source datasets explicitly crafted for the offline setting. These datasets are shaped by the critical properties that real-world applications of offline RL demand. Minari provides datasets collected with random, medium, and expert policies in some environments, allowing us to evaluate whether an algorithm can extract meaning from noise.

In particular the provided datasets focus mainly on the following properties that appear often in realistic situations:


1 - **Narrow and biased data distributions**: e.g. from deterministic policies: Narrow datasets may arise in human demonstrations or in hand-crafted policies.

2 - **Undirected and multitask data**: Undirected in the sense that is not directed towards the specific task one is trying to accomplish. E.g.: recording user interactions on the internet or recording videos of a car for autonomous driving. The main purpose it to test how good is the offline agent to be used for "trajectory stitching", i.e. combining trajectories from different tasks to achieve new objectives, instead of searching for out-of-distribution trajectories.

<img src="_static/images/stiching.png" alt="stich_traj" style="width:200px;">

3 - **Sparse rewards**: Sparse rewards are challenging in online settings due to their close correlation with exploration. In offline RL, we exclusively explore within the dataset, making it an ideal framework to study the algorithm's response to sparse rewards.
Note that crafting effective rewards can be challenging, and overly complex rewards may inadvertently push solutions towards suboptimal outcomes. In contrast, designing sparse rewards is often more straightforward as it merely involves specifying the task's success criteria, making it an attractive property to work with.

4 - **Suboptimal data**: Give a clear task the data could not contain any optimal trajectory so this is a realistic scenario in general and still the offline agent should be able to find a suboptimal solution.

5 - **Non-representable behavior policies**: non-Markovian behavior policies, and partial observability. For instance, if the data is collected with a classical control algorithm that have access to a window of previous states.


6 - **Realistic domains**: Different Mujoco tasks as robot manipulation or multi-tasking.

Let's give a look to [Minari](https://minari.farama.org/main/content/basic_usage/)



## RL Unplugged dataset

(Deep Mind - Google Brain) [website](https://www.deepmind.com/blog/rl-unplugged-benchmarks-for-offline-reinforcement-learning) and [blog](https://www.deepmind.com/blog/rl-unplugged-benchmarks-for-offline-reinforcement-learning)

It includes a nice set of tasks, but the crucial point is that all datasets come from behavior policies trained online, so the collected data may not be representative of realistic situations where human experts and non-RL policies are typically used to collect data. Additionally, most of the data comes from medium to expert policies.

## Open X-Embodiment Repository
October 2023 - Partners from 33 academic labs.

This [library](https://robotics-transformer-x.github.io/) introduced the **Open X-Embodiment Repository** that includes a dataset with 22 different robot types for **X-embodiment learning**, i.e. to learn from diverse and large-scale datasets from multiple robots for better transfer learning and improved generalization.

[Let's give a look](https://www.deepmind.com/blog/scaling-up-learning-across-many-different-robot-types)

### References

[ \[Fu.Justin et. al. '2021 \] D4RL: Datasets for Deep Data-Driven Reinforcement Learning](https://arxiv.org/abs/2004.07219)

[ MINARI: A dataset API for Offline Reinforcement Learning ](https://minari.farama.org/main/content/basic_usage/) 

[ C. Gulcehre et al. '2021, “RL unplugged: A suite of benchmarks for offline
reinforcement learning](https://arxiv.org/abs/2006.13888)

[ A. Padalkar et. al. '2023 Open X-Embodiment: Robotic Learning Datasets and RT-X Models ](https://robotics-transformer-x.github.io/)
