In [None]:
%%capture

%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl

In [None]:
%presentation_style

In [None]:
%%capture

%set_random_seed 12

In [None]:
%load_latex_macros


$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$
$\newcommand{\amax}{{\text{argmax}}}$
$\newcommand{\P}{{\mathbb{P}}}$
$\newcommand{\E}{{\mathbb{E}}}$
$\newcommand{\R}{{\mathbb{R}}}$
$\newcommand{\Z}{{\mathbb{Z}}}$
$\newcommand{\N}{{\mathbb{N}}}$
$\newcommand{\C}{{\mathbb{C}}}$
$\newcommand{\abs}[1]{{ \left| #1 \right| }}$
$\newcommand{\simpl}[1]{{\Delta^{#1} }}$


Introduction to Offline Reinforcement Learning
---

## What kind of data is used in offline RL?

- only actions: BC, inverse RL
- perfect trajectories and rewards: modified BC, critic regularized regression, CQL
- imperfect trajectories: want to improve on behavior policy. BQL, IQL

## Typical problems in offline RL

- Distribution shift -> Show example
- Argmax in learned critic is problematic (like adversarial attack) -> Show example
- Poor behavior policy
- If only actions are collected - what is the reward? Inverse RL is hard and ambiguous
- Poor examples are not collected at all by good policies (driving car off the road), major distribution shift. Solution - something like Dagger, if access to env is possible somehow (show a Dagger example?)
- Standard off-policy algorithms don't work well since errors in critic are not corrected by collecting more samples (compare SAC and AWAC)
- ...

## Why offline RL then?
- Might be no other choice (no access to env)
- Way easier to implement (supervised learning, no sampling loop)
- When no clear reward is given (navigating drones, self-driving cars, robot arms, etc.)
- As a pre-training step before an actual RL algo (AWAC and follow-up papers) -> Show example

## Software for offline RL
- d4rl -> minari
- supervised libraries, but also tianshou

## Examples:
- how to collect data
- mujoco, data collected with experts - CRR, BC.
- mujoco, suboptimal policy - BQL, IQL?
- using d4rl/minari datasets

Reinforcement learning algorithms primarily rely on an online learning approach, which poses a significant challenge to their widespread use. RL typically involves a continuous process of gathering experience by engaging with the environment using the latest policy, and then using this experience to enhance the policy. **In many situations, this online interaction is neither practical nor safe due to costly or risky data collection, as seen in fields such as robotics, healthcare, etc.**.

**Even in cases where online interaction is viable, there is often a preference to leverage existing data**. This is particularly true in complex domains where substantial datasets are essential for effective generalization.

**Offline Reinforcement Learning(RL), also known as Batch Reinforcement Learning, is a variant of RL that effectively leverages large, previously collected datasets for large-scale real-world applications**. The use of static datasets means that during the training process of the agent, offline RL does not perform any form of online interaction and exploration, which is also the most significant difference from online reinforcement learning methods.


### What is the problem to be solved?

The offline RL problem can be defined as a data-driven approach to the previously seen online RL methods. As before, the goal is to minimize the discounted expected reward:

$$
\begin{equation}
J (\pi) = \mathbb{E}_{\tau \sim D}  \left[ \sum_{t = 0}^{\infty} \gamma^t r (s_t, a_t) \right]
\tag{1}
\label{eq:discounted_rew_offline}
\end{equation}
$$

But this is done directly on the dataset $D$:

$$
D = \{(s_0, a_0, r_0), (s_1, a_1, r_1), \ldots, (s_T, a_T, r_T)\} \quad \tag{Dataset}
$$

comprising state/action/reward values. This is in contrast to online RL, where the trajectories $\tau$ are collected through interaction with the environment. **Note that $D$ doesn't necessarily need to be related to the specific task at hand, but it should be sufficiently representative (i.e., containing high-reward regions of the problem). If $D$ is derived from data collected from unrelated tasks, we will need to design our own reward function, just as we did in the online RL setting.**

The dataset can be collected from a suboptimal policy, a random policy, provided by a human expert, or a mixture of them. Such a policy is called a behavior policy (or expert policy), denoted as $\pi_b$.

As an example of a behavior policy ($\pi_b$), when training a robot to navigate a room, $\pi_b$ might involve rules like "avoid obstacles," "move forward," or "stop." It could be operated manually by a human who controls the robot based on sensor data, such as lidar for collision detection, or it could be a combination of suboptimal policies.

**The goal of offline RL is to derive an optimal or near-optimal policy directly from $D$ without requiring further interactions with the environment**.

Other important points:

- Offline RL differs from "imitation learning" (discussed later), which is its supervised learning counterpart. **In imitation learning, the policy closely mirrors the behavior policy, while offline RL aims for a superior policy, ideally near the optimal one.**

- Offline RL brings complexity because the offline policy differs in state-action distributions from the behavior policy, leading to **distributional shift challenges**. Similar challenges may arise in behavioral cloning, discussed later.

Some examples where offline RL could be highly beneficial:
    
**Decision's Making in Health Care**: In healthcare, we can use Markov decision processes to model the diagnosis and treatment of patients. Actions are interventions like tests and treatments, while observations are patient symptoms and test results. Offline RL is safer and more practical than active RL, as treating patients directly with partially trained policies is risky.

**Learning Robotic Manipulation Skills**: In robotics, we can use active RL for skill learning, but generalizing skills across different environments is challenging. Offline RL allows us to reuse previously collected data from various skills to accelerate learning of new skills. For example, making soup with onions and carrots can build on experiences with onions and meat or carrots and cucumbers, reducing the need for new data collection.

**Learning Goal-Directed Dialogue Policies**: Dialogue systems, like chatbots helping users make purchases, can be modeled as MDPs. Collecting data by interacting with real users can be costly. Offline data collection from humans is a more practical approach for training effective conversational agents.

**Autonomous Driving**: Training autonomous vehicles in real-world environments can be dangerous and costly. Offline RL can use data from past driving experiences to improve vehicle control and decision-making, making it safer and more efficient.

**Energy Management**: Optimizing energy consumption in buildings or industrial processes can be a critical task. Offline RL can analyze historical energy usage data to develop efficient control strategies, reducing energy costs and environmental impact.

**Finance**: Portfolio management and trading strategies often require learning from historical financial data. Offline RL can help develop and refine investment policies by utilizing past market data.

In summary, the process is clear:

**Phase A**: Collect data set, $D$, of state-action pairs through a behavior policy $\pi_b$: e.g. a robot randomly moving (or human controlled) in a given space, data collected from an autonomous vehicle, etc. The data collected doesn't need to come from an expert (typically the case in real situations) and during this phase we are not worry in general about a specific task (i.e. rewards). In fact, it could be that the data is collected from a robot doing a different task that the one we are interested in. We want just a set of allowed state-action pairs that could be usable and representative for the task in mind.

**Phase B**: In this phase we want to solve a given task (so we need to design rewards) only through the provided initial data without any interaction with the environment but still be able to find an optimal or near-optimal policy.

**Today, we will explore different points in offline Reinforcement Learning and in particular**:

    1. The technical challenges in offline RL.
    
    2. Differences between approaches like imitation learning and online vs. offline RL, 
       and what we can adapt from online methods for the offline setting.

       
    3. Standard data collection approaches and libraries used in the offline RL 
       community

### For our exercises, we'll primarily use a 2D grid environment for two important reasons:


    1. It allows for quick training and data collection, making trainings and 
       data-collection quite fast giving you the possibility to play around in the 
       workshop!
       
    
    2. Simplifying and customizing the environment to introduce varying levels of complexity in a controlled 
       manner, along with the option to create your own straightforward behavior policies, can facilitate a
       clearer exploration of the core concepts and advantages of offline Reinforcement Learning (RL), which 
       can be quite challenging or almost impossible with high-dimensional spaces.

**Please note that both the provided library (/offline_rl) and the exercises (notebooks and /offline_rl/scripts) are adaptable to tackle more intricate environments and tasks. Don't hesitate to experiment with them on your own. Be prepared for some patience, as training in RL can be time-consuming when dealing with complex problems!**

### References

[ Levine et al. '2021 - Offline Reinforcement Learning: Tutorial, Review,
and Perspectives on Open Problems ](https://arxiv.org/pdf/2005.01643.pdf).

[Prudencio et al. ' 2023 - A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems ](https://arxiv.org/pdf/2203.01387.pdf).