## Using Implicit Q learning (an offline RL algorithm) in Pearl.

Here is a [better rendering](https://nbviewer.org/github/facebookresearch/Pearl/blob/main/tutorials/sequential_decision_making/Implicit_Q_learning.ipynb) of this notebook on [nbviewer](https://nbviewer.org/)

- The purpose of this tutorial is to illustrate how users can use Pearl's implementation of Implicit Q-learning (IQL), an offline RL algorithm.

- This example illustrates offline learning for continuous control using
offline data collected from Open AI Gym's HalfCheetah environment.


In [1]:
%load_ext autoreload
%autoreload 2

# Pearl Installation

If you haven't installed Pearl, please make sure you install Pearl with the following cell. Otherwise, you can skip the cell below.

In [2]:
# Pearl installation from github. This install also includes PyTorch, Gym and Matplotlib

%pip uninstall Pearl -y
%rm -rf Pearl
!git clone https://github.com/facebookresearch/Pearl.git
%cd Pearl
%pip install .
%cd ..

[0mCloning into 'Pearl'...
remote: Enumerating objects: 5615, done.[K
remote: Counting objects: 100% (1824/1824), done.[K
remote: Compressing objects: 100% (613/613), done.[K
remote: Total 5615 (delta 1388), reused 1563 (delta 1199), pack-reused 3791[K
Receiving objects: 100% (5615/5615), 53.75 MiB | 21.57 MiB/s, done.
Resolving deltas: 100% (3715/3715), done.
/content/Pearl
Processing /content/Pearl
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting gymnasium[accept-rom-license,atari,mujoco] (from Pearl==0.1.0)
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Collecting parameterized (from Pearl==0.1.0)
  Downloading parameterized-0.9.0-py2.py3-none-any.whl (20 kB)
Col

In [3]:
import os
import pickle
import torch


from pearl.pearl_agent import PearlAgent
from pearl.utils.functional_utils.experimentation.set_seed import set_seed
from pearl.utils.instantiations.environments.gym_environment import GymEnvironment
from pearl.neural_networks.sequential_decision_making.actor_networks import VanillaContinuousActorNetwork
from pearl.policy_learners.sequential_decision_making.implicit_q_learning import ImplicitQLearning

from pearl.utils.functional_utils.experimentation.create_offline_data import (
    get_data_collection_agent_returns,
)

from pearl.utils.functional_utils.train_and_eval.offline_learning_and_evaluation import (
    get_offline_data_in_buffer,
    offline_evaluation,
    offline_learning,
)

set_seed(0)

# Specify some environment details

In [4]:
experiment_seed = 100

# We choose a continuous control environment from MuJoCo called 'Half-Cheetah'.
env_name = "HalfCheetah-v4"
env = GymEnvironment(env_name)
action_space = env.action_space
is_action_continuous = True

  and should_run_async(code)


# Download offline data

- We have offline data at https://github.com/facebookresearch/Pearl/tree/gh-pages/data which will be used for this tutorial.

- This dataset was collected using the replay buffer of an experiment where soft-actor critic (SAC) algorithm was used for policy learning, along with a large entropy parameter for exploration. Users can think of it as a 'medium replay' dataset in D4RL (https://github.com/Farama-Foundation/D4RL).

- Note: We have not integrated Pearl with D4RL as it is being deprecated and a new library called Minari is being developed. We do plan to integrate Pearl with Minari at a later time.

In [5]:
import requests

# Download the dataset of transition tuples from the github url and store in a local file
url = "https://github.com/facebookresearch/Pearl/raw/gh-pages/data/offline_rl_data/HalfCheetah/offline_raw_transitions_dict_small_2.pt"
filename = "offline_raw_transitions_dict_small_2.pt"    # local file with the dataset of transition tuples
response = requests.get(url)
with open(filename, "wb") as f:
    f.write(response.content)

In [6]:
cwd = os.getcwd()   # current working directory
data_path = cwd + '/' + filename    # change this in case you have a file with the dataset of transition tuples already stored in a local path.

# The default device where offline data replay buffer is stored is cpu; see the `get_offline_data_in_buffer` for device management
offline_data_replay_buffer = get_offline_data_in_buffer(
    is_action_continuous=is_action_continuous,
    url=None,
    data_path=data_path,  # path to local file which contains the offline dataset
)

# Set up an IQL agent

- Note that here `env` and `action_space` are for the HalfCheetah environment as set above.


In [7]:
offline_agent = PearlAgent(
    policy_learner=ImplicitQLearning(
        state_dim=env.observation_space.shape[0],
        action_space=action_space,
        actor_hidden_dims=[256, 256],
        critic_hidden_dims=[256, 256],
        value_critic_hidden_dims=[256, 256],
        actor_network_type=VanillaContinuousActorNetwork,
        value_critic_learning_rate=1e-3,
        actor_learning_rate=3e-4,
        critic_learning_rate=1e-4,
        critic_soft_update_tau=0.05,
        training_rounds=2,
        batch_size=256,
        expectile=0.75,
        temperature_advantage_weighted_regression=3,
    ),
)

# Offline learning

In [8]:
# Number of training epochs
training_epochs = 20000

# Use the `offline_learning` utility function in Pearl to train an offline RL agent using offline data
offline_learning(
    offline_agent=offline_agent,
    data_buffer=offline_data_replay_buffer, # replay buffer created using the offline data
    training_epochs=training_epochs,
    seed=experiment_seed,
)

training epoch 0 training loss {'value_loss': 0.44493526220321655, 'actor_loss': 1.1561717987060547, 'critic_loss': 5.130976676940918}
training epoch 500 training loss {'value_loss': 0.1409292221069336, 'actor_loss': 0.5989271998405457, 'critic_loss': 10.62370491027832}
training epoch 1000 training loss {'value_loss': 0.5230052471160889, 'actor_loss': 0.9807262420654297, 'critic_loss': 24.416900634765625}
training epoch 1500 training loss {'value_loss': 0.9530619382858276, 'actor_loss': 3.278916597366333, 'critic_loss': 29.79534149169922}
training epoch 2000 training loss {'value_loss': 1.614856243133545, 'actor_loss': 6.346199035644531, 'critic_loss': 43.3765869140625}
training epoch 2500 training loss {'value_loss': 2.005828619003296, 'actor_loss': 7.089072227478027, 'critic_loss': 59.126861572265625}
training epoch 3000 training loss {'value_loss': 2.0187668800354004, 'actor_loss': 3.204172134399414, 'critic_loss': 48.85820770263672}
training epoch 3500 training loss {'value_loss': 

# Offline evaluation

In [9]:
# Use the `offline_evaluation` utility function in Pearl to evaluate the trained agent by interacting with the environment.

offline_evaluation_returns = offline_evaluation(
    offline_agent=offline_agent,
    env=env,
    number_of_episodes=50,
    seed=experiment_seed,
)

# mean evaluation returns of the offline agent
avg_offline_agent_returns = torch.mean(torch.tensor(offline_evaluation_returns))
print(f"average returns of the offline agent {avg_offline_agent_returns}")

episode 49, return=244.23391946865013average returns of the offline agent 230.69460202311308


# Getting normalized scores (typical for benchmarking offline RL algorithms)


In [10]:
# Step 1: get episodic returns of a random agent using the 'returns_random_agent.pickle' file on github.
# You can also do this by initializing an offline agent randomly and interacting with the environment.


# URL of the file on GitHub
random_returns_url = "https://github.com/facebookresearch/Pearl/raw/gh-pages/data/offline_rl_data/HalfCheetah/returns_random_agent.pickle"

# Download the file
response = requests.get(random_returns_url)
returns = response.content

# Load the data from the file
with open('random_agent_returns.pkl', 'wb') as f:
    f.write(returns)
with open('random_agent_returns.pkl', 'rb') as f:
    random_agent_returns = pickle.load(f)


avg_return_random_agent = torch.mean(torch.tensor(random_agent_returns))    # mean returns of a random agent
print(f"average returns of a random agent {avg_return_random_agent}")

average returns of a random agent -426.930167355092


In [11]:
# Step 2: get training returns in the data (i.e. episodic returns of the data collection agent).

# The `get_data_collection_agent_returns` function computes the episodic returns from the offline data of transition tuples
# Note 1: We implicitly assume that the offline data was collected in an episodic manner
# Note 2: The `data_path` points to the local file with offline data. Recall that we set data_path = cwd + '/' + filename, where filename = "offline_raw_transitions_dict_small_2.pt"
data_collection_agent_returns = get_data_collection_agent_returns(
    data_path=data_path
)

# average trajectory returns in the dataset
avg_return_data_collection_agent = torch.mean(
    torch.tensor(data_collection_agent_returns)
)
print(
    f"average returns of the data collection agent {avg_return_data_collection_agent}"
)


max_return_data_collection_agent = torch.max(torch.tensor(data_collection_agent_returns))   # maximum trajectory returns in the dataset
print(f"maximum returns of the data collection agent {max_return_data_collection_agent}")

getting returns of the data collection agent agent
using offline training data in /content/offline_raw_transitions_dict_small_2.pt to stitch trajectories and compute returns
average returns of the data collection agent 490.7219223530043
maximum returns of the data collection agent 1878.2492669075727


In [12]:
# The following is one way to define the normalized score, which is a proxy for
# how good the offline learning algorithm is as compared to the policy that was used to collect data.

normalized_score = (avg_offline_agent_returns - avg_return_random_agent) / (
    avg_return_data_collection_agent - avg_return_random_agent
)

print(f"normalized score {normalized_score}")

normalized score 0.716638448006362


Note 1: A normalized score close to 1 or greater than 1 indicates good performance. However, note that the offline dataset we use here is very small (100,000 transition tuples), so we do not expect to see a high normalized score.

Note 2: We have used average episodic returns in the offline data as a proxy for the performance of data collection agent (which is used when computing the normalized score).

- An ideal way to do this would be to run the data collection agent/policy, at the end of the data collection phase, for a few episodes and take the average episodic returns.
- This would approximate the 'best' policy used to collect data. The real test for offline learning algorithms is to be able to beat this 'best' policy.