# Introduzione

Obiettivo di questo notebook: testare il training degli algoritmi Implicit Q-Learning, Conservative Q-Learning, Behavior Cloning, TD3 con Behavior Cloning sull dataset `D4RL/pen/expert-v2`.

Il task "pen" richiede a una mano robotica (Adroit Hand) di manipolare una penna per portarla in una certa posizione nello spazio.

# Caricamento del dataset

In [1]:
import minari
import time
import numpy as np
from d3rlpy.algos import IQLConfig, CQLConfig, BCConfig, TD3PlusBCConfig, BCQConfig, AWACConfig
from d3rlpy.datasets import MDPDataset
from d3rlpy.constants import ActionSpace
from d3rlpy.metrics import EnvironmentEvaluator

In [14]:
dataset = minari.load_dataset("D4RL/hammer/expert-v2")

In [15]:
print("Episodi totali:", dataset.total_episodes)
print("Spazio osservazioni:", dataset.observation_space)
print("Spazio azioni:", dataset.action_space)

Episodi totali: 5000
Spazio osservazioni: Box(-inf, inf, (46,), float64)
Spazio azioni: Box(-1.0, 1.0, (26,), float32)


In [16]:
episode = next(dataset.iterate_episodes())
print(episode)

#print(f"Osservazioni: \n{episode.observations[0]}")
#print(f"Actions: \n{episode.actions[0]}")
#print(f"Rewards: \n{episode.rewards[0]}")
#print(f"Terminations: \n{episode.terminations[0]}")

EpisodeData(id=0, total_steps=200, observations=ndarray of shape (201, 46) and dtype float64, actions=ndarray of shape (200, 26) and dtype float32, rewards=ndarray of 200 floats, terminations=ndarray of 200 bools, truncations=ndarray of 200 bools, infos=dict with the following keys: ['success'])


The task to be completed consists on repositioning the blue pen to match the orientation of the green target. The base of the hand is fixed. The target is also randomized to cover all configurations. The task will be considered successful when the orientations match within tolerance

# Preparazione dataset

d3rlpy si aspetta che il dataset sia composto da transizioni, in cui ogni elemento contiene uno stato, un’azione, una ricompensa, lo stato successivo e un flag terminale, tutti allineati in modo che lo stato e l’azione alla posizione i corrispondano alla transizione verso lo stato alla posizione i+1. A tal fine, la libreria mette a disposizione la classe MDPDataset, che consente di creare facilmente un oggetto dataset nel formato richiesto. All'interno del dataset non c'è la distinzione in episodi, tutti gli step sono uniti in un unico array.

In [5]:
observations = []
actions = []
rewards = []
terminals = []

for episode in dataset.iterate_episodes():
    # si rimuove l'ultimo elemento, in quanto non ha una successiva azione associata
    obs = episode.observations[:-1]
    actions_ep = episode.actions
    rewards_ep = episode.rewards
    dones = np.array(episode.terminations) | np.array(episode.truncations)

    observations.append(obs)
    actions.append(actions_ep)
    rewards.append(rewards_ep)
    terminals.append(dones)

# ora observations è un array di 4958 array (episodi) di 100 array circa (step) di array (osservazioni). stesso discorso per gli altri

# si uniscono gli array in modo da avere, per ogni step del dataset osservazioni, azione, reward, terminali
observations = np.concatenate(observations)
actions = np.concatenate(actions)
rewards = np.concatenate(rewards)
terminals = np.concatenate(terminals)

# ora observations è un array di 499206 (step in tutto il dataset) di array (osservazioni) . stesso discorso per gli altri
print(observations.shape)
print(actions.shape)
print(rewards.shape)
print(terminals.shape)

d3_dataset = MDPDataset(observations, actions, rewards, terminals, action_space = ActionSpace.CONTINUOUS)

(499206, 45)
(499206, 24)
(499206,)
(499206,)
[2m2025-04-29 11:48.12[0m [[32m[1minfo     [0m] [1mSignatures have been automatically determined.[0m [36maction_signature[0m=[35mSignature(dtype=[dtype('float32')], shape=[(24,)])[0m [36mobservation_signature[0m=[35mSignature(dtype=[dtype('float64')], shape=[(45,)])[0m [36mreward_signature[0m=[35mSignature(dtype=[dtype('float64')], shape=[(1,)])[0m
[2m2025-04-29 11:48.12[0m [[32m[1minfo     [0m] [1mAction size has been automatically determined.[0m [36maction_size[0m=[35m24[0m


# Implicit Q-Learning

In [None]:
iql = IQLConfig().create(device="cpu")

In [None]:
iql.build_with_dataset(d3_dataset)

In [None]:
env = dataset.recover_environment()

iql.fit(
    dataset=d3_dataset,
    n_steps=10000,
    n_steps_per_epoch=1000,
    evaluators={"env": EnvironmentEvaluator(env)},
)

In [None]:
env = dataset.recover_environment(render_mode="human", camera_id=2)
obs, _ = env.reset()
done = False
total_reward = 0

for _ in range(1000):
    action = iql.predict(obs[None])[0]
    obs, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    if terminated:
        break

env.close()
print(f"Reward totale: {total_reward}")

# Conservative Q-Learning

In [None]:
cql = CQLConfig().create(device="cpu")

In [None]:
cql.build_with_dataset(d3_dataset)

In [None]:
env = dataset.recover_environment()

cql.fit(
    dataset=d3_dataset,
    n_steps=10000,
    n_steps_per_epoch=1000,
    evaluators={"env": EnvironmentEvaluator(env)},
)

In [None]:
env = dataset.recover_environment(render_mode="human", camera_id=2)
obs, _ = env.reset()
done = False
total_reward = 0

for _ in range(1000):
    action = cql.predict(obs[None])[0]
    obs, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    if terminated:
        break

env.close()
print(f"Reward totale: {total_reward}")

# Behavior Cloning

In [None]:
bc = BCConfig().create(device="cpu")

In [None]:
bc.build_with_dataset(d3_dataset)

In [None]:
env = dataset.recover_environment()

bc.fit(
    dataset=d3_dataset,
    n_steps=10000,
    n_steps_per_epoch=1000,
    evaluators={"env": EnvironmentEvaluator(env)},
)

In [None]:
env = dataset.recover_environment(render_mode="human", camera_id=2)
obs, _ = env.reset()
done = False
total_reward = 0

for _ in range(1000):
    action = bc.predict(obs[None])[0]
    obs, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    if terminated:
        break

env.close()
print(f"Reward totale: {total_reward}")

# TD3 + BC

In [None]:
td3bc = TD3PlusBCConfig().create(device="cpu")

In [None]:
td3bc.build_with_dataset(d3_dataset)

In [None]:
env = dataset.recover_environment()

td3bc.fit(
    dataset=d3_dataset,
    n_steps=10000,
    n_steps_per_epoch=1000,
    evaluators={"env": EnvironmentEvaluator(env)},
)

In [None]:
env = dataset.recover_environment(render_mode="human", camera_id=2)
obs, _ = env.reset()
done = False
total_reward = 0

for _ in range(1000):
    action = td3bc.predict(obs[None])[0]
    obs, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    if terminated:
        break

env.close()
print(f"Reward totale: {total_reward}")

# BCQ

In [None]:
bcq = BCQConfig().create(device="cpu")

In [None]:
bcq.build_with_dataset(d3_dataset)

In [None]:
env = dataset.recover_environment()

bcq.fit(
    dataset=d3_dataset,
    n_steps=10000,
    n_steps_per_epoch=1000,
    evaluators={"env": EnvironmentEvaluator(env)},
)

In [None]:
env = dataset.recover_environment(render_mode="human", camera_id=2)
obs, _ = env.reset()
done = False
total_reward = 0

for _ in range(1000):
    action = bcq.predict(obs[None])[0]
    obs, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    if terminated or truncated:
        break

env.close()
print(f"Reward totale: {total_reward}")

# AWAC

In [6]:
awac = AWACConfig().create(device="cpu")

In [7]:
awac.build_with_dataset(d3_dataset)

In [8]:
env = dataset.recover_environment()

awac.fit(
    dataset=d3_dataset,
    n_steps=10000,
    n_steps_per_epoch=1000,
    evaluators={"env": EnvironmentEvaluator(env)},
)

[2m2025-04-29 11:48.25[0m [[32m[1minfo     [0m] [1mdataset info                  [0m [36mdataset_info[0m=[35mDatasetInfo(observation_signature=Signature(dtype=[dtype('float64')], shape=[(45,)]), action_signature=Signature(dtype=[dtype('float32')], shape=[(24,)]), reward_signature=Signature(dtype=[dtype('float64')], shape=[(1,)]), action_space=<ActionSpace.CONTINUOUS: 1>, action_size=24)[0m
[2m2025-04-29 11:48.25[0m [[32m[1minfo     [0m] [1mDirectory is created at d3rlpy_logs/AWAC_20250429114825[0m
[2m2025-04-29 11:48.25[0m [[32m[1minfo     [0m] [1mParameters                    [0m [36mparams[0m=[35m{'observation_shape': [45], 'action_size': 24, 'config': {'type': 'awac', 'params': {'batch_size': 1024, 'gamma': 0.99, 'observation_scaler': {'type': 'none', 'params': {}}, 'action_scaler': {'type': 'none', 'params': {}}, 'reward_scaler': {'type': 'none', 'params': {}}, 'compile_graph': False, 'actor_learning_rate': 0.0003, 'critic_learning_rate': 0.0003, 'actor_

Epoch 1/10:   0%|          | 0/1000 [00:00<?, ?it/s]

[2m2025-04-29 11:48.43[0m [[32m[1minfo     [0m] [1mAWAC_20250429114825: epoch=1 step=1000[0m [36mepoch[0m=[35m1[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0064957036972045894, 'time_algorithm_update': 0.010676538467407227, 'critic_loss': 554.7457459411621, 'actor_loss': 425110.322421875, 'temp': 0.0, 'temp_loss': 0.0, 'time_step': 0.017215315580368044, 'env': 562.5663216534314}[0m [36mstep[0m=[35m1000[0m
[2m2025-04-29 11:48.43[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/AWAC_20250429114825/model_1000.d3[0m


Epoch 2/10:   0%|          | 0/1000 [00:00<?, ?it/s]

[2m2025-04-29 11:49.00[0m [[32m[1minfo     [0m] [1mAWAC_20250429114825: epoch=2 step=2000[0m [36mepoch[0m=[35m2[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.006325634002685547, 'time_algorithm_update': 0.010734166860580444, 'critic_loss': 546.8411540832519, 'actor_loss': 207324.41248828126, 'temp': 0.0, 'temp_loss': 0.0, 'time_step': 0.017101824045181273, 'env': 540.1014425113003}[0m [36mstep[0m=[35m2000[0m
[2m2025-04-29 11:49.00[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/AWAC_20250429114825/model_2000.d3[0m


Epoch 3/10:   0%|          | 0/1000 [00:00<?, ?it/s]

[2m2025-04-29 11:49.18[0m [[32m[1minfo     [0m] [1mAWAC_20250429114825: epoch=3 step=3000[0m [36mepoch[0m=[35m3[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.006434033870697022, 'time_algorithm_update': 0.010788160800933838, 'critic_loss': 1128.686857849121, 'actor_loss': 115807.14467773438, 'temp': 0.0, 'temp_loss': 0.0, 'time_step': 0.017263113498687744, 'env': 85.39976027247272}[0m [36mstep[0m=[35m3000[0m
[2m2025-04-29 11:49.18[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/AWAC_20250429114825/model_3000.d3[0m


Epoch 4/10:   0%|          | 0/1000 [00:00<?, ?it/s]

[2m2025-04-29 11:49.35[0m [[32m[1minfo     [0m] [1mAWAC_20250429114825: epoch=4 step=4000[0m [36mepoch[0m=[35m4[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.006201712608337402, 'time_algorithm_update': 0.010133864879608155, 'critic_loss': 2037.0081600952149, 'actor_loss': 75400.00033557128, 'temp': 0.0, 'temp_loss': 0.0, 'time_step': 0.016375247716903688, 'env': 1626.3771405587713}[0m [36mstep[0m=[35m4000[0m
[2m2025-04-29 11:49.35[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/AWAC_20250429114825/model_4000.d3[0m


Epoch 5/10:   0%|          | 0/1000 [00:00<?, ?it/s]

[2m2025-04-29 11:49.52[0m [[32m[1minfo     [0m] [1mAWAC_20250429114825: epoch=5 step=5000[0m [36mepoch[0m=[35m5[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.006261154413223267, 'time_algorithm_update': 0.010505888223648072, 'critic_loss': 3351.3476982421876, 'actor_loss': 51374.26865582275, 'temp': 0.0, 'temp_loss': 0.0, 'time_step': 0.01680708694458008, 'env': 552.7175959139385}[0m [36mstep[0m=[35m5000[0m
[2m2025-04-29 11:49.52[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/AWAC_20250429114825/model_5000.d3[0m


Epoch 6/10:   0%|          | 0/1000 [00:00<?, ?it/s]

[2m2025-04-29 11:50.09[0m [[32m[1minfo     [0m] [1mAWAC_20250429114825: epoch=6 step=6000[0m [36mepoch[0m=[35m6[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00640866470336914, 'time_algorithm_update': 0.0100319242477417, 'critic_loss': 5012.912577270507, 'actor_loss': 36368.00639971924, 'temp': 0.0, 'temp_loss': 0.0, 'time_step': 0.016481981754302977, 'env': 623.78805076086}[0m [36mstep[0m=[35m6000[0m
[2m2025-04-29 11:50.09[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/AWAC_20250429114825/model_6000.d3[0m


Epoch 7/10:   0%|          | 0/1000 [00:00<?, ?it/s]

[2m2025-04-29 11:50.27[0m [[32m[1minfo     [0m] [1mAWAC_20250429114825: epoch=7 step=7000[0m [36mepoch[0m=[35m7[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.006795634508132934, 'time_algorithm_update': 0.011142736196517945, 'critic_loss': 7242.70580847168, 'actor_loss': 24817.51032544708, 'temp': 0.0, 'temp_loss': 0.0, 'time_step': 0.017982701301574706, 'env': 1404.1743881699217}[0m [36mstep[0m=[35m7000[0m
[2m2025-04-29 11:50.27[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/AWAC_20250429114825/model_7000.d3[0m


Epoch 8/10:   0%|          | 0/1000 [00:00<?, ?it/s]

[2m2025-04-29 11:50.45[0m [[32m[1minfo     [0m] [1mAWAC_20250429114825: epoch=8 step=8000[0m [36mepoch[0m=[35m8[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.006568410873413086, 'time_algorithm_update': 0.01032735800743103, 'critic_loss': 9595.429397705078, 'actor_loss': 19035.566874320983, 'temp': 0.0, 'temp_loss': 0.0, 'time_step': 0.016937438011169433, 'env': 463.13818769172406}[0m [36mstep[0m=[35m8000[0m
[2m2025-04-29 11:50.45[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/AWAC_20250429114825/model_8000.d3[0m


Epoch 9/10:   0%|          | 0/1000 [00:00<?, ?it/s]

[2m2025-04-29 11:51.03[0m [[32m[1minfo     [0m] [1mAWAC_20250429114825: epoch=9 step=9000[0m [36mepoch[0m=[35m9[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.006766695022583008, 'time_algorithm_update': 0.011246136665344238, 'critic_loss': 12170.222025878906, 'actor_loss': 13423.196507347107, 'temp': 0.0, 'temp_loss': 0.0, 'time_step': 0.018054604053497316, 'env': 1530.4305660160392}[0m [36mstep[0m=[35m9000[0m
[2m2025-04-29 11:51.03[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/AWAC_20250429114825/model_9000.d3[0m


Epoch 10/10:   0%|          | 0/1000 [00:00<?, ?it/s]

[2m2025-04-29 11:51.21[0m [[32m[1minfo     [0m] [1mAWAC_20250429114825: epoch=10 step=10000[0m [36mepoch[0m=[35m10[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.006686381101608276, 'time_algorithm_update': 0.010797379970550537, 'critic_loss': 15307.128662597655, 'actor_loss': 9125.657841182709, 'temp': 0.0, 'temp_loss': 0.0, 'time_step': 0.017527610540390013, 'env': 1781.371999825909}[0m [36mstep[0m=[35m10000[0m
[2m2025-04-29 11:51.21[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/AWAC_20250429114825/model_10000.d3[0m


[(1,
  {'time_sample_batch': 0.0064957036972045894,
   'time_algorithm_update': 0.010676538467407227,
   'critic_loss': 554.7457459411621,
   'actor_loss': 425110.322421875,
   'temp': 0.0,
   'temp_loss': 0.0,
   'time_step': 0.017215315580368044,
   'env': 562.5663216534314}),
 (2,
  {'time_sample_batch': 0.006325634002685547,
   'time_algorithm_update': 0.010734166860580444,
   'critic_loss': 546.8411540832519,
   'actor_loss': 207324.41248828126,
   'temp': 0.0,
   'temp_loss': 0.0,
   'time_step': 0.017101824045181273,
   'env': 540.1014425113003}),
 (3,
  {'time_sample_batch': 0.006434033870697022,
   'time_algorithm_update': 0.010788160800933838,
   'critic_loss': 1128.686857849121,
   'actor_loss': 115807.14467773438,
   'temp': 0.0,
   'temp_loss': 0.0,
   'time_step': 0.017263113498687744,
   'env': 85.39976027247272}),
 (4,
  {'time_sample_batch': 0.006201712608337402,
   'time_algorithm_update': 0.010133864879608155,
   'critic_loss': 2037.0081600952149,
   'actor_loss': 75

In [13]:
env = dataset.recover_environment(render_mode="human", camera_id=2)
obs, _ = env.reset()
done = False
total_reward = 0

for _ in range(1000):
    action = awac.predict(obs[None])[0]
    obs, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    if terminated or truncated:
        break

env.close()
print(f"Reward totale: {total_reward}")

Reward totale: 3628.426871523513
