# DDPG + Hindsight Experience Replay (HER)

Implementation of **Deep Deterministic Policy Gradient (DDPG)** with
**Hindsight Experience Replay (HER)** for goal-conditioned robotic
manipulation tasks from [Gymnasium-Robotics](https://robotics.farama.org/).

**References:**
- Lillicrap et al., [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971), 2015
- Andrychowicz et al., [Hindsight Experience Replay](https://arxiv.org/abs/1707.01495), 2017

**Experiments** (all on `FetchReach-v4`):

| # | Run name | HER | Goal strategy |
|---|----------|-----|---------------|
| 1 | `DDPG` | ✗ | — |
| 2 | `DDPG_HER_final` | ✓ | `final` |
| 3 | `DDPG_HER_episode` | ✓ | `episode` |
| 4 | `DDPG_HER_future` | ✓ | `future` |

## 1. Install Dependencies

In [1]:
!pip install -q \
    gymnasium gymnasium-robotics "gymnasium[mujoco]" \
    torch torchvision \
    matplotlib numpy wandb tqdm imageio moviepy

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.4/42.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.2/26.2 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m952.1/952.1 kB[0m [31m58.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m78.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m243.5/243.5 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
stable-baselines3 2.1.0 requires gymnasium<0.30,>=0.28.1, but you have gymnasium 1.2.3 which is incompatible.
kaggle-environments 1.18.0 requires gymnasium==0.29.0, but you have gymnasium 1.2.3 which is incompa

## 2. Weights & Biases Setup (Optional)

Logging to [W&B](https://wandb.ai) is **optional**. If `WANDB_API_KEY` is
set in the environment, metrics and evaluation videos will be logged
automatically. Otherwise, training still works — results are printed to
stdout.

In [2]:
import wandb
import os

USE_WANDB = os.environ.get("WANDB_API_KEY") is not None

if USE_WANDB:
    wandb.login()
else:
    print("WANDB_API_KEY not set — logging disabled. "
          "Set the env var or run `wandb login` to enable.")



WANDB_API_KEY not set — logging disabled. Set the env var or run `wandb login` to enable.


## 3. Imports

In [3]:
import os
import numpy as np
import torch
from torch import nn

import gymnasium as gym
import gymnasium_robotics
from gymnasium import Wrapper
import matplotlib.pyplot as plt

gym.register_envs(gymnasium_robotics)

## 4. Neural Network Building Blocks

`ConvertedSigmoid` squashes network output from $(0, 1)$ to an arbitrary
$[\text{low}, \text{high}]$ range — used as the actor's output activation
so that actions stay within environment bounds.

In [4]:
class ConvertedSigmoid(nn.Module):
    """Sigmoid rescaled to [low_bound, high_bound]."""

    def __init__(self, low_bound, high_bound):
        super().__init__()
        self.register_buffer("low_bound", torch.tensor(low_bound))
        self.register_buffer("scale", torch.tensor(high_bound - low_bound))

    def forward(self, x):
        return torch.sigmoid(x) * self.scale + self.low_bound


class MLP(nn.Module):
    """Simple multi-layer perceptron."""

    def __init__(self, input_dim, hidden_dims, output_dim,
                 activation=nn.ReLU, output_activation=nn.Identity()):
        super().__init__()
        if isinstance(hidden_dims, int):
            hidden_dims = [hidden_dims]

        layers = [nn.Linear(input_dim, hidden_dims[0]), activation()]
        for d_in, d_out in zip(hidden_dims[:-1], hidden_dims[1:]):
            layers += [nn.Linear(d_in, d_out), activation()]
        layers += [nn.Linear(hidden_dims[-1], output_dim), output_activation]

        self.mlp = nn.Sequential(*layers)

    def forward(self, x):
        return self.mlp(x)

## 5. DDPG Agent

`DDPG` implements the vanilla algorithm; `GoalDDPG` extends it by
concatenating the desired goal to both actor and critic inputs,
making the policy goal-conditioned.

In [5]:
class DDPG(nn.Module):
    """Deep Deterministic Policy Gradient (Lillicrap et al., 2015)."""

    def __init__(self, obs_dim, min_action_values, max_action_values,
                 hidden_dims, exploration_std, gamma,
                 target_exponential_averaging, device):
        super().__init__()
        self.device = device
        self.gamma = gamma
        self.tau = target_exponential_averaging
        self.exploration_std = exploration_std

        action_dim = min_action_values.shape
        self.min_action_values = torch.tensor(min_action_values, device=device)
        self.max_action_values = torch.tensor(max_action_values, device=device)

        assert len(obs_dim) == 1, "obs_dim must be 1-D"
        assert len(action_dim) == 1, "action_dim must be 1-D"

        # Online networks
        self.critic = MLP(obs_dim[0] + action_dim[0], hidden_dims, 1).to(device)
        self.actor = MLP(
            obs_dim[0], hidden_dims, action_dim[0],
            output_activation=ConvertedSigmoid(min_action_values, max_action_values),
        ).to(device)

        # Target networks (initialised as copies)
        self._target_critic = MLP(obs_dim[0] + action_dim[0], hidden_dims, 1).to(device)
        self._target_actor = MLP(
            obs_dim[0], hidden_dims, action_dim[0],
            output_activation=ConvertedSigmoid(min_action_values, max_action_values),
        ).to(device)
        self._target_critic.load_state_dict(self.critic.state_dict())
        self._target_actor.load_state_dict(self.actor.state_dict())

    # ---- action selection ----

    def _apply_exploration_noise(self, action):
        noise = torch.randn_like(action, device=self.device) * self.exploration_std
        return torch.clamp(action + noise,
                           min=self.min_action_values,
                           max=self.max_action_values)

    @torch.no_grad()
    def act(self, obs, training=True):
        obs = torch.from_numpy(obs).to(self.device)
        action = self.actor(obs)
        if training:
            action = self._apply_exploration_noise(action)
        return action.cpu().numpy()

    # ---- batch decomposition helpers ----

    def _split_batch(self, batch):
        cur = {"state": batch["state"], "action": batch["action"]}
        nxt = {"state": batch["next_state"], "action": None}
        return cur, nxt

    def _actor_input(self, d):
        return d["state"]

    def _critic_input(self, d):
        return torch.cat([d["state"], d["action"]], dim=-1)

    # ---- losses ----

    def get_critic_loss(self, batch):
        reward, terminated = batch["reward"], batch["terminated"]
        cur, nxt = self._split_batch(batch)

        with torch.no_grad():
            nxt["action"] = self._target_actor(self._actor_input(nxt))
            target = reward + self.gamma * (1 - terminated.int()) \
                     * self._target_critic(self._critic_input(nxt))

        q_values = self.critic(self._critic_input(cur))
        return torch.mean((q_values - target) ** 2)

    def get_actor_loss(self, batch):
        cur, _ = self._split_batch(batch)
        cur["action"] = self.actor(self._actor_input(cur))
        return -self.critic(self._critic_input(cur)).mean()

    # ---- target update ----

    def update_target_networks(self):
        for target, online in [(self._target_critic, self.critic),
                               (self._target_actor, self.actor)]:
            for tp, op in zip(target.parameters(), online.parameters()):
                tp.data.mul_(self.tau).add_(op.data, alpha=1 - self.tau)

    # ---- parameter groups (for separate optimisers) ----

    def critic_parameters(self):
        return list(self.critic.parameters())

    def actor_parameters(self):
        return list(self.actor.parameters())


class GoalDDPG(DDPG):
    """Goal-conditioned DDPG: observation and goal are concatenated."""

    def __init__(self, obs_dim, goal_dim, min_action_values, max_action_values,
                 hidden_dims, exploration_std, gamma,
                 target_exponential_averaging, device, seed):
        assert len(obs_dim) == 1 and len(goal_dim) == 1
        super().__init__(
            obs_dim=(obs_dim[0] + goal_dim[0],),
            min_action_values=min_action_values,
            max_action_values=max_action_values,
            hidden_dims=hidden_dims,
            exploration_std=exploration_std,
            gamma=gamma,
            target_exponential_averaging=target_exponential_averaging,
            device=device,
        )
        self.np_rng = np.random.default_rng(seed)

    def _split_batch(self, batch):
        cur = {"state": batch["state"], "action": batch["action"],
               "goal": batch["desired_goal"]}
        nxt = {"state": batch["next_state"], "action": None,
               "goal": batch["desired_goal"]}
        return cur, nxt

    def _actor_input(self, d):
        return torch.cat([d["state"], d["goal"]], dim=-1)

    def _critic_input(self, d):
        return torch.cat([d["state"], d["action"], d["goal"]], dim=-1)

## 6. Environment Wrapper

Converts observations to `float32` and wraps scalar reward / flags into
arrays for uniform downstream handling.

In [6]:
class GoalEnvWrapper(Wrapper):
    """Thin wrapper for Gymnasium goal-conditioned environments.

    - Casts observations to float32.
    - Wraps scalar reward / terminated / truncated into 1-D arrays.
    """

    def step(self, action):
        obs, reward, terminated, truncated, info = self.env.step(action)
        obs = {k: v.astype(np.float32) for k, v in obs.items()}
        reward = np.array([reward], dtype=np.float32)
        terminated = np.array([terminated])
        truncated = np.array([truncated])
        return obs, reward, terminated, truncated, info

    def reset(self, *, seed=None, options=None):
        obs, info = self.env.reset(seed=seed, options=options)
        obs = {k: v.astype(np.float32) for k, v in obs.items()}
        return obs

## 7. Replay Buffers

- `ReplayBuffer` — standard trajectory-based replay buffer.
- `HERReplayBuffer` — extends the base buffer with **Hindsight Experience
  Replay**: at sampling time a fraction of transitions gets new (hindsight)
  goals, and the corresponding rewards / termination flags are recomputed.

In [7]:
REPLAY_KEYS = [
    "state", "next_state", "action",
    "terminated", "truncated", "reward",
    "achieved_goal", "desired_goal",
]


class ReplayBuffer:
    """Trajectory-level replay buffer stored in RAM."""

    def __init__(self, batch_size, max_size, keys, seed, device):
        self.memory: list[list[dict]] = []
        self.size = 0
        self.max_size = max_size
        self.batch_size = batch_size
        self.device = device
        self.rng = np.random.default_rng(seed)
        self.keys = keys

        # Maps global index → (trajectory_idx, timestep_idx)
        self._index_map: dict[int, tuple[int, int]] = {}

    # ---- storage ----

    def add_trajectory(self, trajectory):
        # Evict oldest trajectories if necessary
        evict_len, evict_count = 0, 0
        while self.size + len(trajectory) - evict_len > self.max_size:
            evict_len += len(self.memory[evict_count])
            evict_count += 1

        if evict_count > 0:
            self.memory = self.memory[evict_count:]
            new_map = {}
            for g_idx, (t_idx, e_idx) in self._index_map.items():
                if t_idx >= evict_count:
                    new_map[g_idx - evict_len] = (t_idx - evict_count, e_idx)
            self._index_map = new_map
            self.size -= evict_len

        next_traj_idx = (self._index_map[self.size - 1][0] + 1) if self.size else 0
        for i in range(len(trajectory)):
            self._index_map[self.size + i] = (next_traj_idx, i)
        self.size += len(trajectory)
        self.memory.append(trajectory)

    # ---- sampling ----

    def sample_batch(self, batch_size=None):
        bs = batch_size or self.batch_size
        idxs = self.rng.choice(self.size, size=bs)
        return self._fetch(idxs)

    def _fetch(self, idxs):
        out = {k: [] for k in self.keys}
        for i in idxs:
            t_idx, ts_idx = self._index_map[i]
            step = self.memory[t_idx][ts_idx]
            for k in self.keys:
                out[k].append(torch.from_numpy(step[k]))
        return {k: torch.stack(out[k]).to(self.device) for k in self.keys}

    def get_trajectory(self, global_idx):
        t_idx, local_idx = self._index_map[global_idx]
        return self.memory[t_idx], local_idx


class HERReplayBuffer(ReplayBuffer):
    """Replay buffer with Hindsight Experience Replay.

    Supports three goal-relabelling strategies:
    - ``future``: sample from future timesteps in the same episode.
    - ``final``:  always use the last achieved goal of the episode.
    - ``episode``: sample uniformly from the entire episode.
    """

    def __init__(self, batch_size, max_size, keys, seed, device,
                 reward_fn, terminated_fn,
                 goal_strategy="future", relabel_prob=0.5):
        super().__init__(batch_size, max_size, keys, seed, device)
        self.reward_fn = reward_fn
        self.terminated_fn = terminated_fn
        self.goal_strategy = goal_strategy
        self.relabel_prob = relabel_prob

    def _sample_goal(self, trajectory, local_idx):
        T = len(trajectory)
        if self.goal_strategy == "future":
            t = self.rng.integers(local_idx + 1, T) if local_idx < T - 1 else T - 1
        elif self.goal_strategy == "final":
            t = T - 1
        elif self.goal_strategy == "episode":
            t = self.rng.integers(0, T)
        else:
            raise ValueError(f"Unknown goal strategy: {self.goal_strategy}")
        return trajectory[t]["achieved_goal"]

    def sample_batch(self, batch_size=None):
        bs = batch_size or self.batch_size
        idxs = self.rng.choice(self.size, size=bs, replace=False)
        batch = self._fetch(idxs)

        mask = self.rng.random(bs) < self.relabel_prob
        for i in np.where(mask)[0]:
            traj, local = self.get_trajectory(idxs[i])
            if len(traj) == 0:
                continue

            new_goal = self._sample_goal(traj, local)
            batch["desired_goal"][i] = torch.from_numpy(new_goal).to(self.device)

            achieved = batch["achieved_goal"][i].cpu().numpy()
            batch["reward"][i] = torch.tensor(
                [self.reward_fn(achieved, new_goal, {})],
                dtype=torch.float32, device=self.device,
            )
            batch["terminated"][i] = torch.tensor(
                [self.terminated_fn(achieved, new_goal, {})],
                dtype=torch.bool, device=self.device,
            )
        return batch

## 8. Hyperparameters

In [8]:
# Environment
EPISODE_LENGTH = 50
SEED = 42

# Validation
VAL_EPISODES = 10
WINDOW_SIZE = 20

# Replay buffer
BATCH_SIZE = 256
MAX_BUFFER_SIZE = 1_000_000

# Exploration
PREHEAT_STEPS = 10_000
EXPLORATION_STD = 0.2
RANDOM_ACTION_PROB = 0.3

# Network / optimisation
HIDDEN_DIMS = [256, 256, 256]
GAMMA = 0.99
TAU = 0.95  # target network EMA coefficient
UPDATE_FREQUENCY = 2
PATIENCE = 10  # early stopping: max evals without improvement
LR_ACTOR = 1e-4
LR_CRITIC = 1e-3

# Device
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

# Reproducibility
np.random.seed(SEED)
torch.manual_seed(SEED)

Using device: cuda


<torch._C.Generator at 0x7ee1e24d6d30>

## 9. Utility Functions

In [9]:
def uniform_random_action(env):
    """Sample a uniform random action within the environment bounds."""
    low, high = env.action_space.low, env.action_space.high
    return np.random.rand(*low.shape).astype(np.float32) * (high - low) + low


def make_transition(obs, action, reward, terminated, truncated, next_obs):
    """Pack a single transition into a dict for the replay buffer."""
    return {
        "state": obs["observation"],
        "action": action,
        "reward": reward,
        "terminated": terminated,
        "truncated": truncated,
        "next_state": next_obs["observation"],
        "achieved_goal": obs["achieved_goal"],
        "desired_goal": obs["desired_goal"],
    }


def fill_buffer_random(env, buffer, n_steps):
    """Pre-fill the replay buffer with random-policy transitions."""
    obs = env.reset()
    trajectory = []
    for _ in range(n_steps):
        action = uniform_random_action(env)
        next_obs, reward, terminated, truncated, info = env.step(action)
        trajectory.append(make_transition(obs, action, reward, terminated, truncated, next_obs))
        obs = next_obs
        if terminated[0] or truncated[0]:
            buffer.add_trajectory(trajectory)
            trajectory = []
            obs = env.reset()
    if trajectory:
        buffer.add_trajectory(trajectory)


def make_buffer(env, batch_size, max_size, seed, device, use_her=False, goal_strategy="future"):
    """Create a replay buffer (optionally with HER)."""
    if not use_her:
        return ReplayBuffer(batch_size, max_size, REPLAY_KEYS, seed, device)

    reward_fn = lambda ag, g, _: env.unwrapped.compute_reward(ag, g, None)
    term_fn = lambda ag, g, _: env.unwrapped.compute_terminated(ag, g, None)
    return HERReplayBuffer(
        batch_size, max_size, REPLAY_KEYS, seed, device,
        reward_fn=reward_fn, terminated_fn=term_fn,
        goal_strategy=goal_strategy,
    )


class RunningMean:
    """Sliding-window mean calculator."""

    def __init__(self, window_size):
        self._buf = []
        self._max = window_size

    def update(self, value):
        self._buf.append(value)
        if len(self._buf) > self._max:
            self._buf.pop(0)

    @property
    def mean(self):
        return float(np.mean(self._buf)) if self._buf else 0.0

## 10. Training Loop

Single function that trains a goal-conditioned DDPG agent with optional
HER replay buffer. Logs metrics and evaluation videos to W&B.

In [10]:
def train(agent, optim_actor, optim_critic, replay_buffer, env, *,
          run_name, n_episodes=4_000, eval_every=100, patience=10):
    """Train a GoalDDPG agent and log results to Weights & Biases.

    Training stops early if eval success rate does not improve
    for ``patience`` consecutive evaluation cycles.
    """

    if USE_WANDB:
        wandb.init(project="DDPG-HER", name=run_name, config={
        "episodes": n_episodes,
        "batch_size": BATCH_SIZE,
        "gamma": GAMMA,
        "tau": TAU,
        "lr_actor": LR_ACTOR,
        "lr_critic": LR_CRITIC,
        "exploration_std": EXPLORATION_STD,
        "hidden_dims": HIDDEN_DIMS,
        "preheat_steps": PREHEAT_STEPS,
        "buffer_type": type(replay_buffer).__name__,
        })

    success_tracker = RunningMean(WINDOW_SIZE)
    step = 0
    best_eval_sr = -1.0
    evals_without_improvement = 0

    try:
        for episode in range(n_episodes):
            agent.train()
            critic_losses, actor_losses = [], []
            obs = env.reset()
            trajectory = []
            done = False
            episode_success = False

            # ---- Collect & learn ----
            while not done:
                state_goal = np.concatenate([obs["observation"], obs["desired_goal"]])
                action = agent.act(state_goal, training=True)
                if np.random.random() < RANDOM_ACTION_PROB:
                    action = uniform_random_action(env)

                next_obs, reward, terminated, truncated, info = env.step(action)
                done = terminated[0] or truncated[0]
                if info.get("is_success", False):
                    episode_success = True

                trajectory.append(
                    make_transition(obs, action, reward, terminated, truncated, next_obs)
                )

                # Gradient step
                batch = replay_buffer.sample_batch()

                optim_critic.zero_grad()
                critic_loss = agent.get_critic_loss(batch)
                critic_loss.backward()
                optim_critic.step()
                critic_losses.append(critic_loss.item())

                if step % UPDATE_FREQUENCY == 0:
                    optim_actor.zero_grad()
                    actor_loss = agent.get_actor_loss(batch)
                    actor_loss.backward()
                    optim_actor.step()
                    agent.update_target_networks()
                    actor_losses.append(actor_loss.item())

                obs = next_obs
                step += 1

            replay_buffer.add_trajectory(trajectory)
            success_tracker.update(episode_success)

            metrics = {
                "actor_loss": np.mean(actor_losses) if actor_losses else 0,
                "critic_loss": np.mean(critic_losses) if critic_losses else 0,
                "train_success_rate": success_tracker.mean,
            }

            # ---- Evaluation ----
            if episode % eval_every == 0:
                agent.eval()
                eval_successes = []
                video_frames = []

                for val_ep in range(VAL_EPISODES):
                    obs_val = env.reset()
                    done_val = False
                    success = False

                    while not done_val:
                        if val_ep == 0:  # record video for the first eval episode
                            video_frames.append(env.render().astype(np.uint8))

                        sg = np.concatenate([obs_val["observation"],
                                             obs_val["desired_goal"]])
                        action = agent.act(sg, training=False)
                        obs_val, _, term, trunc, info_val = env.step(action)
                        done_val = term[0] or trunc[0]
                        if info_val.get("is_success", False):
                            success = True

                    eval_successes.append(success)

                eval_sr = np.mean(eval_successes)
                metrics["eval_success_rate"] = eval_sr

                if video_frames and USE_WANDB:
                    video_np = np.array(video_frames, dtype=np.uint8)
                    wandb.log({"eval_video": wandb.Video(video_np, fps=15, format="mp4")},
                              step=episode)

                # Early stopping check
                if eval_sr > best_eval_sr:
                    best_eval_sr = eval_sr
                    evals_without_improvement = 0
                else:
                    evals_without_improvement += 1

                print(f"Episode {episode:>5d} | "
                      f"Train SR {success_tracker.mean:.3f} | "
                      f"Eval SR {eval_sr:.3f} | "
                      f"Best {best_eval_sr:.3f} | "
                      f"Patience {evals_without_improvement}/{patience}")

                if evals_without_improvement >= patience:
                    print(f"Early stopping at episode {episode} "
                          f"(no improvement for {patience} evals)")
                    break

            if USE_WANDB:
                wandb.log(metrics, step=episode)

    except Exception:
        import traceback
        traceback.print_exc()
    finally:
        if USE_WANDB:
            wandb.finish()

## 11. MuJoCo Rendering Backend

In [11]:
os.environ["MUJOCO_GL"] = "egl"

## 12. Experiment Runner

Helper that creates the environment, agent, optimisers, and replay buffer,
fills the buffer with random transitions, then runs training.

In [12]:
def run_experiment(run_name, use_her=False, goal_strategy="future"):
    """Initialise everything and run a single training experiment."""

    env = gym.make("FetchReach-v4", render_mode="rgb_array")
    env = GoalEnvWrapper(env)
    env.reset(seed=SEED)

    obs_shape = env.unwrapped.observation_space["observation"].shape
    goal_shape = env.unwrapped.observation_space["desired_goal"].shape

    agent = GoalDDPG(
        obs_dim=obs_shape,
        goal_dim=goal_shape,
        min_action_values=env.action_space.low,
        max_action_values=env.action_space.high,
        hidden_dims=HIDDEN_DIMS,
        exploration_std=EXPLORATION_STD,
        gamma=GAMMA,
        target_exponential_averaging=TAU,
        device=DEVICE,
        seed=SEED,
    )

    optim_actor = torch.optim.Adam(agent.actor_parameters(), lr=LR_ACTOR)
    optim_critic = torch.optim.Adam(agent.critic_parameters(), lr=LR_CRITIC)

    buffer = make_buffer(
        env, BATCH_SIZE, MAX_BUFFER_SIZE, SEED, DEVICE,
        use_her=use_her, goal_strategy=goal_strategy,
    )
    fill_buffer_random(env, buffer, PREHEAT_STEPS)

    train(agent, optim_actor, optim_critic, buffer, env,
          run_name=run_name, patience=PATIENCE)
    env.close()
    return agent

## 13. Experiments

### 13.1 Baseline — DDPG without HER

In [13]:
agent_ddpg = run_experiment("DDPG", use_her=False)

Episode     0 | Train SR 0.000 | Eval SR 0.100 | Best 0.100 | Patience 0/10
Episode   100 | Train SR 0.000 | Eval SR 0.100 | Best 0.100 | Patience 1/10
Episode   200 | Train SR 0.100 | Eval SR 0.200 | Best 0.200 | Patience 0/10
Episode   300 | Train SR 0.200 | Eval SR 0.000 | Best 0.200 | Patience 1/10
Episode   400 | Train SR 0.150 | Eval SR 0.100 | Best 0.200 | Patience 2/10
Episode   500 | Train SR 0.000 | Eval SR 0.000 | Best 0.200 | Patience 3/10
Episode   600 | Train SR 0.100 | Eval SR 0.100 | Best 0.200 | Patience 4/10
Episode   700 | Train SR 0.100 | Eval SR 0.000 | Best 0.200 | Patience 5/10
Episode   800 | Train SR 0.250 | Eval SR 0.300 | Best 0.300 | Patience 0/10
Episode   900 | Train SR 0.350 | Eval SR 0.100 | Best 0.300 | Patience 1/10
Episode  1000 | Train SR 0.300 | Eval SR 0.100 | Best 0.300 | Patience 2/10
Episode  1100 | Train SR 0.100 | Eval SR 0.100 | Best 0.300 | Patience 3/10
Episode  1200 | Train SR 0.150 | Eval SR 0.000 | Best 0.300 | Patience 4/10
Episode  130

### 13.2 DDPG + HER (goal strategy: `final`)

In [14]:
agent_her_final = run_experiment("DDPG_HER_final", use_her=True, goal_strategy="final")

Episode     0 | Train SR 0.000 | Eval SR 0.100 | Best 0.100 | Patience 0/10
Episode   100 | Train SR 0.000 | Eval SR 0.100 | Best 0.100 | Patience 1/10
Episode   200 | Train SR 0.650 | Eval SR 1.000 | Best 1.000 | Patience 0/10
Episode   300 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 1/10
Episode   400 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 2/10
Episode   500 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 3/10
Episode   600 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 4/10
Episode   700 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 5/10
Episode   800 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 6/10
Episode   900 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 7/10
Episode  1000 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 8/10
Episode  1100 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 9/10
Episode  1200 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 10/10
Early stopp

### 13.3 DDPG + HER (goal strategy: `episode`)

In [15]:
agent_her_episode = run_experiment("DDPG_HER_episode", use_her=True, goal_strategy="episode")

Episode     0 | Train SR 0.000 | Eval SR 0.000 | Best 0.000 | Patience 0/10
Episode   100 | Train SR 0.750 | Eval SR 1.000 | Best 1.000 | Patience 0/10
Episode   200 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 1/10
Episode   300 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 2/10
Episode   400 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 3/10
Episode   500 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 4/10
Episode   600 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 5/10
Episode   700 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 6/10
Episode   800 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 7/10
Episode   900 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 8/10
Episode  1000 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 9/10
Episode  1100 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 10/10
Early stopping at episode 1100 (no improvement for 10 evals)


### 13.4 DDPG + HER (goal strategy: `future`)

In [16]:
agent_her_future = run_experiment("DDPG_HER_future", use_her=True, goal_strategy="future")

Episode     0 | Train SR 0.000 | Eval SR 0.200 | Best 0.200 | Patience 0/10
Episode   100 | Train SR 0.550 | Eval SR 0.800 | Best 0.800 | Patience 0/10
Episode   200 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 0/10
Episode   300 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 1/10
Episode   400 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 2/10
Episode   500 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 3/10
Episode   600 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 4/10
Episode   700 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 5/10
Episode   800 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 6/10
Episode   900 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 7/10
Episode  1000 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 8/10
Episode  1100 | Train SR 1.000 | Eval SR 1.000 | Best 1.000 | Patience 9/10
Episode  1200 | Train SR 1.000 | Eval SR 0.800 | Best 1.000 | Patience 10/10
Early stopp