# DDPG on Half Cheetah

In this example we will see how to train a Deep deterministic policy gradiant (DDPG) agent using `torchrl`. See [Documentation](https://torchrl.sanyamkapoor.com/) for an introduction to *TorchRL* and installation instructions.

This notebook requires mujoco-py. See [this](https://github.com/openai/mujoco-py) for installation instructions.


In [1]:
import argparse
import torch
import numpy as np
from torchrl import registry
from torchrl import utils
from torchrl.problems import base_hparams, DDPGProblem
from torchrl.agents import BaseDDPGAgent

We use a pre-built version of the DDPG agent from *TorchRL* library to initialize a `Problem`. This `Problem` class is also based on a pre-built version from the library.

In [2]:
class DDPGHalfCheetah(DDPGProblem):
  def init_agent(self):
    observation_space, action_space = utils.get_gym_spaces(self.runner.make_env)

    agent = BaseDDPGAgent(
        observation_space,
        action_space,
        actor_lr=self.hparams.actor_lr,
        critic_lr=self.hparams.critic_lr,
        gamma=self.hparams.gamma,
        tau=self.hparams.tau)

    return agent

This class requires us to extend the init_agent method. There is no restriction on the contents as long as it returns a valid BaseAgent.

## Hyperparameter Specification

We use the `HParams` object from the library to add custom properties. Again, arbitrary properties can be provided to such objects as long as they are consistently used within the previously specified `Problem` class (e.g. within the `init_agent` routine).


In [3]:
def hparams_ddpg_half_cheetah():
    params = base_hparams.base_ddpg()

    params.env_id = 'HalfCheetah-v2'

    params.num_processes = 16

    params.rollout_steps = 1
    params.max_episode_steps = 500
    params.num_total_steps = int(2e6)

    params.gamma = 0.99
    params.buffer_size = int(1e6)

    params.batch_size = 128
    params.tau = 1e-2
    params.actor_lr = 1e-4
    params.critic_lr = 1e-3

    return params

## Initialize Problem Instance

We use GPUs if available and some basic arguments, most importantly the seed. Make sure to run using different seeds.

**NOTE**: We use `argparse.Namespace` class as the argument to the `Problem` class which explains the type cast. If interested, track this issue [here](https://github.com/activatedgeek/torchrl/issues/61).

In [4]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

args=dict(
    seed=1,
    log_interval=1000,
    eval_interval=1000,
    num_eval=1,
)

ddpg_half_cheetah = DDPGHalfCheetah(
    hparams_ddpg_half_cheetah(),
    argparse.Namespace(**args),
    None, # Disable logging
    device=device,
    show_progress=True,
)

## Training the DDPG Agent

Calling the `run()` routine allows us to execute training. Note that for now we have disabled logging by keeping `log_dir=None` in the above instatiation.

In [5]:
ddpg_half_cheetah.run()

100%|██████████| 125000/125000 [30:41<00:00, 67.86epochs/s] 


## Evaluate Training


In [12]:
%%time

ddpg_half_cheetah.agent.train(False)

eval_runner = ddpg_half_cheetah.make_runner(n_envs=10)
eval_rewards = []
for _ in range(100 // ddpg_half_cheetah.runner.n_envs):
  eval_history = eval_runner.rollout(ddpg_half_cheetah.agent)
  for i in range(ddpg_half_cheetah.runner.n_envs):
    _, _, reward_history, _, _ = eval_history[0]
    eval_rewards.append(np.sum(reward_history, axis=0))
eval_runner.close()

CPU times: user 6.08 s, sys: 1.29 s, total: 7.36 s
Wall time: 8.3 s


In [13]:
avg_reward, std_reward = np.average(eval_rewards), np.std(eval_rewards)
print('Reward: {} +/- {}'.format(avg_reward, std_reward))


Reward: 672.4418830774515 +/- 36.763124177817204
