# DDPG on Pendulum

In this example we will see how to train a Deep deterministic policy gradiant (DDPG) agent using `torchrl`. See [Documentation](https://torchrl.sanyamkapoor.com/) for an introduction to *TorchRL* and installation instructions.

## Problem Specification

The full problem can specified in **less than 50 lines of code**!

In [None]:
import argparse
import torch
import numpy as np
from torchrl import registry
from torchrl import utils
from torchrl.problems import base_hparams, DDPGProblem
from torchrl.agents import BaseDDPGAgent

We use a pre-built version of the DDPG agent from *TorchRL* library to initialize a `Problem`. This `Problem` class is also based on a pre-built version from the library.

In [None]:
class DDPGPendulum(DDPGProblem):
  def init_agent(self):
    observation_space, action_space = utils.get_gym_spaces(self.runner.make_env)

    agent = BaseDDPGAgent(
        observation_space,
        action_space,
        actor_lr=self.hparams.actor_lr,
        critic_lr=self.hparams.critic_lr,
        gamma=self.hparams.gamma,
        tau=self.hparams.tau)

    return agent

This class requires us to extend the `init_agent` method. There is no restriction on the contents as long as it returns a valid `BaseAgent`.

## Hyperparameter Specification

We use the `HParams` object from the library to add custom properties. Again, arbitrary properties can be provided to such objects as long as they are consistently used within the previously specified `Problem` class (e.g. within the `init_agent` routine).

In [None]:
def hparams_ddpg_pendulum():
    params = base_hparams.base_ddpg()

    params.env_id = 'Pendulum-v0'

    params.num_processes = 1

    params.rollout_steps = 1
    params.max_episode_steps = 500
    params.num_total_steps = 20000

    params.gamma = 0.99
    params.buffer_size = int(1e6)

    params.batch_size = 128
    params.tau = 1e-2
    params.actor_lr = 1e-4
    params.critic_lr = 1e-3

    return params

## Initialize Problem Instance

We use GPUs if available and some basic arguments, most importantly the seed. Make sure to run using different seeds.

**NOTE**: We use `argparse.Namespace` class as the argument to the `Problem` class which explains the type cast. If interested, track this issue [here](https://github.com/activatedgeek/torchrl/issues/61).

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

args=dict(
    seed=1,
    log_interval=1000,
    eval_interval=1000,
    num_eval=1,
)

ddpg_pendulum = DDPGPendulum(
    hparams_ddpg_pendulum(),
    argparse.Namespace(**args),
    None, # Disable logging
    device=device,
    show_progress=True,
)

## Training the DDPG Agent

Calling the `run()` routine allows us to execute training. Note that for now we have disabled logging by keeping `log_dir=None` in the above instatiation.

In [None]:
ddpg_pendulum.run()

## Evaluate Training
The pendulum starts in a random position, and the goal is to swing it up so it stays upright.
Quoting the documentation, 

> Pendulum-v0 is an unsolved environment, which means it does not have a specified reward threshold at which it's considered solved

Best 100-episode performance according to the leaderboard is -123.11 ± 6.86			


In [None]:
%%time

ddpg_pendulum.agent.train(False)

eval_runner = ddpg_pendulum.make_runner(n_envs=10)
eval_rewards = []
for _ in range(100 // ddpg_pendulum.runner.n_envs):
  eval_history = eval_runner.rollout(ddpg_pendulum.agent)
  for i in range(ddpg_pendulum.runner.n_envs):
    _, _, reward_history, _, _ = eval_history[0]
    eval_rewards.append(np.sum(reward_history, axis=0))
eval_runner.close()

In [None]:
avg_reward, std_reward = np.average(eval_rewards), np.std(eval_rewards)

print('Reward: {} +/- {}'.format(avg_reward, std_reward))


# Visualization

In [None]:
vis_runner = ddpg_pendulum.make_runner(n_envs=1)
vis_runner.rollout(ddpg_pendulum.agent,render = True)
vis_runner.close()