# Excercise 9.2 Deep Deterministic Policy Gradient with Continuous CartPole

## Goal

- understanding DDPG and implement it
- understand how each hyperparameter contributes to the learning process

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import gym
import numpy as np
import chula_rl as rl
from chula_rl.env.cartpolecont import ContinuousCartPoleEnv

# Step 1: Env

In [None]:
def make_env():
    env = ContinuousCartPoleEnv()
    env = rl.env.wrapper.EpisodeSummary(env)
    return env

In [None]:
env = rl.env.DummyVecEnv([make_env])

# Step 2: Vec one-step explorer

In a parallel environment setting, we also need a compatible parallel explorer. The code is straightforward to the point that we have implemented it for you already. But you are welcome to read the code. 

Go see `chula_rl.explorer.vec_one_step_explorer`

In DDPG, we use one-step explorer in same fashion as DQN.

In [None]:
exp = rl.explorer.VecOneStepExplorer(n_step, n_max_interaction, env)

## Step 2.1 Vec one-step replay

DDPG uses with replays. We have implemented a vec version of one-step replay. It is a trivial extension to the non-vec one.

Go see `chula_rl.explorer.vec_one_step_uniform_replay`

A replay is implemented as a "wrapper" to the original explorer. It acts as a middle-person to call the explorer but return samples from the replay instead!

In [None]:
exp = rl.explorer.VecOneStepUniformReplay(exp, n_sample, n_max_size, n_env, obs_space, act_space)

# Step 3: Deep Deterministic Policy Gradient (DDPG)

DDPG requires 2 components: 
- Deterministic policy
- Critic (Q function)

Both are implemented as neural nets. We leave this section to you. 

Your DDPG should subclass `chula_rl.policy.base_policy.BasePolicy`. 

## Words of advice: 

- You code will surely contain bugs! Developing in jupyter notebook might not be a good idea. 
- There is a ton of hyperparameters, it is no easy task to find the right parameters
- Finding the right parameters might need some analysis on how the code performs which is hard if you don't "log" enough
- So, log EVERYTHING, use tensorboard to your advantage
- For example, log the mean action of the policys, log the current value of the value function. These will be invaluable in debugging
- "ทำไมมันช่างเปราะบางเหลือเกิน ~" is a sentence to describe this section

## Run it

If you forgot how to run it already. Here is how: 

```
while True:
    data = exp.step(policy)
    policy.optimize_step(data)
```