# Control in a continuous action space with DDPG
_Authors:_ Aristotelis Dimitriou, Konstantinos Spinakis

---

### Introduction

In this reinforcement learning project, we implement the Deep Deterministic Policy Gradient (DDPG) algorithm to handle continuous action spaces while maintaining the benefits of Deep Q-learning (DQN). The objective is to stabilize an inverted pendulum in the Pendulum-v1 environment from OpenAI Gym.

DDPG is an actor-critic algorithm that utilizes one neural network (critic) to estimate the Q function and another (actor) to select the action. It is based on the deterministic policy gradient theorem, allowing both the actor and critic to be trained off-policy from a replay buffer. The policy network outputs a specific action instead of a probability distribution, enabling a flexible exploration strategy.

* The `Pendulum-v1` environment provides a three-dimensional observation vector $(\cos(\alpha), \sin(\alpha), \dot{\alpha})$ where $\alpha$ represents the angle between the pendulum and the vertical line. 

* The action is a scalar value between -2 and 2, representing the torque applied to the pendulum's unique joint. 

* The control policy must learn to swing the pendulum to gain momentum before stabilizing it in a vertical position with minimal torque. 

* The reward function is defined as $-(\alpha^2 + 0.1\cdot\dot{\alpha}^2 + 0.001\cdot\tau^2)$, with the maximum reward of 0 achieved when the pendulum is vertically positioned, motionless, and with no torque applied.


In [30]:
import gym
import numpy as np
from helpers import NormalizedEnv, RandomAgent

___
### Heuristic Policy

In this section, we will familiarize ourselves with the `Pendulum-v1` environment by implementing a simple heuristic policy to attempt stabilizing the pendulum. We will compare the heuristic policy with a random policy to verify the increase in average reward.

_**Tasks:**_


1. Create an instance of the `Pendulum-v1` environment and wrap it in a `NormalizedEnv` class.

In [31]:
env = NormalizedEnv(gym.make('Pendulum-v1'))

2. Implement a functions that simulates an interaction between the environment and the agent. Returning the average cumulative reward.

In [32]:
def run_agent(agent, env, episodes=10, verbose=False):
    rewards = []
    for i in range(episodes):
        state = env.reset()
        total_reward = 0
        done = False

        while not done:
            action = agent.compute_action(state)
            next_state, reward, _, done, _ = env.step(action)
            total_reward += reward
            state = next_state
            if verbose:
                print(f'Episode {i+1}/{episodes}')
                print(f'State: {state}')
                print(f'Action: {action}')
                print(f'Reward: {reward}')
                print(f'Done: {done}')
                print('------------------')
        rewards.append(total_reward)
    return np.mean(rewards)

3. Implement a heuristic policy for the pendulum (`HeuristicPendulumAgent`).

In [33]:
class HeuristicPendulumAgent:
    def __init__(self, env, fixed_torque=0.5, verbose=False):
        self.state_size = env.observation_space.shape[0]
        self.action_size = env.action_space.shape[0]
        self.fixed_torque = fixed_torque
        self.verbose = verbose

    def compute_action(self, state):
        if self.verbose:
            print(f'State: {state}')
        x, y, angular_velocity = state
        if y < 0:   # Lower half of the domain
            action = np.sign(angular_velocity) * self.fixed_torque
        else:       # Upper half of the domain
            action = -np.sign(angular_velocity) * self.fixed_torque
        return np.array([action])

5. Compare the average cumulative reward obtained by the heuristic policy and compare it with the reward of the random agent.

In [34]:
heuristic_agent = HeuristicPendulumAgent(env)
random_agent = RandomAgent(env)

heuristic_agent_avg_reward = run_agent(heuristic_agent, env, verbose=False)
random_agent_avg_reward = run_agent(random_agent, env, verbose=False)
print(f'Random agent average reward: {random_agent_avg_reward:.2f}')
print(f'Heuristic agent average reward: {heuristic_agent_avg_reward:.2f}')

ValueError: not enough values to unpack (expected 3, got 2)