# Implementation of DDPG in a Custom Environment

- In this notebook I will go over how I implemented the deep deterministic policy gradient (DDPG) algorithm, and then trained it in a custom environment I call Target Practice. 

- I implemented the algorithm using Pytorch and trained it on a GPU instance in Colab.

- The entire training script and environment code is located in `DDPG_training.ipynb`

## Environment: Target Practice

<img src="images/episodes_0_49.gif" width="250" align="center">

- The goal in this environment is simple, given an input array that contains a target, chose a point closest to the center of the target.
- The closer the point is to the `middle of the target`, the more reward for the agent.
- Choosing a point that is not within the target results in a reward of `-1`
- Each episode lasts a single step, the agent chooses a point in environment and the appropriate reward is given.


- Observation Space
    - The observation space is a single channel array `(X, X, 1)`
- Action Space
    - The action space is an array with shape `(2, )`, whose values are continuous and in `(0, X-1]`, where X is the chosen dimension of the array

## Agent: DDPG

- Since the action space is continous, we need to choose an agent that can produce continuous actions. DDPG does that.

- I used several resources to help me implement DDPG, I will link those here
    - https://keras.io/examples/rl/ddpg_pendulum/
    - https://arxiv.org/pdf/1509.02971
    - https://towardsdatascience.com/deep-deterministic-policy-gradients-explained-2d94655a9b7b

- To summarize the algorithm
    - DDPG uses an actor and critic model
    - The actor chooses the action (XY coordinates in our case) to take given the environment (an array contained a target)
    - The critic produces a Q-value which reflects how good an action is given a current state
    - To train, the actor takes actions, then is updated by maximizing the Q-value returned by the critic
    - The critic is improved by being trained on a buffer of experiences that are collected as the agent is being trained
    - The algorithm also calls for a random process used for action exploration. 
    - In my case, I add uniformly random values to the XY coordinates chosen by the actor
    - The random values are phased out throughout training to reduce their influence, this produces explore-eploit behavior


## Results

`Episodes 0 - 49`

<img src="images/episodes_0_49.gif" width="250" align="center">

`Episodes 3000 - 3049`

<img src="images/episodes_3000_3049.gif" width="250" align="center">

`Episodes 7000 - 7049`

<img src="images/episodes_7000_7049.gif" width="250" align="center">



## Learnings
- Explore vs Exploit
    - I had to create a random process that would fit well for the environment that I had created
    - I found myself thinking about the exploration as building a good training set for the agent, to show what actions yielded the best reward
    - Creating this was an unlock for getting the agent to train
- Actor Output Layer
    - Although the actor's output was continous like in a regression neural net, I found myself using the tanh activation function on the output layer, then scaling the -1 to 1 output to the environment dimensions (0-249), this also seemed to be an unlock for training

## Next Steps

- I created the environment so it can be made more complex
- Additional targets can be added, then the reward system is as follows
    - A reward for hitting the target
    - A reward multiplier for hitting the current biggest target
    - As targets are hit they are removed from the environment
- I would like to try to train an agent on a more complex version of the environment