# Description of Implementation
## Learning Algorithm
* ### Algorithm - Deep Deterministic Policy Gradients (DDPG)Implementation
This project implements an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces called [Deep Deterministic Policy Gradients](https://arxiv.org/abs/1509.02971). 

![Deep Q Network](images/DDPG.png)
* ### Hyperparameters
    * Replay buffer size
      ```bash
      list BUFFER_SIZE = int(1e5)   
      ```
    * Minibatch size
      ```bash
      BATCH_SIZE = 128             
      ```
    * Discount factor
    ```bash
    GAMMA = 0.99                  
    ```
    *  For soft update of target parameters
    ```bash
    TAU = 1e-3                    
    ```
    * Learning rate of the Actor 
    ```
    LR_ACTOR = 2e-4              
    ```
    * Learning rate of the Critic
    ```
    LR_CRITIC = 2e-4           
    ```
    * Learning rate of the Critic
    ```
    LR_CRITIC = 2e-4           
    ```
    * L2 weight decay
    ```
    WEIGHT_DECAY = 0        
    ```
    * Learning timestep interval
    ```
    LEARN_EVERY = 8        
    ```
    * Number of Learning Passes
    ```
    LEARN_NUM = 4       
    ```
    * OU Process
    ```
    mu=0.
    theta=0.15
    sigma=0.08
    ```
* ### Accelerate the trainings
    * Adding a check in the step function to only learn once every 5-10 steps, and then when it is time to learn, call the learn function several times (say, 4-8 times). 
    
    * in ddpg_agent.py,
      Learn, if enough samples are available in memory
    ```
        if len(self.memory) > BATCH_SIZE and timestep % LEARN_EVERY ==0:
            for _ in range(LEARN_NUM):
                experiences = self.memory.sample()
                self.learn(experiences, GAMMA)
    ```
    * in agent.step, add timestep 
    ```
    def step(self, state, action, reward, next_state, done, timestep)
    ```
    * in the main ipydb file,
      ```
      agent.step(state, action, reward, next_state, done,t)
      t = t+1
      ```
   
* ### Neural Network Architecture
    The Neural Network will take states as inputs and output actions.
    * Actor
    ```
    fc1_units=256, fc2_units = 128
    BatchNorm1d and ReLu are applied
    The final output is generated through Tanh
    ```
    * Critic
    ```
    fc1_units=256, fc2_units = 128
    BatchNorm1d and ReLu are applied
    ```
* ### Clip the action between -1 and 1
    return np.clip(action, -1, 1)

## Plot of Rewards

Running DDPG with above hyperparameters and Neural Network Architecture, the agent is able to receive an average reward (over 100 episodes) of at least `+30` after `187` episodes!

![Epsode Solution](images/episode_solution.png)

## Ideas for Future Work
* ### Read paper to determine performance of various deep RL algorithms on continuous control tasks
    * Implement REINFORCE, TNPG, RWR, REPS, TRPO, CEM, CMA-ES and DDPG,
    * [Deep Deterministic Policy Gradients](https://arxiv.org/abs/1604.06778).