Introduction
--------------------

This Report notebook is a walkthrough in understanding the following in greater detail: *Learning Algorithm*, *Plot of Rewards*, *Ideas for Future Work*. This report will describe the learning algorithm and the chosen hyperparameters for the model architectures and the model itself. Lastly, it will discuss the plot of rewards to illustrate the agent's performance received reward (over 100 episodes). The given challenge will be to solve the environment by achieving average +30 rewards.

Content
---------------
1. Learning Algorithm
    - 1.1 DDPG agent
    - 1.2 Model Architecture
    - 1.3 Learning Parameters
    - 1.4 Algorithm to train DDPG
2. Plot of Rewards
3. Ideas for Future Work
    - 3.1 Model Architecture
    - 3.2 Modify the Number of Agents
    - 3.3 Modify the Learning Agent


1 Learning Algorithm
-------------------------------
Firstly this report will be discussing the agent before proceeding to discuss the model and algorithm.

### 1.1 DDPG Agent

In this section, we will understand this Deep Deterministic Policy Gradient (DDPG) agent. The main key feature of this algorithm is the adoption of actor and critic net, with very similar functionality as the DQN. Firstly, we initialise a local and target network for actor and critic network, similar idea as the DQN we will have the target to be use as our anchor for the `t+1` states:

The idea of DDPG revolves around actor critic algorithm, where we will use the values as the baseline to give feedback on the performance of the policy. Every single time the actor network is queried we will expect the best action, which can  consider it as learning the $argmax_aQ(s,a)$. Whereas the critic network determine the $Q(s, a^*(s, \theta_{a^*}); \theta_Q)$ which is value produced by the best selected action. Thus, the next states best action and values are generated by the target actor and critic network respectively. This is to enable us to determine the critic loss by comparing against the value generated by current state using the `local critic` network. Unlike most actor-crtic algorithm where the actor's loss is computed with the log probability of the policy function, the loss for DDPG is computed by the output from the local critic network. For instance:
```python
# Get critic values from the local net
action_expected = self.actor_local(states)
actor_loss = -self.critic_local(states, action_expected).mean()
```

Similarly to DQN, the target network for both actor and critic networks are optimised via the soft-update. In this work, we adopt a direct update for each time step taken.



### 1.2 Model Architecture

The model adopted in general is based on linear layers. There are three for both actor and critic network. However, in the second FC layer of the critic network, it takes in the concatenation of the action matrix. Thus the network is not a straightforward FC layers for critic network unlike the actor network. In the final layer of the actor network, the output is the action dimensions while the output of the critic network is the state value. In both cases, both are initialise using the same seed for the same random weights.

Usage of swish function is chosen here. Swish activation function is a smooth and non-monotonic function that is simpliy $swish(x) = x\sigma(x)$. The idea of `ReLU` function has a draw back that approximately half of the input $x$ will result in a gradient that is 0. Previous LeakyReLU and SELU are unable to overcome this issue. Since we have design the action to be clipped between -1 and 1 it will make sense to incoporate negative input. Hence, swish function is used in the model instead.

### 1.3 Learning Parameters

Hyperparameters play a crucial role in optimizing the network. With a set of good hyperparameters, the model can be optimized easily or the convergence for maximization of reward can be achieved quickly. In this work, there are a number of variables for optimizing, however these values are chosen as default in the generation of reward function to be shown in the *Plot of Rewards* section later.
_*List of Hyperparameters*_
1. Seed of model: 0                                # Initialization value for the weights
2. Hidden layer size: (400,300)                     # This could potentially directly affect the type of representations learned
3. Batch size: 128                                  # The amount of experience to sample in on learning update
4. Buffer size: 1e5                                # Storage size of the latest experiences
5. Gamma: 0.99                                     # To modify the contribution to the target q-value
6. Tau: 1e-3                                       # To modify the soft update of the target network
7. Learning rate for actor: 1e-3                   # Tune the backpropagation sensitivity and impact
8. Learning rate for critic: 1e-3
8. Maximum timesteps per episode: 1000             # The amount of experience in one episode
9. Decay Noise: (0.99,0.9999)                      # The (decay noise of the OU-noise input to action, decaying rate increment per timestep of the decay noise)


### 1.4 Algorithm to train DDPG
In considering the deep model that we have discussed, we will need an algorithm that runs recursively to (1) observe the current environment state, for a range of time steps the (2) agent acts on the given states, (3) receive feedback by receiving new states and rewards after acting on the environment, (4) agent updates itself, (5) check for the condition if the environment is solved. These steps are more or less a general approach to tackle DDPG problems.

Overall, the main bottleneck in optimising this algorithm to achieve ~30 avg scores lies in (i) Number of timesteps, (ii) Model Architecture, (iii) Balancing of the exploration and exploitation process - OUNOISE optimising. While other parameters do affect the learning, more importantly these three are the ones the can affect the mean scores more effectively.



2 Plot of Rewards
------------------------
![Image](https://github.com/Wachn/Continuous-control_/blob/main/plots/1agent-DDPG_reacher.png?raw=true)
In the plot shown above, the average rewards varies greatly, but towards 150 episodes the average reward starts to accumulate to ~30.               


3 Ideas for Future Work
------------------------
### 3.1 Model Architecture:
While there are no graphs, in this work it is found that by tunning the model, we can achieve varying performance. For instance in the deep model used, by tuning the feature dimension of the FC layer we can achieve varying results. Understandably, we can find a better model to optimise the entire network.

### 3.2 Modify the Number of Agents
By modifying the number of agent we can achieve a greater performance seen in the figure below. Comparing to the previous result, where one agent is used, the average of 20 agents produce a smoother transition with smaller varaiance. Further more it learns quickly in the early exploration stage before moving out of the "local-maxima" and suffer a dip in performance. Nonetheless towards 100 episodes the model becomes more optimise and reaches back ~35 score easily.
![Image](https://github.com/Wachn/Continuous-control_/blob/main/plots/20agents-DDPG_reacher.png?raw=true)

### 3.3 Modify the Learning Agent
In a subdirectory `/RL/` there is a A2C algorithm. While it might not be the best, the idea is that varying the learning agent algorithm we can achieve even better performance.
