## Project 2 Continous control

### Training 
For the training session, we construct the **agent** with above parameters
and we run the *Deep-Q-Network* procedure **ddpg** as follows:

  agent = **Agent**(state_size=state_size, action_size=action_size, random_seed=8)<br>
  scores = **ddpg**()

### Parameters
We experience the following parameters:  **_n\_fc1_** (No. of neurons in 1st Fully-Connected Layer), **_n\_fc2_** (No. of neurons in 2nd Fully-Connected Layer),
 * _n\_fc1_ is played out as a value 64,
 * _n\_fc2_ is played out as a value 64.
 
The other Hyperparameters used for solving the environment.
 * GAMMA -> 0.99 (To get the future expected rewards.)
 * TAU -> 1e-3 i.e 0.003 (To update the target model with local model.)
 * BATCH_SIZE -> 64 (Batch of 64 _sarsa_ tuples to feed to network for learning)
 * BUFFER_SIZE -> 100000 (Length of Replay Buffer)
 * LR_ACTOR = 1e-3 (Learning Rate to update weights of the Actor network during back-propagation.)
 * LR_CRITIC = 1e-3 (Learning Rate to update weights of the Critic network during back-propagation.)
 * WEIGHT_DECAY = 0 (Weight Decay for optimizer)
 * EPSILON = 1.0 (Starting value of Epsilon, to perform Epsilon-Grredy Algorithm)
 * EPSILON_DECAY = 1e-6 (Decay for Epsilon value)
 * LEARNING_PERIOD = 20 (learning frequency)
 * UPDATE_FACTOR   = 10 (how much to learn)


### Model Networks (Policy and Value)

Both Policy and Value networks implements the simple
neural network with 3 fully-connected layers and 2 
rectified nonlinear layers. This is realized in the framework 
of package **PyTorch**. The number of neurons of the fully-connected layers are 
as follows:

For policy:
 * Fully-Connected Layer 1,  number of neurons: _state_size_ x _n\_fc1_ , 
 * Fully-Connected Layer 2,  number of neurons: _n\_fc1_ x _n\_fc2_ ,
 * Fully-Connected Layer 3,  number of neurons: _n\_fc2_ x _action_size_ ,
 
For Value:
 * Fully-Connected Layer 1,  number of neurons: _state_size_ x _n\_fc1_ , 
 * Fully-Connected Layer 2,  number of neurons: _(n\_fc1+action\_size)_ x _n\_fc2_ ,
 * Fully-Connected Layer 3,  number of neurons: _n\_fc2_ x 1 ,

### Learning Algrorithm

#### Deep Deterministic Policy Gradient algorithm

DDPG is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy. This dual mechanism is the actor-critic method. The DDPG algorithm uses two additional mechanisms: Replay Buffer and Soft Updates.

####     Actor-Critic dual mechanism

For each timestep _t,_ we do the following operations:

Let __*S&nbsp;*__ be the current state. It is the  input for the  _Actor NN_.  The output is the action-value 

![](images/policy_pi.png)

where \pi is the policy function,  i.e., the distribution of the actions. The _Critic NN_  gets the state __*S&nbsp;*__ as input and outputs      
the state-value function __*v(S,w)*__ , that is the _expected total reward_ for the agent starting from state __*S&nbsp;*__. Here, _\theta_ is    
the vector parameter of the _Actor NN_, _w&nbsp;_ - the vector parameter of the _Critic NN_. The task is to train both networks, i.e.,   
to find the optimal values for _\theta_ and _w&nbsp;_.  By policy _\pi_ we get the action _A&nbsp;_,  from the environment we get reward _R&nbsp;_   
and the next state __*S'&nbsp;*__. Then we get _TD-estimate_: 
 
![](images/TD_estimate.png)
		 
Next, we use the _Critic_ to calculate the _advantage function_ _A(s, a)_:

![](images/calc_advantage.png)
				 
Here, _\gamma_ is the _discount factor_. The parameter _\theta_ is updated by gradient ascent as follows:

![](images/update_theta.png)

The parameter _w&nbsp;_ is updated as follows:

![](images/update_w.png)
		
Here, \alpha (resp. \beta) is the learning rate for the _Actor NN_ (resp. _Critic NN_).  Before we return to the next timestep we update the state _S&nbsp;_ and the operator _I&nbsp;_ by _discount factor_ \gamma:

![](images/next_state.png)

At the start of the algorithm the operator _I_ should be initialized to the identity opeartor. 
#### Goal of Agent
The environment for this project involves controlling a double-jointed arm, to reach target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of this agent is to maintain its position at the target location for as many time steps as possible.


#### Mechanism of Agent
* 4 Networks Q-Networks, Actor and Critic each having 2 networks as follows.
    * Actor
```bash
self.actor_local = Actor(state_size, action_size, random_seed).to(device)
self.actor_target = Actor(state_size, action_size, random_seed).to(device)
```
    * Critic
```bash
self.critic_local = Critic(state_size, action_size, random_seed).to(device)
self.critic_target = Critic(state_size, action_size, random_seed).to(device)
```
* Replay memory (using the class Experience)
```bash
self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
e = self.experience(state, action, reward, next_state, done)
self.memory.append(e)
```
* Update Critic Network
```bash
# Get predicted next-state actions and Q values from target models
actions_next = self.actor_target(next_states)
Q_targets_next = self.critic_target(next_states, actions_next)
# Compute Q targets for current states (y_i)
Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))
# Compute critic loss
Q_expected = self.critic_local(states, actions)
critic_loss = F.mse_loss(Q_expected, Q_targets)
# Minimize the loss
self.critic_optimizer.zero_grad()
critic_loss.backward()
torch.nn.utils.clip_grad_norm_(self.critic_local.parameters(), 1)
self.critic_optimizer.step()
```
* Update Actor
```bash
# Compute actor loss
actions_pred = self.actor_local(states)
actor_loss = -self.critic_local(states, actions_pred).mean()
# Minimize the loss
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
```

### Result Graph
```bash
agent = Agent(state_size=state_size, action_size=action_size, random_seed=8)
scores = ddpg()
```
\*** Episode 261	Average Score: 30.11, Time: 02:19:15 *** <br>
Environment solved !
![Result Graph](plot.png)

Using The Above Given Parameters, this environment was solved within 270 episodes i.e 261 episodes.<br>
This can be improved by applying below given Future Ideas.
### Future Ideas
1. Improvement can be done by adding one or more layers to the network.
2. Changing the number of neurons in each layer can also improve the learning curve between states and actions.
3. Updating epsilon with different decay or starting with different epsilon can be a improvement.