## Project 1 Navigation Report

### Training 
For the training session, we construct the **agent** with parameters
and we run the *Deep-Q-Network* procedure **dqn** as follows:

  agent = **Agent**(state_size=37, action_size=4, seed=1, n_fc1=64, n_fc2=64)       
  scores, episodes = **dqn**(n_episodes = 1000, eps_start = epsilon_start)

### Parameters
We experience the following parameters:  **_n\_fc1_** (No. of neurons in 1st Fully-Connected Layer), **_n\_fc2_** (No. of neurons in 2nd Fully-Connected Layer), **_eps\_start_** (Starting value of Epsilon, to perform Epsilon-Grredy Algorithm).
For the training session, 
 * _eps\_start_ is played out as a value 0.99 with step 0.001, 
 * _n\_fc1_ is played out as a value 64,
 * _n\_fc2_ is played out as a value 64.
 
The other Hyperparameters used for solving the environment.
 * GAMMA -> 0.99 (To get the future expected rewards.)
 * TAU -> 1e-3 i.e 0.003 (To update the target model(store in my case) with local model(learn in my case).)
 * LR -> 5e-4 i.e 0.0005 (Learning Rate to update weights of the network during back-propagation.)
 * BATCH_SIZE -> 64 (Batch of 64 _sarsa_ tuples to feed to network for learning)
 * BUFFER_SIZE -> 100000 (Length of Replay Buffer)
 * UPDATE_EVERY -> 4 (To update target model after this many steps, with local model.)


### Model Q-Network

Both Q-Networks (learn and store) are implemented by the class
**QNetwork**. This class implements the simple
neural network with 3 fully-connected layers and 2 
rectified nonlinear layers. This **QNetwork** is realized in the framework 
of package **PyTorch**. The number of neurons of the fully-connected layers are 
as follows:

 * Layer Fully-Connected Layer 1,  number of neurons: _state_size_ x _n\_fc1_ , 
 * Layer Fully-Connected Layer 2,  number of neurons: _n\_fc1_ x _n\_fc2_ ,
 * Layer Fully-Connected Layer 3,  number of neurons: _n\_fc2_ x _action_size_ ,
 
where _state_size_ = 37, _action_size_ = 8, _n\_fc1_ and _n\_fc2_
are the input params as given above.

### Learning Algrorithm

#### Deep-Q-Network algorithm

The _Deep-Q-Network_ procedure **dqn** performs the **double loop**. 
External loop (by _episodes_) is executed till the number of episodes reached the maximum **number 
of episodes** = _1000_ or the _completion criteria_ is executed.
For the completion criteria, we check  

  _np.mean(scores_window) >=13_, (scores_window -> queue of 100 scores stored after each episode.)
  
In the internal loop,  **dqn** gets the current _action_ from the **agent**.
By this _action_ **dqn** gets _state_ and _reward_ from Unity environment.
Then, the **agent** accept params _state,action,reward,next_state, done_
to the next training step. The variable _score_ accumulates obtained rewards.

#### Mechanism of Agent
* Two Q-Networks (learn and store) using the simple neural network as Model Q-Network shown above.
```bash
self.qnet_learn = QNetwork(state_size, action_size, seed, n_fc1, n_fc2).to(device)
self.qnet_store = QNetwork(state_size, action_size, seed, n_fc1, n_fc2).to(device)
```
* Replay memory (using the class Experience)
```bash
self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
e = self.experience(state, action, reward, next_state, done)
self.memory.append(e)
```
* Epsilon-greedy mechanism to get random actions sometimes.
```bash
if random.random() > eps:
    return np.argmax(action_values.cpu().data.numpy())
else:
    return random.choice(np.arange(self.action_size))
```
* Epsilon becomes smaller after each episode.
```bash
eps = max(eps_end,eps_decay*eps)
```
* Q-learning, i.e., using the max value for all possible actions
* Computing the loss function by MSE loss
```bash
loss = F.mse_loss(Q_expected, Q_targets)
```
* Minimize the loss by gradient descend mechanism using the ADAM optimizer.

### Result Graph
```bash
agent = Agent(state_size=37, action_size=4, seed=1)
scores, ep = dqn(n_episodes=1000, eps_start=.99, eps_end=0.01, eps_decay = .996)
```
Episode: 587, elapsed: 0:07:44.751907, Avg.Score: 13.02,  score 15.0, How many scores >= 13: 60, eps.: 0.09<br>
 terminating at episode : 587 ave reward reached +13 over 100 episodes
![Result Graph](plot.png)

Using The Above Given Parameters, this environment was solved within 600 episodes i.e 587 episodes.<br>
This can be improved by applying below given Future Ideas.
### Future Ideas
1. Improvement can be done by adding one or more layers to the network.
2. Changing the number of neurons in each layer can also improve the learning curve between states and actions.
3. Updating epsilon with different decay or starting with different epsilon can be a improvement.