
# Deep Q-Networks (DQN)


In this project, an agent is trained to navigate (and collect bananas!) in a large, square world. by using DQN.

DQN is a deep neural network which acts as a function approximator. DQN transforms input state into a vector of action values. Specific action is performed stochastically or by choosing one with the maximum value. In our experiment we are doing later.

When the agent interacts with the environment, the sequence of experience tuples can be highly correlated. The naive Q-learning algorithm that learns from each of these experience tuples in sequential order runs the risk of getting swayed by the effects of this correlation. By instead keeping track of a replay buffer and using experience replay to sample from the buffer at random, we can prevent action values from oscillating or diverging catastrophically. The replay buffer contains a collection of experience tuples *(S, A, R, S')*. The tuples are gradually added to the buffer as we are interacting with the environment.

The act of sampling a small batch of tuples from the replay buffer in order to learn is known as experience replay. In addition to breaking harmful correlations, experience replay allows us to learn more from individual tuples multiple times, recall rare occurrences, and in general make better use of our experience.

In Q-Learning, we update a guess with a guess, and this can potentially lead to harmful correlations. To avoid this we use two separate networks with identical architectures, a target Q-Network (whose weights are updated less often) and a primary Q-Network, this approach is called fixed Q targets.


# Model architecture

The network consists of two fully connected layers. The network maps states to actions. It uses ReLU as an activation function.

state_size = 37 <br />
action_size = 4<br />

input = 37 # input = state size<br />
fc1 = 64   # number of nodes in first hidden layer <br />
fc2 = 64   # number of nodes in second hidden layer <br />
output = 4 # output = action size


Hyperparameters <br />

BUFFER_SIZE = int(1e5)  # replay buffer size <br />
BATCH_SIZE = 64         # minibatch size (Initially 64) <br />
GAMMA = 0.995           # discount factor (Initially 0.99) <br />
TAU = 1e-3              # for soft update of target parameters <br />
LR = 5e-4               # learning rate (Initially 5e-4) <br />
UPDATE_EVERY = 4        # how often to update the network <br />

# Plot of Rewards

The environment was solved with an average Score of 13.00 in 374 episodes	
![download.png](attachment:download.png)


# Ideas for Future Work

* Select actions stochastically instead of choosing one with the maximum value.
* Experiment with Double DQN and Dule DQN.
* Implement DQN with prioritized experience replay.


