# Deep Q-Networks
---

- In RL an agent outputs an action and the environment returns an observation (state of the system) and reward
- Through interaction the agent learns the best action to take for each state (Q-table) based on long term expected reward
- Q-table is not feasible for large/continuous/non-linear environments  
- Deep RL uses non-linear function approximators (DNN) to calculate action values directly from environemnt feedback
- Represented as Deep Neural Networks 
- Use Deep Learning to find the optimal paramaters   

[Issues in Using Function Approximation for Reinforcement Learning](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.3097&rep=rep1&type=pdf)   

## DeepMind's DQN
[Human-level control through deep reinforcement learning](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)   
- pass in video game images
- outputs a vector of action values with max value being the action to take
- game score at each time step was the reinforcement signal 
- converted frames to grayscale and converted to 84 x 84
- stacked 4 frames together so state space size was 84 x 84 x 4 (to help with sequential data)
- produces a Q value for each action in a single forward pass - to exploit take one with max value
- neural network architecture
    - convolutional layers (spatial relationships, spatial rule space, temporal properties across frames)
    - 3 convolution layers with ReLU
    - 1 fully connected hidden layer with ReLU
    - 1 fully connected linear output layer the produced the action value vector 
- not guarenteed to converge on optimal value function
- network weigths can oscillate or diverge due to the high correlation between actions and states
- RL is nortoriously unstable when neural networks are used to represent action values  
- ways to overcome this
    - Experience Replay
    - Fixed Q Targets

## Experience Replay
- rolling history of past data - replay pool
- behavior distribution is averaged over many previous states smoothing out learning and avoiding oscillations
- experience tuple (S<sub>t</sub>, A<sub>t</sub>, R<sub>t+1</sub>, S<sub>t+1</sub>)
- typical online Q-learning learns from the experience tuple then discards it and moves on
- instead store these experiences
- some states are rare to come by and some actions can be very costly so be able to recall those experiences
- Replay Buffer - storage for each experience tuple
- sample small batch from replay buffer to learn
- can learn from individual tuple multiple times, recall rare experiences and make better use of experience
- sequence of experience tuples can be highly correlated so learning in sequential order can be swayed the effects of this correlation
- can sample from replay buffer at random, break correlation and prevent action values from oscillating or diverging
- reduces RL to a SL scenario
- experience replay is sampling a small batch of experience tuples from the replay buffer to learn from

## Fixed Q-Targets
- target network to represent old Q-function which is used to compute the loss of every action during training
- at each step of training the Q-function values change and the value estimates can spiral out of control
- allows more reliable convergence
- Q-learning is a form of Temporal-Difference (TD) learning
- update a guess with a guess
- when updating weights there is a correlation between target and parameters being changed - like trying to hit moving target
- fix the function parameters used to generate the target, w<sup>-</sup>, copy of w that isn't changed during learning step
![weight update](images/drl/fixed_Q_target.png)

## Deep Q-Learning Algorithm
![deep Q learning](images/drl/deep_Q_learning.png)

## Double DQN
[Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/pdf/1509.06461.pdf)   

- Deep Q-Learning tends to overestimate action values

## Prioritized Experience Replay  
[Prioritized Experience Replay](https://arxiv.org/abs/1511.05952)   

- perhaps an angent can learn more effectively from certain transitions over others
- important transitions should be sampled with higher probability

## Dueling DQN
[Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581)

## Other Improvements
[Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1602.01783)  
[A Distributional Perspective on Reinforcement Learning](https://arxiv.org/abs/1707.06887)  
[Noisy Networks for Exploration](https://arxiv.org/abs/1706.10295)   
[Rainbow: Combining Improvements in Deep Reinforcement Learning](https://arxiv.org/abs/1710.02298)  