# Reinforcement Learning: DQNs

source: https://towardsdatascience.com/reinforcement-learning-w-keras-openai-dqns-1eed3a5338c


## Setup

In [1]:
%matplotlib inline

import seaborn as sns
import matplotlib.pyplot as plt
import gym
import numpy as np

sns.set_style("darkgrid")

# Introduction
***

This project is about solving a reinforcement problem with an algorithm so called deep Q-Learning. 

This problem was first solved by the researchers from Google DeepMind. This tutorial is based on the main ideas from their early research papers (especially this and this). 

## Deep Q-Network
DQN is introduced in 2 papers, Playing Atari with Deep Reinforcement Learning on NIPS in 2013 and Human-level control through deep reinforcement learning on Nature in 2015. Interestingly, there were only few papers about DRN between 2013 and 2015. I guess that the reason was people couldn’t reproduce DQN implementation without information in Nature version.

More formally, this project uses artificial neural networks as non-linear function approximator with weights $\theta$ for the action-value function:

$$
\mathcal{N}_{\theta}(i_t) \approx q^*(i_t, a_t) = \mathbb{E} [ r_{t+1} + \gamma \max_{a} q^*(i_{t+1}, a_{t+1} ) ]
$$
where $i$ is a observed state and $a$ the choosen action. 


**Q Learning**
  

In the structure of the agent are two separate models placed. Each is a neural network with the same hyperparameter, but they differ temporarely in the parameter vector. 

$\mathcal{N}_{\theta_1}(i)$, $\mathcal{N}_{\theta_2}(i)$

The first network, the action model, is responsible for the estimation of the action during the experimental phase. It outputs the q values on which the agent actual decide which action to take given a state.
The second model, on the other side, has the task to provide the target q values for training the first network. 
Therefore it is called the target model and uses the replay memory during the learning phase.

The idea behind the replay experience is:

- For the basic Q-learning algorithm we need many thousand states from the game-environment in order to learn important features so the Q-values can be estimated.
- Experience Replay is originally proposed in Reinforcement Learning for Robots Using Neural Networks in 1993. DNN is easily overfitting current episodes. Once DNN is overfitted, it’s hard to produce various experiences. To solve this problem, Experience Replay stores experiences including state transitions, rewards and actions, which are necessary data to perform Q learning, and makes mini-batches to update neural networks. This technique expects the following merits:

    - reduces correlation between experiences in updating DNN
    - increases learning speed with mini-batches
    - reuses past transitions to avoid catastrophic forgetting




The learning or replay is invoked when the memory is full of recording from the experiment. The target model takes then each memory recording and does the q learning update: 


## Game Environment

The playbox from `openAI` for developing and comparing reinforcement learning algorithms is the library called `gym`.
This library inclued several environments or test problems that can be solved with reinforcement algorithm. 
It provides easy shared interfaces, which enables to skip the complex manual feature engineering. 


This project captures the learning problem `MountainCar`. 
Here is the challenge that a car, stocked between two hills, need to climb the right hill, but the a single impulse cause a to less momentum. The only way to solve the problem is that the agent drives front and back to generate a stronger momentum. 
Typically, the agent does not know about this approach and how to solve this efficiently.
A Moore, Efficient Memory-Based Learning for Robot Control, PhD thesis, University of Cambridge, 1990. first discribed the problem.

![](mountainCar.png)

This is the `MountainCar` evironment from gym.

In [2]:
env = gym.make("MountainCar-v0")

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


The spaces for the action is disrcet and there are 3 possible actions availible.
$$
a \in \mathcal{A} = \{0, 1, 2\}
$$


number | action  
-------|-------  
0      | push left
1      | no operation
2      | push right

In [3]:
env.action_space

Discrete(3)

The observation space $\boldsymbol{i}$ is an `2` dimensional vector. The first dimention tells the position of the car and the second the velocity, whereas both values are continious and fall into two intervalls:

$$
\boldsymbol{i} = (i_1, i_2)' \in \mathcal{S} = [-1.2, 0.6] \times [-0.07, 0.07]
$$

number | action  
-------|-------  
$i_1$  | position
$i_2$  | velocity

In [4]:
print(env.observation_space)
print("Lower bound is :: {}".format(env.observation_space.low))
print("Upper bound is :: {}".format(env.observation_space.high))

Box(2,)
Lower bound is :: [-1.2  -0.07]
Upper bound is :: [0.6  0.07]


## Reward 

The reward is set to be -1 for each time step except the goal position of $0.5$ is reached.
$$
r \in \mathcal{R} = \{-1, 0\}
$$

## Terminal State
The terminal state determnines the end of an epsiode and is either when the car is in state $\boldsymbol{i}_{200}$ or in the state $\boldsymbol{i}_t = (0.5, i_{t2})$ with $t \leq 200$.


# Traing
***


In [None]:
import deepQLearningSimple as dql
import gym

Using TensorFlow backend.


In [None]:
try: 
    env.close()
except:
    pass

env = gym.make("MountainCar-v0")
agent = dql.agent(env  = env, training = True, render = False)


# Training
agent.run(num_episode = 10000, num_steps = 500)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
 Game :: 1 Wins :: 0 Mean Q Value :: 0.024898752570152283  Median Q Values :: 0.025068996474146843 
 Game :: 2 Wins :: 0 Mean Q Value :: 0.025488073006272316  Median Q Values :: 0.02617967128753662 
 Game :: 3 Wins :: 0 Mean Q Value :: 0.025484977290034294  Median Q Values :: 0.026033785194158554 
 Game :: 4 Wins :: 0 Mean Q Value :: 0.02517753280699253  Median Q Values :: 0.02536056563258171 
 Game :: 5 Wins :: 0 Mean Q Value :: 0.025322481989860535  Median Q Values :: 0.025494081899523735 
 Game :: 6 Wins :: 0 Mean Q Value :: 0.024952402338385582  Median Q Values :: 0.024924002587795258 
 Game :: 7 Wins :: 0 Mean Q Value :: 0.024831285700201988  Median Q Values :: 0.024779269471764565 
 Game :: 8 Wins :: 0 Mean Q Value :: 0.02498335763812065  Median Q Values :: 0.025037799030542374 
 Game :: 9 Wins :: 0 Mean Q Value :: 0.02494696155190468  Median Q Values :: 0.02491395175457000

# Results
***

In [None]:
def mean_changing_rate_layer(model, layer): 
    smoothing_factor = 0.00001
    
    # Get the layer for all experiments from the history
    idx = range(len(model.weights_history))
    l = [model.weights_history[i][layer]  for i in idx]
    
    # Calculate the mean deviation from the privious weights
    idx = range(len(l))
    changing_rate = [np.mean( (l[i + 1] + smoothing_factor) / 
                                 (l[i] + smoothing_factor)) for i in idx if i < len(l) -1]

    return changing_rate

changing_l1 = mean_changing_rate_layer(agent.action_dqn, 0)
changing_l2 = mean_changing_rate_layer(agent.action_dqn, 1)
changing_l3 = mean_changing_rate_layer(agent.action_dqn, 2)
changing_l4 = mean_changing_rate_layer(agent.action_dqn, 3)

changing_l8 = mean_changing_rate_layer(agent.action_dqn, 7)

In [None]:
plt.plot(changing_l1, color = "blue")

In [None]:
plt.plot(changing_l3, color = "blue")

In [None]:
agentDQN.replay_memory.memory[0]

In [None]:
# run testing 
agentDQN.training = False
agentDQN.render = True

agentDQN.run(num_episode = 10, num_steps = 500)
env.close()