# Q-learning

The __Action Value Function__ (or __"Q-function"__) takes two inputs: "state" and "action". It returns the expected future reward of that action at that state:
$$Q^{\pi}_{s_t, a_t} = \mathbb{E} [R_{t + 1} + \lambda R_{t + 2} + \lambda^{2} R_{t + 3} + \ldots \mid s_t, a_t]$$

## Q-table
We will use Q-table ("Q" for "quality" of the action), which stores Q-values for all states and actions. The columns will be the actions. The rows will be the states. The value of each cell will be the maximum expected future reward for that given state and action.<br>
To find Q(s,a) we use the Bellman equation, which updates our Q-value based on new observations:
$$NewQ(s, a) = Q(s, a) + \alpha(R(s, a) + \lambda \max Q'(s', a') - Q(s, a))$$
where $s'$ and $\alpha$ are next state and learning rate respectively.

Algorithm:
```
Initialize Q-table with size n(number of states) x m(number of actions) with 0 values
for life or until learning is stopped:
    Choose an action (a) in the current state (s) based on current Q-value estimates (argmax(Q(s, :)))
    Take an action (a) and observe the outcome state (s') and reward (r)
    Update Q(s, a) using Bellman equation
```

## Deep Q-Learning
But if we have a lot of states it is not efficient to create Q-table, so we can use neural network to predict Q values for all actions using given state.<br>
In this case we have to minimize difference between our predicted Q-value and target Q-value, which is equal to $R(s, a) + \lambda \max_a Q(s', a)$.

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from collections import deque
import random

import gym

In [2]:
RANDOM_SEED = 40

random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
tf.set_random_seed(RANDOM_SEED)

In [3]:
env = gym.make("CartPole-v0")

a_size = env.action_space.n
s_size = env.observation_space.shape[0]
print("Action space size: {}".format(a_size))
print("State space size: {}".format(s_size))

possible_actions = np.identity(a_size)

Action space size: 2
State space size: 4


In [4]:
class DQNetwork(object):
    def __init__(self, s_size, a_size):
        self.s_size = s_size
        self.a_size = a_size
        
        self.states = tf.placeholder(shape=[None, self.s_size], dtype=tf.float32)
        self.dense = tf.layers.dense(inputs=self.states, units=20, activation=tf.nn.tanh)
        self.Qout = tf.layers.dense(inputs=self.dense, units=self.a_size)
        self.predict = tf.argmax(self.Qout, 1)

        self.Qtarget = tf.placeholder(shape=[None], dtype=tf.float32)
        self.action = tf.placeholder(shape=[None ,self.a_size], dtype=tf.float32)
        Q = tf.reduce_sum(tf.multiply(self.Qout, self.action), axis=1)
        self.loss = tf.reduce_mean(tf.square(self.Qtarget - Q))
        trainer = tf.train.AdamOptimizer(learning_rate=0.01)
        self.optimize = trainer.minimize(self.loss)

In [16]:
tf.reset_default_graph()

network = DQNetwork(s_size, a_size)
init = tf.global_variables_initializer()

In [17]:
sess = tf.Session()
sess.run(init)

In [18]:
gamma = 0.99
n_steps = 2000
e = 1
e_decay = 0.995
e_min = 0.1
num_episodes = 200
batch_size = 40

rlist = []
experience = deque(maxlen=2000)

for episode in range(num_episodes):
    s = env.reset()
    r_total = 0
    done = False
    
    while not done:
        if np.random.rand(1) < e:
            a_ind = env.action_space.sample()
        else:
            a_ind = sess.run(network.predict, feed_dict={network.states: [s]})[0]
        s1, r, done, _ = env.step(a_ind)
        
        experience.append((s, possible_actions[a_ind], r, s1, done))
        
        r_total += r
        s = s1
        if done:
            if e > e_min:
                e *= e_decay
            if episode % 10 == 0:
                print("EPIDOSE {:0>5}: {}".format(episode, np.mean(rlist[-10:-1])))
        
        if len(experience) > batch_size:
            minibatch = random.sample(experience, batch_size)
            states_mb = np.array([i[0] for i in minibatch])
            actions_mb = np.array([i[1] for i in minibatch])
            rewards_mb = np.array([i[2] for i in minibatch])
            next_states_mb = np.array([i[3] for i in minibatch])
            dones_mb = np.array([i[4] for i in minibatch])
            
            Qtarget = []
            Qnext_state = sess.run(network.Qout, feed_dict={network.states: next_states_mb})
            for i in range(batch_size):
                target = rewards_mb[i]
                if not dones_mb[i]:
                    target += gamma * np.max(Qnext_state[i])
                Qtarget.append(target)
            loss, _ = sess.run([network.loss, network.optimize], feed_dict={network.states: states_mb,
                                                                           network.Qtarget: Qtarget,
                                                                           network.action: actions_mb})

    rlist.append(r_total)

EPIDOSE 00000: nan
EPIDOSE 00010: 25.444444444444443
EPIDOSE 00020: 24.333333333333332
EPIDOSE 00030: 31.444444444444443
EPIDOSE 00040: 31.22222222222222
EPIDOSE 00050: 37.111111111111114
EPIDOSE 00060: 54.888888888888886
EPIDOSE 00070: 41.111111111111114
EPIDOSE 00080: 51.888888888888886
EPIDOSE 00090: 77.77777777777777
EPIDOSE 00100: 61.111111111111114
EPIDOSE 00110: 82.77777777777777
EPIDOSE 00120: 95.77777777777777
EPIDOSE 00130: 49.111111111111114
EPIDOSE 00140: 79.77777777777777
EPIDOSE 00150: 59.22222222222222
EPIDOSE 00160: 65.0
EPIDOSE 00170: 85.88888888888889
EPIDOSE 00180: 57.111111111111114
EPIDOSE 00190: 112.44444444444444


In [24]:
s = env.reset()
for i in range(3): env.step(1)
r_total = 0
done = False
while True:
    env.render()
    a = sess.run(network.predict, feed_dict={network.states: [s]})[0]
    s, r, done, _ = env.step(a)
    r_total += r
#     print(done)
    if done == True:
        print(r_total)
        break

197.0


In [25]:
env.close()