# Approximate q-learning

In this notebook you will teach a lasagne neural network to do Q-learning.

__Frameworks__ - we'll accept this homework in any deep learning framework. For example, it translates to TensorFlow almost line-to-line. However, we recommend you to stick to theano/lasagne unless you're certain about your skills in the framework of your choice.

In [91]:
%env THEANO_FLAGS='floatX=float32'
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1

env: THEANO_FLAGS='floatX=float32'


In [92]:
import gym
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#### Попробуем применить qlearning для среды LunarLander

In [93]:
env = gym.make("LunarLander-v2").env
env.reset()
n_actions = env.action_space.n
state_dim = env.observation_space.shape

# Approximate (deep) Q-learning: building the network

In this section we will build and train naive Q-learning with theano/lasagne

First step is initializing input variables

In [94]:
import theano
import theano.tensor as T

current_states = T.matrix("states[batch,units]")
actions = T.ivector("action_ids[batch]")
rewards = T.vector("rewards[batch]")
next_states = T.matrix("next states[batch,units]")
is_end = T.ivector("vector[batch] where 1 means that session just ended")

In [95]:
import lasagne
from lasagne.layers import *

#input layer
l_states = InputLayer((None,)+state_dim)

nn = DenseLayer(l_states, 200)
nn = DenseLayer(nn, 200)

#output layer
l_qvalues = DenseLayer(nn, num_units=n_actions, nonlinearity=None)

#### Predicting Q-values for `current_states`

In [97]:
#get q-values for ALL actions in current_states
predicted_qvalues = get_output(l_qvalues,{l_states:current_states})

In [98]:
#compiling agent's "GetQValues" function
get_qvalues = theano.function([current_states], T.argmax(predicted_qvalues, axis=1), allow_input_downcast=True)

In [99]:
#select q-values for chosen actions
predicted_qvalues_for_actions = predicted_qvalues[T.arange(actions.shape[0]),actions]

#### Loss function and `update`
Here we write a function similar to `agent.update`.

In [100]:
#predict q-values for next states
predicted_next_qvalues = get_output(l_qvalues,{l_states:next_states})


#Computing target q-values under 
gamma = 0.99
target_qvalues_for_actions = rewards + gamma * T.max(predicted_next_qvalues, axis=1)

#zero-out q-values at the end
target_qvalues_for_actions = (1-is_end)*target_qvalues_for_actions

#don't compute gradient over target q-values (consider constant)
target_qvalues_for_actions = theano.gradient.disconnected_grad(target_qvalues_for_actions)

#### Добавим регуляризацию, чтобы сеть не переобучалась

In [101]:
from lasagne.regularization import regularize_network_params, l2
l2_penalty = regularize_network_params(l_qvalues, l2)

In [102]:
#mean squared error loss function
loss = lasagne.objectives.squared_error(predicted_qvalues_for_actions, target_qvalues_for_actions) + 0.001*l2_penalty

In [103]:
#all network weights
all_weights = get_all_params(l_qvalues,trainable=True)

#network updates. Note the small learning rate (for stability)
updates = lasagne.updates.adam(loss.mean(),all_weights,learning_rate=1e-4)

In [104]:
#Training function that resembles agent.update(state,action,reward,next_state) 
#with 1 more argument meaning is_end
train_step = theano.function([current_states,actions,rewards,next_states,is_end],
                             updates=updates, allow_input_downcast=True)

### Playing the game

In [112]:
epsilon = 0.7 #initial epsilon

def generate_session(t_max=800):
    """play env with approximate q-learning agent and train it at the same time"""
    
    total_reward = 0
    s = env.reset()
    for t in range(t_max):
        
        #get action q-values from the network
        q_values = get_qvalues(np.array([s],dtype=np.float32))[0] 
        rnd = np.random.uniform()
        if rnd < epsilon:
            a = np.random.choice(np.arange(n_actions))
        else:
            a = q_values
        
        new_s,r,done,info = env.step(a)
        
        #train agent one step. Note that we use one-element arrays instead of scalars 
        #because that's what function accepts.
        train_step(np.array([s],dtype=np.float32),[a],[r],
                   np.array([new_s],dtype=np.float32),[done])
        
        total_reward+=r
        
        s = new_s
        if done: break
            
    return total_reward       

In [113]:
import tqdm

In [114]:
import imp
imp.reload(tqdm)

<module 'tqdm' from '/home/egdeliya_nurgaliyeva/anaconda3/envs/py34/lib/python3.4/site-packages/tqdm/__init__.py'>

In [116]:
t = tqdm.trange(10)
for i in t:
    
    rewards = [generate_session() for _ in range(100)] #generate new sessions
    
    epsilon*=0.995
        
    t.set_postfix(mean_reward=np.mean(rewards), epsilon=epsilon)

    if np.mean(rewards) > 0:
        print ("You Win!")
        print(np.mean(rewards))
#         break
        
    assert epsilon!=0, "Please explore environment"


  0%|          | 0/10 [00:00<?, ?it/s][A
100%|██████████| 10/10 [23:33<00:00, 141.38s/it, epsilon=0.386, mean_reward=11.1]

You Win!





## Получили reward > 0 (reward = 11.1) !!!