#Deep Reinforcement Learning Tutorial with TensorFlow

본 튜토리얼은 Tensorflow 환경에서 DeeP Reinforcement Learning에 대한 기초적인 실습을 하기 위한 자료이다. 첫 번째 파트에서는 tensorflow에 대한 기초적인 실습과 MLP 예제를 다루어보고, 두 번째 파트에서는 Karpathy가 오픈소스로 공개한 Deep Reinforcement Learning 예제를 분석하고 직접 수정해보는 시간을 갖는다.


The code and comments are written by Dong-Hyun Kwak (imcomking@gmail.com)


<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.


## Part 1. TensorFlow

TensorFlow 이하, tf는 구글 주도하에 초기 개발되고, 현재 오픈소스로 공개되어 널리 쓰이고 있는 분산 기계학습(딥러닝)을 위한 라이브러리이다. Computational Graph를 사용한 Theano의 장점을 그대로 살려 automatic derivation(자동 미분)이 가능하고, Spark처럼 분산 클라우드 컴퓨팅 환경에서 동작하기 위한 아키텍처로 설계되었다.

우선 MLP 예제 코드를 통해 tf의 동작을 확인해보자.

### Multi-Layer Perceptron
Multi-Layer Perceptron, 이하 MLP는 다음과 같은 구조를 가진 모델이다. Convolutional Neural Networks와 달리 굉장히 layer간의 연결이 빽빽하게 가득 차 있어, dense layer 혹은 fully connected layer라는 이름으로도 불리고 있다.



mlp 그림

tf를 이용해서 MNIST 데이터를 MLP로 분류하는 코드를 작성해보자.

In [2]:
%matplotlib inline
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

class MLP():
    def __init__(self):
        # download mnist data from internet
        mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
        # open a session which is a set of computation graph operations
        sess = tf.InteractiveSession()

        # placeholder is used fof transferring the data
        x = tf.placeholder("float", shape=[None, 784]) # none represents variable length of dimension
        y_target = tf.placeholder("float", shape=[None, 10]) # shape argument is optional, but usefule to debug

        # Variable is allocated in GPU memory
        W1 = tf.Variable(tf.zeros([784, 256]))
        b1 = tf.Variable(tf.zeros([256]))
        h1 = tf.sigmoid(tf.matmul(x, W1) + b1)
        
        W2 = tf.Variable(tf.zeros([256, 10]))
        b2 = tf.Variable(tf.zeros([10]))
        y = tf.nn.softmax(tf.matmul(h1, W2) + b2)
        # softmax classification
        
        # initialize the variables by sess.run. maybe really allocating step?
        sess.run(tf.initialize_all_variables())

        # define the Loss function
        cross_entropy = -tf.reduce_sum(y_target*tf.log(y))

        # define optimization algorithm
        train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
        
        # list of boolean which is result of comparing the training prediction & real data
        correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_target, 1))
        
        # change true -> 1 false -> 0 and calc mean.
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

        # training step
        for i in range(5001): # 1000 step of minibatches
            batch = mnist.train.next_batch(150) # 50 is minibatch size
            train_step.run(feed_dict={x: batch[0], y_target: batch[1]}) # placeholder's none length is replaced by i:i+100 indexes
            
            if i%500 == 0:
                train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_target: batch[1]})
                print "step %d, training accuracy: %.3f"%(i, train_accuracy)
        
        # for given x, y_target data set
        print  "test accuracy: %g"%accuracy.eval(feed_dict={x: mnist.test.images, y_target: mnist.test.labels})

        
MLP()

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
step 0, training accuracy: 0.133
step 500, training accuracy: 0.267
step 1000, training accuracy: 0.887
step 1500, training accuracy: 0.880
step 2000, training accuracy: 0.967
step 2500, training accuracy: 0.960
step 3000, training accuracy: 0.953
step 3500, training accuracy: 0.967
step 4000, training accuracy: 0.960
step 4500, training accuracy: 0.953
step 5000, training accuracy: 0.973
test accuracy: 0.9412


<__main__.MLP instance at 0x7faedd629098>

### TensorBoard 설정하기
TensorFlow는 TensorBoard라는 매우 강력한 visualization tool을 제공한다. 이를 사용하면 웹브라우저 형태로 사용자가 모델의 구조를 눈으로 확인하고, 파라미터 값의 변화를 살펴보는 등의 직관적인 분석이 가능하다.

이는 다른 라이브러리에서는 제공하지 않는 tf만의 차별화된 강점이다.

이를 활용해 방금 만들었던 MLP에 대해서 분석해보자. 그러려면 다음의 사항을 반영해 코드를 수정하여야 한다.

* **변수들의 이름 지어주기**

* **변수들의 Summary 생성**

* **변수들의 Summary 기록**


In [1]:
%matplotlib inline
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

class MLP_tf_board():
    def __init__(self):
        # download mnist data from internet
        mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
        # open a session which is a set of computation graph operations
        sess = tf.InteractiveSession()

        # placeholder is used fof transferring the data
        x = tf.placeholder("float", shape=[None, 784], name = 'x') # none represents variable length of dimension
        y_target = tf.placeholder("float", shape=[None, 10], name = 'y_target') # shape argument is optional, but usefule to debug

        # Variable is allocated in GPU memory
        W1 = tf.Variable(tf.zeros([784, 256]), name = 'W1')
        b1 = tf.Variable(tf.zeros([256]), name = 'b1')
        h1 = tf.sigmoid(tf.matmul(x, W1) + b1, name = 'h1')
        
        W2 = tf.Variable(tf.zeros([256, 10]), name = 'W2')
        b2 = tf.Variable(tf.zeros([10]), name = 'b2')
        y = tf.nn.softmax(tf.matmul(h1, W2) + b2, name = 'y')
        # softmax classification
        
        # initialize the variables by sess.run. maybe really allocating step?
        sess.run(tf.initialize_all_variables())
        
        
        # define the Loss function
        cross_entropy = -tf.reduce_sum(y_target*tf.log(y), name = 'cross_entropy')

        # define optimization algorithm
        train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
        
        # list of boolean which is result of comparing the training prediction & real data
        correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_target, 1))
        
        # change true -> 1 false -> 0 and calc mean.
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

        
        
        
        
        # create summary of parameters
        tf.histogram_summary('weights_1', W1)
        tf.histogram_summary('weights_2', W2)
        tf.histogram_summary('y', y)
        tf.scalar_summary('cross_entropy', cross_entropy)
        merged = tf.merge_all_summaries()
        summary_writer = tf.train.SummaryWriter("/tmp/mlp" , sess.graph_def)
        
        
        
        # training step
        for i in range(5001): # 1000 step of minibatches
            batch = mnist.train.next_batch(150) # 50 is minibatch size
            train_step.run(feed_dict={x: batch[0], y_target: batch[1]}) # placeholder's none length is replaced by i:i+100 indexes
            
            if i%500 == 0:
                train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_target: batch[1]})
                print "step %d, training accuracy: %.3f"%(i, train_accuracy)
                
                # calculate the summary and write.
                summary = sess.run(merged, feed_dict={x:batch[0], y_target: batch[1]})
                summary_writer.add_summary(summary , i)
        
        # for given x, y_target data set
        print  "test accuracy: %g"%accuracy.eval(feed_dict={x: mnist.test.images, y_target: mnist.test.labels})

        
        
        
        
MLP_tf_board()

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
step 0, training accuracy: 0.133
step 500, training accuracy: 0.420
step 1000, training accuracy: 0.947
step 1500, training accuracy: 0.967
step 2000, training accuracy: 0.960
step 2500, training accuracy: 0.953
step 3000, training accuracy: 0.960
step 3500, training accuracy: 0.987
step 4000, training accuracy: 0.953
step 4500, training accuracy: 0.987
step 5000, training accuracy: 0.973
test accuracy: 0.9474


<__main__.MLP_tf_board instance at 0x7fc1244fdea8>

### TensorBoard 실행하기
위와 같이 코드를 수정했다면, 이제 리눅스 shell에서 tensorboard.py가 있는 곳으로 이동한 다음을 실행한다.


cd tensorflow/tensorflow/tensorboard

python tensorboard.py --logdir=/tmp/mlp


그다음 0.0.0.0:6006/#graphs 에 접속하면 아래와 같은 그림을 볼 수 있다.


<img src = "tensorboard_mlp.png">

## Part 2. Deep Reinforcement Learing

Deep Reinforcement Learning이란, 기존의 강화학습에서 사용하는 Q function을 딥러닝으로 근사하여 사용하는 모델을 의미한다. 대표적으로 구글 Deep Mind의 Atari와 최근에 화제가 된 AlphaGo 역시 이 Deep Reinforcement Learning을 이용한 응용의 한가지이다.

이번 파트에서는 Deep Reinforcement Learning을 이용해서 간단한 2차원 게임을 플레이하고, reward로 부터 스스로 학습하는 Kaparthy의 오픈소스 예제를 실습해 본다.

<img alt="Deep RL" style="border-width:0" width="600" src="example.gif?raw=true" />


(출처: https://github.com/nivwusquorum/tensorflow-deepq)

### Reinforcement Learning

Reinforcement Learning, 이하 RL은 supervised learning과 달리 주어진 데이터에 대한 정확한 정답을 제공받지 않고, 내가 한 행동에 대한 reward feedback 만으로 학습을 수행하는 알고리즘이다. 이를 강화학습이라 부르며, 이것을 수행하는 가장 대표적인 알고리즘으로 Q-Learning 이 있다.

우선 RL에서 사용하는 용어와 개념을 이해해보자.

* **numpy**: scientific computing (matrix, algebra, random, etc.)
* **matplotlib**: plotting
* **h5py, cPickle**: efficient data saving and loading
* **tensorflow**: GPU and symbolic computing
* **IPython interact**: scrolled view

#### Import Libraries
We are going to need the following libraries:
* **numpy**: scientific computing (matrix, algebra, random, etc.)
* **matplotlib**: plotting
* **h5py, cPickle**: efficient data saving and loading
* **tensorflow**: GPU and symbolic computing
* **IPython interact**: scrolled view

In [2]:
%matplotlib inline
import numpy as np
import tensorflow as tf

from tf_rl.controller import DiscreteDeepQ, HumanController
from tf_rl.simulation import KarpathyGame
from tf_rl import simulate

#from tf_rl.models import MLP

from __future__ import print_function

#### Environment Settings

이제 우리가 원하는 게임 환경을 설정하고, 적절한 reward와 object의 개수 및 observation 을 조절한다

In [3]:
current_settings = {
    'objects': [
        'friend',
        'enemy',
    ],
    'colors': {
        'hero':   'yellow',
        'friend': 'green',
        'enemy':  'red',
    },
    'object_reward': {
        'friend': 0.1,
        'enemy': -0.1,
    },
    'hero_bounces_off_walls': False,
    'world_size': (700,500),
    'hero_initial_position': [400, 300],
    'hero_initial_speed':    [0,   0],
    "maximum_speed":         [50, 50],
    "object_radius": 10.0,
    "num_objects": {
        "friend" : 25,
        "enemy" :  25,
    },
    "num_observation_lines" : 32, # the number of antennas
    "observation_line_length": 240., # the length of antennas
    "tolerable_distance_to_wall": 50, 
    "wall_distance_penalty":  -0.0, # if the hero is close to wall, that receives penalty
    "delta_v": 50 # speed value
}

# create the game simulator
g = KarpathyGame(current_settings)

# this LOG_DIR is used for tensorboard
LOG_DIR = "/tmp/nacsi"
print(LOG_DIR)

/tmp/nacsi


#### Deep Learning Architecture

이제 Q function을 근사하기 위한 딥러닝 모델을 만들어보자. 이번 예제에서는 위에서 보았던 4층짜리 MLP를 사용한다.

In [4]:
import math
from tf_rl.utils import base_name

class Layer(object):
    def __init__(self, input_sizes, output_size, scope):
        """Cretes a neural network layer."""
        if type(input_sizes) != list:
            input_sizes = [input_sizes]

        self.input_sizes = input_sizes
        self.output_size = output_size
        self.scope       = scope or "Layer"

        with tf.variable_scope(self.scope):
            self.Ws = []
            for input_idx, input_size in enumerate(input_sizes):
                W_name = "W_%d" % (input_idx,)
                W_initializer =  tf.random_uniform_initializer(
                        -1.0 / math.sqrt(input_size), 1.0 / math.sqrt(input_size))
                W_var = tf.get_variable(W_name, (input_size, output_size), initializer=W_initializer)
                self.Ws.append(W_var)
            self.b = tf.get_variable("b", (output_size,), initializer=tf.constant_initializer(0))

    def __call__(self, xs):
        if type(xs) != list:
            xs = [xs]
        assert len(xs) == len(self.Ws), \
                "Expected %d input vectors, got %d" % (len(self.Ws), len(xs))
        with tf.variable_scope(self.scope):
            return sum([tf.matmul(x, W) for x, W in zip(xs, self.Ws)]) + self.b

    def variables(self):
        return [self.b] + self.Ws

    def copy(self, scope=None):
        scope = scope or self.scope + "_copy"

        with tf.variable_scope(scope) as sc:
            for v in self.variables():
                tf.get_variable(base_name(v), v.get_shape(),
                        initializer=lambda x,dtype=tf.float32: v.initialized_value())
            sc.reuse_variables()
            return Layer(self.input_sizes, self.output_size, scope=sc)

class MLP(object):
    def __init__(self, input_sizes, hiddens, nonlinearities, scope=None, given_layers=None):
        self.input_sizes = input_sizes
        # observation is 5 features(distance of each object and X,Y speed) of closest 32 object with hero(friend, enemy, wall) + 2 hero's own speed X,Y
        # ==> 5*32 + 2 = 162 features about the game
        self.hiddens = hiddens
        self.input_nonlinearity, self.layer_nonlinearities = nonlinearities[0], nonlinearities[1:]
        self.scope = scope or "MLP"

        assert len(hiddens) == len(nonlinearities), \
                "Number of hiddens must be equal to number of nonlinearities"

        with tf.variable_scope(self.scope):
            if given_layers is not None:
                self.input_layer = given_layers[0]
                self.layers      = given_layers[1:]
            else:
                self.input_layer = Layer(input_sizes, hiddens[0], scope="input_layer") # 135 -> 200
                self.layers = []

                for l_idx, (h_from, h_to) in enumerate(zip(hiddens[:-1], hiddens[1:])): # hiddens == [200, 200, 4], so this mean, swifting the index by 1
                    # (200, 200) , (200,4)
                    self.layers.append(Layer(h_from, h_to, scope="hidden_layer_%d" % (l_idx,)))
                    # this has 4 layers

    def __call__(self, xs):
        if type(xs) != list:
            xs = [xs]
        with tf.variable_scope(self.scope):
            hidden = self.input_nonlinearity(self.input_layer(xs))
            for layer, nonlinearity in zip(self.layers, self.layer_nonlinearities):
                hidden = nonlinearity(layer(hidden))
            return hidden

    def variables(self):
        res = self.input_layer.variables()
        for layer in self.layers:
            res.extend(layer.variables())
        return res

    def copy(self, scope=None):
        scope = scope or self.scope + "_copy"
        nonlinearities = [self.input_nonlinearity] + self.layer_nonlinearities
        given_layers = [self.input_layer.copy()] + [layer.copy() for layer in self.layers]
        return MLP(self.input_sizes, self.hiddens, nonlinearities, scope=scope,
                given_layers=given_layers)

#### Make an Agent

이제 Discrete Deep Q learning 알고리즘이 이 게임을 플레이하면서 학습을 하도록 agent로 설정을 한다.

In [5]:
# Tensorflow business - it is always good to reset a graph before creating a new controller.
tf.reset_default_graph()
session = tf.InteractiveSession()

# This little guy will let us run tensorboard
#      tensorboard --logdir [LOG_DIR]
journalist = tf.train.SummaryWriter(LOG_DIR)

# Brain maps from observation to Q values for different actions.
# Here it is a done using a multi layer perceptron with 2 hidden
# layers
brain = MLP([g.observation_size,], [200, 200, g.num_actions], 
            [tf.tanh, tf.tanh, tf.identity])

# The optimizer to use. Here we use RMSProp as recommended
# by the publication
optimizer = tf.train.RMSPropOptimizer(learning_rate= 0.001, decay=0.9)

# DiscreteDeepQ object
current_controller = DiscreteDeepQ(g.observation_size, g.num_actions, brain, optimizer, session,
                                   discount_rate=0.99, exploration_period=5000, max_experience=10000, 
                                   store_every_nth=4, train_every_nth=4,
                                   summary_writer=journalist)

session.run(tf.initialize_all_variables())
session.run(current_controller.target_network_update)
# graph was not available when journalist was created  
journalist.add_graph(session.graph_def)

#### Play the Game

실제로 게임을 플레이하면서 강화학습을 하는 과정을 지켜본다.

In [6]:
FPS          = 30
ACTION_EVERY = 3
    
fast_mode = True
if fast_mode:
    WAIT, VISUALIZE_EVERY = False, 20
else:
    WAIT, VISUALIZE_EVERY = True, 1

    
try:
    with tf.device("/gpu:0"):
        simulate(simulation=g,
                 controller=current_controller,
                 fps=FPS,
                 visualize_every=VISUALIZE_EVERY,
                 action_every=ACTION_EVERY,
                 wait=WAIT,
                 disable_training=False,
                 simulation_resolution=0.001,
                 save_path=None)
except KeyboardInterrupt:
    print("Interrupted")

Interrupted


#### Contest!
자 이제 위의 여러가지 환경 설정과 모델의 구조를 바꾸어 enemy:friend 의 차이가 최대한 많이 나도록 나의 agent를 학습시켜본다.