In [6]:
import gym

env = gym.make('CartPole-v0')
print(env.action_space)
print(env.observation_space)

Discrete(2)
Box(4,)


Cartpole er et problem som består i en platform som kan bevege seg horisontalt og en pendel festet til platformen. Spillet går ut på å balansere pendelen over platformen. Observasjonsrommet er posisjonen og hastigheten til platformen, samt vinkelen og vinkelhastigheten til pendelen. Handlingene er et dytt mot venstre eller høyre.

For å lære en strategi for dette problemet kan man bruke Deep Q Learning. Dette er en metode hvor man setter opp et dypt nevralt nettverk hvor inputen er observasjonen og outputen er en handling. Bellmans ligning gir

$Q(S_t, A_t,\theta) \leftarrow Q(S_t, A_t,\theta) + \alpha [R_t + \gamma \max_a Q(S_{t+1}, a,\theta) - Q(S_t,A_t, \theta)]$

Hvor $\theta$ er vektene til Q-nettverket. Dette kan oversettes til problemet å minimere tapsfunksjonen

$L(\theta) = |\underline{R_t + \gamma \underset{a}{\max} Q(S_{t+1}, a,\theta)} - Q(S_t,A_t,\theta)|^2$

Ettersom man her optimaliserer mot en target (den understrekede delen av tapsfunksjonen) som ikke er stasjonær, altså endrer seg etterhvert som man trener nettverket, kan vi ikke garantere at treningen konvergerer. For å få en mer stabil trening vil man gjerne bruke to nettverk, et target nettverk og et treningsnettverk, hvor vektene fra targetnettverket kun endrer seg etter et gitt antall iterasjoner.

In [8]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

input_shape = (1,) + # ?
num_outputs = # ?

model = Sequential()
model.add(Flatten(input_shape=input_shape))
# Definer model
model.add(Dense(env.action_space.n, activation='softmax'))

W0724 14:56:53.972738 4563498432 deprecation_wrapper.py:119] From /Users/auduneltvik/.virtualenvs/rl-workshop/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0724 14:56:53.996760 4563498432 deprecation_wrapper.py:119] From /Users/auduneltvik/.virtualenvs/rl-workshop/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0724 14:56:54.013818 4563498432 deprecation_wrapper.py:119] From /Users/auduneltvik/.virtualenvs/rl-workshop/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.



Biblioteket <code>keras-rl</code> kommer med innebygd støtte for Deep Q nettverk. Her setter vi opp en agent som gitt en model, et environment og en policy (metode for å velge ut handling. Her bruker vi en grådig policy som bare velger den handlingen med størst vekting).

In [9]:
from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit = 50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=env.action_space.n, memory=memory, policy=policy)

dqn.compile(Adam(), metrics=['mse'])

dqn.fit(env, nb_steps=5000, verbose=2)

W0724 14:56:55.307780 4563498432 deprecation_wrapper.py:119] From /Users/auduneltvik/.virtualenvs/rl-workshop/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0724 14:56:55.308681 4563498432 deprecation_wrapper.py:119] From /Users/auduneltvik/.virtualenvs/rl-workshop/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:181: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0724 14:56:55.397023 4563498432 deprecation_wrapper.py:119] From /Users/auduneltvik/.virtualenvs/rl-workshop/lib/python3.6/site-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.



Training for 5000 steps ...
   35/5000: episode: 1, duration: 0.059s, episode steps: 35, steps per second: 590, episode reward: 35.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.486 [0.000, 1.000], mean observation: -0.089 [-0.846, 0.325], loss: --, mean_squared_error: --, mean_q: --
   85/5000: episode: 2, duration: 0.015s, episode steps: 50, steps per second: 3280, episode reward: 50.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.520 [0.000, 1.000], mean observation: 0.115 [-0.286, 0.821], loss: --, mean_squared_error: --, mean_q: --
  128/5000: episode: 3, duration: 0.010s, episode steps: 43, steps per second: 4391, episode reward: 43.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.488 [0.000, 1.000], mean observation: -0.080 [-0.882, 0.308], loss: --, mean_squared_error: --, mean_q: --
  243/5000: episode: 4, duration: 0.025s, episode steps: 115, steps per second: 4603, episode reward: 115.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.496 [0.000, 1.0

 1783/5000: episode: 31, duration: 0.129s, episode steps: 66, steps per second: 513, episode reward: 66.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.515 [0.000, 1.000], mean observation: 0.098 [-0.342, 0.801], loss: 0.276818, mean_squared_error: 0.325292, mean_q: 0.874250
 1825/5000: episode: 32, duration: 0.162s, episode steps: 42, steps per second: 259, episode reward: 42.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.476 [0.000, 1.000], mean observation: -0.096 [-0.694, 0.174], loss: 0.265511, mean_squared_error: 0.309228, mean_q: 0.880464
 1892/5000: episode: 33, duration: 0.200s, episode steps: 67, steps per second: 334, episode reward: 67.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.493 [0.000, 1.000], mean observation: -0.074 [-0.851, 0.600], loss: 0.264466, mean_squared_error: 0.308852, mean_q: 0.886799
 1934/5000: episode: 34, duration: 0.103s, episode steps: 42, steps per second: 409, episode reward: 42.000, mean reward: 1.000 [1.000, 1.000], mean 

 3266/5000: episode: 61, duration: 0.123s, episode steps: 49, steps per second: 399, episode reward: 49.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.490 [0.000, 1.000], mean observation: -0.086 [-0.770, 0.264], loss: 0.221331, mean_squared_error: 0.256678, mean_q: 0.948590
 3316/5000: episode: 62, duration: 0.113s, episode steps: 50, steps per second: 444, episode reward: 50.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.520 [0.000, 1.000], mean observation: 0.125 [-0.290, 0.814], loss: 0.233150, mean_squared_error: 0.273275, mean_q: 0.941777
 3428/5000: episode: 63, duration: 0.277s, episode steps: 112, steps per second: 405, episode reward: 112.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.500 [0.000, 1.000], mean observation: -0.002 [-1.105, 0.424], loss: 0.223950, mean_squared_error: 0.259859, mean_q: 0.947021
 3483/5000: episode: 64, duration: 0.154s, episode steps: 55, steps per second: 357, episode reward: 55.000, mean reward: 1.000 [1.000, 1.000], mea

<keras.callbacks.History at 0x12db367b8>

Til slutt kan vi teste agenten vår på problemet ved å kjøre <code>dqn.test()</code>

In [None]:
dqn.test(env, nb_episodes=5, visualize=True)
env.close()

Dersom du blir ferdig kan du prøve å trene opp en agent på et mer komplisert miljø og nettverk, for eksempel et CNN på miljøet <code>"Breakout-v0"</code>.