# Action Space

The action is a ndarray with shape (1,) which can take values {0, 1} indicating the direction of the fixed force the cart is pushed with.

0: Push cart to the left

1: Push cart to the right

Note: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it

Observation Space
The observation is a ndarray with shape (4,) with the values corresponding to the following positions and velocities

# We will use CartPole environment provided by gym, an opensource python library which provides many environments for Reinforcement Learning such as Atari Games. In CartPole, we have a pole standing on a cart which can move. The goal of the agent is to keep the pole up by applying some force on it every time step. When the pole is less than 15° from the vertical, the agent receives a reward of 1. An episode is ended when the pole is more than 15° far from the vertical or when the cart position exceeds 2.4 units from the centre.

In [1]:
 conda install -c conda-forge tensorflow 

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install numpy

Note: you may need to restart the kernel to use updated packages.


In [8]:
import numpy as np
import gym
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import mean_squared_error
from matplotlib import pyplot as plt

The agent has to:

1.compute the action to choose for a given state
2.store its experiences in a memory buffer
3.train the DNN by sampling a batch of experiences from the memory buffer

The agent is more likely to explore the environment in the beginning by choosing random actions because he has no idea about how the environment works. Through time steps, the agent gets more and more knowledge, so he is more likely to exploit his knowledge rather than picking random actions. For that purpose, we will use the epsilon greedy algorithm.



In [26]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.n_actions = action_size
        # we define some parameters and hyperparameters:
        # "lr" : learning rate
        # "gamma": discounted factor
        # "exploration_proba_decay": decay of the exploration probability
        # "batch_size": size of experiences we sample to train the DNN
        self.lr = 0.001
        self.gamma = 0.99
        self.exploration_proba = 1.0
        self.exploration_proba_decay = 0.005
        self.batch_size = 32
        
        # We define our memory buffer where we will store our experiences
        # We stores only the 2000 last time steps
        self.memory_buffer= list()
        self.max_memory_buffer = 2000
        
        # We creaate our model having to hidden layers of 24 units (neurones)
        # The first layer has the same size as a state size
        # The last layer has the size of actions space
        self.model = Sequential([
            Dense(units=24,input_dim=state_size, activation = 'relu'),
            Dense(units=24,activation = 'relu'),
            Dense(units=action_size, activation = 'linear')
        ])
        self.model.compile(loss="mse",
                      optimizer = Adam(lr=self.lr))
        
    # The agent computes the action to perform given a state 
    def compute_action(self, current_state):
        # We sample a variable uniformly over [0,1]
        # if the variable is less than the exploration probability
        #     we choose an action randomly
        # else
        #     we forward the state through the DNN and choose the action 
        #     with the highest Q-value.
        if np.random.uniform(0,1) < self.exploration_proba:
            return np.random.choice(range(self.n_actions))
        q_values = self.model.predict(current_state, verbose=0)[0]
        # print("This is qvalues:",q_values)
        return np.argmax(q_values)

    # when an episode is finished, we update the exploration probability using 
    # espilon greedy algorithm
    def update_exploration_probability(self):
        self.exploration_proba = self.exploration_proba * np.exp(-self.exploration_proba_decay)
        print(self.exploration_proba)
    
    # At each time step, we store the corresponding experience
    def store_episode(self,current_state, action, reward, next_state, done):
        #We use a dictionnary to store them
        self.memory_buffer.append({
            "current_state":current_state,
            "action":action,
            "reward":reward,
            "next_state":next_state,
            "done" :done
        })
        # If the size of memory buffer exceeds its maximum, we remove the oldest experience
        if len(self.memory_buffer) > self.max_memory_buffer:
            self.memory_buffer.pop(0)
    

    # At the end of each episode, we train our model
    def train(self):
        # We shuffle the memory buffer and select a batch size of experiences
        np.random.shuffle(self.memory_buffer)
        batch_sample = self.memory_buffer[0:self.batch_size]
        
        # We iterate over the selected experiences
        for experience in batch_sample:
            # We compute the Q-values of S_t
            q_current_state = self.model.predict(experience["current_state"])
           
            # We compute the Q-target using Bellman optimality equation
            q_target = experience["reward"]
            
            if not experience["done"]:
                q_target = q_target + self.gamma*np.max(self.model.predict(experience["next_state"])[0])
            print("+++++++++++++++++", len(q_current_state[0]), "+++++++++++++", experience["action"])
            q_current_state[0][experience["action"]] = q_target
            # train the model
            self.model.fit(experience["current_state"], q_current_state, verbose=0)
            


In [27]:
# We create our gym environment 
env = gym.make("CartPole-v1")
# We get the shape of a state and the actions space size
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
# Number of episodes to run
n_episodes = 500
# Max iterations per epiode
max_iteration_ep = 1000
# We define our agent
agent = DQNAgent(state_size, action_size)
total_steps = 0

In [28]:
env.observation_space

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

In [29]:
env.action_space

Discrete(2)

In [32]:
# We iterate over episodes
for e in range(n_episodes):
    # We initialize the first state and reshape it to fit 
    #  with the input layer of the DNN
    current_state = env.reset()
    current_state = np.array([current_state])
    for step in range(max_iteration_ep):
        total_steps = total_steps + 1
        # the agent computes the action to perform
        action = agent.compute_action(current_state)
        # the envrionment runs the action and returns
        # the next state, a reward and whether the agent is done
        next_state, reward, done, _ = env.step(action)
        next_state = np.array([next_state])
        
        # We sotre each experience in the memory buffer
        agent.store_episode(current_state, action, reward, next_state, done)
        
        # if the episode is ended, we leave the loop after
        # updating the exploration probability
        if done:
            agent.update_exploration_probability()
            break
        current_state = next_state
    # if the have at least batch_size experiences in the memory buffer
    # than we tain our model
    print("episodes: %d" % e)
print("----------", total_steps, agent.batch_size)
if total_steps >= agent.batch_size:
    agent.train()

0.10228420671553794
episodes: 0
0.1017740621062842
episodes: 1
0.10126646185388387
episodes: 2
0.10076139326830418
episodes: 3
0.10025884372280419
episodes: 4
0.0997588006536191
episodes: 5
0.09926125155964612
episodes: 6
0.09876618400213198
episodes: 7
0.09827358560436199
episodes: 8
0.09778344405135052
episodes: 9
0.09729574708953323
episodes: 10
0.09681048252646067
episodes: 11
0.09632763823049348
episodes: 12
0.09584720213049912
episodes: 13
0.09536916221555007
episodes: 14
0.09489350653462356
episodes: 15
0.09442022319630279
episodes: 16
0.09394930036847965
episodes: 17
0.09348072627805892
episodes: 18
0.09301448921066394
episodes: 19
0.09255057751034372
episodes: 20
0.09208897957928161
episodes: 21
0.0916296838775053
episodes: 22
0.0911726789225983
episodes: 23
0.09071795328941294
episodes: 24
0.09026549560978472
episodes: 25
0.08981529457224809
episodes: 26
0.08936733892175364
episodes: 27
0.08892161745938677
episodes: 28
0.08847811904208774
episodes: 29
0.088036832582373
episod

0.030197383422318678
episodes: 244
0.030046773344173314
episodes: 245
0.02989691443692649
episodes: 246
0.029747802954097725
episodes: 247
0.029599435167892176
episodes: 248
0.029451807369107463
episodes: 249
0.029304915867040926
episodes: 250
0.029158756989397364
episodes: 251
0.029013327082197223
episodes: 252
0.028868622509685252
episodes: 253
0.028724639654239596
episodes: 254
0.028581374916281373
episodes: 255
0.02843882471418467
episodes: 256
0.028296985484187014
episodes: 257
0.028155853680300266
episodes: 258
0.028015425774221975
episodes: 259
0.02787569825524718
episodes: 260
0.027736667630180623
episodes: 261
0.027598330423249443
episodes: 262
0.027460683176016257
episodes: 263
0.02732372244729272
episodes: 264
0.027187444813053473
episodes: 265
0.02705184686635057
episodes: 266
0.026916925217228275
episodes: 267
0.026782676492638335
episodes: 268
0.02664909733635564
episodes: 269
0.026516184408894333
episodes: 270
0.0263839343874243
episodes: 271
0.026252343965688117
episode

0.009279013887064798
episodes: 480
0.009232734612231673
episodes: 481
0.009186686156244725
episodes: 482
0.009140867367890156
episodes: 483
0.009095277101695873
episodes: 484
0.009049914217902844
episodes: 485
0.009004777582436613
episodes: 486
0.008959866066878942
episodes: 487
0.008915178548439604
episodes: 488
0.008870713909928309
episodes: 489
0.008826471039726778
episodes: 490
0.008782448831760954
episodes: 491
0.008738646185473343
episodes: 492
0.008695062005795506
episodes: 493
0.008651695203120684
episodes: 494
0.00860854469327655
episodes: 495
0.00856560939749811
episodes: 496
0.00852288824240073
episodes: 497
0.008480380159953314
episodes: 498
0.008438084087451583
episodes: 499
---------- 10616 32
+++++++++++++++++ 2 +++++++++++++ 0
+++++++++++++++++ 2 +++++++++++++ 0
+++++++++++++++++ 2 +++++++++++++ 0
+++++++++++++++++ 2 +++++++++++++ 1
+++++++++++++++++ 2 +++++++++++++ 1
+++++++++++++++++ 2 +++++++++++++ 1
+++++++++++++++++ 2 +++++++++++++ 1
+++++++++++++++++ 2 +++++++++++

In [70]:
from gym import wrappers
def make_video():
#     env_to_wrap = gym.make('CartPole-v1')
#     env = wrappers.Monitor(env_to_wrap, 'videos', force = True)
    rewards = 0
    steps = 0
    done = False
    state = env.reset()
    state = np.array([state])
    while not done:
        action = agent.compute_action(state)
        state, reward, done, _ = env.step(action)
        state = np.array([state])            
        steps += 1
        print("%3d " % steps, end="")
        rewards += reward
    print(rewards)
    #env.close()
    #env_to_wrap.close()
make_video()


  1   2   3   4   5   6   7   8   9  10 10.0


In [25]:
pip install pyglet

Collecting pyglet
  Downloading pyglet-2.0.4-py3-none-any.whl (831 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m831.0/831.0 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: pyglet
Successfully installed pyglet-2.0.4
Note: you may need to restart the kernel to use updated packages.
