# Action Space

The action is a ndarray with shape (1,) which can take values {0, 1} indicating the direction of the fixed force the cart is pushed with.

0: Push cart to the left

1: Push cart to the right

Note: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it

Observation Space
The observation is a ndarray with shape (4,) with the values corresponding to the following positions and velocities

# We will use CartPole environment provided by gym, an opensource python library which provides many environments for Reinforcement Learning such as Atari Games. In CartPole, we have a pole standing on a cart which can move. The goal of the agent is to keep the pole up by applying some force on it every time step. When the pole is less than 15° from the vertical, the agent receives a reward of 1. An episode is ended when the pole is more than 15° far from the vertical or when the cart position exceeds 2.4 units from the centre.

In [1]:
 conda install -c conda-forge tensorflow 

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install numpy==1.21

Note: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np
import gym
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import mean_squared_error
from matplotlib import pyplot as plt

2023-02-22 09:46:04.927963: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


The agent has to:

1.compute the action to choose for a given state
2.store its experiences in a memory buffer
3.train the DNN by sampling a batch of experiences from the memory buffer

The agent is more likely to explore the environment in the beginning by choosing random actions because he has no idea about how the environment works. Through time steps, the agent gets more and more knowledge, so he is more likely to exploit his knowledge rather than picking random actions. For that purpose, we will use the epsilon greedy algorithm.



In [10]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.n_actions = action_size
        # we define some parameters and hyperparameters:
        # "lr" : learning rate
        # "gamma": discounted factor
        # "exploration_proba_decay": decay of the exploration probability
        # "batch_size": size of experiences we sample to train the DNN
        self.lr = 0.001
        self.gamma = 0.99
        self.exploration_proba = 1.0
        self.exploration_proba_decay = 0.005
        self.batch_size = 32
        
        # We define our memory buffer where we will store our experiences
        # We stores only the 2000 last time steps
        self.memory_buffer= list()
        self.max_memory_buffer = 2000
        
        # We creaate our model having to hidden layers of 24 units (neurones)
        # The first layer has the same size as a state size
        # The last layer has the size of actions space
        self.model = Sequential([
            Dense(units=24,input_dim=state_size, activation = 'relu'),
            Dense(units=24,activation = 'relu'),
            Dense(units=action_size, activation = 'linear')
        ])
        self.model.compile(loss="mse",
                      optimizer = Adam(lr=self.lr))
        
    # The agent computes the action to perform given a state 
    def compute_action(self, current_state):
        # We sample a variable uniformly over [0,1]
        # if the variable is less than the exploration probability
        #     we choose an action randomly
        # else
        #     we forward the state through the DNN and choose the action 
        #     with the highest Q-value.
        if np.random.uniform(0,1) < self.exploration_proba:
            return np.random.choice(range(self.n_actions))
        q_values = self.model.predict(current_state)[0]
        print("This is qvalues:",q_values)
        return np.argmax(q_values)

    # when an episode is finished, we update the exploration probability using 
    # espilon greedy algorithm
    def update_exploration_probability(self):
        self.exploration_proba = self.exploration_proba * np.exp(-self.exploration_proba_decay)
        print(self.exploration_proba)
    
    # At each time step, we store the corresponding experience
    def store_episode(self,current_state, action, reward, next_state, done):
        #We use a dictionnary to store them
        self.memory_buffer.append({
            "current_state":current_state,
            "action":action,
            "reward":reward,
            "next_state":next_state,
            "done" :done
        })
        # If the size of memory buffer exceeds its maximum, we remove the oldest experience
        if len(self.memory_buffer) > self.max_memory_buffer:
            self.memory_buffer.pop(0)
    

    # At the end of each episode, we train our model
    def train(self):
        # We shuffle the memory buffer and select a batch size of experiences
        np.random.shuffle(self.memory_buffer)
        batch_sample = self.memory_buffer[0:self.batch_size]
        
        # We iterate over the selected experiences
        for experience in batch_sample:
            # We compute the Q-values of S_t
            q_current_state = self.model.predict(experience["current_state"])
           
            # We compute the Q-target using Bellman optimality equation
            q_target = experience["reward"]
            
            if not experience["done"]:
                q_target = q_target + self.gamma*np.max(self.model.predict(experience["next_state"])[0])
            q_current_state[0][experience["action"]] = q_target
            # train the model
            self.model.fit(experience["current_state"], q_current_state, verbose=0)
            


In [13]:
# We create our gym environment 
env = gym.make("CartPole-v1")
# We get the shape of a state and the actions space size
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
# Number of episodes to run
n_episodes = 100
# Max iterations per epiode
max_iteration_ep = 100
# We define our agent
agent = DQNAgent(state_size, action_size)
total_steps = 0

In [20]:
env.observation_space

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

In [16]:
env.action_space

Discrete(2)

In [9]:
# We iterate over episodes
for e in range(n_episodes):
    # We initialize the first state and reshape it to fit 
    #  with the input layer of the DNN
    current_state = env.reset()
    current_state = np.array([current_state])
    for step in range(max_iteration_ep):
        total_steps = total_steps + 1
        # the agent computes the action to perform
        action = agent.compute_action(current_state)
        # the envrionment runs the action and returns
        # the next state, a reward and whether the agent is done
        next_state, reward, done, _ = env.step(action)
        next_state = np.array([next_state])
        
        # We sotre each experience in the memory buffer
        agent.store_episode(current_state, action, reward, next_state, done)
        
        # if the episode is ended, we leave the loop after
        # updating the exploration probability
        if done:
            agent.update_exploration_probability()
            break
        current_state = next_state
    # if the have at least batch_size experiences in the memory buffer
    # than we tain our model
if total_steps >= agent.batch_size:
    agent.train()

0.9950124791926823
0.9900498337491681
0.9851119396030628
0.9801986733067554
This is qvalues: [-0.03394986  0.04676377]
0.9753099120283327
0.9704455335485083
This is qvalues: [-0.00942048  0.00686342]
0.9656054162575666
0.9607894391523233
0.9559974818331
This is qvalues: [-0.2033608   0.04340545]
0.951229424500714
This is qvalues: [-0.00841642  0.00019725]
This is qvalues: [-0.16224572  0.02028988]
This is qvalues: [-0.23589367  0.02649503]
This is qvalues: [-0.4270761   0.06641441]
0.9464851479534839
0.9417645335842488
This is qvalues: [-0.17770234 -0.0128769 ]
This is qvalues: [-0.01876265 -0.00442043]
This is qvalues: [-0.2544458  0.0325833]
This is qvalues: [-0.33091837  0.03977161]
0.9370674633774035
This is qvalues: [-0.09428612  0.12547003]
0.9323938199059484
This is qvalues: [-0.02460052  0.05056434]
0.9277434863285531
This is qvalues: [-0.15779871  0.01826962]
0.923116346386636
This is qvalues: [-0.14985436  0.01078222]
This is qvalues: [-0.16561212  0.04216303]
0.9185122844014

This is qvalues: [-0.3612565   0.05811743]
This is qvalues: [-0.4439076   0.06851766]
0.8228346580560187
This is qvalues: [-0.05633728 -0.00298397]
This is qvalues: [-0.31606147  0.05032094]
0.8187307530779823
This is qvalues: [-0.04433158  0.07162546]
This is qvalues: [-0.13664299 -0.00498997]
This is qvalues: [-0.1581685   0.01915083]
0.814647316411415
This is qvalues: [-0.04061197  0.06242791]
This is qvalues: [-0.04571178  0.0590633 ]
This is qvalues: [-0.0219635   0.02410556]
This is qvalues: [-0.23308504  0.03404928]
0.8105842459701875
This is qvalues: [-0.01074314  0.00258449]
This is qvalues: [-0.15065585  0.04423893]
This is qvalues: [-0.223471    0.04838858]
This is qvalues: [-0.29963058  0.05477029]
0.8065414401773273
This is qvalues: [-0.00821258  0.0235431 ]
This is qvalues: [-0.00451325  0.00581584]
This is qvalues: [-0.00780102  0.00452966]
This is qvalues: [-0.00819981  0.00459666]
This is qvalues: [-0.00748028  0.00468128]
This is qvalues: [-0.06242305  0.01240518]
Thi

This is qvalues: [-0.07876107  0.01309354]
This is qvalues: [-0.24212332  0.03551676]
This is qvalues: [-0.3990477   0.05332916]
0.74453158746591
This is qvalues: [-0.01516629  0.01080819]
This is qvalues: [-0.07064234  0.00533037]
This is qvalues: [-0.20835625  0.01362715]
This is qvalues: [-0.25355726  0.04407074]
This is qvalues: [-0.36882147  0.06870224]
0.7408182206817184
This is qvalues: [-0.01268918  0.01439272]
This is qvalues: [-0.12884769  0.01646636]
This is qvalues: [-0.21207035  0.02931551]
This is qvalues: [-0.311028    0.05275404]
0.7371233743916283
This is qvalues: [-0.03788179  0.04457316]
This is qvalues: [-0.17842253 -0.00019844]
This is qvalues: [-0.27545562  0.06682114]
0.7334469562242899
This is qvalues: [-0.0001368  0.0015094]
This is qvalues: [-0.05731825  0.00131532]
0.7297888742690575
This is qvalues: [-0.211812    0.01526487]
This is qvalues: [-0.43953893  0.04465508]
This is qvalues: [-0.5227098   0.05844929]
0.7261490370736916
This is qvalues: [-0.04216451 

This is qvalues: [-0.08425769  0.01444852]
This is qvalues: [-0.03328871  0.0173042 ]
This is qvalues: [-0.10305667  0.02116887]
This is qvalues: [-0.25007883  0.03162582]
This is qvalues: [-0.3280346   0.03999334]
0.6804506362045882
This is qvalues: [-0.0571592   0.09022731]
This is qvalues: [-0.04743806  0.06990758]
This is qvalues: [-0.05699448  0.07998758]
This is qvalues: [-0.04503801  0.05678665]
This is qvalues: [-0.02087423  0.0224275 ]
This is qvalues: [-0.0416894  -0.01172743]
This is qvalues: [-0.1727561  -0.00168429]
This is qvalues: [-0.24070948  0.00466057]
This is qvalues: [-0.24685976  0.01947521]
This is qvalues: [-0.31822067  0.02684866]
This is qvalues: [-0.39539236  0.03789607]
This is qvalues: [-0.47570723  0.05076232]
This is qvalues: [-0.6465596   0.08194171]
0.6770568744981653
This is qvalues: [-0.03505785  0.06112056]
This is qvalues: [-0.02383338  0.03799243]
This is qvalues: [-0.04363234 -0.01074832]
This is qvalues: [-0.10712811 -0.00169772]
This is qvalues:

This is qvalues: [-0.6217134   0.08462347]
0.6505090947233172
This is qvalues: [-0.00547955  0.0051594 ]
This is qvalues: [-0.06357381  0.09737349]
This is qvalues: [-0.07839029  0.11224216]
This is qvalues: [-0.09465713  0.1288977 ]
This is qvalues: [-0.08725823  0.10999274]
This is qvalues: [-0.06137855  0.07794021]
This is qvalues: [-0.04576014  0.05029885]
This is qvalues: [-0.02361836 -0.0028032 ]
This is qvalues: [-0.20132521 -0.0012432 ]
This is qvalues: [-0.12839456  0.01684655]
This is qvalues: [-0.19219252  0.02142124]
This is qvalues: [-0.26035446  0.03157733]
This is qvalues: [-0.20267281  0.06536151]
This is qvalues: [-0.21684754  0.07505934]
This is qvalues: [-0.29446352  0.0682743 ]
This is qvalues: [-0.48466873  0.09680365]
This is qvalues: [-0.5690737   0.10912605]
0.6472646670780353
This is qvalues: [-0.00789172  0.00596023]
This is qvalues: [0.0005601  0.00882696]
This is qvalues: [-0.09441406  0.01455037]
This is qvalues: [-0.17835443  0.02518955]
This is qvalues: [

This is qvalues: [-0.03039035  0.04760367]
This is qvalues: [-0.03961652  0.05470322]
This is qvalues: [-0.03191113  0.0344607 ]
This is qvalues: [-0.05017669  0.06420045]
This is qvalues: [-0.04293305  0.04295756]
This is qvalues: [-0.02447348  0.00951862]
This is qvalues: [-0.1257727  -0.00887544]
This is qvalues: [-0.12999457  0.00099909]
This is qvalues: [-0.20151548  0.00747988]
This is qvalues: [-0.27616856  0.01586407]
This is qvalues: [-0.24618796  0.03884418]
This is qvalues: [-0.43883723  0.06989885]
0.6126263941844169
This is qvalues: [-0.12209231 -0.00421521]
This is qvalues: [-0.33285266  0.0193119 ]
This is qvalues: [-0.44307998  0.05604906]
This is qvalues: [-0.5270047   0.06933205]
0.6095709072963101
This is qvalues: [-0.01254062  0.00090435]
This is qvalues: [-0.08030534  0.00723905]
This is qvalues: [-0.23528992  0.02818516]
This is qvalues: [-0.39115244  0.04627333]
This is qvalues: [-0.55978906  0.07228104]
0.6065306597126343


In [26]:
from gym import wrappers
def make_video():
    env_to_wrap = gym.make('CartPole-v1')
    env = wrappers.Monitor(env_to_wrap, 'videos', force = True)
    rewards = 0
    steps = 0
    done = False
    state = env.reset()
    state = np.array([state])
    while not done:
        action = agent.compute_action(state)
        state, reward, done, _ = env.step(action)
        state = np.array([state])            
        steps += 1
        rewards += reward
    print(rewards)
    #env.close()
    #env_to_wrap.close()
make_video()




NameError: name 'glPushMatrix' is not defined

In [25]:
pip install pyglet

Collecting pyglet
  Downloading pyglet-2.0.4-py3-none-any.whl (831 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m831.0/831.0 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: pyglet
Successfully installed pyglet-2.0.4
Note: you may need to restart the kernel to use updated packages.
