simple 3x3 grid env, the agent starts a (1,1) and 
goal is to reach (3,3,) set of actions could be left, right, up, down 
depending on which state agent is in. rewards = -1 for all states 
except terminal, for which it is 103 optimal policy reward = 100.

In [1]:
import os 
import numpy as np
import random

import gym 
from gym import Env
from gym.spaces import Discrete

from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

Matplotlib created a temporary cache directory at /var/folders/4s/rj_sy56d06508gj9z8g3lwhw0000gn/T/matplotlib-j0v0bx7w because the default path (/Users/yousuf/.matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.


In [None]:
'''
action space
0 - left
1 - up
2 - right
3 - down
'''

In [2]:
class Maze_3x3_Grid(Env):
    def __init__(self):
        self.action_space = Discrete(4)
        self.observation_space = Discrete(9)
        self.state = (0, 0)
        self.episode_length = 10

        
    def step(self,action):
        
        if self.state==8:
            return self.state, 0, True, {}
        
        self.episode_length -= 1
        
        # Converting the number to (row, col) to check the logic
        row, col = divmod(self.state, 3)
        
        if action == 0 and col > 0:  # Left
            col -= 1
        elif action == 1 and col < 2:  # Right
            col += 1
        elif action == 2 and row > 0:  # Up
            row -= 1
        elif action == 3 and row < 2:  # Down
            row += 1
        
        # Updating state by converting (row, col) back to number state
        self.state = row * 3 + col
        
        if self.state == 8:
            reward = 103
            done = True
        else:
            reward = -1
            done = False
            
        if self.episode_length<=0:
            done = True
        else:
            done = False
            

        info = {}
        return self.state, reward, done, info
        
        
    def render(self):
        '''
        grid = np.zeros((3, 3), dtype=str)
        grid.fill('-')
        row, col = divmod(self.state, 3)
        grid[row, col] = 'A'  # This will mark the agents posiition in the grid
        print(grid)
        print("\n")
        '''
        pass
        
        
    def reset(self):
        self.state = 0  # resetting to start position
        self.episode_length = 10
        return self.state

In [3]:
env = Maze_3x3_Grid()

### Testing Env

In [4]:
episodes = 5
for episode in range(1, episodes+1):
    obs = env.reset()
    done = False
    score = 0
    # step = 1
    while not done:
        
        env.render()
        action = env.action_space.sample()
        obs, reward, done, info = env.step(action)
        score+=reward
        ## print(step)
        ## step+=1
        
    print('Episode:{} Score:{}'.format(episode, score))
    
env.close()

Episode:1 Score:-10
Episode:2 Score:94
Episode:3 Score:-10
Episode:4 Score:96
Episode:5 Score:95


### Training Model

In [5]:
log_path = os.path.join('Training', 'Logs')
model = PPO('MlpPolicy', env, verbose = 1, tensorboard_log=log_path)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




In [58]:
model.learn(total_timesteps=50000)

Logging to Training/Logs/PPO_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 9.64     |
|    ep_rew_mean     | 12.4     |
| time/              |          |
|    fps             | 6565     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 9.07        |
|    ep_rew_mean          | 36          |
| time/                   |             |
|    fps                  | 4468        |
|    iterations           | 2           |
|    time_elapsed         | 0           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.020304106 |
|    clip_fraction        | 0.241       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.37       |
|    explained_variance   | -0.000976   |

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5.26        |
|    ep_rew_mean          | 99.7        |
| time/                   |             |
|    fps                  | 3765        |
|    iterations           | 11          |
|    time_elapsed         | 5           |
|    total_timesteps      | 22528       |
| train/                  |             |
|    approx_kl            | 0.016240537 |
|    clip_fraction        | 0.0672      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.748      |
|    explained_variance   | 0.88        |
|    learning_rate        | 0.0003      |
|    loss                 | 314         |
|    n_updates            | 100         |
|    policy_gradient_loss | -0.00887    |
|    value_loss           | 687         |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 5.18

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5           |
|    ep_rew_mean          | 100         |
| time/                   |             |
|    fps                  | 3718        |
|    iterations           | 21          |
|    time_elapsed         | 11          |
|    total_timesteps      | 43008       |
| train/                  |             |
|    approx_kl            | 0.009650457 |
|    clip_fraction        | 0.0525      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.532      |
|    explained_variance   | 1           |
|    learning_rate        | 0.0003      |
|    loss                 | 0.121       |
|    n_updates            | 200         |
|    policy_gradient_loss | -0.00404    |
|    value_loss           | 0.0747      |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 5     

<stable_baselines3.ppo.ppo.PPO at 0x292ec78d0>

### Saving Model

In [8]:
Maze_Path = os.path.join('Training', 'Saved Models')

In [None]:
model.save(Maze_Path)

### Evaluating model

In [10]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10, render=True)
print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

Mean reward: 100.00 +/- 0.00




### Let's see the agent in action

In [12]:
## Loading the class Again cuz I didnt want to make changed to the original class 
## i.e uncommenting the render function cuz If I did that when U re-train the model the whole rendering would just 
## slow down the training


class Maze_3x3_Grid(Env):
    def __init__(self):
        self.action_space = Discrete(4)
        self.observation_space = Discrete(9)
        self.state = (0, 0)
        self.episode_length = 10

        
    def step(self,action):
        
        if self.state==8:
            return self.state, 0, True, {}
        
        self.episode_length -= 1
        
        # Converting the number to (row, col) to check the logic
        row, col = divmod(self.state, 3)
        
        if action == 0 and col > 0:  # Left
            col -= 1
        elif action == 1 and col < 2:  # Right
            col += 1
        elif action == 2 and row > 0:  # Up
            row -= 1
        elif action == 3 and row < 2:  # Down
            row += 1
        
        # Updating state by converting (row, col) back to number state
        self.state = row * 3 + col
        
        if self.state == 8:
            reward = 103
            done = True
        else:
            reward = -1
            done = False
            
        if self.episode_length<=0:
            done = True
        else:
            done = False
            

        info = {}
        return self.state, reward, done, info
        
        
    def render(self):
        
        grid = np.zeros((3, 3), dtype=str)
        grid.fill('-')
        row, col = divmod(self.state, 3)
        grid[row, col] = 'A'  # This will mark the agents posiition in the grid
        print(grid)
        print("\n")
        
        
    def reset(self):
        self.state = 0  # resetting to start position
        self.episode_length = 10
        return self.state
        

In [13]:
env = Maze_3x3_Grid()

In [14]:
model = PPO.load(Maze_Path, env=env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [15]:
env = Maze_3x3_Grid()  
obs = env.reset()

for step in range(100):  
    
    env.render()
    action, _states = model.predict(obs)
    obs, reward, done, info = env.step(action)
    score+=reward
    if done:
        print(f"Episode finished after {step+1} steps with {score} Reward")
        obs = env.reset()
        score = 0

[['A' '-' '-']
 ['-' '-' '-']
 ['-' '-' '-']]


[['-' '-' '-']
 ['A' '-' '-']
 ['-' '-' '-']]


[['-' '-' '-']
 ['-' 'A' '-']
 ['-' '-' '-']]


[['-' '-' '-']
 ['-' '-' 'A']
 ['-' '-' '-']]


[['-' '-' '-']
 ['-' '-' '-']
 ['-' '-' 'A']]


Episode finished after 5 steps with 100 Reward
[['A' '-' '-']
 ['-' '-' '-']
 ['-' '-' '-']]


[['-' 'A' '-']
 ['-' '-' '-']
 ['-' '-' '-']]


[['-' '-' '-']
 ['-' 'A' '-']
 ['-' '-' '-']]


[['-' '-' '-']
 ['-' '-' 'A']
 ['-' '-' '-']]


[['-' '-' '-']
 ['-' '-' '-']
 ['-' '-' 'A']]


Episode finished after 10 steps with 100 Reward
[['A' '-' '-']
 ['-' '-' '-']
 ['-' '-' '-']]


[['-' 'A' '-']
 ['-' '-' '-']
 ['-' '-' '-']]


[['-' '-' '-']
 ['-' 'A' '-']
 ['-' '-' '-']]


[['-' '-' '-']
 ['-' '-' '-']
 ['-' 'A' '-']]


[['-' '-' '-']
 ['-' '-' '-']
 ['-' '-' 'A']]


Episode finished after 15 steps with 100 Reward
[['A' '-' '-']
 ['-' '-' '-']
 ['-' '-' '-']]


[['-' '-' '-']
 ['A' '-' '-']
 ['-' '-' '-']]


[['-' '-' '-']
 ['-' 'A' '-']
 ['-' '-' '

## We can see that the agent has learned the optimal Policy and is executing the same everytime

## Let's Check the Logs

In [69]:
!pip install tensorboard

Collecting tensorboard
  Downloading tensorboard-2.17.1-py3-none-any.whl.metadata (1.6 kB)
Collecting absl-py>=0.4 (from tensorboard)
  Using cached absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting grpcio>=1.48.2 (from tensorboard)
  Downloading grpcio-1.66.0-cp312-cp312-macosx_10_9_universal2.whl.metadata (3.9 kB)
Collecting tensorboard-data-server<0.8.0,>=0.7.0 (from tensorboard)
  Using cached tensorboard_data_server-0.7.2-py3-none-any.whl.metadata (1.1 kB)
Downloading tensorboard-2.17.1-py3-none-any.whl (5.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached absl_py-2.1.0-py3-none-any.whl (133 kB)
Downloading grpcio-1.66.0-cp312-cp312-macosx_10_9_universal2.whl (10.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.6/10.6 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached tensorboard_data_server-0.7.2-py3-none-any.

In [70]:
%load_ext tensorboard

In [71]:
%tensorboard --logdir ./Training/Logs

WE can see that at around 40,000 Steps we achieved the optimal policy