# Reinforcement Learning (RL) with Gymnasium and Stable Baselines3 Tutorial (Part 2)
Source: 
- https://www.youtube.com/watch?v=Mut_u40Sqz4&t=6144s (Nicholas Renotte) (YouTube video by Nicholas Renotte titled, 
'Reinforcement Learning in 3 Hours | Full Course using Python')

Documentations:
- Gymnasium: https://gymnasium.farama.org/ (This library provides standardized environments for developing and testing RL algorithms)
- Stable Baselines3: https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html (This library provides a suite of pre-implemented RL algorithms based on PyTorch)

In [None]:
import os
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

In [39]:
environment_name = "CartPole-v1"
env = gym.make(environment_name, render_mode="human")

## 5. Reload PPO DRL model

In [None]:
PPO_Model_CartPole_v1_20k = os.path.join('Training', 'Saved RL Models', 'PPO_Model_CartPole_v1_20k')
PPO_DRL_model = PPO.load(PPO_Model_CartPole_v1_20k, env=env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


## 6. Evaluating the PPO DRL model in a RL Environment

Now we can evaluate the performance of the PPO DRL model in the Gymnasium's 'CartPole-v1' RL Environment

In [None]:
eval_env = gym.make("CartPole-v1", render_mode="human")

# The 'evaluate_policy()' function returns a tuple,
#       (mean_reward, std_reward)
# - 'mean_reward' - refers to the mean reward throughout the episodes
# - 'std_reward' - refers to the standard deviation of the reward throughout the episodes
print(evaluate_policy(PPO_DRL_model, eval_env, n_eval_episodes=1, render=True))
env.close()



(np.float64(500.0), np.float64(0.0))


## 7. Test the PPO DRL model in a RL Environment

To test the PPO DRL model in the Gymnasium's 'CartPole-v1' RL Environment, we can use the same code from the earlier  section '2. Load RL Environment and testing if it works with a baseline algorithm that takes random actions' with some minor changes

But here, instead of taking a random action at each time step in an episode, we are using the PPO DRL model to predict that action at each time step in an episode instead

In [None]:
# We simply need to change a few things here from the earlier section '2. Load RL Environment and testing if it works 
# with a baseline algorithm that takes random actions':
# 1. Change the line 'state = env.reset()' and 'print(f"Initial State: {state}")' -> 'obs = env.reset()' and 'print(f"Initial State: {obs}")'

# 2. Change the line 'action = env.action_space.sample()' -> 'action, _ = PPO_RL_model.predict(obs)'

# 3. Change the line 'n_state, reward, done, truncated, info = env.step(action)' -> 'obs, reward, done, truncated, info = env.step(action)'

environment_name = "CartPole-v1"
env = gym.make(environment_name, render_mode="human")

episodes = 5
for episode in range(0, episodes+1):
    # Initialise starting state of the RL agent in the RL Environment before an episode, done to false, and starting 
    # episode score to 0
    obs, _ = env.reset()
    print(f"Initial State: {obs}")
    done = False
    episode_score = 0

    # During an episode:
    while not done:
        env.render()
        # RL agent determines action to take
        # - Now, we are no longer randomly sampling an action to take by our RL agent in the RL Environment, but
        #   instead we are using the PPO DRL model to predict the action at each time step in an episode instead based
        #   on the current observations/states in the RL Environment
        action, _ = PPO_DRL_model.predict(obs)
        # RL Environment generates the next state and reward gained upon taking the action in the current state
        obs, reward, done, truncated, info = env.step(action)
        # Append the reward gained upon taking the action in the current state to the cumulative episode date
        episode_score += reward

    print(f"Episode: {episode} Score: {episode_score}")

env.close()

Initial State: [ 0.01689493 -0.04276498  0.03179216  0.0038821 ]
Episode: 0 Score: 376.0
Initial State: [ 0.04157403  0.03359015  0.01806775 -0.02898854]
Episode: 1 Score: 378.0
Initial State: [ 0.00292343 -0.01035426 -0.03587482  0.04994534]
Episode: 2 Score: 366.0
Initial State: [ 0.0451983  -0.02441233  0.04365429 -0.03893877]
Episode: 3 Score: 237.0
Initial State: [ 0.03165805 -0.02900014 -0.00288191 -0.01140878]
Episode: 4 Score: 358.0
Initial State: [0.04184088 0.04438623 0.02803899 0.00377379]
Episode: 5 Score: 254.0


## 8. Logging the training process of the PPO DRL model in a RL Environment in TensorBoard

(Copy-pasted from the '2. training_a_Feedforward_Neural_Network_with_digits_image_dataset_with_TensorBoard.ipynb' file from the 'Tutorial 8 - TensorBoard Introduction' folder from the 'Deep Learning' tutorials)

You need to:
1. Navigate to the folder storing the 'logs' folder

2. Run the following command to open the TensorBoard tool:
    ```
    tensorboard --logdir logs
    ```

Expected output:
```
2025-06-12 18:53:36.667376: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-12 18:53:38.357404: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.19.0 at http://localhost:6006/ (Press CTRL+C to quit)
```

Open the URL in your browser to view the TensorBoard tool:
```
http://localhost:6006/
```


## 9. Adding a Callback during training of the PPO DRL model in a RL Environment

### What is a Callback?
A callback is a custom piece of code that runs at specific points during training of the RL algorithm — like at the end of an episode, after a certain number of steps, or when saving models. It gives you ability to extend/add more functionalities to help you track, control, or modify the training process of the RL algorithms while it’s running.

**Here are some existing built-in callback classes in Stable Baselines3 you can use, which supports different functionalities:**
| Callback                           | Purpose                                                                                         |
| ---------------------------------- | ----------------------------------------------------------------------------------------------- |
| `BaseCallback`                     | Abstract base class. All callbacks must inherit from this.                                      |
| `EventCallback`                    | Base class for callbacks that trigger on a specific event (e.g. best model saving, evaluation). |
| `CallbackList`                     | Combines multiple callbacks into one.                                                           |
| `CheckpointCallback`               | Saves the model periodically during training.                                                   |
| `EvalCallback`                     | Evaluates the agent periodically and optionally saves the best-performing model.                |
| `StopTrainingOnRewardThreshold`    | Stops training once a certain reward threshold is reached.                                      |
| `StopTrainingOnMaxEpisodes`        | Stops training after a set number of episodes.                                                  |
| `StopTrainingOnNoModelImprovement` | Stops training if the model performance doesn't improve for a certain number of evaluations.    |
| `ProgressBarCallback`              | Shows a tqdm progress bar for training.                                                         |
| `TensorboardCallback`              | Custom callback for advanced Tensorboard logging (often user-defined).                          |

In this tutorial hoowever, we will only be looking at the 'EvalCallback' and 'StopTrainingOnRewadThreshold' callback classes, which, used together, allows us to stop training the RL algorithm after it has reached a certain reward threshold in a certain episode. Then, it will save the RL algorithm

In [43]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

In [None]:
save_path = os.path.join('Training_Tutorial', 'Saved RL Models')

In [45]:
environment_name = "CartPole-v1"
env = gym.make(environment_name, render_mode="human")
env = DummyVecEnv([lambda: env])

In [None]:
# The 'StopTrainingOnRewardThreshold' callback class defines which reward threshold to stop the training of the RL algorithm
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=200, verbose=1)

# The 'EvalCallback' callback class defines how often to check the defined callback, which is the 
# 'StopTrainingOnRewardThreshold' callback class in this case.

# The 5 more important parameters of the 'EvalCallback' callback class are:
# - 'env'                   - the RL environment
# - 'callback_on_new_best'  - the callback class to execute
# - 'eval_freq'             - how often to execute the callback class in 'callback_on_new_best'
# - 'best_model_save_path'  - which path to save the RL model whenever it achieves a new best mean reward during evaluation
# - 'verbose'               - controls how much information is printed to the console/log during training
#                             -> 'verbose=0' means 'Silent', no output at all
#                             -> 'verbose=1' means 'Info', shows key training events: episode rewards, updates, losses, etc.
#                             -> 'verbose=2' means 'Debug' shows more detailed info like hyperparameters, rollout steps, and internal logs

# The best RL model will be saved automatically in a file named 'best_model.zip'
eval_callback = EvalCallback(env,
                             callback_on_new_best=stop_callback,
                             eval_freq=10000,
                             best_model_save_path=save_path,
                             verbose=1)

In [None]:
log_path = os.path.join('Training_Tutorial', 'logs')
another_PPO_DRL_model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

Using cpu device


In [None]:
another_PPO_DRL_model.learn(total_timesteps=20000, callback=eval_callback)

Logging to Training\logs\PPO_3
-----------------------------
| time/              |      |
|    fps             | 47   |
|    iterations      | 1    |
|    time_elapsed    | 43   |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 46          |
|    iterations           | 2           |
|    time_elapsed         | 87          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008842143 |
|    clip_fraction        | 0.068       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.668      |
|    explained_variance   | 0.109       |
|    learning_rate        | 0.0003      |
|    loss                 | 15.4        |
|    n_updates            | 20          |
|    policy_gradient_loss | -0.0201     |
|    value_loss           | 41.3        |
-----------------------------------------
Eva

<stable_baselines3.ppo.ppo.PPO at 0x1beb74365d0>

## 10. Changing PPO DRL model policies

**Policy** refers to the 'learning' architecture of RL algorithms defines how the agent chooses actions based on its current state.

For PPO DRL algorithm, which is a Deep RL (DRL) algorithm, it uses neural networks or its variants as its 'learning' architecture.

In Stable Baseliens3, you can change the underlying policy/neural network variant used in the RL algorithms.

### What does this mean?
```python
[dict(pi=[128,128,128,128], vf=[128,128,128,128])]
```

Within Deep RL (DRL) algorithms, they have their own subcategory of 'learning' architectures/policy, despite all of them using neural networks, some using multiple neural networks:
| **DRL Algorithm**    | **Networks Used**                             | **# of Neural Networks**                         | **Purpose of Each Network**                   |
| ---------------- | --------------------------------------------- | ------------------------------------ | --------------------------------------------- |
| **DQN**          | Q-Network (+ Target Q-Network)                | 1 (conceptually 1; target is a copy) | Q(s, a) for all actions                       |                  |
| **HER**          | Q-Network (with goal relabeling)              | 1                                    | Goal-conditioned Q(s, a)                      |                  |
| **PPO**          | Actor + Critic networks                       | 2                                    | Actor: π(a \| s), Critic: V(s) |
| **A2C / A3C**    | Actor + Critic networks                       | 2                                    | Actor: π(a \| s), Critic: V(s) |
| **TRPO**         | Actor + Critic networks                       | 2                                    | Trust-region actor + baseline value function  |
| **DDPG**         | Actor + Critic (+ Target copies)              | 2                                    | Actor: μ(s), Critic: Q(s, a)                  |
| **TD3**          | Actor + 2 Critics (+ Targets)                 | 3                                    | Double critics to reduce overestimation       |
| **SAC**          | Stochastic Actor + 2 Critics + Entropy Critic | 3+                                   | Soft policy + value critics                   |

Hence, for PPO DRL algorithm, the specified new policy neural network 'learning' architecture,
```python
[dict(pi=[128,128,128,128], vf=[128,128,128,128])]
```
means that:
- 'pi' defines the architecture for the policy network (the actor that outputs action probabilities).
- 'vf' defines the architecture for the value function network (the critic that estimates the value of a state). 

| Network       | Layers                                | Description                                                          |
| ------------- | ------------------------------------- | -------------------------------------------------------------------- |
| Policy (`pi`) | 4 hidden layers with 128 neurons each | Used to decide what action to take                                   |
| Value (`vf`)  | 4 hidden layers with 128 neurons each | Used to estimate how good a state is (used in advantage calculation) |


In [None]:
new_policy_neural_network_learning_architecture = [dict(pi=[128,128,128,128], vf=[128,128,128,128])]
another_another_PPO_DRL_model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path, policy_kwargs={'net_arch' : new_policy_neural_network_learning_architecture})

Using cpu device


In [None]:
another_another_PPO_DRL_model.learn(total_timesteps=20000, callback=eval_callback)

Logging to Training\logs\PPO_4
-----------------------------
| time/              |      |
|    fps             | 46   |
|    iterations      | 1    |
|    time_elapsed    | 44   |
|    total_timesteps | 2048 |
-----------------------------
--------------------------------------
| time/                   |          |
|    fps                  | 45       |
|    iterations           | 2        |
|    time_elapsed         | 89       |
|    total_timesteps      | 4096     |
| train/                  |          |
|    approx_kl            | 0.014807 |
|    clip_fraction        | 0.196    |
|    clip_range           | 0.2      |
|    entropy_loss         | -0.682   |
|    explained_variance   | 6.68e-06 |
|    learning_rate        | 0.0003   |
|    loss                 | 4.04     |
|    n_updates            | 10       |
|    policy_gradient_loss | -0.0227  |
|    value_loss           | 21.9     |
--------------------------------------
-----------------------------------------
| time/        



Eval num_timesteps=10000, episode_reward=349.60 +/- 84.60
Episode length: 349.60 +/- 84.60
------------------------------------------
| eval/                   |              |
|    mean_ep_length       | 350          |
|    mean_reward          | 350          |
| time/                   |              |
|    total_timesteps      | 10000        |
| train/                  |              |
|    approx_kl            | 0.0113417385 |
|    clip_fraction        | 0.126        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.577       |
|    explained_variance   | 0.596        |
|    learning_rate        | 0.0003       |
|    loss                 | 10.1         |
|    n_updates            | 40           |
|    policy_gradient_loss | -0.0189      |
|    value_loss           | 39.4         |
------------------------------------------
------------------------------
| time/              |       |
|    fps             | 39    |
|    iterations      | 5     |
|    time_e

<stable_baselines3.ppo.ppo.PPO at 0x1beb74c3010>

## 11. Using an alternate DRL algorithm

In Stable Baselines3, it comes with pre-packaged a number of different RL algorithms that you can use.

From the Stable Baselines3 documentation: https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html

These are the pre-packaged RL algorithms in the Stable Baselines3 library:
| **RL Algorithm** | **Full Name**                                                          | **Category**                                                   |
| ---------------- | ---------------------------------------------------------------------- | -------------------------------------------------------------- |
| **A2C**          | Advantage Actor-Critic                                                 | On-policy, Actor-Critic                                        |
| **ACER**         | Actor-Critic with Experience Replay                                    | Off-policy, Actor-Critic                                       |
| **ACKTR**        | Actor-Critic using Kronecker-Factored Trust Region                     | On-policy, Actor-Critic                                        |
| **DDPG**         | Deep Deterministic Policy Gradient                                     | Off-policy, Actor-Critic (deterministic)                       |
| **DQN**          | Deep Q-Network                                                         | Off-policy, Value-based                                        |
| **GAIL**         | Generative Adversarial Imitation Learning                              | Imitation Learning                                             |
| **HER**          | Hindsight Experience Replay                                            | Experience replay strategy for goal-based learning             |
| **PPO1**         | Proximal Policy Optimization (Original/OpenAI Baseline implementation) | On-policy, Actor-Critic                                        |
| **PPO2**         | Proximal Policy Optimization (Improved version from OpenAI Baselines)  | On-policy, Actor-Critic                                        |
| **SAC**          | Soft Actor-Critic                                                      | Off-policy, Actor-Critic (stochastic + entropy regularization) |
| **TD3**          | Twin Delayed Deep Deterministic Policy Gradient                        | Off-policy, Actor-Critic (improvement over DDPG)               |
| **TRPO**         | Trust Region Policy Optimization                                       | On-policy, Actor-Critic                                        |


Just for demonstration purposes, the alternate RL algorithm that we will be using here is the Deep Q-Network (DQN) DRL algorithm

In [58]:
from stable_baselines3 import DQN

In [None]:
DQN_DRL_model = DQN('MlpPolicy', env, verbose=20000, tensorboard_log=log_path)

Using cpu device


In [None]:
DQN_DRL_model.learn(total_timesteps=1)

Logging to Training\logs\DQN_1


<stable_baselines3.dqn.dqn.DQN at 0x1beb7462010>

In [None]:
DQN_Model_CartPole_v1_20k = os.path.join('Training_Tutorial', 'Saved RL Models', 'DQN_Model_CartPole_v1_20k')
DQN_DRL_model.save(DQN_Model_CartPole_v1_20k)