# Reinforcement Learning (RL) with Gymnasium and Stable Baselines3 Tutorial
Source: 
- https://www.youtube.com/watch?v=Mut_u40Sqz4&t=6144s (Nicholas Renotte) (YouTube video by Nicholas Renotte titled, 
'Reinforcement Learning in 3 Hours | Full Course using Python')

Documentations:
- Gymnasium: https://gymnasium.farama.org/ (This library provides standardized environments for developing and testing RL algorithms)
- Stable Baselines3: https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html (This library provides a suite of pre-implemented RL algorithms based on PyTorch)

# Project 3: Custom RL Environment

### What are Atari Games RL Environments?
In Gymansium, there is a class of RL Environments called Atari Games, which refers to the classic video games from the Atari 2600 console, such as:
- Breakout
- Pong
- Space Invaders
- Q*Bert
- Seaquest
- Montezuma's Revenge

and many more...

These games are used as benchmark RL Environments for evaluating and comparing the performance of RL algorithms.

## How is an RL Environment defined?
An RL Environment is typically modeled as the 5-tuple:
```text
𝑀=(𝑆,𝐴,𝑃,𝑅,𝛾)
```

An RL Environment is defined as 5-tuple in the framework of a Markov Decision Process (MDP):

| Symbol              | Name                       | Description                                                                               |
| ------------------- | -------------------------- | ----------------------------------------------------------------------------------------- |
| $S$                 | **States**                 | The set of all possible states the agent can be in                                        |
| $A$                 | **Actions**                | The set of all possible actions the agent can take                                        |
| $P(s' \mid s, a)$   | **Transition Probability** | The probability of moving to state $s'$ after taking action $a$ in state $s$              |
| $R(s, a)$           | **Reward Function**        | The expected reward received after taking action $a$ in state $s$                         |
| $\gamma \in [0, 1]$ | **Discount Factor**        | The factor by which future rewards are discounted (controls how far-sighted the agent is) |

## How does Gymnasium represent each of these components of the RL Environment?
**States**/**Observations** and  **Actions**  
- Box – n-dimensional tensor, range of values (continuous values)
    ```
    E.g. Box(0, 1, shape=(3,3))
    ```
- Discrete – Set of items (discrete values)
    ```
    E.g. Discrete(3)
    ```
- Tuple – Tuple of other spaces (e.g., Box or Discrete)
    ```
    E.g. Tuple((Discrete(2), Box(0, 100, shape=(1,))))
    ```
- Dict – Dictionary of spaces (e.g., Box or Discrete)
    ```
    E.g. Dict({"height": Discrete(2), "speed": Box(0, 100, shape=(1,))})
    ```
- MultiBinary – One-hot encoded binary values
    ```
    E.g. MultiBinary(4)
    ```
- MultiDiscrete – Multiple discrete values
    ```
    E.g. MultiDiscrete([5, 2, 2])
    ```

**Transition Probability**  
- abstracted out by the Gymmnasium library

**Reward Function**  
- abstracted out by the Gymmnasium library

**Discount Factor**
- abstracted out by the Gymmnasium library

### What is the difference between States and Observations?
RL agents only act on observations, not states. Optimal behavior of RL agents assumes knowledge of the underlying state (or estimates of it).

| **Aspect**          | **State**                                                    | **Observation**                                                 |
| ------------------- | ------------------------------------------------------------ | --------------------------------------------------------------- |
| **Definition**      | The **true internal configuration** of the environment       | The **information** the agent **receives** from the environment |
| **Completeness**    | Often assumed to be **complete** (Markov property holds)     | May be **partial**, noisy, or incomplete view of the state      |
| **Markov Property** | A true state satisfies: future depends only on current state | Observations may not satisfy the Markov property                |
| **Agent’s View**    | Agent may not have access to the full state                  | Agent always uses observations to decide actions                |
| **Example**         | All object positions, velocities, and environment internals  | Camera image, radar scan, or any sensor reading                 |

**MDP vs POMDP**
- In fully observable environments (e.g., many standard RL benchmarks), the observation is equivalent to the state. This is assumed in Markov Decision Processes (MDPs).
- In Partially Observable MDPs (POMDPs), the agent sees only observations and must infer the state using memory or belief models.

## 1. Import Dependencies

**To run Gymnasium and Stable Baselines3 libraries, it is HIGHLY recommended to create a virtual environment and download the dependencies/requirements in the virtual environment seperately to prevent conflicts in libraries!**

### How to set up a virtual environment in VS Code?
1. **Create a virtual environment**
    ```bash
    python -m venv venv
    ```
    This creates a folder named venv/ containing the isolated environment.

2. **Activate the virtual environment**

    For Windows:
    ```bash
    .\venv\Scripts\activate
    ```
    For macOS/Linux:
    ```bash
    source venv/bin/activate
    ```
    You’ll know it’s activated when your terminal prompt changes to show (venv).

3. **Now you can install dependencies inside the virtual environment!**

### What dependencies/requirements to download? 

**For Gymnasium library**
```bash
pip install gymnasium
```

**For Stable Baselines3 library**
```bash
pip install stable-baselines3[extra]
```

**For ALE (Arcade Learning Environment) package**  
The current newer version of Gymnasium library no longer include Atari Games RL Environments anymore by default. To use these Atari Games RL Environments with Gymnasium, you need to download a seperate dependency/package, the ALE (Arcade Learning Environment) package.
```bash
pip install autorom[accept-rom-license]
pip install ale-py
```

Source(s):
- https://github.com/AndreM96/Stable_Baseline3_Gymnasium_Tutorial (AndreM96 on Github)
- https://www.youtube.com/watch?v=Mut_u40Sqz4&t=6144s (one of the comments under the YouTube video by Nicholas Renotte titled, 'Reinforcement Learning in 3 Hours | Full Course using Python')

Just for demonstration purposes, the RL algorithm that we will be using here is the Advantage Actor-Critic (A2C) DRL algorithm

In [None]:
# Import Gymnasium-related dependencies
import gymnasium as gym
from gymnasium import Env
from gymnasium.spaces import Discrete, Box, Dict, Tuple, MultiBinary, MultiDiscrete

# Import Stable Baselines3-related dependencies
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy

# Import helper dependencies
import numpy as np
import random
import os

## 2. Types of Gymnasium's Spaces
It represents the structure and boundaries of either:
- Observation space: What the agent sees (e.g. images, coordinates, sensors)
- State space: All possible configurations the environment can be in (often implied)
- Action space: What actions the agent can take (e.g. move left/right, accelerate)

As seen from the section above, 'How does Gymnasium represent each of these components of the RL Environment?', there are 6 types of Gymnasium Spaces:
1. Box – n-dimensional tensor, range of values (continuous values)
    ```
    E.g. Box(0, 1, shape=(3,3))
    ```
2. Discrete – Set of items (discrete values)
    ```
    E.g. Discrete(3)
    ```
3. Tuple – Tuple of other spaces (e.g., Box or Discrete)
    ```
    E.g. Tuple((Discrete(2), Box(0, 100, shape=(1,))))
    ```
4. Dict – Dictionary of spaces (e.g., Box or Discrete)
    ```
    E.g. Dict({"height": Discrete(2), "speed": Box(0, 100, shape=(1,))})
    ```
5. MultiBinary – One-hot encoded binary values
    ```
    E.g. MultiBinary(4)
    ```
6. MultiDiscrete – Multiple discrete values
    ```
    E.g. MultiDiscrete([5, 2, 2])
    ```

### 1. Box Space
```
Box(low, high, shape, dtype=float)
E.g. Box(0, 1, shape=(3,3))
```
| **Parameter**      | **Value**         | **Meaning**                                                                 |
| ------------------ | ----------------- | --------------------------------------------------------------------------- |
| `low`              | `0`               | Minimum possible value for each element in the box (can be scalar or array) |
| `high`             | `1`               | Maximum possible value for each element (same shape as `low` or scalar)     |
| `shape`            | `(3, 3)`          | Shape of the space — this creates a **3×3 matrix** space                    |
| `dtype` (optional) | `float` (default) | Data type for the elements (e.g., `np.float32`, `np.int32`)                 |

Best for: Continuous control tasks, like robotics, self-driving cars, or any numeric sensor input

In [139]:
print(Box(0, 1, shape=(3,3)))

Box(0.0, 1.0, (3, 3), float32)


In [140]:
print(Box(0, 1, shape=(3,3)).sample())

[[0.38299504 0.3242162  0.0487468 ]
 [0.83433515 0.63927984 0.12781315]
 [0.95965    0.796533   0.12002907]]


### 2. Discrete Space
```
Discrete(n)
E.g. Discrete(3)
```
| **Parameter** | **Value** | **Meaning**                                          |
| ------------- | --------- | ---------------------------------------------------- |
| `n`           | `3`       | The number of **distinct values**, from `0` to `n-1` |

Best for: Single categorical choices (e.g., move directions, menu options)

In [141]:
print(Discrete(3))

Discrete(3)


In [142]:
print(Discrete(3).sample)

<bound method Discrete.sample of Discrete(3)>


### 3. Tuple Space
```
E.g. Tuple((Discrete(2), Box(0, 100, shape=(1,))))
```
| **Parameter** | **Value**                              | **Meaning**                                                     |
| ------------- | -------------------------------------- | --------------------------------------------------------------- |
| `spaces`      | `(Discrete(2), Box(0, 100, shape=(1,)))` | A tuple of **independent spaces** (can mix Discrete, Box, etc.) |

Best for: Environments with multi-part observations or actions where each part is different in type or range

Example: (button_pressed, sensor_readings)

IMPORTANT NOTE:  
Tuple Space is not supported by Stable Baselines3! (Its the only one, all other spaces types are supported by Stable Baselines3)

In [143]:
print(Tuple((Discrete(3), Box(0,1, shape=(3,3)))))

Tuple(Discrete(3), Box(0.0, 1.0, (3, 3), float32))


In [144]:
print(Tuple((Discrete(3), Box(0,1, shape=(3,3)))).sample())

(np.int64(2), array([[0.57508886, 0.3206607 , 0.20824908],
       [0.13850348, 0.6429767 , 0.4782226 ],
       [0.08177763, 0.7019443 , 0.59392303]], dtype=float32))


### 4. Dict Space
```
E.g. Dict({"height": Discrete(2), "speed": Box(0, 100, shape=(1,))})
```
| **Parameter** | **Value**                                     | **Meaning**                                                 |
| ------------- | --------------------------------------------- | ----------------------------------------------------------- |
| `spaces`      | `{"image": Box(...), "speed": Discrete(...)}` | A **dictionary of named spaces**, for structured data input |

Best for: Named observation components (e.g., image + metadata, or lidar + speed + GPS)

Example: structured state inputs like {"camera": image, "position": vector}

In [145]:
print(Dict({"height" : Discrete(2), "speed" : Box(0, 100, shape={1,})}))

Dict('height': Discrete(2), 'speed': Box(0.0, 100.0, (1,), float32))


In [146]:
print(Dict({"height" : Discrete(2), "speed" : Box(0, 100, shape={1,})}).sample())

{'height': np.int64(0), 'speed': array([51.066807], dtype=float32)}


### 5. MultiBinary Space
```
MultiBinary(n)
E.g. MultiBinary(4)
```
| **Parameter** | **Value** | **Meaning**                                 |
| ------------- | --------- | ------------------------------------------- |
| `n`           | `4`       | Number of binary variables (each is 0 or 1) |

Best for: Multiple binary options (e.g., toggles, flags, presence/absence of features)

In [147]:
print(MultiBinary(4))

MultiBinary(4)


In [148]:
print(MultiBinary(4).sample())

[0 0 0 1]


### 6. MultiDiscrete Space
```
MultiDiscrete([n₁, n₂, ..., nₖ]), where 'nvec = [n₁, n₂, ..., nₖ]'
E.g. MultiDiscrete([5, 2, 2])
```
| **Parameter** | **Value**   | **Meaning**                                                                    |
| ------------- | ----------- | ------------------------------------------------------------------------------ |
| `nvec`        | `[5, 2, 2]` | List of number of categories per variable. Each variable ranges from 0 to nᵢ−1 |

Best for: Multi-dimensional categorical actions or observations, where each slot is an independent discrete choice

In [149]:
print(MultiDiscrete([5,2,2]))

MultiDiscrete([5 2 2])


In [150]:
print(MultiDiscrete([5,2,2]).sample())

[2 1 1]


## 3. Building a Custom RL Environment and testing if it works with a baseline algorithm that takes random actions

Just for demonstration purposes, the custom RL Environment that we will try to build is a Shower RL Environment, which at a high level idea:
- Build an RL agent that gives us the best shower possible
- The RL agent randomly sets the shower temperature
- The ideal shower temperature is between 37 to 39 degrees (we know this detail, but our RL agent dosen't)

### What is the 5-tuple MDP components of our MDP RL Environment?

**States**/**Observations**
- a continuous (set shower temperature) value between 0 to 100 - represented by a Box Space

Type: Box(0, 100)
| Num | Observation         | Min | Max | Description                           |
| --- | ------------------- | --- | --- | ------------------------------------- |
| 0   | Shower temperature  | 0   | 100 | Range of possible shower temperatures |

**Actions**  
- 3 discrete actions - represented by a Discrete Space
1. Increase shower temperature by 1 - represented by a value of 2
2. Maintain shower temperature - represented by a value of 1
3. Decrease shower temperature by 1 - represented by a value of 0

Type: Discrete(3)
| **Index** | **Action Name**               | **Meaning**                      |
| --------- | ----------------------------- | -------------------------------- |
| 0         | Decrease Shower Temperature   | Decrease Shower Temperature by 1 |
| 1         | Maintain shower temperature   | No change to shower temperature  |
| 2         | Increase shower temperature   | Increase shower temperature by 1 |

**Transition Probability**  
- abstracted out by the Gymmnasium library

**Reward Function**  
- if shower temperature is not between 37 to 39 degrees inclusive, reward -1
- else, reward +1

**Discount Factor**
- abstracted out by the Gymmnasium library

### Building the Custom RL Environment

In [151]:
class ShowerEnv(Env):
    def __init__(self):
        self.action_space = Discrete(3)
        self.observation_space = Box(low=np.array([0]), high=np.array([100]))
        self.state = 38 + random.randint(-3, 3)         # Initialising the initial state of the RL Environment
        self.shower_length = 60

    def step(self, action):
        # Apply action/set shower temperature
        self.state += action - 1

        # Decrease 'shower_length' time
        self.shower_length -= 1

        # Calculate Reward with Reward Function
        reward = 0
        if self.state >= 37 and self.state <= 39:
            reward += 1
        else:
            reward -= 1

        if self.shower_length <= 0:
            done = True
        else:
            done = False

        truncated = False
        info = {}

        return self.state, reward, done, truncated, info

    # Optional feature to visualise the RL agent in the RL Environment. Can be done using pygame. Won't be covered in this
    # tutorial
    def render(self):
        pass

    def reset(self, *, seed=None, options=None):
        # Re-initialising the initial state of the RL Environment
        self.state = np.array([38 + random.randint(-3, 3)]).astype(float)
        self.shower_length = 60

        info = {}

        return self.state, info

In [152]:
# Understanding the state and action spaces used in the Custom Shower RL Environment
env = ShowerEnv()
print(env.observation_space)
print(env.action_space)
print(env.reset())

Box(0.0, 100.0, (1,), float32)
Discrete(3)
(array([39.]), {})


### Testing the Custom RL Environment if it works with a baseline algorithm that takes random actions

In [153]:
env = ShowerEnv()

episodes = 5
for episode in range(0, episodes+1):
    # Initialise starting state of the RL agent in the RL Environment before an episode, done to false, and starting 
    # episode score to 0
    obs, _ = env.reset()
    print(f"Initial State: {obs}")
    done = False
    episode_score = 0

    # During an episode:
    while not done:
        env.render()
        # RL agent determines action to take
        # - In this case, we are randomly sampling an action to take by our RL agent in the RL Environment (this line of
        #   code defines that baseline algorithm that takes random actions (instead of an RL algorithm))
        action = env.action_space.sample()
        # RL Environment generates the next state and reward gained upon taking the action in the current state
        obs, reward, done, truncated, info = env.step(action)
        # Append the reward gained upon taking the action in the current state to the cumulative episode date
        episode_score += reward

    print(f"Episode: {episode} Score: {episode_score}")

env.close()

Initial State: [35.]
Episode: 0 Score: -10
Initial State: [39.]
Episode: 1 Score: -6
Initial State: [40.]
Episode: 2 Score: -50
Initial State: [41.]
Episode: 3 Score: -60
Initial State: [37.]
Episode: 4 Score: -48
Initial State: [41.]
Episode: 5 Score: 0


## 3. Vectorise RL Environment and Train an A2C DRL algorithm in a RL Environment

### What is an Reinforcement Learning (RL) algorithm?

An RL algorithm involves an agent performing actions in an RL environment, receiving rewards or penalties based on those actions, and adjusting its behavior accordingly. This loop helps the agent improve its decision-making over time to maximize the cumulative reward.

### How does a Reinforcement Learning (RL) algorithm 'learn'?

In ML and DL, we learnt that ML/DL algorithms 'learn' by updating the ML/DL algorithm's weights and biases as more datas are fed into the ML/DL algorithm, and after many iterations of training, it makes accurate predictions. 

**This is no different in RL.**

In RL, the RL algorithms uses various architectures to 'learn' by updating the RL algorithm's weights and biases as it interacts more with the RL Environment (via the reward mechanism). The 'learning' architecture used also defines whether a RL algorithm is a **Classical RL algorithm** or a **Deep RL (DRL) algorithm**.

**Classical RL algorithm learning architectures**  
Uses tables or simple functions:
| Type                          | Description                                                                      | Example             |
| ----------------------------- | -------------------------------------------------------------------------------- | ------------------- |
| **Tabular policy**            | Table stores the best action for each discrete state                             | `π[s] = a`          |
| **Tabular stochastic policy** | Table of probabilities for each action in each state                             | `π[a][s] = P(a \| s)` |
| **Value-based methods**       | Use a value table (e.g., Q-table) and derive policy as `π(s) = argmax Q(s,a)`    | Q-Learning          |
| **Policy iteration**          | Alternates between evaluating a policy and improving it based on value estimates | Dynamic Programming |      |
| **Function approximation**    | Uses linear models or tile coding to generalize across large state spaces        | `π(s) = θᵀφ(s)`     |

**Deep RL (DRL) algorithm learning architectures**  
Uses neural networks or its variants,
- FNN/MLP
- CNN
- RNN
- LSTM
- GRU

In RL, after many iterations of training, it makes accurate predictions, more specifically, it behaves better/takes better actions. 

These RL algorithm 'learning' architectures is also called **Policy**, which defines how the agent chooses actions based on its current state.

### What does a Vectorised RL Environment mean?
Vectorized RL Environments are RL Environments that can be made to run in parallel, allowing multiple simulations at once to increase training speed of the RL algorithm.

A non-vectorized RL Environment does not allow for being made to run in parallel (only one simulation can run at a time).

In Gymnasium, some RL Environments are vectorized by default (e.g. Breakout), while others are not (e.g. CartPole). But when training a RL algorithm from Stable Baselines3, it is required for the RL Environment to be vectorized as well (even if you dont intend to run them in parallel).

In [154]:
# Since when we create the Custom RL Environment with the 'Env' super class from Gymnasium, it automatically
# wraps the Custom RL Environment in a dummy vectorised RL Environment already, hence there is no need
# to vectorise it again 

### For logging purposes of the training process of the PPO DRL algorithm

In [155]:
# Stating the path where we want to store our training logs files in the local folder './Training_Project_3_Custom/logs'
log_path = os.path.join('Training_Project_3_Custom', 'logs')
print(log_path)

Training_Project_3_Custom\logs


### Creating the PPO DRL algorithm in the RL Environment

In [156]:
# What does each of the parameters in the 'PPO' DRL algorithm class mean?
# - 'policy' (e.g. 'MlpPolicy'  - refers to the learning architecture used a the policy of the RL algorithm, which in this
#               or 'CnnPolicy')   is FNN/MLP
# - 'env'                       - refers to the RL environment to train the RL algorithm in
# - 'verbose'                   - controls how much information is printed to the console/log during training
#                                 -> 'verbose=0' means 'Silent', no output at all
#                                 -> 'verbose=1' means 'Info', shows key training events: episode rewards, updates, losses, etc.
#                                 -> 'verbose=2' means 'Debug' shows more detailed info like hyperparameters, rollout steps, and internal logs
# - 'tensorboard_log'           - states to do the training logging in Tensorboard
PPO_DRL_model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [157]:
PPO?

[31mInit signature:[39m
PPO(
    policy: Union[str, type[stable_baselines3.common.policies.ActorCriticPolicy]],
    env: Union[gymnasium.core.Env, ForwardRef([33m'VecEnv'[39m), str],
    learning_rate: Union[float, Callable[[float], float]] = [32m0.0003[39m,
    n_steps: int = [32m2048[39m,
    batch_size: int = [32m64[39m,
    n_epochs: int = [32m10[39m,
    gamma: float = [32m0.99[39m,
    gae_lambda: float = [32m0.95[39m,
    clip_range: Union[float, Callable[[float], float]] = [32m0.2[39m,
    clip_range_vf: Union[NoneType, float, Callable[[float], float]] = [38;5;28;01mNone[39;00m,
    normalize_advantage: bool = [38;5;28;01mTrue[39;00m,
    ent_coef: float = [32m0.0[39m,
    vf_coef: float = [32m0.5[39m,
    max_grad_norm: float = [32m0.5[39m,
    use_sde: bool = [38;5;28;01mFalse[39;00m,
    sde_sample_freq: int = -[32m1[39m,
    rollout_buffer_class: Optional[type[stable_baselines3.common.buffers.RolloutBuffer]] = [38;5;28;01mNone[39;00m,
    r

### Training the PPO DRL algorithm in the RL Environment to become a PPO DRL model

Note that the number of timesteps/iterations/episodes to be used here to train an RL algorithm varies depending on the complexity of the RL Environment.

For this tutorial's RL Environment, 'Breakout-v0', it is moderately complex and should take about 100 000 to 200 000 timesteps/iterations/episodes compared to the simpler 'CartPole-v1' RL Environment which should only take about 20 000 timesteps/iterations/episodes, but for more complex RL Environments it may take up to 500 000 timesteps/iterations/episodes.

In [165]:
PPO_DRL_model.learn(total_timesteps=100000)

Logging to Training_Project_3_Custom\logs\PPO_2
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 60       |
|    ep_rew_mean     | 53.6     |
| time/              |          |
|    fps             | 2467     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 60          |
|    ep_rew_mean          | 53.5        |
| time/                   |             |
|    fps                  | 1494        |
|    iterations           | 2           |
|    time_elapsed         | 2           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.023102708 |
|    clip_fraction        | 0.166       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.815      |
|    explained_variance 

<stable_baselines3.ppo.ppo.PPO at 0x1836d2f9890>

## 4. Save PPO DRL model

In [159]:
PPO_Model_Custom_100k = os.path.join('Training_Project_3_Custom', 'Saved RL Models', 'PPO_Model_Custom_100k')
PPO_DRL_model.save(PPO_Model_Custom_100k)



## 5. Reload PPO DRL model

In [160]:
PPO_Model_Custom_100k = os.path.join('Training_Project_3_Custom', 'Saved RL Models', 'PPO_Model_Custom_100k')
reloaded_PPO_DRL_model = PPO.load(PPO_Model_Custom_100k, env=env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


## 6. Evaluating the PPO DRL model in a RL Environment

In [161]:
# The 'evaluate_policy()' function returns a tuple,
#       (mean_reward, std_reward)
# - 'mean_reward' - refers to the mean reward throughout the episodes
# - 'std_reward' - refers to the standard deviation of the reward throughout the episodes
print(evaluate_policy(reloaded_PPO_DRL_model, env, n_eval_episodes=1, render=True))
env.close()

(np.float64(-60.0), np.float64(0.0))




## 7. Test the PPO DRL model in a RL Environment

To test the PPO DRL model in the Gymnasium's Custom RL Environment, we can use the same code from the earlier section '2. Load RL Environment and testing if it works with a baseline algorithm that takes random actions' with some minor changes

But here, instead of taking a random action at each time step in an episode, we are using the PPO DRL model to predict that action at each time step in an episode instead

In [164]:
env = ShowerEnv()

episodes = 5
for episode in range(0, episodes+1):
    # Initialise starting state of the RL agent in the RL Environment before an episode, done to false, and starting 
    # episode score to 0
    obs, _ = env.reset()
    print(f"Initial State: {obs}")
    done = False
    episode_score = 0

    # During an episode:
    while not done:
        env.render()
        # RL agent determines action to take
        # - Now, we are no longer randomly sampling an action to take by our RL agent in the RL Environment, but
        #   instead we are using the PPO DRL model to predict the action at each time step in an episode instead based
        #   on the current observations/states in the RL Environment
        action, _ = reloaded_PPO_DRL_model.predict(obs)
        # RL Environment generates the next state and reward gained upon taking the action in the current state
        obs, reward, done, truncated, info = env.step(action)
        # Append the reward gained upon taking the action in the current state to the cumulative episode date
        episode_score += reward

    print(f"Episode: {episode} Score: {episode_score}")

env.close()

Initial State: [38.]
Episode: 0 Score: 54
Initial State: [36.]
Episode: 1 Score: 56
Initial State: [38.]
Episode: 2 Score: 58
Initial State: [38.]
Episode: 3 Score: 56
Initial State: [39.]
Episode: 4 Score: 58
Initial State: [39.]
Episode: 5 Score: 52
