# Reinforcement Learning (RL) with Gymnasium and Stable Baselines3 Tutorial (Part 1)
Source: 
- https://www.youtube.com/watch?v=Mut_u40Sqz4&t=6144s (Nicholas Renotte) (YouTube video by Nicholas Renotte titled, 
'Reinforcement Learning in 3 Hours | Full Course using Python')

Documentations:
- Gymnasium: https://gymnasium.farama.org/ (This library provides standardized environments for developing and testing RL algorithms)
- Stable Baselines3: https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html (This library provides a suite of pre-implemented RL algorithms based on PyTorch)

## How is an RL Environment defined?
An RL Environment is typically modeled as the 5-tuple:
```text
𝑀=(𝑆,𝐴,𝑃,𝑅,𝛾)
```

An RL Environment is defined as 5-tuple in the framework of a Markov Decision Process (MDP):

| Symbol              | Name                       | Description                                                                               |
| ------------------- | -------------------------- | ----------------------------------------------------------------------------------------- |
| $S$                 | **States**                 | The set of all possible states the agent can be in                                        |
| $A$                 | **Actions**                | The set of all possible actions the agent can take                                        |
| $P(s' \mid s, a)$   | **Transition Probability** | The probability of moving to state $s'$ after taking action $a$ in state $s$              |
| $R(s, a)$           | **Reward Function**        | The expected reward received after taking action $a$ in state $s$                         |
| $\gamma \in [0, 1]$ | **Discount Factor**        | The factor by which future rewards are discounted (controls how far-sighted the agent is) |

## How does Gymnasium represent each of these components of the RL Environment?
**States**/**Observations** and  **Actions**  
- Box – n-dimensional tensor, range of values (continuous values)
    ```
    E.g. Box(0, 1, shape=(3,3))
    ```
- Discrete – Set of items (discrete values)
    ```
    E.g. Discrete(3)
    ```
- Tuple – Tuple of other spaces (e.g., Box or Discrete)
    ```
    E.g. Tuple((Discrete(2), Box(0, 100, shape=(1,))))
    ```
- Dict – Dictionary of spaces (e.g., Box or Discrete)
    ```
    E.g. Dict({"height": Discrete(2), "speed": Box(0, 100, shape=(1,))})
    ```
- MultiBinary – One-hot encoded binary values
    ```
    E.g. MultiBinary(4)
    ```
- MultiDiscrete – Multiple discrete values
    ```
    E.g. MultiDiscrete([5, 2, 2])
    ```

**Transition Probability**  
- abstracted out by the Gymmnasium library

**Reward Function**  
- abstracted out by the Gymmnasium library

**Discount Factor**
- abstracted out by the Gymmnasium library

### What is the difference between States and Observations?
RL agents only act on observations, not states. Optimal behavior of RL agents assumes knowledge of the underlying state (or estimates of it).

| **Aspect**          | **State**                                                    | **Observation**                                                 |
| ------------------- | ------------------------------------------------------------ | --------------------------------------------------------------- |
| **Definition**      | The **true internal configuration** of the environment       | The **information** the agent **receives** from the environment |
| **Completeness**    | Often assumed to be **complete** (Markov property holds)     | May be **partial**, noisy, or incomplete view of the state      |
| **Markov Property** | A true state satisfies: future depends only on current state | Observations may not satisfy the Markov property                |
| **Agent’s View**    | Agent may not have access to the full state                  | Agent always uses observations to decide actions                |
| **Example**         | All object positions, velocities, and environment internals  | Camera image, radar scan, or any sensor reading                 |

**MDP vs POMDP**
- In fully observable environments (e.g., many standard RL benchmarks), the observation is equivalent to the state. This is assumed in Markov Decision Processes (MDPs).
- In Partially Observable MDPs (POMDPs), the agent sees only observations and must infer the state using memory or belief models.

## 1. Import Dependencies

**To run Gymnasium and Stable Baselines3 libraries, it is HIGHLY recommended to create a virtual environment and download the dependencies/requirements in the virtual environment seperately to prevent conflicts in libraries!**

### How to set up a virtual environment in VS Code?
1. **Create a virtual environment**
    ```bash
    python -m venv venv
    ```
    This creates a folder named venv/ containing the isolated environment.

2. **Activate the virtual environment**

    For Windows:
    ```bash
    .\venv\Scripts\activate
    ```
    For macOS/Linux:
    ```bash
    source venv/bin/activate
    ```
    You’ll know it’s activated when your terminal prompt changes to show (venv).

3. **Now you can install dependencies inside the virtual environment!**

### What dependencies/requirements to download? 

**For Gymnasium library**
```bash
pip install gymnasium
```

**For Stable Baselines3 library**
```bash
pip install stable-baselines3[extra]
```

Source(s):
- https://github.com/AndreM96/Stable_Baseline3_Gymnasium_Tutorial (AndreM96 on Github)

Just for demonstration purposes, the RL algorithm that we will be using here is the Proximal Policy Optimization (PPO) DRL algorithm

In [11]:
import os
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

## 2. Load RL Environment and testing if it works with a baseline algorithm that takes random actions

Just for demonstration purposes, the RL Environment that we will be using here is the "CartPole-v1"

In [12]:
environment_name = "CartPole-v1"
env = gym.make(environment_name, render_mode="human")

episodes = 5
for episode in range(0, episodes+1):
    # Initialise starting state of the RL agent in the RL Environment before an episode, done to false, and starting 
    # episode score to 0
    state = env.reset()
    print(f"Initial State: {state}")
    done = False
    episode_score = 0

    # During an episode:
    while not done:
        env.render()
        # RL agent determines action to take
        # - In this case, we are randomly sampling an action to take by our RL agent in the RL Environment (this line of
        #   code defines that baseline algorithm that takes random actions (instead of an RL algorithm))
        action = env.action_space.sample()
        # RL Environment generates the next state and reward gained upon taking the action in the current state
        n_state, reward, done, truncated, info = env.step(action)
        # Append the reward gained upon taking the action in the current state to the cumulative episode date
        episode_score += reward

    print(f"Episode: {episode} Score: {episode_score}")

env.close()

Initial State: (array([ 0.0164126 ,  0.02874528, -0.01213397, -0.00579226], dtype=float32), {})
Episode: 0 Score: 30.0
Initial State: (array([ 0.00957305,  0.00781111, -0.02901817, -0.03707879], dtype=float32), {})
Episode: 1 Score: 30.0
Initial State: (array([ 0.0052368 ,  0.03448069, -0.00121296, -0.03387132], dtype=float32), {})
Episode: 2 Score: 12.0
Initial State: (array([ 0.00475164, -0.01795405,  0.0135004 , -0.03343001], dtype=float32), {})
Episode: 3 Score: 12.0
Initial State: (array([-0.00023687,  0.01651358, -0.01636876,  0.00142939], dtype=float32), {})
Episode: 4 Score: 25.0
Initial State: (array([0.01125989, 0.01252995, 0.01655364, 0.00256611], dtype=float32), {})
Episode: 5 Score: 19.0


### Understanding the RL Environment

From the Gymansium's "CartPole-v1" RL Environment documentation: https://gymnasium.farama.org/environments/classic_control/cart_pole/

**States**  
Type: Box(4)
| Num | Observation           | Min               | Max             |
| --- | --------------------- | ----------------- | --------------- |
| 0   | Cart Position         | -4.8              | 4.8             |
| 1   | Cart Velocity         | -Inf              | Inf             |
| 2   | Pole Angle            | -0.418 rad (-24°) | 0.418 rad (24°) |
| 3   | Pole Angular Velocity | -Inf              | Inf             |

**Actions**  
Type: Discrete(2)
| Num | Action                 |
| --- | ---------------------- |
| 0   | Push cart to the left  |
| 1   | Push cart to the right |

Note:
The amount the velocity that is reduced or increased is not fixed; it depends on the angle the pole is pointing. This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it.

In [13]:
# Understanding the state and action spaces used in the Gymnasium's "CartPole-v1" RL Environment
print(env.observation_space)
print(env.action_space)

Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)
Discrete(2)


## 3. Vectorise RL Environment and Train an PPO DRL algorithm in a RL Environment

### What is an Reinforcement Learning (RL) algorithm?

An RL algorithm involves an agent performing actions in an RL environment, receiving rewards or penalties based on those actions, and adjusting its behavior accordingly. This loop helps the agent improve its decision-making over time to maximize the cumulative reward.

### How does a Reinforcement Learning (RL) algorithm 'learn'?

In ML and DL, we learnt that ML/DL algorithms 'learn' by updating the ML/DL algorithm's weights and biases as more datas are fed into the ML/DL algorithm, and after many iterations of training, it makes accurate predictions. 

**This is no different in RL.**

In RL, the RL algorithms uses various architectures to 'learn' by updating the RL algorithm's weights and biases as it interacts more with the RL Environment (via the reward mechanism). The 'learning' architecture used also defines whether a RL algorithm is a **Classical RL algorithm** or a **Deep RL (DRL) algorithm**.

**Classical RL algorithm learning architectures**  
Uses tables or simple functions:
| Type                          | Description                                                                      | Example             |
| ----------------------------- | -------------------------------------------------------------------------------- | ------------------- |
| **Tabular policy**            | Table stores the best action for each discrete state                             | `π[s] = a`          |
| **Tabular stochastic policy** | Table of probabilities for each action in each state                             | `π[a][s] = P(a \| s)` |
| **Value-based methods**       | Use a value table (e.g., Q-table) and derive policy as `π(s) = argmax Q(s,a)`    | Q-Learning          |
| **Policy iteration**          | Alternates between evaluating a policy and improving it based on value estimates | Dynamic Programming |      |
| **Function approximation**    | Uses linear models or tile coding to generalize across large state spaces        | `π(s) = θᵀφ(s)`     |

**Deep RL (DRL) algorithm learning architectures**  
Uses neural networks or its variants,
- FNN/MLP
- CNN
- RNN
- LSTM
- GRU

In RL, after many iterations of training, it makes accurate predictions, more specifically, it behaves better/takes better actions. 

These RL algorithm 'learning' architectures is also called **Policy**, which defines how the agent chooses actions based on its current state.

### What does a Vectorised RL Environment mean?
Vectorized RL Environments are RL Environments that can be made to run in parallel, allowing multiple simulations at once to increase training speed of the RL algorithm.

A non-vectorized RL Environment does not allow for being made to run in parallel (only one simulation can run at a time).

In Gymnasium, some RL Environments are vectorized by default (e.g. Breakout), while others are not (e.g. CartPole). But when training a RL algorithm from Stable Baselines3, it is required for the RL Environment to be vectorized as well (even if you dont intend to run them in parallel).

Since the RL Environment used here is "CartPole-v1", which is not vectorized by default, you need to manually vectorize them, and you can do so as shown below

In [None]:
environment_name = "CartPole-v1"
env = gym.make(environment_name, render_mode="human")

# Since the RL Environment used here, "CartPole-v1", is non-vectorized, we make it vectorized by placing it in the 
# 'DummyVecEnv' object (how this is done is shown here), which acts as a sort of wrapper to convert the non-vectorized 
# RL Environment into a vectorized RL Environment
env = DummyVecEnv([lambda: env])

### For logging purposes of the training process of the PPO DRL algorithm

In [None]:
# Stating the path where we want to store our training logs files in the local folder './Training_Tutorial/logs'
log_path = os.path.join('Training_Tutorial', 'logs')
print(log_path)

Training\logs


### Creating the PPO DRL algorithm in the RL Environment

In [None]:
# What does each of the parameters in the 'PPO' DRL algorithm class mean?
# - 'policy' (e.g. 'MlpPolicy'  - refers to the learning architecture used a the policy of the RL algorithm, which in this
#               or 'CnnPolicy')   is FNN/MLP
# - 'env'                       - refers to the RL environment to train the RL algorithm in
# - 'verbose'                   - controls how much information is printed to the console/log during training
#                                 -> 'verbose=0' means 'Silent', no output at all
#                                 -> 'verbose=1' means 'Info', shows key training events: episode rewards, updates, losses, etc.
#                                 -> 'verbose=2' means 'Debug' shows more detailed info like hyperparameters, rollout steps, and internal logs
# - 'tensorboard_log'           - states to do the training logging in Tensorboard
PPO_DRL_model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log_path)

Using cpu device


In [16]:
PPO?

[1;31mInit signature:[0m
[0mPPO[0m[1;33m([0m[1;33m
[0m    [0mpolicy[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mstr[0m[1;33m,[0m [0mtype[0m[1;33m[[0m[0mstable_baselines3[0m[1;33m.[0m[0mcommon[0m[1;33m.[0m[0mpolicies[0m[1;33m.[0m[0mActorCriticPolicy[0m[1;33m][0m[1;33m][0m[1;33m,[0m[1;33m
[0m    [0menv[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mgymnasium[0m[1;33m.[0m[0mcore[0m[1;33m.[0m[0mEnv[0m[1;33m,[0m [0mForwardRef[0m[1;33m([0m[1;34m'VecEnv'[0m[1;33m)[0m[1;33m,[0m [0mstr[0m[1;33m][0m[1;33m,[0m[1;33m
[0m    [0mlearning_rate[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mfloat[0m[1;33m,[0m [0mCallable[0m[1;33m[[0m[1;33m[[0m[0mfloat[0m[1;33m][0m[1;33m,[0m [0mfloat[0m[1;33m][0m[1;33m][0m [1;33m=[0m [1;36m0.0003[0m[1;33m,[0m[1;33m
[0m    [0mn_steps[0m[1;33m:[0m [0mint[0m [1;33m=[0m [1;36m2048[0m[1;33m,[0m[1;33m
[0m    [0mbatch_size[0m[1;33m:[0m [0mint[0m [1;33m=[0m [1;

### Training the PPO DRL algorithm in the RL Environment to become a PPO DRL model

Note that the number of timesteps/iterations/episodes to be used here to train an RL algorithm varies depending on the complexity of the RL Environment.

For this tutorial's RL Environment, 'CartPole-v1', it takes about 20 000 timesteps/iterations/episodes, but for more complex RL Environments it may take up to 500 000 timesteps/iterations/episodes.

In [None]:
PPO_DRL_model.learn(total_timesteps=20000)

Logging to Training\logs\PPO_1
-----------------------------
| time/              |      |
|    fps             | 46   |
|    iterations      | 1    |
|    time_elapsed    | 44   |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 44          |
|    iterations           | 2           |
|    time_elapsed         | 91          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008763621 |
|    clip_fraction        | 0.094       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.687      |
|    explained_variance   | -0.0146     |
|    learning_rate        | 0.0003      |
|    loss                 | 5.65        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0143     |
|    value_loss           | 47          |
-----------------------------------------
---

<stable_baselines3.ppo.ppo.PPO at 0x231edf9fd10>

## 4. Save PPO DRL model

In [None]:
PPO_Model_CartPole_v1_20k = os.path.join('Training_Tutorial', 'Saved RL Models', 'PPO_Model_CartPole_v1_20k')
PPO_DRL_model.save(PPO_Model_CartPole_v1_20k)