# Reinforcement Learning (RL) with Gymnasium and Stable Baselines3 Tutorial
Source: 
- https://www.youtube.com/watch?v=Mut_u40Sqz4&t=6144s (Nicholas Renotte) (YouTube video by Nicholas Renotte titled, 
'Reinforcement Learning in 3 Hours | Full Course using Python')

Documentations:
- Gymnasium: https://gymnasium.farama.org/ (This library provides standardized environments for developing and testing RL algorithms)
- Stable Baselines3: https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html (This library provides a suite of pre-implemented RL algorithms based on PyTorch)

# Project 1: Breakout RL Environment

### What are Atari Games RL Environments?
In Gymansium, there is a class of RL Environments called Atari Games, which refers to the classic video games from the Atari 2600 console, such as:
- Breakout
- Pong
- Space Invaders
- Q*Bert
- Seaquest
- Montezuma's Revenge

and many more...

These games are used as benchmark RL Environments for evaluating and comparing the performance of RL algorithms.

## How is an RL Environment defined?
An RL Environment is typically modeled as the 5-tuple:
```text
𝑀=(𝑆,𝐴,𝑃,𝑅,𝛾)
```

An RL Environment is defined as 5-tuple in the framework of a Markov Decision Process (MDP):

| Symbol              | Name                       | Description                                                                               |
| ------------------- | -------------------------- | ----------------------------------------------------------------------------------------- |
| $S$                 | **States**                 | The set of all possible states the agent can be in                                        |
| $A$                 | **Actions**                | The set of all possible actions the agent can take                                        |
| $P(s' \mid s, a)$   | **Transition Probability** | The probability of moving to state $s'$ after taking action $a$ in state $s$              |
| $R(s, a)$           | **Reward Function**        | The expected reward received after taking action $a$ in state $s$                         |
| $\gamma \in [0, 1]$ | **Discount Factor**        | The factor by which future rewards are discounted (controls how far-sighted the agent is) |

## How does Gymnasium represent each of these components of the RL Environment?
**States**/**Observations** and  **Actions**  
- Box – n-dimensional tensor, range of values (continuous values)
    ```
    E.g. Box(0, 1, shape=(3,3))
    ```
- Discrete – Set of items (discrete values)
    ```
    E.g. Discrete(3)
    ```
- Tuple – Tuple of other spaces (e.g., Box or Discrete)
    ```
    E.g. Tuple((Discrete(2), Box(0, 100, shape=(1,))))
    ```
- Dict – Dictionary of spaces (e.g., Box or Discrete)
    ```
    E.g. Dict({"height": Discrete(2), "speed": Box(0, 100, shape=(1,))})
    ```
- MultiBinary – One-hot encoded binary values
    ```
    E.g. MultiBinary(4)
    ```
- MultiDiscrete – Multiple discrete values
    ```
    E.g. MultiDiscrete([5, 2, 2])
    ```

**Transition Probability**  
- abstracted out by the Gymmnasium library

**Reward Function**  
- abstracted out by the Gymmnasium library

**Discount Factor**
- abstracted out by the Gymmnasium library

### What is the difference between States and Observations?
RL agents only act on observations, not states. Optimal behavior of RL agents assumes knowledge of the underlying state (or estimates of it).

| **Aspect**          | **State**                                                    | **Observation**                                                 |
| ------------------- | ------------------------------------------------------------ | --------------------------------------------------------------- |
| **Definition**      | The **true internal configuration** of the environment       | The **information** the agent **receives** from the environment |
| **Completeness**    | Often assumed to be **complete** (Markov property holds)     | May be **partial**, noisy, or incomplete view of the state      |
| **Markov Property** | A true state satisfies: future depends only on current state | Observations may not satisfy the Markov property                |
| **Agent’s View**    | Agent may not have access to the full state                  | Agent always uses observations to decide actions                |
| **Example**         | All object positions, velocities, and environment internals  | Camera image, radar scan, or any sensor reading                 |

**MDP vs POMDP**
- In fully observable environments (e.g., many standard RL benchmarks), the observation is equivalent to the state. This is assumed in Markov Decision Processes (MDPs).
- In Partially Observable MDPs (POMDPs), the agent sees only observations and must infer the state using memory or belief models.

## 1. Import Dependencies

**To run Gymnasium and Stable Baselines3 libraries, it is HIGHLY recommended to create a virtual environment and download the dependencies/requirements in the virtual environment seperately to prevent conflicts in libraries!**

### How to set up a virtual environment in VS Code?
1. **Create a virtual environment**
    ```bash
    python -m venv venv
    ```
    This creates a folder named venv/ containing the isolated environment.

2. **Activate the virtual environment**

    For Windows:
    ```bash
    .\venv\Scripts\activate
    ```
    For macOS/Linux:
    ```bash
    source venv/bin/activate
    ```
    You’ll know it’s activated when your terminal prompt changes to show (venv).

3. **Now you can install dependencies inside the virtual environment!**

### What dependencies/requirements to download? 

**For Gymnasium library**
```bash
pip install gymnasium
```

**For Stable Baselines3 library**
```bash
pip install stable-baselines3[extra]
```

**For ALE (Arcade Learning Environment) package**  
The current newer version of Gymnasium library no longer include Atari Games RL Environments anymore by default. To use these Atari Games RL Environments with Gymnasium, you need to download a seperate dependency/package, the ALE (Arcade Learning Environment) package.
```bash
pip install autorom[accept-rom-license]
pip install ale-py
```

Source(s):
- https://github.com/AndreM96/Stable_Baseline3_Gymnasium_Tutorial (AndreM96 on Github)
- https://www.youtube.com/watch?v=Mut_u40Sqz4&t=6144s (one of the comments under the YouTube video by Nicholas Renotte titled, 'Reinforcement Learning in 3 Hours | Full Course using Python')

Just for demonstration purposes, the RL algorithm that we will be using here is the Advantage Actor-Critic (A2C) DRL algorithm

In [42]:
import os
import gymnasium as gym
from ale_py import ALEInterface
from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_atari_env

## 2. Load RL Environment and testing if it works with a baseline algorithm that takes random actions

Just for demonstration purposes, the RL Environment that we will be using here is the "Breakout-v0"

In [43]:
environment_name = "Breakout-v0"
env = gym.make(environment_name, render_mode="human")

episodes = 1
for episode in range(0, episodes+1):
    # Initialise starting state of the RL agent in the RL Environment before an episode, done to false, and starting 
    # episode score to 0
    obs, _ = env.reset()
    print(f"Initial State: {obs}")
    done = False
    episode_score = 0

    # During an episode:
    while not done:
        env.render()
        # RL agent determines action to take
        # - In this case, we are randomly sampling an action to take by our RL agent in the RL Environment (this line of
        #   code defines that baseline algorithm that takes random actions (instead of an RL algorithm))
        action = env.action_space.sample()
        # RL Environment generates the next state and reward gained upon taking the action in the current state
        obs, reward, done, truncated, info = env.step(action)
        # Append the reward gained upon taking the action in the current state to the cumulative episode date
        episode_score += reward

    print(f"Episode: {episode} Score: {episode_score}")

env.close()

Initial State: [[[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 ...

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]]
Episode: 0 Score: 2.0
Initial State: [[[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 ...

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  ...
  [0 0 0]
  [0 0 0]
  [0 0 0]]]
Episode: 1 Score: 1.0


### Understanding the RL Environment

From the Gymansium's "Breakout-v0" RL Environment (image version, there is another version which is the RAM version) documentation: https://ale.farama.org/environments/breakout/#

**States**  
Type: Box(0, 255, (210, 160, 3))
| Num | Observation      | Min | Max | Description                        |
| --- | ---------------- | --- | --- | ---------------------------------- |
| 0   | RGB image frame  | 0   | 255 | Raw screen image (pixel intensity) |


**Actions**  
Type: Discrete(18)
| **Index** | **Action Name** | **Meaning**            |
| --------- | --------------- | ---------------------- |
| 0         | NOOP            | Do nothing             |
| 1         | FIRE            | Press fire button only |
| 2         | UP              | Move joystick up       |
| 3         | RIGHT           | Move right             |
| 4         | LEFT            | Move left              |
| 5         | DOWN            | Move down              |
| 6         | UPRIGHT         | UP + RIGHT             |
| 7         | UPLEFT          | UP + LEFT              |
| 8         | DOWNRIGHT       | DOWN + RIGHT           |
| 9         | DOWNLEFT        | DOWN + LEFT            |
| 10        | UPFIRE          | UP + FIRE              |
| 11        | RIGHTFIRE       | RIGHT + FIRE           |
| 12        | LEFTFIRE        | LEFT + FIRE            |
| 13        | DOWNFIRE        | DOWN + FIRE            |
| 14        | UPRIGHTFIRE     | UP + RIGHT + FIRE      |
| 15        | UPLEFTFIRE      | UP + LEFT + FIRE       |
| 16        | DOWNRIGHTFIRE   | DOWN + RIGHT + FIRE    |
| 17        | DOWNLEFTFIRE    | DOWN + LEFT + FIRE     |

In [44]:
# Understanding the state and action spaces used in the Gymnasium's "Breakout-v0" RL Environment
print(env.observation_space)
print(env.action_space)

Box(0, 255, (210, 160, 3), uint8)
Discrete(4)


## 3. Vectorise RL Environment and Train an A2C DRL algorithm in a RL Environment

### What is an Reinforcement Learning (RL) algorithm?

An RL algorithm involves an agent performing actions in an RL environment, receiving rewards or penalties based on those actions, and adjusting its behavior accordingly. This loop helps the agent improve its decision-making over time to maximize the cumulative reward.

### How does a Reinforcement Learning (RL) algorithm 'learn'?

In ML and DL, we learnt that ML/DL algorithms 'learn' by updating the ML/DL algorithm's weights and biases as more datas are fed into the ML/DL algorithm, and after many iterations of training, it makes accurate predictions. 

**This is no different in RL.**

In RL, the RL algorithms uses various architectures to 'learn' by updating the RL algorithm's weights and biases as it interacts more with the RL Environment (via the reward mechanism). The 'learning' architecture used also defines whether a RL algorithm is a **Classical RL algorithm** or a **Deep RL (DRL) algorithm**.

**Classical RL algorithm learning architectures**  
Uses tables or simple functions:
| Type                          | Description                                                                      | Example             |
| ----------------------------- | -------------------------------------------------------------------------------- | ------------------- |
| **Tabular policy**            | Table stores the best action for each discrete state                             | `π[s] = a`          |
| **Tabular stochastic policy** | Table of probabilities for each action in each state                             | `π[a][s] = P(a \| s)` |
| **Value-based methods**       | Use a value table (e.g., Q-table) and derive policy as `π(s) = argmax Q(s,a)`    | Q-Learning          |
| **Policy iteration**          | Alternates between evaluating a policy and improving it based on value estimates | Dynamic Programming |      |
| **Function approximation**    | Uses linear models or tile coding to generalize across large state spaces        | `π(s) = θᵀφ(s)`     |

**Deep RL (DRL) algorithm learning architectures**  
Uses neural networks or its variants,
- FNN/MLP
- CNN
- RNN
- LSTM
- GRU

In RL, after many iterations of training, it makes accurate predictions, more specifically, it behaves better/takes better actions. 

These RL algorithm 'learning' architectures is also called **Policy**, which defines how the agent chooses actions based on its current state.

### What does a Vectorised RL Environment mean?
Vectorized RL Environments are RL Environments that can be made to run in parallel, allowing multiple simulations at once to increase training speed of the RL algorithm.

A non-vectorized RL Environment does not allow for being made to run in parallel (only one simulation can run at a time).

In Gymnasium, some RL Environments are vectorized by default (e.g. Breakout), while others are not (e.g. CartPole). But when training a RL algorithm from Stable Baselines3, it is required for the RL Environment to be vectorized as well (even if you dont intend to run them in parallel).

Since the RL Environment used here is "Breakout-v0", which is vectorized by default, you don't need to manually vectorize them.

To allow the running of multiple simulations at once to increase training speed of the RL algorithm, you can do so as shown below.

In [45]:
# The Stable Baselines3 'make_atari_env()' helper function helps create wrapped Atari Game RL Environments
# The 3 more important parameters are:
# - env_id (where 'environment_name' is at) - stores the RL Environment to be used
# - n_envs                                  - specifies the number of simulations of the RL Environment to run at once
# - seed                                    - controls the randomness of the RL Environments and ensures that experiments 
#                                             are reproducible by keeping the same seed
env = make_atari_env(environment_name, n_envs=4, seed=0)
# The Stable Baselines3 'VecFrameStack' class allows you to stack the RL Environments together
env = VecFrameStack(env, n_stack=4)

After vectorising and increasing the number of simulations to be run at once to increase training speed of the RL algorithm, when you run the 'reset()' and 'render()' functions of the RL Environment, you can visually see that there will be multiple simulations being created.

In [46]:
env.reset()
env.render()

array([[[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       ...,

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]]], shape=(420, 320, 3), dtype=uint8)

### For logging purposes of the training process of the A2C DRL algorithm

In [47]:
# Stating the path where we want to store our training logs files in the local folder './Training_Project_1_Breakout/logs'
log_path = os.path.join('Training_Project_1_Breakout', 'logs')
print(log_path)

Training_Project_1_Breakout\logs


### Creating the A2C DRL algorithm in the RL Environment

In [48]:
# What does each of the parameters in the 'A2C' DRL algorithm class mean?
# - 'policy' (e.g. 'MlpPolicy'  - refers to the learning architecture used a the policy of the RL algorithm, which in this
#               or 'CnnPolicy')   is FNN/MLP
# - 'env'                       - refers to the RL environment to train the RL algorithm in
# - 'verbose'                   - controls how much information is printed to the console/log during training
#                                 -> 'verbose=0' means 'Silent', no output at all
#                                 -> 'verbose=1' means 'Info', shows key training events: episode rewards, updates, losses, etc.
#                                 -> 'verbose=2' means 'Debug' shows more detailed info like hyperparameters, rollout steps, and internal logs
# - 'tensorboard_log'           - states to do the training logging in Tensorboard
A2C_DRL_model = A2C('CnnPolicy', env, verbose=1, tensorboard_log=log_path)

Using cpu device
Wrapping the env in a VecTransposeImage.


In [49]:
A2C?

[31mInit signature:[39m
A2C(
    policy: Union[str, type[stable_baselines3.common.policies.ActorCriticPolicy]],
    env: Union[gymnasium.core.Env, ForwardRef([33m'VecEnv'[39m), str],
    learning_rate: Union[float, Callable[[float], float]] = [32m0.0007[39m,
    n_steps: int = [32m5[39m,
    gamma: float = [32m0.99[39m,
    gae_lambda: float = [32m1.0[39m,
    ent_coef: float = [32m0.0[39m,
    vf_coef: float = [32m0.5[39m,
    max_grad_norm: float = [32m0.5[39m,
    rms_prop_eps: float = [32m1e-05[39m,
    use_rms_prop: bool = [38;5;28;01mTrue[39;00m,
    use_sde: bool = [38;5;28;01mFalse[39;00m,
    sde_sample_freq: int = -[32m1[39m,
    rollout_buffer_class: Optional[type[stable_baselines3.common.buffers.RolloutBuffer]] = [38;5;28;01mNone[39;00m,
    rollout_buffer_kwargs: Optional[dict[str, Any]] = [38;5;28;01mNone[39;00m,
    normalize_advantage: bool = [38;5;28;01mFalse[39;00m,
    stats_window_size: int = [32m100[39m,
    tensorboard_log: Option

### Training the A2C DRL algorithm in the RL Environment to become a A2C DRL model

Note that the number of timesteps/iterations/episodes to be used here to train an RL algorithm varies depending on the complexity of the RL Environment.

For this tutorial's RL Environment, 'Breakout-v0', it is moderately complex and should take about 100 000 to 200 000 timesteps/iterations/episodes compared to the simpler 'CartPole-v1' RL Environment which should only take about 20 000 timesteps/iterations/episodes, but for more complex RL Environments it may take up to 500 000 timesteps/iterations/episodes.

In [None]:
A2C_DRL_model.learn(total_timesteps=100000)

## 4. Save A2C DRL model

In [None]:
A2C_Model_Breakout_v0_100k = os.path.join('Training_Project_1_Breakout', 'Saved RL Models', 'A2C_Model_Breakout_v0_100k')
A2C_DRL_model.save(A2C_Model_Breakout_v0_100k)

## 5. Reload A2C DRL model

In [52]:
A2C_Model_Breakout_v0_100k = os.path.join('Training_Project_1_Breakout', 'Saved RL Models', 'A2C_Model_Breakout_v0_100k')
reloaded_A2C_DRL_model = A2C.load(A2C_Model_Breakout_v0_100k, env=env)

Wrapping the env in a VecTransposeImage.


## 6. Evaluating the A2C DRL model in a RL Environment

In [53]:
# Recall that previously you vectorised by the 'Breakout-v0' RL Environment into running multiple simulations in 
# parallel. Hence, now you need to revert it back to only running 1 simulation sequentially so that your Gymnasium
# 'evaluate_policy()' function can work.
eval_env = make_atari_env('Breakout-v0', n_envs=1, seed=0)
eval_env = VecFrameStack(eval_env, n_stack=4)

# The 'evaluate_policy()' function returns a tuple,
#       (mean_reward, std_reward)
# - 'mean_reward' - refers to the mean reward throughout the episodes
# - 'std_reward' - refers to the standard deviation of the reward throughout the episodes
print(evaluate_policy(reloaded_A2C_DRL_model, eval_env, n_eval_episodes=1, render=True))
env.close()

(np.float64(2.0), np.float64(0.0))


## 7. Test the A2C DRL model in a RL Environment

To test the A2C DRL model in the Gymnasium's 'Breakout-v0' RL Environment, we can use the same code from the earlier section '2. Load RL Environment and testing if it works with a baseline algorithm that takes random actions' with some minor changes

But here, instead of taking a random action at each time step in an episode, we are using the A2C DRL model to predict that action at each time step in an episode instead

In [58]:
from stable_baselines3.common.atari_wrappers import AtariWrapper

In [59]:
train_env = make_atari_env('Breakout-v0', n_envs=1, seed=0)
train_env = VecFrameStack(train_env, n_stack=4)

render_env = gym.make('Breakout-v0', render_mode='human')
render_env = AtariWrapper(render_env)

episodes = 5
for episode in range(1, episodes+1):
    train_obs = train_env.reset()
    render_obs = render_env.reset()
    done = False
    episode_score = 0

    while not done:
        render_env.render()
        
        action, _ = reloaded_A2C_DRL_model.predict(train_obs)
        
        train_obs, reward, dones, infos = train_env.step(action)
        render_obs, reward, done, truncated, info = render_env.step(action[0])  
        
        episode_score += reward

    print(f"Episode: {episode} Score: {episode_score}")

# Close environments
train_env.close()
render_env.close()

Episode: 1 Score: 0.0
Episode: 2 Score: 0.0
Episode: 3 Score: 2.0
Episode: 4 Score: 0.0
Episode: 5 Score: 0.0
