# **Lab assignment 1: Introduction to Gym Library**

In this notebook, you will:

- Undertand Reinforcement Learning (RL) and its core concepts.
- Learn Gymnasium library for developing and benchmarking RL algorithms.
- Install and explore Gymnasium environments on your system.
- Run your first Gymnasium environment and interact with it using Python.

Gymnasium (previously OpenAI Gym) is a library designed for developing and assessing reinforcement learning algorithms. It offers a set of standard environments to test RL models, such as classic control problems, Atari games, and robotics simulations.

Key features of Gymnasium:
- A standardized API for RL environments.
- A wide range of built-in environments for testing algorithms.
- Seamless integration with deep learning frameworks like TensorFlow and PyTorch.

## **Section 1: Introduction to Reinforcement Learning**

Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with an environment. 
The agent takes actions in an environment to maximize cumulative rewards. The key components of an RL system include:

- **Agent**: The entity that learns and makes decisions.
- **Environment**: The system in which the agent operates.
- **State**: A representation of the environment at a given time.
- **Action**: The set of possible moves the agent can take.
- **Reward**: Feedback from the environment that guides the learning process.

RL is used in many real-world applications, including robotics, game playing (like AlphaGo and Deep Q-Networks for Atari), and self-driving cars.

## **Section 2: Gymnasium: The RL Benchmarking Library**

Gymnasium is a project that offers an API (application programming interface) for single-agent reinforcement learning environments, featuring implementations of popular environments such as cartpole, pendulum, mountain-car, mujoco, atari, and more. This Jupyter Notebook Lab will cover the basics of using Gymnasium, including its four main functions: `make()`, `Env.reset()`, `Env.step()`, and `Env.render()`.

At the core of Gymnasium is the `Env` class, a high-level Python class that represents a Markov Decision Process (MDP) based on reinforcement learning theory (note: it is not a complete reconstruction and omits several MDP components). This class allows users to generate an initial state, transition to new states based on actions, and visualize the environment. In addition to Env, Wrapper classes are provided to help modify or augment the environment, particularly the agent's observations, rewards, and actions.

Gymnasium offers an interface through which an agent interacts with the environment using the `step()` function and observes the resulting changes.

## **Section 3: Install and explore Gym environments**

To install Gymnasium on a server or local machine, run:

In [1]:
# YOUR CODE HERE


To install using a Notebook like Google’s Colab, use:

In [2]:
# YOUR CODE HERE
!pip install gymnasium




[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: C:\Users\yousr\Downloads\ibtikaar_pharma_complete_v3\.venv\Scripts\python.exe -m pip install --upgrade pip


The command above installs Gymnasium along with the appropriate versions of its dependencies.

Note: The Gymnasium RL library is officially supported on Linux and macOS, but it can also be installed on Windows. Follow this video for a step-by-step guide: [Install Gymnasium (OpenAI Gym) on Windows](https://www.youtube.com/watch?v=gMgj4pSHLww&list=PL58zEckBH8fCt_lYkmayZoR9XfDCW9hte)

Once the installation is complete, verify the setup by importing the Gymnasium library and displaying its version using:

In [3]:
# YOUR CODE HERE
import gymnasium as gym
print(gym.__version__)



1.2.3


### **Exploring Gymnasium environments**

As of November 2024, Gymnasium offers more than 60 built-in environments. To explore the available environments, you can use the `gym.envs.registry.keys()` function, as demonstrated in the example below:

In [4]:
# YOUR CODE HERE

for env in gym.envs.registry.keys():
    print(env)

CartPole-v0
CartPole-v1
MountainCar-v0
MountainCarContinuous-v0
Pendulum-v1
Acrobot-v1
phys2d/CartPole-v0
phys2d/CartPole-v1
phys2d/Pendulum-v0
LunarLander-v3
LunarLanderContinuous-v3
BipedalWalker-v3
BipedalWalkerHardcore-v3
CarRacing-v3
Blackjack-v1
FrozenLake-v1
FrozenLake8x8-v1
CliffWalking-v1
CliffWalkingSlippery-v1
Taxi-v3
tabular/Blackjack-v0
tabular/CliffWalking-v0
Reacher-v2
Reacher-v4
Reacher-v5
Pusher-v2
Pusher-v4
Pusher-v5
InvertedPendulum-v2
InvertedPendulum-v4
InvertedPendulum-v5
InvertedDoublePendulum-v2
InvertedDoublePendulum-v4
InvertedDoublePendulum-v5
HalfCheetah-v2
HalfCheetah-v3
HalfCheetah-v4
HalfCheetah-v5
Hopper-v2
Hopper-v3
Hopper-v4
Hopper-v5
Swimmer-v2
Swimmer-v3
Swimmer-v4
Swimmer-v5
Walker2d-v2
Walker2d-v3
Walker2d-v4
Walker2d-v5
Ant-v2
Ant-v3
Ant-v4
Ant-v5
Humanoid-v2
Humanoid-v3
Humanoid-v4
Humanoid-v5
HumanoidStandup-v2
HumanoidStandup-v4
HumanoidStandup-v5
GymV21Environment-v0
GymV26Environment-v0


You can also visit the [Gymnasium homepage](https://gymnasium.farama.org/), where the left-hand column contains links to all the available environments. Each environment’s webpage provides details, such as its actions, states, and other information.

The environments are grouped into categories like Classic Control, Box2D, and more. Below are some common environments in each category:

* **Classic Control**: These are standard environments commonly used in RL research, providing a balance of complexity and simplicity to test and benchmark RL algorithms. Some classic control environments in Gymnasium include:
    * Acrobot
    * Cart Pole
    * Mountain Car Discrete
    * Mountain Car Continuous
    * Pendulum

* **Box2D**: Box2D is a 2D physics engine used in games. Environments based on this engine include simple games like:
    * Lunar Lander
    * Car Racing

* **ToyText**: These small and simple environments are typically used to debug RL algorithms. Many of them are based on grid world models or basic card games. Examples include:
    * Blackjack
    * Taxi
    * Frozen Lake

* **MuJoCo**: Multi-Joint dynamics with Contact (MuJoCo) is an open-source physics engine that simulates environments for applications such as robotics, biomechanics, and machine learning. MuJoCo environments in Gymnasium include:
    * Ant
    * Hopper
    * Humanoid
    * Swimmer
    * And more

In addition to the built-in environments, Gymnasium supports integration with numerous external environments using the same API.

## **Section 4: First Gymnasium environment**

In this section, we will explore the `Taxi` environment from the OpenAI Gym Library. Taxi is one of the many environments available in OpenAI Gym, where the goal is to pick up passengers and drop them off at their destination in the fewest possible moves. In this lab, you will start by working with a taxi agent that takes random actions until it accomplishes its objective.

<p align="center">
    <img src="images/taxi_env.png" align="center" width="300">
<p>

**Description**

There are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger’s location, picks up the passenger, drives to the passenger’s destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends.

**Actions**

There are 6 discrete deterministic actions:

0: move south

1: move north

2: move east

3: move west

4: pickup passenger

5: drop off passenger

**Observations**

There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations.

Note that there are 400 states that can actually be reached during an episode. The missing states correspond to situations in which the passenger is at the same location as their destination, as this typically signals the end of an episode. Four additional states can be observed right after a successful episodes, when both the passenger and the taxi are at the destination. This gives a total of 404 reachable discrete states.

Each state space is represented by the tuple: (taxi_row, taxi_col, passenger_location, destination)

An observation is an integer that encodes the corresponding state. The state tuple can then be decoded with the “decode” method.

**Passenger locations**:

0: R(ed)

1: G(reen)

2: Y(ellow)

3: B(lue)

4: in taxi

**Destinations**:

0: R(ed)

1: G(reen)

2: Y(ellow)

3: B(lue)

For more information about the Taxi environement visit the [Gym documentation](https://www.gymlibrary.dev/environments/toy_text/taxi/).

After installation, we can load the Taxi environment and display its appearance using the 'ansi' render mode:

In [5]:
# YOUR CODE HERE

env = gym.make("Taxi-v3", render_mode="ansi")
state, info = env.reset()
print(env.render())


+---------+
|[34;1mR[0m: | : :G|
| : | : : |
| : : : :[43m [0m|
| | : | : |
|Y| : |[35mB[0m: |
+---------+




* The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
* The pipe ("|") represents a wall which the taxi cannot cross.
* R, G, Y, B are the possible pickup and destination locations. The blue letter represents the current passenger pick-up location, and the purple letter is the current destination.

Now, visualize the Taxi environment using the 'human' render mode:

In [6]:
import sys
!{sys.executable} -m pip install pygame





[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
# YOUR CODE HERE
#env = gym.make("Taxi-v3", render_mode="human")
#state, info = env.reset()
#env.render()


The core interface in Gym is `env`, which serves as the unified environment interface. The following `env` methods will be particularly useful:

* **env.reset**: Resets the environment and returns a random initial state.
* **env.step(action)**: Advances the environment by one timestep. It returns:
    * **observation**: The current state of the environment.
    * **reward**: Indicates whether the action was beneficial or not.
    * **done**: Signals whether the agent has successfully completed a task (e.g., picked up and dropped off a passenger), marking the end of one episode.
    * **info**: Provides additional information, such as performance and latency, for debugging.
* **env.render**: Renders a single frame of the environment, useful for visualization.

**Note**: We use `.env` at the end of `make` to prevent training from stopping after 200 iterations, which is the default limit in the latest version of Gym (reference).

### **The Taxi Environment Problem**

Let’s explore the environment in more detail. Reset the Taxi environment, render its initial state, and display the action and state space in OpenAI Gym.

In [None]:
# YOUR CODE HERE

env = gym.make("Taxi-v3", render_mode="ansi")
state, info = env.reset()
print("Initial Environment:")
print(env.render())

# Action space
print("Action Space:", env.action_space)
print("Number of actions:", env.action_space.n)

# State space
print("State Space:", env.observation_space)
print("Number of states:", env.observation_space.n)


Initial Environment:
+---------+
|[35mR[0m: | : :[34;1mG[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B:[43m [0m|
+---------+


Action Space: Discrete(6)
Number of actions: 6
State Space: Discrete(500)
Number of states: 500


Based on the printed output, we can confirm that the Action Space has 6 possible actions, and the State Space consists of 500 unique states. Each state is uniquely identified by assigning a number to every possible state, and the agent selects an action from 0 to 5, as described earlier.

The 500 states represent a combination of the taxi's location, the passenger's location, and the destination.

Reinforcement learning works by mapping states to optimal actions through exploration. The agent interacts with the environment, takes actions, and learns based on the rewards assigned by the environment.

The best action for each state is the one that maximizes the cumulative long-term reward.

### **Illustration**

Use the illustrated scenario from the image, encode its state, and pass it to the environment for rendering in Gym. The taxi is located at row 3, column 1, with the passenger at location 2 and the destination at location 0. By applying the state encoding method `env.encode` in the Taxi environment, we can achieve this as follows:

In [None]:
# YOUR CODE HERE
env = gym.make("Taxi-v3", render_mode="ansi")
env.reset()

# encode: (taxi_row, taxi_col, passenger_location, destination)
state = env.unwrapped.encode(3, 1, 2, 0)
env.unwrapped.s = state

print("Encoded state:", state)
print(env.render())

Encoded state: 328
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+




We are using our illustration's coordinates to generate a number corresponding to a state between 0 and 499, which turns out to be 328 for our illustration's state.

Using this encoded number, we can manually set the environment’s state with `env.env.s`. You can experiment with different numbers and observe how the taxi, passenger, and destination change positions.

### **The Reward Table**

When the Taxi environment is initialized, a reward table called `P` is automatically generated. This table can be viewed as a matrix where the rows represent the number of states and the columns represent the number of actions, forming a **states × actions** matrix.  

Since every state is included in this matrix, we can examine the default reward values assigned to the state in our illustration.

The dictionary in the reward table follows the structure `{action: [(probability, next_state, reward, done)]}`.  

Key points to note:  

- The numbers 0-5 correspond to the possible actions the taxi can take in the current state: **south, north, east, west, pickup, and dropoff**.  
- In this environment, the probability value is always **1.0**.  
- `next_state` represents the state the agent will transition to after taking the specified action.  
- Movement actions (south, north, east, west) result in a **-1 reward**, while pickup and dropoff actions give a **-10 reward** in this particular state. However, if the taxi has a passenger and is at the correct drop-off location, the dropoff action (5) would yield a **+20 reward**.  
- `done` indicates whether the episode has ended, meaning the passenger has been successfully dropped off at the correct location.  

**Note:** If the agent chooses action **2 (east)** in this state, it will attempt to move into a wall. However, the environment prevents the taxi from crossing walls, so it will remain in the same position while continuously receiving **-1 penalties**, negatively impacting its long-term reward.

### **Random Agent**

We will begin by implementing an agent that takes random actions without any learning. This will serve as our baseline.

The first step is to provide our agent with an initial state of the environment. A state represents how the agent perceives its surroundings. In the Taxi environment, a state includes the current positions of the taxi, the passenger, and the designated pick-up and drop-off locations. Below are examples of three different states in the Taxi environment:

<p align="center">
    <img src="images/taxi-states.png" align="center" width=450>
<p>

**Note:**  
- **Yellow** represents the taxi.  
- **Blue letters** indicate pickup locations.  
- **Purple letters** mark drop-off destinations.  

Next, we will run a loop to simulate the game. In each iteration, our agent will:  

1. Select a random action from the action space:  
   - **0** → Move south  
   - **1** → Move north  
   - **2** → Move east  
   - **3** → Move west  
   - **4** → Pick up the passenger  
   - **5** → Drop off the passenger  
2. Receive the updated state after performing the action.  

Below is the script for our random agent:

In [16]:
# YOUR CODE HERE
import time

env = gym.make("Taxi-v3", render_mode="ansi")

state, info = env.reset()
print(env.render())

done = False

while not done:
    action = env.action_space.sample()  # choose random action
    state, reward, terminated, truncated, info = env.step(action)
    
    print("Action:", action)
    print(env.render())
    print("Reward:", reward)
    print("-" * 30)
    
    done = terminated or truncated

env.close()


+---------+
|R: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+


Action: 4
+---------+
|R: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+
  (Pickup)

Reward: -10
------------------------------
Action: 0
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : |[43m [0m: |
|[34;1mY[0m| : |[35mB[0m: |
+---------+
  (South)

Reward: -1
------------------------------
Action: 1
+---------+
|R: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+
  (North)

Reward: -1
------------------------------
Action: 1
+---------+
|R: | : :G|
| : | :[43m [0m: |
| : : : : |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+
  (North)

Reward: -1
------------------------------
Action: 1
+---------+
|R: | :[43m [0m:G|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+
  (North)

Reward: -1
------------------------------
Act

You can run this script to observe your agent taking random actions. This serves as a great introduction to the OpenAI Gym toolkit and helps you understand how the environment works.  

To visualize the animation of your random agent's actions, execute the following code:

In [31]:
# YOUR CODE HERE
env = gym.make("Taxi-v3", render_mode="human")

state, info = env.reset()

done = False

while not done:
    action = env.action_space.sample()  # random action
    state, reward, terminated, truncated, info = env.step(action)
    
    time.sleep(0.3)  # slow down animation
    
    done = terminated or truncated

env.close()


The results are not good. Our agent takes thousands of timesteps and makes numerous incorrect drop-offs, delivering just one passenger to the right destination.  

This happens because the agent is not learning from its past actions. We can run this process repeatedly, but it will never improve. The agent has no memory of which actions work best for each state, which is where Reinforcement Learning comes in.  

In the next labs, we will implement reinforcement learning algorithms that will allow our agent to learn from rewards and improve its performance over time.

#### **LunarLander environment:**
Explore another example, such as LunarLander, to enhance your understanding.

In [21]:
# YOUR CODE HERE
env = gym.make("LunarLander-v3", render_mode="human")
state, info = env.reset()


In [28]:
print("Action Space:", env.action_space)
print("Number of actions:", env.action_space.n)

print("Example state (observation):", state)
print("Length of state:", len(state))


Action Space: Discrete(4)
Number of actions: 4
Example state (observation): [-0.43026924  0.06845564 -0.8290989  -0.72003347  0.33612123 -5.0903864
  0.          1.        ]
Length of state: 8



- In **Taxi**, the state is **discrete** and represented by a single number from 0 to 499.
- In **LunarLander**, the state is **continuous** and represented by a vector of 8 real numbers:


In [38]:
import gymnasium as gym
import time

env = gym.make("LunarLander-v3", render_mode="human")

state, info = env.reset()
done = False

while not done:
    action = env.action_space.sample()
    state, reward, terminated, truncated, info = env.step(action)

    print("Action:", action)
    print("Reward:", reward)
    print("-" * 30)

    time.sleep(0.02)   # 🔹 slow down so you can see animation

    done = terminated or truncated

time.sleep(2)  # 🔹 keep window open 2 seconds at the end
env.close()


Action: 2
Reward: 0.062010558261062
------------------------------
Action: 3
Reward: 1.040967910592799
------------------------------
Action: 3
Reward: 0.437337993669787
------------------------------
Action: 2
Reward: -2.245930009906277
------------------------------
Action: 3
Reward: 0.08318863030686544
------------------------------
Action: 0
Reward: 0.1767305364187166
------------------------------
Action: 0
Reward: -0.5233549690566122
------------------------------
Action: 2
Reward: -1.4629915049941815
------------------------------
Action: 0
Reward: -1.302402713238365
------------------------------
Action: 2
Reward: -0.36010396352652946
------------------------------
Action: 2
Reward: -2.2869961262418483
------------------------------
Action: 1
Reward: 0.5369178858931332
------------------------------
Action: 0
Reward: -1.1129374870640447
------------------------------
Action: 2
Reward: -1.994478511024471
------------------------------
Action: 2
Reward: -2.386753494253912
-------

## **Section 5: Conclusion**

Great work! You have:
- Undertand Reinforcement Learning (RL) and its core concepts.
- Learn Gymnasium library for developing and benchmarking RL algorithms.
- Install and explore Gymnasium environments on your system.
- Run your first Gymnasium environment and interact with it using Python.