
## Reinforcement Learning Lab: Navigating the Frozen Lake with OpenAI Gym

# Visual Example of the Frozen Lake Environment
To get a sense of the environment, here's a visualization of the Frozen Lake game from Gymnasium, featuring the elf character navigating the grid.
https://gymnasium.farama.org/environments/toy_text/frozen_lake/

## Introduction to Reinforcement Learning (RL)

**Reinforcement Learning (RL)** is a type of machine learning where an **agent** learns to make decisions by interacting with an **environment**. The agent takes **actions**, receives **rewards** (positive or negative feedback), and aims to **maximize cumulative reward** over time.

### Key Concepts

| Concept | Description |
|---------|-------------|
| **State (s)** | Current situation of the environment |
| **Action (a)** | Choice made by the agent |
| **Reward (r)** | Immediate feedback after an action |
| **Policy (Ï€)** | Strategy that maps states to actions |
| **Q-function (Q(s,a))** | Expected cumulative reward starting from state `s`, taking action `a`, and following policy `Ï€` |

---

### RL Environment: FrozenLake-v1 (Gymnasium)

- **Description:** Navigate a frozen lake from **Start (S)** to **Goal (G)**, avoiding holes (**H**). The ice is frozen (**F**) but slippery by default.
- **Grid Size:** 4x4 (16 states) or 8x8 (64 states). We'll use **4x4** for simplicity.
- **States:** Discrete positions on the grid (0 to 15 for 4x4).
- **Actions:**
  - `0` = left
  - `1` = down
  - `2` = right
  - `3` = up
- **Rewards:** 
  - `+1` for reaching goal
  - `0` otherwise
  - Episode ends on **goal** or **hole**
- **Slippery Mode:** Actions have 1/3 chance of slipping. Initially disabled for deterministic learning; later included as an exercise.
- **Goal:** Learn a **policy** to reach `G` safely using **Q-Learning**.

---


## Installation of Required Packages
Before starting, we need to install the necessary libraries. Gymnasium provides the environment, and Pygame is required for graphical rendering.

In [1]:
!pip install gymnasium pygame




## Imports and Environment Setup

Before we start, we need to **import the necessary packages** and **set up the FrozenLake environment**.  
We'll also set **random seeds** to ensure our results are **reproducible**.

### Environment Details
- **Grid:** 4x4 (16 states)  
- **Slippery:** Disabled (`is_slippery=False`) for deterministic learning  
- **Goal:** Make learning simpler for initial experiments  




In [31]:
import gymnasium as gym
import numpy as np
import random


## Setting Up the FrozenLake Environment

In this step, we create the **FrozenLake environment** using a **4x4 grid** and **disable slippery mode**. This ensures that the agent's actions are **deterministic**, making it easier to understand the learning process.

We also check the **number of states and actions** to verify the environment's configuration.  
This setup is important for understanding the **discrete state and action space** that Q-Learning will operate on.

> âœ… Using a deterministic environment allows us to **focus on learning the Q-Learning algorithm** without additional randomness.


In [32]:
env = gym.make(
    "FrozenLake-v1",
    map_name="4x4",
    is_slippery=False
)

print("Number of states:", env.observation_space.n)
print("Number of actions:", env.action_space.n)


Number of states: 16
Number of actions: 4


## Defining Action Mappings

In this step, we define a **dictionary called `actions`** that maps the **numerical action indices** (`0` to `3`) from the FrozenLake environment to **human-readable strings**:  

- `0` â†’ "LEFT"  
- `1` â†’ "DOWN"  
- `2` â†’ "RIGHT"  
- `3` â†’ "UP"  

This mapping makes it easier to **interpret and debug** the agent's actions during **rendering or evaluation**.  
At the end, we can display the dictionary contents to **verify the mappings**.

> âœ… This step ensures that when the agent moves, we can **quickly understand what each action means**.


In [33]:
actions = {
    0: "LEFT",
    1: "DOWN",
    2: "RIGHT",
    3: "UP"
}

actions


{0: 'LEFT', 1: 'DOWN', 2: 'RIGHT', 3: 'UP'}

## Initializing the Q-Table

In this step, we create the **Q-table** as a **2D NumPy array filled with zeros**.  

- **Rows:** correspond to the **states** from the environment's observation space  
- **Columns:** correspond to the **actions** from the action space  

The Q-table will **store the expected rewards** for each **state-action pair**, which the agent will **update during Q-Learning training**.  

Printing the Q-table at this stage allows us to **verify that it has been initialized correctly**.

> âœ… At this point, the Q-table is empty, ready to be **populated as the agent learns**.


In [34]:
Q = np.zeros((env.observation_space.n, env.action_space.n))
Q


array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

## Epsilon-Greedy Action Selection Function

In this step, we define the **`choose_action` function**, which implements an **epsilon-greedy policy** for Q-Learning.  

- With probability **Îµ** (default 0.2), the agent **explores** by selecting a **random action**.  
- Otherwise, the agent **exploits** its current knowledge by choosing the action with the **highest Q-value** for the given state.  

This function is used during training to **decide the agent's action at each step**, balancing **exploration** of new actions and **exploitation** of learned rewards.

> âœ… Epsilon-greedy ensures that the agent does not get stuck in suboptimal actions and continues to **learn from the environment**.


In [35]:
def choose_action(state, epsilon=0.2):
    if random.random() < epsilon:
        return env.action_space.sample()  # explore
    else:
        return np.argmax(Q[state])        # exploit


## Defining Hyperparameters for Q-Learning

In this step, we set the **key hyperparameters** for the Q-Learning algorithm:

- **Î± (alpha)** = 0.1 â†’ the **learning rate**, controlling how much Q-values are updated based on new information.  
- **Î³ (gamma)** = 0.95 â†’ the **discount factor**, determining the importance of **future rewards** compared to immediate ones.  
- **episodes** = 5000 â†’ the number of **training episodes**, specifying how many times the agent interacts with the environment to learn.  

These hyperparameters can be **tuned** to improve convergence and performance:  
- Lower **Î±** â†’ slower learning  
- Higher **Î³** â†’ emphasizes **long-term rewards**  
- More **episodes** â†’ allows for better **training and policy learning**

> âœ… Proper hyperparameter tuning is essential for the agent to **learn an effective policy efficiently**.


In [36]:
alpha = 0.1    # learning rate
gamma = 0.95   # future importance
episodes = 5000


## Training the Q-Learning Agent

In this step, we run the **core Q-Learning training loop** over the specified number of episodes (e.g., 5000).  

### Training Process

For each episode:

1. **Reset the environment** to start a new episode and get the **initial state**.  
2. While the episode is **not done** (the agent hasn't reached the goal or fallen into a hole):  
   - **Choose an action** using the **epsilon-greedy policy** (`choose_action`).  
   - **Step the environment** with the chosen action to obtain the **next state**, **reward**, and **termination flags**.  
   - **Update the Q-table** for the current state-action pair using the **Q-Learning update rule**:

   $$
   Q(s,a) \leftarrow Q(s,a) + \alpha \Big[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \Big]
   $$

   Where:  
   - $\alpha$ = learning rate  
   - $\gamma$ = discount factor  
3. **Set the current state** to the **next state**.

### Outcome

By iteratively updating the Q-table based on received rewards and expected future rewards, the agent **learns an optimal policy**.  
After training, the **Q-table guides the agent** to make better decisions in the environment.

> âœ… This loop is the core of Q-Learning, allowing the agent to **improve its behavior over time** through trial and error.


In [37]:
for episode in range(episodes):
    state, info = env.reset()
    done = False

    while not done:
        action = choose_action(state)
        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated

        Q[state, action] += alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )

        state = next_state


## Inspecting the Trained Q-Table

After training, we can **examine the learned Q-table** to understand the agent's behavior.

- Use **NumPy's `round` function** to display Q-values rounded to **2 decimal places** for better readability.  
- Each **row** represents a **state** (0 to 15) and each **column** corresponds to an **action**:  
  - `0` â†’ LEFT  
  - `1` â†’ DOWN  
  - `2` â†’ RIGHT  
  - `3` â†’ UP  
- **Higher values** in a row indicate the **preferred action(s)** for that state.  
- **Non-zero values** reflect the agentâ€™s learned preferences to **reach the goal efficiently**.  

> âœ… Inspecting the Q-table helps us **visualize what the agent has learned** and verify that the policy is improving toward the goal.


In [38]:
np.round(Q, 2)


array([[0.74, 0.77, 0.7 , 0.74],
       [0.74, 0.  , 0.14, 0.42],
       [0.44, 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  ],
       [0.77, 0.81, 0.  , 0.74],
       [0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.9 , 0.  , 0.04],
       [0.  , 0.  , 0.  , 0.  ],
       [0.81, 0.  , 0.86, 0.77],
       [0.81, 0.9 , 0.9 , 0.  ],
       [0.86, 0.95, 0.  , 0.86],
       [0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.62, 0.95, 0.71],
       [0.9 , 0.95, 1.  , 0.9 ],
       [0.  , 0.  , 0.  , 0.  ]])

## Setting Up a Renderable FrozenLake Environment

In this step, we create a **new instance** of the FrozenLake environment called `env_render` for **visual demonstration**.

### Key Details

- **Grid:** 4x4 map  
- **Slippery Mode:** Disabled (`is_slippery=False`) for **deterministic movements**  
- **Render Mode:** `"human"` â†’ enables **graphical rendering** in a popup window (using Pygame)  

### Purpose

This setup allows us to **visually observe the agent** as it moves on the ice grid, represented as an **elf character**.  
It is particularly useful for **demonstrating the trained policy** in action.  

> âœ… Unlike the previous environment (`env`), `env_render` is specifically configured for **human-visible rendering**, making it ideal for demonstrations.


In [39]:
env_render = gym.make(
    "FrozenLake-v1",
    map_name="4x4",
    is_slippery=False,
    render_mode="human"
)


## Running a Trained Episode with Graphical Rendering and Random Start

In this step, we **demonstrate the trained agent's policy** by running a **single episode** in the **graphical rendering environment** (`env_render`).

### How It Works

1. **Reset the environment** to obtain the **initial state**.  
2. **Starting state (state 0):**  
   - Select a **random action** from `DOWN (1)`, `RIGHT (2)`, or `UP (3)`  
   - Adds **variety** and prevents potential stuck scenarios  
3. **Other states:**  
   - Choose the **greedy action** (highest Q-value) from the **trained Q-table**  
4. **Print** the current state and chosen action using the **`actions` dictionary** for readability  
5. **Step the environment** to get the **next state**, **reward**, and **termination flags**  
   - Render the agentâ€™s movement as an **elf** in a popup window  
6. **Loop ends** when the agent reaches the **goal** or falls into a **hole** (`done = True`)  

### Outcome

- Visual inspection of the agent's path on the frozen lake grid  
- Confirms that the **trained Q-table policy** guides the agent effectively toward the goal  

> âœ… This method allows us to **see the agent in action** and understand its learned behavior in a human-readable, graphical format.


In [40]:
state, info = env_render.reset()
done = False

while not done:
    # pick the best action but add small randomness if stuck
    if state == 0:
        action = random.choice([1,2,3])  # DOWN, RIGHT, UP
    else:
        action = np.argmax(Q[state])
    
    print(f"State: {state}, Action: {actions[action]}")
    state, reward, terminated, truncated, info = env_render.step(action)
    done = terminated or truncated

env_render.close()


State: 0, Action: DOWN
State: 4, Action: DOWN
State: 8, Action: RIGHT
State: 9, Action: RIGHT
State: 10, Action: DOWN
State: 14, Action: RIGHT


## Try It Yourself: Experiment and Observe

Now that the agent has been trained and we can run episodes with graphical rendering, itâ€™s time to **explore and experiment**! Here are some ideas:

- **Change the starting state:**  
  Try starting from different states instead of always using state `0`. How does it affect the path the agent takes?  

- **Modify epsilon or hyperparameters:**  
  Adjust **epsilon**, **alpha**, **gamma**, or **number of episodes**, retrain the Q-table, and observe how the agentâ€™s behavior changes.  
  - Lower epsilon â†’ less exploration, agent may get stuck in suboptimal paths  
  - Higher gamma â†’ agent values long-term rewards more  
  - More episodes â†’ Q-values converge more accurately  

- **Enable slippery mode:**  
  Set `is_slippery=True` and see how the **randomness in actions** affects the learned policy. How does the agent adapt to uncertainty?  

- **Inspect the Q-table:**  
  Look at the rounded Q-values for different states. Which actions are preferred? How do changes in hyperparameters affect these preferences?  

- **Visual observation:**  
  Watch the agent move on the frozen lake. Try different paths and starting points. Can you predict its actions before they happen?  

> ðŸ’¡ **Tip:** Experimenting and observing the outputs will help you **understand the effects of hyperparameters, randomness, and learning in Q-Learning**. Take notes on how the policy changes and what it tells you about the agent's decision-making process.
