# Day-78 Introduction to Reinforcement Learning (RL)

Today, we're diving into one of the most fascinating and, I'd say, almost futuristic branches of ML: Reinforcement Learning (RL). Get ready to train machines to make smart decisions through trial and error, just like we learn in the real world!

So far, we've explored Supervised Learning (learning from labeled data, like predicting house prices) and Unsupervised Learning (finding patterns in unlabeled data, like clustering customers).

Now, think about teaching a dog a new trick, or a baby learning to walk. There are no labeled datasets! They learn by interacting with the environment, performing an action, and getting a reward (a treat, a pat, or successfully standing up) or a penalty.

This "learning by interaction" is the core of Reinforcement Learning. It's the mechanism behind self-driving cars, game-playing AIs like AlphaGo, and robotics. It's truly a game-changer!

## Topics Covered

- What is Reinforcement Learning (RL)?
- How agents makes decision?
- Code Implementation

## What is Reinforcement Learning (RL)

Reinforcement Learning (RL) is a type of Machine Learning where an agent learns to make decisions by interacting with an environment to achieve a goal.

Instead of learning from labeled data (like in supervised learning), RL learns from trial and error — by performing actions and receiving rewards or penalties as feedback.

### Core components of RL

| Component       | Description                        | Example                                   |
| --------------- | ---------------------------------- | ----------------------------------------- |
| **Agent**       | The learner or decision-maker      | A robot, game player, or self-driving car |
| **Environment** | The world the agent interacts with | The road, the game, the room              |
| **State (S)**   | Current situation of the agent     | Car’s speed and position                  |
| **Action (A)**  | Decision taken by the agent        | Accelerate, turn left, brake              |
| **Reward (R)**  | Feedback from environment          | +1 for staying on road, -1 for crash      |


### Key Concept:

RL is based on this feedback loop

`Agent → takes an action → Environment → gives a reward → Agent learns → repeat!`

The goal of the agent is to maximize cumulative reward over time.

Let’s imagine you’re teaching your dog to fetch a ball

- You say “Fetch!” (action).

- The dog runs and grabs the ball (environment responds).

- You give a treat (reward).

Over time, the dog learns: **“If I fetch the ball, I get a treat — so I’ll do it again!”**

That’s Reinforcement Learning — learning by reward-based feedback.

Similarly, in a video game, an RL agent learns to maximize its score by exploring actions and seeing what gives the best long-term reward.

### Types of Reinforcement Learning


#### Based on Reward Type:
| Type                              | Description                                                                       | Key Idea                                         | Example                                                    |
| --------------------------------- | --------------------------------------------------------------------------------- | ------------------------------------------------ | ---------------------------------------------------------- |
| **1. Positive Reinforcement**     | Increases the likelihood of repeating a behavior by giving a **positive reward**. | Encourages good behavior.                        | Dog gets a treat after sitting when told.             |
| **2. Negative Reinforcement**     | Increases behavior by **removing a negative condition** after the desired action. | Removes discomfort when correct action is taken. | Seatbelt alarm stops once you wear the belt.            |
| **3. Punishment (Penalty-Based)** | **Discourages** a behavior by applying a penalty or negative reward.              | Reduces bad behavior.                            | Car gets -10 reward for hitting a wall.               |
| **4. Extinction**                 | Previously learned behavior is **forgotten** when it no longer gives rewards.     | Removes behavior with no feedback.               | Agent stops pressing a button that no longer gives reward. |

#### Algorithmic Classification (Technical View)

From an algorithm perspective, RL can also be categorized into three major types

| Type                   | Description                                                                              | Example                            |
| ---------------------- | ---------------------------------------------------------------------------------------- | ---------------------------------- |
| **1. Value-Based RL**  | Learns a **value function (Q-value)** that estimates how good each action is in a state. | **Q-Learning**, **SARSA**          |
| **2. Policy-Based RL** | Learns the **policy directly** — doesn’t rely on Q-tables.                               | **REINFORCE**, **Policy Gradient** |
| **3. Actor–Critic RL** | Combines both — **actor** (policy) + **critic** (value function).                        | **A3C**, **PPO**, **DDPG**         |


#### Based on Knowledge Type

based on how the agent understands the environment RL can be classified into 2 types:

| Type               | What it Means                                                                       | The Agent Knows / Learns                                                        | Analogy                                                                                    |
| ------------------ | ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
| **Model-Based RL** | The agent **builds or uses a model** of the environment to plan actions.            | It knows the **state transitions** → “If I do A in state S, I’ll end up in S′.” | Like **playing chess** — you can “think ahead” and simulate future moves before acting.  |
| **Model-Free RL**  | The agent **learns purely from experience** — no explicit model of the environment. | It doesn’t know what will happen next until it tries.                           | Like **learning to ride a bike** — you learn by falling and improving each time.        |


### RL vs Supervised vs Unsupervised Learning

| Type              | Learns From               | Example                |
| ----------------- | ------------------------- | ---------------------- |
| **Supervised**    | Labeled data              | Email spam detection   |
| **Unsupervised**  | Hidden patterns           | Customer segmentation  |
| **Reinforcement** | Feedback from environment | Game playing, robotics |


## How agents makes decision?

In Reinforcement Learning, the Agent is the decision-maker.
To decide what to do next, it uses two key tools:

-  `Policy` ( $\pi$ ) → its strategy
-  `Q-Table` ($Q$) → its memory of experiences

| Concept         | Role                                                                       | Relationship                              |
| --------------- | -------------------------------------------------------------------------- | ----------------------------------------- |
| **Agent**       | Learner / decision maker                                                   | Uses both **Policy** and **Q-Table**      |
| **Policy (π)**  | Strategy that decides which action to take                                 | Derived from the **Q-table**              |
| **Q-Table (Q)** | Memory that stores the quality (expected reward) of each action-state pair | Helps to **improve** the policy over time |


`Analogy`:

Think of the agent as a student learning to play chess:

- **The Q-table** is its $notebook$ — where it writes down which moves worked well.

- **The policy** is its playing $strategy$ — how it decides what move to play next.

The agent reads from its notebook (Q-table) to guide its strategy (policy).

`flow diagram`

Agent → Policy → Action → Environment → Reward → Q-Table update → Policy improvement

### What is Policy ($\pi$)?The Agent's Strategy

In the context of Reinforcement Learning, the Policy ($\pi$) is the Agent's strategy or master plan for making decisions. It is the definitive rule that the Agent uses to determine what Action ($a$) to take when it is in a particular State ($s$).

`Goal`: The Agent's entire training process is about finding the optimal policy ($\pi^*$) that maximizes the expected cumulative reward over time.

Think about a self-driving car (the Agent). The Policy is the set of rules it follows at every moment:

| State (s) (Observation) |Policy (π) (Rule/Strategy) |Action (a) (Decision)                        |
|-------------------------|---------------------------|---------------------------------------------|
|State: Red traffic light |Rule: Stop and wait.       |Action: Apply brakes.                        |
|State: Clear road ahead  |Rule: Maintain speed limit.|Action: Press gas pedal lightly.             |
|State: Car merging left  |Rule: Slightly steer right.|Action: Turn steering wheel 2 degrees right. |

In this analogy:

The Policy is the entire, comprehensive driving manual that maps every possible road situation (state) to the correct driving response (action).

#### Types of policies

| Type                        | Description                                                              | Behavior / Formula                         | Example                                                   |
| --------------------------- | ------------------------------------------------------------------------ | ------------------------------------------ | --------------------------------------------------------- | 
| **1. Deterministic Policy** | Always picks **one fixed action** for each state                         | $( a = \pi(s)$ )                   | If traffic light = red → always stop |                      
| **2. Stochastic Policy**    | Picks **actions based on probabilities**                                 | $( \pi(a\|s) = P(a\|s) )  $        |   80% go straight, 20% turn right |
| **3. Random Policy**        | Chooses **actions completely at random**, regardless of state            | $( \pi(a\|s) = \frac{1}{A} )$         | Agent explores blindly in the environment |
| **4. Greedy Policy**        | Always picks the **action with the highest Q-value** (best known so far) | $( a = \arg\max_a Q(s,a) ) $       | Always chooses what seems best now |
| **5. ε-Greedy Policy**      | Chooses the **best action most of the time**, but **explores sometimes** | With probability ε → explore, else exploit | 90% time take best move, 10% random                                                                 |    
| **6. Softmax Policy**       | Selects actions **proportionally to their Q-values** using Softmax       | $( P(a\|s) = \frac{e^{Q(s,a)/\tau}}{\sum_{a'} e^{Q(s,a')/\tau}} )$ | High-value actions = higher chance, not guaranteed|  
| **7. Optimal Policy**       | The **final goal** — the policy that gives **maximum cumulative reward** | $( \pi^*(a\|s) ) $  | What your agent learns after training – the “best” strategy |


### The Exploration vs. Exploitation Trade-off

This trade-off is the core dilemma in Reinforcement Learning, and the $\epsilon$-greedy strategy is designed to manage it. We will talk about Epsilon-greedy policy in bit but first lets discuss what is exploration and exploitation?

| **Term**         | **Definition**                                                                          | **Goal**                                                                      | **Analogy**                                                                           |
| ---------------- | --------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| **Exploitation** | Choosing the best known action based on current experience and knowledge (the Q-table). | To maximize reward immediately by using the current optimal strategy.         | Always eating at your favorite restaurant because you know it's good.               |
| **Exploration**  | Choosing a random or sub-optimal action that hasn't been tried frequently.              | To discover new information that could lead to a higher reward in the future. | Trying a new, unknown restaurant to see if it's better than your current favorite.  |


The Exploration-Exploitation Trade-off is the core challenge. An Agent must balance these two activities:

- If the Agent only exploits (acts Greedy), it risks getting stuck in a local optimum—a good path, but potentially missing out on a much better, undiscovered path.

- If the Agent only explores (acts Random), it may spend too much time on poor actions and never consistently achieve a high reward, thus learning an inefficient policy.

The **Epsilon-Greedy Policy** is the most common solution, allowing the Agent to explore frequently at the beginning of training (when knowledge is low) and exploit more frequently toward the end (when knowledge is high).

### Epsilon-greedy policy

The Epsilon-Greedy Policy is a controlled mix of these two strategies:
1. With a small probability $\mathbf{\epsilon}$ (epsilon): You Explore. You randomly pick a new restaurant you've never tried.
2.  With a large probability $\mathbf{1 - \epsilon}$: You Exploit. You go to your current favorite restaurant (the one you believe is the best based on your experience).

![image.png](attachment:image.png)

Example: If $\epsilon = 0.1$ (10%):
- $10\%$ of the time, you try a random new place (Exploration).
- $90\%$ of the time, you go to your known best place (Exploitation).

### What is a Q-Table?

A Q-table (Quality Table) is a lookup table that helps the agent decide what action to take in a given state.

It stores the expected future reward (Q-value) for each state–action pair.
Formally:

$ 𝑄(𝑠,𝑎)$ = expected total reward if the agent takes action $ 𝑎$ in state $ 𝑠 $


The agent keeps updating these values during training using the Q-learning formula:

$Q(s,a)=Q(s,a)+\alpha[r +\gamma max Q(s′,a′) − Q(s,a)]$

Where

- $s$: current state

- $a$: action taken

- $r$: reward

- $\alpha$: learning rate

- $\gamma$: discount factor

## Code implementation:

We will use OpenAI Gymnasium (formerly OpenAI Gym) to implement Model-Free Reinforcement Learning using the Q-Learning algorithm. In RL an agent requires an environment that provides a simulation for learning optimal actions through trial and error.

In [None]:
! pip install gymnasium





### 1. Understand the Environment

We will use the Taxi-v3 environment from OpenAI Gymnasium a classic RL problem designed to train an agent to navigate a grid world and transport passengers efficiently.

![image.png](attachment:image.png)

https://gymnasium.farama.org/environments/toy_text/taxi/

In [1]:
import gymnasium as gym
env = gym.make('Taxi-v3', render_mode='ansi')
env.reset()

print(env.render())

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : |[43m [0m: |
|Y| : |[34;1mB[0m: |
+---------+




- That code snippet initializes the `Gymnasium Taxi-v3` environment and renders its initial state to the console using ANSI characters.
- The output will be a textual representation of the 5x5 grid world. Since the environment is reset to a random state, the exact placement of the taxi, passenger, and destination will vary, but the structure remains the same.

| **Symbol(s)**                            | **Description**                                               | **Role in Environment**                                 |
| ---------------------------------------- | ------------------------------------------------------------- | ------------------------------------------------------- |
| **+---+**                                | The boundary of the 5×5 grid world.                           | Defines the limits of the environment.                  |                                   |
| **`                                      \| `** and **`:`**                                               | Walls that block the Taxi’s movement.                   | Restrict where the Taxi can move. |
| **R, G, Y, B**                           | The four designated landmarks.                                | Possible pickup and drop-off locations.                 |                                   |
| **Blue Letter** (e.g., Blue **R**)       | The **current location** of the passenger.                    | Represents the passenger’s **state before pickup**.     |                                   |
| **Magenta Letter** (e.g., Magenta **G**) | The **destination** for the passenger.                        | Represents the **target state for drop-off**.           |                                   |
| **Yellow Taxi**                          | Taxi Agent is present and currently **empty** (no passenger). | Represents the agent’s **state (location and status)**. |                                   |
| **Green Taxi**                           | Taxi Agent is present and currently **carrying a passenger**. | Represents the agent’s **state (location and status)**. |                                   |


### 2. Getting state, action and Initializing Q-table

In [2]:
import numpy as np

# 1. Get State/Action Space Sizes
state_space_size = env.observation_space.n
action_space_size = env.action_space.n

# 2. Initialize the Q-table (500 rows x 6 columns)
q_table = np.zeros((state_space_size, action_space_size))

print(f"Total States: {state_space_size}")
print(f"Total Actions: {action_space_size}")
print(f"Q-Table Shape: {q_table.shape}")

Total States: 500
Total Actions: 6
Q-Table Shape: (500, 6)


There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations.

The action shape is (1,) in the range {0, 5} indicating which direction to move the taxi or to pickup/drop off passengers.
So, we have following 6 actions

- 0: Move south (down)

- 1: Move north (up)

- 2: Move east (right)

- 3: Move west (left)

- 4: Pickup passenger

- 5: Drop off passenger

### 3. Training

Before we start with training. We have to clear some nomenclature which is related to RL.

- `Episode`: A sequence of interactions between an agent and its environment, starting from an initial state and ending at a terminal state.
- `Episode Steps`: An action taken in the environment leading to a state and reward. Part of the data generated during the sampling phase in reinforcement learning
- `The Epsilon Solution (Decay)`:In Q-Learning, we start with a high $\epsilon$ (e.g., $\epsilon=1.0$) and gradually decay it over thousands of episodes.    
    - **Start ($\epsilon \approx 1.0$)**: The Agent is mostly exploring (random movement) to gather initial knowledge and fill the $\mathbf{Q}$-table.Middle: The Agent uses a good mix of both to refine the Q-values.
    - **End ($\epsilon \approx 0.01$)**: The Agent is mostly exploiting the optimal path it has learned, only exploring occasionally to ensure it hasn't missed any last-minute improvements.

By decaying $\epsilon$, we ensure the Agent first gains comprehensive knowledge and then utilizes that knowledge effectively to find the ultimate optimal policy.
 

In [None]:
# 1. Define Hyperparameters
alpha = 0.1     # Learning Rate
gamma = 0.6     # Discount Factor
epsilon = 1.0   # Initial Exploration Rate
max_epsilon = 1.0   # The maximum value for epsilon 
min_epsilon = 0.01  # The minimum value for epsilon
decay_rate = 0.0001
total_episodes = 50000

# Start Training
for episode in range(total_episodes):
    # Reset environment for a new episode
    state, info = env.reset()
    done = False
    
    while not done:
        # 2. Epsilon-Greedy Strategy (Action Selection)
        if np.random.random() < epsilon:
            # Exploration: Choose a random action
            action = env.action_space.sample() 
        else:
            # Exploitation: Choose the best action based on the Q-table
            action = np.argmax(q_table[state, :]) 

        # 3. Take Action and Observe
        new_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated

        # 4. Apply the Q-Learning Update Rule
        # The key step! Update the Q-value based on the Bellman equation
        old_value = q_table[state, action]
        max_future_q = np.max(q_table[new_state, :])
        
        new_q_value = old_value + alpha * (reward + gamma * max_future_q - old_value)
        q_table[state, action] = new_q_value

        # Transition to the new state
        state = new_state

    # 5. Decay Epsilon (Reduce exploration over time)
    epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)

# Final step: Close the environment
env.close()

print("Training finished. The Q-table now holds the optimal policy (estimated Q-values).")
# The trained q_table can now be used to run the agent optimally.
print("Sample Q-table slice:\n", q_table[:5])

Training finished. The Q-table now holds the optimal policy (estimated Q-values).
Sample Q-table slice:
 [[  0.           0.           0.           0.           0.
    0.        ]
 [ -2.41837066  -2.3639511   -2.41837066  -2.3639511   -2.27325184
  -11.3639511 ]
 [ -1.870144    -1.45024     -1.870144    -1.45024     -0.7504
  -10.45024   ]
 [ -2.3639511   -2.27325184  -2.3639511   -2.27325184  -2.1220864
  -11.27325184]
 [ -2.49619087  -2.49625368  -2.49619092  -2.49650594 -11.49599254
  -11.49514162]]


### Interpreting the Trained Q-Table

Above Q-table slice contains $Q(s, a)$ values, where each row is a State ($s$) and each column is an Action ($a$).

The Agent's Policy is to simply choose the action with the highest Q-value for its current state.

#### Case-1: Row 2

| **State**    | **South (0)** | **North (1)** | **East (2)** | **West (3)** | **Pickup (4)** | **Drop off (5)** |
| ------------ | ------------: | ------------: | -----------: | -----------: | -------------: | ---------------: |
| **Q-Values** |     −1.870144 |     −1.450240 |    −1.870144 |    −1.450240 |      −0.750400 |       −10.450240 |


**Decision**: The largest (least negative) Q-value is $\mathbf{-0.7504}$, which corresponds to Action 4 (Pickup passenger).

**Interpretation**: In this specific state, the Agent has figured out that the best move to maximize future reward is to pick up the passenger. This suggests the taxi is currently at the passenger's location

#### Case-2: Row 4

| **State**    | **South (0)** | **North (1)** | **East (2)** | **West (3)** | **Pickup (4)** | **Drop off (5)** |
| ------------ | ------------: | ------------: | -----------: | -----------: | -------------: | ---------------: |
| **Q-Values** |     −2.49619  |   −2.49625    |    −2.49619  |     −2.49650 |     −11.49599  |       −11.49514  |

**Decision**: All movement values (Actions 0-3) are very close, but the actions for Pickup and Dropoff (4 and 5) have very large negative values (around -11.5). The largest value among the movement actions is $-2.49619$ (Action 0 or 2).

**Interpretation**: The large penalty for Pickup/Dropoff suggests the passenger is likely not in the taxi AND the taxi is not at the destination/pickup spot. The Agent is being penalized heavily for trying to perform an invalid action. The optimal action here is likely the one that moves the taxi closer to the target (which we can't tell without seeing the grid, but it avoids the severe penalty).

### Running Trained Agent 

In [None]:
total_test_episodes = 10
total_penalties = 0

print("\n--- Running the Trained Agent (Exploitation Only) ---")

for episode in range(total_test_episodes):
    state, info = env.reset()
    done = False
    reward_sum = 0
    
    # Optional: Clear output for a smoother animation effect
    from IPython.display import clear_output
    from time import sleep

    while not done:
        # **Policy: Always choose the action with the highest Q-value**
        action = np.argmax(q_table[state, :]) 
        
        new_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        
        reward_sum += reward
        if reward == -10:
            total_penalties += 1

        # Render the step 
        clear_output(wait=True)
        print(env.render())
        sleep(0.05) # adjust it as per your understanding

        state = new_state
    
    print(f"Episode {episode + 1}: Total Reward = {reward_sum}")

print(f"\nAverage penalties per run: {total_penalties / total_test_episodes}")


+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)

Episode 10: Total Reward = 7

Average penalties per run: 0.0


## Day 78 Summary:

On Day 78 we were introduced to Reinforcement Learning (RL), where an Agent learns to maximize Reward through trial and error using a Policy ($\pi$).

The core achievement was implementing Tabular Q-Learning on the Taxi-v3 environment:
- **Q-Table**: A memory table used to store the estimated quality $Q(s, a)$ for every State-Action pair.
- **$\epsilon$-Greedy Policy**: The Agent balances Exploration (random action with probability $\epsilon$) and Exploitation (best-known action with probability $1-\epsilon$). $\epsilon$ decays over time.
- **Q-Learning Formula**: The Agent uses this formula to update the Q-table based on immediate reward ($r$) and discounted future reward:$$Q(s, a) \leftarrow Q(s, a) + \alpha \cdot [r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a)]$$The training was successful, resulting in a fully populated Q-table that defines the optimal policy for the Taxi Agent.

## Whats Next: Day-79 - Deep Q-Networks (DQN)


Day 79 will address the major limitation of Q-Learning: the Curse of Dimensionality.

- Problem: The Q-table cannot handle environments with huge, continuous state spaces (like video games or robotics) because the table would be too large to store.

- Solution (DQN): You will replace the lookup Q-table with a Neural Network to approximate the Q-function.

The network will take the State as input and output the Q-value for all possible actions, allowing the Agent to learn complex policies without explicitly storing every single state in a massive table.