**Q-learning** is a **reinforcement learning algorithm** that teaches an agent to choose the best action in each situation by learning a table of values (Q-table) through trial and error.
Each value in the table estimates **how good** a specific action is in a given state, based on the rewards the agent gets now and in the future.

**Why we use it:**

* To find the **optimal policy** (best way to act) without knowing the environment’s rules in advance.
* Works well in problems where the agent learns by **interacting** with the environment and receiving feedback.
* Useful for games, robot navigation, decision-making systems, and any situation where you want to **maximize long-term rewards**.

## **Mathematical Definition**

We want to learn the **optimal action-value function**:

$$
Q^*(s,a) = \mathbb{E} \left[ r_t + \gamma \max_{a'} Q^*(s_{t+1},a') \right]
$$

We use the **update rule**:

$$
Q(s,a) \leftarrow Q(s,a) + \alpha \big[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \big]
$$

Where:

* $s$ = current state
* $a$ = chosen action
* $r$ = immediate reward
* $s'$ = next state
* $\alpha$ = learning rate
* $\gamma$ = discount factor

---

## **Tiny Example Dataset**

We have 2 states and 2 actions:

| State | Action | Next State | Reward |
| ----- | ------ | ---------- | ------ |
| S1    | Right  | S2         | 1      |
| S2    | Right  | Terminal   | 5      |
| S1    | Left   | S1         | 0      |
| S2    | Left   | S1         | 0      |

Parameters:

* $\alpha = 0.5$
* $\gamma = 0.9$
* Q-table starts at 0

---

### **First Update**

From S1, take **Right** → S2, reward = 1

$$
Q(S1,\text{Right}) = 0 + 0.5 \times [1 + 0.9 \times \max_{a'}Q(S2,a') - 0]
$$

At the start, $Q(S2,*) = 0$:

$$
Q(S1,\text{Right}) = 0.5 \times [1 + 0 - 0] = 0.5
$$

---

### **Second Update**

From S2, take **Right** → Terminal, reward = 5

$$
Q(S2,\text{Right}) = 0 + 0.5 \times [5 + 0.9 \times 0 - 0]
$$

$$
Q(S2,\text{Right}) = 2.5
$$



## **Example: Warehouse Robot Navigation**

**Scenario:**
A company has a robot that moves in a warehouse to pick up packages.
The robot must learn the **best path** from the charging station to the package location while **avoiding obstacles**.

---

### **How an ML Engineer Applies Q-Learning**

1. **Define the states (S)**
   Each possible position in the warehouse grid is a state.

2. **Define the actions (A)**

   * Move Up, Move Down, Move Left, Move Right

3. **Define the rewards (R)**

   * **+10** for reaching the package
   * **−10** for hitting an obstacle
   * **−1** for each move (to encourage shorter paths)

4. **Initialize the Q-table**
   Rows = positions, columns = possible moves.
   Initially, all values = 0.

5. **Run training episodes**

   * The robot starts in a random spot.
   * Chooses actions with **ε-greedy** policy.
   * Moves, gets rewards, updates the Q-table using the **Bellman equation**.

6. **Convergence**
   Over time, the Q-values converge so that **the best move from each position is clear**.

7. **Deployment**

   * Once trained, the robot just **looks up the best action in the Q-table** for its current position.
   * No need for human control — it navigates optimally.

---

💡 **Why Q-Learning here?**

* The engineer doesn’t need to know the exact warehouse map in advance.
* The robot learns from **trial and error** by exploring and adjusting.
* Works well even if obstacles change over time — the robot can keep learning.

In [9]:
import numpy as np
rng = np.random.default_rng(0)

# --- Tiny 1D world: states 0..4 (4 is terminal/goal) ---
n_states = 5
actions = {0:"left", 1:"right"}
n_actions = len(actions)

# Rewards: -1 per step, +10 on reaching goal
def step(state, action):
    if state == 4:  # terminal
        return state, 0, True
    if action == 0:  # left
        next_state = max(0, state-1)
    else:            # right
        next_state = min(4, state+1)
    reward = 10 if next_state == 4 else -1
    done = (next_state == 4)
    return next_state, reward, done

# --- Q-learning hyperparams ---
alpha = 0.5       # learning rate
gamma = 0.9       # discount
epsilon = 0.2     # ε-greedy exploration
episodes = 200

Q = np.zeros((n_states, n_actions))  # Q-table

for _ in range(episodes):
    s = 0  # start
    done = False
    while not done:
        # ε-greedy action
        if rng.random() < epsilon:
            a = rng.integers(n_actions)
        else:
            a = np.argmax(Q[s])

        s_next, r, done = step(s, a)

        # Bellman update
        best_next = 0 if done else np.max(Q[s_next])
        td_target = r + gamma * best_next
        Q[s, a] += alpha * (td_target - Q[s, a])

        s = s_next

# Show learned Q and greedy policy
np.set_printoptions(precision=2, suppress=True)
print("Q-table:\n", Q)

greedy_policy = [actions[np.argmax(Q[s])] if s != 4 else "terminal" for s in range(n_states)]
print("Greedy policy by state:", greedy_policy)


Q-table:
 [[ 3.12  4.58]
 [ 3.12  6.2 ]
 [ 4.58  8.  ]
 [ 6.2  10.  ]
 [ 0.    0.  ]]
Greedy policy by state: ['right', 'right', 'right', 'right', 'terminal']
