# ü§ñ Agent Object for Grid World

This documentation describes the `Agent` class implementation found in the `agent.ipynb` notebook. This agent is designed to solve the **Grid World** reinforcement learning problem by utilizing the **Q-learning** algorithm to learn an optimal policy through trial-and-error interaction with its environment.

---

## üõ†Ô∏è 1. Agent Initialization
The agent is configured with several hyperparameters that define its learning behavior and how it manages the balance between exploring new actions and exploiting known rewards.

* **Environment Connection**: Links the agent to a specific world (like Grid World) to understand available states and actions.
* **Hyperparameters**:
    * **Learning Rate (`lr`)**: Determines how quickly the agent updates its knowledge based on new rewards.
    * **Discount Factor (`gamma`)**: Dictates the importance of future rewards; higher values encourage long-term planning.
    * **Epsilon ($\epsilon$) Strategy**: Manages decision-making via `epsilon_greedy` (initial exploration probability), `epsilon_decay` (rate of learning), and `epsilon_min` (minimum exploration).
* **Q-Table**: A `defaultdict` initialized with zeros that stores the expected utility (Q-values) for every possible state-action pair.


---

## üß† 2. Action Selection (Policy)
The agent employs an **epsilon-greedy policy** to navigate the environment and discover the most rewarding paths.

* **Exploration**: With a probability of $\epsilon$, the agent selects a random action to explore the environment.
* **Exploitation**: Otherwise, it chooses the action with the highest Q-value from its table.
* **Tie-Breaking**: If multiple actions have the same maximum Q-value, the agent uses a random permutation to break the tie, ensuring it doesn't get stuck in repetitive loops.


---

## üìà 3. The Q-Learning Update Rule
Learning occurs through the `_learn` method, which implements the standard Q-learning update based on the **Bellman Equation** after every movement.

* **Target Calculation**: 
    * If the agent reaches a goal (terminal state), the target is the immediate reward.
    * Otherwise, the target is the reward plus the discounted maximum future reward predicted from the next state.
* **Value Adjustment**: The agent updates its Q-table by shifting the current value toward the target based on the learning rate.
* **Exploration Decay**: After every learning step, the agent reduces its `epsilon` value, transitioning from a random explorer to an optimized decision-maker.


---

## üèÅ 4. Application Context
In the context of Sebastian Raschka's **Chapter 19**, this specific agent is used as the first hands-on RL implementation. It demonstrates how to solve a discrete **Grid World** problem before moving on to more complex, continuous environments like **CartPole** using Deep Q-Networks (DQN).

In [2]:
import numpy as np
from collections import defaultdict

In [3]:
class Agent():
    def __init__(
        self, env,
        learning_rate= 0.01,
        discount_factor= 0.9,
        epsilon_greedy= 0.9,
        epsilon_min= 0.1,
        epsilon_decay= 0.95
    ):
        self.env = env
        self.lr = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon_greedy
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        # Define the q_table
        self.q_table = defaultdict(lambda: np.zeros(self.env.nA))

    def choose_action(self, state):
        if np.random.uniform() < self.epsilon:
            action = np.random.choice(self.env.nA)
        else:
            q_vals = self.q_table[state]
            perm_actions = np.random.permutation(self.env.nA)
            q_vals = [q_vals[a] for a in perm_actions]
            perm_argmax = np.argmax(q_vals)
            action = perm_actions[perm_argmax]

        return action

    def _learn(self, transition):
        s, a, r, next_s, done = transition
        q_val = self.q_table[s][a]
        if done:
            q_target = r

        else:
            q_target = r + self.gamma * np.max(self.q_table[next_s])
        # Update q table
        self.q_table[s][a] += self.lr * (q_target - q_val)
        # Adjust epsilon with epsilon decay
        self._adjust_epsilon()

    def _adjust_epsilon(self):
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
