# Sequential Decision Making Under Uncertainty

In reinforcement learning, an agent generates its own training data by interacting with the world. Unlike supervised learning, where correct actions are provided, the agent must discover the consequences of its actions through trial and error. We introduce the fundamentals of reinforcement learning (RL) by focusing on the evaluative aspect of decision-making under uncertainty. We'll explore how agents learn from interactions with their environment, emphasizing key concepts like rewards, timesteps, and values. The core framework is the k-armed bandit problem, a simplified setting that captures essential RL ideas.
___

In [5]:
import os
from typing import Optional

import numpy as np

In [2]:
if str(os.getcwd()).endswith("notebooks"):
    os.chdir("../")

print("Current working directory:", os.getcwd())

Current working directory: /Users/dksifoua/Developer/learning/Reinforcement-Learning


## 1. The K-Armed Bandit problem

In a k-armed bandit problem, an **agent** is faced repeatedly with a choice among k different **actions**. After taking each action, it receives a numerical **reward** chosen from a **stationary probability distribution** that depends on the selected action. The objective is **to maximize the expected total reward over some time period**, for example, over 1000 action selections, or **time steps**.

Each of the k actions has an expected or mean reward given that that action is selected; let us call this the **value** of that action. We denote the action selected on time step $t$ as $A_t$, and the corresponding reward as $R_t$. The value of an arbitrary action $a$, denoted $q_*(a)$, is the expected reward given that a is selected:

$$ q_*(a) \; \dot{=} \; \mathbb{E}[R_t|A_t = a] $$

In the k-armed bandit setting, we do not know the action values with certainty (the action values with certainty). We denote the estimated value of action $a$ at time step $t$ as $Q_t(a)$. We would like $Q_t(a)$ to be close to $q_*(a)$.
___

In [10]:
class KArmedBanditEnvironment:
    """K-armed bandit environment with stationary action means.

    Each action ``a`` has a fixed (hidden) mean reward ``Q*(a)`` sampled once at
    reset from ``Normal(initial_reward, 1)``. Calling :meth:`step` returns a noisy
    reward drawn from ``Normal(Q*(a), reward_noise_std^2)``.

    Args:
        n_actions (int): Number of available actions (arms). Must be > 0.
        initial_reward (float): Mean around which action means are initialized.
        reward_noise_std (float, optional): Standard deviation of the observation
            noise added on each :meth:`step`. Defaults to ``1.0``. Must be >= 0.
        seed (int | None, optional): Random seed for reproducibility. Ignored if
            ``rng`` is provided. Defaults to ``None``.
        rng (np.random.Generator | None, optional): Custom NumPy random generator.
            If ``None``, a new one is created (optionally seeded). Defaults to ``None``.

    Attributes:
        n_actions (int): Number of actions.
        initial_reward (float): Initialization center for action means.
        reward_noise_std (float): Observation noise standard deviation.
        rng (np.random.Generator): Random generator used by the environment.
        rewards (np.ndarray): Per-action means ``Q*(a)`` sampled at last reset.
            Shape: ``(n_actions,)``.
        best_action (int): Index of the action with the highest mean at last reset.

    Examples:
        >>> env = KArmedBanditEnvironment(n_actions=10, initial_reward=0.0, seed=123)
        >>> env.reset() # resample action means
        >>> r = env.step(0) # take action 0, observe reward
        >>> env.best_action in range(env.n_actions)
        True
    """

    __slots__ = (
        "n_actions",
        "initial_reward",
        "reward_noise_std",
        "rng",
        "rewards",
        "best_action",
    )

    def __init__(
        self,
        n_actions: int,
        initial_reward: float,
        reward_noise_std: float = 1.0,
        seed: Optional[int] = None,
        rng: Optional[np.random.Generator] = None,
    ) -> None:
        if n_actions <= 0:
            raise ValueError("n_actions must be a positive integer.")
        if reward_noise_std < 0:
            raise ValueError("reward_noise_std must be non-negative.")

        self.n_actions = n_actions
        self.initial_reward = initial_reward
        self.reward_noise_std = reward_noise_std
        self.rng = rng if rng is not None else np.random.default_rng(seed)

        self.rewards = np.empty(self.n_actions, dtype=float)
        self.best_action: Optional[int] = None

    def reset(self) -> None:
        """Resample per-action mean rewards.

        The per-action means ``Q*(a)`` are drawn from ``Normal(initial_reward, 1)``.
        """
        self.rewards = self.rng.standard_normal(self.n_actions) + self.initial_reward
        self.best_action = np.argmax(self.rewards)

    def step(self, action: int) -> float:
        """Take an action and observe a noisy reward.

        Args:
            action (int): Index of the chosen action in ``[0, n_actions)``.

        Returns:
            float: Observed reward drawn from
                ``Normal(Q*(action), reward_noise_std^2)``.

        Raises:
            IndexError: If ``action`` is out of bounds.
        """
        if self.best_action is None:
            raise RuntimeError("Environment not ready. Call `reset()` before the first `step()`.")
        
        if not (0 <= action < self.n_actions):
            raise IndexError(f"action must be in [0, {self.n_actions - 1}], got {action}")
            
        noise = 0.0 if self.reward_noise_std == 0.0 else self.rng.normal(0.0, self.reward_noise_std)
        return self.rewards[action] + noise

## 2. Action-Value Methods


___