# Chapter 17 - Making Complex Decisions

*In which we examine methods for deciding what todo today, given that we may face another
decision tomorrow* - Peter Norvig and Stuart Russell in Artificial Intelligence: A Modern Approach

<img src="https://raw.githubusercontent.com/ValRCS/RBS_PBM773_Introduction_to_AI/main/img/ch17_making_complex_decisions/DALL%C2%B7E%202024-03-06%2015.35.21%20-%20Illustrate%20an%20agent%20navigating%20through%20an%20unreliable%20environment%2C%20characterized%20by%20unpredictable%2C%20random%20events.%20The%20scene%20is%20a%20chaotic%2C%20ever-changing.webp" width="500">

## **Introduction**
- Focus on methods for making decisions today that account for potential future decisions.
- Addresses computational issues in decision-making within stochastic environments, contrasting with one-shot decision problems discussed in Chapter 16.
- Emphasizes sequential decision problems where an agent's utility depends on a series of decisions, integrating utilities, uncertainty, sensing, and encompassing search and planning problems.
- Outlines the structure of the chapter:
- Section 17.1: Definition of sequential decision problems.
- Section 17.2: Methods for solving sequential decision problems suitable for stochastic environments.
- Section 17.3: Discussion on multi-armed bandit problems, highlighting their significance and prevalence.
- Section 17.4: Examination of decision problems in partially observable environments.
- Section 17.5: Strategies for solving problems in partially observable contexts

## 17.1 Sequential Decision Problems

- Introduces a 4×3 grid environment as an example to explain sequential decision-making in fully observable, stochastic settings.
- Actions (Up, Down, Left, Right) have an intended effect with a probability of 0.8, but with a probability of 0.2, the action outcome deviates at right angles, introducing stochasticity.
- Utilizes a Markovian transition model, where the outcome probability of an action in a state depends solely on the current state and not on the action history.
- Introduces a reward function, R(s,a,s′), where the utility for the agent is the sum of rewards received over its actions, encouraging efficient goal achievement.
- Defines a Markov Decision Process (MDP) as consisting of a set of states, a set of actions per state, a transition model (P(s′|s,a)), and a reward function (R(s,a,s′)).
- Solution to an MDP is not a fixed sequence of actions but a policy (π), which specifies the action to take in any state. The quality of a policy is measured by the expected utility of environment histories it generates.
- An optimal policy (π*) is one that yields the highest expected utility. There can be multiple optimal policies depending on the balance of risk and reward, influenced by the value of rewards for nonterminal transitions.
- Highlights how the optimal policy changes with different reward values, showing agents may exhibit different behaviors under varying conditions (e.g., avoiding negative states, preferring certain paths).
- MDPs represent a more realistic approach to decision-making than deterministic models by incorporating uncertainty, making them relevant in various fields like AI, operations research, and economics.
- Sets the stage for discussing utilities, optimal policies, and MDP solution models in more detail, along with solution algorithms in the following sections.


<img src="https://github.com/ValRCS/RBS_PBM773_Introduction_to_AI/blob/main/img/ch17_making_complex_decisions/fig17_1.jpg?raw=true" width="400">

In [10]:
# let's make a class to represent a board location
class Cell:
    def __init__(self, x, y,
                    reward=-0.04,
                    transition_prob=0.8,
                    transition_left=0.1,
                    transition_right=0.1, #TODO could use class for those transtions
                    is_terminal=False,
                    is_wall=False,
                    is_start=False):
        self.x = x
        self.y = y
        self.reward = reward
        self.state = 0 # we assume all starting positions have zero starting reward
        self.transition_prob = transition_prob
        self.transition_left = transition_left
        self.transition_right = transition_right
        self.is_terminal = is_terminal
        self.is_wall = is_wall
        self.is_start = is_start

In [11]:
# let's encode our board as digits then create a dictionary mapping locations to cells
board = [[0, 0, 0, 1],
         [0, 9, 0, -1],
         [3, 0, 0, 0]]
print(f"We have a {len(board[0])}x{len(board)} board")
# now map the board to cells
# remember board starts at bottom left so (1,1) is bottom left
# -3 represents start, 9 represent wall
# 0 represent default empty cell
# 1 represents terminal cell
# -1 represents terminal cell

cell_map = {}
# so we have different mappings for x and y and row and col
for row, y in enumerate(range(len(board)-1, 0-1, -1), start=1):
    for col, x in enumerate(range(len(board[0])), start=1):
        if board[y][x] == 0:
            cell_map[(col, row)] = Cell(col, row)
        elif board[y][x] == 1:
            cell_map[(col, row)] = Cell(col, row, reward=1, is_terminal=True)
        elif board[y][x] == -1:
            cell_map[(col, row)] = Cell(col, row, reward=-1, is_terminal=True)
        elif board[y][x] == 3:
            cell_map[(col, row)] = Cell(col, row, is_start=True)
        elif board[y][x] == 9:
            cell_map[(col, row)] = Cell(col, row, is_wall=True)

# print our cells
for k, v in cell_map.items():
    print(k, v.x, v.y, v.reward, v.is_terminal, v.is_start, v.state)
# board_string = """
# 0001
# 0X0i
# S000
# """


We have a 4x3 board
(1, 1) 1 1 -0.04 False True 0
(2, 1) 2 1 -0.04 False False 0
(3, 1) 3 1 -0.04 False False 0
(4, 1) 4 1 -0.04 False False 0
(1, 2) 1 2 -0.04 False False 0
(2, 2) 2 2 -0.04 False False 0
(3, 2) 3 2 -0.04 False False 0
(4, 2) 4 2 -1 True False 0
(1, 3) 1 3 -0.04 False False 0
(2, 3) 2 3 -0.04 False False 0
(3, 3) 3 3 -0.04 False False 0
(4, 3) 4 3 1 True False 0


In [None]:
# TODO create agent that moves around the board
# It should have a policy and a value function

In [7]:
# let's create a function simple policy for moving
import random
def random_policy(col, row):
    choices = ["NORTH","SOUTH","WEST","EAST"]
    return random.choice(choices)
# let's test it
random_policy(2,3)

'EAST'

In [9]:
# let's make a function that given a command and col and row transition probability for correct and right and left will return the actual move
def move(command, col, row, transition_prob = 0.8, transition_left = 0.1):
    transition = random.random()
    if command == "NORTH":
        if transition < transition_prob:
            return col, row - 1
        elif transition < transition_prob + transition_left:
            return col - 1, row
        else: # WHATEVER remains IS RIGHT
            return col + 1, row
    elif command == "SOUTH":
        if transition < transition_prob:
            return col, row + 1
        elif transition < transition_prob + transition_left:
            return col + 1, row
        else:
            return col - 1, row
    elif command == "WEST":
        if transition < transition_prob:
            return col - 1, row
        elif transition < transition_prob + transition_left:
            return col, row + 1
        else:
            return col, row - 1
    elif command == "EAST":
        if transition < transition_prob:
            return col + 1, row
        elif transition < transition_prob + transition_left:
            return col, row - 1
        else:
            return col, row + 1


# let's test it
move("NORTH", 2, 3), move("SOUTH", 2, 3), move("WEST", 2, 3), move("EAST", 2, 3)

((2, 2), (2, 4), (2, 2), (3, 3))

In [14]:
# let's update position and reward on board
# if we hit a wall or non existant position (outside our board) we substract reward of 0.04 from current cell state
# then we return current position
# if we make a move successfuly we update state with reward and return current position
def update_position(col, row, command):
    new_col, new_row = move(command, col, row)
    if (new_col, new_row) not in cell_map:
        cell_map[(col, row)].state += cell_map[(col, row)].reward
        return col, row
    # Walls have value of 9
    if cell_map[(new_col, new_row)].is_wall:
        cell_map[(col, row)].state += cell_map[(col, row)].reward
        return col, row
    # Terminal cells have value of
    if cell_map[(new_col, new_row)].is_terminal:
        cell_map[(new_col, new_row)].state += cell_map[(new_col, new_row)].reward
        return new_col, new_row
    # normal cell
    cell_map[(new_col, new_row)].state += cell_map[(new_col, new_row)].reward
    return new_col, new_row

In [17]:
# now let's initialize our starting position and run our random algorithm until we reach terminal state
col, row = 1, 3
while not cell_map[(col, row)].is_terminal:
    command = random_policy(col, row)
    col, row = update_position(col, row, command)
print(f"We reached a terminal state at {col},{row}")
# print final cell values
for k, v in cell_map.items():
    print(k, v.x, v.y, v.reward, v.is_terminal, v.is_start, v.state)

We reached a terminal state at 4,3
(1, 1) 1 1 -0.04 False True -0.4799999999999999
(2, 1) 2 1 -0.04 False False -0.2
(3, 1) 3 1 -0.04 False False -0.2
(4, 1) 4 1 -0.04 False False -0.36
(1, 2) 1 2 -0.04 False False -0.68
(2, 2) 2 2 -0.04 False False 0
(3, 2) 3 2 -0.04 False False -0.28
(4, 2) 4 2 -1 True False -2
(1, 3) 1 3 -0.04 False False -0.8800000000000002
(2, 3) 2 3 -0.04 False False -0.5199999999999999
(3, 3) 3 3 -0.04 False False -0.28
(4, 3) 4 3 1 True False 2


In [None]:
## TODO think of a better policy building approach
# could be value iteration or policy iteration

### 17.1.1 Utilities Over Time**  
- Differentiates between **finite horizon**  and **infinite horizon**  decision-making in MDPs:
- **Finite Horizon** : There's a fixed time after which no further decisions matter, affecting the optimal action based on the remaining time, leading to nonstationary policies.
- **Infinite Horizon** : No fixed deadline for decision-making, leading to simpler, stationary policies where optimal actions depend only on the current state.
- Introduces **additive discounted rewards**  as a method for calculating the utility of state sequences, where future rewards are discounted by a factor γ (between 0 and 1), reflecting the agent's preference for sooner over later rewards.
- Justifies additive discounted rewards through several lenses:
- **Empirical Observation** : Humans and animals value immediate rewards more.
- **Economic Rationale** : Immediate rewards can be invested to yield further gains.
- **Uncertainty Avoidance** : Future rewards are uncertain; discounting mimics the risk of not receiving a future reward.
- **Preference-Independence Assumption** : Preferences between state sequences are stationary, making additive discounting a natural fit.
- **Infinite Sequences Management** : Discounts help avoid the issue of infinite utilities in endless sequences by ensuring the total utility remains finite.
- Discusses the utility of infinite sequences and the concept of a **proper policy** —one that ensures reaching a terminal state, allowing for undiscounted rewards in certain conditions.
- Mentions **average reward per time step**  as an alternative for evaluating policies, especially when dealing with infinite sequences, but notes its complexity.
- Concludes with the preference for using additive discounted rewards for simplicity and ease of analysis in evaluating environment histories and policy effectiveness.

<img src="https://github.com/ValRCS/RBS_PBM773_Introduction_to_AI/blob/main/img/ch17_making_complex_decisions/fig17_2.jpg?raw=true" width="400">

### Discounted Rewards

Idea is that bird in hand is worth two in the bush. If you have a choice to get 1$ today or 1$ tomorrow, you would choose 1$ today, because you can invest it and get more than 1$ tomorrow.

In [3]:
## let's calculate how much 1000 Euros would be worth in 10 years given 7.2% interest rate

# define the variables
principal = 1000
rate = 0.072
time = 10

# calculate the amount
amount = principal * (1 + rate) ** time
print(f"After {time} years, {principal} Euros would be worth {amount:.2f} Euros")

After 10 years, 1000 Euros would be worth 2004.23 Euros


In [None]:
## Above I used rule of 72 to calculate the amount. The rule of 72 is a quick, useful formula that is popularly used to estimate the number of years required to double the invested money at a given annual rate of return. The rule states that the amount of time required to double your money can be estimated by dividing 72 by your rate of return.
## In this case, 72 / 7.2 = 10. So, the amount would be doubled in 10 years.

## Technically the formula for compound interest is A = P(1 + r/n)^(nt), where:
## A = the amount of money accumulated after n years, including interest.
## P = the principal amount (the initial amount of money).
## r = the annual interest rate (decimal).
## n = the number of times that interest is compounded per year.

In [4]:
## So Fry from Futurama had
principal = 0.93 # 93 cents in 1999
interest = 0.0225 # 2.25% interest rate
years = 1000 # 1000 years in the future
amount = principal * (1 + interest) ** years
print(f"After {years} years, {principal:.2f} USD  would be worth {amount:.2f} USD for Fry in 2999.")

After 1000 years, 0.93 USD  would be worth 4283508449.71 USD for Fry in 2999.


### 17.1.2 Optimal Policies and the Utilities of States**
- Establishes that the utility of a history is the sum of discounted rewards, allowing for comparison of policies based on expected utilities.
- Defines the expected utility of executing a policy π from an initial state s, factoring in the policy, initial state, and the environment's transition model.
- Identifies an optimal policy π* that maximizes expected utility regardless of the initial state, leading to the conclusion that the optimal policy is independent of the starting state in infinite-horizon scenarios.
- Introduces the **Bellman equation**  as central to understanding the utility of states, expressing a state's utility as the maximum expected reward for any action taken from that state plus the expected discounted utility of subsequent states.
- Explains the **action-utility function (Q-function)** , which calculates the expected utility of taking a specific action in a given state, facilitating the derivation of optimal policies.
- Demonstrates the Bellman equation for both the utility of states (U) and the Q-function (Q), illustrating how solving these equations enables the determination of optimal policies.
- Emphasizes the Q-function's significance in MDP solution algorithms, underlining its role in determining the utility of actions and thereby guiding optimal decision-making.

This subchapter delves into the theoretical foundations necessary for understanding how optimal policies are formulated within the context of MDPs, highlighting the mathematical framework (Bellman equations) used to evaluate and determine the best course of action given the utility of states and transitions.

### 17.1.3 Reward Scales

- Discusses the impact of reward scaling on the optimal policy, emphasizing that the scale of rewards does not affect the optimal policy, only the magnitude of the utility values.

- Shaping Theorem: The optimal policy remains unchanged if a constant is added to all rewards, or if all rewards are multiplied by a positive constant.



### 17.1.4 Representing MDPs**  
- MDPs can be represented with large, three-dimensional tables for transition probabilities P(s′∣s,a)P(s'|s,a)P(s′∣s,a) and rewards R(s,a,s′)R(s,a,s')R(s,a,s′), suitable for small problems but impractical for larger ones due to their size.
- For sparser cases, the size reduces to O(∣S∣∣A∣)O(|S||A|)O(∣S∣∣A∣), but this is still too large for complex problems.
- Dynamic Decision Networks (DDNs) offer a more efficient representation by extending dynamic Bayesian networks (DBNs) with decision, reward, and utility nodes, providing an exponential complexity advantage over atomic (table-based) representations.
- DDNs allow for the decomposition of the state into several variables, demonstrating this with a mobile robot example. The state variables include the robot's location and orientation (XtX_tXt​), rate of change (X˙t\dot{X}_tX˙t​), charging status (ChargingtCharging_tChargingt​), and battery level (BatterytBattery_tBatteryt​). The action set is comprised of variables for plugging/unplugging, and power sent to each wheel.
- Transition models in DDNs are computed as products of conditional probabilities, focusing on the interactions between specific actions and state variables.
- Rewards in DDNs can depend on various factors like location or charging status but might not directly depend on actions or the outcome state.
- The utility at a future time t+3t+3t+3 in the network accounts for all rewards from that point onwards, and heuristic approximations to utility can be included to avoid further expansion, similar to bounded-depth search and heuristic evaluations in games.
- Tetris is provided as an example of an MDP with a massive state space (7×7×2200≈10627 \times 7 \times 2^{200} \approx 10^{62}7×7×2200≈1062 states), showing that even games with simple rules can have complex state spaces, necessitating efficient representations like DDNs for practical problem-solving. Every policy in Tetris is proper, as the game always reaches a terminal state with the board filling up.

This subchapter highlights the challenges of representing MDPs for complex or large-scale problems and presents DDNs as a powerful tool for efficiently modeling such problems, making it feasible to handle the vast state spaces and intricate dynamics involved.

<img src="https://github.com/ValRCS/RBS_PBM773_Introduction_to_AI/blob/main/img/ch17_making_complex_decisions/fig17_5.jpg?raw=true" width="400">

## 17.2 Algorithms for MDPs**  
- **Value Iteration** : An iterative algorithm that repeatedly updates the utility of each state until the change in utility is below a small threshold (indicating convergence). It directly applies the Bellman equation to update state utilities, making it straightforward and widely applicable. Value iteration is efficient for problems where the utility values converge quickly, but it may require many iterations for precise convergence, especially in environments with long sequences of decisions.
- **Policy Iteration** : Begins with an arbitrary policy and iteratively improves it. The process involves two main steps: policy evaluation, where the utility of following the current policy is calculated for each state, and policy improvement, where the policy is updated by choosing actions that lead to the highest utility based on current estimates. Policy iteration typically converges faster than value iteration because it makes more substantial changes in each step, directly improving the policy at each iteration.
- **Linear Programming** : Solves MDPs by framing the problem as a set of linear equations and inequalities that represent the Bellman equations and constraints of the MDP, respectively. This approach finds the optimal utility values for all states simultaneously by solving the linear program. While linear programming provides an exact solution and can be efficient for certain problem sizes, it may not scale well to very large MDPs due to the computational complexity of linear programming solvers.
- **Online Approximate Algorithms (Monte Carlo Planning)** : Instead of computing or approximating the utility of all states in advance (offline), these algorithms focus on the current decision by simulating outcomes of possible actions and using the results to estimate the utilities of actions. Monte Carlo planning and other online methods are particularly useful when it's impractical to solve the entire MDP upfront due to the size or complexity of the state space. These algorithms prioritize computational effort on parts of the state space that are relevant to the current decision, making them efficient for large or complex problems.

Each algorithm offers a different trade-off between computational complexity, scalability, and the need for complete problem knowledge upfront. Value and policy iteration are classic methods that provide exact solutions but may struggle with very large state spaces. Linear programming offers a powerful alternative for exact solutions but can be computationally intensive. Online approximate algorithms, including Monte Carlo planning, offer practical solutions for large or complex MDPs by focusing computational resources on the most relevant decisions.

### 17.2.1 Value Iteration**

Value Iteration is a method for solving Markov Decision Processes (MDPs) that leverages the Bellman equation to iteratively compute the utility of each state until convergence. It's a foundational approach due to its generality and simplicity, applicable across various decision-making problems where outcomes are partly uncertain and partly under the control of a decision maker.
- **Principle** : The algorithm starts with arbitrary initial utility values for all states and repeatedly applies the Bellman update across all states. This update refines utility estimates based on the utilities of neighboring states, converging to the true utilities over iterations.
- **Bellman Update** : Each iteration updates the utility of a state sss to the maximum expected utility attainable from sss, considering all possible actions aaa and subsequent states s′s's′. The formula used is: Ui+1(s)=max⁡a∑s′P(s′∣s,a)[R(s,a,s′)+γUi(s′)]U_{i+1}(s) = \max_a \sum_{s'} P(s' | s, a) [R(s, a, s') + \gamma U_i(s')]Ui+1​(s)=maxa​∑s′​P(s′∣s,a)[R(s,a,s′)+γUi​(s′)], where Ui(s)U_i(s)Ui​(s) is the estimated utility of state sss at iteration iii, PPP is the transition probability, RRR is the reward, and γ\gammaγ is the discount factor.
- **Convergence** : The algorithm is guaranteed to converge to the true utilities, providing a unique solution to the Bellman equations as long as the discount factor γ<1\gamma < 1γ<1. This convergence is due to the contraction property of the Bellman update, meaning with each iteration, the utility estimates get "closer" to the true utilities, reducing the maximum error exponentially fast.
- **Termination Condition** : In practice, the algorithm stops when the change in utility values between iterations is sufficiently small, indicating that further iterations won't significantly change the outcome. This condition is typically set as a small ϵ\epsilonϵ, ensuring that the estimated utilities are within an acceptable error margin of the true utilities.
- **Policy Derivation** : Once convergence is achieved or the termination condition is met, the optimal policy can be derived from the final utility values using a one-step look-ahead, selecting actions that maximize the expected utility.
- **Error Analysis and Policy Loss** : The algorithm includes ways to assess the error in utility estimates and the potential loss in policy effectiveness if iterations are stopped early. These analyses provide guarantees on the quality of the resulting policy, even if the iterations are halted before absolute convergence.
- **Practical Implications** : Value iteration is a robust and widely used method in decision-making models, especially in fields like robotics, automated planning, and AI game strategies. It balances computational efficiency with accuracy, allowing for practical application in complex environments.

Value iteration's strength lies in its simplicity and the strong theoretical underpinnings that guarantee convergence and optimality, making it a staple method in the toolbox for solving MDPs.

### 17.2.2 Policy Iteration**

Policy iteration is an algorithm that efficiently finds optimal policies for MDPs through an iterative process that alternates between policy evaluation and policy improvement steps, starting from an arbitrary initial policy.
- **Policy Evaluation** : For a given policy πi\pi_iπi​, calculate Ui=UπiU_i = U^{\pi_i}Ui​=Uπi​, the utility of each state if πi\pi_iπi​ were executed indefinitely. This step involves solving a set of linear equations—one for each state—based on the simplified Bellman equation. The equations are linear because the action in each state is predetermined by the policy, removing the need for the "max" operator. This can be solved exactly using linear algebra techniques in O(n3)O(n^3)O(n3) time for nnn states.
- **Policy Improvement** : Generate a new policy πi+1\pi_{i+1}πi+1​ by choosing for each state the action that maximizes the expected utility using one-step look-ahead based on UiU_iUi​. This step updates the policy by selecting the action in each state that results in the highest utility, according to the current estimate of state utilities.

The algorithm terminates when a policy improvement step yields no change in the policy, indicating that the current policy is optimal. Since the number of possible policies is finite and each iteration of policy improvement yields a strictly better policy, convergence is guaranteed.

For large state spaces, exact policy evaluation might be computationally expensive. In such cases, approximate methods like simplified value iteration steps (with the policy fixed) can be used to estimate utilities, a technique known as modified policy iteration.

Additionally, asynchronous policy iteration, a variant of the algorithm, allows for updating the utility or policy for any subset of states at each iteration, rather than all states simultaneously. This can lead to more efficient algorithms, especially when updates focus on states likely to be reached under a good policy.

Overall, policy iteration is a powerful method for determining optimal policies in MDPs, offering a balance between computational efficiency and accuracy through its two-step iterative approach.

### Policy Iteration Algorithm
```python
def policy_iteration(mdp, epsilon=0.001, gamma=0.99):
    """
    Perform policy iteration on a given MDP.
    
    Args:
    - mdp: An MDP with properties S (states), A(s) (actions), P(s'|s,a) (transition model),
           and R(s,a,s') (reward).
    - epsilon: Small threshold for stopping criteria based on utility change.
    - gamma: Discount factor for future rewards.
    
    Returns:
    - pi: The optimal policy, a mapping from states to actions.
    """
    # Initialize utilities arbitrarily, and policy randomly
    U = {s: 0 for s in mdp.states}
    pi = {s: mdp.actions(s)[0] for s in mdp.states}  # Initialize policy with first action of each state

    while True:
        U = policy_evaluation(pi, U, mdp, epsilon, gamma)
        unchanged = True
        for s in mdp.states:
            # Policy improvement step
            a_star = max(mdp.actions(s), key=lambda a: q_value(mdp, s, a, U, gamma))
            if q_value(mdp, s, a_star, U, gamma) > q_value(mdp, s, pi[s], U, gamma):
                pi[s] = a_star
                unchanged = False
                
        if unchanged:
            break
    
    return pi

def policy_evaluation(pi, U, mdp, epsilon, gamma):
    """
    Evaluate a policy, updating utility estimates.
    Placeholder implementation. Needs the actual implementation based on the MDP specifics.
    """
    # Placeholder for policy evaluation code
    # Normally, you'd solve a system of linear equations here or iterate until convergence
    return U

def q_value(mdp, s, a, U, gamma):
    """
    Compute the Q-value for a state-action pair given the utility of states.
    Placeholder implementation. Needs actual MDP transition and reward functions.
    """
    # Placeholder for Q-value computation
    return sum(mdp.P(s_prime | s, a) * (mdp.R(s, a, s_prime) + gamma * U[s_prime]) for s_prime in mdp.states)

# Placeholder MDP structure
class MDP:
    def __init__(self):
        self.states = [...]  # Fill in with actual states
        # Define actions, transition model, and rewards accordingly
    
    def actions(self, s):
        # Return a list of actions available in state s
        pass
    
    # Define other MDP methods as required for P, R

# Example of how to run:
# mdp = MDP()  # Assume this is properly defined elsewhere
# optimal_policy = policy_iteration(mdp)
```



This code is a framework and will need to be filled in with the specifics of your MDP, including how states, actions, transitions, and rewards are defined and managed. The critical part is the iteration between policy evaluation and improvement, which is central to the policy iteration algorithm.

### 17.2.3 Linear Programming**

Linear programming (LP) offers a different approach to solving Markov Decision Processes (MDPs) by framing the problem as a constrained optimization task. The goal here is to identify the utilities of each state in a way that these utilities adhere to the Bellman equations, which describe the relationship between the utility of a state and the utilities of adjacent states, factoring in the rewards and transition probabilities. This section underscores how solving MDPs can transition from the realm of dynamic programming to linear programming, providing an overview of the process and its implications.
### Key Points of Linear Programming in MDPs
- **LP Formulation** : To solve an MDP using linear programming, one defines a linear program where the utilities of states, U(s)U(s)U(s), are the variables. The objective is to minimize these utilities subject to a set of constraints derived from the Bellman equations. These constraints ensure that the utility of any given state sss is at least as much as the expected return from any action aaa taken in that state, considering both immediate rewards and discounted future utilities.
- **Constraints** : The constraints are formulated as U(s)≥∑s′P(s′∣s,a)[R(s,a,s′)+γU(s′)]U(s) \geq \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma U(s')]U(s)≥∑s′​P(s′∣s,a)[R(s,a,s′)+γU(s′)] for every state sss and action aaa. These constraints essentially encapsulate the Bellman equations within the LP framework, ensuring the solution adheres to the dynamics defined by the MDP.
- **Optimal Policy Utilities** : The solution to this linear program will yield the highest utilities achievable under the constraints of the Bellman equations. These utilities correspond to the utilities under an optimal policy, demonstrating the connection between LP solutions and MDP solutions.
- **Efficiency and Practicality** : While linear programming is known to be solvable in polynomial time, and thus theoretically offers a polynomial-time solution to MDPs, in practice, LP solvers may not be as efficient as dynamic programming methods specifically tailored for MDPs. The polynomial time complexity sounds appealing, but given the often large number of states in practical MDPs, the computational load can still be substantial.
- **Comparison with Dynamic Programming** : The discussion hints at the broader computational landscape, comparing the efficiency of LP solvers with dynamic programming approaches. While LP provides a theoretically solid framework for solving MDPs, dynamic programming methods, being more specialized, often outperform LP in practice. Furthermore, the text contrasts this with the notion that even basic search algorithms, despite their simplicity, run in linear time relative to the number of states and actions, highlighting the trade-offs involved in choosing an approach.
### Conclusion

The linear programming approach to solving MDPs illuminates the versatility and breadth of methods available for dealing with decision-making under uncertainty. It showcases how MDPs can be tackled from a mathematical optimization standpoint, emphasizing the theoretical underpinnings that allow these problems to be solved in polynomial time. However, the practical considerations and computational efficiency of dynamic programming often make it the preferred method for solving MDPs in real-world scenarios.

### 17.2.4 Online algorithms for MDPs

This section focuses on online algorithms for solving Markov Decision Processes (MDPs), particularly useful when dealing with very large MDPs where exact offline solutions are impractical or impossible. Unlike offline algorithms like value iteration and policy iteration, which compute an optimal policy beforehand, online algorithms compute or refine policies on-the-fly as decisions need to be made.

**Expectimax Approach for MDPs:**
- A straightforward online method for MDPs is the Expectimax algorithm, which builds a decision tree with alternating decision (max) nodes and chance nodes.
- The tree's depth is limited by the ε-horizon concept, which bounds the absolute error in computed utilities based on a tree of finite depth, effectively ignoring distant future rewards that have minimal impact on current decisions.
- However, practical use of Expectimax in MDPs may be hampered by large branching factors at chance nodes, making full expansion computationally expensive.

**Sampling to Handle Large Branching Factors:**
- To manage large branching factors, sampling a limited number of outcomes from the action's possible results can approximate the value of chance nodes. This approach focuses on the most likely outcomes, thus efficiently estimating the node value without exhaustive enumeration.

**Real-Time Dynamic Programming (RTDP):**
- RTDP treats the explored states as a sub-MDP and solves it using available algorithms, akin to the LRTA* algorithm for heuristic search.
- RTDP can be effective in moderate-sized domains but faces challenges in larger domains like Tetris, where sparse rewards and minimal state repetitions diminish its effectiveness.

**Monte Carlo Tree Search (UCT) for MDPs:**
- The UCT algorithm, originally developed for MDPs, can be adapted with minor changes to address the stochastic nature of opponents (or the environment) and the importance of tracking rewards.
- While UCT may struggle in small, loop-rich environments like the 4×3 world due to inefficient exploration, it's potentially more suitable for complex problems like Tetris. Here, the algorithm's ability to look further ahead can provide valuable insights into the consequences of risky moves.

**Overall, online algorithms for MDPs:**
- Offer a practical alternative for handling large state spaces where offline solution methods are infeasible.
- Utilize techniques like sampling and tree-based exploration to make real-time decisions with bounded error.
- Can be augmented with reinforcement learning for improved heuristic estimation or adapted for specific domains to enhance performance.


<img src="https://raw.githubusercontent.com/ValRCS/RBS_PBM773_Introduction_to_AI/main/img/ch17_making_complex_decisions/DALL%C2%B7E%202024-03-06%2015.33.10%20-%20Create%20an%20illustration%20representing%20the%20multi-armed%20bandit%20problem.%20Picture%20a%20row%20of%20colorful%2C%20vintage%20slot%20machines%2C%20each%20with%20unique%20designs%2C%20placed.webp" width="500">

## 17.3 Bandit Problems**  
- **Introduction to N-armed Bandit Problem:**
- The n-armed bandit problem models a scenario where a gambler must choose between multiple slot machines (arms), each with an unknown reward distribution.
- It represents a fundamental dilemma in decision theory: the tradeoff between exploiting known resources (exploitation) and exploring unknown options (exploration) for potentially greater rewards.
- **Real-world Applications:**
- This problem mirrors real-life decision-making situations such as selecting medical treatments, making investment choices, funding research projects, or choosing advertisements to display on web pages.
- **Historical Context:**
- Initially considered during World War II, the n-armed bandit problem was once suggested as a form of intellectual sabotage due to its complexity.
- Early researchers mistakenly believed that optimal strategies would always converge on the best arm, a belief later debunked by further studies.
- **Optimal Strategy Insights:**
- Contrary to initial assumptions, optimal strategies may not always favor the arm with the highest expected reward, highlighting the nuanced nature of decision-making under uncertainty.
- **Gittins Indices:**
- The concept of Gittins indices is introduced as a method for solving the n-armed bandit problem. Gittins indices provide a value for choosing each arm based on its past rewards relative to the cost of pulling it.
- An optimal policy involves pulling the arm with the highest Gittins index, simplifying the decision-making process to linear time for the initial decision and constant time for subsequent decisions.
- **Conclusion:**
- The n-armed bandit problem, along with its solutions like Gittins indices, underscores the importance of balancing exploration and exploitation. It provides a framework for understanding and navigating decisions in uncertain environments across various domains.

### 17.3.1 Calculating the Gittins Index**  
- **Understanding the Gittins Index:**
- The Gittins index is calculated for an arm in a multi-armed bandit problem to determine the optimal point to switch from one arm (decision option) to another, based on a balance of expected rewards and time discounting.
- **Example Calculation:**
- An example sequence of deterministic rewards (0, 2, 0, 7.2, ...) is analyzed across different stopping times (T) to illustrate how the Gittins index is calculated. The process involves comparing the accumulated discounted rewards (∑γ^t Rt) to the discounted time (∑γ^t), leading to a ratio that represents the value of pulling the arm up to that point.
- The Gittins index is identified as the maximum value of this ratio, indicating the optimal stopping time. In the given example, the index is highest at T=4, suggesting the first four rewards should be collected before potentially switching to another option.
- **General Calculation Approach:**
- To calculate the Gittins index for any arm M in a state, an MDP augmentation approach is used. This involves considering the option to either continue with the current strategy or switch to receiving a guaranteed sequence of rewards, λ, indefinitely.
- A new MDP, termed a "restart MDP," models this decision by allowing a return to the initial state as an alternative to continuing with the current strategy. The optimal policy within this augmented MDP corresponds to the optimal stopping rule for the original problem.
- **Solving for the Gittins Index:**
- The Gittins index for an arm in state s is determined as 1−γ times the value of the optimal policy for the restart MDP M_s.
- Through value iteration or other MDP solving techniques, the value of starting the state in the restart MDP provides the necessary data to compute the Gittins index.
- **Key Takeaways:**
- The calculation of the Gittins index provides a quantitative basis for strategic decision-making in the face of uncertainty, allowing for the optimization of reward collection over time.
- This method highlights the nuanced trade-off between continuing with a known strategy and exploring new options for potentially greater long-term benefits.

### 17.3.2 The Bernoulli Bandit**  
- **Basics of the Bernoulli Bandit:**  
- In the Bernoulli bandit problem, each arm MiM_iMi​ yields a binary outcome (reward of 0 or 1) with a fixed, but initially unknown, success probability μi\mu_iμi​.
- The state of an arm MiM_iMi​ is described by two counts: sis_isi​ (successes) and fif_ifi​ (failures), which are used to estimate the arm's current success probability as sisi+fi\frac{s_i}{s_i + f_i}si​+fi​si​​. To avoid division by zero, both counts start at 1, setting the initial success probability at 0.50.50.5.
- **Calculating the Gittins Index for Bernoulli Arms:**
- Direct application of the Gittins index calculation to Bernoulli arms is complicated by their potential to have infinitely many states due to the continuous update of success and failure counts.
- A practical approximation can be achieved by considering a truncated MDP with a state space limited to si+fi=100s_i + f_i = 100si​+fi​=100 and a discount factor γ=0.9\gamma = 0.9γ=0.9. This approximation yields results that balance the trade-off between exploiting arms with known high success rates and exploring less-tried arms for potential better outcomes.
- **Exploration Bonus:**
- The calculation reveals an "exploration bonus" that favors trying arms that have been sampled less frequently. This effect can lead to a situation where an arm with a lower estimated success probability but fewer trials (e.g., state (3,2)) has a higher Gittins index than an arm with a higher estimated success rate but more trials (e.g., state (7,4)). This illustrates the intrinsic value placed on gathering more information to reduce uncertainty.
- **Insights from the Bernoulli Bandit:**
- The Bernoulli bandit model captures the fundamental dilemma in decision-making under uncertainty: the balance between exploiting known options for immediate rewards and exploring unknown options for potentially greater long-term benefits.
- The approach to calculating the Gittins index for Bernoulli bandits provides a methodical way to quantify this trade-off, guiding decisions in a wide range of practical situations where outcomes are binary and probabilities are initially unknown.


### 17.3.3 Approximately Optimal Bandit Policies**

This section delves into strategies for managing bandit problems when calculating Gittins indices is impractical due to complexity. It highlights the effectiveness of simpler, nearly optimal policies that leverage the combination of estimated value and uncertainty. Two notable methods are discussed:
- **Upper Confidence Bound (UCB):**  
- The UCB approach establishes confidence intervals for each arm's value based on the samples collected, focusing on the upper limit of these intervals. It selects the arm with the highest upper confidence bound, calculated as the current mean estimate (μ^i\hat{\mu}_iμ^​i​) plus a term that increases with the uncertainty of the estimate. This uncertainty term is inversely proportional to the square root of the number of times (NiN_iNi​) the arm has been sampled, scaled by a function g(N)g(N)g(N) of the total number of samples NNN from all arms. The UCB value for an arm MiM_iMi​ is given by:
UCB(Mi)=μ^i+g(N)NiUCB(M_i) = \hat{\mu}_i + \frac{g(N)}{\sqrt{N_i}}UCB(Mi​)=μ^​i​+Ni​​g(N)​
- The choice of g(N)g(N)g(N) influences the regret relative to an optimal policy that always picks the best arm. Regret is minimally allowed to grow logarithmically with NNN, according to Lai and Robbins' theorem, with specific formulations of g(N)g(N)g(N) ensuring this minimal growth rate.
- **Thompson Sampling:**  
- An alternative strategy, Thompson sampling, selects an arm based on the probability that it is the optimal choice, using the current belief about each arm's value distribution. For each arm MiM_iMi​, a value sample is drawn from its value distribution Pi(μi)P_i(\mu_i)Pi​(μi​), and the arm with the highest sample is chosen.
- This method's efficacy is also reflected in its regret, which grows logarithmically with the total number of samples NNN, similar to UCB policies.

**Key Insights:**
- Both UCB and Thompson sampling methods apply the principle of balancing exploitation (using arms known to offer high rewards) with exploration (testing less certain arms for potentially higher rewards).
- The UCB method quantitatively balances this trade-off by adjusting the exploration factor based on the total experience, while Thompson sampling probabilistically balances exploration and exploitation by modeling the current uncertainty of each arm's true value.
- These strategies offer a practical and computationally feasible approach to managing bandit problems, providing performance close to the theoretically optimal but computationally intractable solutions like calculating the Gittins index for complex problems.

### 17.3.4 Non-indexable Variants**

This section discusses scenarios where the classic bandit problem approach, which relies on index functions for decision-making, does not apply, presenting the concept of selection problems and bandit superprocesses (BSPs) as significant variations.
- **Selection Problems:**
- Selection problems arise when the objective is to choose the best option from a set, where the cost of exploring options is not tied to their outcomes. Unlike bandit problems that balance the exploration of unknown options with the exploitation of known ones, selection problems focus solely on identifying the best choice regardless of exploration costs. An example is choosing the most effective drug from a set of candidates based on trials.
- A key characteristic of selection problems is the absence of an index function, meaning that no simple numerical value can represent the desirability of each option. The introduction of additional options can change preferences in non-trivial ways, complicating the decision-making process.
- **Bandit Superprocess (BSP):**
- BSPs generalize bandit problems by allowing each option, or "arm," to be a full Markov Decision Process (MDP) instead of a simpler Markov Reward Process. This setup reflects real-world situations where decisions involve complex, interdependent processes, such as managing multiple projects or tasks simultaneously (multitasking).
- A common misconception is that the optimal policy for a BSP can be constructed from the optimal policies of its constituent MDPs. However, this is incorrect because the presence of multiple MDPs alters the balance between short-term and long-term rewards, potentially making locally suboptimal actions globally optimal.
- The globally optimal solution for a BSP might involve a more aggressive pursuit of short-term rewards in individual MDPs, due to the "opportunity cost" of not engaging with other profitable MDPs.
- **Solving BSPs:**
- Directly solving a BSP by considering it as a global MDP is impractical due to the exponential increase in state space. Instead, solutions involve understanding the "opportunity cost" of not acting on other MDPs and seeking policies that either dominate in terms of early rewards or effectively balance the exploration and exploitation across the BSP.
- By establishing bounds and utilizing look-ahead search, it is possible to identify optimal or near-optimal policies for BSPs, allowing for efficient decision-making even in complex, multitasking scenarios.

**Key Takeaways:**  
- **Selection Problems**  challenge the applicability of index-based solutions due to their focus on fast, cost-insensitive decision-making.
- **Bandit Superprocesses (BSPs)**  introduce a nuanced, multitasking-oriented perspective to decision-making in complex environments, where local suboptimality can contribute to global optimality.
- Efficiently solving BSPs requires innovative strategies that consider opportunity costs and the dynamic interplay between multiple decision processes, moving beyond the simple exploration-exploitation balance of classical bandit problems.

## 17.4 Partially Observable MDPs (POMDPs)**

This section introduces the concept of Partially Observable Markov Decision Processes (POMDPs), which address scenarios where an agent cannot fully observe the environment, making decision-making significantly more complex compared to fully observable MDPs.
- **Background:**  In standard MDPs, the assumption is that the agent has complete visibility of its environment, allowing it to know exactly which state it's in. This full observability, coupled with the Markov property (the future is independent of the past, given the present), ensures that the optimal decision at any moment depends solely on the current state.
- **Challenges with Partial Observability:**  In contrast, POMDPs deal with situations where the agent has limited information about the state of the environment. This partial observability introduces uncertainty regarding the agent's exact state, complicating the task of selecting the best action:
- The agent can't directly apply an action decision rule, π(s)\pi(s)π(s), because it isn't certain about its current state, sss.
- The value of being in a particular state, and the optimal action to take when in that state, now depend on the agent's knowledge or beliefs about the state, rather than the state itself.
- **Increased Complexity:**  Due to these uncertainties, POMDPs are significantly more challenging to solve than fully observable MDPs. The complexity arises because the agent must maintain and update a belief state—a representation of its uncertainty about the environment's actual state—and determine its actions based on this belief state rather than direct state observations.
- **Relevance to the Real World:**  Despite their complexity, POMDPs are crucial for understanding and designing decision-making systems in real-world applications. The real world is inherently partially observable; agents often have incomplete information and must make decisions based on uncertain or indirect observations. Thus, POMDPs offer a more realistic framework for modeling and solving decision-making problems under uncertainty.

In summary, POMDPs extend the MDP framework to handle situations where the agent has limited information about the environment, introducing the concept of belief states to navigate the uncertainty. This makes POMDPs a powerful tool for modeling and solving complex decision-making problems in partially observable contexts.

### 17.4.1 Definition of POMDPs**

Partially Observable Markov Decision Processes (POMDPs) extend MDPs to scenarios where the agent does not have full knowledge of its environment. A POMDP is defined by the same elements as an MDP (transition model, actions, and reward function) plus a sensor model that specifies the probability of perceiving certain evidence given the state of the environment. This makes POMDPs suitable for representing decision-making tasks in environments with uncertain or incomplete information.

Key Components of POMDPs:
- **Transition Model (P(s' | s, a))** : Describes the probability of transitioning from state s to state s' given action a.
- **Actions (A(s))** : The set of actions available to the agent.
- **Reward Function (R(s, a, s'))** : The reward received for transitioning from state s to state s' by action a.
- **Sensor Model (P(e | s))** : Specifies the likelihood of observing evidence e in state s, reflecting the partial observability of the environment.

In POMDPs, the concept of a **belief state**  is central. A belief state represents a probability distribution over all possible states, encapsulating the agent's current knowledge or belief about the actual state of the environment. The agent updates its belief state based on its actions and the observations (evidence) it receives, using a recursive filtering equation. This process essentially tracks how likely each state is given the history of actions and observations.

Optimal actions in a POMDP depend solely on the agent's current belief state rather than its actual state. The optimal policy, therefore, maps belief states to actions. This requires the agent to navigate a belief-state space, which is continuous and can be significantly more complex than dealing with discrete state spaces in traditional MDPs.

The approach to solving POMDPs involves:
1. **Determining the Optimal Action** : Given the current belief state, execute the action determined by the optimal policy for that belief state.
2. **Observing New Evidence** : Update the belief state based on the action taken and new evidence received.
3. **Repeating the Process** : Continue updating the belief state and choosing actions based on the updated belief state.

POMDPs integrate the value of information into the decision-making process, considering not only the physical effects of actions but also their informational effects. This makes POMDPs particularly powerful for modeling decision-making tasks in environments where information is incomplete or uncertain, such as robotics, automated navigation, and various real-world strategic situations.

## 17.5 Algorithms for Solving POMDPs**

Solving Partially Observable Markov Decision Processes (POMDPs) presents unique challenges due to their continuous and often high-dimensional belief-state spaces. While POMDPs can be reduced to MDPs in belief-state space, traditional dynamic programming algorithms designed for finite state spaces cannot be directly applied. Two notable approaches to tackling POMDPs are a value iteration algorithm tailored for POMDPs and an online decision-making algorithm.
### Value Iteration for POMDPs

This algorithm extends the concept of value iteration, a dynamic programming technique used for solving MDPs, to the domain of POMDPs. The key difference lies in operating over a continuous belief-state space rather than a discrete set of states. The algorithm iteratively updates the value of each belief state based on the expected rewards for taking various actions and then transitioning to new belief states according to the observation model. This process continues until the value function converges to a fixed point, representing the maximum expected utility of belief states under an optimal policy.

The primary challenge with value iteration in POMDPs is the representation and computation of the value function over a continuous space. Various approximation methods are employed to manage this complexity, such as using a finite set of representative belief states or employing function approximation techniques to generalize across the belief space.
### Online Decision-Making for POMDPs

Online decision-making algorithms focus on determining the best action to take from the current belief state without attempting to solve the entire POMDP upfront. These algorithms typically simulate future outcomes based on possible actions and observations to evaluate the expected utility of different decisions. One common approach is the use of Monte Carlo simulations to sample future belief states and use these samples to estimate the value of taking different actions.

Online decision-making is particularly useful in scenarios where solving the entire POMDP is computationally infeasible due to the dimensionality of the belief space or when the environment's dynamics are not fully known in advance. By focusing computation on the current decision point, these algorithms can provide near-optimal actions with significantly reduced computational requirements.

Both value iteration and online decision-making algorithms for POMDPs leverage the structure of the belief-state space to navigate the challenges of partial observability. While these methods cannot fully eliminate the computational complexity inherent in POMDPs, they offer practical approaches for a wide range of applications, from robotics and autonomous vehicles to strategic decision-making in uncertain environments.

### 17.5.1 Value Iteration for POMDPs**

Value iteration for Partially Observable Markov Decision Processes (POMDPs) adapts the value iteration algorithm for the complex scenario where the agent does not have full knowledge of its state. The infinite belief states in POMDPs demand innovative approaches, and this algorithm involves considering conditional plans and their expected utilities across different belief states.
#### Key Concepts and Approach:
- **Conditional Plans and Belief States:**
- In POMDPs, the agent's policy generates actions based on its current belief state, a probability distribution over all possible states reflecting the agent's knowledge about the environment. This is analogous to conditional plans in nondeterministic and partially observable problems, where actions depend on subsequent perceptions.
- **Expected Utility of Conditional Plans:**
- The expected utility of executing a fixed conditional plan varies linearly with the belief state and can be represented by a hyperplane in the belief space. The utility function for a belief state under an optimal policy is the maximum of these hyperplane utilities, making it piecewise linear and convex.
- **Hyperplane Representation:**
- Each conditional plan corresponds to a hyperplane in the belief space, and the optimal policy at any belief state chooses the plan with the highest expected utility. The belief space is divided into regions, each associated with an optimal conditional plan.
- **Algorithm Process:**
- The algorithm recursively computes the utilities of conditional plans of increasing depth by considering each possible action, subsequent percept, and the utility of executing subsequent depth-1 plans. Dominated plans, which are suboptimal across the belief space, are identified and eliminated to manage the algorithm's complexity.
- **Example and Visualization:**
- An illustrative two-state example demonstrates how the belief space can be divided into regions, each indicating an optimal conditional plan. This example shows how the agent's lack of state knowledge leads to decision-making based on belief states rather than concrete states.
- **Complexity and Practicality:**
- Despite theoretical advancements, the practical application of value iteration for POMDPs is limited by its computational complexity. The number of conditional plans grows exponentially with the depth of consideration, making the algorithm inefficient for larger problems. Dominated plan elimination is crucial but not sufficient to overcome this challenge.
#### Conclusion:

Value iteration for POMDPs provides a conceptual framework for understanding optimal decision-making under uncertainty. It highlights the importance of belief states and the piecewise linear, convex nature of the utility function in such environments. However, the computational challenges associated with the approach limit its practicality for large-scale problems, prompting the development of more efficient algorithms and approximate methods for solving POMDPs.

### POMPD Value Iterating Algorithm

To implement the POMDP-VALUE-ITERATION function in Python, we need to make some assumptions about how the POMDP model is defined and how utility vectors and plans are represented. Given the complexity of POMDPs, this implementation focuses on the conceptual framework of the algorithm. The key parts include maintaining a set of plans with associated utility vectors, iterating to generate new plans by considering all possible actions and subsequent percepts, and pruning dominated plans to improve efficiency.

```python
def remove_dominated_plans(U_prime):
    """
    Removes dominated plans from the set U_prime based on their utility vectors.
    Placeholder function - requires an actual implementation to compare utility vectors.
    """
    # Placeholder for actual dominated plan removal logic
    return U_prime

def max_difference(U, U_prime):
    """
    Calculates the maximum difference in utility vectors between two sets of plans.
    Placeholder function - requires actual implementation to compute differences.
    """
    # Placeholder for computing maximum difference
    return 0  # Placeholder return value

def pomdp_value_iteration(pomdp, epsilon):
    """
    Performs value iteration for a partially observable Markov decision process (POMDP).

    Args:
    - pomdp: A POMDP model including states S, actions A(s), transition model P(s' | s, a),
             sensor model P(e | s), rewards R(s), and discount factor gamma.
    - epsilon: Maximum error allowed in the utility of any state.

    Returns:
    - U: A set of plans with associated utility vectors representing the utility function.
    """
    # Initialize U' with the empty plan and its utility vector based on rewards
    U_prime = {([]): {s: pomdp.R(s) for s in pomdp.S}}

    while True:
        U = U_prime.copy()
        U_prime = {}

        for plan in U:
            for action in pomdp.actions():
                for percept in pomdp.percepts():
                    # Compute utility vectors for new plans according to Equation (17.18)
                    # Placeholder for actual utility vector computation
                    new_plan = (action, percept, plan)  # Simplified representation of a new plan
                    U_prime[new_plan] = U[plan]  # Placeholder for updated utility vector

        U_prime = remove_dominated_plans(U_prime)

        if max_difference(U, U_prime) <= epsilon * (1 - pomdp.gamma) / pomdp.gamma:
            break

    return U
```



This implementation outlines the structure of the POMDP value iteration algorithm but omits specific details related to computing utility vectors for new plans (Equation 17.18), removing dominated plans, and calculating the maximum difference between utility vectors of plan sets. Implementing these functions would require detailed modeling of the POMDP components (states, actions, transition and sensor models, rewards) and a method for efficiently representing and comparing utility vectors across belief states.

To fully implement this algorithm, one would need to develop mechanisms for:
- **Representing belief states**  and their associated utility vectors within the Python program.
- **Computing new utility vectors**  for generated plans based on action outcomes and percept observations.
- **Identifying and removing dominated plans**  to maintain an efficient set of candidate policies.
- **Calculating the maximum difference**  between the utility vectors of plans in consecutive iterations to determine convergence.


### 17.5.2 Online algorithms for POMDPs

Here we focus on making real-time decisions by selecting actions based on current belief states and updating these beliefs as new observations are made. This approach contrasts with pre-computing an entire policy, as is done in offline methods. Here's a summary of the key points:
1. **Basic Process** : The online POMDP algorithm iterates through a cycle where it starts with a prior belief state, deliberates to choose an action, performs the action, receives an observation, and updates the belief state based on this observation.
2. **Decision Making** : The deliberation process can use an expectimax algorithm adapted for POMDPs. In this context, decision nodes represent belief states, and chance nodes represent potential observations that lead to updated belief states. Transition probabilities between belief states are determined according to the formula given in Equation (17.17).
3. **Complexity and Approximation** : The complexity of exhaustive search in the belief space grows exponentially with the depth of search, depending on the number of actions (|A|) and possible observations (|E|). To manage complexity, sampling methods (like those used in MDPs) can be applied to reduce the branching factor in the decision tree without significantly compromising decision accuracy. This makes online decision-making in POMDPs feasible even for large state spaces.
4. **Approximate Filtering** : For large state spaces where exact belief state updates are computationally infeasible, approximate methods like particle filtering are used. Here, belief states are represented as collections of particles rather than precise probability distributions.
5. **Long-range Planning** : For problems requiring decisions over long horizons, long-range play strategies, similar to those in the UCT algorithm, can be employed. The combination of particle filtering and UCT, known as Partially Observable Monte Carlo Planning (POMCP), allows for competent decision-making in large and complex POMDPs.
6. **Advantages** : POMDP-based agents can adeptly handle partially observable, stochastic environments. They can adapt to unexpected evidence, plan for information gathering, and exhibit graceful degradation under time pressure or in complex scenarios through approximation techniques.
7. **Limitations and Future Directions** : Despite their advantages, a significant challenge for real-world deployment of online POMDP agents is their performance over long time scales. Simple strategies may not achieve meaningful outcomes in complex tasks requiring many actions. Incorporating hierarchical planning ideas could offer a solution, but efficient and effective methods for applying these in stochastic, partially observable contexts are still under development.

In summary, online algorithms for POMDPs offer a promising approach for dealing with uncertainty and partial observability in decision-making processes, balancing the need for real-time action with the computational complexity of maintaining and updating belief states.

## Chapter 17 Summary

This chapter provides a comprehensive exploration of decision-making in uncertain environments, focusing on Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs). Here's a summary of the key points covered:
- **MDPs and Their Components** : MDPs are a formal framework for sequential decision-making in environments where outcomes are uncertain. They are characterized by a transition model that describes the probabilistic outcomes of actions and a reward function that specifies the rewards for being in different states. The goal is to solve the MDP by finding an optimal policy that specifies the best action to take in every state, thereby maximizing the expected utility over time.
- **Utility and Optimal Policies** : The utility of a state sequence is defined as the sum of the rewards obtained over the sequence, with the option of discounting rewards received in the future. An optimal policy is one that maximizes the expected sum of rewards from any given state. The chapter introduces algorithms like value iteration and policy iteration for finding optimal policies by computing the utility of each state.
- **Value Iteration** : This algorithm iteratively updates the utilities of all states based on the utilities of neighboring states until the values converge. It effectively solves for the optimal policy by directly calculating the expected utility of taking each action in every state.
- **Policy Iteration** : This method alternates between evaluating the current policy to determine the utility of each state under that policy and then improving the policy based on those utilities. It is often more efficient than value iteration because it converges in fewer steps.
- **POMDPs** : Solving POMDPs, where the agent does not have full visibility of the state space, is significantly more challenging. The chapter describes how POMDPs can be approached by transforming them into MDPs in the space of belief states, which represent probabilities over possible states. Optimal solutions in POMDPs involve not just making decisions based on the current belief but also gathering information to reduce uncertainty and make better future decisions.
- **Decision-Theoretic Agents** : The construction of agents capable of operating in POMDP environments involves using dynamic decision networks (DDNs) to represent transition and sensor models, update belief states, and project forward potential action sequences. This enables the agent to make informed decisions based on its current understanding of the world and its predictions of future states.
- **Future Directions** : The chapter hints at revisiting MDPs and POMDPs in the context of reinforcement learning, where agents learn optimal behaviors through experience rather than being provided with a model of the environment upfront. This approach allows for continuous improvement and adaptation to changing environments.

In summary, this chapter lays the foundational concepts for understanding how agents can make rational decisions in environments with uncertain outcomes and varying rewards, setting the stage for further exploration into how agents can learn from their interactions with the world.

## Historical and Bibliographical Notes

The Bibliographical and Historical Notes section provides a comprehensive overview of the development and key contributions to the fields of Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs), as well as related areas. Here are the key points, including important dates and authors:

- **Richard Bellman (1949)** : Coined the term “dynamic programming” and laid the foundation for modern sequential decision problems with his book *Dynamic Programming* (1957).
- **Shapley (1953)** : Independently described the value iteration algorithm.
- **Ron Howard (1960)** : Introduced policy iteration and the idea of average reward for solving infinite-horizon problems.
- **Bellman and Dreyfus (1962)** : Introduced several additional results to the field.
- **Denardo (1967)** : Used contraction mappings in dynamic programming algorithms' analysis.
- **van Nunen (1976) and Puterman and Shin (1978)** : Developed modified policy iteration.
- **Williams and Baird (1993)** : Analyzed asynchronous policy iteration.
- **Moore and Atkeson (1993), Andre et al. (1998), Wingate and Seppi (2005)** : Contributed to prioritized sweeping algorithms for MDPs.
- **de Ghellinck (1960), Manne (1960), D’Épenoux (1963)** : Formulated MDP-solving as a linear program.
- **de Farias and Roy (2003)** : Demonstrated the efficacy of linear programming for approximating solutions to large MDPs.
- **Papadimitriou and Tsitsiklis (1987), Littman et al. (1995)** : Provided results on MDPs' computational complexity.
- **Yinyu Ye (2011)** : Analyzed policy iteration's runtime, proving it polynomial for fixed γ.
- **Sutton (1988), Watkins (1989)** : Seminal work on reinforcement learning methods for solving MDPs.
- **Dean and Kanazawa (1989), Tatman and Shachter (1990)** : Proposed using dynamic decision networks for agent architecture.
- **Wellman (1990), Koenig (1991), Dean and Wellman (1991)** : Connected MDPs with AI planning.
- **Boutilier et al. (2000, 2001), Koller and Parr (2000), Guestrin et al. (2003)** : Advanced work on factored and relational MDPs.
- **Srivastava et al. (2014)** : Discussed open-universe MDPs and POMDPs.
- **Barto et al. (1995), Kearns et al. (2002), Kocsis and Szepesvari (2006)** : Developed and analyzed algorithms for real-time dynamic programming and online decision making.
- **Thompson (1933), Robbins (1952), Bradt et al. (1956), Gittins and Jones (1974), Gittins (1989)** : Contributed significantly to bandit problems and optimal policies.
- **Astrom (1965), Aoki (1965)** : Noted for transforming partially observable MDPs into regular MDPs over belief states.
- **Edward Sondik (1971)** : Proposed the first complete algorithm for the exact solution of POMDPs.
- **Cassandra et al. (1994), Kaelbling et al. (1998), Hansen (1998)** : Made significant contributions within AI to POMDP solution methods.
- **Pineau et al. (2003), Spaan and Vlassis (2005), Shani et al. (2013)** : Developed point-based value iteration methods for POMDPs.
- **Silver and Veness (2011)** : Introduced the POMCP algorithm for online POMDPs solution.
- **Rafferty et al. (2016), Young et al. (2013), Hsiao et al. (2007), Huynh and Roy (2009), Forbes et al. (1995), Bai et al. (2015)** : Demonstrated the use of POMDP models in various real-world applications.

## Ideas for Further Research

The Ideas for Further Research section provides a comprehensive list of open research questions and potential areas for further exploration in the fields of Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs). Here are some of the key areas and questions highlighted:

- **Reinforcement Learning and MDPs** : Investigate the application of reinforcement learning methods to MDPs, particularly in the context of large state spaces and continuous action spaces.
- **Factored and Relational MDPs** : Explore the use of factored and relational MDPs to model complex decision-making problems, such as those involving multiple agents or complex interactions between entities.
- **Open-Universe MDPs and POMDPs** : Investigate the use of open-universe MDPs and POMDPs to model dynamic environments with evolving state spaces and uncertain dynamics.

## MDP vs POMDP

- Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs) are both frameworks used for modeling decision-making in environments where outcomes are partly random and partly under the control of a decision-maker. However, they differ significantly in terms of the observability of the environment's state. Below is a comparison of MDPs and POMDPs across various dimensions:
### State Observability
- **MDPs** : Assume that the agent can observe the environment's state fully and accurately at all times. The current state provides all the information needed for decision-making.
- **POMDPs** : Assume that the agent cannot directly observe the environment's true state. Instead, the agent receives observations that provide partial information about the state. Decision-making must account for this uncertainty.
### Components
- **MDPs** : Defined by a set of states (S), a set of actions (A), a transition model (P(s'|s,a)), and a reward function (R(s,a,s')).
- **POMDPs** : Extend MDPs by adding an observation space (O) and an observation model (Z(o|s',a)). This accounts for the uncertainty in perceiving the true state of the environment.
### Decision Process
- **MDPs** : The decision process involves selecting actions based on the current state to maximize the cumulative reward. Policies map states to actions.
- **POMDPs** : The decision process involves selecting actions based on a belief state (a probability distribution over states) since the agent cannot observe the true state. Policies map belief states to actions.
### Complexity
- **MDPs** : Solving an MDP to find an optimal policy is computationally challenging but tractable for many problems, especially with finite state and action spaces.
- **POMDPs** : Solving a POMDP is generally more computationally demanding than solving an MDP due to the continuous nature of belief spaces and the need for belief updates. POMDPs are considered PSPACE-hard.
### Solution Methods
- **MDPs** : Common solution methods include Value Iteration, Policy Iteration, and Linear Programming.
- **POMDPs** : Solution methods include Value Iteration for POMDPs, Point-Based Value Iteration (PBVI), and online planning methods like POMCP (Partially Observable Monte Carlo Planning). Approximate methods are often necessary due to the high computational complexity.
### Applications
- **MDPs** : Suitable for decision-making problems where the state of the environment can be fully observed, such as navigation and inventory management.
- **POMDPs** : Applicable to scenarios where the agent must make decisions under uncertainty about the state, such as robot navigation in unknown environments, medical diagnosis, and strategic planning under incomplete information.
### Summary

While both MDPs and POMDPs are frameworks for decision-making under uncertainty, the key distinction lies in the observability of the environment's state. MDPs assume full observability, simplifying the decision-making process to state-based policies. In contrast, POMDPs deal with partial observability, requiring policies based on belief states and making the solution process more complex. Despite the increased complexity, POMDPs offer a more realistic model for many real-world problems where state uncertainty cannot be ignored.