***Pattern Description: Missing Recursive Bellman Backup in POMDP-like Problems
Overview***

This pattern arises in LLM-generated code for decision-making problems modeled as Partially Observable Markov Decision Processes (POMDPs) or similar belief-state frameworks. The core issue is that the LLM fails to implement the recursive Bellman backup operation over belief states, which is essential for correct value iteration or policy computation in these domains.

What is the Recursive Bellman Backup?
The Bellman backup is a fundamental dynamic programming step that computes the value function for a belief state by:

    Considering all possible actions,

    For each action, considering all possible observations,

    Recursively backing up the value of the next belief state after the action-observation pair,

    Weighting by probabilities to compute expected future value,

    Adding immediate expected rewards,

    And taking the maximum over actions.

This process is inherently recursive and expectation-based over the continuous belief space, which must be discretized or approximated for computation.

Pattern Characteristics
1. Superficial Correctness:
The generated code often includes:

State, action, observation, and reward definitions,

Functions for belief updates,

Loops iterating over discretized beliefs and actions.

2. Missing Core Recursion:
However, the code:

Does not recursively back up future values by iterating over possible observations and updating beliefs accordingly,

Instead, it often applies a single-step or simplified update without expectation over observations,

Or incorrectly computes expected value without propagating the value function to next belief states.

3. Incorrect or Incomplete Value Iteration:

Value updates lack the nested summations over observations and next beliefs,

Policies derived from such code are suboptimal or invalid,

Resulting test cases fail to reflect correct POMDP behavior.

Problem 1:

Problem Description
This problem models a simplified medical diagnosis scenario using a POMDP framework. The system:

Maintains a belief state, representing the probability distribution over three possible hidden health states:
['Healthy', 'Sick', 'Critical'].

Allows three possible actions:
['Wait', 'Test', 'Treat'].

Receives noisy observations about the health state, such as:
['Positive', 'Negative', 'Uncertain'].

Has transition probabilities, observation models, and a reward structure based on medical outcomes and actions.

The objective is to implement a value iteration algorithm over discretized belief states to compute:

✅ The optimal value function for each belief.
✅ The corresponding optimal policy, mapping beliefs to actions.

The correct solution requires a recursive Bellman backup over belief states, considering all possible actions and observations, updating beliefs accordingly, and propagating future value estimates.

Prompt:

Task:
Implement value iteration for a simple POMDP modeling medical decision-making.
There are 3 health states: 'Healthy', 'Sick', 'Critical'.
The agent can 'Wait', 'Test', or 'Treat'.
Observations can be 'Positive', 'Negative', 'Uncertain', based on a noisy observation model.

Requirements:

Define transition probabilities and observation models.

Implement belief updates based on actions and observations.

Discretize the belief space.

Run value iteration to compute the optimal policy and value function over beliefs.

The code should:

Track value estimates for each belief.

Update beliefs based on observations.

Iterate to convergence.

Generate an example at the end.

In [4]:
import numpy as np
import itertools

# State, Action, Observation spaces
states = ['Healthy', 'Sick', 'Critical']
actions = ['Wait', 'Test', 'Treat']
observations = ['Positive', 'Negative', 'Uncertain']

num_states = len(states)

# Transition model: T[s][a][s']
T = {
    'Healthy': {
        'Wait': {'Healthy': 0.9, 'Sick': 0.1, 'Critical': 0.0},
        'Test': {'Healthy': 1.0},
        'Treat': {'Healthy': 1.0}
    },
    'Sick': {
        'Wait': {'Sick': 0.6, 'Critical': 0.3, 'Healthy': 0.1},
        'Test': {'Sick': 1.0},
        'Treat': {'Healthy': 0.6, 'Sick': 0.3, 'Critical': 0.1}
    },
    'Critical': {
        'Wait': {'Critical': 0.9, 'Sick': 0.1},
        'Test': {'Critical': 1.0},
        'Treat': {'Sick': 0.2, 'Critical': 0.8}
    }
}

# Observation model: O[a][s'][o]
O = {
    'Test': {
        'Healthy': {'Negative': 0.9, 'Positive': 0.05, 'Uncertain': 0.05},
        'Sick':    {'Negative': 0.2, 'Positive': 0.7,  'Uncertain': 0.1},
        'Critical':{'Negative': 0.05, 'Positive': 0.9, 'Uncertain': 0.05},
    },
    'Wait': {
        s: {'Uncertain': 1.0} for s in states
    },
    'Treat': {
        s: {'Uncertain': 1.0} for s in states
    }
}

# Rewards: R[s][a]
R = {
    'Healthy':  {'Wait': 1, 'Test': -1, 'Treat': -2},
    'Sick':     {'Wait': -1, 'Test': -1, 'Treat': 2},
    'Critical': {'Wait': -3, 'Test': -2, 'Treat': 1}
}

# Discretize belief space
def generate_beliefs(num_points=10):
    grid = np.linspace(0, 1, num_points)
    beliefs = [b for b in itertools.product(grid, repeat=num_states)
               if np.isclose(sum(b), 1)]
    return [np.array(b) for b in beliefs]

# Belief update
def update_belief(b, a, o):
    new_b = np.zeros(num_states)
    for i, s_prime in enumerate(states):
        total = 0
        for j, s in enumerate(states):
            trans = T[s][a].get(s_prime, 0)
            total += b[j] * trans
        obs_prob = O[a][s_prime].get(o, 0)
        new_b[i] = total * obs_prob
    new_b /= np.sum(new_b) + 1e-8
    return new_b

# Value iteration
def value_iteration(beliefs, gamma=0.95, max_iter=100, tol=1e-3):
    V = {tuple(b): 0 for b in beliefs}
    policy = {}

    for _ in range(max_iter):
        new_V = {}
        delta = 0

        for b in beliefs:
            best_val = -np.inf
            best_act = None

            for a in actions:
                val = 0
                for s_idx, s in enumerate(states):
                    val += b[s_idx] * R[s][a]

                exp_future_val = 0
                for o in observations:
                    b_next = update_belief(b, a, o)
                    b_next_key = tuple(np.round(b_next, 3))
                    if b_next_key in V:
                        exp_future_val += np.sum([
                            b[s_idx] * T[states[s_idx]][a].get(states[i], 0) *
                            O[a][states[i]].get(o, 0) * V[b_next_key]
                            for i in range(num_states)
                            for s_idx in range(num_states)
                        ])

                val += gamma * exp_future_val

                if val > best_val:
                    best_val = val
                    best_act = a

            new_V[tuple(b)] = best_val
            policy[tuple(b)] = best_act
            delta = max(delta, abs(V[tuple(b)] - best_val))

        V = new_V
        if delta < tol:
            break

    return V, policy

# 🔁 Run everything
beliefs = generate_beliefs()
V, policy = value_iteration(beliefs)

# Sample output
for b in beliefs[:10]:
    print(f"Belief: {np.round(b,2)}, Action: {policy[tuple(b)]}, Value: {round(V[tuple(b)],2)}")


Belief: [0. 0. 1.], Action: Treat, Value: 1.0
Belief: [0.   0.11 0.89], Action: Treat, Value: 1.11
Belief: [0.   0.22 0.78], Action: Treat, Value: 1.22
Belief: [0.   0.33 0.67], Action: Treat, Value: 1.33
Belief: [0.   0.44 0.56], Action: Treat, Value: 1.44
Belief: [0.   0.56 0.44], Action: Treat, Value: 1.56
Belief: [0.   0.67 0.33], Action: Treat, Value: 1.67
Belief: [0.   0.78 0.22], Action: Treat, Value: 1.78
Belief: [0.   0.89 0.11], Action: Treat, Value: 1.89
Belief: [0. 1. 0.], Action: Treat, Value: 2.0


Exactly — the issue is with how the Bellman backup’s expected future value is calculated in your code.

In your current implementation, you’re mixing summations over both current and next states inside the expected future value in a way that inflates the result incorrectly. More specifically:


Your code tries to sum over states and multiply by the value of the same updated belief multiple times, which distorts the expectation.

Prompt:
Model an inventory management POMDP with partial observations:

- Inventory level: 0 to 5 units.
- Actions: order 0, 1, or 2 units.
- Demand is unknown, either High or Low demand with probabilities.
- Only partial observation of demand via sales data (censored).
- Reward is profit from sales minus ordering cost.
- Belief state is distribution over demand levels.
- Implement belief update from sales observations.
- Implement recursive value iteration for policy optimization with gamma=0.95.
- Run value iteration for 20 iterations.

Include:
- State, action, observation definitions.
- Transition, observation probabilities.
- Belief update.
- Recursive value iteration.

In [None]:
import numpy as np
from scipy.stats import poisson
from collections import defaultdict
import itertools

# Parameters
inventory_levels = range(6)
actions = [0, 1, 2]
observations = range(6)
demand_types = ['Low', 'High']
demand_probs = {'Low': 0.6, 'High': 0.4}
demand_dists = {'Low': poisson(1), 'High': poisson(3)}
gamma = 0.95
reward_per_unit = 10
cost_per_unit = 2

# Belief grid (discretized)
belief_grid = np.linspace(0, 1, 11)  # P(Low), so P(High) = 1 - p

def expected_demand(type_):
    return demand_dists[type_].mean()

def transition(s, a, demand):
    return max(0, min(s + a - demand, 5))

def observation_prob(obs, s, a, demand_type):
    inv = s + a
    return sum(
        demand_dists[demand_type].pmf(d) 
        for d in range(20) if min(d, inv) == obs
    )

def belief_update(belief, obs, s, a):
    inv = s + a
    numerators = {}
    for d_type in demand_types:
        prob = 0
        for d in range(20):
            if min(d, inv) == obs:
                prob += demand_dists[d_type].pmf(d)
        numerators[d_type] = prob * belief[d_type]
    total = sum(numerators.values())
    if total == 0:
        return belief  # No update possible
    return {k: v / total for k, v in numerators.items()}

def expected_reward(s, a, belief):
    inv = s + a
    reward = 0
    for d_type in demand_types:
        prob = belief[d_type]
        for d in range(20):
            p_d = demand_dists[d_type].pmf(d)
            sold = min(d, inv)
            reward += prob * p_d * (sold * reward_per_unit - a * cost_per_unit)
    return reward

# Initialize value function
V = defaultdict(lambda: np.zeros(len(inventory_levels)))
policy = defaultdict(lambda: np.zeros(len(inventory_levels), dtype=int))

# Value Iteration
for it in range(20):
    new_V = defaultdict(lambda: np.zeros(len(inventory_levels)))
    for b_p in belief_grid:
        belief = {'Low': b_p, 'High': 1 - b_p}
        for s in inventory_levels:
            best_val = -np.inf
            best_act = 0
            for a in actions:
                inv = s + a
                exp_val = expected_reward(s, a, belief)
                for obs in observations:
                    # Belief update
                    b_prime = belief_update(belief, obs, s, a)
                    b_prime_key = round(b_prime['Low'], 1)
                    # Transition
                    val = 0
                    for d_type in demand_types:
                        for d in range(20):
                            if min(d, inv) == obs:
                                p_d = demand_dists[d_type].pmf(d)
                                s_prime = transition(s, a, d)
                                val += belief[d_type] * p_d * V[b_prime_key][s_prime]
                    exp_val += gamma * val
                if exp_val > best_val:
                    best_val = exp_val
                    best_act = a
            new_V[round(b_p, 1)][s] = best_val
            policy[round(b_p, 1)][s] = best_act
    V = new_V

# Display optimal policy
print("Optimal Policy (belief in Low demand vs inventory level):")
for b in belief_grid:
    b_key = round(b, 1)
    print(f"Belief Low={b_key:.1f}: ", policy[b_key])


Summary:
The code roughly implements Bellman backup logic but misses weighting the expected future value by the observation probability 

This leads to overcounting the future values because it adds the unnormalized sums for each observation.

Implement a model for sensor maintenance with noisy observations

Requirements:

- **Hidden States**:
    - 0 = Failed
    - 1 = Degraded
    - 2 = Healthy

- **Actions**:
    - 0 = Do nothing
    - 1 = Send maintenance
    - 2 = Replace sensor (restores to Healthy)

- **Observations**:
    - "Bad", "Uncertain", "Good"

- **Transition Model (P(s' | s, a))**:
    - If action = 0 (Do nothing):
        - Healthy → Degraded (10% chance)
        - Degraded → Failed (20% chance)
        - Failed remains Failed
    - If action = 1 (Maintenance):
        - Degraded → Healthy (50% chance), otherwise unchanged
        - Failed → Degraded (30% chance), otherwise unchanged
    - If action = 2 (Replace):
        - Always transitions to Healthy

- **Observation Model (P(o | s))**:
    - If state = Failed:
        - "Bad" = 0.9, "Uncertain" = 0.1, "Good" = 0.0
    - If state = Degraded:
        - "Bad" = 0.3, "Uncertain" = 0.5, "Good" = 0.2
    - If state = Healthy:
        - "Bad" = 0.0, "Uncertain" = 0.2, "Good" = 0.8

- **Rewards**:
    - Large penalty for being in Failed state
    - Cost for maintenance and replacement
    - Bonus for Healthy operation

- **Belief Update**:
    - Maintain belief vector over hidden states
    - After each observation, update belief using Bayes' rule

- **Value Iteration**:
    - For each belief state and action:
            - Expectation over hidden states and observations
            - Update belief based on noisy observation
            - Recursively compute future value with discount factor (e.g., 0.9)
    - Run for fixed number of iterations (e.g., 20)
    - Store optimal policy mapping from belief to action


Goal:
- Find optimal sensor maintenance policy under partial observability and observation noise


In [9]:
import numpy as np
from itertools import product

# 📌 Model Definitions
states = ['Failed', 'Degraded', 'Healthy']
actions = ['DoNothing', 'Maintenance', 'Replace']
observations = ['Bad', 'Uncertain', 'Good']
discount = 0.9

# Transition model P(s' | s, a)
T = {
    0: np.array([  # Do nothing
        [1.0,   0.0,   0.0],
        [0.2,   0.8,   0.0],
        [0.0,   0.1,   0.9]
    ]),
    1: np.array([  # Maintenance
        [0.7,   0.3,   0.0],
        [0.0,   0.5,   0.5],
        [0.0,   0.0,   1.0]
    ]),
    2: np.array([  # Replace
        [0.0,   0.0,   1.0],
        [0.0,   0.0,   1.0],
        [0.0,   0.0,   1.0]
    ])
}

# Observation model P(o | s)
O = np.array([
    [0.9, 0.1, 0.0],  # "Bad" for Failed, Degraded, Healthy
    [0.3, 0.5, 0.2],  # "Uncertain"
    [0.0, 0.2, 0.8]   # "Good"
])

# Reward matrix R[s, a]
R = np.array([
    [-100, -10, -50],  # Failed
    [ -20, -10, -50],  # Degraded
    [ +10, -10, -50]   # Healthy
])

# 🔁 Belief Update Function
def update_belief(belief, action, observation):
    new_belief = np.zeros(3)
    for s_prime in range(3):
        obs_prob = O[observation, s_prime]
        trans_prob = sum(T[action][s_prime, s] * belief[s] for s in range(3))
        new_belief[s_prime] = obs_prob * trans_prob
    return new_belief / np.sum(new_belief)

# 💰 Value Iteration
def value_iteration(num_iterations=20, resolution=11):
    belief_points = [
        np.array([b1, b2, 1.0 - b1 - b2])
        for b1 in np.linspace(0, 1, resolution)
        for b2 in np.linspace(0, 1 - b1, resolution)
    ]

    V = {tuple(b): 0 for b in belief_points}
    policy = {}

    for _ in range(num_iterations):
        new_V = {}
        for b in belief_points:
            b = np.array(b)
            best_val = -np.inf
            best_act = None

            for a in range(3):  # All actions
                reward = sum(b[s] * R[s, a] for s in range(3))
                future = 0
                for o in range(3):
                    b_new = update_belief(b, a, o)
                    b_key = tuple(np.round(b_new, 4))
                    future += np.dot(O[o], T[a] @ b) * V.get(b_key, 0)
                value = reward + discount * future

                if value > best_val:
                    best_val = value
                    best_act = a

            new_V[tuple(b)] = best_val
            policy[tuple(b)] = best_act
        V = new_V
    return policy

# 🏁 Example Run
if __name__ == "__main__":
    belief = np.array([0.0, 0.2, 0.8])  # Most likely Healthy
    policy = value_iteration()
    action_index = policy[tuple(np.round(belief, 4))]
    print(f"Recommended action: {actions[action_index]}")


Recommended action: DoNothing


  return new_belief / np.sum(new_belief)


Your future expected value calculation is currently incorrect because the observation probability is miscalculated.