# Partially Observable Markov decision processes (POMDPs)

This Jupyter notebook acts as supporting material for POMDPs, covered in **Chapter 17 Making Complex Decisions** of the book* Artificial Intelligence: A Modern Approach*. We make use of the implementations of POMPDPs in mdp.py module. This notebook has been separated from the notebook `mdp.py` as the topics are considerably more advanced.

**Note that it is essential to work through and understand the mdp.ipynb notebook before diving into this one.**

Let us import everything from the mdp module to get started.

In [1]:
from mdp import *
from notebook import psource, pseudocode

## CONTENTS

1. Overview of MDPs
2. POMDPs - a conceptual outline
3. POMDPs - a rigorous outline
4. Value Iteration
    - Value Iteration Visualization

## 1. OVERVIEW

We first review Markov property and MDPs as in [Section 17.1] of the book.

- A stochastic process is said to have the **Markov property**, or to have a **Markovian transition model** if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only on the present state, not on the sequence of events that preceded it.

 -- (Source: [Wikipedia](https://en.wikipedia.org/wiki/Markov_property))

A Markov decision process or MDP is defined as:
- a sequential decision problem for a fully observable, stochastic environment with a Markovian transition model and additive rewards.

An MDP consists of a set of states (with an initial state $s_0$); a set $A(s)$ of actions
in each state; a transition model $P(s' | s, a)$; and a reward function $R(s)$.

The MDP seeks to make sequential decisions to occupy states so as to maximise some combination of the reward function $R(s)$.

The characteristic problem of the MDP is hence to identify the optimal policy function $\pi^*(s)$ that provides the _utility-maximising_ action $a$ to be taken when the current state is $s$.

### Belief vector

**Note**: The book refers to the _belief vector_ as the _belief state_. We use the latter terminology here to retain our ability to refer to the belief vector as a _probability distribution over states_.

The solution of an MDP is subject to certain properties of the problem which are assumed and justified in [Section 17.1]. One critical assumption is that the agent is **fully aware of its current state at all times**.

A tedious (but rewarding, as we will see) way of expressing this is in terms of the **belief vector** $b$ of the agent. The belief vector is a function mapping states to probabilities or certainties of being in those states.

Consider an agent that is fully aware that it is in state $s_i$ in the statespace $(s_1, s_2, ... s_n)$ at the current time.

Its belief vector is the vector $(b(s_1), b(s_2), ... b(s_n))$ given by the function $b(s)$:
\begin{align*}
b(s) &= 0 \quad \text{if }s \neq s_i \\ &= 1 \quad \text{if } s = s_i
\end{align*}

Note that $b(s)$ is a probability distribution that necessarily sums to $1$ over all $s$.



## 2. POMDPs - a conceptual outline

The POMDP really has only two modifications to the **problem formulation** compared to the MDP.

- **Belief state** - In the real world, the current state of an agent is often not known with complete certainty. This makes the concept of a belief vector extremely relevant. It allows the agent to represent different degrees of certainty with which it _believes_ it is in each state.

- **Evidence percepts** - In the real world, agents often have certain kinds of evidence, collected from sensors. They can use the probability distribution of observed evidence, conditional on state, to consolidate their information. This is a known distribution $P(e\ |\ s)$ - $e$ being an evidence, and $s$ being the state it is conditional on.

Consider the world we used for the MDP. 

![title](images/grid_mdp.jpg)

#### Using the belief vector
An agent beginning at $(1, 1)$ may not be certain that it is indeed in $(1, 1)$. Consider a belief vector $b$ such that:
\begin{align*}
    b((1,1)) &= 0.8 \\
    b((2,1)) &= 0.1 \\
    b((1,2)) &= 0.1 \\
    b(s) &= 0 \quad \quad \forall \text{ other } s
\end{align*}

By horizontally catenating each row, we can represent this as an 11-dimensional vector (omitting $(2, 2)$).

Thus, taking $s_1 = (1, 1)$, $s_2 = (1, 2)$, ... $s_{11} = (4,3)$, we have $b$:

$b = (0.8, 0.1, 0, 0, 0.1, 0, 0, 0, 0, 0, 0)$ 

This fully represents the certainty to which the agent is aware of its state.

#### Using evidence
The evidence observed here could be the number of adjacent 'walls' or 'dead ends' observed by the agent. We assume that the agent cannot 'orient' the walls - only count them.

In this case, $e$ can take only two values, 1 and 2. This gives $P(e\ |\ s)$ as:
\begin{align*}
    P(e=2\ |\ s) &= \frac{1}{7} \quad \forall \quad s \in \{s_1, s_2, s_4, s_5, s_8, s_9, s_{11}\}\\
    P(e=1\ |\ s) &= \frac{1}{4} \quad \forall \quad s \in \{s_3, s_6, s_7, s_{10}\} \\
    P(e\ |\ s) &= 0 \quad \forall \quad \text{ other } s, e
\end{align*}

Note that the implications of the evidence on the state must be known **a priori** to the agent. Ways of reliably learning this distribution from percepts are beyond the scope of this notebook.

## 3. POMDPs - a rigorous outline

A POMDP is thus a sequential decision problem for for a *partially* observable, stochastic environment with a Markovian transition model, a known 'sensor model' for inferring state from observation, and additive rewards. 

Practically, a POMDP has the following, which an MDP also has:
- a set of states, each denoted by $s$
- a set of actions available in each state, $A(s)$
- a reward accrued on attaining some state, $R(s)$
- a transition probability $P(s'\ |\ s, a)$ of action $a$ changing the state from $s$ to $s'$

And the following, which an MDP does not:
- a sensor model $P(e\ |\ s)$ on evidence conditional on states

Additionally, the POMDP is now uncertain of its current state hence has:
- a belief vector $b$ representing the certainty of being in each state (as a probability distribution)


#### New uncertainties

It is useful to intuitively appreciate the new uncertainties that have arisen in the agent's awareness of its own state.

- At any point, the agent has belief vector $b$, the distribution of its believed likelihood of being in each state $s$.
- For each of these states $s$ that the agent may **actually** be in, it has some set of actions given by $A(s)$.
- Each of these actions may transport it to some other state $s'$, assuming an initial state $s$, with probability $P(s'\ |\ s, a)$
- Once the action is performed, the agent receives a percept $e$. $P(e\ |\ s)$ now tells it the chances of having perceived $e$ for each state $s$. The agent must use this information to update its new belief state appropriately.

#### Evolution of the belief vector - the `FORWARD` function

The new belief vector $b'(s')$ after an action $a$ on the belief vector $b(s)$ and the noting of evidence $e$ is:
$$ b'(s') = \alpha P(e\ |\ s') \sum_s P(s'\ | s, a) b(s)$$ 

where $\alpha$ is a normalising constant (to retain the interpretation of $b$ as a probability distribution.

This equation is just counts the sum of likelihoods of going to a state $s'$ from every possible state $s$, times the initial likelihood of being in each $s$. This is multiplied by the likelihood that the known evidence actually implies the new state $s'$. 

This function is represented as `b' = FORWARD(b, a, e)`

#### Probability distribution of the evolving belief vector

The goal here is to find $P(b'\ |\ b, a)$ - the probability that action $a$ transforms belief vector $b$ into belief vector $b'$. The following steps illustrate this -

The probability of observing evidence $e$ when action $a$ is enacted on belief vector $b$ can be distributed over each possible new state $s'$ resulting from it:
\begin{align*}
    P(e\ |\ b, a) &= \sum_{s'} P(e\ |\ b, a, s') P(s'\ |\ b, a) \\
                  &= \sum_{s'} P(e\ |\ s') P(s'\ |\ b, a) \\
                  &= \sum_{s'} P(e\ |\ s') \sum_s P(s'\ |\ s, a) b(s)
\end{align*}

The probability of getting belief vector $b'$ from $b$ by application of action $a$ can thus be summed over all possible evidences $e$:
\begin{align*}
    P(b'\ |\ b, a) &= \sum_{e} P(b'\ |\ b, a, e) P(e\ |\ b, a) \\
                  &= \sum_{e} P(b'\ |\ b, a, e) \sum_{s'} P(e\ |\ s') \sum_s P(s'\ |\ s, a) b(s)
\end{align*}

where $P(b'\ |\ b, a, e) = 1$ if $b' = $ `FORWARD(b, a, e)` and $= 0$ otherwise.

Given initial and final belief states $b$ and $b'$, the transition probabilities still depend on the action $a$ and observed evidence $e$. Some belief states may be achievable by certain actions, but have non-zero probabilities for states prohibited by the evidence $e$. Thus, the above condition thus ensures that only valid combinations of $(b', b, a, e)$ are considered.

#### A modified rewardspace

For MDPs, the reward space was simple - one reward per available state. However, for a belief vector $b(s)$, the expected reward is now:
$$\rho(b) = \sum_s b(s) R(s)$$

Thus, as the belief vector can take infinite values of the distribution over states, so can the reward for each belief vector vary over a hyperplane in the belief space, or space of states (planes in an $N$-dimensional space are formed by a linear combination of the axes).