# The Bellman Equation

The Bellman Equation provides a technique on solving Markov Decision Process (MDP) in reinforcement learning and is common for finding the optimal value and Q functions recursively. Knowing optimal value or optimal Q function helps us to derive the optimal policy.

## Deterministic Policy
Using deterministic policy, an agent performs only one particular action in a state. It is denoted by $\mu$ and is given as:

\begin{equation}
a_t = \mu(s_t)
\end{equation}

where, t is the time step and a is the action available in state for an agent. 

### The Bellman equation in Deterministic Environment
The Bellman equation for the deterministic environment gives that the value of a current state can be obtained as a sum of the immediate reward and the discounted value of th next state.

\begin{equation}
V(s) = R(s,a,s') + \gamma V(s')
\end{equation}

Where, 
* $R(s,a,s')$ is the immediate reward obtained,
* $\gamma$ is the discount factor, and,
* $V(s')$ is the value of the next state.

Thus, using the policy $\pi$, we have the bellman equation for the deterministic environment as:
\begin{equation}
V^\pi(s) = R(s,a,s') + \gamma V^\pi(s')
\end{equation}

*Note: the RHS term of the Bellman equation is also known as the **Bellman Backup**.

### The Bellman equation in Stochastic Environment
If our environment is stochastic instead of being deterministic, we have situation of landing into different states when we perform an action a in state s. Say, we are in state s1 and taking action a1 will land into s2 with 10% of probability and s3 with 90% probability. In such situation we need to update the Bellman equation incorporating the probability as:

\begin{equation}
V^\pi(s) = \sum_{s'}{P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]}
\end{equation}

Here, $P(s'|s,a)$ is the transition probability of reaching various s' by taking action a in state s.

## Stochastic Policy
Unlike a deterministic policy, with stochastic policy there are multiple actions available for agent to take. Thus, stochastic policy returns a probability distribution over an action space for each state s in a given state space $S$. This means, an agent doesn't take same action in particular state as it is determined probabilistically through the distribution.

In [None]:
`