Okay! So we now know how to frame any reinforcement problem in terms of a Markov Decision Process (MPD) as: 
- A set of states _S_
- A set of actions _A_
- A set of rewards _R_
- A one-dynamic step of the environment, p(s',r | s,a) = P(S(t+1) = s', R(t+1) = r | S(t) = s, A(t) = a)
- A discount rate γ∈[0,1]

Now what will be the solution to it? Well, to find the solution all we need is a series of actions that needs to be learned by the agent towards the pursuit of its goal, in context with its current state and the states that follow.

Therefore, as long as the agent can learn the appropriate action to the environment states that it can observe, we have a solution to our problem. This leads to the idea of a policy defining these actions. A policy can be:
- Deterministic
- Stochastic

**Deterministic Policy:** It is basically a mapping of states to the corresponding actions which should be taken and can be represented as __π:S→A__. Example: π(low) = recharge , π(high) = explore

**Stochastic Policy:** A stochastic policy basically tells us the probability of an action(for all the possible actions in the action space) being taken by an agent while in a particular state and is represented as __π:S×A→[0,1]__. Example: π(recharge ∣ low) = 0.5, π(wait ∣ low) = 0.4, π(search ∣ low) = 0.1


**Note:** Any deterministic Policy can be represented as a Stochastic Policy (Prob. for that particular action-state combination will be 1).

Now that we know how to define the policy, another question arises. How to get the best policy?

### State-Value Functions

To get the best policy, we first need to define something called a **state-value function**.

For each state in the state space, the state-value function yields the expected return if the agent starts in that state and then follows the specified policy for all the timesteps, till termiation of that episode. It is denoted as Vπ.

**Note:** The state-value function corresponds to a particular policy and changes if there is a change in the policy.

It can be written down in notation form as:

<img src='Images\State-Value-Function.png'>

### Bellman Expectation Equation

Based on the above definition of State value function, the Bellman equation states that to get the value of any state, we just need the expected sum of the immediate rewards and the discounted value of the state that follows. (We calculate the expected sum because generally, we are not 100% sure of the expected reward and the state that follows.)

Vπ(s) = Eπ [G(t) | S(t) = s]
      
      = Eπ [R(t+1) + γR(t+2) + .... | S(t) = s]
      
      = Eπ [R(t+1) + γVπ(S(t+1)) | S(t) = s]

<img src = 'Images\Bellman-Equation.png'>

But how do we calculate these expected values? This is where our policy comes into play.

For a *Deterministic policy*:

Vπ(s) = Eπ [R(t+1) + γVπ(S(t+1)) | S(t) = s]

can be re-written as:

Vπ(s) = ∑ p(s',r | s, a) . (r + γVπ(s')) ,    for all s' ∈ S+ , r ∈ R 

In this case we are just multiplying the sum of the reward and the discounted value of next state by its corresponding probabilities and summing over all such possibilities to get the expected values.

Similarly, for a *Stochastic policy*, it can be re-witten as:

Vπ(s) = ∑ π(a|s) . p(s',r | s, a) . (r + γVπ(s')) ,    for all s' ∈ S+ , r ∈ R, a ∈ A(s) 