# Stanford CME 241 (Winter 2021) - Assignment 3

## 1.

Let $\pi_D$ be a deterministic policy. The 4 Bellman equations become:

* $$V^{\pi_D}(s) = Q^{\pi_D}(s, \pi_D(s))$$
* $$Q^{\pi_D}(s,a) = \mathcal{R}(s,a) + \gamma \cdot \sum_{s' \in \mathcal{N}}\mathcal{P}(s,a,s') \cdot Q^{\pi_D}$$
* $$V^{\pi_D}(s) = \mathcal{R}(s,a) + \gamma \cdot \sum_{s' \in \mathcal{N}}\mathcal{P}(s, \pi_D (s), s') \cdot V^{\pi_D}(s') $$
* $$Q^{\pi_D}(s,a) = \mathcal{R}(s,a) + \gamma \cdot \sum_{s' \in \mathcal{N}}\mathcal{P}(s,a,s')Q^{\pi_D}(s', \pi_D (s')) $$

## 2. 

The definition of the given Markov process is independant of the state s. Therefore, we get $V^*(s) = V^*(s') ~~ \forall s,s'\in \mathcal{S}$.
Let us derive $V^*$ using this identity: 

$$V^*(s) = max_a [ \mathcal{R}(s,a) + \frac{1}{2}((1-a)+a)\cdot V^*(s)]$$
$$V^*(s) = max_a [ 2a(1-a)] + \frac{1}{2}\cdot V^*(s)$$
### Which yields $V^*(s) = 1$

Similarly, we compute $$Q^*(s,a) = 2a(1-a) + \frac{1}{2}$$
We the get an optimal policy $$\pi_D^* (s) = argmax_a [Q^*(s,a)] = \frac{1}{2}$$

## 3. A patient frog vs. a busy frog 🐸
 #### Bonus assignment : what would the frog do if she was in a rush to escape the pond ? 

State spaces : $\mathcal{S} = \{0...n\}$, $\mathcal{N} = \{1...(n-1)\}$, $\mathcal{T} = \{0,n\}$
Action space:  $\mathcal{A} = \{A,B\}$

As we are only interested in the survival of the frog, the rewards are sparse and $\gamma$ = 1. We will therefore define the MDP with $\mathcal{P}$ and $\mathcal{R_T}$:

*  $\mathcal{R_T}(s,a, s') =  1 \cdot \{s'=n\} $
* $\mathcal{P}(s,A,s-1) = 1 - \mathcal{P}(s,A,s+1) = \frac{s}{n}$
* $\mathcal{P}(s,B,s') = \frac{1}{n} ~~ \forall s' \neq s$

In [93]:
from rl.markov_decision_process import FiniteMarkovDecisionProcess, FinitePolicy
from rl.distribution import Categorical, Constant

In [94]:
class Frog(FiniteMarkovDecisionProcess):
    def __init__(self, n):
        def reward(i):
            return float(i==n)-float(i==0) 
        mapping = {}
        mapping[0] = None
        mapping[n] = None
        for i in range(1,n):
            mapping[i]={}
            mapping[i]['A'] = Categorical({(i-1, reward(i-1)):i/n, (i+1, reward(i+1)):(n-i)/n })
            mapping[i]['B'] = Categorical({(j, reward(j)):1/n for j in range(0,n+1) if j!=i})
        super().__init__(mapping)
        

In [95]:
def enumerate_all_deterministic_policies(k):
    def enumerate_all_combinations(n):
        assert n >=0
        if n == 0:
            return [{0:None}]
        elif n==1:
            return [{0: None, 1:None}]
        else:
            res =[]
            policies_previous = enumerate_all_combinations(n-1)
            for policy in policies_previous:
                policy[n] = None
                p1, p2 = policy.copy(), policy.copy()
                p1[n-1] = Constant('A')
                p2[n-1] = Constant('B')
                res.append(p1)
                res.append(p2)
            return res
    policies = enumerate_all_combinations(k)
    return [FinitePolicy(policy) for policy in policies]

In [101]:
import numpy as np
def brute_force_optimal_V(n, gamma = 1.):
    policies = enumerate_all_deterministic_policies(n)

    for i,policy in enumerate(policies):
        V = Frog(n).apply_finite_policy(policy).get_value_function_vec(gamma)
        if i == 0:
            V_star = V
            Pi_star = policy
        elif np.all(V >= V_star):
            V_star = V
            Pi_star = policy
    return V_star, Pi_star

In [97]:
V_star, Pi_star  = brute_force_optimal_V(12)
print(Pi_star)

0 is a Terminal State
For State 1:
  Do Action B with Probability 1.000
For State 2:
  Do Action A with Probability 1.000
For State 3:
  Do Action A with Probability 1.000
For State 4:
  Do Action A with Probability 1.000
For State 5:
  Do Action A with Probability 1.000
For State 6:
  Do Action A with Probability 1.000
For State 7:
  Do Action A with Probability 1.000
For State 8:
  Do Action A with Probability 1.000
For State 9:
  Do Action A with Probability 1.000
For State 10:
  Do Action A with Probability 1.000
For State 11:
  Do Action A with Probability 1.000
12 is a Terminal State



In [100]:
print(V_star)

[0.34789534 0.39567691 0.40523322 0.40841866 0.41001138 0.41114903
 0.41228669 0.41387941 0.41706485 0.42662116 0.47440273]


The optimal strategy is to always croak A, except when in state 1. This makes sense: In state all states but state 1 it is riskless to go one step backward. Let's see what it becomes if we give an incentive to escape quickly from the pond : let's set $\gamma = 0.99$ for a patient frog, and then  $\gamma = 0.01$ for a frog with no time to waste.

In [107]:
V_star, Pi_star  = brute_force_optimal_V(12, gamma = 0.99)
print(Pi_star)

0 is a Terminal State
For State 1:
  Do Action B with Probability 1.000
For State 2:
  Do Action B with Probability 1.000
For State 3:
  Do Action B with Probability 1.000
For State 4:
  Do Action B with Probability 1.000
For State 5:
  Do Action B with Probability 1.000
For State 6:
  Do Action B with Probability 1.000
For State 7:
  Do Action B with Probability 1.000
For State 8:
  Do Action A with Probability 1.000
For State 9:
  Do Action A with Probability 1.000
For State 10:
  Do Action A with Probability 1.000
For State 11:
  Do Action A with Probability 1.000
12 is a Terminal State



In [108]:
V_star, Pi_star  = brute_force_optimal_V(12, gamma = 0.1)
print(Pi_star)

0 is a Terminal State
For State 1:
  Do Action B with Probability 1.000
For State 2:
  Do Action B with Probability 1.000
For State 3:
  Do Action B with Probability 1.000
For State 4:
  Do Action B with Probability 1.000
For State 5:
  Do Action B with Probability 1.000
For State 6:
  Do Action B with Probability 1.000
For State 7:
  Do Action B with Probability 1.000
For State 8:
  Do Action B with Probability 1.000
For State 9:
  Do Action B with Probability 1.000
For State 10:
  Do Action A with Probability 1.000
For State 11:
  Do Action A with Probability 1.000
12 is a Terminal State



### As expected, the frog is now willing to croak B to move faster (which is riskier than doing the conservative $\gamma =1$ strategy). The smaller $\gamma$, the more our frog will croak B (in positions 1,2,...,k) to try to teleport quickly towards the $n^{th}$ position.

## 4. 

$$Q^* (s,a) = \mathcal{R}(s,a) = \frac{1}{\sqrt{2\pi}\sigma}\cdot \int_{s'\in R}~e^{as'}\cdot e^{\frac{(s'-s)^2}{2 \sigma^2}}ds'  $$