<h1>Dynammic Programming</h1>

# 0. Policy Evaluation (Prediction)

## Enviroment
The environment is like, 

> 0, 1, 2, 3

> 4, 5, 6, 7

> 8, 9, 10, 11

> 12, 13, 14, 15

## State
For state $s \in \{0, 1, ..., 15\}$, with 0 and 15 are terminal states .

## Action

The agent can go as one of the actions in $\{up, down, left, right\}$.

## Transition Probability

The transition probability of the agent goes from $s$ to $s'$ will be 1 if $s'$ is the next state of $s$ with action in $\mathcal{S}$ or $s$ is in the side of the matrix, otherwise 0.

## Reward

If the next state $s'$ is the terminal state, the reward is 0, otherwise -1.

## Policy Evaluation, $V_{\pi}(s)$

+ Initialize

> $V(s) = 0, \forall s \in \mathcal{S}^{+}$

+ Repeat

> $\Delta \leftarrow 0$

> For each $s \in \mathcal{S}$:

> $v \leftarrow V(s)$

> $V(s) \leftarrow \sum_{a}\pi(a|s) \sum_{s', r}p(s', r|s, a)[r + \gamma V(s')]$

> $\Delta \leftarrow \max(\Delta, |v - V(s)|)$

+ until $\Delta < \theta$

In [1]:
import numpy as np 

# environment
env = np.array([[0, 1, 2, 3], 
                [4, 5, 6, 7], 
                [8, 9, 10, 11], 
               [12, 13, 14, 15]])

In [2]:
# state 
s = np.array(range(16))

In [3]:
# transition probability
n_s = 16
n_a = 4 #actions, 0: up, 1: down, 2: left , 3: right
p = np.zeros((n_s, n_a, n_s)) # s, a, s'

prob = 1. / 4 #prob is equal for all actions

#up 
for s in range(4):
    p[s][0][s] = prob
for s in range(4, 16):
    p[s][0][s-4]= prob
    
#down
for s in range(12, 16):
    p[s][1][s] = prob
for s in range(0, 12):
    p[s][1][s+4] = prob
    
# left 
for s in range(0, 16, 4):
    p[s][2][s] = prob
for s in range(1,16):
    if s % 4 != 0:
        p[s][2][s-1] = prob

# right
for s in range(3, 16, 4):
    p[s][3][s] = prob
for s in range(0, 16):
    if s % 4 != 3:
        p[s][3][s+1] = prob
        
# terminal state
for a in range(4):
    for s1 in range(16):
        p[0][a][s1] = 0
        p[15][a][s1] = 0

In [4]:
# reward function
def reward_func(s, a, s1):
    if s1 == 0 or s1 == 15:
        return 0
    return -1

In [5]:
def expect_value(s, gamma, V, p):
    """
    s: current state
    gamma: discount factor
    V: value function
    p: transition probability, s, a, s1
    """
    
    # pi(a|s) = 1 / n_a, for all actions
    
    e_v = 0
    for s1 in range(16):
        for a in range(4):
            #pi_s_a = 1. / 4
            p_s_a_s1 = p[s][a][s1]
            r = reward_func(s, a, s1)
            e_v += p_s_a_s1 * (r + gamma * V[s1])

    return e_v

In [8]:
# value funtion
V = np.zeros(16)
gamma = 1.0

def evaluate_value_func(V, gamma):
    delta = 0
    for s in range(16):
        v = V[s]
        V[s] = expect_value(s, gamma, V, p)
        delta = max(delta, abs(v - V[s]))
    return V, delta

In [10]:
delta = 1e10
epsilon = 1e-3
while delta > epsilon:
    V, delta = evaluate_value_func(V, gamma)

In [11]:
V

array([  0.        , -12.99419182, -18.99164997, -20.99080924,
       -12.99419182, -16.9928726 , -18.99226129, -18.99234977,
       -18.99164997, -18.99226129, -16.99346994, -12.99512458,
       -20.99080924, -18.99234977, -12.99512458,   0.        ])

In [12]:
delta

0.0008407261041938341