In [1]:
from lib.util import *
from lib.mdp import *
from lib.policy import *

# Markov Decision Processes

A set of states $S$, a state transition probability matrix $P$, a reward function $R$ s.t. $R(s) = \mathbb{E}[R_{n+1} \mid S_n = s]$, a discount factor $\gamma \in [0, 1]$, and a finite set of actions $A$. MDPs are similar to MRPs, but with actions.

A policy $\pi$ is a probability distribution of the actions given a state: $\pi(a\mid s) = \mathbb{P}[A_t = a \mid S_t = s]$

A value function $v_\pi$ for a given policy $\pi$ is the expected return from a state $s$ that is obtained by following the policy $\pi$: $v_\pi(s) = \mathbb{E}_\pi(G_t \mid S_t = s)$

Bellman Expectation Equations: 
$$v_\pi(s) = \sum_{a\in A} \pi(a\mid s) q_\pi(s,a) $$
$$q_\pi(s,a) = R(s, a) + \gamma \sum_{s' \in S} P(s, s', a) v_\pi(s') $$
$$v_\pi(s) = \sum_{a\in A} \pi(a\mid s) [R(s, a) + \gamma \sum_{s' \in S} P(s, s', a) v_\pi(s')] $$
$$q_\pi(s,a) = R(s, a) + \gamma \sum_{s' \in S} P(s, s', a) [\sum_{a'\in A} \pi(a'\mid s') q_\pi(s',a')] $$

Matrix Form of the Bellman Expectation Equation:
$$v_\pi = R^\pi + \gamma R^\pi v_\pi $$

Bellman Optimality Equations: 
$$v_*(s) = \max_a{q_*(s,a)} $$
$$q_*(s,a) = R(s, a)  + \gamma \sum_{s' \in S} P(s, s', a) v_*(s') $$
$$v_*(s) = \max_a [R(s, a)  + \gamma \sum_{s' \in S} P(s, s', a) v_*(s')] $$
$$q_*(s,a) = R(s, a)  + \gamma \sum_{s' \in S} P(s, s', a) \max_{a'}{q_*(s',a')} $$

In [2]:
n = 7
gamma = 0.8

In [3]:
P = generate_stochastic_matrix(n)
R = generate_reward_vector(n)
mrp = MRP(P, R, gamma)
mdp = MDP(gamma, [mrp]*n)
Q = generate_stochastic_matrix(n)
policy = Policy(Q)

print(mdp.policy_evaluation(policy))

defaultdict(<class 'float'>, {0: 2.647308912662474, 1: 1.8782294989002593, 2: 2.079261911183475, 3: 2.5849648134043486, 4: 2.671004722344322, 5: 2.0978024808783515, 6: 2.5936520291608796})


In [4]:
print(mdp.policy_iteration())

[[1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]] defaultdict(<class 'float'>, {0: 1.2512319020532376, 1: 0.622665096427804, 2: 0.7973113677967644, 3: 1.396782264669143, 4: 1.5580610022340229, 5: 1.040508183309656, 6: 1.5943621080375014})
(<lib.policy.DeterministicPolicy object at 0x7f2a680ed278>, defaultdict(<class 'float'>, {0: 1.2512319020532376, 1: 0.622665096427804, 2: 0.7973113677967644, 3: 1.396782264669143, 4: 1.5580610022340229, 5: 1.040508183309656, 6: 1.5943621080375014}))


In [5]:
print(mdp.value_iteration())

[2.30089561 2.12821118 2.26553762 2.24268159 2.23553333 2.18966034
 2.2508692 ]
