## Homework 2: Markov Decision Processes (MDP), Value Function, and Bellman Equations

### Bellman equation for MRP Value Function
**1. Bellman equation: ** $v = R + \gamma P v$

**2. Solve by Matrix Inversion: ** $v = (I - \gamma P)^{-1}R$

In [6]:
from src.mrp import MRP

transitions = {
        1: {1: 0.6, 2: 0.3, 3: 0.1}, 
        2: {1: 0.1, 2: 0.2, 3: 0.7},
        3: {3: 1.0}
    }
reward = {1: 7.0, 2:10.0, 3:0.0}
gamma = 1.0
mrp_obj = MRP(transitions, reward, gamma)
print("Non-terminal states:")
print(mrp_obj.get_states())
print("The transition matrix without termial states:")
print(mrp_obj.get_trans_matrix())
print("Bellman equation solution for MRP value function: ")
print(mrp_obj.valueFun())

Non-terminal states:
{1, 2, 3}
The transition matrix without termial states:
[[0.6 0.3]
 [0.1 0.2]]
Bellman equation solution for MRP value function: 
[29.65517241 16.20689655]


### Markov Decision Process
**1. Definition of MDP: ** A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. It can be represented as a tuple $\langle \mathcal { S } , \mathcal { A } , \mathcal { P } , \mathcal { R } , \gamma \rangle$. In addition to MRP, $A$ is a finite set of actions. $\mathcal { P } _ { s s ^ { \prime } } ^ { a } = \mathbb { P } \left[ S _ { t + 1 } = s ^ { \prime } | S _ { t } = s , A _ { t } = a \right]$ is a state transition probability matrix. $\mathcal { R } _ { s } ^ { a } = \mathbb { E } \left[ R _ { t + 1 } | S _ { t } = s , A _ { t } = a \right]$ is a reward function.

**2. Policy: ** A policy $\pi$ is a distribution over actions given states. $\pi ( a | s ) = \mathbb { P } \left[ A _ { t } = a | S _ { t } = s \right]$.

**3. Value function: ** 

   (1)  The state-value function: $v _ { \pi } ( s ) = \mathbb { E } _ { \pi } \left[ G _ { t } | S _ { t } = s \right]$ is the expected return starting from state $s$, and then following policy $\pi$
   
   (2)  The action-value function: $q _ { \pi } ( s , a ) = \mathbb { E } _ { \pi } \left[ G _ { t } | S _ { t } = s , A _ { t } = a \right]$ is the expected return
starting from state $s$, taking action $a$, and then following policy $\pi$

**4. Bellman equation: ** 

   (1) $v _ { \pi } ( s ) = \sum _ { a \in \mathcal { A } } \pi ( a | s ) q _ { \pi } ( s , a )$
   
   (2) $q _ { \pi } ( s , a ) = \mathcal { R } _ { s } ^ { a } + \gamma \sum _ { s ^ { \prime } \in \mathcal { S } } \mathcal { P } _ { s s ^ { \prime } \pi } ^ { a } v _ { \pi } \left( s ^ { \prime } \right)$
   
   (3) $v _ { \pi } ( s ) = \sum _ { a \in \mathcal { A } } \pi ( a | s ) \left( \mathcal { R } _ { s } ^ { a } + \gamma \sum _ { s ^ { \prime } \in \mathcal { S } } \mathcal { P } _ { s s ^ { \prime } } ^ { a } v _ { \pi } \left( s ^ { \prime } \right) \right)$
   
   (4) $q _ { \pi } ( s , a ) = \mathcal { R } _ { s } ^ { a } + \gamma \sum _ { s ^ { \prime } \in \mathcal { S } } \mathcal { P } _ { s s ^ { \prime } } ^ { a } \sum _ { a ^ { \prime } \in \mathcal { A } } \pi \left( a ^ { \prime } | s ^ { \prime } \right) q _ { \pi } \left( s ^ { \prime } , a ^ { \prime } \right)$
   
   (5) $v _ { * } ( s ) = \max _ { a } q _ { * } ( s , a )$
   
   (6) $q _ { * } ( s , a ) = \mathcal { R } _ { s } ^ { a } + \gamma \sum _ { s ^ { \prime } \in \mathcal { S } } \mathcal { P } _ { s s ^ { \prime } } ^ { a } v _ { * } \left( s ^ { \prime } \right)$
   
   (7) $v _ { * } ( s ) = \max _ { a } \mathcal { R } _ { s } ^ { a } + \gamma \sum _ { s ^ { \prime } \in \mathcal { S } } \mathcal { P } _ { s s ^ { \prime } } ^ { a } v _ { * } \left( s ^ { \prime } \right)$
   
   (8) $q _ { * } ( s , a ) = \mathcal { R } _ { s } ^ { a } + \gamma \sum _ { s ^ { \prime } \in \mathcal { S } } \mathcal { P } _ { s s ^ { \prime } } ^ { a } \max _ { a ^ { \prime } } q _ { * } \left( s ^ { \prime } , a ^ { \prime } \right)$
   
   
### Class Design for MDP

**1. Transform to MRP: **

$Pr_{ss'} = \sum_{a} \pi(a|s)Pr_{ss'}^a$

In [8]:
from src.mdp import MDP
mdp_data = {
        1: {
            'a': {1: 0.2, 2: 0.6, 3: 0.2},
            'b': {1: 0.6, 2: 0.3, 3: 0.1},
            'c': {1: 0.1, 2: 0.2, 3: 0.7}
        },
        2: {
            'a': {1: 0.1, 2: 0.6, 3: 0.3},
            'c': {1: 0.6, 2: 0.2, 3: 0.2}
        },
        3: {
            'b': {3: 1.0}
        }
    }

reward = {
        1: {
            'a': 7.0,
            'b': -2.0,
            'c': 10.0
        },
        2: {
            'a': 1.0,
            'c': -1.2
        },
        3: {
            'b':  0.0
        }
    }

policy_data = {
        1: {'a': 0.4, 'b': 0.6},
        2: {'a': 0.7, 'c': 0.3},
        3: {'b': 1.0}
    }

gamma = 1
mdp = MDP(mdp_data, reward, gamma)
print("The MDP process and reward: ")
print(mdp.process)
print(mdp.state_reward)
mrp = mdp.get_mrp(policy_data)
print("The MRP process and reward given a policy: ")
print(mrp.process)
print(mrp.state_reward)

The MDP process and reward: 
{1: {'a': {1: 0.2, 2: 0.6, 3: 0.2}, 'b': {1: 0.6, 2: 0.3, 3: 0.1}, 'c': {1: 0.1, 2: 0.2, 3: 0.7}}, 2: {'a': {1: 0.1, 2: 0.6, 3: 0.3}, 'c': {1: 0.6, 2: 0.2, 3: 0.2}}, 3: {'b': {3: 1.0}}}
{1: {'a': 7.0, 'b': -2.0, 'c': 10.0}, 2: {'a': 1.0, 'c': -1.2}, 3: {'b': 0.0}}
The MRP process and reward given a policy: 
{1: {1: 0.44, 2: 0.42, 3: 0.14}, 2: {1: 0.25, 2: 0.48, 3: 0.27}, 3: {3: 1.0}}
{1: 1.6000000000000003, 2: 0.33999999999999997, 3: 0.0}


**2. Convert from a different definition: **

$R(s,a) = \sum_{s'} Pr_{ss'}^a r(s,s',a)$

In [9]:
from src.my_funcs import SASf, SAf

def cast(rss: SASf, pr: SASf) -> SAf:
    rs = {}
    for state in pr.keys():
        asf_pr = pr[state]
        asf_rss = rss[state]
        rs[state] = {}
        for action, p in asf_pr.items():
            rs[state][action] = 0
            for s in p.keys():
                rs[state][action] += p[s]*asf_rss[action][s]
                
    return rs

mdp_data = {
        1: {
            'a': {1: 0.2, 2: 0.6, 3: 0.2},
            'b': {1: 0.6, 2: 0.3, 3: 0.1},
            'c': {1: 0.1, 2: 0.2, 3: 0.7}
        },
        2: {
            'a': {1: 0.1, 2: 0.6, 3: 0.3},
            'c': {1: 0.6, 2: 0.2, 3: 0.2}
        },
        3: {
            'b': {3: 1.0}
        }
    }

reward1 = {
        1: {
            'a': 7.0,
            'b': -2.0,
            'c': 10.0
        },
        2: {
            'a': 1.0,
            'c': -1.2
        },
        3: {
            'b':  0.0
        }
    }

reward2 = {
        1: {
            'a': {1: 10, 2: 10, 3: -5},
            'b': {1: -10, 2: 10, 3: 10},
            'c': {1: 10, 2: 0, 3: 0}
        },
        2: {
            'a': {1: 10, 2: 0, 3: 0},
            'c': {1: 0, 2: -6, 3: 0}
        },
        3: {
            'b': {3: 0}
        }
    }

reward_cast = cast(mdp_data, reward2)
print(reward_cast)

{1: {'a': 7.0, 'b': -2.0, 'c': 1.0}, 2: {'a': 1.0, 'c': -1.2000000000000002}, 3: {'b': 0.0}}
