A Markov Process is defined by:

   - A set of states $\mathcal S$ 

   - A transition probability $\mathcal P_{ss'}$ which gives the probability of moving from state $\mathcal s$ to $\mathcal s'$

The MP is well defined by $\langle \mathcal S, \mathcal P \rangle$ because of the Markov property:

<center>
<br>
$\mathcal P[S_{t+1}=\mathcal{s} | S_1,...,S_{t}] = \mathcal P[S_{t+1}=\mathcal{s} | S_{t}]$
</center>

A Markov Reward Process is defined by:

   - A Markov Process $\langle \mathcal S, \mathcal P \rangle$

   - A discount factor $\gamma$
   
   - A reward function $\mathcal R(s) = \mathbb E[R_{t+1}|S_{t}=\mathcal s]$

The value function of a MRP is defined as the expected future reward starting from $\mathcal s$ discounted at rate $\gamma$:

<center>
<br>
$V(\mathcal{s}) = \mathbb E[\sum_{k=0}^{\infty} {\gamma^k R_{t+k+1}} | S_t=\mathcal s]$
</center>

In [77]:
import numpy as np

In [83]:
"""
Implementation of Markov Process

Attributes:
    transition(2d array): Transition matrix

Functions:
    computeStationnary(1d array): Stationnary distribution of MP
"""


class MP:
    def __init__(self,transition):
        self.transition = transition
    def computeStationnary(self):
        Ker = self.transition.transpose() - np.identity(self.transition.shape[0])
        values, vectors = np.linalg.eig(Ker)
        zeroIndex = np.argmin(abs(values))
        zeroVect = vectors[:,zeroIndex]
        return zeroVect/np.sum(zeroVect)

"""
Implementation of Markov Reward Process (Inherits from MP)

Attributes:
    transition(2d array): Transition matrix
    gamma(float): Discount factor
    reward(1d array): Expected next-step reward for each state

"""
    
    
class MRP_1(MP):
    def __init__(self, transition, reward, gamma):
        super().__init__(transition)
        self.reward = reward
        self.gamma = gamma

"""
Alternate Implementation of Markov Reward Process (Inherits from MP)

Modified Attributes:
    reward(2d array): Reward matrix for transition between each pair of states
    
Functions:
    cast(MRP_1 object): Cast the object into the previous implementation of MRP by computing
                        the expected reward using the transition matrix

"""
        
class MRP_2(MP):
    def __init__(self, transition, reward, gamma):
        super().__init__(transition)
        self.reward = reward
        self.gamma = gamma
    def cast(self):
        reward = np.diag(np.dot(self.reward, np.transpose(self.transition)))
        return MRP_1(self.transition, reward, self.gamma)
        

In [97]:
R = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
P = np.array([[0, 0, 1], [0.5, 0.5, 0], [0.3, 0.3, 0.4]])
gamma = 0.9

In [98]:
m = MRP_2(P, R, gamma)
m.cast().reward

array([3. , 4.5, 8.1])

In [99]:
m.computeStationnary()

array([0.27272727, 0.27272727, 0.45454545])