## Homework 1: Markov Processes (MP) and Markov Reward Processes (MRP)

### MP definition
**Markov Process**: a MP is a tuple ($S$, $P$), such that $S$ is a finite set of states and $P$ is a state transition probability matrix, $P_{ss'} = Pr[S_{t+1} = s'|S_t =s]$.

**Example: Student Markov Chain**: 

The state space is $S = \{Class1, Class2, Class3, Facebook, Pub, Pass, Sleep\}$. 

\begin{equation*}
P_{ss'} = 
\begin{bmatrix}
0 & 0.5 & 0 & 0.5 & 0 & 0 & 0 \\
0 & 0 & 0.8 & 0 & 0 & 0 & 0.2 \\
0 & 0 & 0 & 0 & 0.4 & 0.6 & 0 \\
0.1 & 0 & 0 & 0.9 & 0 & 0 & 0 \\
0.2 & 0.4 & 0.4 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 1 \\
0 & 0 & 0 & 0 & 0 & 0 & 1 \\
\end{bmatrix}
\end{equation*}

### Class design for MP
**Design:**
1. Data structure: The data type of MP processes is a dictionary with its values defined as another dictionary, which describes the states and the transition matrix.
2. Stationary distribution: Calculate $\pi$ with $\pi = \pi P$.

**Example:**
\begin{equation*}
P_{ss'} = 
\begin{bmatrix}
0.1 & 0.6 & 0.1 & 0.2 \\
0.25 & 0.22 & 0.24 & 0.29 \\
0.7 & 0.3 & 0 & 0 \\
0.3 & 0.5 & 0.2 & 0  \\
\end{bmatrix}
\end{equation*}

In [16]:
import numpy as np
from src.mp import MP

transitions = {
        1: {1: 0.1, 2: 0.6, 3: 0.1, 4: 0.2},
        2: {1: 0.25, 2: 0.22, 3: 0.24, 4: 0.29},
        3: {1: 0.7, 2: 0.3},
        4: {1: 0.3, 2: 0.5, 3: 0.2}
    }

mp_obj = MP(transitions)
print(mp_obj.all_state_list)
print(mp_obj.get_tran_mat())
stationary = mp_obj.stationary_distribution()
print(stationary)

[1, 2, 3, 4]
[[0.1  0.6  0.1  0.2 ]
 [0.25 0.22 0.24 0.29]
 [0.7  0.3  0.   0.  ]
 [0.3  0.5  0.2  0.  ]]
{1: 0.28574421284173046, 2: 0.38860374986906865, 3: 0.15580810725882485, 4: 0.16984393003037593}


  ).astype(float)


### MRP definition
**Markov Reward Process**: MRP is a Markov chain with values, which can be represented as a tuple ($S$, $P$, $R$, $\gamma$). In addition to MP, $R$ is a reward function, $R_s = \mathbb{E}[R_{t+1}| S_t = s]$. $\gamma$ is a discount factor, $\gamma \in [0,1]$.

\begin{equation*}
R_{s} = 
\begin{bmatrix}
-2\\
-2\\
-2\\
-1\\
+1\\
+10\\
0\\
\end{bmatrix}
\end{equation*}

### Class design for MRP
**Design:**
1. Data structure: MRP is a subclass of MP with two additional inputs: rewards and discout factors. The reward is a dictionary that maps states to the value fo reward.
2. Converting between two different definitions: the $r(s,s')$ and the $R(s) = \sum_{s'} p(s,s') * r(s,s')$.

Code Design for First definiton of MRP:

In [2]:
from src.mrp import MRP

transitions = {
        1: {1: 0.6, 2: 0.3, 3: 0.1}, 
        2: {1: 0.1, 2: 0.2, 3: 0.7},
        3: {3: 1.0}
    }
reward = {1: 7.0, 2:10.0, 3:0.0}
gamma = 1.0
mrp_obj = MRP(transitions, reward, gamma)
print(mrp_obj.get_states())
print(mrp_obj.get_trans_matrix())
print(mrp_obj.valueFun())


{1, 2, 3}
[[0.6 0.3]
 [0.1 0.2]]
[29.65517241 16.20689655]


Code Design for Second definiton of MRP: