***
# Markov Processes

A **Markov Process** is a tuple $ \langle \mathcal{S}, \mathcal{P} \rangle $ where $\mathcal{S}$ is a set of **states** called the **observation space** or **state space** that the agent can be in, and $ \mathcal{P} : \mathcal{S}^2 \to \left[ 0, 1 \right]$ is a function describing the probability of transitioning from one state to another.
$$
\mathcal{P}(s_t, s_{t+1}) = \mathbb{P} \left[s_{t+1} \vert s_t \right]
$$
A Markov Processes are used to model stochastic sequences of states $s_0, s_1, \dots$ satisfying the **Markov property**:
$$
\mathbb{P} \left[ s_{t+1} \vert s_t \right] = \mathbb{P} \left[ s_{t+1} \vert s_0, s_1, \dots, s_t \right]
$$
that is, the probability of transitioning from state $s_t$ to state $s_{t+1}$ is independent of previous transitions.

In [1]:
import numpy as np # library for fast array, matrix, and tensor-based operations

In [2]:
def generate_state_transition_matrix(num_states):
    P = np.random.rand(num_states, num_states)
    for i in range(num_states): # convert each row to a probability distribution by dividing by the total
        P[i] /= sum(P[i])
    return P

In [3]:
num_states = 4 # the number of states in our Markov Process. Our current state is just an integer from 0 to 9.

P = generate_state_transition_matrix(num_states)

In [4]:
print(P)

[[0.23266937 0.23323859 0.22250382 0.31158822]
 [0.12545893 0.3674417  0.43148244 0.07561693]
 [0.24065458 0.23092811 0.2597049  0.26871241]
 [0.71127352 0.08507722 0.00308815 0.2005611 ]]


#### Trajectories

A **trajectory** (denoted $\tau$) for a Markov process is a (potentially infinite) sequence of states
$$
\tau = \langle s_0, s_1, \dots, s_T \rangle
$$
The probability of generating particular trajectories depends on the underlying dynamics of the process, which is determined by $\mathcal{P}$.

In [5]:
def generate_trajectory(P, T=100, s_0 = 0):
    num_states = len(P) # the number of next states we may go to
    s_t = s_0 # the current state is the starting state
    tau = np.zeros(T, dtype=np.int32) # the trajectory, a fixed-length array of length T
    tau[0] = s_t # the first state is the starting state
    
    for t in range(1, T): # the length of the rest of the trajectory
        s_t = np.random.choice(np.arange(num_states), p=P[s_t]) # randomly select a state using P, where every entry j in P[s_t] is the probability of transitioning from state s_t ot state j
        tau[t] = s_t # add this state to our trajectory.
        
    return tau

In [6]:
T = 100
s_0 = 0
tau = generate_trajectory(P, T, s_0)

In [7]:
print(tau)

[0 0 1 1 2 2 3 0 1 3 0 1 1 2 0 3 0 1 2 3 3 3 0 1 2 3 3 0 3 0 0 2 1 2 2 2 1
 0 3 1 1 2 1 1 1 2 2 3 0 1 0 1 1 1 1 3 0 2 3 0 3 0 3 3 0 0 0 2 0 0 0 3 0 0
 1 1 2 0 1 1 2 3 0 0 2 2 0 1 2 3 0 2 2 2 2 2 1 1 1 2]


***
# Markov Reward Processes

A **Markov Reward Process**  is an extension of a Markov Process that allows us to associate rewards with states. Formally, it is a tuple $\langle \mathcal{S}, \mathcal{P}, \mathcal{R} \rangle$ that allows us to associate with each state transition $\langle s_t, s_{t+1} \rangle$ some reward
$$
\mathcal{R}(s_t, s_{t+1}) = \mathbb{E}\left[r_t \vert s_t, s_{t+1} \right]
$$
which is often simplified to being $\mathcal{R}(s_t)$, the reward of being in a particular state $s_t$. For the purpose of this lesson and essentially all implementations, we make this simplification.

In [8]:
def generate_reward_matrix(num_states, mu=0, sigma=1):
    return mu + np.random.randn(num_states)*sigma

In [9]:
num_states = 4
mu = 0
sigma = 4
P = generate_state_transition_matrix(num_states)
R = generate_reward_matrix(num_states, mu, sigma)

In [10]:
print(P)

[[0.36834299 0.05543805 0.18762746 0.38859149]
 [0.43803116 0.19719147 0.33558899 0.02918838]
 [0.30655611 0.22577853 0.13269723 0.33496813]
 [0.16759977 0.41435695 0.33891393 0.07912935]]


In [11]:
print(R)

[-0.25883375 -4.96230641  0.65818934  8.97092499]


In [12]:
def generate_trajectory(P, R, T=100, s_0 = 0):
    num_states = len(P) # the number of next states we may go to
    s_t = s_0 # the current state is the starting state
    tau = np.zeros(T, dtype=np.int32) # the trajectory, a fixed-length array of length T
    tau[0] = s_t # the first state is the starting state
    rewards = np.zeros(T) # the rewards experiences, a fixed-length array of length T
    rewards[0] = R[s_0] # the first reward is the reward for being in the first state
    
    for t in range(1, T): # the length of the rest of the trajectory
        s_t = np.random.choice(np.arange(num_states), p=P[s_t]) # randomly select a state using P, where every entry j in P[s_t] is the probability of transitioning from state s_t ot state j
        tau[t] = s_t # add this state to our trajectory.
        rewards[t] = R[s_t]
        
    return tau, rewards

In [13]:
tau, rewards = generate_trajectory(P, R, 100, 0)

In [14]:
print(tau)

[0 0 0 3 2 3 2 0 0 2 0 3 0 2 3 1 2 0 3 0 2 3 1 0 0 0 3 2 3 2 1 3 2 0 2 0 2
 0 0 0 0 0 1 0 3 1 0 3 1 0 0 0 3 0 2 0 3 0 0 1 0 3 1 0 0 0 3 2 0 0 3 1 1 3
 2 0 0 3 0 0 0 2 2 0 0 3 2 0 3 0 3 0 0 3 1 1 2 3 2 3]


In [15]:
print(rewards)

[-0.25883375 -0.25883375 -0.25883375  8.97092499  0.65818934  8.97092499
  0.65818934 -0.25883375 -0.25883375  0.65818934 -0.25883375  8.97092499
 -0.25883375  0.65818934  8.97092499 -4.96230641  0.65818934 -0.25883375
  8.97092499 -0.25883375  0.65818934  8.97092499 -4.96230641 -0.25883375
 -0.25883375 -0.25883375  8.97092499  0.65818934  8.97092499  0.65818934
 -4.96230641  8.97092499  0.65818934 -0.25883375  0.65818934 -0.25883375
  0.65818934 -0.25883375 -0.25883375 -0.25883375 -0.25883375 -0.25883375
 -4.96230641 -0.25883375  8.97092499 -4.96230641 -0.25883375  8.97092499
 -4.96230641 -0.25883375 -0.25883375 -0.25883375  8.97092499 -0.25883375
  0.65818934 -0.25883375  8.97092499 -0.25883375 -0.25883375 -4.96230641
 -0.25883375  8.97092499 -4.96230641 -0.25883375 -0.25883375 -0.25883375
  8.97092499  0.65818934 -0.25883375 -0.25883375  8.97092499 -4.96230641
 -4.96230641  8.97092499  0.65818934 -0.25883375 -0.25883375  8.97092499
 -0.25883375 -0.25883375 -0.25883375  0.65818934  0

***
#### Return and Discounted Return

We are interested in trajectories that maximize the **return** $R_t$:
$$
\begin{align}
    R_t &= r_t + r_{t+1} +  r_{t+2}+  \dots + r_T  \\
    &= \sum_{k=t}^T r_k
\end{align}
$$

When $T$ is finite, we say that the trajectory has a **finite time horizon** and that the environment is **episodic** (happens in episodes).

For infinite time horizons, we cannot guarantee that $R_t$ converges. As a result, we might consider discounting rewards exponentially over time in order to guarantee convergence. This line of reasoning leads us to the **discounted return** $G_t$:

$$
\begin{align}
    G_t &= r_{t} + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots \\
    &= \sum_{k=t}^\infty \gamma^{k-t} r_k 
\end{align}
$$
where $\gamma$ is a discount factor between $0$ and $1$ (often close to $1$).

We sometimes refer to both the discounted and undiscounted return as just "return" for brevity, and write $G_t$ where for some episodic environments it may be more appropriate to use $R_t$. In fact, it should not be hard to see that $R_t$ is just $G_t$ with $r_t = 0$ for $t > T$ and $\gamma = 1$.

***
#### Value Function

We can use the expected value of $G_t$ to determine the **value** of being a certain state $s_t$:

$$
V(s_t) = \mathbb{E}\left[ G_t \big\vert s_t \right]
$$

We can decompose $V(s_t)$ into two parts: the immediate reward $r_t$ and the discounted value of being in the next state $s_{t+1}$:


\begin{align}
    \begin{split}
        V(s_t) &= \mathbb{E} \left[ G_t \big\vert s_t \right] \\
                &= \mathbb{E} \left[ r_{t} + \gamma r_{t+1} + \dots + \gamma^2 r_{t+2} \big\vert s_t \right] \\
                &= \mathbb{E}\left[r_{t} + \gamma (r_{t+1} + \gamma r_{t+2} + \dots) \big\vert s_t \right] \\
                &= \mathbb{E}\left[ r_{t} + \gamma G_{t+1} \big\vert s_t \right] \\
                &= \mathbb{E} \left[ r_{t} + \gamma V(s_{t+1}) \big\vert s_t \right] \\
    \end{split}
\end{align}

This last form of $V(s_t)$ is known as the **Bellman Equation**.

***
#### TD(0)

Say we are interesting in learning to predict $V(s_t)$. We can use a simple method called **TD(0)** to approximate $V(s_t)$ when our observation space $\mathcal{S}$ is **discrete** (i.e., finite). We begin by initializing a vector $V$ with random values, which will serve as our initial predictions for what $V(s_t)$ is.

According to the Bellman Equation, we can use $r_t + \gamma V(s_{t+1})$ (which we call the **TD-target**) as an unbiased estimator for $V(s_t)$. Then we can modify our estimate of $V(s_t)$ to more closely match $r_t + \gamma V(s_{t+1})$ according to a learning rate $\alpha$ as follows:
$$
V(s_t) \gets V(s_t) + \alpha \left( r_t + \gamma V(s_{t+1}) - V(s_t) \right)
$$

Let's use the TD(0) algorithm to attempt to learn the value function of this Markov reward process.

In [16]:
def generate_value_predictions(num_states):
    return np.random.rand(num_states)

In [17]:
def TD_0(V, P, R, T=100, s_0=0, gamma=0.99, alpha=0.1):
    num_states = len(P) # the number of next states we may go to
    s_t = s_0 # the current state is the starting state
    r_t = R[s_t]
    
    for t in range(1, T): # the length of the rest of the trajectory
        s_t_next = np.random.choice(np.arange(num_states), p=P[s_t]) # randomly select a state using P, where every entry j in P[s_t] is the probability of transitioning from state s_t ot state j
#         print('current state: {}\t next state: {}'.format(s_t, s_t_next))
#         print('current V: {} \t better V: {}'.format( V[s_t], (r_t + gamma*V[s_t_next])))
        V[s_t] = V[s_t] + alpha*(r_t + gamma*V[s_t_next] - V[s_t]) # update our value function according to the bellman equation
        s_t = s_t_next # update our reference to s_t to point to the new state
        r_t = R[s_t] # update our reference to r_t to point to the new reward
    return V

In [18]:
V = generate_value_predictions(num_states)
V = TD_0(V, P, R, 10000, 0, 0.99, 0.1)

In [19]:
print(V)

[105.94867585  97.62996413 105.66175568 111.73113174]


***
# Markov Decision Processes
A **Markov Decision Process** (MDP) is an extension of a Markov Reward Process that allows state transitions to be conditional upon some action. Formally, it is a tuple $\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R} \rangle $ where $\mathcal{A}$ is a set of actions available to an agent in a state $s_t$. Our reward conditional now upon both the state $s_t$ we are in and the action $a_t$ that we took:

$$
\mathcal{R}(s_t, a_t) = \mathbb{E} \left[ r_t \vert s_t, a_t \right]
$$

Also, $\mathcal{P}$ is the probability of transitioning to state $s_{t+1}$ given that the current state is $s_t$ and the current action is $a_t$:

$$
\mathcal{P}(s_t, a_t, s_{t+1}) = \mathbb{P} \left[ s_{t+1} \vert s_t, a_t \right]
$$

Whereas in an MRP the probability of generating trajectories is dependant upon only the dynamics of the underlying Markov process, in an MDP trajectories also depend on the actions of an agent.

This is an MDP with four states and two actions.

![MDP](../images/mdp.png)

In this case, $\mathcal{P}$ is a **tensor** of shape $4 \times 2 \times 4$ where $\mathcal{P}(i, j, k)$ is the probability of transitioning from state $i$ to state $k$ given action $j$. Furthermore, $\mathcal{R}$ is a matrix of shape $4 \times 2$ where $\mathcal{R}(i, j)$ is the reward associated with choosing action $j$ in state $i$.

***
### Policies

Our goal is to define a **policy** $\pi$, which can be thought of as a set of rules for choosing actions based on the state. Typically, we represent $\pi$ as a probability distribution:
$$
\pi: \mathcal{S} \times \mathcal{A} \to \left[ 0, 1 \right]
$$

where we sample $a_t$ from the distribution
$$
a_t \sim \pi(\cdot \vert s_t)
$$
For discrete action spaces and observation spaces, we can think of this as a table of size $\lvert \mathcal{S} \rvert \times \lvert \mathcal{A} \rvert$ where $\pi_{i,j}$ is the probability of choosing action $j$ in state $i$.

In [20]:
def generate_state_transition_matrix(num_states, num_actions):
    P = np.random.rand(num_states, num_actions, num_states)
    for i in range(num_states):
        for j in range(num_actions):
            P[i, j] /= sum(P[i, j]) # convert to a probability distribution
    return P

In [21]:
def generate_reward_matrix(num_states, num_actions, mu=0, sigma=1):
    return mu + np.random.randn(num_states, num_actions)*sigma

In [22]:
def generate_policy(num_states, num_actions):
    pi = np.random.rand(num_states, num_actions)
    for i in range(num_states):
        pi[i] /= sum(pi[i])
    return pi

In [23]:
num_states = 4 # the number of states in our Markov Decision Process. Our current state is just an integer from 0 to 9.
num_actions = 2 

P = generate_state_transition_matrix(num_states, num_actions)
R = generate_reward_matrix(num_states, num_actions)
pi = generate_policy(num_states, num_actions)

In [24]:
print(P)

[[[0.16873509 0.23952693 0.09195878 0.49977921]
  [0.07553146 0.36523116 0.27211665 0.28712073]]

 [[0.31499871 0.19002019 0.22049889 0.2744822 ]
  [0.18279282 0.2703649  0.23947173 0.30737055]]

 [[0.37361335 0.01630832 0.33579444 0.27428389]
  [0.41613642 0.23877548 0.2533055  0.09178259]]

 [[0.39109849 0.10451307 0.33508499 0.16930345]
  [0.26504203 0.13289914 0.42689106 0.17516777]]]


In [25]:
print(R)

[[ 0.02808433  0.9436435 ]
 [ 0.15751751 -0.09610482]
 [ 0.56692432  0.01549822]
 [ 1.99221181 -0.41711537]]


In [26]:
print(pi)

[[0.3348768  0.6651232 ]
 [0.5920647  0.4079353 ]
 [0.80058436 0.19941564]
 [0.1887536  0.8112464 ]]


In [27]:
def generate_trajectory(P, R, pi, T=100, s_0 = 0, a_0 = 0):
    num_states = len(P) # the number of next states we may go to
    num_actions = len(P[0]) # the number of actions we may take
    
    s_t = s_0 # the current state is the starting state
    tau = np.zeros(T, dtype=np.int32) # the trajectory, a fixed-length array of length T
    tau[0] = s_t # the first state is the starting state
    
    a_t = np.random.choice(np.arange(num_actions), p=pi[s_t]) # select our first action
    actions = np.zeros(T, dtype=np.int32) # the actions we choose, a fixed-length array of length T
    actions[0] = a_t 
    
    rewards = np.zeros(T) # the rewards experiences, a fixed-length array of length T
    rewards[0] = R[s_t, a_t] # the first reward is the reward for being in the first state and choosing the first action
    
    for t in range(1, T): # the length of the rest of the trajectory
        s_t = np.random.choice(np.arange(num_states), p=P[s_t, a_t]) # randomly select a state using P, where every entry j in P[s_t] is the probability of transitioning from state s_t ot state j
        a_t = np.random.choice(np.arange(num_actions), p=pi[s_t]) # randomly select an action given the state.
        tau[t] = s_t # add this state to our trajectory.
        actions[t] = a_t
        rewards[t] = R[s_t, a_t]

    return tau, rewards

In [28]:
tau, rewards = generate_trajectory(P, R, pi)

In [29]:
print(tau)

[0 1 0 3 2 3 0 3 1 1 2 3 0 3 1 1 0 0 3 0 1 0 3 1 3 0 3 0 0 1 0 3 0 2 2 0 2
 0 1 2 3 2 3 2 3 2 2 0 1 2 0 3 0 1 0 2 2 0 3 2 1 2 1 3 1 2 0 0 1 0 1 2 0 0
 0 1 3 2 2 0 2 2 3 3 0 3 1 2 3 0 2 0 1 2 0 2 0 2 0 1]


In [30]:
print(rewards)

[ 0.9436435   0.15751751  0.9436435  -0.41711537  0.56692432  1.99221181
  0.02808433 -0.41711537  0.15751751 -0.09610482  0.56692432  1.99221181
  0.02808433 -0.41711537  0.15751751  0.15751751  0.9436435   0.9436435
 -0.41711537  0.9436435  -0.09610482  0.9436435  -0.41711537  0.15751751
 -0.41711537  0.02808433  1.99221181  0.02808433  0.9436435   0.15751751
  0.02808433  1.99221181  0.9436435   0.56692432  0.01549822  0.9436435
  0.56692432  0.9436435   0.15751751  0.56692432 -0.41711537  0.56692432
 -0.41711537  0.56692432 -0.41711537  0.56692432  0.01549822  0.02808433
  0.15751751  0.01549822  0.9436435  -0.41711537  0.02808433  0.15751751
  0.02808433  0.56692432  0.56692432  0.02808433 -0.41711537  0.01549822
  0.15751751  0.01549822 -0.09610482 -0.41711537  0.15751751  0.56692432
  0.02808433  0.9436435   0.15751751  0.9436435  -0.09610482  0.01549822
  0.9436435   0.9436435   0.02808433  0.15751751 -0.41711537  0.56692432
  0.01549822  0.9436435   0.56692432  0.56692432  1.9

In this notebook, we learned about:

- Markov processes and how trajectories $\tau$ depend on the underlying state dynamics  $\mathcal{P}$. 
- Markov reward processes and how we can generate a stream of rewards given our trajectory.
- Returns, the value function $V(s_t)$ and how we can learn to approximate it using the TD(0) algorithm.
- Markov decision processes and how we can use a policy $\pi$ to guide our trajectories by choosing actions $a_t$ conditioned on the state $s_t$.