# Reinforcement Learning I

<hr>

**Gentle intro to RL**

RL algorithms learn to pick *good* actions based on the rewards that they receive during training, with none or limited supervision. 

The algorithm learns to take actions to maximize some notion of a *cumulative reward* instead of the immediate reward in the next step and can take *good* actions even without any intermediate rewards.

Some terminology:

- States, $s \in S$ (observed)
- Actions, $a \in A$ (intended)
- Transitions, $T(s, a, s') = p(s' | s, a)$
    - A function that takes the current state, the intended action and outputs the probability of a specific next state, 
    - i.e. action dependent transition probabilities, such that for each state $s$ and action $a$, $\sum_{s' \in S} T(s, a, s') = 1$
    
    
- Reward, $R(s, a, s')$, representing the reward for starting in state $s$, taking action $a$ (cost) and ending up in state $s'$ after one step

These four values characterizes the *Markov Decision Process*:

$MDP = \text{<}S, A, T, R \text{>}$

****

**Markov Decision Processes (MDP)**<br>

MDPs satisfy the Markov property in that the transition probabilities and rewards depend only on the current state and action, and remain unchanged regardless of the history that leads to the current state.

1. **Rewards**

    One way to look at rewards is to define a bounded number of actions and states and aggregate all intermediate rewards, such that:

    $\text{Finite Horizon} = U([s_0, \dots, s_{N+K}]) = U([s_0, \dots, s_N]) \forall K$

    In this definition, the utility function, $U$, only looks at rewards up to $N$ steps and all other $K$ rewards past that point will be ignored. This definition can be problematic as it not only depends on the current state but also at the timepoint, i.e. if the agent is only left with one-step then it might take a highly risky move.

    Consider a **discounted reward** utility function, that places higher value on the immediate step and value decays as rewards are further away, which allows us to look at an infinite horizon and does not depend on how many steps have been taken.

    $U([s_0, \dots]) = R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2) + \dots = \sum_{t=0}^{\infty} \gamma^t R(S_t)$

    where 

    - $0 \leq \gamma \leq 1$, such that for $\gamma = 0$ then it boils down to greedily maximizing for the immediate reward
    - $U([s_0, \dots]) \leq R_{max} \sum_{t=0}^{\infty} \gamma^t = \frac{R_{max}}{1 - \gamma}$, i.e. if maximum reward is finite then $\sum_{t=0}^{\infty} \gamma^t$ is a geometric series that converges to $\frac{1}{1-\gamma}$


2. **Optimal Policy**

    A policy is a function $\pi : S \rightarrow A$ that assigns an action $\pi(s)$ to any state $s$ and we denote the optimal policy by $\pi^*$ that maximizes the expected utility, even if it means taking actions that would lead to lower immediate next-step rewards from few states, i.e. this is exactly what MDP tries to solve.
    
    A value function $V^*(s)$ of a given state, $s$, is the expected reward (i.e. the expectation of the utility function) if the agent acts optimally starting at state $s$.
    
    A Q-function $Q^*(s,a)$ is the expected reward of a given state, $s$, and taking action $a$, and then acting optimally afterwards.

<hr>

**Bellman Equations**

The Bellman equations connect $\pi^*(s)$, $V^*(s)$ and $Q^*(s,a)$ together, such that:

$V^*(s) = \max_a Q^*(s, a) = Q^*(s, \pi^*(s))$

$Q^*(s,a) = \sum_{s'} T(s, a, s') \cdot [R(s,a,s') + \gamma V^*(s')]$

$V^*(s) = \max_a \sum_{s'} T(s, a, s') \cdot [R(s,a,s') + \gamma V^*(s')]$

where

- $T(s, a, s')$ is the transition probability of being in state $s$, taking an action $a$ and entering a specific state $s'$
- $R(s,a,s')$ is the reward for starting in state  𝑠 , taking action  𝑎  (cost) and ending up in state  𝑠′  after one step

These equations will help us to solve for the optimal MDP policy.

Plugging the first equation into the second, we get:

$Q^*(s, a) = \sum_{s'} T(s, a, s') [R(s,a,s') + \gamma max_{a'}Q^* (s', a')]$

****

**Value Iteration Algorithm**

Suppose we wish to look at the expected reward of the state after $K$ steps and is represented by $V^*_K (s)$ where as $K$ goes to $\infty$ then $V^*_K \rightarrow V(s)$

Algorithm:

1. Initialization

    $V^*_0 (s) = 0$


2. Iterate and update until $V^*_K (s) \simeq V^*_{K+1} (s)$ $\forall s$

    $V^*_{K+1} (s) = \max_a \sum_{s'} T(s, a, s') \cdot [R(s,a,s') + \gamma V^*_K(s')] $
    

Suppose we have an agent trying to navigate a one-dimensional grid consisting of 5 cells. At each step, the agent has only one action to choose from, i.e. it moves to the cell on the immediate right.

<img alt="One-dimensional Grid" src="assets/one_dimensional_grid.png" width="500">

This example is such that the reward function is defined to be $R(s, a, s') = R(s)$, $R(s = 5) = 1$ or $R(s) = 0$ otherwise

Let $V^*(i)$ denote the value function of state $i$, the $i^{th}$ cell starting from the left.

Let $V^*_k (i)$ denote the value function estimate at state $i$ at the $k$th step of the value iteration algorithm. 

Let $V^*_0 (i)$ denote the initialization of this estimate.

Suppose we use a discount factor $\gamma = 0.5$, then we will write the functions $V^*_k$ as arrays below:

$\begin{bmatrix} V^*_k (1) & V^*_k (2) & V^*_k (3) & V^*_k (4) & V^*_k (5) \end{bmatrix}$

At step (1), we initialize by setting $V^*_0 (i) = 0$ for all $i$, such that:

$V^*_0 = \begin{bmatrix} 0 & 0 & 0 & 0 & 0 \end{bmatrix}$

Then, using the value iteration update rule, we get:

If we walk 1 step further from a given state, then the value we will get is represented by $V^*_1 = \begin{bmatrix} 0 & 0 & 0 & 0 & 1 \end{bmatrix}$ where after 1 step we get $R(s) = 5$ for the fifth cell

If we walk 2 steps further from a given state, then the value is represented as the following:

$V^*_2 = \begin{bmatrix} 0 & 0 & 0 & 0.5 & 1 \end{bmatrix}$

where $R(s,a,s') = R(4) = 0$ but $\gamma V^*_2(5) = 0.5 \times 1$

Continue the iteraitons and update $V^*_K (s)$ until it converges

<hr>

# Basic code
A `minimal, reproducible example`

*Question*

Consider the same one-dimensional grid, with changes to transition probabilities.

<img alt="One-dimensional Grid" src="assets/one_dimensional_grid.png" width="300">

At any given grid location, the agent can choose to either stay at the location or move to an adjacent location. If the agent chooses to stay at the location, such an action is successful with probability 0.5 and

- if the agent is at the corner grid locations then it ends up at the neighboring grid location with probability 0.5,
- if the agent is at any of the inner grid locations it has a probability of 0.25 each of ending up at each of the neighboring locations

If the agent chooses to move (*either left or right*) at any of the inner grid locations, such an action is successful with probability 1/3 and with probability 2/3 it fails to move, and

- if the agent chooses to move left at the leftmost grid location, then the action ends up exactly the same as choosing to stay, i.e., staying at the leftmost grid location with probability 0.5, and ends up at its neighboring grid location with probability 0.5,
- if the agent chooses to move right at the rightmost grid location, then the action ends up exactly the same as choosing to stay, i.e. staying at the rightmost grid location with probability 0.5, and ends up at its neighboring grid location with probability 0.5

Let $\gamma = 0.5$

Run the value iteration algorithm for 10, 100 and 200 iterations.

In [9]:
# Represent the transition probabilities and reward in a matrix
import numpy as np 

# state 
s = [0, 1, 2, 3, 4] 
# action 
a = [0, 1, 2] # representing moving left, staying, moving right respectively 

#transition probability 
T = np.array([
    [[1/2,1/2,0,0,0], [1/2,1/2,0,0,0], [2/3,1/3,0,0,0]], # all possible states, actions, all possible next states
    [[1/3,2/3,0,0,0], [1/4,1/2,1/4,0,0], [0,2/3,1/3,0,0]], 
    [[0,1/3,2/3,0,0], [0,1/4,1/2,1/4,0], [0,0,2/3,1/3,0]], 
    [[0,0,1/3,2/3,0], [0,0,1/4,1/2,1/4], [0,0,0,2/3,1/3]], 
    [[0,0,0,1/3,2/3], [0,0,0,1/2,1/2], [0,0,0,1/2,1/2]]
]) 
num_state = len(s)
num_action = len(a)
gamma = 1/2 

# initialization 
V = np.zeros(5) 

# reward 
R = np.zeros(5) 
R[4] = 1 
num_iter = 200

# Run value iteraiton algorithm
for i in range(num_iter): 
    Q = [[sum([T[s][a][t] * (R[s] + gamma * V[t]) for t in range(num_state)]) for a in range(num_action)] for s in range(num_state)] 
    V = np.max(Q, axis=1)
    print(i+1, "\t", V)

1 	 [0. 0. 0. 0. 1.]
2 	 [0.         0.         0.         0.16666667 1.33333333]
3 	 [0.         0.         0.02777778 0.27777778 1.47222222]
4 	 [0.         0.00462963 0.05555556 0.33796296 1.53703704]
5 	 [1.15740741e-03 1.08024691e-02 7.48456790e-02 3.68827160e-01
 1.56867284e+00]
6 	 [0.00298997 0.0160751  0.08641975 0.38438786 1.58436214]
7 	 [0.00476627 0.01976166 0.09287123 0.39218964 1.59218536]
8 	 [0.00613198 0.02206576 0.09632202 0.39609411 1.59609339]
9 	 [0.00704943 0.02340892 0.09812302 0.39804693 1.59804682]
10 	 [0.00761459 0.02415681 0.09904883 0.39902345 1.59902343]
11 	 [0.00794285 0.02456041 0.09952018 0.39951172 1.59951172]
12 	 [0.00812581 0.0247735  0.09975868 0.39975586 1.59975586]
13 	 [0.00822483 0.02488428 0.09987887 0.39987793 1.59987793]
14 	 [0.00827728 0.02494124 0.09993928 0.39993896 1.59993896]
15 	 [0.00830463 0.02497029 0.09996959 0.39996948 1.59996948]
16 	 [0.00831873 0.02498503 0.09998478 0.39998474 1.59998474]
17 	 [0.00832594 0.02499247 0.099992