# Reinforcement Learning

<hr>

**Gentle intro to RL**

RL algorithms learn to pick *good* actions based on the rewards that they receive during training, with none or limited supervision. 

The algorithm learns to take actions to maximize some notion of a *cumulative reward* instead of the immediate reward in the next step and can take *good* actions even without any intermediate rewards.

Some terminology:

- States, $s \in S$ (observed)
- Actions, $a \in A$ (intended)
- Transitions, $T(s, a, s') = p(s' | s, a)$
    - A function that takes the current state, the intended action and outputs the probability of a specific next state, 
    - i.e. action dependent transition probabilities, such that for each state $s$ and action $a$, $\sum_{s' \in S} T(s, a, s') = 1$
    
    
- Reward, $R(s, a, s')$, representing the reward for starting in state $s$, taking action $a$ (cost) and ending up in state $s'$ after one step

These four values characterizes the *Markov Decision Process*:

$MDP = \text{<}S, A, T, R \text{>}$

****

**Markov Decision Processes (MDP)**<br>

MDPs satisfy the Markov property in that the transition probabilities and rewards depend only on the current state and action, and remain unchanged regardless of the history that leads to the current state.

1. **Rewards**

    One way to look at rewards is to define a bounded number of actions and states and aggregate all intermediate rewards, such that:

    $\text{Finite Horizon} = U([s_0, \dots, s_{N+K}]) = U([s_0, \dots, s_N]) \forall K$

    In this definition, the utility function, $U$, only looks at rewards up to $N$ steps and all other $K$ rewards past that point will be ignored. This definition can be problematic as it not only depends on the current state but also at the timepoint, i.e. if the agent is only left with one-step then it might take a highly risky move.

    Consider a **discounted reward** utility function, that places higher value on the immediate step and value decays as rewards are further away, which allows us to look at an infinite horizon and does not depend on how many steps have been taken.

    $U([s_0, \dots]) = R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2) + \dots = \sum_{t=0}^{\infty} \gamma^t R(S_t)$

    where 

    - $0 \leq \gamma \leq 1$, such that for $\gamma = 0$ then it boils down to greedily maximizing for the immediate reward
    - $U([s_0, \dots]) \leq R_{max} \sum_{t=0}^{\infty} \gamma^t = \frac{R_{max}}{1 - \gamma}$, i.e. if maximum reward is finite then $\sum_{t=0}^{\infty} \gamma^t$ is a geometric series that converges to $\frac{1}{1-\gamma}$


2. **Optimal Policy**

    A policy is a function $\pi : S \rightarrow A$ that assigns an action $\pi(s)$ to any state $s$ and we denote the optimal policy by $\pi^*$ that maximizes the expected utility, even if it means taking actions that would lead to lower immediate next-step rewards from few states, i.e. this is exactly what MDP tries to solve.
    
    A value function $V(s)$ of a given state, $s$, is the expected reward (i.e. the expectation of the utility function) if the agent acts optimally starting at state $s$.

<hr>

**Bellman Equations**



****

**Value Iteration Algorithm**


<hr>

# Basic code
A `minimal, reproducible example`