# Reinforcement Learning

## Markov Decision Process
---

Defining the problem:  
- States: S
- Model: T(s,a,s') ~ Pr(s' | s,a)
    - transition model - rules of game you are playing
    - probability you will go to state s' given you are state s and take action a
- Actions: A(s), A
    - all actions agent can take
- Reward: R(s), R(s,a), R(s,a,s')
    - scalar value for being in a state
    - domain knowledge
    - reward you get for a state tells you the usefulness of that state
    
Solution:  
- Policy: $\pi(s) \rightarrow a$
    - takes state and tell you action you should take
    - solution to MDP
    - $\pi^*$: optimal policy - maximizes long term reward

Markov property   
- only the present matters (can make sure current state remembers everything you need to remember from the past)
- system is stationary - rules don't change - transition model doesn't change

## Rewards
----

- delayed rewards
- minor changes matter
    - small negative reward for each step encourages you to end the game
    - too large a negative reward and you may be rewarded to go to a bad state
    - too large of a positive reward an you may be rewarded to not move or end game

Temporal Credit Assignment - given sequence of <s,a,r> determine good/bad actions   

Sequence of Rewards: Assumptions:  
- infinite horizons (stationary)
    - policy may change if finite horizon (only a few timesteps left)
    - $\pi(s, t) \rightarrow a$
- utility of sequences (stationary of preferences)
    - if $U(s_0, s_1, s_2 ...) > U(s_0, s_1^{\prime}, s_2^{\prime} ...)$ the $U(s_1, s_2 ...) > U(s_1^{\prime}, s_2^{\prime} ...)$

What doesn't work:  
$U(s_0, s_1, s_2 ...) = \sum_{t=0}^{\infty}R(s_t)$ - Doesn't work - any type of reward will go to infinty (if rewards are positive) 

What does work (discounted rewards):   
$U(s_0, s_1, s_2 ...) = \sum_{t=0}^{\infty}\gamma^t R(s_t)$ where $0 \leq \gamma < 1$   
bounded from above by largest award: $\leq \sum_{t=0}^{\infty} \gamma^t R_{max} = \dfrac{R_{max}}{1-\gamma}$ (geometric series)   
allows us to add infinite number of numbers but gives a finite number   
like having a finite horizon but it's always the same distance away (effectively infinite or unbounded)   

## Policies
---

$\pi^* = argmax_{\pi} E[\sum_{t=0}^{\infty} \gamma^t R(s_t) | \pi]$   
$U^{\pi}(s) = E[\sum_{t=0}^{\infty} \gamma^t R(s_t) | \pi, s_0=s]$   
utility is long term feedback (accounts for delayed rewards) - does not equal R(S) which is immediate reward   
$\pi^* = argmax_a \sum_{s^{\prime}} T(s,a,s^{\prime}) U(s^{\prime}) \rightarrow U(s)$ from following optimal policy   

## Bellman Equation   
---

$U(s) = R(s) + \gamma * max_a \sum_{s^{\prime}} T(s,a,s^{\prime}) U(s^{\prime})$    
n equations in n unknows but max is non-linear   

Finding policies:  
- start with arbitrary utilities
- update utilites based on neighbors
- repeat until convergence  

## Value Iteration
---

$U_{t+1}(s) = R(s) + \gamma * max_a \sum_{s^{\prime}} T(s,a,s^{\prime}) U_t(s^{\prime})$    
R(s) is truth that keep getting added in  

## Policy Iteration
---

- start with $\pi_0$ guess
- evaluate: given $\pi_t$ calculate $U_t = U^{\pi_t}$
- improve: $\pi_{t+1} = argmax_a \sum T(s,a,s') U_t(s')$  

$U_t(s) = R(s) + \gamma \sum_{s'} T(s,\pi_t(s),s') U_t(s')$

## RL 'API'
---

- planning
    - model (T,R) -> Planner -> policy ($/pi$)
- reinforcement learning
    - transitions <s,a,r,s'> -> Learner -> policy
- modeling
    - transitions -> Modeler -> model
- simulating
    - model -> simulator -> transitions
    
reward maximization  

RL-based planner
- model -> simulator -> transitions -> learner -> policy
model-based RL
- transitions -> modeler -> planner -> policy

## Three Approaches to RL
---

- policy search
    - s -> $\pi$ -> a
    - direct use
    - indirect learning
- value-function based
    - s -> U -> v
- model-based
    - <s,a> -> [T,R] -> <s',r>
    - direct learning
    - indirect use
    - can use supervised learning
    
<img src="images/approaches_to_rl.png" width=600 align="left"/>  

## Q Function (new kind of value function)
----

$Q(s,a) = R(s) + \gamma \sum_{s'} T(s,a,s') * max_{a'} Q(s',a')$  
value for arriving in s, leaving via a, proceeding optimally thereafter   
evaluating the Bellman equations from data:   <s,a,r,s'> -> Q   

$U(s) = max_a (Q(s,a))$  
$\pi(s) = argmax_a (Q(s,a))$  

## Estimating Q from transitions
----

$Q(s,a) = R(s) + \gamma \sum_{s'} T(s,a,s') * max_{a'} Q(s',a')$  

but don't have [T, R], have transitions  
learn to solve an MDP, without [T,R], interact <s,a,r,s'>   

<s,a,r,s'>:   
- $\hat{Q}(s,a) \leftarrow^{\alpha_t} r + \gamma * max_{a'} \hat{Q}(s',a')$  
- learning rate $\alpha$  

<img src="images/q_learning_convergence.png" width=600 align="left"/>  

## Choosing Actions
---

Q-learning is a family of algorithms
- how to initialize Q
- how to decay alpha
- how to choose actions
    - always choose same action (won't learn)
    - choose randomly (won't use it)
    - use Q (will use it) - greedy - local min
    - random restarts (slow)
    - epsilon greedy - "simulated annealing" like approach

## $\epsilon$-Greedy Exploration
----

If GLIE (greedy limit + infinite exploration) (decayed $\epsilon$)   
$\hat{Q} \rightarrow Q$ and $\hat{\pi} \rightarrow \pi^*$    

learn - use  
Exploration-Explotation dilema  (only one of you) (fundamental tradeoff in RL)  

RL is glue between model learning (ML) and planning (automated planning and scheduling)   

<img src="images/exploration.png" width=600 align="left"/> 