# Markov Decision Processes

> In this post, we will learn the Markov decision processes (MDP), the mathematical framework for solving RL Problem. This is the summary of lecture "Fundamentals of Reinforcement Learning" from Coursera.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Coursera, Reinforcement_Learning]
- image: 

## Markov Decision Processes

![mdp](image/mdp.png)

MDP provide a general framework for sequential decision-making.

### Dynamics of an MDP

<img src='image/mdp2.png' align='center' />

$$ p(s', r \vert s, a) $$

the joint probability of next state s' and reward, given current state and action.

- Markov property

$$ p: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \rightarrow [0, 1] \\
 \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s', r \vert s, a) = 1, \quad \forall s \in S, a \in \mathcal{A}(s)
$$

The present state contains all the information necessary to predict the future.

## The Goal of Reinforcement Learning

### Formal definition

$$ \text{Return } G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \dots $$

**Return** is the random variable since the dynamics of the MDP can be stochastic. To maximize this value, we use expected value of return, the **expected return**.

$$ \mathbb{E}[{G_t}] = \mathbb{E}[R_{t+1} + R_{t+2} + R_{t+3} + \dots ] $$

The sum of reward must be finite.

### Episodic task

![episodic](image/episodic_task.png)

The interaction naturally breaks into chunks called episodes, and each episode begins independently of how the previous one ended. Each episode has terminal state, and at termination, the agent is reset to a start state.

## Reward Hypothesis

### Reinforcement-learning hypothesis

*Intelligent behavior arises from the actions of an individual seeking to maximize its received reward signals in a complext and changing world.*

- Research program
    - indentify where reward signals come from,
    - develop algorithms that search the space of behaviors to maximize reward signals.
    
### Goals as Rewards

- 1 for goal, 0 otherwise
    - goal-reward representation
- -1 for not goal, 0 once goal reached
    - action-penalty representation
    
### Reward Expression

- Programming
    - coding
    - Human-in-the-loop
- Examples
    - Mimic reward
    - Inverse reinforcement learning
- Optimization
    - Evolutionary optimization
    - meta RL
    
### Challenges to the Hypothesis

- Target is something other than expected cumulative reward:
    - How represent risk-sensitive behavior?
    - How capture diversity in behavior?
- Good match for high-level human behavior?
    - Blind reward pursuers aren't good people.
    - We create our "purpose" over years, lifetimes.

## Continuing tasks

### Difference between episodic task and continuing task

- Episodic task
    - Interaction breaks natually into **episodes**
    - Each episode ends in a **terminal state**
    - Episodes are **independent**
    
$$ G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \dots + R_T $$

- Continuing task
    - Interaction goes on **continually**
    - No terminal state
    
$$ G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \dots = \infty (\text{?}) $$

### Discounting

So how can we make $G_t$ to finite? One solution is to **discount** reward in the future by $\gamma$, where $0 \le \gamma < 1$.

$$ \begin{aligned} G_t &\doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots + \gamma^{k-1} R_{t+k} + \dots \\ &= \sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} \end{aligned} $$

$G_t$ is finite as long as $0 \le \gamma < 1$.

Assume $R_{max}$ is the maximum reward the agent can receive

$$ \begin{aligned} G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \le \sum_{k=0}^{\infty} \gamma^k R_{max} &= R_{max} \sum_{k=0}^{\infty} \gamma^k \\ &= R_{max} \times \frac{1}{1 - \gamma} \end{aligned} $$

It will converge geometric series when $\gamma \lt 1$

### Effect of discount factor on agent behavior

$$ G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots + \gamma^{k-1} R_{t+k} + \dots $$

If $\gamma = 0$, 

$$ \begin{aligned} G_t &\doteq R_{t+1} + 0 R_{t+2} + 0^2 R_{t+3} + \dots + 0^{k-1} R_{t+k} + \dots \\
&= R_{t+1} \end{aligned}$$

In this case, agent only cares about the immediate reward, that is **short-sighted agent!**

If $\gamma \rightarrow 1$, Agent takes future rewards into account more strongly, that is **Far-sighted agent!**

### Recursive nature of returns

$$ \begin{aligned} G_t &\doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \dots \\
&= R_{t+1} + \gamma ( R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \dots) \\
&= R_{t+1} + \gamma G_{t+1} \end{aligned} $$

## Examples of Epi