# Elements of RL

ref: http://www.incompleteideas.net/book/bookdraft2017nov5.pdf, Chapter 1.3

1. a policy
1. a reward signal
1. a value function
1. a model of environment (optional)

* Model-based methods: solve RL problems that use model and **planning**. Planning is possible because of a model of environment.
* Model-free methods: no model for environment, no planning, explicitly trial and error, almost the opposite of **planning**.

# Elements of a MDP

ref: http://proceedings.mlr.press/v32/silver14.pdf


1. $\mathcal{S}$, state space
1. $\mathcal{A}$, action space
1. $\pi_{\theta}$: $S \rightarrow \mathcal{P}(\mathcal{A})$, policy, used to select actions in the MDP, $\theta$ is the parameters characterizing the policy, e.g. policy is modeled by a neural network
1. $p(s_{t+1}|s_t, a_t)$, stationary transition dynamics distribution, satisfying the Markov property: $p(s_{t+1}|(s_1, a_1), \cdots, (s_t, a_t)) = p(s_{t+1}|(s_t, a_t))$
1. $r$: $\mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, reward function 
1. $p_1(s_1)$, intial state distribution

More concepts:

1. $h_{1:T} = (s_1, a_1, r_1), \cdots, (s_T, a_T, r_T)$, trajectory 
1. $\gamma^{t-1} r(s_t, a_t)$, discounted reward at time step $t$
1. $r_t^{\gamma} = \sum_{k=t}^{\infty} \gamma^{k - t} r(s_k, a_k)$, total discounted reward from time-step $t$ onwards. Note $r$ is a function, but $r_t^{\gamma}$ is NOT. Discount factor $\gamma \in [0, 1)$
1. $V^{\pi}(s) = \mathbb{E}[r_1^{\gamma}|S_1=s; \pi]$, state value function under policy $\pi$
1. $Q^{\pi}(s, a) = \mathbb{E}[r_1^{\gamma}|S_1=s, A_1=a; \pi]$, state-action value function under policy $\pi$
1. $J(\pi) = \mathbb{E}[r_1^{\gamma}|\pi]$, the goal of RL, to maximize total discounted reward from the start state 

Another fomulation (slightly different naming conventions) from 

https://arxiv.org/pdf/1706.05374.pdf

MDP is characterized by $(S, A, R, p, p_0, γ)$

# Concepts

1. Episode
1. Terminal state
1. Episodic task
1. Continuing task
1. Discounting
1. Absorbing state
1. State-value function for policy $\pi$: $v_{\pi}$
1. Action-value function for policy $\pi$: $q_{\pi}$