# Control with Approximation

> This is the summary of lecture "Prediction and Control with Function Approximation" from Coursera.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Coursera, Reinforcement_Learning]
- image: 

## Episodic SARSA with Function Approximation

### State-values to action-values

$$ v_{\pi}(s) \approx \hat{v}(s, w) \doteq w^Tx(s) \\
q_{\pi}(s, a) \approx \hat{q}(s, a, w) \doteq w^Tx(s, a) $$

### Representing actions

$x(s) = \begin{bmatrix} x_0(s) \\ x_1(s) \\ x_2(s) \\ x_3(s) \end{bmatrix} \\ \mathcal{A}(s) = \{a_0, a_1, a_2\}$

$x(s, a) = \begin{bmatrix} x_0(s) \\ x_1(s) \\ x_2(s) \\ x_3(s) \\ x_0(s) \\ x_1(s) \\ x_2(s) \\ x_3(s) \\ x_0(s) \\ x_1(s) \\ x_2(s) \\ x_3(s) \end{bmatrix}$

This is called **stacked features**

### Episodic Semi-gradient SARSA for Estimating $\hat{q} \approx q_{*}$
$\begin{aligned}
&\text{Input: a differentiable action-value function parameterization } \hat{q}: \mathcal{s} \times \mathcal{A} \times \mathbb{R}^d \to \mathbb{R} \\
&\text{Algorithm paramters: step size } \alpha > 0, \text{ small } \epsilon > 0 \\
&\text{Initialize value-function weights } w \in \mathbb{R}^d \text{ arbitrarily (e.g., } w = 0 \text{)} \\
\newline
&\text{Loop for each episode:} \\
&\quad S, A \leftarrow \text{ initial state and action of episode (e.g., } \epsilon \text{-greedy)} \\
&\quad \text{Loop for each step of episode:} \\
&\qquad \text{Take action } A, \text{ observe } R, S' \\
&\qquad \text{If } S' \text{ is terminal:} \\
&\qquad \quad w \leftarrow w + \alpha[R - \hat{q}(S, A, w)] \nabla \hat{q}(S, A, w) \\
&\qquad \quad \text{Go to next episode} \\
&\qquad \text{Choose } A' \text{ as a function of } \hat{q}(S', \cdot, w) \text{ (e.g., } \epsilon \text{-greedy)} \\
&\qquad w \leftarrow w + \alpha[R + \gamma \hat{q}(S', A', w) - \hat{q}(S, A, w)] \nabla \hat{q}(S, A, w) \\
&\qquad S \leftarrow S' \\
&\qquad A \leftarrow A' \\
\end{aligned}$

## Expected Sarsa with Function Approximation

### From SARSA to Expected SARSA

SARSA:

$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha + \big(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\big) $

Expected SARSA:

$ Q(S_t, Q_t) \leftarrow Q(S_t, A_t) + \alpha + \big(R_{t+1} + \gamma \sum\limits_{a'} \pi(a' \vert S_{t+1} ) Q(S_{t+1}, a') - Q(S_t, A_t) \big) $

### Expected SARSA with Function Approximation

SARSA:

$ w \leftarrow w + \alpha \big(R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, w) - \hat{q}(S_t, A_t, w)\big) \nabla \hat{q}(S_t, A_t, w) $

Expected SARSA:

$ w \leftarrow w + \alpha \big(R_{t+1} + \gamma \sum\limits_{a'} \pi(a' \vert S_{t+1}) \hat{q}(S_{t+1}, a', w) - \hat{q}(S_t, A_t, w) \big) \nabla \hat{q}(S_t, A_t, w) $

### Expected SARSA to Q-learning

$ w \leftarrow w + \alpha \big( R_{t+1} + \gamma \max\limits_{a'} \hat{q}(S_{t+1}, a', w) - \hat{q}(S_t, A_t, w)\big) \nabla \hat{q}(S_t, A_t, w) $

## Exploration under Function Approximation

### Epsilon-Greedy

![eg](image/epsilon_greedy.png)

## Average Reward - A New Way of Formulating Control Problems

### Simple example

![](image/continuing_task.png)

Two policies are existed: traversing the left ring and traversing the right ring. The reward is zero everywhere except for in one transition in each ring. In the left ring, the reward is +1 immediately after state $S$. In the right ring, the reward is +2 immediately before state $S$.

If we calculate the value function in left policy with geometric series,

$v_{L}(S) = \frac{1}{1 - \gamma^5}$

In case of right policy,

$v_{R}(S) = \frac{2 \gamma^4}{1 - \gamma^5}$

If we set inequality of above equations,

$\frac{1}{1 - \gamma^5} = \frac{2 \gamma^4}{1 - \gamma^5}$

$v_{R}(S) > v_{L}(S) \text{ when } \gamma > 2^{\frac{1}{4}} \approx 0.841$

In order to maintain the discount factor, we might need gamma to be quite large.

### The Average Reward objective

$$ r(\pi) \doteq \lim\limits_{h \to \infty} \frac{1}{h} \sum\limits_{t=1}^h \mathbb{E}[R_t \vert S_0, A_{0:t-1} \sim \pi] $$

Using state visitation $\mu$, we can redefine this,

$$ r(\pi) = \sum\limits_s \mu_{\pi}(s) \sum\limits_{a} \pi(a \vert s) \sum\limits_{s', r} p(s', r \vert s, a) r $$

### Returns for Average Reward

$$G_t = R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \dots $$

Each term ($R_{t+x} - r(\pi)$) is called differential returns.

### Value Functions for Average Reward

$q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t \vert S_t = s, A_t = a] $

$q_{\pi}(s, a) = \sum\limits_{s', r}p(s', r \vert s, a) ( r - r(\pi) + \sum\limits_{a'} \pi(a' \vert s') q_{\pi}(s', a'))$

### Differential semi-gradient SARSA for estimating $\hat{q} \approx q_{*}$

$\begin{aligned}
&\text{Input: a differentiable action-value function parameterization } \hat{q} : \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \to \mathbb{R} \\
&\text{Algorithm parameters: step sizes } \alpha, \beta > 0 \\
&\text{Initialize value-function weights } w \in \mathbb{R}^d \text{ arbitrarily (e.g., } w=0 \text{)} \\
&\text{Initialize average reward estimate } \bar{R} \in \mathbb{R} \text{ arbitrarily (e.g., } \bar{R} = 0 \text{)} \\
\newline
&\text{Initialize state } S, \text{ and action } A \\
&\text{Loop for each step: } \\
&\quad \text{Take action } A, \text{ observe } R, S' \\
&\quad \text{Choose } A', \text{ as a function } \hat{q}(S', \cdot, w) \text{ (e.g., } \epsilon\text{-greedy)} \\
&\quad \delta \leftarrow R - \bar{R} + \hat{q}(S', A', w) - \hat{q}(S, A, w) \\
&\quad \bar{R} \leftarrow \bar{R} + \beta \delta \\
&\quad w \leftarrow w + \alpha \delta \nabla \hat{q}(S, A, w) \\
&\quad S \leftarrow S' \\
&\quad A \leftarrow A' \\
\end{aligned}$

## Intrinsic Rewards (Satinder Singh)

### Preferences-Parameters Confound

- (most often the)starting point is an **agent-designer** that has an objective reward fucntion that specifies preferences over agent behavior (it is often way too sparse and delayed)

- What should the agent's reward function be?

- A single reward function **confounds two** roles (from the designers point of view) simultaneously in RL agents

    1. (Preference) it expresses the agent-designer's preferences over behaviorss
    2. (Parameters) Through the reward hypothesis, it expresses the RL agent's goals/purposes and becomes parameters of actual agent behavior
    
These roles seem distinct; should they be confounded?

### Revised Autonomous Agent

![](image/autonomous_agent.png)

Agent reward is internal to the agent.

Parameters to be designed by agent-designer!

### Approches to designing reward

- Inverse Reinforcement Learning (Ng et al.)
    
    - Designer/operator demonstrates optimal behavior
    - Clever algorithms for automatically determining **set of** reward function such that observed behavior is optimal (e.g., Bayesian IRL; Ramachandran & Amir)
    - Ideal: Set agent reward = objective reward (i.e., preserve the preferences paraters confound)
    
- Reward shaping 
    - (Ng et al) agent reward = objective reward + potential-based reward (breaks PP confound)
    - Objective: To achieve agent with objective reward's asymptotic behavior faster! [Also Bayesian Reward Shaping by Ramachandran et al.]
    
- Preference Elicitation (Ideal: preserves PP confound)
- Mechanism Design (in Economics)
- Other heuristic Approaches

### Optimal Reward Problem

- There are two reward functions

    1. Agent-designer's: objective reward $R_o$ (given)
    2. Agent's: internal reward $R_i$
    
$\begin{aligned}
&\text{Agent } G(R_i; \Theta) \text{ in Environment(Env) produces (random)  interaction }\\ &h \sim <Env, G(R_i; \Theta)> \\
&\text{Utility of interaction } h \text{ to agent is } U_i(h) = \sum_t R_i(h_t) \\
&\text{Utility to agent designer is } U_o(h) = \sum_t R_o(h_t) \\
&\text{Optimal Reward } R^{*}_i = \arg\max_{R_i \in \{R_i\}} Exp_{Env}\{Exp_{h \sim <Env, G(R_i; \Theta)>} \{U_o(h)\}\} 
\end{aligned}$

This is called Nested optimization, Outer reward optimization, or Inner policy optimization

### Policy Gradient for Reward Design (PGRD) (Sorg, Singh, Lewis; NIPS 2010)

- Insight: In planning agents, the reward function parameterizes the agent's policy

![](image/pgrd.png)

PGRD optimizes the reward function via a standard policy gradient approach (OLPOMDP)

- PGRD approximates gradient in 2 parts:

$ \nabla_{\theta} \mathbb{E}[R_o(\tau) \vert Agent(R(\cdot; \theta) =  $

- $\nabla_{\mu} \mathbb{E}[R_o (\tau) \vert \mu]$
    
    - Gradient of performance w.r.t the policy
    - Approximated by OLPOMDP
        
- $ \nabla_{\theta} \mu(s, a; \theta)$
    
    - Gradient of policy w.r.t reward parameters
    - Accounts for the planning procedure
    - The sub-gradient of $Q^D$ w.r.t:
        
$\nabla_{\theta} Q^D (s, a) = \nabla_{\theta} R(s, a; \theta) + \sum\limits_{s', a'} T(s' \vert s, a) \pi(a \vert s) \nabla_{\theta} Q^{D-1}(s', a')$
        