# 5. Reinforcement Learning

![model based and model free](https://raw.githubusercontent.com/Weizhuo-Zhang/CS188_AI/master/cs188_notes/pics/model_based_model_free.png?token=AHRFC2JG6OIGIPI6DKNKIJC5MDW6E)

## Model-Based Learning
### Model-Based Idea:
- Learn an approximate model based on experiences
- Solve for values as if the learned model were correct

### Step 1: Learn empirical MDP model
- Count outcomes $s'$ for each $s,a$
- Normalize to give an estimate of $\hat{T}(s,a,s')$
- Discover each $\hat{R}(s,a,s')$ when we experience $(s,a,s')$

### Step 2: Solve the learned MDP
- For example, use value iteration, as before

## Passive Reinforcement Learning
### Simplified Task: policy evaluation
- Input: a fixed policy $\pi(s)$
- You don't know the transitions $T(s,a,s')$
- You don't know the rewards $R(s,a,s')$
- **Goal: learn the state values**

### In this case
- Learner is "along for the ride"
- No choice about what actions to take
- Just execute the policy and learn from experience
- This is NOT offline planning! You actually take actions in the world

---

## Temporal Difference Learning (即时差分学习)
### Big idea: Learn from every experience
- Update $V(s)$ each time we experience a transition $(s,a,s',r)$
- Likely outcomes $s'$ will contribute updates more often

### Temporal Difference Learning
- Policy still fixed, still doing evaluation!
- Move values toward value of whatever successor occurs: running average

**Sample of V(s):** $$sample = R(s,\pi(s),s')+\gamma V^\pi(s')$$

**Update to V(s):**$$V^\pi \gets (1-\alpha)V^\pi(s)+(\alpha)sample$$

**Same update:**$$V^\pi(s)\gets V^\pi(s)+\alpha(sample-V^\pi(s))$$

## Exponential Moving Average (指数移动平均法)
- The running interpolation update: $\overline{x}_n=(1-\alpha)\cdot\overline{x}_{n-1}+\alpha\cdot x_n$
- Makes recent samples more important
$$\overline{x}_n=\frac{x_n+(1-\alpha)\cdot x_{n-1}+(1-\alpha)^2\cdot x_{n-2}+...}{1+(1-\alpha)+(1-\alpha)^2+...}$$
- Forgets about the past (distant past values were wrong anyway)
- **Decreasing learning rate ($\alpha$) can give converging averages**

$$V^\pi \gets (1-\alpha)V^\pi(s)+\alpha\left[R(s,\pi(s),s')+\gamma V^\pi(s')\right]$$

---

## Active Reinforcement Learning

Active Reinforcement Learning collects data while learning from Q-Values.

Full reinforcement learning: optimal policies (like value iteration)
- You don't know the transitions T(s,a,s')
- You don't know the rewards R(s,a,s')
- You choose the actions now
- <font color="red">Goal: learn the optimal policy / values</font>

In this case:
- Learner makes choices!
- Fundamental tradeoff: exploration vs. exploitation
- This is NOT offline planning! You actually take actions in the world and find out what happens...


## Detour: Q-Value Interation
Value iteration: find successive (depth-limited) values
- Start with V<sub>0</sub>(s)=0, which we know is right
- Given V<sub>k</sub>, calculate the depth k+1 values for all states:

　　　　$$V_{k+1}(s)\leftarrow\mathop{\max}_a\sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_k(s')]$$
    
But Q-Values are more useful, so compute them instead
- Start with Q<sub>0</sub>(s,a)=0, which we know is right
- Given Q<sub>k</sub>, calculate the depth k+1 q-values for all q-states:

　　　　$$Q_{k+1}(s,a)\leftarrow\sum_{s'} T(s,a,s')[R(s,a,s')+\gamma \mathop{\max}_{a'}Q_k(s',a')]$$
    
----

## Exploration vs. Exploitation

- [**Exploration**](#Exploration): you have to try unknown actions to get information
- **Exploitation**: eventually, you have to use what you know

## Exploration

### $\varepsilon-greedy$ Random actions (simplest)
- **Steps:**
  - Every time step, flip a coin
  - With (small) probability $\varepsilon$, act randomly
  - With (large) probability $1-\varepsilon$, act on current policy

- **Problems with random actions**
  - You do eventually explore the space, but keep thrashing around once learning is done
  - One solution: lower $\varepsilon$ over time
  - Another solution: [exploration functions](#Exploration-Functions)

### Exploration Functions
- Random actions: explore a fixed amount
- Better idea: explore areas whose badness is not (yet) established, eventually stop exploring
- Take a value estimate $u$ and a visit count $n$, and returns an optimistic utility, e.g, $f(u,n)=u+k/n$
  - Regular Q-Update: $Q(s,a)\leftarrow_\alpha R(s,a,s')+\gamma\mathop{\max}_{\alpha'}Q(s',a')$
  - Modified Q-Update: $Q(s,a)\leftarrow_\alpha R(s,a,s')+\gamma\mathop{\max}_{\alpha'}f(Q(s',a'),N(s',a'))$

## Regret
- **Measure how quick your learner get the optimal solution, quicker indicates less regret which is better.**
- Even if you learn the optimal policy, you still make mistakes alone the way!
- Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal(expexted) rewards
- Minimizing regret goes beyond learning to be optimal - it requires optimally learning to be optimal
- Example: random exploration and exploration functions both end up optimal, but random exploration has higher regret

# Q&A
- What is the complexity of the Q-Learning?
- Does Q-Learning have to explor as many as MDP? (online, offline)