### Learning Model
- Goal: estimate model $M_\eta$ from experience $\{ S_1,A_1,R_2, \dots, S_T \}$
- This is a supervised learning problem
$$
S_1,A_1 \rightarrow R_2, S_2 \\
S_2,A_2 \rightarrow R_3, S_3 \\
\vdots \\
S_{T-1},A_{T-1} \rightarrow R_T, S_T
$$
- Learning $s,a \rightarrow r$ is regression problem
- Learning $s,a \rightarrow s^\prime$ is density estimation problem
- Pick loss function, e.g. mean-square error, KL divergence
- Find parameter $\eta$ that minimise emprical loss

### Table Lookup Model
- Model is an explicit MDP, $\hat{P}$, $\hat{R}$
- Count visits $N(s,a)$ to each state action pair
$$
\begin{align}
\hat{P}^a_{s,s^\prime} & = \frac{1}{N(s,a)}\sum^T_{t=1} \boldsymbol{1}(S_t,A_t,S_{t+1}=s,a,s^\prime) \\
\hat{R}^a_s & = \frac{1}{N(s,a)}\sum^T_{t=1}\boldsymbol{1}(S_t,A_t=s,a)R_t
\end{align}
$$
- Alternatively
  -  At each time-stamp $t$, record experience tuple $\langle S_t,A_t,R_{t+1},S_{t+1} \rangle$
  - To sample model, randomly pick tuple matching $\langle s, a, \cdot, \cdot \rangle$

### Dyna
- Learn a model from real experience
- Learn and play value function (and/or policy) from real and simulated experience

### Dyna-Q Algorithm
$$
\begin{align}
& \text{Initialize }Q(s,a)\text{ and }Model(s,a)\text{ for all }s \in S\text{ and }a \in A(s) \\
& \text{Do forever:} \\
& \hspace{10mm} \text{(a) }S \leftarrow\text{ current (nonterminal) state} \\
& \hspace{10mm} \text{(b) }A \leftarrow \epsilon\text{-greedy}(S,Q) \\
& \hspace{10mm} \text{(c) Execute action }A\text{; observe resultant reward, }R\text{, and state, }S^\prime \\
& \hspace{10mm} \text{(d) }Q(S,A) \leftarrow Q(S,A) + \alpha [R + \gamma \max_a Q(S^\prime,a) - Q(S,A)] \\
& \hspace{10mm} \text{(e) }Model(S,A) \leftarrow R,S^\prime\text{ (assuming deterministic environment)} \\
& \hspace{10mm} \text{(f) Releat }n\text{ times} \\
& \hspace{20mm} S \leftarrow \text{random previously observed state} \\
& \hspace{20mm} A \leftarrow \text{random action previously taken in }S \\
& \hspace{20mm} R,S^\prime \leftarrow Model(S,A) \\
& \hspace{20mm} Q(S,A) \leftarrow Q(S,A) + \alpha [R + \gamma \max_a Q(S^\prime,a) - Q(S,A)]
\end{align}
$$

### Simple Monte-Carlo Search
- Given a model $M_v$ and a simulation policy $\pi$
- For each action $a \in A$
  - Simulate $K$ episodes from current (real) state $s_t$
  $$
  \{s_t,a,R^k_{t+1},S^k_{t+1},A^k_{t+1},\dots,S^k_T\}^K_{k=1} \sim M_v,\pi
  $$
  - Evaluate actions by mean return (Monte-Carlo evaluation)
  $$
  Q(s_t,a_t) = \frac{1}{K} \sum^K_{k=1} G_t \stackrel{P}\rightarrow q_\pi(s_t,a)
  $$
- Select current (real) action with maximum value
$$
\DeclareMathOperator*{\argmax}{arg\,max}
a_t = \argmax\limits_{a \in A} Q(s_t,a)
$$

### Monte-Carlo Tree Search (Evaluation)
- Given a model $M_v$
- Simulate $K$ episodes from current state $s_t$ using current simulation policy $\pi$
$$
\{ s_t, A^k_t, R^k_{t+1}, S^k_{t+1},\dots,S^k_T\}^K_{k=1} \sim M_v,\pi
$$
- Build a search tree containing visited states and actions
- Evaluate states $Q(s,a)$ by mean return of episodes from $s, a$
$$
Q(s,a) = \frac{1}{N(s,a)}\sum^K_{k=1}\sum^T_{u=t}\boldsymbol{1}(S_u,A_u=s,a)G_u\stackrel{P}\rightarrow q_\pi(s,a)
$$
- After search is finished, select current (real) action with maxium value in search tree
$$
\DeclareMathOperator*{\argmax}{arg\,max}
a_t = \argmax\limits_{a \in A} Q(s_t,a)
$$

### Monte-Carlo Tree Search (Simulation)
- In MCTS, the simulation policy $\pi$ improves
- Each simulation cosists of two phase (in-tree, out-of-tree)
  - Tree policy (improves): pick action to maximise $Q(s, a)$
  - Default policy (fixed): pick action randomly
- Repeat (each simulation)
  - Evaluate states $Q(S,A)$ by Monte-Carlo evaluation
  - Improve tree policy, e.g. by $\epsilon\text{-greedy}(Q)$
- Monte-Carlo control applied to simulated experience
- Converges on the optimal search tree, $Q(S,A) \rightarrow q_\ast(S,A)$