#### **Access to MDP Dynamics**
#### **Reversible vs. Irreversible Access:**

  * **Model (Reversible Access):** In a model, we can query the MDP dynamics at any time and get the probability distribution $p(s' | s, a)$, i.e., the probability of transitioning to state $s'$ given action $a$ in state $s$.
  * **Environment (Irreversible Access):** In an environment, after taking an action $a$, we cannot undo it. We must move forward and observe the resulting state $s'$, making it irreversible.
* **Distribution vs. Sample Models:**

  * **Distribution Models:** You get the full probability distribution over next states (more complete, but requires more knowledge or computation)
  * **Sample Models:**  You get only one sample of the next state, meaning one possible outcome based on the distribution


#### **Planning vs. Learning**

The distinction between **planning** and **learning** lies in two factors:

<img src="../../Files/fourth-semester/rl/10.png" alt="Planning vs. Learning" width="400">

* **Access to MDP Dynamics:**

  * **Planning:** Uses a model with reversible access, allowing for queries on any state-action pair.
  * **Reinforcement Learning (RL):** Uses an environment with irreversible access, where you must proceed through the environment step-by-step.
* **Solution Representation:**

  * **Planning:** Stores a **local solution**, focusing only on the current state.
  * **RL:** Stores a **global solution**, which accounts for all possible states.

The boundary between planning and learning can be summarized as:

* **Planning**: Uses a local solution with reversible access to the MDP.
* **Model-based RL**: Uses a global solution with reversible access (such as Dynamic Programming).
* **Model-free RL**: Uses a global solution with irreversible access (e.g., Q-learning).

|                            | Local solution               | Global solution              |
|----------------------------|------------------------------|------------------------------|
| **Reversible MDP access**   | Planning (e.g., MCTS)        | Borderline/Model-based RL (e.g., Dynamic Programming) |
| **Irreversible MDP access** | (impossible)                 | Model-free RL (e.g., Q-learning) |

---


### **Back-ups:**

In reinforcement learning, a **back-up** refers to how we update the value function or action-value function when estimating the value of a state or action.

* **Expected Back-ups:**

  * These are primarily used in **planning**. In planning, we have a model (reversible access to MDP dynamics), so we can fully calculate the expected value of the next state.
  * In the diagram, the **expected back-up** computes the **Bellman equation** using probabilities (e.g., the transition probabilities and rewards).
  * For instance, when calculating the value $V(s)$, you sum over all possible actions $a$ and all possible next states $s'$, weighted by the transition probabilities $p(s'|s,a)$ and rewards $r(s,a,s')$:

    $$
    V(s) = \sum_{a} \pi(a|s) \sum_{s'} p(s'|s,a) \left[r(s,a,s') + \gamma V(s')\right]
    $$
  * This is done with **expectation**, meaning you are considering all possible transitions and rewards.

* **Sample Back-ups:**

  * These are mostly used in **reinforcement learning** (RL), where you often do not have a model of the environment. Instead, you experience one transition at a time and use that experience to update your estimates.
  * In the diagram, the **sample back-up** for the **TD(0)** or **Sarsa** algorithm, for example, involves using a single sample (one observed next state $s'$ and reward $r$) to update the estimate for $V(s)$ or $Q(s,a)$.
  * In **TD(0)** or **Sarsa**, the update is done based on actual experiences rather than expectations. The update rule in Sarsa (for state-action values) is:

    $$
    Q(s,a) \leftarrow Q(s,a) + \alpha \left( r + \gamma Q(s',a') - Q(s,a) \right)
    $$
  * The difference here is that you only get a sample of what actually happened, rather than considering the entire distribution of possible outcomes.

### **Back-up Diagrams:**

The back-up diagrams visually show how updates are made in the **state values** $V(s)$ and **state-action values** $Q(s,a)$. The diagrams correspond to different types of updates.

* **On-policy vs Off-policy:**

  * **On-policy**: The value or Q-values are updated using the same policy that is being followed (e.g., **Sarsa** where the policy is followed directly).
  * **Off-policy**: The value or Q-values are updated using a different policy than the one being followed (e.g., **Q-learning** where the greedy policy is used to update, but the agent may be exploring randomly).


### **1-Step vs. Multi-Step Back-ups:**

* **1-Step Back-ups (Shallow Updates):**

  * These updates involve only a single step (transition from one state to the next) and are **shallow** in terms of the depth of the update.
  * **TD(0)** and **Monte Carlo** methods are examples of shallow updates because they only consider the immediate next state or the final return (in the case of Monte Carlo).
  * These updates are **fast** but may not capture the long-term effects of actions well.

* **Multi-Step Back-ups (Deep Updates):**

  * **Multi-step updates** involve looking further into the future, considering multiple steps in the environment (depth).
  * For example, **Monte Carlo** learning looks at complete episodes, while **TD(λ)** or **dynamic programming** methods may use longer sequences of states and actions to perform updates.
  * These updates are **deeper** and often lead to **more accurate estimates** of value functions but are computationally more expensive.

<img src="../../Files/fourth-semester/rl/11.png" alt="Planning vs. Learning" width="600">
<img src="../../Files/fourth-semester/rl/12.png" alt="Planning vs. Learning" width="600">

### **Relation to Algorithms:**

* **TD(0)** is a shallow update that uses **sample back-ups** (one-step sample updates).
* **Monte Carlo** can be a **multi-step** back-up method (it performs updates over entire episodes).
* **Dynamic Programming (DP)** involves **expected back-ups** over the entire state space and is typically used in planning where the model is fully known.

### **Notes:**
* **Expected back-ups** are used in **planning** where you have a model and can compute the full expected value over the next state.
* **Sample back-ups** are used in **RL** when you only have access to sampled experiences.
* **Shallow updates** look at immediate outcomes (1-step), while **deep updates** consider longer-term effects (multi-step).


---

# **To be reviewd again from Slides**

## Poitns from Lecture Notes

#### **Tabular Model Learning**

In a tabular model, we learn both the **transition dynamics** and the **reward function** by collecting samples of transitions:

* **Transition counts:** $n(s, a, s')$ is the number of times we transition from state $s$ to $s'$ using action $a$.
* **Reward sums:** $R_{\text{sum}}(s, a, s')$ is the sum of rewards obtained when transitioning from $s$ to $s'$ using action $a$.

Using these counts, we can estimate:

* **Transition model:**

  $$
  \hat{p}(s'|s, a) = \frac{n(s, a, s')}{n(s, a)}
  $$
* **Reward model:**

  $$
  \hat{r}(s, a, s') = \frac{R_{\text{sum}}(s, a, s')}{n(s, a, s')}
  $$

#### **Dyna Algorithm**

Dyna is a model-based RL algorithm that combines learning and planning:

* First, we **learn a model** of the environment.
* Then, we use the model to simulate **one-step planning updates** to our value function.

The algorithm updates the Q-values using both actual environment samples and simulated samples from the learned model. This helps improve **data efficiency** by leveraging the model.

Algorithm details for Dyna Q-learning:

1. Initialize Q-values, transition counts, and reward sums.
2. For each timestep:

   * Take an action using an $\epsilon$-greedy policy.
   * Observe the transition and update the model.
   * Perform Q-learning updates using real experiences.
   * Perform planning updates using simulated transitions from the model.

#### **Prioritized Sweeping**

Prioritized sweeping focuses on efficiently updating the value function by identifying **states with high-priority updates**:

* When the Q-value estimate for a state-action pair changes significantly, the predecessor states that lead to this state should also be updated.
* This prioritization helps focus updates on the most promising state-action pairs.

The algorithm maintains a **priority queue** where states with larger TD errors (difference between predicted and actual rewards) are prioritized for updates. It also performs **backward search** to identify which states to update based on the value changes.

---


#### **Model-based RL Algorithms**

Model-based RL methods combine planning and learning by using a learned model to simulate the environment:

* **Real-time Dynamic Programming (RTDP):** A modification of dynamic programming that focuses on **reachable states** instead of all states, making it more efficient for large state spaces.
* **Dyna:** Learns a model from real transitions and uses that model to simulate additional transitions for planning.
* **Prioritized Sweeping:** Uses both forward and backward models to prioritize states for updating, spreading information faster.

#### **Learning a Model from Data**

Given a dataset of observed transitions, we can estimate:

* **Dynamics model:** The probability distribution $p(s'|s,a)$.
* **Reward function:** The expected reward $r(s,a,s')$.

To do this:

1. **Track counts** $n(s,a,s')$ and reward sums $R_{\text{sum}}(s,a,s')$.
2. Estimate the transition model as:

   $$
   \hat{p}(s'|s,a) = \frac{n(s,a,s')}{n(s,a)}
   $$
3. Estimate the reward model as:

   $$
   \hat{r}(s,a,s') = \frac{R_{\text{sum}}(s,a,s')}{n(s,a,s')}
   $$
