<div align="center" style="line-height: 1.7;">
    <h2 style="font-weight: 600;"><strong>Results & Conclusion</strong></h2>
</div> 

&nbsp;


## Problem Definition

The goal of this project is to understand how an agent can learn effective decision-making through **trial and error** in a stochastic GridWorld environment. Unlike deep RL benchmarks, GridWorld isolates core reinforcement learning phenomena (exploration, temporal difference learning, planning, bootstrapping) in a compact, fully observable domain.

We compare:

* **Q-Learning** (model-free, off-policy)
* **SARSA** (model-free, on-policy)
* **Dyna-Q** (model-based + planning via simulated updates)

The environment includes **walls**, **pits**, and **wind** (transition noise). The agent must learn to reach the goal while avoiding hazards and minimizing step penalties.

---

## Motivation & Significance

Understaind planning and robutness in tabular RL is foundational for real RL systems:

* **GridWorld** lets us visualize value propagation and policy formation clearly.

* **Model-based RL (Dyna-Q)** is one of the first examples of how simulated experience can vastly improve sample efficiency.

* **Robustness testing** (layout shift, dynamics shift, seed stability) reflects real challenges in robotics and general purpose RL:
  
    * Environments change
    
    * Dynamics are not perfectly known
    
    * Algorithms may be brittle across random seeds

Studying these effects in a simple controlled setting provides insight applicable to larger scale RL systems.

---

## Research Questions & Answers

**How do planning steps ($K$) in Dyna-Q affect sample efficiency?**

Answer:  
Planning dramatically accelerates learning **up to a point**. The sweep revealed:

* **$K=0$** (pure model-free) -> slowest learning (~190 episodes to reach >= 0.6 return)
  
* **$K=5-20$** -> fastest learning (~90-110 episodes).

* **$K-50$** -> worse than $K=10$ or $K=20$; diminishing returns and slight instability.

However, all K values converged to the **same final optimal policy**. So planning controls speed, not optimality.

---

**How sensitive are the algorithms to ε-greedy schedules and learning rates?**

Answer:
All algorithms converge robustly under reasonable $\alpha$ and $\epsilon$ schedules, but:

* Q-Learning is the most sensitive to high $\epsilon$ early in training (off-policy bootstrapping amplifies exploratory randomness).

* SARSA is the most stable accross schedules (on-policy learning compensates for poor exploration choices).

* Dyna-Q is the most sensitive to $\alpha$ and $\epsilon$ because planning amplifies whatever trajectories are experienced early.

Overall, **SARSA is consistently the least sensitive** to hyperparameter changes.

---

**How robust are policies to layout changes and transition noise?**

To answer this question, we examined three ways:

**1. Dynamics Shift (Windy Environment)**

* All algorithms adapt well; final returns are similar (~0.63-0.70).
  
* Dyna-Q learns **faster** but final performance is slightly worse.

* Seed stability is extremely high across all three.

**2. Layout Shift (New walls/pits)**  
Two tests:

  **Cross-evaluation (no retraining)**:  
Policies learned on the baseline grid, evaluated on the shifted one.

   * **SARSA**: remains positive (+0.52)

   * **Q-Learning & Dyna-Q**: collapse (-0.96 to -1.02)

SARSA clearly generalizes better across layouts.

**Retraining on the shifted layout:**  
All three converge to similar final performance (~0.71 return), but:

   * **Dyna-Q shows very high variance across seeds** (std ~ 0.33)
   
   * Q-learning and SARSA are far more consistent


**Conclusion**
Model-based planning makes Dyna-Q less robust to structural changes.

---

**Does Dyna-Q's planning provide lasting benefits?**

Answer:
Only during early learning.  
Once the agent gathers enough real experience, all algorithms converge to the same final policy.

However:  

* Under layout shift, Dyna-Q is **least stable across seeds**

* Under dynamics shift, planning helps early but does not yield a better final policy.

**Overall**:
Planning accelerates learning but reduces robustness.

---

## Summary of Key Findings

**Learning Efficiency**  

* Dyna-Q with moderate $K (10-20)$ is the fastest learner.
  
* Too much planning ($K=50$) can make learning worse.


**Final Performance**

* All algorithms reach identical optimal performance when trained in the same environment.


**Robustness**

* **SARSA** is most robust to layout shift and hyperparameter sensitivity.

* **Q-Learning** is moderately stable.

* **Dyna-Q** shows the worst generalization under structural changes.


**Seed Stability**

* Windy env -> very stable for all.

* Layout shift -> Dyna-Q highly unstable across seeds.

---

## Overall Conclusion

This project highlights the classic Dyna-Q trade-off:

   **Planning accelerates learning but amplifies early mistakes, reducing robustness.**

In simple stationary environments, model-based RL is a major win: fast learning and optimal asymptoti performance.

But under realistic conditions (changing layouts, noisy transitions), **model-free methods especially SARSA are significancly more robust** and consistent.

These results reinforce a theme seen in modern RL research:

**better sample efficiency does not guarantee better generalization.**

---

<style>
    .button {
        background-color: #3b3b3b;
        color: white;
        padding: 25px 60px;
        border: none;
        border-radius: 12px;
        cursor: pointer;
        font-size: 30px;
        transition: background-color 0.3s ease;
    }

    .button:hover {
        background-color: #45a049;
        transform: scale(1.05);
    }
    
</style>

<div style=" text-align: center; margin-top:20px;">
    
  <a href="06_robustness.ipynb">
    <button class="button">
      ⬅️ Prev: Robustness & Generalization
    </button>
  </a>
  <span style="display:inline-block; width:200px;"></span>
  <a href="../Start_Here.ipynb">
    <button class="button">
      Next: Start Here ➡️
    </button>
  </a>
  
</div>
