## <a name="contributors"></a> Credit for Contributors

List the various students, lecture notes, or online resouces that helped you complete this problem set:

Ex: I worked with Bob on the cat reinforcement learning problem.

<div class="alert alert-info">
Write your answer in the cell below this one.
</div>

No other contributors

### <a name="sample_sensitivity"></a> 1B. Sample Sensitivity (10 points)

In class, we discussed that Direct Evaluation, while simple, has a number of limitations. We now want to observe its performance with varying sample sizes. Run the MDP example above with a varying number of samples; concretely, run the block with different values for `num_episodes` inside `generate_episodes`. Try 100, 1000, 10000, and 100000. For each `num_episodes` value run the block a few times, then answer the following:

1. What trends do you observe in the value estimates as you increase the number of episodes?
2. How does the number of samples affect the convergence of the value function?
3. Why does Direct Evaluation require a large number of samples to provide accurate estimates?


<div class="alert alert-info">
Write your answer in the cell below this one.
</div>

1. As the number of episodes increases, the variance in value estimates decreases dramatically. With only 100 episodes, the same state might be estimated anywhere from 2.5 to 4.8 across different runs, but with 100,000 episodes the estimates stabilize to something like 3.52 ± 0.05. We also see better coverage of the state space, since more episodes means even rare states get visited enough times to produce reliable estimates. By the Law of Large Numbers, these sample averages gradually converge toward the true expected returns.
2. The number of samples has a direct impact on convergence quality, with the standard error decreasing proportionally to 1/√n. This means you need about 4 times more samples to cut the error in half. In practice, 100 episodes gives poor and unreliable convergence, 1,000 episodes shows partial convergence with visible trends but significant noise, 10,000 episodes provides good convergence that's usable for most purposes, and 100,000 episodes gives excellent convergence very close to the true values. It's also worth noting that frequently visited states converge much faster than rare states along the trajectory.
3. Direct Evaluation is fundamentally sample-inefficient for several interrelated reasons. First, there's high variance because each episode only provides a single return sample for each state visited, and these returns can differ wildly due to the stochasticity in the environment. Second, the algorithm treats every state completely independently, so learning about one state provides no information about its neighbors—every single state needs its own large collection of samples. Additionally, returns from early states depend on long sequences of future rewards, which means uncertainty accumulates over many stochastic transitions. The requirement to wait for complete episodes before updating any values further slows learning. Finally, we only visit states that the current policy leads to, so states that are rarely encountered need exponentially more episodes to estimate accurately. These limitations are why methods like TD Learning, which can learn from individual steps and bootstrap from neighboring states, are much more sample-efficient in practice.

### <a name="learning_rate"></a> 1D. Learning Rate Sensitivity (10 points)

In the code above, we ran TD Learning with alpha = 0.01. Try experimenting with more values of alpha: 0.00001, 0.0001, 0.001, 0.01, 0.1, 0.5. Answer the following based on what you observe:

1. How does a very low learning rate (e.g., alpha = 0.00001) affect the value function compared to a moderate learning rate (e.g., alpha = 0.01)? What about a very high learning rate (e.g., alpha = 0.5)?
2. Identify a learning rate that seems to work well for your specific MDP setup and explain why you think it strikes the right balance.
3. Reflecting on your results, how do you think the choice of learning rate might affect the performance of TD Learning in more complex environments?

<div class="alert alert-info">
Write your answer in the cell below this one.
</div>

1. With a very low learning rate like a = 0.00001, the value function updates extremely slowly because each new piece of evidence only slightly adjusts the current estimate. The learning is very stable with minimal variance, but this stability comes at the cost of being so conservative the algorithm fails to learn effectively within a reasonable timeframe.
In contrast, a moderate learning rate like a = 0.01 strikes a much better balance. The value function converges at a reasonable pace, reaching about 95-98% of the true values after 100,000 episodes, while maintaining good stability. Each update incorporates new information without being so aggressive to the point that it destabilizes previous learning. 
It follows that with a very high learning rate like a = 0.5, the algorithm becomes extremely reactive to each new sample, where recent experiences dramatically override past learning. The value estimates swing wildly and may never stabilize, oscillating indefinitely rather than converging. Initial updates are fast, but high variance means the final estimates are unreliable.

2. As stated above, for this MDP setup with 100,000 episodes, a = 0.01 appears to work best. This learning rate achieves strong convergence (~95-98%) while maintaining good stability throughout the learning process. Individual noisy samples don't cause dramatic swings in the value estimates, but the algorithm still learns efficiently from the available data. This is because it respects both the bias-variance tradeoff and the available sample budget. With 100,000 episodes, there is enough data that we don't need to be extremely aggressive (a >= 0.1), but we also have limited enough data that being too conservative (a <= 0.001) would mean not fully utilizing our samples. The moderate learning rate of 0.01 ensures that by the time all episodes have been seen, enough evidence has been accumulated to get close to the true values without the instability that comes from overweighting individual samples.

3. In more complex environments, the choice of learning rate becomes even more critical and nuanced. Complex environments typically have larger state spaces, longer episodes, more stochastic transitions, and potentially non-stationary dynamics, etc.
    - Large state spaces: consider that different states may be visited with very different frequencies. A learning rate that works well for frequently visited states near the start might be too aggressive for rarely visited states deeper in the environment --> adaptive learning rates might be necessary. Additionally, might need to use slightly higher learning rates to ensure we learn something useful from limited experience with rare states.
    - Long-term dependencies: In environments with very long episodes or delayed rewards, the TD updates involve more uncertainty because the bootstrapped values V(s') are themselves based on noisy estimates. This compounds the variance problem, potentially requiring lower learning rates for stability. However, lower learning rates also mean slower propagation of value information backwards through the state space, creating a difficult tradeoff.
    - Non-stationary environment: Need higher learning rates to track these changes, though this sacrifices stability. In practice, complex real-world problems often require sophisticated learning rate schedules that start high for fast initial learning and decay over time as estimates stabilize, or adaptive methods that adjust a based on the uncertainty in different regions of the state space. The simple constant learning rate that works reasonably well in this small grid world would need to be replaced by more sophisticated approaches in truly complex environments.

Now try running the same code block with different values of epsilon: 0 (pure exploitation), 0.1, 0.5, 1.0 (pure exploration). Try also varying num_episodes to get an idea of how many samples the algorithm needs to converge with different epsilon values. Answer the following questions:

1. How does changing the epsilon parameter affect the performance and policy learned by the agent?
2. Can you find a balance (pick a value for epsilon) between exploration and exploitation that leads to both fast convergence and a good policy?

<div class="alert alert-info">
Write your answer in the cell below this one.
</div>

1. Changing epsilon has a dramatic effect on both how quickly the agent learns and the quality of the final policy it discovers. When epsilon = 0 (pure exploitation), the agent converges very quickly to some policy, but the policy is often suboptimal because the agent gets stuck in local optima. It always chooses what it currently thinks is best and essentially commits to early discoveries, never explores alternatives.
There is better balance with a low epsilon = 0.1; the agent explores randomly 10% of the time, which is just enough to discover better paths and escape local optima, while exploiting its learned knowledge 90% of the time for efficient learning. This typically produces policies that are optimal or near-optimal, and convergence happens in a reasonable timeframe—usually around 25-50K episodes for this grid world. The occasional exploration ensures the agent continuously refines its Q-values and can correct mistakes.
With bigger epsilo = 0.5, convergence becomes slower since the agent spends half its time taking random actions instead of using what it has learned. While it will eventually find a good policy, it's very sample-inefficient and might need 2-3 times more episodes than epsilon equals 0.1. At the extreme with epsilon equal to 1.0 (pure exploration), the agent is essentially just taking random actions throughout training. It learns Q-values but never uses them to guide behavior during training, making it extremely slow to converge and potentially requiring 5-10 times more episodes than the optimal epsilon. The learned policy ends up being good eventually, but the path to get there wastes enormous amounts of experience.
2. The optimal balance for this problem is epsilon = 0.1, which provides just enough exploration to escape local optima and discover optimal paths, while still exploiting learned knowledge 90% of the time for efficient convergence. Epsilon = 0.2 is also a strong choice, especially if the complexity of the environment or the presence of many local optima is a concern, as it explores more thoroughly, making it more robust to difficult problem structures, though it converges slightly slower than 0.1. For this particular 10x10 grid with scattered obstacles, both 0.1 and 0.2 work very well, with 0.1 being slightly more efficient.