## Week 03: Graded Lab

### Lab session

Welcome to the third week of our Reinforcement Learning course! In this lab session, we will delve into more advanced reinforcement learning techniques and challenge you to implement and extend what you've learned in the previous two weeks. This graded lab is designed to be more time-consuming and challenging, but it will be a rewarding experience as you tackle complex problems in reinforcement learning.

### Educational Objectives

* Be able to extend the Monte Carlo algorithm to work with a state value function while maintaining a balance between exploration and exploitation.
* Be able to implement the incremental Monte Carlo algorithm.	
* Be able to apply the Q-Learning algorithm from scratch.


### Getting Started

Please group up in pairs and open the notebook from week 01 in either Google Colab or on your local machine.

### Tasks

This graded lab is designed for you to prove your understanding of the learned methods. The tasks are threefold:


	
* **1. Monte Carlo with State Value Function:** For the first task, you are required to take the **Monte Carlo algorithm** from the first week's lab (Grid World) and **modify it to work with a state-value function instead of a random choice**. This means that instead of randomly sampling from all possible actions, you sample according to the estimated state-value function. You should also **incorporate a strategy to balance exploration and exploitation using ε-Greedy**.
	
* **2. Incremental Monte Carlo:** In the second task, extend the **Monte Carlo algorithm** from the first task **to an incremental Monte Carlo method**. Incremental Monte Carlo updates the value estimates incrementally, reducing the need to wait until the end of an episode to update values. This should lead to more efficient learning.
	
* **3. Q-Learning Integration:** For the final task, you should **replace the Monte Carlo method** used and extended in the first two tasks **with the Q-Learning algorithm**, as introduced last week. This will allow you to compare the performance of Q-Learning with the Monte Carlo approach and observe how the learning process differs.


### Submission

* The deadline for completing this graded lab is the start of the lab session in week 7 (31st of October, 16:00). If you are attending the lecture in person, please be prepared to show your results upfront. If you are attending online, you should send an email to embe@zhaw.ch with your lab results attached. Make sure to include the names of both students if you are working in a pair.

### Key Takeaways
	
* **Advanced Techniques:** You will be applying advanced reinforcement learning techniques, including state value functions, incremental Monte Carlo, and Q-Learning.
	
* **Efficient Learning:** Incremental Monte Carlo is designed to make learning more efficient by updating value estimates incrementally.
	
* **Comparative Analysis:** By implementing Q-Learning for the same GridWorld environment, you will gain insights into the performance differences between the Monte Carlo method and Q-Learning in action.

## A1  Monte Carlo with State Value Function:

In [5]:
from simple_grid_world import SimpleGridWorld
from monte_carlo_generation_a1 import MonteCarloGeneration
from monte_carlo_experiment_a1 import MonteCarloExperiment
from visualize import state_value_2d, next_best_value_2d
from IPython.display import clear_output
from update_functions import update_basic

env = SimpleGridWorld() # Instantiate the environment
agent = MonteCarloExperiment(env=env)
generator = MonteCarloGeneration(env=env, agent=agent) # Instantiate the trajectory generator
for i in range(10000):
  clear_output(wait=True)
  generator.run_episode(update_basic)
print(state_value_2d(env, agent))
print(next_best_value_2d(env, agent), flush=True)
print(agent.random_counter)

-13.30 | -14.83 | -12.72 | -36.76 | 
-11.87 |  -9.59 |  -9.03 |  -7.92 | 
 -9.81 |  -7.47 |  -5.43 |  -4.35 | 
-11.43 |  -6.29 |  -4.00 |    @   | 

⬇ | ⬇ | ⬇ | ⬇ | 
⮕ | ⬇ | ⬇ | ⬇ | 
⮕ | ⮕ | ⬇ | ⬇ | 
⮕ | ⮕ | ⮕ | @ | 

12379


### Chapter 2.2 of sutton 

![image.png](attachment:image.png)

In [1]:
from simple_grid_world import SimpleGridWorld
from monte_carlo_generation_a1 import MonteCarloGeneration
from monte_carlo_experiment_a1 import MonteCarloExperiment
from visualize import state_value_2d, next_best_value_2d
from IPython.display import clear_output
from update_functions import update_sutton_2_2

env = SimpleGridWorld() # Instantiate the environment
agent = MonteCarloExperiment(env=env)
generator = MonteCarloGeneration(env=env, agent=agent) # Instantiate the trajectory generator
for i in range(10000):
  clear_output(wait=True)
  generator.run_episode_sutton_2_2(update_sutton_2_2)
print(state_value_2d(env, agent))
print(next_best_value_2d(env, agent), flush=True)
print(agent.random_counter)

-10.45 |  -8.76 |  -7.97 |  -6.02 | 
-11.02 |  -8.47 |  -6.37 |  -5.05 | 
-13.26 |  -6.80 |  -4.88 |  -3.28 | 
 -7.76 |  -5.90 |  -3.41 |    @   | 

⮕ | ⬇ | ⬇ | ⬇ | 
⮕ | ⮕ | ⬇ | ⬇ | 
⮕ | ⮕ | ⮕ | ⬇ | 
⮕ | ⮕ | ⮕ | @ | 

11311


## A2 Incremental Monte Carlo:

### Chapter 2.3 of sutton 


![image.png](attachment:image.png)

In [2]:
from simple_grid_world import SimpleGridWorld
from monte_carlo_generation_a1 import MonteCarloGeneration
from monte_carlo_experiment_a1 import MonteCarloExperiment
from visualize import state_value_2d, next_best_value_2d
from IPython.display import clear_output
from update_functions import update_sutton_2_3

env = SimpleGridWorld() # Instantiate the environment
agent = MonteCarloExperiment(env=env)
generator = MonteCarloGeneration(env=env, agent=agent) # Instantiate the trajectory generator
for i in range(10000):
  clear_output(wait=True)
  generator.run_episode(update_sutton_2_3)
print(state_value_2d(env, agent))
print(next_best_value_2d(env, agent), flush=True)
print(agent.random_counter)

 -9.25 |  -7.95 |  -7.27 |  -5.90 | 
 -7.99 |  -7.29 |  -6.09 |  -4.86 | 
 -7.02 |  -5.67 |  -4.18 |  -3.09 | 
 -5.57 |  -4.80 |  -2.95 |    @   | 

⬇ | ⬇ | ⬇ | ⬇ | 
⮕ | ⬇ | ⬇ | ⬇ | 
⮕ | ⮕ | ⮕ | ⬇ | 
⮕ | ⮕ | ⮕ | @ | 

8847


## A3 Q-Learning Integration:

In [3]:
from simple_grid_world import SimpleGridWorld
from monte_carlo_generation_a3 import MonteCarloGeneration
from monte_carlo_experiment_a3 import MonteCarloExperiment
from visualize import state_value_2d, next_best_value_2d
from IPython.display import clear_output

env = SimpleGridWorld() # Instantiate the environment
agent = MonteCarloExperiment(env=env)
generator = MonteCarloGeneration(env=env, agent=agent) # Instantiate the trajectory generator
for i in range(10000):
  clear_output(wait=True)
  generator.run_episode()
print(state_value_2d(env, agent))
print(next_best_value_2d(env, agent), flush=True)

 -6.32 |  -5.61 |  -4.66 |  -3.94 | 
 -5.57 |  -4.87 |  -3.92 |  -3.21 | 
 -4.46 |  -3.77 |  -2.92 |  -2.23 | 
 -3.54 |  -3.07 |  -2.16 |    @   | 

⮕ | ⮕ | ⮕ | ⬇ | 
⮕ | ⮕ | ⮕ | ⬇ | 
⮕ | ⮕ | ⮕ | ⬇ | 
⮕ | ⮕ | ⮕ | @ | 



### Monte Carlo Methods

|  | Monte Carlo | Incremental Monte Carlo |
| --| --| --|
| **First Visit**  | **1.** Epoche durchlaufen <br>  **2.** Backpropagate und alle values definieren $S(s) + G_t$ und Counter incrementieren <br> **3.** Von vorne beginnend: nimm erster state-value und dividiere durch Anzahl $N(S)$  |  **1.** Epoche durchlaufen <br> **2.** Backpropagate und alle values definieren $V(S_t) + \frac{1}{N(S_t)} (G_t - V(S_t))$ und Counter incrementieren <br> **3.** Von vorne beginnend: nimm erster state-value;|
| **Every-visit**  | **1.** Epoche durchlaufen; Backpropagate und alle values definieren $S(s) + G_t$ und Counter incrementieren <br> **2.** Von vorne beginnend: addiere (state-values dividiert durch Anzahl $N(S)$) | **1.** Epoche durchlaufen <br> **2.** Backpropagate und alle values definieren $V(S_t) + \frac{1}{N(S_t)} (G_t - V(S_t))$ und aufsummieren und Counter incrementieren; |