## Sub-assignment 4:  TD Learning *(3 points)*



### Introduction

This Jupyter Notebook explores the implementation of TD learning. TD learning, short for Temporal Difference learning, is a reinforcement learning algorithm that combines elements of both Monte Carlo methods and dynamic programming.

In this notebook, we will focus on three different TD methods: TD prediction, on-policy TD control, and off-policy TD control. We will implement and run these algorithms on a specific environment called CircleWorld.

The CircleWorld environment is a grid-like world where an agent can move in four directions: up, down, left, and right. The goal of the agent is to navigate through the world and reach a specific target state. The environment is episodic, meaning that each episode starts from a specific initial state and ends when the agent reaches the target state.

We will start by implementing TD prediction, which aims to estimate the value function under the optimal policy. We will run the TD(0) algorithm for a fixed number of sample episodes and observe how the accuracy of the value function estimate evolves across episodes.

Next, we will move on to on-policy TD control, where we will implement and run the SARSA algorithm. SARSA is an on-policy control algorithm that estimates the optimal policy by updating the action-value function based on the agent's experience. We will start with a uniform stochastic policy and gradually improve it through the SARSA algorithm.

Finally, we will explore off-policy TD control using the Q-learning algorithm. Q-learning is an off-policy control algorithm that estimates the optimal policy by updating the action-value function based on the maximum Q-value of the next state. We will start with a uniform stochastic policy and observe how the estimated Q-values evolve across episodes.

Throughout this notebook, we will plot the results of each algorithm to visualize the learning progress and compare the performance of different TD methods.



Let's get started!

### Prerequisite

This sub-assignment is built on top of the previous sub-assignments, so it is important to complete them before, and utilize defined components in this sub-assignment. 

Note that, despite the fact that TD-learning can handle continuing tasks, we will constrain the study to episodic
tasks because the nature of the environment would result in infinite loops where no stable solutions can be found.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import copy
import time

**a**) **TD Prediction**: Implement and run TD(0) to estimate the value function under the optimal policy. Run the
algorithm for 500 sample episodes with a fixed $\alpha=0.01$. Plot the result as the MSE between the state
 value estimates $V$ and the ground truth $v_{\pi}$ as a function of number of episodes, i.e. how
 the accuracy of the estimate evolves across
  episodes.

In [None]:
def td0_prediction(mdp, policy, num_simulations=30, alpha = 0.01):
	"""
	TD(0) prediction algorithm
	:param mdp:
	:param policy:
	:param num_simulations: number of episodes to sample
	:param alpha: fixed learning rate
	:return:
	"""
	# Conditions for convergence; note that we can also run on continuing problems
	assert(mdp.task == 'episodic')
	V = np.zeros(mdp.n_states)
	Vs = []

	# TODO

	return V, Vs

In [None]:
td_v, td_vs = td0_prediction(mdpe, mdpe.optimal_policy(), num_simulations=500, alpha=0.01)

In [None]:
# plots

**b**) **On-policy TD control**: Implement and run the SARSA algorithm to estimate the optimal policy. Remember
that policy updates are $\epsilon$-greedy. Start from a uniform stochastic policy and run the algorithm for 10000
episodes, with a fixed learning rate $\alpha=0.01$ and $\epsilon=0.1$. Plot the MSE between the estimated q-value
function and the true optimal $q_{*}$ as a function of number of episodes. Note: you can obtain the $q_{*}$ from
$v_{*}$ using the method `mdpe.v_to_q()`.

In [None]:
def sarsa(mdp, num_simulations=30, alpha=0.01, epsilon=0.1):
	"""
	SARSA on-policy control
	:param mdp:
	:param num_simulations: number of sample episodes
	:param alpha: learning rate
	:param epsilon: minimum action selection probability
	:return:
	"""
	# Conditions for convergence; note that we can also run on continuing problems
	assert (mdp.task == 'episodic')
	Q = np.zeros([mdp.n_states, mdp.n_actions])
	policy = uniform_stochastic_policy(mdp.n_states, mdp.n_actions)
	Qs = []

	# TODO

	return policy, Q, Qs

In [None]:
policy_sarsa, Q, Qs = sarsa(mdpe, num_simulations=10000, alpha=0.01, epsilon=0.1)

In [None]:
# plots

**c**) **Off-policy TD control**: Implement and run the Q-learning algorithm to estimate the optimal policy. Start from a uniform stochastic policy and run the algorithm for 10000
episodes, with a fixed learning rate $\alpha=0.01$ and $\epsilon=0.1$. Plot the MSE between the estimated q-value
function and the true optimal $q_{*}$ as a function of number of episodes.

In [None]:
def qlearning(mdp, behavioral_policy, num_simulations=30, alpha=0.01, epsilon=0.1):
	# Conditions for convergence; note that we can also run on continuing problems
	assert(mdp.task == 'episodic')

	Q = np.zeros([mdp.n_states, mdp.n_actions])
	Qs = []

	for t in range(num_simulations):
		s = mdp.reset()
		while not mdp.is_terminal(s):
			a = mdp.sample_action(s, behavioral_policy)
			(s1, r) = mdp.step(s, a)

			# TODO: complete

		Qs.append(copy.copy(Q))
	# determine policy from Q function
	target_policy = np.zeros([mdp.n_states, mdp.n_actions])
	for state in mdp.terminal_states():
		target_policy[state,:] = 1.0 / mdp.n_actions
	for state in mdp.nonterminal_states():
		a_max = np.random.choice(np.flatnonzero(Q[state] == np.max(Q[state])))
		target_policy[state, a_max] = 1.0

	return target_policy, Q, Qs

In [None]:
b_policy = uniform_stochastic_policy(mdpe.n_states, mdpe.n_actions)
policy_Q, Q, Qs = qlearning(mdpe, behavioral_policy=b_policy, num_simulations=10000, alpha=0.01)

In [None]:
# plots