### Homework 10: Model-free Control (RL for Optimal Value Function/Policy)
## 1. Prove the Epsilon-Greedy Policy Improvement Theorem 
** Theorem: ** For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $\pi^{\prime}$ with respect to $q_{\pi}$ is an improvement, $v_{\pi^{\prime}}(s) \geq v_{\pi}(s)$.
** Proof: ** 
$$
\begin{aligned} q_{\pi}\left(s, \pi^{\prime}(s)\right) &=\sum_{a \in \mathcal{A}} \pi^{\prime}(a | s) q_{\pi}(s, a) \\ &=\epsilon / m \sum_{a \in \mathcal{A}} q_{\pi}(s, a)+(1-\epsilon) \max _{a \in \mathcal{A}} q_{\pi}(s, a) \\ & \geq \epsilon / m \sum_{a \in \mathcal{A}} q_{\pi}(s, a)+(1-\epsilon) \sum_{a \in \mathcal{A}} \frac{\pi(a | s)-\epsilon / m}{1-\epsilon} q_{\pi}(s, a) \\ &=\sum_{a \in \mathcal{A}} \pi(a | s) q_{\pi}(s, a)=v_{\pi}(s) \end{aligned}
$$
Therefore from policy improvement theorem, $v_{\pi^{\prime}}(s) \geq v_{\pi}(s)$.

## 2. Provide the defintion of GLIE (Greedy in the Limit with Infinite Exploration)
** Definition: **

   (1) All state-action pairs are explored infinitely many times
   
$$
\lim _{k \rightarrow \infty} N_{k}(s, a)=\infty
$$
   (2) The policy converges on a greedy policy
 
$$
\lim _{k \rightarrow \infty} \pi_{k}(a | s)=1\left(a=\underset{a^{\prime} \in \mathcal{A}}{\operatorname{argmax}} Q_{k}\left(s, a^{\prime}\right)\right)
$$

## 3. Implement the tabular SARSA and tabular SARSA(Lambda) algorithms

In [2]:
from src.td0_control import TD0_control
from src.mdp_refined import MDPRefined

mdp_refined_data = {
        1: {
            'a': {1: (0.3, 9.2), 2: (0.6, 4.5), 3: (0.1, 5.0)},
            'b': {2: (0.3, -0.5), 3: (0.7, 2.6)},
            'c': {1: (0.2, 4.8), 2: (0.4, -4.9), 3: (0.4, 0.0)}
        },
        2: {
            'a': {1: (0.3, 9.8), 2: (0.6, 6.7), 3: (0.1, 1.8)},
            'c': {1: (0.2, 4.8), 2: (0.4, 9.2), 3: (0.4, -8.2)}
        },
        3: {
            'a': {3: (1.0, 0.0)},
            'b': {3: (1.0, 0.0)}
        }
    }
gamma_val = 0.9
mdp_ref_obj1 = MDPRefined(mdp_refined_data, gamma_val)
mdp_rep_obj = mdp_ref_obj1.get_mdp_rep_for_rl_tabular()
epsilon_val = 0.1
epsilon_half_life_val = 100
learning_rate_val = 0.1
learning_rate_decay_val = 1e6
episodes_limit = 5000
max_steps_val = 1000

sarsa = TD0_control(
        mdp_rep_obj,
        epsilon_val,
        epsilon_half_life_val,
        learning_rate_val,
        learning_rate_decay_val,
        episodes_limit,
        max_steps_val,
        'Sarsa'
    )

print("Value-action function estimates with Sarsa")
qv_sarsa = sarsa.get_qv_func_dict()
print(qv_sarsa)

Value-action function estimates with Sarsa
{1: {'b': 10.883485571927277, 'a': 29.01505207860538, 'c': 13.090263933704076}, 2: {'a': 29.128953572925912, 'c': 10.200608175810016}, 3: {'b': 0.0, 'a': 0.0}}


## 4. Implement the tabular Q-Learning algorithm

In [3]:
qlearning = TD0_control(
        mdp_rep_obj,
        epsilon_val,
        epsilon_half_life_val,
        learning_rate_val,
        learning_rate_decay_val,
        episodes_limit,
        max_steps_val,
        'Q-learning'
    )

print("Value-action function estimates with Q-learning")
qv_qlearn = qlearning.get_qv_func_dict()
print(qv_qlearn)


Value-action function estimates with Q-learning
{1: {'b': 11.606916520154726, 'a': 35.42292975972827, 'c': 17.902709119148522}, 2: {'a': 34.068557603582164, 'c': 18.34462477171524}, 3: {'b': 0.0, 'a': 0.0}}


## 5.Test the above algorithms on some example MDPs by using DP Policy Iteration/Value Iteration solutions as a benchmark

In [6]:
policy_data = {
        1: {'a': 0.4, 'b': 0.6},
        2: {'a': 0.7, 'c': 0.3},
        3: {'b': 1.0}
    }
pol1,val1 = mdp_ref_obj1.policy_iteration(policy_data)
pol2,val2 = mdp_ref_obj1.value_iteration()
print("Value function estimates with policy iteration")
print(val1)
print("Value function estimates with value iteration")
print(val2)
val_sarsa = {s:max(v[a] for a in v) for s,v in qv_sarsa.items()}
print("Value function estimates with Sarsa")
print(val_sarsa)
val_qlearn = {s:max(v[a] for a in v) for s,v in qv_qlearn.items()}
print("Value function estimates with Q-learning")
print(val_qlearn)

Number of iterations: 2.
Number of iterations: 98.
Value function estimates with policy iteration
{1: 34.7221052631579, 2: 35.90210526315789, 3: 0.0}
Value function estimates with value iteration
{1: 34.72210522497511, 2: 35.90210522497511, 3: 0.0}
Value function estimates with Sarsa
{1: 29.01505207860538, 2: 29.128953572925912, 3: 0.0}
Value function estimates with Q-learning
{1: 35.42292975972827, 2: 34.068557603582164, 3: 0.0}
