## Homework 8: Model-free (RL) Prediction With Monte Carlo and Temporal Difference
### 1. Monte-Carlo
** Class Design: My class design of Monte-Carlo is based on a interface for tabular RL algorithms. It includes both the first-time visit and every-time visit Monte-Carlo Alogorithm**

(1) First-visit Monte-Carlo

<img src="first_visit_montecarlo.png">

(2) Every-visit Monte-Carlo

<img src="every_visit_montecarlo.png">


In [2]:
from src.monte_carlo import MonteCarlo
from src.mdp_refined import MDPRefined

mdp_refined_data = {
        1: {
            'a': {1: (0.3, 9.2), 2: (0.6, 4.5), 3: (0.1, 5.0)},
            'b': {2: (0.3, -0.5), 3: (0.7, 2.6)},
            'c': {1: (0.2, 4.8), 2: (0.4, -4.9), 3: (0.4, 0.0)}
        },
        2: {
            'a': {1: (0.3, 9.8), 2: (0.6, 6.7), 3: (0.1, 1.8)},
            'c': {1: (0.2, 4.8), 2: (0.4, 9.2), 3: (0.4, -8.2)}
        },
        3: {
            'a': {3: (1.0, 0.0)},
            'b': {3: (1.0, 0.0)}
        }
    }

gamma_val = 1.0
mdp_ref_obj1 = MDPRefined(mdp_refined_data, gamma_val)
mdp_rep_obj = mdp_ref_obj1.get_mdp_rep_for_rl_tabular()

exploring_start_val = False
first_visit_flag = True
episodes_limit = 1000
max_steps_val = 1000
mc_obj = MonteCarlo(mdp_rep_obj,exploring_start_val,first_visit_flag,episodes_limit,max_steps_val)

policy_data = {
        1: {'a': 0.4, 'b': 0.6},
        2: {'a': 0.7, 'c': 0.3},
        3: {'b': 1.0}
    }

this_mc_path = mc_obj.get_mc_path(policy_data, 1)
print("One of the Monte Carlo Paths: ")
print(this_mc_path)

this_vf_dict = mc_obj.get_value_func_dict(policy_data)
print("Estimated Value Function: ")
print(this_vf_dict)

One of the Monte Carlo Paths: 
[(1, 'a', 4.5, True), (2, 'a', 9.8, True), (1, 'b', 2.6, False), (3, 'End', 0, 'End')]
Estimated Value Function: 
{1: 12.188085937500066, 2: 17.262925851703432, 3: 0.0}


### 2. Temporal-Difference Learning

<img src="TD0.png">

In [7]:
from src.td0 import TD0

exploring_start_val = False
epsilon_val = 0.1
epsilon_half_life_val = 1000
learning_rate_val = 0.1
learning_rate_decay_val = 1e6
episodes_limit = 10000
max_steps_val = 1000
sarsa_obj = TD0(
        mdp_rep_obj,
        exploring_start_val,
        epsilon_val,
        epsilon_half_life_val,
        learning_rate_val,
        learning_rate_decay_val,
        episodes_limit,
        max_steps_val
    )

this_qf_dict = sarsa_obj.get_value_func_dict(policy_data)
print("Estimated Value Function by TD0: ")
print(this_qf_dict)

Estimated Value Function by TD0: 
{1: 10.704335322963122, 2: 13.415799208176635, 3: 0.0}


### 3. Test it against VI and PI

In [5]:
val = mdp_ref_obj1.policy_evaluation(policy_data)
print("Policy Evaluation: ")
print(val)

Policy Evaluation: 
{1: 13.136856554564174, 2: 19.462937542896363}


### 4. Prove that fixed learning rate (step size alpha) for MC is equivalent to an exponentially decaying average of episode returns

Denote $V\left(S_{t}\right)^{k}$  as the kth update of $V\left(S_{t}\right)$:
$$
\begin{array}{c}{V\left(S_{t}\right)^{k}=V\left(S_{t}\right)^{k-1}+\alpha\left(G_{t}^{k-1}-V\left(S_{t}\right)^{k-1}\right)=(1-\alpha) V\left(S_{t}\right)^{k-1}+\alpha^{*} G_{t}^{k-1}} \\ {V\left(S_{t}\right)^{k}=(1-\alpha)^{*}\left((1-\alpha) V\left(S_{t}\right)^{k-2}+\alpha^{*} G_{t}^{k-2}\right)+\alpha^{*} G_{t}^{k-1}}\end{array}
$$
$$
\begin{array}{c}{V\left(S_{t}\right)^{k}=(1-\alpha)^{2} V\left(S_{t}\right)^{k-2}+(1-\alpha)^{*} \alpha^{*} G_{t}^{k-2}+\alpha^{*} G_{t}^{k-1}} \\ {V\left(S_{t}\right)^{k}=(1-\alpha)^{k-1} * \alpha^{*} G_{t}^{0}+\ldots+(1-\alpha)^{*} \alpha^{*} G_{I}^{k-2}+\alpha^{*} G_{t}^{k-1}} \\ {V\left(S_{t}\right)^{k}=\alpha\left((1-\alpha)^{k-1} * G_{t}^{0}+\ldots+(1-\alpha) * G_{t}^{k-2}+\alpha^{*} G_{t}^{k-1}\right)}\end{array}
$$