# 04 Homework 🏋️🏋️🏋️

#### 👉A course without homework is not a course!

#### 👉Spend some time thinking and trying to implement the challenges I propose here.

#### 👉They are not so easy, so if you get stuck drop me an email at `plabartabajo@gmail.com`

-----

## 1. Can you update the function `train` in a way that the input `epsilon` can also be a callable function?

An `epsilon` value that decays after each episode works better than a fixed `epsilon` for most RL problems.

This is hard exercise, but I want you to give it a try.

If you do not manage it, do not worry. We are going to implement this in an upcoming lesson.

In [20]:
import random
import numpy as np

import gym

from tqdm import tqdm
from typing import Any, Callable, List, Tuple, Union

In [14]:
class QAgent:

    def __init__(self, env, alpha, gamma):
        self.env = env

        # table with q-values: n_states * n_actions
        self.q_table = np.zeros([env.observation_space.n, env.action_space.n])

        # hyper-parameters
        self.alpha = alpha
        self.gamma = gamma

    def get_action(self, state):
        """"""
        # stop()
        return np.argmax(self.q_table[state])

    def update_parameters(self, state, action, reward, next_state):
        """"""
        old_value = self.q_table[state, action]
        next_max = np.max(self.q_table[next_state])

        new_value = (1 - self.alpha) * old_value + self.alpha * (reward + self.gamma * next_max)
        self.q_table[state, action] = new_value

    def reset(self):
        """
        Sets q-values to zeros, which essentially means the agent does not know
        anything
        """
        self.q_table = np.zeros([self.env.observation_space.n, self.env.action_space.n])


In [22]:
def train_with_variable_epsilon(
    agent,
    env,
    n_episodes: int,
    epsilon: Union[float, callable]
) -> Tuple[Any, List, List]:
    """
    Trains and agent and returns 3 things:
    - agent object
    - timesteps_per_episode
    - penalties_per_episode
    """
    # For plotting metrics
    timesteps_per_episode = []
    penalties_per_episode = []

    for i in tqdm(range(0, n_episodes)):

        state = env.reset()

        epochs, penalties, reward, = 0, 0, 0
        done = False

        while not done:
            if callable(epsilon):
                eps = epsilon(i)
            else:
                eps = epsilon

            if random.uniform(0, 1) < eps:
                # Explore action space
                action = env.action_space.sample()
            else:
                # Exploit learned values
                action = agent.get_action(state)

            next_state, reward, done, info = env.step(action)

            agent.update_parameters(state, action, reward, next_state)

            if reward == -10:
                penalties += 1

            state = next_state
            epochs += 1

        timesteps_per_episode.append(epochs)
        penalties_per_episode.append(penalties)

    return agent, timesteps_per_episode, penalties_per_episode

def schedule_epsilon(n_episode: int):
    if n_episode > 50:
        return 0.05
    else:
        return 0.1

In [24]:
env = gym.make("Taxi-v3").env
alpha, gamma = 0.1, 0.9
agent = QAgent(env, alpha, gamma)

agent, timesteps, penalties = train_with_variable_epsilon(
    agent, env, 100, schedule_epsilon
)

100%|██████████| 100/100 [00:00<00:00, 126.13it/s]


-----

## 2. Can you parallelize the function `train_many_runs` using Python's `multiprocessing` module?

I do not like to wait and stare at each progress bar, while I think that each run in `train_many_runs` could execute
in parallel.

Create a new function called `train_many_runs_in_parallel` that outputs the same results as `train_many_runs` but that executes in a fraction of time.