# Reinforcement Learning
Prof. Milica Gašić


Setup the notebook: During the course of exercises, we will require several python packages to be installed. You can find all necessary requirements in the attached requirements.txt file. As a suggestion, you can create a virtual environment and install the requirements there (see https://docs.python.org/3/library/venv.html). A virtual environment can be constructed and activated using

```console
python3 -m venv /path/to/new/virtual/environment
source /path/to/new/virtual/environment/bin/activate
```

You can then install the requirements using

```console
pip install -r requirements.txt
```

### Three state MDP

In [None]:
import random
import gymnasium as gym

#### Implement the three state MDP shown on the exercise set as a `gym` environment.

For more information on `gym`, visit https://gymnasium.farama.org

A `gym` environment can be thought of as a **simulator** of the underlying MDP.
This means that we don't need to know the actual dynamics distribution, but we
only need to implement what happens when we take some action in some state. Most
practical applications of Reinforcement Learning only have access to such
simulators, as we will see later in the lecture.

A `gym` environment must provide the following:
- An `observation_space` that describes the set of states.  
  *Note: `gym` supports partially observable Markov decision processes (POMDPs),
  which are more general than MDPs. In the partially observable setting, we
  cannot observe the full state, but only a limited observation of the state. In
  our case the states are fully observable, i.e., the observation space is the
  same as the state space.*
- An `action_space` that describes the set of actions.
- A `reset()` method that resets the state of the environment and returns this state.
- A `step(action)` method that performs a transition from the current state and
  the given action to the next state. It returns the next state and the reward
  that was produced during this transition.

We already started the implementation below, just fill in the missing code.

In [None]:
class ThreeStateMDP(gym.Env):

    def __init__(self):
        # Define the observation space for the three states.
        # The states are encoded as integers, i.e., X = 0, Y = 1, Z = 2.
        self.observation_space = gym.spaces.Discrete(3)

        # Define the action space for the two actions.
        # The actions are encoded as integers, i.e., left = 0, right = 1.
        self.action_space = gym.spaces.Discrete(2)

        # Create an attribute for the current state of this environment
        # (initially empty).
        self.state = None

    def reset(self):
        #######################################################################
        # TODO: This method resets the internal state of this environment     #
        # according to the initial state distribution. Remember that the      #
        # states are encoded as integers (see __init__).                      #
        #######################################################################
        
        #######################################################################
        # End of your code.                                                   #
        #######################################################################

        # The gym API requires two return values:
        # - The current state of the environment
        # - An `info` dictionary containing additional reset information, which
        #   is empty in our case
        info = {}
        return self.state, info

    def step(self, action):
        #######################################################################
        # TODO: This methods transitions from the current state of the        #
        # environment and the provided action to a next state and produces a  #
        # reward. gym also explicitly differentiates between non-terminal and #
        # terminal states since this simplifies a lot of our code. For this   #
        # reason, we also have to return a `terminated` flag that indicates   #
        # whether the new state is terminal (i.e., the episode ended in a     #
        # terminal state). You can generate a random number between 0 and 1   #
        # using random.random()                                               #
        #######################################################################
        
        #######################################################################
        # End of your code.                                                   #
        #######################################################################

        # In addition to the next state, reward, and the terminated flag, the
        # gym API requires two more return values:
        # - A `truncated` flag that indicates whether the episode was truncated,
        #   i.e., whether it ended early for some other reason than reaching a
        #   terminal state (e.g. when a maximum number of steps is reached),
        #   which is always False in our case
        # - An `info` dictionary containing additional step information, which
        #   is empty in our case
        truncated = False
        info = {}
        return self.state, reward, terminated, truncated, info

To test your implementation we can do the following:
1. Sample an episode with some policy (which we call performing a "rollout") and
   compute the sum of all rewards.
2. Repeat this for $n$ episodes and compute the average of these sums.

We already implemented the function `average_reward_sum(env, policy, n)` that
takes a `gym` environment and a policy, and returns the average sum of rewards
by sampling `n` episodes. The `policy` must be a function that returns an action
given a state.

In [None]:
def average_reward_sum(env, policy, n):
    total_reward_sum = 0
    for i in range(n):
        # Start a new episode, reset the environment
        state, info = env.reset()
        reward_sum = 0
        while True:
            # Select an action using the provided policy
            action = policy(state)
            # Perform an environment step
            state, reward, terminated, truncated, info = env.step(action)
            # Add the reward to the reward sum
            reward_sum += reward
            if terminated or truncated:
                # Exit the loop when the episode ended in a terminal state or was
                # truncated early
                break
        total_reward_sum += reward_sum
    # Compute the average over n episodes
    result = total_reward_sum / n
    return result

We already implemented a random policy that selects left or right with a coin flip:

In [None]:
def random_policy(state):
    if random.random() < 0.5:
        return 0  # left with 50% probability
    else:
        return 1  # right with 50% probability

If you implemented the environment correctly, you should get an average sum of
rewards of approximately 10 (for the random policy):

In [None]:
env = ThreeStateMDP()
average_reward_sum(env, random_policy, n=10000)

#### Implement the policies $\pi_1$ and $\pi_3$ from the exercise set.
Remember the definition of the deterministic policy $\pi_1$
$$\begin{aligned}
  \pi_1(X) & = \text{right} \\
  \pi_1(Y) & = \text{right}
\end{aligned}$$
and the stochastic policy $\pi_3$
$$\begin{aligned}
  \pi_3(\text{left}|X) & = 0 & \pi_3(\text{left}|Y) & = 0.9 \\
  \pi_3(\text{right}|X) & = 1 & \pi_3(\text{right}|Y) & = 0.1
\end{aligned}$$

In [None]:
def pi1(state):
    #######################################################################
    # TODO: Select an action based on the definition of pi1.              #
    #######################################################################
    
    #######################################################################
    # End of your code.                                                   #
    #######################################################################
    return action

Evaluate policy $\pi_1$ by computing the average reward sum. What value do you get?

In [None]:
env = ThreeStateMDP()
average_reward_sum(env, pi1, n=10000)

<details>
<summary>Solution (spoiler)</summary>
The value should be around 6
</details>

In [None]:
def pi3(state):
    #######################################################################
    # TODO: Select an action based on the definition of pi3.              #
    #######################################################################
    
    #######################################################################
    # End of your code.                                                   #
    #######################################################################
    return action

Evaluate policy $\pi_3$ by computing the average reward sum. What value do you get?

In [None]:
env = ThreeStateMDP()
average_reward_sum(env, pi3, n=10000)

<details>
<summary>Solution (spoiler)</summary>
The value should be around 42
</details>