Please fill in your name and that of your teammate.

You: Hans-Andrea Danuser

Teammate:

# Introduction

Welcome to the twelfth lab. It's finally time to dedicate a whole lecture to the foundations of Reinforcement Learning, I hope the course so far has prepared you for this.

There is a lot of coding, leveraging the fact that you covered DL last week; however a lot of code is already there, to limit the time you need. Nonetheless anything in the assignment could show up at the exam, so if there is anything unclear I suggest you ask on Moodle.

### How to pass the lab?

Below you find the exercise questions. Each question awarding points is numbered and states the number of points like this: **[0pt]**. To answer a question, fill the cell below with your answer (markdown for text, code for implementation). Incorrect or incomplete answers are in principle worth 0 points: to assign partial reward is only up to teacher discretion. Over-complete answers do not award extra points (though they are appreciated and will be kept under consideration). Save your work frequently! (`ctrl+s`)

**You need at least 14 points (out of 21 available) to pass** (66%).

# 1. Fundamentals

#### 1.1 **[1pt]** Explain the Reinforcement Learning paradigm in English. Use the words Environment, Agent, Action and Feedback.

In the RL paradigm, an agent and an enviroment are bound in an interaction-loop, where the agent send actionts to the enviroment and the enviroment will send a feedback. (typically a reward)

#### 1.2 **[1pt]** Explain the equation for (pseudo-)Regret in English.

The pseudoregret is the difference between the best possible reward obtained and the reward actually obtained. (always psoitive, goal have a regret as small as possible)

#### 1.3 **[1pt]** Explain in English the importance of exploration and exploitation in MAB.

The goal of exploration is to find the best arm (or action). Its done by choosing suboptimal action (or arms) in order to find better ones in the futur. The goal of exploitation is to use the best action(arm) in order to maximise the reward. this  is done by the agent choosing greedly. Their tradoff is the very heart of Reinforced Learning

#### 1.4 **[1pt]** Explain what would happen if you had a discount factor $\gamma \gt 1$ in the Ice Maze example from the lecture.

Futur rewards will have more value than immediate. Since there are no time restraints  the agent will learn not to reach the termination and try to "wait" aiming to maximise the reward.

#### 1.5 **[2pt]** The Bellman Equation is at the heart of the classical Reinforcement Learning framework. Write it below in Latex, then explain each of the terms as if you were reading it in English.

$\nu_\pi(s)= R(s,\pi(s))+\gamma \sum_{s' \in S} P^{\pi(s)}_{s,s'} \nu_\pi(s')$

The expectation of the future  reward of the agent by choosing policy $\pi$, starting from the state $s$ is the sum of the reward obtained by executing the action chosen by $\pi(s)$ in state $s$, plus the discounted value sum, for each state $s'$ of the probability of reaching $s'$ after executing $\pi(s)$ in the state $s$, times the expected future reward collected by the agent following policy $\pi$ starting from $s$

# 2. Q-Learning from scratch

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import math
import gym
sns.set(rc={'figure.figsize':(8,6)}, style="whitegrid")

Time to get coding!
- The [OpenAI Gym](https://gym.openai.com/) maintains a broad set of Reinforcement Learning benchmarks. It is ready to import on Colab, or you can add it to your local installation using `pipenv install gym`.
- You can find the Ice Maze (named Frozen Lake) [here](https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py).
- The Classic Control games are also relatively easy to solve, and much nicer to render. Feel free to give it a try. You can save a video with `env = gym.wrappers.Monitor(env, 'video', force = True)` right after the `gym.make()` call. More on this topic [here](https://hub.packtpub.com/openai-gym-environments-wrappers-and-monitors-tutorial/).
- To get you started, here is a working example of a random policy on the framework, which I took from their front page and customized.

#### 2.1 **[2pt]** Write a Python function `choose_action` that takes as arguments `Q`, `state`, `epsilon`, `actions` and returns an action chosen according to Q-values using an epsilon-greedy policy.

- `Q` is a dictionary that has states as keys. Each value is a numpy array with the Q-values corresponding to each action from the state used as key.
- You cannot initialize `Q` with all possible states, because they can be infinite or unreachable. Using `Q[state]` with missing key `state` will throw an error; use instead `Q.get(state)` then check if its return `is None`.
- `state` can be simply a number.
- `actions` is a list of possible actions between which to choose.
- Here is some code I used for testing:
```python
Q = {1:[0,0,1], 2:[0,1,0]}
for _ in range(50):
    print(choose_action(Q, 1, 0.0, []), end='') #=> 2
print()
for _ in range(50):
    print(choose_action(Q, 2, 0.0, []), end='') #=> 1
print()
for _ in range(50):
    print(choose_action(Q, 2, 1.0, range(3)), end='') #=> random action
```

In [11]:
def choose_action(Q, state, epsilon, actions):
  q_vals = Q.get(state)
  unexplored_state = q_vals is None
  epsilon_chance = np.random.uniform(0.0,1.0) < epsilon

  #so for the greedy part: if explore random choice:
  if unexplored_state or epsilon_chance:
    action = np.random.choice(actions)
  #choose highest reward
  else:
    action = np.argmax(q_vals)

  return action

Q = {1:[0,0,1], 2:[0,1,0]}
for _ in range(2):
    print(choose_action(Q, 1, 0.0, []), end='') #=> 2
print()
for _ in range(50):
    print(choose_action(Q, 2, 0.0, []), end='') #=> 1
print()
for _ in range(50):
    print(choose_action(Q, 2, 1.0, range(3)), end='') #=> random action


22
11111111111111111111111111111111111111111111111111
02112211201122122222022022111202221001021121211100

#### 2.2 **[2pt]** Write a Python function `update_Q` that takes as arguments `Q`, `state`, `action`, `next_state` and `num_actions`, and updates the `Q` dictionary according to Q-Learning. 

- Here you will need to initialize `Q[state]` if previously unexplored.
- The future expected reward is the highest Q value from the next state -- or zero if unexplored. That's what the last argument is for.
- I'll leave the testing to you. You may feel like just running the next question for testing (called _integration testing_ ), but testing it in isolation ( _unit testing_ ) is a much more controlled verification which should not be skipped (i.e. _easier_ ).

In [12]:
def update_Q(Q, state, action, next_state, reward, num_actions, alpha, gamma):
  if Q.get(state) is None:
    Q[state] = np.zeros(num_actions)
  
  if Q.get(next_state) is None:
    next_q = 0
  else:
    next_q = np.max(Q[next_state])

  q_update = reward + gamma * next_q
  Q[state][action] = (1-alpha) * Q[state][action] + alpha * q_update


Q[3] = [0,0,0]
update_Q(Q, state=3, action=2, next_state=2, reward=1, num_actions=3, alpha=0.6, gamma=0.3)

print(Q[3])


[0, 0, 0.78]


#### 2.3 **[1pt]** Run the code below to successfully use Q-Learning to solve the OpenAI Gym `FrozenLake` environment.

- If you got both previous answers right, here is a free extra point for you :)
- The parameters should work as they are, but feel free to play with them and make sure by the last epochs the environment is solved more often than not.

In [13]:
# Initialization and settings
env = gym.make("FrozenLake-v0", is_slippery=False)
num_episodes = 1000
max_nsteps = 30
gamma = 0.9
alpha = 0.6
epsilon = 0.3

num_actions = env.action_space.n
Q = {} # hashing states to per-action reward arrays

# Loop for each episode
for ith_episode in range(num_episodes):

    # Reset the environment (and obtain first observation)
    state = env.reset()
    total_reward = 0

    # For each timestep
    for t in range(max_nsteps):
        
        # Choose action according to Q-values using epsilon-greedy policy
        # THIS SHOULD USE YOUR IMPLEMENTATION
        action = choose_action(Q, state, epsilon, range(num_actions))
        
        # Execute action: get reward, move env to next state
        next_state, reward, done, info = env.step(action)
        # Add negative reward for falling in hole
        if done and reward == 0: reward = -1

        # Some useful printing to verify progression - (W)in or (l)ose
        if reward == -1: print('l', end='', flush=True)
        if reward == 1: print('W', end='', flush=True)

        # Accumulate reward. Note this environment only rewards on termination though.
        total_reward += reward
        
        # Update Q function for current state and action
        # THIS SHOULD USE YOUR IMPLEMENTATION
        update_Q(Q, state, action, next_state, reward, num_actions, alpha, gamma)
        
        # Update internal state
        state = next_state
        
        # Terminate if episode ended
        if done: break

llllllllllllllllllllllllllllllllllllllllllllllllllllWlllllllllllllllWllllllllllllllllllllllllllllllllllllllllllllllllWllWWlWlWWWlWlWllWWlWlWWWWlWlWlWWllWWWWlWWWWlWWWWWlWllWlWWWWWWlWWWWlWWlWlWWWlWWWlWlWlWWWlllWllWWWlWllWllWWllWWWlWWWWWWWlllllWWWWWllllWWlWllllWWWlWWlWWWWlWWlWlWWWWlWWWlWWWWWWlWllllWlllWlWWWlWllWWlWWWllWWWWWlWlWlWlWWlWWlWWWWlWWWllWlWllWWWWWWWlWWWWWWWllWlWWlWWlWlWllWlWWlWWWlllWWWllllWllWWWllWllWWWWWWlWllllWlWWWWWWlWWWWllWWlllWlWlWlWWlWlWlWWlWlWllWWlWWWWlWllWWlWWWWWWWlWWllWWlWWWWWllWlWWWWWWWWWllllWlWWlWWlWWWWWllWWWWWWWWWWWlllllWWlWWWllWWWlWWWWWWWWWllWWlllWWWllWlWlWWWWWlllWWllllWWlWWWllWlWllllWWWWlWWllWlWlWlWWlWWWlWWWllWWWWlWWWWlWWWWWWlWlWllWWWlWllWlWWlWWWWlWWllWWWWWlWlWWWWlWWWWlWWWllWWWlWWWlWWWWlWWWWllllWlWWlWlWWlWWWllWllWWWWlWllWWWllWllWWWWWlWWWWWlWlWllWlWWWlWllWWWlWWlWWWWWWlWWWWWWlWWWWWWWlWlWWWlWWlWllWWWlWWWWWlWWWWWWWlWWWWWllWlWWlWWlWWllWWWWlllWllllWlWllWlWlWWWWWWlWlWWWWWWWlWWWWWWWlWllWlWlWWll

#### 2.4 **[1pt]** Write a Python code snippet that satisfies the following requirements: (i) runs a fully-greedy policy, (ii) using the learned Q-values, (iii) rendering the environment in each step, and (iv) in less than 10 lines of code.

- You can extract what you need from the code above -- if you know where to look and understand what you need.

In [14]:
state = env.reset()

for t in range(max_nsteps):
  env.render()
  action = np.argmax(Q[state])
  state, reward, done, info = env.step(action)
  if done:
    env.render()
    break


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m


#### 2.5 **[1pt]** Modify the code from question 2.3 to record the cumulative reward and number of time-steps for each episode. Then plot them with two line plots using Seaborn.

- Think carefully: you need to augment the code above, by first initializing a proper data structure, then recording the statistics at the end of each episode. Finally plot them below.

# 3. DQN

- DQN basically implements Q-Learning but uses a (deep) neural network to learn the rewards (with a few tricks). The main advantage versus the dictionary-based approach above is *generalization*: the network can output an expected reward for all actions, even if they were not explored yet.
- If this code takes too long to execute, you may want to use [TPUs with Colab](https://www.tensorflow.org/guide/tpu)
  - Though no idea if the `env.render()` will work, comment the line if it gives problems
  - You can always save a video instead in your Google Drive
- This code was originally taken from [this DQN tutorial](https://towardsdatascience.com/reinforcement-learning-w-keras-openai-dqns-1eed3a5338c) (with few edits). Study it to answer the following questions.
- Learn more about the CartPole [here](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py).

In [16]:
## IMPLEMENTATION ##

import gym
import numpy as np
import random

from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import Adam

# Do you know what a deque is? https://en.wikipedia.org/wiki/Deque
# Though here it's used just to cap its number of elements
from collections import deque

class DQN:
    def __init__(self, env):
        self.env     = env
        self.memory  = deque(maxlen=2000) # after 2000 will delete oldest record(s)

        self.gamma = 0.85
        self.epsilon = 1.0
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.01
        self.tau = .125
        self.batch_size = 64 # try different batch sizes
        
        self.model        = self.create_model()
        self.target_model = self.create_model()

    def create_model(self):
        model = Sequential()
        state_shape  = self.env.observation_space.shape
        model.add(Dense(24, input_dim=state_shape[0], activation="relu")) #bugfix
        model.add(Dense(48, activation="relu"))
        model.add(Dense(24, activation="relu"))
        model.add(Dense(self.env.action_space.n))
        model.compile(loss="mean_squared_error",
            optimizer=Adam(lr=self.learning_rate))
        return model

    def act(self, state):
        self.epsilon *= self.epsilon_decay
        self.epsilon = max(self.epsilon_min, self.epsilon)
        if np.random.random() < self.epsilon: #bugfix
            return self.env.action_space.sample()
        return np.argmax(self.model.predict(state).flatten())

    def remember(self, state, action, reward, new_state, done):
        self.memory.append([state, action, reward, new_state, done])

    def replay(self):
        # This trick records past experience and re-plays for accelerate the training
        if len(self.memory) < self.batch_size: return
        samples = random.sample(self.memory, self.batch_size)
        for sample in samples:
            state, action, reward, new_state, done = sample
            target = self.target_model.predict(state)
            if done:
                target[0][action] = reward
            else:
                Q_future = max(self.target_model.predict(new_state).flatten())
                target[0][action] = reward + Q_future * self.gamma
            self.model.fit(state, target, epochs=1, verbose=0)

    # This part can be confusing: can you explain what is happening line by line?
    def target_train(self):
        weights = self.model.get_weights()
        target_weights = self.target_model.get_weights()
        for i in range(len(target_weights)):
            target_weights[i] = weights[i] * self.tau + target_weights[i] * (1 - self.tau)
        self.target_model.set_weights(target_weights)        

Using TensorFlow backend.


In [17]:
## MAIN ##

try: env.close() # your kernel can crash if you don't close the env properly
except NameError: pass
env     = gym.make("CartPole-v1")
gamma   = 0.9
# Sometimes you will find code that uses (1-epsilon) for epsilon
# Other times you will find a _decaying_ epsilon, lowering over time
epsilon = .5 # it was .9 originally, play with it and understand its role

ntrials   = 20 # I can see the beginning of learning with 20, feel free to raise this
max_nstep = 200 # This environment is capped at 200 anyway, but you can try shorter

dqn_agent = DQN(env=env)
for trial in range(ntrials):
    print(f"Trial {trial+1}:")
    cur_state = env.reset().reshape(1,-1) # never forget to reset the env
    tot_reward = 0
    for step in range(max_nstep):
        # print('.', end='', flush=True)
        action = dqn_agent.act(cur_state)
        print(action, end='', flush=True)
        
        new_state, reward, done, _ = env.step(action)
        if done: reward = -100 # It's much easier to learn if you punish failing
        tot_reward += reward

        new_state = new_state.reshape(1,-1)
        dqn_agent.remember(cur_state, action, reward, new_state, done)

        dqn_agent.replay()       # Internally iterates default (prediction) model
        dqn_agent.target_train() # What does this do?

        cur_state = new_state #bugfix
        if done: break
    print(f" Reward: {tot_reward}")

env.close()
# Here is how to save trained Keras models
# dqn_agent.model.save("trained.model")

Trial 1:
011111001100111 Reward: -86.0
Trial 2:
1000011011011010110000011011100111 Reward: -67.0
Trial 3:
0101011111110010110 Reward: -82.0
Trial 4:
0001110000100000 Reward: -85.0
Trial 5:
01010110011010000101010 Reward: -78.0
Trial 6:
0010011101010100110101110001 Reward: -73.0
Trial 7:
00100100111101001011011001 Reward: -75.0
Trial 8:
001010100101100110001 Reward: -80.0
Trial 9:
000110001110011 Reward: -86.0
Trial 10:
011110101000101000011000011110011110010100011001101000110010 Reward: -41.0
Trial 11:
111001100110000 Reward: -86.0
Trial 12:
01100110111110011001000 Reward: -78.0
Trial 13:
10111111000000 Reward: -87.0
Trial 14:
11111100100 Reward: -90.0
Trial 15:
110111111011 Reward: -89.0
Trial 16:
01101111010100 Reward: -87.0
Trial 17:
101010100111100000101010110001110000111010100100000101100110100001001011101101111001 Reward: -17.0
Trial 18:
0110000010111111 Reward: -85.0
Trial 19:
0000001011111 Reward: -88.0
Trial 20:
1111100000000010110001111010100011100000010100001111101011 Reward

#### 3.1 **[3pt]** There are three bugs in the implementation. Fix them until it runs successfully.

- One is in `## IMPLEMENTATION ##`, easy to find because it stops it from running.
- One is just harder, messes with the policy. It's also in `## IMPLEMENTATION ##` but there's a hint in `## MAIN ##`.
- One is a missing line in `## MAIN ##`: something there makes no sense.
- Of course you can use the original implementation. But you will learn less and cannot do it at the exam, so give it a fair try first, because it's a fair exercise.

#### 3.2 **[1pt]** Compare the shape of the model used with that of an autoencoder. Give one reason as to why they do or do not look alike.

autoencode: hourglass. this model: diamond shape. The autoencoder does that to support compact encoding, while this shape provides more second level features to support the decision-making levels.

#### 3.3 **[1pt]** How many inputs and outputs neurons does the model have? Write the code that answers the question. 

In [18]:
print(env.observation_space.shape)
print(env.action_space.n)

(4,)
2


#### 3.4 **[3pt]** Write Python code to (i) create a new DQN agent, (ii) load a trained model into the agent's `model`, (iii) change the agent's epsilon-related parameters to enforce a greedy policy, and (iv) runs the greedy policy rendering each frame.

- Remember to instantiate the environment first, and to close it at the end, or you may have problems.
- Loading the model is easy -- if you saved it to a file in the `main` cell.
- To enforce the greedy policy you need to set three agent variables, to sensible values.
- You saw a similar evaluation loop when implementing Q-Learning. Just double check the differences with the DQN implementation to make sure it runs. You need to use `env.render()` here (if you're local; if you work on Colab just write a comment about it or save a video).
- You can use `time.sleep()` to slow down the loop if the rendering is too fast. Remember also that the env's rendering window will close when you close the environment, so a pause there can also help.

In [0]:
from time import sleep
import gym
from keras.models import load_model

env = gym.make("CartPole-v1")

dqn_agent = DQN(env=env)
dqn_agent.model = load_model("trained.model")

#greedy
dqn_agent.epsilon = 0.0
dqn_agent.epsilon_min = 0.0
dqn_agent.epsilon_decay = 1.0

max_nstep = 500
state = env.reset().reshape(1,-1)

for step in range(max_nstep):
  env.render()
  sleep(0.1)
  action = dqn_agent.act(state)
  print(action, end='', flush=True)
  state, reward, done, _ = env.step(action)
  state = state.reshape(1,-1)
  if done: break

sleep(1)
env.close()




# At the end of the exercise

Bonus question with no points! Answering this will have no influence on your scoring, not at the assignment and not towards the exam score -- really feel free to ignore it with no consequence. But solving it will reward you with skills that will make the next lectures easier, give you real applications, and will be good practice towards the exam.

The solution for this questions will not be included in the regular lab solutions pdf, but you are welcome to open a discussion on the Moodle: we will support your addressing it, and you may meet other students that choose to solve this, and find a teammate for the next assignment that is willing to do things for fun and not only for score :)

#### BONUS **[ZERO pt]** Solve another of the [OpenAI Gym environments](https://gym.openai.com/envs/#classic_control) by copying and modifying the above DQN implementation. This is very useful to learn to use different environments, and to see first-hand the limits of DRL.

#### BONUS **[ZERO pt]** Augment the DQN implementation to track episode size and cumulative reward and plot them with lineplots.

### Final considerations

- Reinforcement Learning is both the name of the learning paradigm and of the framework classically used to address such problems.
- The amount of research currently dedicated to solve RL problems by improving both DL and the RL framework is staggering. Results still keep coming, but at a much slower pace than in purer SL applications (in front of orders of magnitude more investment).
- Next week we will explore the limitations of the RL framework, and beyond.