# Hands-on Lesson 9 - Q-learning
The goal of this hands-on lesson is to implement and play with the Q-learning algorithm.

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

## Q-learning algorithm
The aim of the Q-learning algorithm is to approach the Q-function, also called the action-value function, for the optimal policy $\pi_*$

$$q_{\pi_*}(s,a) = \mathbb{E}_{\pi_*} \left[ G_t | S_t = s, A_t=a \right]$$

where $G_t$ is the return defined as $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$, with $\gamma$ the discount factor in the interval [0, 1]. 

The Q-algorithm is an off-policy algorithm that learns the Q-function with following iteration 

$$Q(S,A) \leftarrow Q(S,A)  + \alpha \left(R + \gamma \max_{a'}Q(S',a')- Q(S,A) \right)$$

where $\alpha$ is the learning parameter, and $(S,A,R,S')$ is a sequence of state, action, reward, state. 

## The cliff walk problem

The cliff walk problem consists in a grid where a player moves (horizontally or vertically). The initial position is at the bottom left, the final position at the bottom right. All grid points in between are the cliff, the rest is "safe". 

We define an environment class **CliffWalk**, with two main functions:
- **render()** that draws the landscape and the position of the agent
- **step(action)** that returns the new state of the system, the reward and a logical variable determining if the agent is done

In [None]:
class CliffWalk:
        
    def __init__(self, Nhoriz=8, Nvert=4):
        self.Nhoriz = Nhoriz
        self.Nvert = Nvert
        self.Nstates = Nhoriz * Nvert
        self.Nactions = 4
        self.player = None
        self._create_grid() 
        print(self.grid)
        self.render()
        
    def _state(self):
        ''' Maps a position in x,y coordinates to a unique state scalar '''
        return self.player[0] * self.Nhoriz + self.player[1]

    def _create_grid(self):
        self.grid = - 1 * np.ones((self.Nvert, self.Nhoriz), dtype=int)
        self.grid[-1,-1] = 0
        self.grid[-1,1:-1] = -10
        
    def render(self):
        fig, ax = plt.subplots(num=1)
        im = ax.imshow(self.grid)
        ax.set_xticks(np.arange(self.Nhoriz))
        ax.set_yticks(np.arange(self.Nvert))
        if self.player is not None:
            ax.plot(self.player[1], self.player[0] ,'.r')
        plt.show()

    def reset(self):
        self.player = (self.Nvert-1,0)
        return self._state()
    
    def step(self, action):
        # Possible actions
        if action == 0 and self.player[0] > 0: # UP
            self.player = (self.player[0] - 1, self.player[1])
        if action == 1 and self.player[0] < self.Nvert-1: # DOWN
            self.player = (self.player[0] + 1, self.player[1])
        if action == 2 and self.player[1] < self.Nhoriz-1: # RIGHT
            self.player = (self.player[0], self.player[1] + 1)
        if action == 3 and self.player[1] > 0:  # LEFT
            self.player = (self.player[0], self.player[1] - 1)
            
        # Reward
        reward = self.grid[self.player]
        if reward == -1:
            done = False
        else:
            done = True
            
        return self._state(), reward, done

First, familiarize yourself with the environment

In [None]:
env = CliffWalk()

Now let's add an agent

In [None]:
state = env.reset()
env.render()

And let's make it move randomly.

In [None]:
env.reset()
for _ in range(10):
    action = np.random.randint(env.Nactions)
    next_state, reward, done = env.step(action)
    print(state, action, reward, next_state, done)
    state = next_state
    env.render()

## Q_learning

Let's implement the Q-learning algorithm. 

An important aspect is to choose how the policy will explore the state space. Below, we implement an $e$-greedy policy: with probability $e$, a random action is chosen; with probability $(1-e)$, the optimal action is chosen (the one that maximizes $Q(s,a)$ over all possible actions $a$). 

In [None]:
# LEARNING PARAMETERS
Nepisodes = 1000
exploration_rate = 0.1  # e
learning_rate = 0.1     # alpha
discount_factor = 0.9   # gamma 
   
# INITIALISATION    
env = CliffWalk()
q_values = np.zeros((env.Nstates, env.Nactions))

# Loop for episodes
for _ in range(Nepisodes):
    state = env.reset()    
    done = False
    
    # Loop for steps within an episode
    while not done:            
        # Choose action (e-greedy policy)  
        if np.random.random() > exploration_rate:
            action = np.argmax(q_values[state])
        else:
            action = np.random.choice(env.Nactions)
        # Do the action
        next_state, reward, done = env.step(action)
        # Update q_values       
        target = reward + discount_factor * np.max(q_values[next_state])
        error = target - q_values[state][action]
        q_values[state][action] += learning_rate * error
        # Update state
        state = next_state

print(q_values)

Once the Q-function is learned, we can observe the optimal policy

In [None]:
max_steps = 15
state = env.reset()    
done = False
t = 0

# Loop for steps within an episode
while (not done) & (t < max_steps):
    t += 1
    # Choose action (greedy pokicy)  
    action = np.argmax(q_values[state])
    # Do the action
    next_state, reward, done = env.step(action)
    print('SARS:', state, action, reward, next_state)
    # Update state
    state = next_state
    # plot
    env.render()

**EXERCISE:** To assess the quality of control, we want to plot the return for each episode as the algorithm advances. 

**EXERCISE:** We now want to explore alternative policies. One of them is a Boltzmann exploration (or soft-max): the action is chosen with a probability

$$p(a;s) = \frac{\exp(Q(s,a)/\tau)}{\sum_{a'}\exp(Q(s,a')/\tau) } $$

where $\tau$ is a "temperature". 

Indentify the role of $\tau$ in terms of exploration-exploitation. Implement this policy and compare its performance with an e-greedy policy.

**EXERCISE:** Usually the control is moved toward more exploitation / less exploration as time advances. This is done by changing the exploration rate (the $e$ in the $e$-greedy policy) or the temperature (for the Boltzmann policy) over time. Show that this helps to converge towards the optimal policy.

## Adding some stochasticity 

Now the environment is no longer deterministic. We imagine there are wind gusts towards the cliff, and with a certain probability, the agent may be moved downwards instead of the chosen action

In [None]:
class CliffWalkStochastic:      
    def __init__(self, Nhoriz=8, Nvert=4, Pwind=0.3):
        self.Pwind = Pwind  # New line
        self.Nhoriz = Nhoriz
        self.Nvert = Nvert
        self.Nstates = Nhoriz * Nvert
        self.Nactions = 4
        self.player = None
        self._create_grid() 
        print(self.grid)
        self.render()
        
    def _state(self):
        ''' Maps a position in x,y coordinates to a unique state scalar '''
        return self.player[0] * self.Nhoriz + self.player[1]

    def _create_grid(self):
        self.grid = - 1 * np.ones((self.Nvert, self.Nhoriz), dtype=int)
        self.grid[-1,-1] = 0
        self.grid[-1,1:-1] = -10
        
    def render(self):
        fig, ax = plt.subplots(num=1)
        im = ax.imshow(self.grid)
        ax.set_xticks(np.arange(self.Nhoriz))
        ax.set_yticks(np.arange(self.Nvert))
        if self.player is not None:
            ax.plot(self.player[1], self.player[0] ,'.r')
        plt.show()

    def reset(self):
        self.player = (self.Nvert-1,0)
        return self._state()   

    def step(self, action):
        if np.random.random() < self.Pwind:  # New lines 
            action = 1                       #
        # Possible actions
        if action == 0 and self.player[0] > 0: # UP
            self.player = (self.player[0] - 1, self.player[1])
        if action == 1 and self.player[0] < self.Nvert-1: # DOWN
            self.player = (self.player[0] + 1, self.player[1])
        if action == 2 and self.player[1] < self.Nhoriz-1: # RIGHT
            self.player = (self.player[0], self.player[1] + 1)
        if action == 3 and self.player[1] > 0:  # LEFT
            self.player = (self.player[0], self.player[1] - 1)
            
        # Reward
        reward = self.grid[self.player]
        if reward == -1:
            done = False
        else:
            done = True
            
        return self._state(), reward, done

**EXERCISE:** See how this stochastic wind affects the optimal policy. 