### Reinforcement learning (part I)

Recall that reinforcement learning relies on the interaction of the learning algorithm (which we call agent in this framework) and its environment by means of the actions the learner takes, which impact the environment and the rewards he received in return for his actions.   

<img src="reinforcementImage.png",width=300,height=300>

Image source: [Reinforcement Learning Coach](https://nervanasystems.github.io/coach/)

### Part I : Greedy approach and $\varepsilon$-Greedy approaches

#### Exercise I. Stationnary approach

We will start by considering a simple k-armed bandit problem such as the one discussed in class. Here we take $k = 4$ and we take the reward to follow a Gaussian distribution with mean $mu_i$ and $\sigma = 1$. 

In [None]:
mu1 = 0 # put your choice for the value 
# (the mean of the distribution). We assume the distribution of the 
mu2 = 0
mu3 = 0
mu4 = 0

sigma1 = 1
sigma2 = 1
sigma3 = 1
sigma4 = 1




maxIter = 100
iter = 0

while iter < maxIter:
    
    
    action = 0 # sample an action at random from 0 to 3
    
    reward = 0 # sample the reward according to the Gaussian distribution
    
    value  = 0 # update the value 
    
    
    iter +=1
    

#### Exercise I.2. Non stationnary version

Now code the non stationnary version of the k-bandit algorithm

In [None]:
mu1 = 0 # put your choice for the value 
# (the mean of the distribution). We assume the distribution of the 
mu2 = 0
mu3 = 0
mu4 = 0

sigma1 = 1
sigma2 = 1
sigma3 = 1
sigma4 = 1




maxIter = 100
iter = 0

while iter < maxIter:
    
    
    action = 0 # sample an action at random from 0 to 3
    
    reward = 0 # sample the reward according to the Gaussian distribution
    
    value  = 0 # update the value 
    
    
    iter +=1
    

#### Exercise II. Escape room


In this exercise, we will tackle a simple reinforcement learning problem. Consider the map given below. There are 5 rooms + the garden. We would like to train an agent to get out of the house as quickly as possible. To set up the evironment, we will consider 6 possible state (the rooms in which the agent is located) and 6 possible actions (moving from one room to any other room). 

The Q-table can thus be encoded by a $6$ by $6$ matrix. We will consider three types of rewards. Impossible moves (example 1 to 4) will be penalized by $1$. possible moves will be associated to a $0$ reward. Finally any move leading to an escape (e.g. 2 to 6) will be rewarded by 100. 


Map
<img src="QLearningImage2.png" width="700" height="600">

#### Question II.1

As a first approach, we will just run a couple of pure exploration iterations. Just fill out the loop below and run a couple of 

In [None]:
done = False 

while not done: 
    
    
    '''complete the greedy steps by sampling an action at random and updating the state of the environement
    until the variable Done is not set to True. Set this variable to True when the agent is able to escape the house'''
    
    
    
    
    
    

#### Question II.2

Now that you can solve the greedy approach. We will start to exploit and we will do that through the use of a $Q$ table. In this case, as indicated in the statement of the exercise, the Q-table is 6x6. Train the agent by alternating between exploitation and exploration. 

Since we want to update the $Q$-table, we will now add a line of the form 

$$Q[s, a] \leftarrow (1-\alpha)Q[s,a] + \alpha\left(R[a] + \gamma\max_{a'}Q[s',a']\right)$$

When in the exploration framework, we will sample the action at random as in Question III.1. When in the exploitation framework however, we will simply choose the action as the one that maximizes the entry in the $Q$-table for the particular state at which we are. Hence we have $a^* = \underset{a}{\operatorname{argmax}} Q[s,a]$. 


Code this epsilon-greedy approach below. You can start $\epsilon =0.8$ 
Take a sufficiently small learning rate (you can for example start with 0.5) and a relatively large discount factor $\gamma=0.9$ (You can later change those values to see how they affec the learning)

Once you are done with the algorithm, try a couple of different values for $\epsilon$ and describe the evolution in the learning. 

In [None]:
done = False 

epsilon = 0
gamma = 0
alpha = 0

for episode in range(NumEpisodes):

    done =False 
    
    while not done: 
    
    
    '''Draw a number at random from the uniform distribution between 0 and 1''' 
    
    
    '''If the number is less then epsilon, explore if it is larger, exploit'''
    
    if randomDraw < epsilon:
        
        # exploration
        
        '''update the Q-table'''
        
    else:
        
        # exploitation
        
        '''update the Q-table'''
        
        
    

### Part III : Introduction to Q-Learning.


There are several libraries in python including RLLib, pybrain,... that can be used to code reinforcement learning approaches. When Starting, a good approach is to consider the [Gym toolkit](https://gym.openai.com/) from openAI. Gym is compatible with both Theano and TensorFlow and contains a collection of examples that can be used to illustrate most of the reinforcement learning frameworks. Install gym with "pip install gym" (you may need !pip install cmake 'gym[atari]') or use 

"git clone https://github.com/openai/gym"

"cd gym"

"pip install -e ."

if you prefer to clone the git repository. 


#### Exercise III.1. Gym Self Driving cab
(based on the [learndatasci](https://www.learndatasci.com/) tutorials)

As a starting point, we will consider the [self driving cab](https://gym.openai.com/envs/Taxi-v2/) example. Use the lines below to display the map for this particular example. The objective in this exercise is to train the cab through RL in order to (1) Drop off the passenger at the right location (2) save as much time as possible by taking the shortest path from the pick up to the drop off location and (3) respect traffic rules. 

- The cab is represented by the yellow rectangle. It is free to move on a 5x 5 grid and its spatial state can thus be described by a dimension 25 vector. 

- Wherever it is, the cab has four possible destinations, the four positions 'R', 'Y' 'G' and 'B'.  

- We will further assume that the passengers can be picked up in any of the four locations R, G, Y and B. On top of those four locations, we also need to account for the framework in which the passenger is inside the cab. Any passenger position can thus be encoded by 5 binary variables.


In this case, the state of the environment can thus be encoded by $5\times 5 \times 4 \times 5$ binary variables. 


- Finally we need to encode the possible actions that the cab can take. At each location the cab can move in each of the four directions - east, west, north, south but it can also pick up or dropoff a passenger. We can thus encode the actions of the cab through 6 binary variables. 




In [None]:
import gym

env = gym.make("Taxi-v2").env

env.render()

__The cab is not supposed to cross the vertical bars which are representing wall and we will thus enforce this by setting the reward associated to impossible moves to -1__

Gym lets us access the environement by means of the variable 'env'. The variable comes up with 3 methods. 

- env.reset
- env.step (apply a step)
- env.render (display the current state of the environment)

You can also use env.action_space as well as env.observation_space to respectively access the set of actions and existing states of the environment. 

Use the first and third methods to reset and display the original state of your environment after resetting it.

In [None]:
# your code 



The point of this first exercise is for the agent to learn a mapping from the existing states to the optimal actions.

__Step I. Interacting with and displaying the environment.__ 

Each state o fthe environment can either be encoded as a single number (between 0 and 499) or as a (5,5,5,4) tuple of the form (cab row, cab col, passenger index, direction). To move between the two, gym provides teh method 'encode' of the variable 'env'. Using the lines below, together with the render method discussed above, set and display a couple of environment states


In [None]:
state = env.encode(0,0,0,0) # change the 4 tuple to the state you wan to encode 
env.s = state

env.render()

__Step II. Taking actions based on rewards__ 

To each state of the environment is associated a Reward table which can be accessed through the line env.P[n] where n is the number encoding a particular state of the environment. Look at the reward tables of the states you rendered above. 

In [None]:
# your code (one line)

The reward table has 5 rows (encoding the actions) and four columns of the form (probability, nextstate, reward, done). In this framework we don't consider any probability so this variable is always set to $1$. The last column indicates when the cab has droped a passenger at the right location. 

Each successful dropoff concludes one episode. 


#### Exercise III.1. 

Implement a full episode. That is we want an infinite loop that stops when the passenger has been droped. 

(hint: to sample an action you can use the method 'env.action_space.sample()'. Then note that env.step returns a four tuple of the form (state, reward, done, info) where 'done' indicates whether the passenger has been droped.)

In [None]:
'''This script should run one episode in which the cab takes random actions 
until the passenger is droped at the right location'''

while not done: # change the condition for the loop to stop when the state 
    
    
    # put your code here
    
    
    
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

Once you have stored all the frames, use the lines below to play the resulting movie.

In [None]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'].getvalue())
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(frames)

#### Exercise III.2. 

We will now see how one can exploit the previous experience of our agent to increase the rewards over time through Q-learning. In Q-learning the idea is to keep track of the actions that were beneficial by updating a mapping from any pair (environment state, action) to some number encoding the value of the pair. Q-values are updated following the equation 

\begin{align}
Q(\text{state}, \text{action}) &\leftarrow (1-\alpha) Q(\text{state}, \text{action}) \\
&+ \alpha (\text{reward} + \gamma \max_a Q(\text{next state}, \text{all actions}))\end{align}


That is we not only try to maximize the immediate reward but we also try to look for the action that will lead to the highest potential reward one step ahead. In the equation above, $\alpha$ can be interpreted as a __learning rate__. $\gamma$ which is known as the __discount factor__ indicates how much importance we want to give to the future rewards.  

The __Q-table__ is a table with 500 rows corresponding to the 500 states and 6 columns encoding each of the 6 actions. We will use a numpy array of zero to encode this table. Finally in order for our learning algorithm to be efficient, we will alternate between exploitation (with probability epsilon) and exploitation with probability (1-epsilon). 


Extend the "random cab" episode from Exercise II.1. in order to account for the Q table. 

- Use the line 'next_state, reward, done, info = env.step(action)'  to update the environment 
- Select the action either at random or according to the Q-table

(Hint: to decide between exploration and exploitation, split the $[0,1]$ interval between a $[0,\varepsilon]$ subinterval and a $[\varepsilon,1]$ subinterval. Then draw a number uniformly at random from the $[0,1]$ interval. If the number falls in $[0,\varepsilon]$ interval then pick an action at random. Otherwise,  )

In [None]:
'''This script should code one episode in which a random action is 
taken with probability epsilon and the action maximizing Q is taken with probability (1-epsilon)'''


import random
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        

        # complete the episode here
        
        
        
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")



#### Exercise III.3. Evaluating the agent 

Once you have learned the Q-table, evaluate the agent behavior by choosing at each the step and in each state, the action that maximizes the value of the Q-table and play the resulting movie using the 