![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2018 Carlos A. Iglesias

## [Introduction to Machine Learning V](2_6_0_Intro_RL.ipynb)

# Table of Contents

* [Introduction](#Introduction)
* [Getting started with OpenAI Gym](#Getting-started-with-OpenAI-Gym)
* [The Frozen Lake scenario](#The-Frozen-Lake-scenario)
* [Q-Learning with the Frozen Lake scenario](#Q-Learning-with-the-Frozen-Lake-scenario)
* [Exercises](#Exercises)
* [Optional exercises](#Optional-exercises)

# Introduction
The purpose of this practice is to understand better Reinforcement Learning (RL) and, in particular, Q-Learning.

We are going to use [OpenAI Gym](https://gym.openai.com/). OpenAI is a toolkit for developing and comparing RL algorithms.Take a loot at ther [website](https://gym.openai.com/).

It implements [algorithm imitation](http://gym.openai.com/envs/#algorithmic), [classic control problems](http://gym.openai.com/envs/#classic_control), [Atari games](http://gym.openai.com/envs/#atari), [Box2D continuous control](http://gym.openai.com/envs/#box2d), [robotics with MuJoCo, Multi-Joint dynamics with Contact](http://gym.openai.com/envs/#mujoco),  and [simple text based environments](http://gym.openai.com/envs/#toy_text).

This notebook is based on * [Diving deeper into Reinforcement Learning with Q-Learning](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).

First of all, install the OpenAI Gym  library:

```console
foo@bar:~$ pip install gym
```


If you get the error message 'NotImplementedError: abstract', [execute](https://github.com/openai/gym/issues/775) 
```console
foo@bar:~$ pip install pyglet==1.2.4
```

If you want to try the Atari environment, it is better that you opt for the full installation from the source. Follow the instructions at [https://github.com/openai/gym#id15](OpenGym).


# Getting started with OpenAI Gym

First of all, read the [introduction](http://gym.openai.com/docs/#getting-started-with-gym) of OpenAI Gym.

## Environments
OpenGym provides a number of problems called *environments*. 

Try the 'CartPole-v0' (or 'MountainCar).

In [6]:
import gym

env = gym.make('CartPole-v0')
#env = gym.make('MountainCar-v0')
#env = gym.make('Taxi-v2')

#env = gym.make('Jamesbond-ram-v0')

env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.[0m


This will launch an external window with the game. If you cannot close that window, just execute in a code cell:

```python
env.close()
```

The full list of available environments can be found printing the environment registry as follows.

In [8]:
from gym import envs
print(envs.registry.all())

dict_values([EnvSpec(Copy-v0), EnvSpec(RepeatCopy-v0), EnvSpec(ReversedAddition-v0), EnvSpec(ReversedAddition3-v0), EnvSpec(DuplicatedInput-v0), EnvSpec(Reverse-v0), EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(MountainCar-v0), EnvSpec(MountainCarContinuous-v0), EnvSpec(Pendulum-v0), EnvSpec(Acrobot-v1), EnvSpec(LunarLander-v2), EnvSpec(LunarLanderContinuous-v2), EnvSpec(BipedalWalker-v2), EnvSpec(BipedalWalkerHardcore-v2), EnvSpec(CarRacing-v0), EnvSpec(Blackjack-v0), EnvSpec(KellyCoinflip-v0), EnvSpec(KellyCoinflipGeneralized-v0), EnvSpec(FrozenLake-v0), EnvSpec(FrozenLake8x8-v0), EnvSpec(CliffWalking-v0), EnvSpec(NChain-v0), EnvSpec(Roulette-v0), EnvSpec(Taxi-v2), EnvSpec(GuessingGame-v0), EnvSpec(HotterColder-v0), EnvSpec(Reacher-v2), EnvSpec(Pusher-v2), EnvSpec(Thrower-v2), EnvSpec(Striker-v2), EnvSpec(InvertedPendulum-v2), EnvSpec(InvertedDoublePendulum-v2), EnvSpec(HalfCheetah-v2), EnvSpec(Hopper-v2), EnvSpec(Swimmer-v2), EnvSpec(Walker2d-v2), EnvSpec(Ant-v2), EnvSpec(Hum

In [10]:
env.close()

The environment’s **step** function returns  four values. These are:

* **observation (object):** an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
* **reward (float):** amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
* **done (boolean):** whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.).
* **info (dict):** diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

The typical agent loop consists in first calling the method *reset* which provides an initial observation. Then the agent executes an action, and receives the reward, the new observation, and if the episode has finished (done is true). 

For example, analyze this sample of agent loop for 100 ms. The details of the previous variables for this game as described [here](https://github.com/openai/gym/wiki/CartPole-v0) are:
* **observation**: Cart Position, Cart Velocity, Pole Angle, Pole Velocity.
* **action**: 0	(Push cart to the left), 1	(Push cart to the right).
* **reward**: 1  for every step taken, including the termination step.

In [9]:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        print("Action ", action)
        observation, reward, done, info = env.step(action)
        print("Observation ", observation, ", reward ", reward, ", done ", done, ", info " , info)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[-0.03841348  0.04244381 -0.00503248 -0.01166684]
Action  1
Observation  [-0.0375646   0.23763757 -0.00526582 -0.30593331] , reward  1.0 , done  False , info  {}
[-0.0375646   0.23763757 -0.00526582 -0.30593331]
Action  1
Observation  [-0.03281185  0.43283416 -0.01138448 -0.60027229] , reward  1.0 , done  False , info  {}
[-0.03281185  0.43283416 -0.01138448 -0.60027229]
Action  0
Observation  [-0.02415517  0.23787331 -0.02338993 -0.31119693] , reward  1.0 , done  False , info  {}
[-0.02415517  0.23787331 -0.02338993 -0.31119693]
Action  0
Observation  [-0.0193977   0.04309227 -0.02961387 -0.0259813 ] , reward  1.0 , done  False , info  {}
[-0.0193977   0.04309227 -0.02961387 -0.0259813 ]
Action  0
Observation  [-0.01853586 -0.15159275 -0.03013349  0.25721299] , reward  1.0 , done  False , info  {}
[-0.01853586 -0.15159275 -0.03013349  0.25721299]
Action  1
Observation  [-0.02156

[-0.03439474 -0.54275785  0.04342057  0.87772093]
Action  0
Observation  [-0.0452499  -0.73844213  0.06097499  1.18373233] , reward  1.0 , done  False , info  {}
[-0.0452499  -0.73844213  0.06097499  1.18373233]
Action  1
Observation  [-0.06001874 -0.54416197  0.08464964  0.9107692 ] , reward  1.0 , done  False , info  {}
[-0.06001874 -0.54416197  0.08464964  0.9107692 ]
Action  1
Observation  [-0.07090198 -0.35028121  0.10286502  0.64584668] , reward  1.0 , done  False , info  {}
[-0.07090198 -0.35028121  0.10286502  0.64584668]
Action  0
Observation  [-0.0779076  -0.54667471  0.11578196  0.96906875] , reward  1.0 , done  False , info  {}
[-0.0779076  -0.54667471  0.11578196  0.96906875]
Action  1
Observation  [-0.0888411  -0.35328145  0.13516333  0.71488498] , reward  1.0 , done  False , info  {}
[-0.0888411  -0.35328145  0.13516333  0.71488498]
Action  0
Observation  [-0.09590673 -0.54998977  0.14946103  1.04687343] , reward  1.0 , done  False , info  {}
[-0.09590673 -0.54998977  0.

[-0.04800907  0.045426    0.00653589 -0.00498813]
Action  1
Observation  [-0.04710055  0.24045361  0.00643613 -0.29560176] , reward  1.0 , done  False , info  {}
[-0.04710055  0.24045361  0.00643613 -0.29560176]
Action  1
Observation  [-4.22914735e-02  4.35483215e-01  5.24090835e-04 -5.86247903e-01] , reward  1.0 , done  False , info  {}
[-4.22914735e-02  4.35483215e-01  5.24090835e-04 -5.86247903e-01]
Action  0
Observation  [-0.03358181  0.24035393 -0.01120087 -0.29339993] , reward  1.0 , done  False , info  {}
[-0.03358181  0.24035393 -0.01120087 -0.29339993]
Action  0
Observation  [-0.02877473  0.04539345 -0.01706887 -0.00427054] , reward  1.0 , done  False , info  {}
[-0.02877473  0.04539345 -0.01706887 -0.00427054]
Action  1
Observation  [-0.02786686  0.24075598 -0.01715428 -0.30228965] , reward  1.0 , done  False , info  {}
[-0.02786686  0.24075598 -0.01715428 -0.30228965]
Action  0
Observation  [-0.02305174  0.04588266 -0.02320007 -0.0150658 ] , reward  1.0 , done  False , info 

[-0.00626518 -0.35751812 -0.00758625  0.56903238]
Action  1
Observation  [-0.01341555 -0.16229059  0.0037944   0.27396918] , reward  1.0 , done  False , info  {}
[-0.01341555 -0.16229059  0.0037944   0.27396918]
Action  0
Observation  [-0.01666136 -0.35746647  0.00927378  0.56784645] , reward  1.0 , done  False , info  {}
[-0.01666136 -0.35746647  0.00927378  0.56784645]
Action  0
Observation  [-0.02381069 -0.55271727  0.02063071  0.86343651] , reward  1.0 , done  False , info  {}
[-0.02381069 -0.55271727  0.02063071  0.86343651]
Action  1
Observation  [-0.03486503 -0.35788217  0.03789944  0.57731105] , reward  1.0 , done  False , info  {}
[-0.03486503 -0.35788217  0.03789944  0.57731105]
Action  1
Observation  [-0.04202268 -0.16331135  0.04944566  0.29680417] , reward  1.0 , done  False , info  {}
[-0.04202268 -0.16331135  0.04944566  0.29680417]
Action  0
Observation  [-0.0452889  -0.35910203  0.05538175  0.60466235] , reward  1.0 , done  False , info  {}
[-0.0452889  -0.35910203  0.

[ 0.04359227  0.78950104 -0.06861304 -1.26046025]
Action  0
Observation  [ 0.05938229  0.59532053 -0.09382225 -0.99003125] , reward  1.0 , done  False , info  {}
[ 0.05938229  0.59532053 -0.09382225 -0.99003125]
Action  0
Observation  [ 0.07128871  0.40157108 -0.11362287 -0.72822856] , reward  1.0 , done  False , info  {}
[ 0.07128871  0.40157108 -0.11362287 -0.72822856]
Action  0
Observation  [ 0.07932013  0.2081879  -0.12818744 -0.47335751] , reward  1.0 , done  False , info  {}
[ 0.07932013  0.2081879  -0.12818744 -0.47335751]
Action  0
Observation  [ 0.08348388  0.01508723 -0.13765459 -0.22366701] , reward  1.0 , done  False , info  {}
[ 0.08348388  0.01508723 -0.13765459 -0.22366701]
Action  0
Observation  [ 0.08378563 -0.17782631 -0.14212793  0.02262326] , reward  1.0 , done  False , info  {}
[ 0.08378563 -0.17782631 -0.14212793  0.02262326]
Action  1
Observation  [ 0.0802291   0.01901756 -0.14167547 -0.3113104 ] , reward  1.0 , done  False , info  {}
[ 0.0802291   0.01901756 -0.

[ 0.02538204 -0.03276309 -0.02769695  0.04865145]
Action  0
Observation  [ 0.02472678 -0.22747717 -0.02672392  0.33246869] , reward  1.0 , done  False , info  {}
[ 0.02472678 -0.22747717 -0.02672392  0.33246869]
Action  0
Observation  [ 0.02017724 -0.42220875 -0.02007454  0.61660587] , reward  1.0 , done  False , info  {}
[ 0.02017724 -0.42220875 -0.02007454  0.61660587]
Action  0
Observation  [ 0.01173306 -0.61704458 -0.00774243  0.90289921] , reward  1.0 , done  False , info  {}
[ 0.01173306 -0.61704458 -0.00774243  0.90289921]
Action  0
Observation  [-6.07827806e-04 -8.12060799e-01  1.03155582e-02  1.19313852e+00] , reward  1.0 , done  False , info  {}
[-6.07827806e-04 -8.12060799e-01  1.03155582e-02  1.19313852e+00]
Action  1
Observation  [-0.01684904 -0.61707397  0.03417833  0.90370656] , reward  1.0 , done  False , info  {}
[-0.01684904 -0.61707397  0.03417833  0.90370656]
Action  1
Observation  [-0.02919052 -0.42243121  0.05225246  0.6219594 ] , reward  1.0 , done  False , info 

[ 0.03164849 -0.38815401 -0.01636068  0.55927992]
Action  1
Observation  [ 0.02388541 -0.19280629 -0.00517509  0.26148772] , reward  1.0 , done  False , info  {}
[ 0.02388541 -0.19280629 -0.00517509  0.26148772]
Action  1
Observation  [ 2.00292795e-02  2.38915349e-03  5.46692548e-05 -3.28229912e-02] , reward  1.0 , done  False , info  {}
[ 2.00292795e-02  2.38915349e-03  5.46692548e-05 -3.28229912e-02]
Action  0
Observation  [ 0.02007706 -0.19273358 -0.00060179  0.25987718] , reward  1.0 , done  False , info  {}
[ 0.02007706 -0.19273358 -0.00060179  0.25987718]
Action  1
Observation  [ 0.01622239  0.00239696  0.00459575 -0.0329955 ] , reward  1.0 , done  False , info  {}
[ 0.01622239  0.00239696  0.00459575 -0.0329955 ]
Action  1
Observation  [ 0.01627033  0.1974527   0.00393584 -0.32422488] , reward  1.0 , done  False , info  {}
[ 0.01627033  0.1974527   0.00393584 -0.32422488]
Action  1
Observation  [ 0.02021938  0.39251839 -0.00254865 -0.61566401] , reward  1.0 , done  False , info 

[ 0.11901488  0.96432911 -0.143899   -1.60692947]
Action  0
Observation  [ 0.13830146  0.77117189 -0.17603759 -1.36234868] , reward  1.0 , done  False , info  {}
[ 0.13830146  0.77117189 -0.17603759 -1.36234868]
Action  1
Observation  [ 0.1537249   0.9680078  -0.20328456 -1.70452765] , reward  1.0 , done  False , info  {}
[ 0.1537249   0.9680078  -0.20328456 -1.70452765]
Action  0
Observation  [ 0.17308505  0.77572278 -0.23737511 -1.48139409] , reward  1.0 , done  True , info  {}
Episode finished after 14 timesteps
[-0.03534474 -0.03801097 -0.00149005  0.04285437]
Action  1
Observation  [-0.03610496  0.15713232 -0.00063297 -0.25029831] , reward  1.0 , done  False , info  {}
[-0.03610496  0.15713232 -0.00063297 -0.25029831]
Action  1
Observation  [-0.03296231  0.3522633  -0.00563893 -0.54318082] , reward  1.0 , done  False , info  {}
[-0.03296231  0.3522633  -0.00563893 -0.54318082]
Action  0
Observation  [-0.02591705  0.15722105 -0.01650255 -0.25227994] , reward  1.0 , done  False , in

[-0.03179988  0.4064995  -0.03916855 -1.03885516]
Action  0
Observation  [-0.02366989  0.21191931 -0.05994566 -0.75872135] , reward  1.0 , done  False , info  {}
[-0.02366989  0.21191931 -0.05994566 -0.75872135]
Action  1
Observation  [-0.0194315   0.40781382 -0.07512008 -1.06964878] , reward  1.0 , done  False , info  {}
[-0.0194315   0.40781382 -0.07512008 -1.06964878]
Action  1
Observation  [-0.01127523  0.60384449 -0.09651306 -1.38493007] , reward  1.0 , done  False , info  {}
[-0.01127523  0.60384449 -0.09651306 -1.38493007]
Action  0
Observation  [ 8.01662943e-04  4.10049450e-01 -1.24211661e-01 -1.12392114e+00] , reward  1.0 , done  False , info  {}
[ 8.01662943e-04  4.10049450e-01 -1.24211661e-01 -1.12392114e+00]
Action  1
Observation  [ 0.00900265  0.60656112 -0.14669008 -1.45284205] , reward  1.0 , done  False , info  {}
[ 0.00900265  0.60656112 -0.14669008 -1.45284205]
Action  0
Observation  [ 0.02113387  0.41351366 -0.17574692 -1.20935314] , reward  1.0 , done  False , info 

# The Frozen Lake scenario
We are going to play to the [Frozen Lake](http://gym.openai.com/envs/FrozenLake-v0/) game.

The problem is a grid where you should go from the 'start' (S) position to the 'goal position (G) (the pizza!). You can only walk through the 'frozen tiles' (F). Unfortunately, you can fall in a  'hole' (H).
![](images/frozenlake-problem.png "Frozen lake problem")

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise. The possible actions are going left, right, up or down. However, the ice is slippery, so you won't always move in the direction you intend.

![](images/frozenlake-world.png "Frozen lake world")


Here you can see several episodes. A full recording is available at  [Frozen World](http://gym.openai.com/envs/FrozenLake-v0/).

![](images/recording.gif "Example running")


# Q-Learning with the Frozen Lake scenario
We are now going to apply Q-Learning for the Frozen Lake scenario. This part of the notebook is taken from [here](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/Q%20Learning%20with%20FrozenLake.ipynb).

First we create the environment and a Q-table inizializated with zeros to store the value of each action in a given state. 

In [27]:
import numpy as np
import gym
import random

env = gym.make("FrozenLake-v0")


action_size = env.action_space.n
state_size = env.observation_space.n


qtable = np.zeros((state_size, action_size))
print(qtable)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


Now we define the hyperparameters.

In [28]:
# Q-Learning hyperparameters
total_episodes = 10000        # Total episodes
learning_rate = 0.8           # Learning rate
max_steps = 99                # Max steps per episode
gamma = 0.95                  # Discounting rate

# Exploration hyperparameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.01             # Exponential decay rate for exploration prob

And now we implement the Q-Learning algorithm.

![](images/qlearning-algo.png "Q-Learning algorithm")

In [29]:
# List of rewards
rewards = []

# 2 For life or until learning is stopped
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        # 3. Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0, 1)
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])

        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()

        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        
    episode += 1
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards.append(total_rewards)

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

Score over time: 0.4826
[[1.60518451e-01 8.91387956e-02 2.96601641e-01 9.01854463e-02]
 [1.13175005e-03 1.41005657e-03 3.62002511e-03 1.59960656e-01]
 [5.00399653e-03 1.36336032e-02 1.13519882e-02 1.23768268e-01]
 [8.74284025e-03 1.23643338e-03 7.20367554e-04 3.40941171e-02]
 [2.46871727e-01 4.60946645e-04 1.03481616e-01 7.45689133e-04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [5.08654555e-04 8.28346456e-10 6.92777320e-03 5.43643846e-05]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [3.56917014e-02 1.46775173e-02 9.21011717e-02 1.26453025e-01]
 [8.03435752e-03 4.00005248e-01 1.88008887e-02 1.44495931e-02]
 [3.80255929e-02 3.62925397e-02 1.02173788e-01 6.43993394e-04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [8.26085057e-02 3.07854324e-02 1.06771521e-01 8.39566542e-02]
 [6.41377434e-02 3.67876191e-01 2.56672518e-01 1.70758755e-01]
 [0.00000000e+00 0.00000000e+00

Finally, we use the learnt Q-table for playing the Frozen World game.

In [30]:

env.reset()

for episode in range(5):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        env.render()
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            break
        state = new_state
env.close()

****************************************************
EPISODE  0

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Up)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Up)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Up)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Up)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Up)
SFF[41mF[0m
FHFH
FFFH
HFFG
  (Up)
SFF[41mF[0m
FHFH
FFFH
HFFG
  (Up)
SFF[41mF[0m
FHFH
FFFH
HFFG
  (Up)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[

# Exercises

## Taxi
Analyze the [Taxi problem](http://gym.openai.com/envs/Taxi-v2/) and solve it applying Q-Learning. You can find a solution as the one previously presented  [here](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym).

Analyze the impact of not changing the learning rate (alfa or epsilon, depending on the book) or changing it in a different way.

In [31]:
#Create environment and initialize Qtable
env = gym.make("Taxi-v2")
action_size = env.action_space.n
state_size = env.observation_space.n
qtable = np.zeros((state_size, action_size))
print(qtable)

# Q-Learning hyperparameters
total_episodes = 10000        # Total episodes
learning_rate = 0.8           # Learning rate
max_steps = 99                # Max steps per episode
gamma = 0.95                  # Discounting rate

# Exploration hyperparameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.01             # Exponential decay rate for exploration prob


[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]


In [32]:
#Algorithm
# List of rewards
rewards = []

# 2 For life or until learning is stopped
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        # 3. Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0, 1)
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])

        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()

        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        
    episode += 1
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    #epsilon = 1.0
    rewards.append(total_rewards)

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)



#Game
env.reset()

for episode in range(5):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        env.render()
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            break
        state = new_state
env.close()


Score over time: -387.1101
[[  0.           0.           0.           0.           0.
    0.        ]
 [204.74486227 216.81449713 204.96239363 216.81287998 229.49140548
  207.77400707]
 [196.01089049 207.72247246 195.91790223 207.81781805 219.98977698
  198.75950646]
 ...
 [256.86632681 271.81779241 257.20212665 243.37904779 248.14730043
  248.22009062]
 [171.67480096 181.88887893 171.59345337 181.84552996 162.71989687
  162.77521002]
 [215.49649028 224.36265441 222.30866408 257.16550459 212.19436592
  213.33934571]]
****************************************************
EPISODE  0
+---------+
|[35mR[0m: | : :G|
| : : : : |
|[43m [0m: : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+

+---------+
|[35mR[0m: | : :G|
| : : : : |
| : : : : |
|[43m [0m| : | : |
|[34;1mY[0m| : |B: |
+---------+
  (South)
+---------+
|[35mR[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[34;1m[43mY[0m[0m| : |B: |
+---------+
  (South)
+---------+
|[35mR[0m: | : :G|
| : : : : |
| : : : 

Modificando la tasa de aprendizaje minima 1 para que siempre intente explorar en vez de explotar el entorno me da el mismo resultado - No cambia el resultado

# Optional exercises

## Doom
Read this [article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) and execute the companion [notebook](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/tree/master/DQN%20Doom). Analyze the results and provide conclusions about DQN.

## References
* [Diving deeper into Reinforcement Learning with Q-Learning, Thomas Simonini](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).
* Illustrations by [Thomas Simonini](https://github.com/simoninithomas/Deep_reinforcement_learning_Course) and [Sung Kim](https://www.youtube.com/watch?v=xgoO54qN4lY).
* [Frozen Lake solution with TensorFlow](https://analyticsindiamag.com/openai-gym-frozen-lake-beginners-guide-reinforcement-learning/)
* [Deep Q-Learning for Doom](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)
* [Intro OpenAI Gym with Random Search and the Cart Pole scenario](http://www.pinchofintelligence.com/getting-started-openai-gym/)
* [Q-Learning for the Taxi scenario](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym)

## Licence

The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© 2018 Carlos A. Iglesias, Universidad Politécnica de Madrid.