### Reinforcement Learning Tutorial with OpenAI Gym

Jay Urbain, PhD  

OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms that includes many game environments.

This notebook provides a tutorial with example implementations for using the OpenAI Gym environment:
- Interacting with Gym.  
- Value iteration in deterministic environments.  
- Q-learning in deterministic environments.  
- Q-learning in non-determinisitc environments.  
- **ON YOUR OWN:** Complete Q-Learning for Gym environment of your choice.   

References:  
https://gym.openai.com/  
https://www.kaggle.com/kernels/scriptcontent/6183449/notebook  


First, review the Gym toolkit and sample environments:   
    
https://gym.openai.com/   
        

In [None]:
import gym # openAi gym
from gym import envs
import numpy as np 
import datetime
import keras 
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from time import sleep

from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, BoltzmannQPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor
from rl.callbacks import FileLogger, ModelIntervalCheckpoint
print("OK")

#### Gym

There are  many games that are available.  

In [None]:
print(envs.registry.all())


We can start with a basic game called Taxi.


In [None]:
env = gym.make('Taxi-v2')
env.reset()
env.render()

#### Taxi-v2

This task was introduced in [Dietterich2000] to illustrate some issues in hierarchical reinforcement learning. There are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.

[Dietterich2000] T Erez, Y Tassa, E Todorov, "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition", 2011.

Actions: 
There are 6 discrete deterministic actions:
    - 0: move south
    - 1: move north
    - 2: move east 
    - 3: move west 
    - 4: pickup passenger
    - 5: dropoff passenger
    
Rewards: 
There is a reward of -1 for each action and an additional reward of +20 for delievering the passenger. There is a reward of -10 for executing actions "pickup" and "dropoff" illegally.
    
Rendering:
    - blue: passenger
    - magenta: destination
    - yellow: empty taxi
    - green: full taxi
    - other letters: locations

https://gym.openai.com/envs/Taxi-v2/


#### Interacting with the Gym environment  

The OpenAI Gym toolkit follows a standard RL/Markov Decision Process (MDP) for handling interactions with the game.   
<img src="https://cdn-images-1.medium.com/max/800/1*7Ae4mf9gVvpuMgenwtf8wA.png">   
Source: [OpenAI](https://openai.com/)   

At each timestep, the agent chooses an action, and the environment returns an observation and a reward:  

*observation, reward, done, info = env.step(action)*    
* observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game like Taxi.
* reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
* done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
* info (dict): ignore, diagnostic information useful for debugging. Official evaluations of your agent are not allowed to use this for learning.  

To illustrate interacting with the enviornment, we can do some random steps:


In [None]:
# Let's first do some random steps in the game so you see how the game looks like

rew_tot=0
obs= env.reset() # always reset the environmebnt
env.render() # display initial environment state
for _ in range(6):
    action = env.action_space.sample() # sample random action from possible actions (action_space)
    obs, rew, done, info = env.step(action) # execute action in the environment
    rew_tot = rew_tot + rew # add to cumulative reward
    env.render() # render update environment state

#Print the reward of these random action
print("Reward: %r" % rew_tot)    
    

#### Actions  
Action (a): the action the agent provides to the environment. 

env.action_space defines the set of environment actions available to the agen. tell you

Actions available to the Taxi game $[0..5]$:      
* 0: move south
* 1: move north
* 2: move east 
* 3: move west 
* 4: pickup passenger
* 5: dropoff passenger
  

In [None]:
# action space has 6 possible actions, the meaning of the actions is nice to know for us humans but the neural network will figure it out
print(env.action_space)
NUM_ACTIONS = env.action_space.n
print("Possible actions: [0..%a]" % (NUM_ACTIONS-1))


#### State   
State (s): Represents the board state of the game and is returned as the observation. 

In the Taxi game, the observation is an integer, one of 500 possible states. Each state can be translated into a graphic with the render function. 

*Note: this is specific for the Taxi game. In an Atari style game the observation is the game screen with many coloured pixels.*

In [None]:
print(env.observation_space)
print()
env.env.s=42 # some random number, you might recognize it
env.render()
env.env.s = 222 # and some other
env.render()

#### Markov decision process(MDP)
The Taxi game is an example of an [Markov decision process ](https://en.wikipedia.org/wiki/Markov_decision_process). The game can be described in states, possible actions in a state (leading to a next state with a certain probability) and rewards associated with that state transition.

A [Markovian property](https://en.wikipedia.org/wiki/Markov_property) means that the current state encapsulates all prior information.

The Reinforcement Learning environment is modeled as an MDP. Given this environment, the agent takes actions to maximize the cumulative reward. Since the internal workings of the environment is essentially a "black box," it can be referred to as a `hidden markov model` that we will learn.

#### Policy   

Policy ($\pi$): The strategy that the agent uses to determine the next action `a` to take in state `s`. 

The optimal policy ($\pi^*$), is the policy that maximizes the expected cumulative reward. 

Our goal is to learn $\pi^*$ by solvoing the Bellman equation. 


#### Bellman equation  

$V^*(s) \leftarrow max_a\sum_{s'}P(s'|s,a)[R(s,a,s') + \gamma V^*(s')]$

where
* *R(s,a,s')* - Reward for action a in state s, transitioning to s'.
* *P(s'|s,a)* - Probability (expectation) of going to state s' given action a in state s. The Taxi game actions are deterministic so the probability that selected action will lead to expected state is 100%. 
* $\gamma$ - Discount rate for future rewards. It must be between 0 and <1. The higher gamma the more focus on long term rewards. May not converge if $\gamma=1$.

The value iteration algorithm:  
* $V(s)$ represents the cumulative reward for state $s$. $V_{\pi}(s)$ is the expected cumulative reward of the current state $s$ sunder policy $\pi$.  

The Q learning algorithm:   
* The action-value $Q(s,a)$ function represents the cumulative reward of the current state $s$ and taking action $a$ under policy $\pi$.

#### Value iteration algorithm   

The idea is to iteratively calculate the value (expected long-term cumulative reward) for each state. The algorithm iterates over all states $s$ and possible actions $a$ to explore the value (cumulative discounted rewards) $V[s]$ for a given state $s$. 

The algorithm iteratess until $V[s]$ converges. The Optimal policy $\pi^*$ is the action taken at each state $s$ that maximizes the value. This value iteration algorithm is an example of [dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming) (DP). 


In [None]:
# Value iteration algorithm

NUM_ACTIONS = env.action_space.n
NUM_STATES = env.observation_space.n
V = np.zeros([NUM_STATES]) # Value for each state
Pi = np.zeros([NUM_STATES], dtype=int)  # policy, iteratively updated, to get the optimal policy
gamma = 0.9 # discount factor
significant_improvement = 0.01

def best_action_value(s):
    # finds the highest value action (max_a) in state s
    best_a = None
    best_value = float('-inf')

    # iterate through all possible actions to find the best current action
    for a in range (0, NUM_ACTIONS):
        env.env.s = s
        s_new, rew, done, info = env.step(a) #take the action
        v = rew + gamma * V[s_new]
        if v > best_value:
            best_value = v
            best_a = a
    return best_a

iteration = 0
while True:
    # biggest_change - delta
    delta = 0
    for s in range (0, NUM_STATES):
        old_v = V[s]
        action = best_action_value(s) # choose an action with the highest future reward
        env.env.s = s # goto the state
        s_new, reward, done, info = env.step(action) #take the action
        V[s] = reward + gamma * V[s_new] # update Value for the state using Bellman equation
        Pi[s] = action
        delta = max(delta, np.abs(old_v - V[s]))
    iteration += 1
    if delta < significant_improvement:
        print (iteration,' iterations done')
        break

In [None]:
# Review how the algorithm solves the taxi game
rew_tot=0
obs= env.reset()
env.render()
done=False
while done != True: 
    action = Pi[obs]
    obs, rew, done, info = env.step(action) # take step using selected action
    rew_tot = rew_tot + rew
    env.render()
# Print the reward of these actions
print("Reward: %r" % rew_tot)  

#### Model vs Model-free based methods  

Value iteration solves the Taxi game. However, we have to know all environment states/transitions upfront so the algorithm works. In Reinforcement Learning, this is refered to as a model based method.   

If all states are not known upfront, we can learn states and actions during learning. This is refered to as a model-free method.
 

#### Basic Q-learning algorithm    
In the [Q-learning](https://en.wikipedia.org/wiki/Q-learning) algorithm, the agent (Taxi) interacts with its environment to update its knowledge about the model so it can learn an optimal policy.

The $Q-matrix Q(s,a)$ is used to store the current maximum discounted future reward when the agent performs an action $a$ in state $s$. $Q(s, a)$ provides estimates for the best course of action for a given $a$ in state $s$. Upon convergence, the optimal policy $\po^*$ can be read from the $Q-matrix$. t
 
After every step we update $Q(s,a)$ using the reward and the max $Q-value$ for new state resulting from the action. This update is done using the action-value form of the Bellman equation.   

$Q_{t+1}(s_t,a_t) = Q_{t}(s_t,a_t) + \alpha_t(s_t,a_t) * [R_{t+1} + \gamma * max_a Q_t(s_{t+1},a_t) - Q_t(s_t,a_t)]$

Notes: 
- Q-learning was the basis for Deep Q-learning (Deep referring to Neural Network technology)  
- [Temporal difference learning](https://en.wikipedia.org/wiki/Temporal_difference_learning) and [Sarsa](https://en.wikipedia.org/wiki/State%E2%80%93action%E2%80%93reward%E2%80%93state%E2%80%93action) algorithems explored simular value expressions. . 


In [None]:
NUM_ACTIONS = env.action_space.n
NUM_STATES = env.observation_space.n
Q = np.zeros([NUM_STATES, NUM_ACTIONS]) #You could also make this dynamic if you don't know all games states upfront
gamma = 0.9 # discount factor
alpha = 0.9 # learning rate
for episode in range(1,1001):
    done = False
    reward_total = 0
    obs = env.reset()
    while done != True:
            action = np.argmax(Q[obs]) #choosing the action with the highest Q value 
            obs2, reward, done, info = env.step(action) #take the action
            Q[obs,action] += alpha * (reward + gamma * np.max(Q[obs2]) - Q[obs,action]) #Update Q-marix using Bellman equation
            #Q[obs,action] = rew + gamma * np.max(Q[obs2]) # same equation but with learning rate = 1 returns the basic Bellman equation
            reward_total = reward_total + reward
            obs = obs2   
    if episode % 50 == 0:
        print('Episode {} Total Reward: {}'.format(episode, reward_total))

So, what is the magic, how does it solve it? 

The Q-matrix is initialized with zero's. So initially it starts moving randomly until it hits a state/action with rewards or state/actions with a penalty. For understanding, let's simplify the problem that it needs to go to a certain drop-off position to get a reward. So random moves get no rewards but by luck (brute force enough tries) the state/action is found where a reward is given. So next game the immediate actions preceding this state/action will direct toward it by use of the Q-Matrix. The next iteration the actions before that, etc, etc. In other words, it solves "the puzzle" backwards from end-result (drop-off passenger) towards steps to be taken to get there in a iterative fashion.  

Note that in case of the Taxi game there is a reward of -1 for each action. So if in a state the algorithm explored eg south which let to no value the Q-matrix is updated to -1 so next iteration (because values were initialized on 0) it will try an action that is not yet tried and still on 0. So also by design it encourages systematic exploration of states and actions 

If you put the learning rate on 1 the game also solves. Reason is that there is only one reward (dropoff passenger), so the algorithm will find it whatever learning rate. In case a game has more reward places the learning rate determines if it should prioritize longer term or shorter term rewards


In [None]:
# Let's see how the algorithm solves the taxi game by following the policy to take actions delivering max value

rew_tot=0
obs= env.reset()
env.render()
done=False
while done != True: 
    action = np.argmax(Q[obs])
    obs, rew, done, info = env.step(action) #take step using selected action
    rew_tot = rew_tot + rew
    env.render()
#Print the reward of these actions
print("Reward: %r" % rew_tot)  

#### Exploration vs. exploitation

The taxi game operates in a deterministic environment one terminal state with the reward: dropoff passenger, receive +20. 

100% of the time, our algorithm *exploits* action = np.argmax(Q[obs]). To deal with more complex environments, we need to update our algorithm to explore. This is called the tradeoff between "exploitation" and "exploration".
* Exploitation: Make the best decision given current information (Go to the restaurant you know you like)
* Exploration: Gather more information (Try a new restaurant)

Approaches:  
Epsilon Greedy  
* Exploit with probability $(1 — \epsilon)$ and explore probability $\epsilon$, the rates of exploration and exploitation are fixed.
Epsilon-Decreasing  
* Epsilon Greedy with epsilon decreasing over time. 
Thompson sampling  
* The rates of exploration and exploitation are dynamically updated with respect to the entire probability distribution.   
Epsilon-Decreasing with Softmax  
* Epsilon-Decreasing, however in the case of exploring a new option, we don’t just pick an option at random, but instead we estimate the outcome of each option, and then pick based on that (this is the softmax part).


#### [Frozen lakes](https://gym.openai.com/envs/FrozenLake-v0/) of OpenAI/Gym.  

Frozen lakes provides simple non-deterministic envrionment.

Description: "Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend."  

Notice that the game is not deterministic anymore: "won't always move in the direction you intend". Note it is really slippery, the chance you move in the direction you want is relatively small.

S- Start  
G - Goal  
F- Frozen (safe)  
H- Hole (dead)  

Game layout:

In [None]:
env = gym.make('FrozenLake-v0')
rew_tot=0
obs= env.reset()
env.render()


In [None]:
env = gym.make('FrozenLake-v0')
env.reset()
NUM_ACTIONS = env.action_space.n
NUM_STATES = env.observation_space.n
Q = np.zeros([NUM_STATES, NUM_ACTIONS]) #You could also make this dynamic if you don't know all games states upfront
gamma = 0.95 # discount factor
alpha = 0.01 # learning rate
epsilon = 0.1 #
for episode in range(1,500001):
    done = False
    obs = env.reset()
    while done != True:
        if np.random.rand(1) < epsilon:
            # exploration with a new option with probability epsilon, the epsilon greedy approach
            action = env.action_space.sample()
        else:
            # exploitation
            action = np.argmax(Q[obs])
        obs2, reward, done, info = env.step(action) #take the action
        Q[obs,action] += alpha * (reward + gamma * np.max(Q[obs2]) - Q[obs,action]) #Update Q-marix using Bellman equation
        obs = obs2   
        
    if episode % 5000 == 0:
        #report every 5000 steps, test 100 games to get avarage point score for statistics and verify if it is solved
        rew_average = 0.
        for i in range(100):
            obs= env.reset()
            done=False
            while done != True: 
                action = np.argmax(Q[obs])
                obs, rew, done, info = env.step(action) #take step using selected action
                rew_average += rew
        rew_average=rew_average/100
        print('Episode {} avarage reward: {}'.format(episode,rew_average))
        
        if rew_average > 0.8:
            # FrozenLake-v0 defines "solving" as getting average reward of 0.78 over 100 consecutive trials.
            # Test it on 0.8 so it is not a one-off lucky shot solving it
            print("Frozen lake solved")
            break
 

In [None]:
# Let's see how the algorithm solves the frozen-lakes game

rew_tot=0.
obs= env.reset()
done=False
while done != True: 
    action = np.argmax(Q[obs])
    obs, rew, done, info = env.step(action) #take step using selected action
    rew_tot += rew
    env.render()

print("Reward:", rew_tot)  

It appears that if you move right there is a significant chance you move up or down, and if you attempt to move up there is a significant chance you move left or right, etc. So the algorithm learned that if you are on the frozen tile left column second row and you want to move down it is risky to give the down command because you could move to the right into the hole. So it gives the left command because if will keep you on the tile or move you up or down, but not to thr right.  
Or in other words, the algorithm learned to take that actions with the least risk to (accidently slip) drown into a hole. Also interesting to se it learned as first move to go left, this to avoid you move right which is the more dangerous road.  

Note: there is no 100% score possible. By consitently moving away from a hole you can safely traverse all fields except 1 (second row, third column) on which you could glide into due to slippery ice.  

Also good to notice the algorithm uses tenthousands of iterations to find the optimal policy, while this is a 4 by 4 playing field...

#### **ON YOUR OWN:**
    
Explore the OpenAI Gym environments: https://gym.openai.com/envs
        
Implement Q-Learning for at least one of the environments of your choice. It can be a simple environment. Avoid the highly graphical Atari environments, we will cover that with the next tutorial on Deep Q-Learning