<a href="https://colab.research.google.com/github/elsa9421/Interactive-IPython-Demos/blob/main/Reinforcement_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates Reinforcement Learning specifically, by solving of a Markov Decision Process(MDP) using the Bellman Equation and Dynamic Programming.

<br>The Bellman Equation is the basic block of Reinforcement Learning. We can solve the Bellman equation using a special technique called dynamic programming.
<br>There are two powerful algorithms under Dynamic Programming
- Value Iteration
- Policy Iteration

<br>This notebook demonstrates solving of the Taxi Game and Frozen Lake problem using "Value Iteration"





## References/ Links

- [Learn by Example Reinforcement Learning Taxi Game](https://www.kaggle.com/charel/learn-by-example-reinforcement-learning-with-gym#The-Taxi-game)
- [Introduction to Reinforcement Learning and OpenAI Gym](https://www.oreilly.com/radar/introduction-to-reinforcement-learning-and-openai-gym/)
- [Bellman Equation and Dynamic Programming](https://medium.com/analytics-vidhya/bellman-equation-and-dynamic-programming-773ce67fc6a7)
- [Solving FrozenLake](https://medium.com/analytics-vidhya/solving-the-frozenlake-environment-from-openai-gym-using-value-iteration-5a078dffe438)
-[FrozenLake8x8-v0](https://gym.openai.com/envs/FrozenLake8x8-v0/)

## Basics 

### Gym
Gym is released by Open AI in 2016 [Read more](http://gym.openai.com/docs/). It is a toolkit for developing and comparing reinforcement learning algorithms.

In [None]:
## Import

import gym # openAi gym
from gym import envs
import numpy as np 
import matplotlib.pyplot as plt


### Example 1: Taxi Game

<br> Problem Description :


1. **Rules**: 
* There are four designated locations in the grid world indicated by R(ed) , B(lue),  G(reen) ,Y(ellow)
* When the episode starts, the taxi starts off at a random square and the passenger is at a random location.
* The taxi drives to the passenger's location, picks up the passenger, drives to the passenger's destination (another one of the four specified locations), and then drops off the passenger. 
* Once the passenger is dropped off, the episode ends. 
* The taxi cannot pass through a wall.

2. **Actions** `a`: 
There are 6 discrete deterministic actions:
>>- 0: move south
  - 1: move north
  - 2: move east 
  - 3: move west 
  - 4: pickup passenger
  - 5: dropoff passenger
3. **Rewards** : 
>>- There is a reward of -1 for each action 
  - An additional reward of +20 for delievering the passenger. 
  - There is a reward of -10 for executing actions "pickup" and "dropoff" illegally.

4. **Illustration description/Rendering**:
>>- blue: passenger
  - magenta: destination
  - yellow: empty taxi
  - green: full taxi
  - other letters: locations




In [None]:
env = gym.make('Taxi-v3')
env.reset()
env.render()

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



## Iteracting with the Gym Environment


<br> At each timestep, the agent chooses an action, and the environment returns an observation and a reward.

<br> `observation, reward, done, info = env.step(action)`
where, 
<br>`observation` (object): an environment-specific object representing your observation of the environment. 
<br>`reward` (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
<br>`done` (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. 
<br>`info` (dict): ignore, diagnostic information useful for debugging. Official evaluations of your agent are not allowed to use this for learning.


In [None]:
# Let's first do some random steps in the game so you see how the game looks like

reward_tot=0
obs= env.reset()
env.render()
for _ in range(3):
    action = env.action_space.sample() #take step using random action from possible actions (action_space)
    obs, rew, done, info = env.step(action) 
    reward_tot = reward_tot + rew
    env.render()
#Print the reward of these random action
print("Reward: %r" % reward_tot)

+---------+
|[34;1mR[0m: | : :[35mG[0m|
|[43m [0m: | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+

+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : | : : |
|[43m [0m: : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : | : : |
| : : : : |
|[43m [0m| : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : | : : |
| : : : : |
|[43m [0m| : | : |
|Y| : |B: |
+---------+
  (Pickup)
Reward: -12


## Action Space for the Taxi Game :
Action space has 6 possible actions, the meaning of the actions is nice to know for us humans but the neural network will figure it out.

In [None]:
print(env.action_space)
NUM_ACTIONS = env.action_space.n
print("Possible actions: [0..%a]" % (NUM_ACTIONS-1))

Discrete(6)
Possible actions: [0..5]


## State :

This represents the board state of the game and in gym returned it is returned as `observation`. It is a numeric representation of what the agent is observing at a particular moment of time in the environment.
<br> In case of Taxi the observation is an integer, 500 different states are possible that translate to a nice graphic visual format with the `render` function

In [None]:
print(env.observation_space)
print()
env.env.s=42 # some random number, you might recognize it
env.render()
env.env.s = 222 # and some other
env.render()

Discrete(500)

+---------+
|[34;1mR[0m: |[43m [0m: :G|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (Pickup)
+---------+
|[34;1mR[0m: | : :G|
| : | : : |
| :[43m [0m: : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (Pickup)


## Policy(π): 

It is the probability distribution over actions. i.e $\pi(a/s)$ - action picked by the agent given a state.

In this case, it is a deterministic policy
$a=\pi(s)$

The strategy that the agent employs to determine next action `a` in state `s`. Note that it does not state if it is a good or bad policy, it is a policy. The policy is normally noted with the greek letter π. Optimal policy (π*), policy which maximizes the expected reward.

<br> We use the Bellman Equation to find the optimal policy.

## Bellman Equation 

<br> Bellman equation is the basic block of solving reinforcement learning and is omnipresent in RL. It helps us to solve MDP. To solve means finding the optimal policy and value functions.

> A) For Deterministic Environment :
  <br> $V^*(s)= \underset{a}{\operatorname{max}} \{\space R(s,a) + \gamma\space \space V^*(s')\space \} $


> B) For Stochastic Environment :
   $V^*(s)= \underset{a}{\operatorname{max}} \underset{s'}{\operatorname{\Sigma}}\space  P(s'|s,a) \{\space R(s,a,s') + \gamma\space V^*(s')\space \} $

Note :
Optimal policy
   $\pi^*(s)= \underset{a}{\operatorname{argmax}} \underset{s'}{\operatorname{\Sigma}} P(s'|s,a) \{\space R(s,a,s') + \gamma\space \space  V^*(s')\space \} $


where,
<br> $V(s)$ :  is the value for being in a certain state `s`;
$V^*(s)$ is the optimal value function the one that yields maximum value
<br> $V*(s')$ :  is the optimal value for being in the next state `s'`after taking action `a` ;
<br> $R(s,a)$ : Reward obtained on taking  action `a` in state `s` and reachinf state s'
<br> $P(s'|s,a)$ :  Probability of going to state `s'` given action `a` is performed in state `s`.
<br>he Taxi game actions are deterministic (no such a thing as if I want to go north there is an 80% chance to go north and 10% chance to go west and 10% chance to go east). so the probability that selected action will lead to expected state is 100%. So ignore it for this game, it is always 1.
<br> $γ$ : Discount factor gamma; Between 0 and 1; Indicates importance to be given to future rewards.The higher gamma the higher the focus on long term rewards






### Value Iteration Algorithm 

1. Start with $V_0^*(s)$ =0 for all $s$
2. While not converged :
> For each state s, Given $V_i^*$, calculate $V_{i+1}^*$
$V_{i+1}^*(s)= \underset{a}{\operatorname{max}} \underset{s'}{\operatorname{\Sigma}}\space  P(s'|s,a) \{\space R(s,a,s') + \gamma\space V_i^*(s')\space \} $
<br> This is called value update or Bellman update/back-up
3. After Value Iteration , we still need to extract the optimal policy to obtain optimal action taken in each state
$\pi^*(s)= \underset{a}{\operatorname{argmax}} \underset{s'}{\operatorname{\Sigma}}\space  P(s'|s,a)\{\space R(s,a,s') + \gamma\space V^*(s')\space \} $


In [None]:
# Value iteration algorithem
NUM_ACTIONS = env.action_space.n
NUM_STATES = env.observation_space.n
V = np.zeros([NUM_STATES]) # The Value for each state
Pi = np.zeros([NUM_STATES], dtype=int)  # Our policy with we keep updating to get the optimal policy
gamma = 0.9 # discount factor
significant_improvement = 0.01

def best_action_value(s):
    # finds the highest value action (max_a) in state s
    best_a = None
    best_value = float('-inf')

    # loop through all possible actions to find the best current action
    for a in range (0, NUM_ACTIONS):
        env.env.s = s
        s_new, rew, done, info = env.step(a) #take the action
        v = rew + gamma * V[s_new]
        if v > best_value:
            best_value = v
            best_a = a
    return best_a

iteration = 0
while True:

    biggest_change = 0
    for s in range (0, NUM_STATES):
        old_v = V[s]
        action = best_action_value(s) #choosing an action with the highest future reward
        env.env.s = s # goto the state
        s_new, rew, done, info = env.step(action) #take the action
        V[s] = rew + gamma * V[s_new] #Update Value for the state using Bellman equation
        Pi[s] = action
        biggest_change = max(biggest_change, np.abs(old_v - V[s]))
    iteration += 1
    if biggest_change < significant_improvement:
        print (iteration,' iterations done')
        break

41  iterations done


## Solution to Taxi Game :

In [None]:

rew_tot=0
obs= env.reset()
env.render()
done=False
while done != True: 
    action = Pi[obs]
    obs, rew, done, info = env.step(action) #take step using selected action
    rew_tot = rew_tot + rew
    env.render()
#Print the reward of these actions
print("Reward: %r" % rew_tot)  

+---------+
|R: | : :[34;1mG[0m|
| : | : : |
| : : : : |
| | :[43m [0m| : |
|[35mY[0m| : |B: |
+---------+

+---------+
|R: | : :[34;1mG[0m|
| : | : : |
| : :[43m [0m: : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (North)
+---------+
|R: | : :[34;1mG[0m|
| : |[43m [0m: : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (North)
+---------+
|R: |[43m [0m: :[34;1mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (North)
+---------+
|R: | :[43m [0m:[34;1mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (East)
+---------+
|R: | : :[34;1m[43mG[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (East)
+---------+
|R: | : :[42mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :G|
| : | : :[42m_[0m|
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : | : : |
| : : : :[4

## Example 2: FrozenLake4x4

* There are 16 states in the game. 
* The agent starts from S (S for Start) and our goal is to get to G (Goal)
* F means Frozen Surface. You can walk on them. 
* But H means Hole. If you fall in a H, and start from S again.
* Since this is a “Frozen” Lake, if you go in a certain direction, there is only 0.333% chance that the agent will really go in that direction. The movement of the agent is uncertain and only partially depends on the chosen direction. 

Use Stochastic form of Bellman equation to solve this.

In [None]:
env=gym.make('FrozenLake-v0')
env.render()
print("Number of Actions in Action Space A : ",env.action_space)
print("Number of States in States Space S :",env.observation_space)
#OR
print(env.nS)
print(env.nA)

len(env.P[0][1])



[41mS[0mFFF
FHFH
FFFH
HFFG
Number of Actions in Action Space A :  Discrete(4)
Number of States in States Space S : Discrete(16)
16
4


3

env.P ; 
`env.P[state]` :
<br> eg. env.P[0] outputs a dictionary as shown in code below. 
<br>Here 0 in env.P[0] is the first state of the environment.<br> The keys of the dictionary 0,1,2,3 are the actions we can state from state 0. And further each action contains a list, where each element of the list is a tuple showing the `probability of transitioning into the state`, `next state`, `reward` and  if `done`=True `done`=False. (done=True if the next state is a Hole or the Goal)]


In [None]:
print(env.P[0])

{0: [(0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 4, 0.0, False)], 1: [(0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False)], 2: [(0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False), (0.3333333333333333, 0, 0.0, False)], 3: [(0.3333333333333333, 1, 0.0, False), (0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 0, 0.0, False)]}


In [None]:
env = gym.make('FrozenLake-v0')
env.render()



[41mS[0mFFF
FHFH
FFFH
HFFG


In [None]:
# Value iteration algorithem
NUM_ACTIONS = env.nA
NUM_STATES = env.nS
V = np.zeros([NUM_STATES]) # The Value for each state
Pi = np.zeros([NUM_STATES], dtype=int)  # Our policy with we keep updating to get the optimal policy
gamma = 0.9 # discount factor
significant_improvement = 0.01

def best_action_value(s):
    # finds the highest value action (max_a) in state s

    stateValue = [0 for i in range(env.nS)]
    action_values=[]

    # loop through all possible actions to find the best current action
    for a in range (0, NUM_ACTIONS):
      state_value = 0
      for i in range(len(env.P[s][a])):
        #env.env.s = s  env.P[s][action]
        prob,s_new, rew, done = env.P[s][a][i] 
        state_action_value=prob*( rew + gamma * V[s_new])
        state_value += state_action_value
      action_values.append(state_value)      #the value of each action
      best_action = np.argmax(np.asarray(action_values))

    return best_action

iteration = 0
while True:

    biggest_change = 0
    for s in range (0, NUM_STATES):
      old_v = V[s]
      action = best_action_value(s) #choosing an action with the highest future reward
      action_value=0
      for i in range(len(env.P[s][action])):
        prob, s_new,rew, done, =  env.P[s][action][i]
        action_value+= (prob * (rew + gamma * V[s_new])) #Update Value for the state using Bellman equation
      V[s]=action_value
      Pi[s] = action
      biggest_change = max(biggest_change, np.abs(old_v - V[s]))
    iteration += 1
    if biggest_change < significant_improvement:
        print (iteration,' iterations done')
        break

10  iterations done


In [None]:

rew_tot=0
obs= env.reset()
env.render()
done=False
while done != True: 
    action = Pi[obs]
    obs, rew, done, info = env.step(action) #take step using selected action
    rew_tot = rew_tot + rew
    env.render()
#Print the reward of these actions
print("Reward: %r" % rew_tot)  


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
Reward: 1.0
