<a href="https://colab.research.google.com/github/asokraju/Rienforcement-learning/blob/master/Expected-SARSA/FW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# *On policy* and *Off policy* Learning


In this document we impliment the following algorithms and distinguish between on policy and off policy implimentation.

1.   SARSA : *On Policy* method; The Q(s,a) function is learned from action $a, ~a'$ that we sampled from the current policy $\pi$. The update rule is:

$$Q(s,a) \leftarrow Q(s,a)+\alpha (r+\gamma Q(s',a')-Q(s,a))$$ 

2.   Q-Learning : *Off policy* method; $Q(s,a)$ function is learned from different actions (for example, random actions). a is sampled from $\pi$ and $a'$ is a greedy policy.
$$Q(s,a) \leftarrow Q(s,a)+\alpha (r+\gamma~ max_{a'}Q(s',a')-Q(s,a))$$ 
*Note*: The distiction dissappers when $\pi$ is a greedy policy. See [this](https://stats.stackexchange.com/questions/184657/what-is-the-difference-between-off-policy-and-on-policy-learning). Off policy methods do not need a policy, we can still find the optimal action-value function. However, it is common to use a *greedy* (deterministic) or $\epsilon-$*greedy* (stocastic) policy $\pi$. This gives rise to confusion. 

3. Expected SARSA: Both earlier methods have high variance because, neither $Q(s',a')$ nor $max_{a'}Q(s',a')$ are accurate furture action value. Expected SARSA marginalises out $a'$, which reduces the variance. 
$$Q(s,a) \leftarrow Q(s,a)+\alpha (r+\gamma\int_{a'} Q(s',a')d\tilde\pi(a'|s')-Q(s,a))$$ 

where $a$ is sampled from the policy$\pi$. See [this](https://ai.stackexchange.com/questions/10798/expected-sarsa-vs-sarsa-in-rl-an-introduction#). Note that $\tilde \pi$ can be any random known policy. However, in the case $\tilde \pi = \pi$, EXpected SARSA is On-policy methodology. Otherwise an off-policy technique.

The final difference between an *on-policy* and *off-policy* is how we compute the TD error $(\delta)$:
$$\delta  = (r+\gamma Q(s',a')-Q(s,a))$$ 
where $a'$ sampled from $\pi\implies $ *on-policy* and $a'$ NOT sampled from $\pi\implies $ *off-policy*.







#Q learning

In [1]:
%tensorflow_version 2.x
import tensorflow as tf
print(" we are currently using Tensorflow version {}".format(tf.__version__))
import numpy as np
import random
from IPython.display import clear_output
from collections import deque
import progressbar

import gym

from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Dense, Reshape
from tensorflow.keras.optimizers import Adam
from tensorflow import keras


TensorFlow 2.x selected.
 we are currently using Tensorflow version 2.1.0


In [2]:
#env = gym.make("FrozenLake8x8-v0")
env = gym.make("FrozenLake-v0")
env.render()
print('Number of states: {}'.format(env.observation_space.n))
print("number of actions: {}".format(env.action_space.n))


[41mS[0mFFF
FHFH
FFFH
HFFG
Number of states: 16
number of actions: 4


In [0]:
# make some changes for git
class Q_learn:
  def __init__(self, env, optimizer, episodes, explore):
    self._state_size = env.observation_space.n
    self._action_size = env.action_space.n
    self._optimizer = optimizer
    self.gamma = 0.9
    self.episodes = episodes
    self.epsilon = explore
    #We save the experience for sucess and failures induvidually. 
    #Usually there are two many faliure and we get rewards only when we readch the goal. There are no intermediate rewards.
    # Consequently, distribution over sucess /failures are imbalanced.
    #So training the algo on failures is useful, only if we have some date where we succeeded.
    self.experience_replay_s = deque(maxlen = 2000) 
    self.experience_replay_f = deque(maxlen = 2000)

    self.q_network = self.network_model()
    #self.target_network = self.network_model()
    #self.copy_weights()

  def store(self, state, action, reward, next_state, terminated):
    """
    We save the experience for sucess and failures induvidually. 
    """
    if reward>0.0:
      self.experience_replay_s.append((state, action, reward, next_state, terminated))
    else:
      self.experience_replay_f.append((state, action, reward, next_state, terminated))

  def eps_policy(self, state):
    """
    Episilon greedy policy
    """
    if np.random.rand() <= self.epsilon:
      return env.action_space.sample()
    else:
      q_values = self.q_network.predict(np.array([state]))
      return np.argmax(q_values[0])

  def network_model(self):
    """
    NN model for Q value function
    """
    model = Sequential()
    model.add(Dense(50, activation='relu', input_shape=[1]))
    model.add(Dense(50, activation='relu'))
    model.add(Dense(self._action_size, activation='relu'))
    model.compile(loss = 'mse', optimizer = self._optimizer)
    return model
  

  def test_fun(self, state, action, reward, next_state, terminated):
    target = self.q_network.predict(np.array([state]))
    if terminated:
      target[0][int(action)] = reward
    else:
      A = np.ones(self._action_size, dtype=float)*self.epsilon/self._action_size # [eps/n ....n.... eps/n] 
      q_values = self.q_network.predict(np.array([state]))
      best_action = np.argmax(q_values[0])
      A[best_action] += (1.0 - self.epsilon)
      new_action = np.random.choice(self._action_size, p = A)
      q_val_next_state = self.q_network.predict(np.array([next_state]))
      Expected_next_q_value = np.sum(np.array(q_val_next_state)*A)
      target[0][int(action)] = reward + self.gamma*Expected_next_q_value
    return list(target[0])

  def train(self, batch_size):
    minibatch_s = np.array(random.choices(self.experience_replay_s, k = batch_size))
    minibatch_f = np.array(random.choices(self.experience_replay_f, k = batch_size))
    minibatch = np.concatenate((minibatch_s,minibatch_f), axis =0)
    batch_state = minibatch[:,0]
    batch_target = np.array([self.test_fun(state, action, reward, next_state, terminated) for state, action, reward, next_state, terminated in minibatch])
      
    self.q_network.fit(batch_state, batch_target, epochs = 3, verbose = 1)
    
  

In [32]:
optimizer = Adam(learning_rate=0.01)
test_agent = Q_learn(env, optimizer, episodes= 100, explore =0.9)
 
print("Testing the model:")
test_agent.q_network.build()
print(test_agent.q_network.summary())
test_predicted = test_agent.q_network.predict(np.array([env.observation_space.sample()]))
print("predict the q_values for a random sample is {}".format(test_predicted))
test_history = test_agent.q_network.fit(np.array([env.observation_space.sample()]), test_predicted, epochs=1)

Testing the model:
Model: "sequential_19"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_57 (Dense)             (None, 50)                100       
_________________________________________________________________
dense_58 (Dense)             (None, 50)                2550      
_________________________________________________________________
dense_59 (Dense)             (None, 4)                 204       
Total params: 2,854
Trainable params: 2,854
Non-trainable params: 0
_________________________________________________________________
None
predict the q_values for a random sample is [[0.         0.89512193 0.         0.        ]]
Train on 1 samples


In [0]:
optimizer = Adam(learning_rate=0.01)

agent = Q_learn(env, optimizer,episodes=10000, explore = 0.9)

batch_size = 500
num_of_episodes = 10000
timesteps_per_episode = 128
Fail = 0
S = 0
for epi in range(0,num_of_episodes):
  state = env.reset()
  #state = np.array([state])

  reward = 0
  if S>=50:
    agent.epsilon = 0.1
  else:
    agent.epsilon = 0.05

  terminated  = False
  #bar = progressbar.ProgressBar(maxval=timesteps_per_episode/10, widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
  #bar.start()

  for steps in range(timesteps_per_episode):
    action = agent.eps_policy(state)
    #print(action)
    next_state, reward, terminated, _  = env.step(action)
    agent.store(state, action, reward, next_state, terminated)
    state = next_state
    #print(agent.experience_replay)

    
    #bar.finish()
    if (epi)%100 == 0:
      if S>50:
        print("**********************************")
        env.render()
        print("**********************************")
    if terminated:
      if reward > 0.0:
        S=S+1
      else:
        Fail = Fail +1
      break
  if S>=50:
    agent.train(batch_size)
    if epi%100 ==0:
      print("Episode: {}, S = {}, Fail = {}, exploration ={}".format(epi + 1, S, Fail, agent.epsilon))


Train on 1000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Train on 1000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Train on 1000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
