<a href="https://colab.research.google.com/github/dave20874/rl-exercises/blob/main/ex_2_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from matplotlib import pyplot as plt 
import numpy import random

This notebook implements exercise 2.5 from Sutton and Barto's Reinforcement Learning.

Our first class, Bandit, represents the non-stationary 10-Armed Bandit.  All the q_star(a) start out equal (at 0.0) and update with a random walk of mean 0, std dev 0.01.

The update method updates the rewards on each time step.  The play method generates a random reward based on the players choice of action.  (play and update are independent so we can simulate agents with various epsilon-greedy behaviors together.)

In [None]:
class Bandit:
  REWARD_VARIANCE = 1.0
  UPDATE_STD_DEV = 0.01

  def __init__(self, num_actions=10):
    # number of actions for this bandit
    self.num_actions = num_actions

    # Mean and std deviation for each action's reward
    # self.mean_r[N] -> mean reward for action N.
    # (Variance is 1.0 for all actions.  Std Dev = sqrt(variance) is also 1.)
    self.mean_r = [0.0]*self.num_actions

  # Return the ideal Q, q_star, for this bandit at this time.
  def get_q_star(self):
    return max([a[0] for a in self.distrib])

  # return the best action for this bandit at this time.
  def get_best_action(self):
    means = [a[0] for a in self.distrib]
    max = max(means)
    return means.index(max)

  # update distribution on each time step
  def update(self):
    for params in self.distrib:
      params[0] += random.normal(0.0, self.UPDATE_STD_DEV)

  # Play the game!  Get a reward!
  def play(self, action):
    mean = self.distrib[action][0]
    r = random.normal(mean, self.REWARD_VARIANCE)
    return r



  



The next class defines the Agent.  It uses a parameter, alpha to set the step size and epsilon to set the learning rate.  An alpha value of 0.0 tells the agent to use sample averages instead of a constant step size.

In [None]:
class Agent:
  def __init__(self, alpha, epsilon, num_actions=10):
    self.alpha = alpha      # step size.  0.0 means use 1.0/self.n[a]
    self.epsilon = epsilon  # Exploration rate
    self.num_actions = num_actions
    self.n = 0              # steps taken

    # estimated Q values
    self.q = [0.0]*self.num_actions
    self.n = [0]*self.num_actions       # number of steps for this action

  # Play one round with the bandit.
  def play(self, bandit):
    # Decide random or greedy action
    action = self.get_greedy_action()
    if random.uniform(0.0, 1.0) < self.epsilon:
      # override greedy with random action
      action = random.randint(0, self.num_actions)

    # Get reward for this action
    reward = bandit.play(action)

    # Update q, n for this action
    self.n[action] += 1
    if self.alpha == 0.0:
      # compute step size to give sample average
      step_size = 1.0/self.n[action]
    else:
      # use alpha as step size
      step_size = self.alpha

    self.q[action] += step_size*(reward-self.q[action])

    






Now we can use our Bandit and Agent to set up an experiment.


In [None]:
class Experiment:
  RUNS = 10000
  NUM_ACTIONS = 10
  def __init__(self):
    pass

  def do_runs(self):
    for n in range(self.RUNS):
      self.run()

  def run(self):
    # create bandit
    bandit = Bandit(self.NUM_ACTONS)

    # create agents with different exploration rates
    epsilon = 0.1
    avg_agent = Agent(0.0, epsilon, self.NUM_ACTIONS)
    const_sz_agent = Agent(0.1, epsilon, self.NUM_ACTIONS)

    # go through the steps
    for step in range(self.STEPS):
      bandit.update()
      avg_agent.play(bandit)
      const_sz_agent.play(bandit)
