<a href="https://colab.research.google.com/github/dion-dodgen/active-inference/blob/main/Contextual_Multi_Armed_Bandit_in_Pymdp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install inferactively-pymdp

## **Imports some helpful libraries, most importantly `numpy` and `pymdp`**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pymdp
from pymdp import utils

### Let's set up the dimensionalities of the hidden state factors and the control states

In [None]:
""" Define dimensionalities of the hidden state factors and control state factors """
num_states = [3, 3] # a list of dimensionalities of each hidden state factor
num_factors = len(num_states) # the total number of hidden state factors

num_controls = [3, 3] # a list of the dimensionalities of each control state factor
num_control_factors = len(num_controls) # the total number of control state factors

In [None]:
""" Build an object array for storing the factor-specific B matrices """
# B =

### *Solution*

In [None]:
B = utils.initialize_empty_B(num_states, num_controls)

## Let's build the $\mathbf{B}$ array, i.e. one $\mathbf{B}$ matrix for each of the two hidden state factors of our new Grid World

<img src="https://drive.google.com/uc?export=view&id=1aABVu57DxisLi9iTzcF0rngm8LQ1B5Uk"/>


In [None]:
for f, ns in enumerate(num_states):

  """ Initialize the B matrix for this factor """
  B[f] = np.zeros( (ns, ns, num_controls[f]))

  # MOVE LEFT (or UP)
  B[f][0,0:2,0] = 1.0
  B[f][1,2,0] = 1.0

  # MOVE RIGHT (or DOWN)
  B[f][1,0,1] = 1.0
  B[f][2,1:,1] = 1.0

  # STAY
  B[f][:,:,2] = np.eye(ns)


### Plot the B matrix

In [None]:
utils.plot_likelihood(B[0][:,:,0], title = 'Move LEFT')

## Now for the $\mathbf{A}$ array...

### Let's start by specifying a single observation modality - the observation of one's own location, or $o^{Loc}$.

<img src="https://drive.google.com/uc?export=view&id=1cUkRY22yeIZdA6kyBO4mKB6jaXQeVpNo"/>


In [None]:
num_obs = [9]
num_modalities = len(num_obs)

A = utils.initialize_empty_A(num_obs, num_states)
A_location_dims = num_obs + num_states

In [None]:
A[0] = np.zeros(A_location_dims)
A[0].shape

### Now fill out the entries of the $\mathbf{A}$ matrix

In [None]:
""" filling out the mapping to 9-dimensional observations for the three possible settings of X (0, 1, 2), in the case we're in Y = 0 """
A[0][0:3,:,0] = np.eye(3)

""" filling out the mapping to 9-dimensional observations for the three possible settings of X (0, 1, 2), in the case we're in Y = 1 """
A[0][3:6,:,1] = np.eye(3)

""" filling out the mapping to 9-dimensional observations for the three possible settings of X (0, 1, 2), in the case we're in Y = 2 """
A[0][6:9,:,2] = np.eye(3)

### Now display the $\mathbf{A}$ matrices we just made

In [None]:
utils.plot_likelihood(A[0][:,:,0], title = "P(o | x, y == 0)" )

In [None]:
utils.plot_likelihood(A[0][:,:,1], title = "P(o | x, y == 1)" )

In [None]:
utils.plot_likelihood(A[0][:,:,2], title = "P(o | x, y == 2)" )

# **Explore/exploit task with a contextual two-armed bandit**

## Now we're going to build a generative model for an active inference agent playing a two-armed bandit task. The [multi-armed bandit](https://en.wikipedia.org/wiki/Multi-armed_bandit) is a classic decision-making task that captures the core features of the the "explore/exploit tradeoff". This problem formulation is ubiqutious across various disciplines that study decision-making under uncertainty, including economics, neuroscience, machine learning, and engineering.

## The agent has to make choices among mutually exclusive alternatives or 'arms' in order maximize rewards, which issue probabilistically from each arm. However, the reward statistics of each arm are in general unknown or only partially known. The agent must therefore _infer_ the reward statistics.

## The inherently partial-observability  of the ask creates a conflict between **exploitation** or choosing the arm that is _currently believed_ to be most rewarding, and **exploration** or gathering information about the remaining arms, in the hopes of discovering a potentially more rewarding option.

## The fact that expected reward or utility is contextualized by _beliefs_ -- i.e. which arm is currently thought to be the most rewarding -- motivates the use of active inference in this context. This is because the key objective function for action-selection, the expected free energy $\mathbf{G}$, depends on the agent's beliefs about the world. And not only that, but expected free energy balances the desire to maximize rewards with the drive to resolve uncertainty about unknown parts of the agent's model. The more accurate the agent's beliefs are, the more faithfully decision-making can be guided by maximizing expected utility or rewards.


### Specify the dimensionalities of the hidden state factors, the control factors, and the observation modalities

In [None]:
context_names = ['Left-Better', 'Right-Better']
choice_names = ['Start', 'Hint', 'Left Arm', 'Right Arm']

""" Define `num_states` and `num_factors` below """
num_states = [len(context_names), len(choice_names)]
num_factors = len(num_states)

context_action_names = ['Do-nothing']
choice_action_names = ['Move-start', 'Get-hint', 'Play-left', 'Play-right']

""" Define `num_controls` below """
num_controls = [len(context_action_names), len(choice_action_names)]

hint_obs_names = ['Null', 'Hint-left', 'Hint-right']
reward_obs_names = ['Null', 'Loss', 'Reward']
choice_obs_names = ['Start', 'Hint', 'Left Arm', 'Right Arm']

""" Define `num_obs` and `num_modalities` below """
num_obs = [len(hint_obs_names), len(reward_obs_names), len(choice_obs_names)]
num_modalities = len(num_obs)

## Create the $\mathbf{A}$ arrays first

In [None]:
""" Generate the A array """
A = utils.initialize_empty_A(num_obs, num_states)

## Fill out the hint modality, a sub-array of `A` which we'll call `A_hint`

<img src="https://drive.google.com/uc?export=view&id=1SqMp77NAmUa_oh925VURJ1Hyp8v_fXOj"/>



In [None]:
p_hint = 0.7 # accuracy of the hint, according to the agent's generative model (how much does the agent trust the hint?)

A_hint = np.zeros( (len(hint_obs_names), len(context_names), len(choice_names)) )

for choice_id, choice_name in enumerate(choice_names):

  if choice_name == 'Start':

    A_hint[0,:,choice_id] = 1.0

  elif choice_name == 'Hint':

    A_hint[1:,:,choice_id] = np.array([[p_hint,       1.0 - p_hint],
                                      [1.0 - p_hint,  p_hint]])
  elif choice_name == 'Left Arm':

    A_hint[0,:,choice_id] = 1.0

  elif choice_name == 'Right Arm':

    A_hint[0,:,choice_id] = 1.0

A[0] = A_hint

In [None]:
utils.plot_likelihood(A[0][:,:,1], title = "Probability of the two hint types, for the two game states")

## Fill out the reward modality, a sub-array of `A` which we'll call `A_rew`

<img src="https://drive.google.com/uc?export=view&id=155LAPZ9_aulJ3YYZwwlOWrEa6unabHht"/>

In [None]:
p_reward = 0.8 # probability of getting a rewarding outcome, if you are sampling the more rewarding bandit

A_reward = np.zeros((len(reward_obs_names), len(context_names), len(choice_names)))

for choice_id, choice_name in enumerate(choice_names):

  if choice_name == 'Start':

    A_reward[0,:,choice_id] = 1.0

  elif choice_name == 'Hint':

    A_reward[0,:,choice_id] = 1.0

  elif choice_name == 'Left Arm':

    A_reward[1:,:,choice_id] = np.array([ [1.0-p_reward, p_reward],
                                        [p_reward, 1.0-p_reward]])
  elif choice_name == 'Right Arm':

    A_reward[1:, :, choice_id] = np.array([[ p_reward, 1.0- p_reward],
                                         [1- p_reward, p_reward]])

A[1] = A_reward

In [None]:
utils.plot_likelihood(A[1][:,:,2], title='Payoff structure if playing the Left Arm, for the two contexts')

## Fill out the choice observation modality, a sub-array of `A` which we'll call `A_choice`

<img src="https://drive.google.com/uc?export=view&id=1LGdGX0TgesvQ2HDnHMg42XHh0n6ZKnHw"/>


In [None]:
A_choice = np.zeros((len(choice_obs_names), len(context_names), len(choice_names)))

for choice_id in range(len(choice_names)):

  A_choice[choice_id, :, choice_id] = 1.0

A[2] = A_choice

In [None]:
""" Condition on context (first hidden state factor) and display the remaining indices (outcome and choice state) """

utils.plot_likelihood(A[2][:,0,:], title="Mapping between sensed states and true states")

### Let's wrap that all into a single function so we can quickly re-set or re-parameterize the A array as needed

In [None]:
def create_A(p_hint=0.7, p_reward=0.8):
  """
  Function for creating the observation (observation or sensory likelihood) model for the contextual MAB task, parameterized by
  two probabilities: `p_hint` and `p_reward`.
  Parameters:
  ----------
  `p_hint`: float (default 0.7)
    accuracy of the hint, according to the agent's generative model (how much does the agent trust the hint?)
  `p_reward`: float (default 0.8)
    probability of getting a rewarding outcome when sampling the more rewarding bandit, according to the agent's generative model
  Returns:
  ---------
  `A`: numpy object array
    The full observation likelihood model
  """
  A = utils.initialize_empty_A(num_obs, num_states)

  """ Fill out the hint modality """
  A_hint = np.zeros( (len(hint_obs_names), len(context_names), len(choice_names)) )
  for choice_id, choice_name in enumerate(choice_names):
    if choice_name == 'Start':
      A_hint[0,:,choice_id] = 1.0
    elif choice_name == 'Hint':
      A_hint[1:,:,choice_id] = np.array([[p_hint,       1.0 - p_hint],
                                        [1.0 - p_hint,  p_hint]])
    elif choice_name == 'Left Arm':
      A_hint[0,:,choice_id] = 1.0
    elif choice_name == 'Right Arm':
      A_hint[0,:,choice_id] = 1.0

  """ Fill out the reward modality """
  A_reward = np.zeros((len(reward_obs_names), len(context_names), len(choice_names)))
  for choice_id, choice_name in enumerate(choice_names):
    if choice_name == 'Start':
      A_reward[0,:,choice_id] = 1.0
    elif choice_name == 'Hint':
      A_reward[0,:,choice_id] = 1.0
    elif choice_name == 'Left Arm':
      A_reward[1:,:,choice_id] = np.array([ [1.0-p_reward, p_reward],
                                          [p_reward, 1.0-p_reward]])
    elif choice_name == 'Right Arm':
      A_reward[1:, :, choice_id] = np.array([[ p_reward, 1.0- p_reward],
                                          [1- p_reward, p_reward]])
  """ Fill out the choice sensation modality """
  A_choice = np.zeros((len(choice_obs_names), len(context_names), len(choice_names)))
  for choice_id in range(len(choice_names)):
      A_choice[choice_id, :, choice_id] = 1.0

  A[0], A[1], A[2] = A_hint, A_reward, A_choice

  return A

## Now let's move onto the $\mathbf{B}$ arrays

In [None]:
B = utils.initialize_empty_B(num_states, num_states)

### Fill out the context state factor dynamics, a sub-array of `B` which we'll call `B_context`

<img src="https://drive.google.com/uc?export=view&id=1_VvkCpRu1wWwEFiAJKnOGAGikd5KeiiE" width="600" height="300" />


In [None]:
B_context = np.zeros( (len(context_names), len(context_names), len(context_action_names)) )

B_context[:,:,0] = np.eye(len(context_names))

B[0] = B_context

### Fill out the choice factor dynamics, a sub-array of `B` which we'll call `B_choice`

<img src="https://drive.google.com/uc?export=view&id=1qeuFvNIrJR7ldjpkrB6_jAp6JM3UhMw0"/>


In [None]:
B_choice = np.zeros( (len(choice_names), len(choice_names), len(choice_action_names)) )

for choice_i in range(len(choice_names)):

  B_choice[choice_i, :, choice_i] = 1.0

B[1] = B_choice

### Once again, let's wrap that into a quick `create_B()` function so we can re-set it whenever we want.

Let's add in an optional `change_prob` parameter to allow the agent to believe in a stochastic environment (the "better arm" may change identity over time)

In [None]:
def create_B(p_change=0.0):
  """
  Function for creating the transition (dynamics or transition likelihood) model for the contextual MAB task, parameterized by
  a context-change probability `p_change`.
  Parameters:
  ----------
  `p_change`: float (default 1.0)
    probability of the context (which bandit is more rewarding) changing
  Returns:
  ---------
  `B`: numpy object array
    The full transition likelihood model
  """
  B = utils.initialize_empty_B(num_states, num_states)

  """ Context transitions (uncontrollable) """
  B_context = np.zeros( (len(context_names), len(context_names), len(context_action_names)) )
  B_context[:,:,0] = np.array([[1.-p_change,    p_change],
                               [p_change, 1.-p_change]]
                              )

  """ Choice transitions (controllable) """
  B_choice = np.zeros( (len(choice_names), len(choice_names), len(choice_action_names)) )
  for choice_i in range(len(choice_names)):
    B_choice[choice_i, :, choice_i] = 1.0

  B[0], B[1] = B_context, B_choice

  return B


## The $\mathbf{C}$ vectors

In [None]:
from pymdp.maths import softmax

def create_C(reward=2., pun=-4.):
  """ Creates the C array, AKA the observation prior for the MAB task, parameterized by a `reward` and `pun` (punishment) parameter """

  C = utils.obj_array_zeros(num_obs)
  C[1] = np.array([0., pun, reward])
  return C

C = create_C(reward=2.0, pun=-4.0)

utils.plot_beliefs(softmax(C[1]), title = "Prior preferences")

## The $\mathbf{D}$ vectors

In [None]:
def create_D(p_context=0.5):
  """
  Creates the D array AKA the hidden state prior at the first timestep for the MAB task, parameterized by a `p_context` parameter that
  parameterizes the agent's prior beliefs about whether the context is "Left Arm Better" at the first timestep of a given trial
  """

  D = utils.obj_array(num_factors)

  """ Context prior """
  D_context = np.array([p_context,1.-p_context])
  D[0] = D_context


  """ Choice-state prior """
  D_choice = np.zeros(len(choice_names))
  D_choice[choice_names.index("Start")] = 1.0
  D[1] = D_choice

  return D

D = create_D()

In [None]:
utils.plot_beliefs(softmax(D[0]), title = "Prior beliefs about probability of the two contexts")

## Now let's take advantage of the `Agent` class in `pymdp` to wrap this all into an Agent instance that we can use to do active inference in a few lines.

In [None]:
from pymdp.agent import Agent

A = create_A(p_hint=0.7, p_reward=0.8)
B = create_B(p_change=0.0)
C = create_C(reward=2.0, pun=-4.0)
D = create_D(p_context=0.5)
my_agent = Agent(A=A, B=B, C=C, D=D)

## Define a class for the 2-armed bandit environment (AKA the _generative process_)

In [None]:
class TwoArmedBandit(object):

  def __init__(self, context = None, p_hint = 1.0, p_reward = 0.8):

    self.context_names = ["Left-Better", "Right-Better"]

    if context == None:
      self.context = self.context_names[utils.sample(np.array([0.5, 0.5]))] # randomly sample which bandit arm is better (Left or Right)
    else:
      self.context = context

    self.p_hint = p_hint
    self.p_reward = p_reward

    self.reward_obs_names = ['Null', 'Loss', 'Reward']
    self.hint_obs_names = ['Null', 'Hint-left', 'Hint-right']

  def step(self, action):

    if action == "Move-start":
      observed_hint = "Null"
      observed_reward = "Null"
      observed_choice = "Start"
    elif action == "Get-hint":
      if self.context == "Left-Better":
        observed_hint = self.hint_obs_names[utils.sample(np.array([0.0, self.p_hint, 1.0 - self.p_hint]))]
      elif self.context == "Right-Better":
        observed_hint = self.hint_obs_names[utils.sample(np.array([0.0, 1.0 - self.p_hint, self.p_hint]))]
      observed_reward = "Null"
      observed_choice = "Hint"
    elif action == "Play-left":
      observed_hint = "Null"
      observed_choice = "Left Arm"
      if self.context == "Left-Better":
        observed_reward = self.reward_obs_names[utils.sample(np.array([0.0, 1.0 - self.p_reward, self.p_reward]))]
      elif self.context == "Right-Better":
        observed_reward = self.reward_obs_names[utils.sample(np.array([0.0, self.p_reward, 1.0 - self.p_reward]))]
    elif action == "Play-right":
      observed_hint = "Null"
      observed_choice = "Right Arm"
      if self.context == "Right-Better":
        observed_reward = self.reward_obs_names[utils.sample(np.array([0.0, 1.0 - self.p_reward, self.p_reward]))]
      elif self.context == "Left-Better":
        observed_reward = self.reward_obs_names[utils.sample(np.array([0.0, self.p_reward, 1.0 - self.p_reward]))]

    obs = [observed_hint, observed_reward, observed_choice]

    return obs


### Write a function that will take the agent, the environment, and a time length and run the active inference loop

In [None]:
def run_active_inference_loop(my_agent, my_env, T = 5, verbose = False):
  """
  Function that wraps together and runs a full active inference loop using the pymdp.agent.Agent class functionality
  """

  """ Initialize the first observation """
  obs_label = ["Null", "Null", "Start"]  # agent observes itself seeing a `Null` hint, getting a `Null` reward, and seeing itself in the `Start` location
  obs = [hint_obs_names.index(obs_label[0]), reward_obs_names.index(obs_label[1]), choice_obs_names.index(obs_label[2])]

  first_choice = choice_obs_names.index(obs_label[2])
  choice_hist = np.zeros((4,T+1))
  choice_hist[first_choice,0] = 1.0

  belief_hist = np.zeros((2, T))
  context_hist = np.zeros(T)

  for t in range(T):
    context_hist[t] = env.context_names.index(env.context)
    qs = my_agent.infer_states(obs)

    belief_hist[:,t] = qs[0]

    if verbose:
      utils.plot_beliefs(qs[0], title = f"Beliefs about the context at time {t}")

    q_pi, efe = my_agent.infer_policies()
    chosen_action_id = my_agent.sample_action()

    movement_id = int(chosen_action_id[1])
    choice_hist[movement_id,t+1]= 1.0

    choice_action = choice_action_names[movement_id]

    obs_label = my_env.step(choice_action)

    obs = [hint_obs_names.index(obs_label[0]), reward_obs_names.index(obs_label[1]), choice_obs_names.index(obs_label[2])]

    if verbose:
      print(f'Action at time {t}: {choice_action}')
      print(f'Reward at time {t}: {obs_label[1]}')

  return choice_hist, belief_hist, context_hist

In [None]:
def plot_choices_beliefs(choice_hist, belief_hist, context_hist, pad_val=5.0):
  """ Helper function for plotting outcome of simulation.
  first subplot shows the agent's choices (actions) over time , second subplot shows the agents beliefs about the game-context (which arm is better) over time
  """

  T = choice_hist.shape[1]
  fig, axes = plt.subplots(nrows = 2, ncols = 1, figsize = (14,11))
  axes[0].imshow(choice_hist[:,:-1], cmap = 'gray') # only plot up until the second to last timestep, since we don't update beliefs after the last choice
  axes[0].set_xlabel('Timesteps')
  axes[0].set_yticks(ticks = range(4))
  axes[0].set_yticklabels(labels = choice_action_names)
  axes[0].set_title('Choices over time')

  axes[1].imshow(belief_hist, cmap = 'gray')
  axes[1].set_xlabel('Timesteps')
  axes[1].set_yticks(ticks = range(2))
  axes[1].set_yticklabels(labels = ['Left-Better', 'Right-Better'])
  axes[1].set_title('Beliefs over time')
  axes[1].scatter(np.arange(T-1), context_hist, c = 'r', s = 50)

  fig.tight_layout(pad=pad_val)
  plt.show()

### Now all we have to do is define the two-armed bandit environment, choose the length of the simulation, and run the function we wrote above.


*   Try playing with the hint accuracy and/or reward statistics of the environment - remember this is _different_ than the agent's representation of the reward statistics (i.e. the agent's generative model, e.g. the A or B matrices).




In [None]:
p_hint_env = 1.0 # this is the "true" accuracy of the hint - i.e. how often does the hint actually signal which arm is better. REMEMBER: THIS IS INDEPENDENT OF HOW YOU PARAMETERIZE THE A MATRIX FOR THE HINT MODALITY
p_reward_env = 0.7 # this is the "true" reward probability - i.e. how often does the better arm actually return a reward, as opposed to a loss. REMEMBER: THIS IS INDEPENDENT OF HOW YOU PARAMETERIZE THE A MATRIX FOR THE REWARD MODALITY
env = TwoArmedBandit(p_hint=p_hint_env, p_reward=p_reward_env)

T = 15

A = create_A(p_hint=0.7, p_reward=0.8)
B = create_B(p_change=0.0)
C = create_C(reward=2.0, pun=-4.0)
D = create_D(p_context=0.5)
my_agent = Agent(A = A, B = B, C = C, D = D) # in case you want to re-define the agent, you can run this again
choice_hist, belief_hist, context_hist = run_active_inference_loop(my_agent, env, T = T, verbose = False)
plot_choices_beliefs(choice_hist, belief_hist, context_hist)

### Let's manipulate the agent's prior preferences over reward observations ($\mathbf{C}[1]$) in order to examine the tension between exploration and exploitation.

In [None]:
# manipulate the agent's sensitivity to punishment
C = create_C(pun=-4.0)
my_agent = Agent(A = A, B = B, C = C, D = D) # redefine the agent with the new preferences
env = TwoArmedBandit(p_hint = 1.0, p_reward = 0.7) # re-initialize the environment)

choice_hist, belief_hist, context_hist = run_active_inference_loop(my_agent, env, T = T, verbose = False)
plot_choices_beliefs(choice_hist, belief_hist, context_hist)

## Learning the reward contingencies

#### Defining priors over the (Categorical) parameters of the `A` array. A natural choice of the prior is the Dirichlet distribution, which is conjugate to the Categorical likelihood.

In [None]:
A = create_A(p_hint=1.0, p_reward=0.51) # let's assume the agent doesn't know the reward contingencies
pA = utils.dirichlet_like(A, scale = 0.1)

#### Often we want certain contingencies to be unavailable to learning -- these are contingencies that we assume are "baked-in" to the agent's beliefs about the world and not adaptable.

In [None]:
pA[1][0,:,:] *= 10e5 # make the null observation contingencies 'un-learnable'

#### Let's make that into a function to enable quick re-parameterization of `pA`

In [None]:
def parameterize_pA(A_base, scale=1e-16, prior_count=10e5):
  pA = utils.dirichlet_like(A_base, scale = scale)
  pA[1][0,:,:] *= prior_count # make the null observation contingencies 'un-learnable'
  return pA

### Define new active inference loop where you update beliefs about A (`qA`) online and save the results

In [None]:
def run_active_inference_with_learning(my_agent, my_env, T = 5):
  """
  Function that wraps together and runs a full active inference loop using the pymdp.agent.Agent class functionality.
  Also includes learning and outputs the history of the agent's beliefs about the reward
  """

  """ Initialize the first observation """
  obs_label = ["Null", "Null", "Start"]  # agent observes itself seeing a `Null` hint, getting a `Null` reward, and seeing itself in the `Start` location
  obs = [hint_obs_names.index(obs_label[0]), reward_obs_names.index(obs_label[1]), choice_obs_names.index(obs_label[2])]

  belief_hist = np.zeros((2, T))
  context_hist = np.zeros(T)

  first_choice = choice_obs_names.index(obs_label[2])
  choice_hist = np.zeros((4,T+1))
  choice_hist[first_choice,0] = 1.0

  dim_qA = (T,) + my_agent.A[my_agent.modalities_to_learn[0]].shape

  qA_hist = np.zeros(dim_qA)

  for t in range(T):
    context_hist[t] = env.context_names.index(env.context)
    qs = my_agent.infer_states(obs)
    belief_hist[:,t] = qs[0].copy()

    q_pi, _ = my_agent.infer_policies()
    chosen_action_id = my_agent.sample_action()

    movement_id = int(chosen_action_id[1])
    choice_hist[movement_id,t+1]= 1.0

    qA_t = my_agent.update_A(obs)
    qA_hist[t] = qA_t[my_agent.modalities_to_learn[0]]

    choice_action = choice_action_names[movement_id]
    obs_label = my_env.step(choice_action)

    # print(f'Observation : Hint: {obs_label[0]}, Reward: {obs_label[1]}, Choice Sense: {obs_label[2]}')
    obs = [hint_obs_names.index(obs_label[0]), reward_obs_names.index(obs_label[1]), choice_obs_names.index(obs_label[2])]

  return choice_hist, belief_hist, qA_hist, context_hist

In [None]:
A = create_A(p_hint=1.0, p_reward=0.51) # let's assume the agent doesn't know the reward contingencies
pA = parameterize_pA(A_base=A, scale = 1.0, prior_count=10e5)

B, C, D = create_B(), create_C(reward=2., pun=-4.), create_D() # the rest of the generative model
env = TwoArmedBandit(p_hint = 1.0, p_reward = 0.8) # initialize the environment with p_reward = 0.7

agent_with_learning = Agent(A=A, pA=pA, B=B, C=C, D=D, modalities_to_learn=[1], lr_pA = 1.0, use_param_info_gain=True, action_selection='deterministic')
T = 25
choice_hist, belief_hist, qA_hist, context_hist = run_active_inference_with_learning(agent_with_learning, env, T = T)
plot_choices_beliefs(choice_hist, belief_hist, context_hist, pad_val = 0.1)

In [None]:
p_reward_beliefs_left = [utils.norm_dist(qa_t)[2,0,2] for qa_t in qA_hist]
p_reward_beliefs_right = [utils.norm_dist(qa_t)[2,1,3] for qa_t in qA_hist]

print(f'True context is: {env.context}')
fig, ax = plt.subplots(figsize=(10,6))
ax.plot(p_reward_beliefs_left, label = 'Beliefs about $p_{reward}$ when Left Arm is better', lw = 2.0)
ax.plot(p_reward_beliefs_right, label = 'Beliefs about $p_{reward}$ when Right Arm is better', lw = 2.0)
ax.set_ylim(-0.05, 1.05)
ax.set_xlim(0, T)
ax.legend()

### In order to allow the agent to sample both bandit arms more equally (so it can learn the reward probabilities in the _other_ arm, not just the best arm given the context at hand), we will introduce a switching probability into the bandit by augmenting the bandit class so that the context sometimes stochastically switches.

In [None]:
class TwoArmedBanditStochastic(object):

  def __init__(self, context = None, p_hint = 1.0, p_reward = 0.8, p_change = 0.3):

    self.context_names = ["Left-Better", "Right-Better"]

    if context == None:
      self.context = self.context_names[utils.sample(np.array([0.5, 0.5]))] # randomly sample which bandit arm is better (Left or Right)
    else:
      self.context = context

    self.p_hint = p_hint
    self.p_reward = p_reward

    self.reward_obs_names = ['Null', 'Loss', 'Reward']
    self.hint_obs_names = ['Null', 'Hint-left', 'Hint-right']

    self.p_change=p_change

  def step(self, action):

    # change the context stochastically at each timestep
    change_or_stay = utils.sample(np.array([self.p_change, 1. - self.p_change]))
    if change_or_stay == 0:
      if self.context == 'Left-Better':
        self.context = 'Right-Better'
      elif self.context == 'Right-Better':
        self.context = 'Left-Better'

    if action == "Move-start":
      observed_hint = "Null"
      observed_reward = "Null"
      observed_choice = "Start"
    elif action == "Get-hint":
      if self.context == "Left-Better":
        observed_hint = self.hint_obs_names[utils.sample(np.array([0.0, self.p_hint, 1.0 - self.p_hint]))]
      elif self.context == "Right-Better":
        observed_hint = self.hint_obs_names[utils.sample(np.array([0.0, 1.0 - self.p_hint, self.p_hint]))]
      observed_reward = "Null"
      observed_choice = "Hint"
    elif action == "Play-left":
      observed_hint = "Null"
      observed_choice = "Left Arm"
      if self.context == "Left-Better":
        observed_reward = self.reward_obs_names[utils.sample(np.array([0.0, 1.0 - self.p_reward, self.p_reward]))]
      elif self.context == "Right-Better":
        observed_reward = self.reward_obs_names[utils.sample(np.array([0.0, self.p_reward, 1.0 - self.p_reward]))]
    elif action == "Play-right":
      observed_hint = "Null"
      observed_choice = "Right Arm"
      if self.context == "Right-Better":
        observed_reward = self.reward_obs_names[utils.sample(np.array([0.0, 1.0 - self.p_reward, self.p_reward]))]
      elif self.context == "Left-Better":
        observed_reward = self.reward_obs_names[utils.sample(np.array([0.0, self.p_reward, 1.0 - self.p_reward]))]

    obs = [observed_hint, observed_reward, observed_choice]

    return obs


In [None]:
env = TwoArmedBanditStochastic(p_hint = 1.0, p_reward = 0.8, p_change=0.2) # initialize the environment with p_reward = 0.7
print(f'Starting context is :{env.context}')

A = create_A(p_hint=1.0, p_reward=0.51)
pA = parameterize_pA(A_base=A, scale=0.05, prior_count=10e5)

B = create_B(p_change=0.2)
C = create_C(reward=2.0, pun=-2.0)
D = create_D()

T = 20
agent_with_learning = Agent(A=A, pA=pA, B=B, C=C, D=D, modalities_to_learn=[1], lr_pA = 1.25, use_param_info_gain = True, action_selection='stochastic')
choice_hist, belief_hist, qA_hist, context_hist = run_active_inference_with_learning(agent_with_learning, env, T = T)
plot_choices_beliefs(choice_hist, belief_hist, context_hist, pad_val = 1.)

In [None]:
p_reward_beliefs_left = [utils.norm_dist(qa_t)[2,0,2] for qa_t in qA_hist]
p_reward_beliefs_right = [utils.norm_dist(qa_t)[2,1,3] for qa_t in qA_hist]

fig, ax = plt.subplots(figsize=(10,6))
ax.plot(p_reward_beliefs_left, label = 'Beliefs about $p_{reward}$ when Left Arm is better', lw = 2.0)
ax.plot(p_reward_beliefs_right, label = 'Beliefs about $p_{reward}$ when Right Arm is better', lw = 2.0)
ax.set_ylim(-0.05, 1.05)
ax.set_xlim(0, T)
ax.legend()