In [2]:
%matplotlib notebook
%load_ext autoreload
%autoreload 2
from supplementary.simple_choice_model import hits_gen as hits
from supplementary.simple_choice_model import sim_tools
import ipywidgets as wid

# Introduction

This notebook will introduce an artificial agent that uses a simple decision-making routine in order to choose the next task to interact with. The agent starts in some initial state (some past history of interacting with all tasks) to make the initial decision. Upon choosing a task, the agent "plays" it, gets some feedback (hit or miss), updates its state accordingly and repeats. This closed-loop process involves multiple components that can produce different behaviors over time.

One interesting thing about the tasks that the agent chooses from, is that the probability of getting a positive feedback can change over time depending on the agent's engagement with the task. In other words, the tasks are such that the agent can learn to perform better on them, provided they are learnable. To fully model the closed-loop process we would need a model of learning. However, we can simulate this part of the process with any function that takes time (or experience) on a given task as an argument and returns the probabilities over particualar outcomes (e.g. correct / incorrect). Below, I will use a sigmoid function of a linear function with 1 independent predictor: number of trials on task (so far). Basically:

$$ P(\text{hit} \mid \text{trial}) = \sigma(b_0 + b_1 \text{trial}) = \frac{1}{1 + e^{-(b_0 + b_1 \text{trial})}} $$

where $b_0, b_1$ are free parameters and can differ between tasks. Parameter $b_1$ controls to how rapidly the function increases and $b_0$ corresponds how likely getting a hit is without prior experience with the task (when $\text{trial} = 0$). Below you have an interface to either fit these parameters to our data (or aspects of data), or tweak them as you like. Select the next `code cell` and run it (try `Shift + Enter`, or the Run button from the menu on the top left).

# Generating hits

In [3]:
# %matplotlib notebook
# from supplementary.simple_choice_model import hits_gen as hits
hits_generator = hits.HitsGeneratorGUI(bandits=['1D','I1D','2D','R'])

<IPython.core.display.Javascript object>

VBox(children=(HBox(children=(Dropdown(description='Group: ', layout=Layout(width='30%'), options=(('All', Non…

# Decision-making process

The probability of choosing a task is proportional to the **utility** of that task, and the probabilities are given by the _softmax_ function which normalizes all utilities so that they sum up to one:

$$ p(\text{task}_i) = \frac{e^{u_i/\tau}} {\sum_j e^{u_j/\tau}} $$

Above, $u_i$ is the utility of a particular task $i$, while $j$ indexes the utilities of all tasks other than task $i$; $\tau$ is the temperature parameter of the softmax function which controls the stochasticity of the function. As $\tau$ approaches 0, softmax approaches the argmax function.

The agent uses a linear utility function to evaluate each task:

$$ u_{i,t} = \alpha \text{LP}_{i,t} + \beta \text{PC}_{i,t} + \gamma I_{i,t} $$

where $\alpha, \beta, \gamma$ are free parameters, $\text{LP}$ is the learning progress evaluated for task $i$ at time $t$, $\text{PC}$ is its positive feedback expectation (modeled as Percent Correct), and $I$ is the "inertia" variable that equals 1 if task $i$ was played on the previous trial and 0 otheriwse. As long as the agent has these 3 quantities for any task, it can meaningfully assign utility to it. Note that each term is itself a model with its own parameters and structure.

## Learning progress
LP is a quantity that reflects _change_ in learning. In our case it can be both positive and negative change. There are many ways to model LP, but we will use a formulation that does not assume the use of any learning model by the agent, since we have not introduced one. Instead, the agent relies on a finite perfect memory of hits and misses that extends some time in the past. The memory stores the record of outcomes for each tasks and LP is evaluated based on these records. Specifically, the agent takes the absolute difference between the hit rate over some number $m_0$ of most recent trials on a task and the hit rate over some $m_1$ most recent trials right before that (on the same task). In other words, it compares the new performance history to the older performance history to see whether performance is changing or stays the same.

> Closely related to Reward Prediction Error and Temporal Difference Error in simple Q-learning.

## Expected positive feedback
As mentioned before, the expected positive feedback is modeled as a percentage of hits in the record of all trials stored in memory. If the memory size is $M$, then $PC_{i,t}$ is the proportion of hits in $M$ most recent past trials on task $i$. This variable is an operationalization of subjective evaluation of one's mastery of a task. Plain hit rate across M trials is a very simple operationalization and one could consider various models of the judgment of learning (JOL) to describe the process more faithfully. Also note, that LP relies on the computation of hit rate and thus could be closely related to the JOL mechanism.

> Closely related to confidence judgments and JOLs.

## Inertia
Finally, the inertia term is there to make sure that the agent's behavior qualitatively resembles that of a human being. Specifically, it boosts the utility of a task that has been selected a time step prior, which increases the probability of repeating the same task. When humans explore freely, they tend to stay on the same task for some extended period of time before switching to another one (some even played the same task for 250 trials). If there is no incentive for the agent to repeat its selection, it would jump across tasks (seemingly) erratically due to the stochasticity of the decision-making mechanism.

> [Inertia is equivalent to switching cost, which can be implemented similarly. Instead of having a vector `[0,1,0,0]` that encodes whether task $i$ at time $t$ was selected on time $t-1$, we could have a vector `[1,0,1,1]` that represents whether the task $i$ at time $t$ is a new one (i.e. requires switching to). Then, a negative parameter would correspond to the size of the switching cost and "discourage" the agent to switch. Additionally, I think, inertia is related to boredom. If there is only inertia, there is nothing to prevent an agent from sticking to the same task forever, if the parameter is strong enough (LP and PC cannot easily overcome it, because they are bounded between 0 and 1). A boredom variable could be such that it increases with time and can eventually tip over the fixed inertia quantity. The relative sensitivities to inertia and boredom can determine how readily the agent switches to something else. Boredom, however, is not simply a function of time spent on the same task and probably interacts with LP, PC and perhaps other factors.]

# Simulation

Below, you can simulate and view one or several rounds of free play by running the following code cell. To start a simulation, we need initialize the starting state. We do it by randomly sampling a subject from our data (or from a subset of data). Click "Update initial state" button to sample a new subject. You will see their training trials appear on the top left (black = 1 / hit, white = 0 / miss). This initial state will be used to compute the utilities and start the free-play closed loop. To simulate 250 trials of free play, click the "Simulate" button. You will see the proportion of time spent on each task on the top-right, as well as the agent's actual choices and outcomes across time on the bottom. 

You can interactively change the utility function's free parameters to see how the behavior changes. You can also change the parameters of the `hits_generator`, and/or change the initial state. Be sure to update the initial state by clicking the button and click on 'Simulate' button to see the latest changes. 

Finally, the simulation can include multiple runs (up to 30) on the same input state and parameter values, each resulting in its own trajectory and learning outcomes (self-challenge and test performance, calculated indentically to SC and test score from our data analyses). You can set the number of runs (N) and view different runs individually by setting the 'Run #' slider to a particular run ID. You can also view the learning outcomes of all runs (gray circles) and their average (solid black circle) in the figure below the controls called **sim^2** (the square marker shows you which run you are currently viewing in the top panels).

In [4]:
from supplementary.simple_choice_model import sim_tools
simulator = sim_tools.Simulator(nb_trials=250, hits_generator=hits_generator, controls=True, live=False,
                                alpha=1.0, beta=1.0, gamma=3.0, tau=0.6)

<IPython.core.display.Javascript object>

HBox(children=(Button(button_style='info', description='Update initial state', style=ButtonStyle()), Text(valu…

HBox(children=(Button(button_style='success', description='Simulate 1', style=ButtonStyle()), BoundedIntText(v…

VBox(children=(FloatSlider(value=1.0, continuous_update=False, description='alpha', layout=Layout(width='80%')…

<IPython.core.display.Javascript object>

# Model parameters

The free parameters of the model can be fit to different kinds of data. 