
# Box3 - contending with multiple local maxima

This box was confusing to me. The task and model used are different from the task and models in Box2 and Boxes 4-7. The text in the main paper says there are two parameters, learning rate $\alpha$ and $\rho$ which controls working memory influence. However, appendix 4 mentions four parameters and refers to a paywalled paper (Collins & Frank, 2012) for a complete description of the task used. When going through the Matlab code we see there are indeed four parameters, but one of these, $K$, is kept constant for simulations and recovery. So there are in fact three parameters we use for simulation and recovery here.

I mention this because I *think* I've gotten the task and model right, but there may be some discrepancies left. However, this shouldn't matter for the principles shown.

## Experimental task

Our simulated participants will partake in a task where they see a picture (stimulus) and through trial-and-error have to learn which of three buttons (our actions or choices) to push to receive a reward. The experiment has multiple blocks, each with a different number of stimuli ranging from 2 to 6. Every block is repeated three times, so we have a total of $5 * 3 = 15$ blocks. Within a block, each stimuli is repeated 15 times giving us a total of $(2 + 3 + 4 + 5 + 6) * 15 = 300$ trials, and including the block repeats we thus have $300 * 3 = 900$ trials in total.

## Learning model

This model uses a mix of reinforcement learning (RL) and working memory (WM). So we have the standard RL with values for each combination of stimuli (state) and action that's updated every trial like:

$$Q(state, action) = Q(state, action) + \alpha * (reward - Q(state, action))$$

and then we have similar state, action values for WM that are just set to the reward received:

$$WM(state, choice) = reward$$

Choices are made using softmax with inverse temperature $\beta$ for RL:

$$p(action) = \frac{e^{\beta * Q(state)}}{\sum{e^{\beta * Q(state)}}}$$

and for WM the same softmax is used, but using a constant of $50$ instead of $\beta$ which essentially creates a fully "greedy" behaviour, almost always picking the highest valued action.

The final choice probabilities are calculated with a mixture policy:

$$p(action) = (1 - w) * p_{RL} + w * p_{WM}$$

where $w = \rho * min(1, \frac{K}{n_s})$, where $n_s$ is number of stimuli and $K$ can be seen as scaling the mixture weight in proportion to working memory capacity vs number of stimuli.