# Simulating Rodent Learning Curves with Reinforcement Learning Agents

- Simulating mouse choice behaviour with Q-learning models using data from a bandit task
- Specifying 1000 simulations (to reach the sample size in Beron et al (2022))
- Using random search within parameter ranges to return best-fitting RMSE (comparing real mouse behaviour to model predictions) - 50 random search iterations specified
- Compared for 80%-20% reward trials, 70%-30% and 90%-10% where one arm gives the reward the higher probability (e.g., 80% of the time), and the other arm gives the reward the lower proportion of the time (e.g., 20% of the time).
- Different number of simulations and iterations can be specified!
- The RL agent is then compared to the mouse, averaging across trial blocks. 
- In this configuration, the rewarding arm switches every few trials, so averaging across blocks allows an average learning curve to be plotted, looking at learning post-switch. 
- But, in order to simulate mice having to RE-LEARN which arm is good, we also need to simulate this 'confusion' in the RL agent - which we can do by having it train on the wrong arm (with the same reward probability) for a certain number of trials, before learning about the new arm. This is to try and get the RL agent to best mimic the experience of the (confused) mouse, with the rewarding arm suddely switching. 
- In other words, we simulate the agent having transition from a block where the reward probabilties were entirely reversed, so the agent is convinced that the wrong arm is rewarding - the same experience the mouse would have. 
- Since each trial only has one action step, this RL model does not include $\gamma$ in the Reward function.



## 1) Epsilon Greedy Q-learning agent:

- This simple Q-learning agent learns which of two arms is more rewarding in three different conditions.
- In this experimental setup, the initial state s simply corresponds to the mouse in the state before choosing to pursue the right or left arm of the two-armed bandit task. Other possible states are being in the 'good' arm - with a high reward probability, or being in the 'bad' arm, with a low reward probability.
- It uses an Epsilon $\epsilon$ greedy approach (with a probability of $\epsilon$, take a random action, determined by a 50-50 coin toss, with a probability of $1-\epsilon$, exploit the best known policy, corresponding to the action that maximises the q-value)
-  $\epsilon$ is first initialized at 1 to promot exploratory behaviour, and gradually decreased by a hyperparameter that scales epsilon, to encourage the simulated agent to exploit the best policy once it is pretty clear.
- Because mice either receive a reward (1) or do not (0), the q-learning equation computes Q-values based on this binary outcomes. The actual value ends up getting weighed by the experience of that event, which an optimal agent should learn = the probability condition. i.e., in the 70% reward condition, the estimated q-value at the end should end up being 0.7 (not accounting for any biases/bonuses that other models may include). 

### General Q-learning formula:
$$Q(a_{t+1}) = Q(a_t) + \alpha[(reward - Q(a_t))]$$


### Epsilon greedy action selection:
$$P(0.5a_t) = \epsilon$$

$$P(a_t=argmaxQ(a_t)) = 1 - \epsilon$$

### Epsilon decay:
$$\epsilon_{t+1} = max(\epsilon_{min}, \epsilon_t * decayRate)$$

- The minimum epsilon allows us to specify a model that will never fully exploit the optimal policy (if needed). 



### Average Epsilon Greedy Q-Learning Agent vs Average Mouse (70% reward, 30% no reward in 'good' arm)

- The following terminal command runs the Q-learning model for the 70-30 condition, where changing the -c field will change the condition modeled, and changing --simulations allows the user to specify a different number of simulations. 

python fitRLmodel_rmse.py -m epsilon_greedy -c 70 --sims_per_set 1000 --search_iterations 50

![Best fit for 70-30 condition](plot_epsilon_greedy_70-30_fit_RMSE.png)

- RMSE = 0.0776, indicating a very good fit - suggests that the RL model is off by an average of 7.76%.
- $\alpha$ (learning rate) = 0.413, $\epsilon_{min}$ = 0.136, decay = 0.983
- This model suggests a reasonably modest learning rate, as well as continued stochastic behaviour even after the optimal policy is learnt, given that the minimum epsilon corresponds to exploratory behaviour with a probability of 0.14, even with decay (ranges provided 0.01 - 0.5)
- This is a good strategy overall - with mice picking the 'correct arm' almost 90% of the time by trial 100. The simulated RL mouse learns a little quicker based on the % of good options chosen, but is actually a bit more conservative than the mouse.

![Q-Value History](plot_epsilon_greedy_70-30_Q_Values.png)

- from the q-learning estimate history above, we can that the agent is somewhat underestimating the value of the highly rewarding arm - likely representing a more conservative set of internal beliefs about the 'goodness' of the rewarding arm, while also overestimating how bad the 'bad' arm is. 

### 80% reward vs 20% no reward condition

![Best fit for 90-10 condition](plot_epsilon_greedy_80-20_fit_RMSE.png)


- RMSE = 0.071 - this is a slightly better performance compared to in the 70-30 condition, but overall quite similar.
- $\alpha$ (learning rate) = 0.406, $\epsilon_{min}$ = 0.090, decay = 0.978
- the learning rate is very similar, but we can see that the agent generally exploits more once the optimal policy is learned, represented by a lower minimum epsilon. 
- The simulated mouse peaks with choosing the "good" option more than 90% of the time, and visually we can see that real life mice are more stochastic trial-by-trial.

![Agent's Q-Learning Q-Estimates](plot_epsilon_greedy_80-20_Q_Values.png)

- from the Q-estimate history curve, we can see that the agent achieves somewhat steady estimates by around trial 100, but these estimates are actually slightly underestimating the value of the rewarding arm. 
- it is also not accurately estimating the q-value of the bad arm, actually estimating the value to be signifiacntly worse than it is.

### 90-10 condition

![Best fit for 90-10 condition](plot_epsilon_greedy_90-10_fit_RMSE.png)


- RMSE = 0.139 (poor fit) - the epsilon greedy model does not do well at capturing average mouse learning in this high-reward probability condition. Graphically, mice appear to act a lot more stochastically, even though the reward probability is very high. 

- $\alpha$ (learning rate) = 0.497, $\epsilon_{min}$ = 0.109, decay = 0.985 

- The learning rate is the highest out of the three probability conditions, which is inline with the greater reward probability. But the agent that behaves most like the average mouse is surprisingly stochastic, with a minimum exploration probability of 0.107, greater than the 80-20 condition.

![Q-value estimates by Epsilon Greedy Q-learner](plot_epsilon_greedy_90-10_Q_Values.png)

- From the Q-estimates, we can see that by 100, the agent has pretty much learnt the true value of the good arm, and is pretty stable with estimating that. In contrast, the agent reaches an accurate Q-value estimate for the 'bad' arm, but the estimate wavers a lot - likely because the agent is not really choosing that bad option often.

- While the model performs well for the 70-30 and 80-20 conditions, alternate models may better capture mouse stochasticity in the latter trials, particularly in the 90-10 condition. 


## Q-Learning Agent with 'Forgetting'

- to account for mice seeming to choose the good option less over time, especially in the 90-10 condition, we can include a forgetting factor parameter, which scales down the overall Q-values of actions, as if the mice have forgotten the value associated with them. 
- we can model this as a global decay parameter $\omega$ for all Q-values over time.
- Mathematically:

$$Q(a_t) = \omega Q(a_t)$$

In terminal: fitRLmodel_rmse.py -m forgetting -c 70 --sims_per_set 1000 --search_iterations 50

- Note that this will take longer to run locally, given the increased number of parameters with $\omega$ now included in parameter combination iterations. 

### 70-30 'Forgetting' Agent
![Best fit for 70-30 condition](plot_forgetting_70-30_fit_RMSE.png)



- RMSE = 0.103 (poorer fit than model without forgetting)
- $\alpha$ (learning rate) = 0.345, $\epsilon_{min}$ = 0.01, Decay = 0.990, $\omega$ (Forgetting) = 0.978 (small amount of forgetting)
- The learning rate is lower, which makes sense because the agent is essentially forgetting some of what it has learned, though only by a little bit. This model does not improve upon the model without forgetting, suggesting we should pursue a different model. 

![70-30 Q-Values](plot_forgetting_70-30_Q_Values.png)

- by examining the estimated q-values, we can also see that this model just isn't a good at estimating the true q-value and is underestimating the true value more than the previous model did.

### 80-20 'Forgetting' Agent

![Best fit for 80-20 condition](plot_forgetting_80-20_fit_RMSE.png)


- RMSE = 0.093 (better than for the 70-30 condition, but still worse than the original model without forgetting)
- $\alpha$ (Learning Rate) = 0.335, $\epsilon_{min}$ = 0.01, Decay = 0.952, Forgetting = 0.997
- these parameter estimates suggest a slower learning rate, but also less stochasiticty

![Q-learning 80-20](plot_forgetting_80-20_Q_Values.png)

- the q-values plot over time also suggests overall slower learning, especailly about the rewarding arm

### 90-10 'Forgetting' Agent

![Best fit for 90-10 condition](plot_forgetting_90-10_fit_RMSE.png)

- RMSE = 0.149 (poor fit)

![Q-learning](plot_forgetting_90-10_Q_Values.png)

- because of the forgetting parameter, we can see that the q-values are consistently being forgotten, leading to underestimation of the value of the rewarding arm, and also for the poor arm. 

- we were still not able to capture the U-shaped, stochastic learning curve shown by the average learning mouse - an additional consideration is to model anticipation - where mice may learn, overall, that a switch in the rewarding arm occurs at some point throughout the blocks of trials. 


## 3) Q-Learning with Anticipation of Switch
- the decay of the reward of Q-values is not really affecting the models much
- it could be that because of the switching, mice are learning that, soon, the conditions may change, so they are changing their behaviour in anticipation of this, no longer just picking the optimal option (because it may change).
- We can incorporate this with Boltzmann exploration to account for some of the additional mouse stochasticity, which should bias q-values to favour unexplored arms. So since the 'bad' arm has been neglected so much by previous agents, we can create a new agent that is incentivised to still explore it even though there is good evidence that it is not rewarding.
- We will remove the forgetting factor (which was not useful) and instead include an anticipation rate $\psi$, which becomes active after a specified anticipation trial. 
- Visually, it looks like the anticipation begins around trial 200-250:
- Mathematically:

$$P(a_i, t) = \frac{e^{Q_t(a_i) / \tau_t}}{e^{Q_t(a_1) / \tau_t} + e^{Q_t(a_2) / \tau_t}}$$

- where $\tau$ corresponds to the 'temperature' on trial t. A high temperature constant encourages exploration (scales up the q-value), while a low temperature constant makes the agent prefer exploitation (greedy). 

- $\tau$, however, depends on the trial number - the point at which the agent/mouse starts to anticipate a change in switch coming soon. We can record this as the anticipation_trial. At this point, the agent should increase exploratory behaviour. 

- Mathematically:

$$\tau_{t+1} = \begin{cases} \max(decay \cdot \tau_t , \tau_{min}) & \text{if } t < anticipateTrial \text{ and } \tau_t > \tau_{min} \\ \min(\psi \cdot  \tau_t , \tau_{max}) & \text{if } t \ge anticipateTrial \text{ or } \tau_t \le \tau_{min} \end{cases}$$

- Where, $\psi$ is the anticipation rate, which scales up stochastic, exploratory behaviour by modulating the temperature $\tau$ parameter. 

- This model is specified using the terminal command below, with fewer iterations because of computational constraints: 

fitRLmodel_rmse.py -m boltzmann_anticipation -c 70 --sims_per_set 500 --search_iterations 30

### 70-30 Boltzmann Anticipatory Agent (30 parameter iterations)

![Best fit for 70-30 condition](plot_boltzmann_anticipation_70-30_fit_RMSE.png)

- RMSE =  0.096. Again this is still worse than the original model RMSE, but visually we can see it is capturing mouse stochasicity in latter trials (t=169 onwards) a bit better. 
- $\alpha$ (learning rate) = 0.389, max temperature $\tau$ = 0.544, anticipation_trial = 169, anticipation rate $\psi$ = 1.002, decay rate = 0.991
- This suggests a similar, modest learning rate in the 70-30 condition, with very slight anticipation. Visually, we can see a U-shaped more than in the previous models. 


![Q-learning graph](plot_boltzmann_anticipation_70-30_Q_Values.png)

- this model, again, slightly understimates the true q-values, suggesting that mice are acting a little conservatively. They are also highly overestimating the poorness of the 'bad' arm, which still provides rewards 30% of the time

### 80-30 Boltzmann-Anticipatory Agent

![Best fit for 80-20 condition](plot_boltzmann_anticipation_80-20_fit_RMSE.png)


- RMSE = 0.083 (again, poorer performance than just the simplest epsilon-greedy decay model, but still a good fit)
- $\alpha$ (learning Rate) = 0.27, Max $\tau$ (Temperature) = 1.611, Anticipation Trial = 155,  Anticipation Rate = 1.002 
- This has a lower learning rate, and a high max temperature - promoting more stochasticity.

![Q-learning plot](plot_boltzmann_anticipation_80-20_Q_Values.png)

- q-values suggest pretty fast learning (within the first 25 ish trials), and a consistent q-value estimate hovering close to the true values. 

### 90-10 Boltzmann Anticipation Agent

![Best fit for 90-10 condition](plot_boltzmann_anticipation_90-10_fit_RMSE.png)

- RMSE = 0.140 (poor fit as we've seen with all models for this condition, suggesting overall it is not a good condition to have mice take part in if we want to model behaviour)
- $\alpha$ (learning rate) = 0.665, Max $\tau$ (temp) = 0.693, Anticipation Trial = 245, Anticipation Rate = 1.004
- The RL model learns a lot faster that the 90% rewarding arm is better, but the mice seem to still show some stochastic behaviour, more than in the objectively more uncertain trials

![Q-learning](plot_boltzmann_anticipation_90-10_Q_Values.png)

## 4) Q-Learning Agent with Anticipation, Perseverence and Boltzmann Exploration
- Boltzmann Exploration is better than the Epsilon-Greedy Approach and may help to capture the stochastic mouse behaviour
- Here we use Tau to scale down the value of actions we know a lot about, which promotes exploration behaviour. 

### 80-20 Condition

![Best fit for 80-20 condition](plot_boltzmann_anticipation_80-20.png)


- appears to be capturing stochasicity a bit better, but the RMSE is not really better. 