Hippocampus and striatum show distinct contributions to longitudinal changes in value-based learning in middle childhood

The hippocampal-dependent memory system and striatal-dependent memory system modulate reinforcement learning depending on feedback timing in adults, but their contributions during development remain unclear. In a 2-year longitudinal study, 6-to-7-year-old children performed a reinforcement learning task in which they received feedback immediately or with a short delay following their response. Children’s learning was found to be sensitive to feedback timing modulations in their reaction time and inverse temperature parameter, which quantifies value-guided decision-making. They showed longitudinal improvements towards more optimal value-based learning, and their hippocampal volume showed protracted maturation. Better delayed model-derived learning covaried with larger hippocampal volume longitudinally, in line with the adult literature. In contrast, a larger striatal volume in children was associated with both better immediate and delayed model-derived learning longitudinally. These findings show, for the first time, an early hippocampal contribution to the dynamic development of reinforcement learning in middle childhood, with neurally less differentiated and more cooperative memory systems than in adults.


Introduction
As children enter school during middle childhood, they must learn to act appropriately in new situations through feedback.For example, children must learn to raise their hand before speaking during class.The teacher may reinforce this behavior immediately or with a delay, which raises the question whether feedback timing modulates their learning.Here, reinforcement learning (RL; Sutton and Barto, 2018) provides a useful mechanistic framework to describe such feedback-driven value-based learning and decision-making.RL models allow to explicitely test for the influence of separate components during value-based learning, such as model-free and model-based learning (Gläscher et al., 2010), social and non-social learning (Bolenz et al., 2017;Zhang and Gläscher, 2020), or the contribution of different memory systems (Foerde and Shohamy, 2011;Packard and Goodman, 2013;Goodman and Packard, 2016).
The role of feedback timing has previously been studied in relation to memory sytems.The memory systems account is a theoretical framework that proposes that different types of memory are supported by distinct neural systems in the brain.Specifically, this account suggests that there are two memory systems: a hippocampal-dependent system and a striatal-dependent system.These systems modulate memory and value-based learning, and their interactive development has been of particular interest to developmental research (Davidow et al., 2016;Hartley et al., 2021).In adults, the hippocampal-dependent memory system has been shown to contribute to episodic memory during reinforcement learning and is more engaged during feedback that is presented with a delay (Packard and Goodman, 2013;Packard et al., 2018;Schwabe and Wolf, 2013), as opposed to the striatal-dependent memory system, which is more engaged after immediate feedback and supports habitual memory (Foerde and Shohamy, 2011;Foerde et al., 2013;Höltje and Mecklinger, 2020;Lighthall et al., 2018).Specifically, hippocampal activation was greater during delayed feedback than during immediate feedback, whereas striatal activation was greater during immediate feedback than during delayed feedback (Foerde and Shohamy, 2011).The engagement of the hippocampus during delayed feedback was further supported by enhanced episodic memory for incidentally presented objects compared to objects presented with immediate feedback.Taken together, findings from adult studies suggest that feedback timing modulates the engagement of the hippocampal and striatal memory systems during value-based learning.Given the differential developmental trajectories of these systems and the impact the systems have on reinforcement learning and memory, it is important to understand whether children would show similar feedback timing modulations as previously shown in adults.In addition, whether such feedback timing modulation changes over time remains largely unexplored.To this end, in this study, we examined the contributions of hippocampal and striatal structural volumes during the longitudinal development of reinforcement learning across two years in 6-to-7-year-old children.We will introduce the key parameters in reinforcement learning and then we review the existing literature on developmental trajectories in reinforcement learning as well as on hippocampus and striatum, our two brain regions of interest.
Reinforcement learning behavior modulated by feedback timing can be modeled computationally using at least three parameters that reflect feedback-based learning and decision-making.For feedback-based learning, a learning rate parameter determines the extent to which the reward prediction error, defined as the difference between the received reward and the expected reward, influences the update of the future choice values.A higher learning rate emphasizes recent outcomes, whereas a lower learning rate reflects learning integrated over a longer outcome history (Zhang et al., 2020).Value updates may further depend on an outcome sensitivity parameter that scales the individual magnitude of received rewards.Finally, in decision-making, the inverse temperature parameter plays a key role in determining the tendency to select the more valuable choice and quantifies choice stochasticity.A higher inverse temperature reflects more value-guided, deterministic choice behavior compared to a lower inverse temperature reflecting more random choices.Learning rates and inverse temperature have been studied extensively across development, mainly with cross-sectional studies showing mixed findings regarding their age gradients (Nussenbaum and Hartley, 2019).One study reported lower learning rates in children compared to adolescents (Decker et al., 2015), while other studies found no differences (Javadi et al., 2014;Palminteri et al., 2016) or even higher learning rates in children (Davidow et al., 2016;Master et al., 2020).Developmental differences regarding the inverse temperature parameter are slightly more consistent, with studies reporting no differences (Davidow et al., 2016;Hauser et al., 2015;Moutoussis et al., 2018;van den Bos et al., 2012)  higher inverse temperature with age that suggests that behavior is increasingly value-guided and less explorative (Decker et al., 2015;Javadi et al., 2014;Palminteri et al., 2016;Rodriguez Buritica et al., 2019).To the best of our knowledge, outcome sensitivity has not been modeled computationally across development.However, studies that linked striatal reward activation to self-reported reward sensitivity showed increasing sensitivity from childhood to adolescence (Galván, 2013;van Duijvenvoorde et al., 2014).
In general, the inconsistencies regarding developmental differences in parameters may be due to their dependency on model and task properties (Eckstein et al., 2021), which could be reconciled by comparing developmental changes to simulation-based optimal learning (Zhang et al., 2020).Such comparisons acknowledge that optimal parameter values vary depending on the context, and it has been suggested that humans develop towards more optimal parameter values from childhood into adulthood (Nussenbaum and Hartley, 2019).Importantly, to our knowledge previous reinforcement learning studies with children were cross-sectional, and only two studies investigated children under 8 years of age (Decker et al., 2015;Cohen et al., 2020).Cross-sectional studies, in which developmental change is inferred as a between-subject factor, do not capture the dynamics in middle childhood if individual differences are large, whereas longitudinal studies test development as a within-subject factor, which is crucial for uncovering change across time.Thus, longitudinal changes in reinforcement learning in middle childhood as well as their putative striatal and hippocampal associations remain unknown.To this end, learning rates, outcome sensitivity, and inverse temperature are relevant computational parameters to study longitudinal changes in striatal and hippocampal systems during value-based learning.
Striatal and hippocampal contributions to reinforcement learning during middle childhood may differ as these brain regions undergo major developmental changes.Although earlier structural studies with relatively small sample sizes showed large developmental variability and a tendency for an earlier volume peak in the striatum than in the hippocampus (Raznahan et al., 2014;Wierenga et al., 2014;Giedd, 2004;Uematsu et al., 2012;Giedd et al., 2015;Goodman et al., 2014;Goddings et al., 2014), a recent cross-sectional large-scale study was able to contrast striatal and hippocampal trajectories with greater granularity (Dima et al., 2022).These data showed striatal volume peaks in the first decade which then declined throughout later developmental periods, whereas hippocampal volume For immediate feedback (top panel), between choice response and feedback, cue and choice were presented for 1 s.At feedback, a green frame around the incidentally encoded object indicated a positive outcome, which appeared in 87.5% of the trials when selecting the squard-shaped lolli for this example cue.For delayed feedback (bottom panel), the delay phase between choice response and feedback lasted for 5 s.The red frame around the object indicated a negative outcome and appeared in 87.5% of the trials when selecting the squard-shaped lolli for this example cue.(B) For each feedback condition, two action-outcome contingencies were learned to balance a potential choice bias.With the four task versions, the cues and outcome contingencies were counterbalanced across participants.
showed a more protracted inverted-U-shaped trajectory that peaked in adolescence.Based on these structural findings, striatal and hippocampal systems are expected to develop functionally at different rates (Lavenex and Banta Lavenex, 2013), with habit memory depending on the earlier developing striatum and episodic memory depending on the later developing hippocampus (Mandolesi et al., 2009).A direct investigation of the longitudinal development of both memory systems in childhood would shed light on whether the memory systems show a differential engagement similar to that of adults (Foerde and Shohamy, 2011).Such knowledge could be useful to structure learning processes according to the developmental status.For example, children's ability to learn from delayed feedback may depend on how well their hippocampus has developed.In the same study sample, we previously reported that children's hippocampal volume was related to their family's income level (Raffington et al., 2019).Additionally, previous research has shown that stress can reduce the effectiveness of the hippocampal-dependent memory system (Schwabe and Wolf, 2013).This suggests that environmental factors such as income and stress may play a role in shaping how well children learn from delayed feedback, particularly through their impact on hippocampal development.By identifying the specific environmental factors that impact children's learning and brain development, we can identify risk groups and tailor interventions to ameliorate adverse effects.
This study aimed to explore the development of value-based learning in children and its relationship with structural brain development over time.We hypothesized that the timing of feedback would modulate children's learning in a commonly used reinforcement learning task (see Figure 1), and that such modulation can be captured by reinforcement learning (RL) model parameters.Additionally, we predicted that children's value-based longitudinal development would shift towards more optimal learning behavior.Regarding structural brain development, we expected the striatum to be relatively mature by middle childhood compared to the protracted hippocampal maturation.Our second objective was to investigate the relationship between value-based learning and structural brain development using longitudinal structural equation modeling.We anticipated that there would be differentiated brain-cognition links between brain volume and value-based learning.Specifically, we predicted that immediate feedback learning would be more strongly associated with striatal volume, whereas hippocampal volume would be more closely linked to delayed feedback and the facilitation of episodic memory encoding.Finally, we examined how these brain-cognition dynamics would change over time by analyzing their longitudinal changes.

Behavioral results
First, we were interested in whether children showed behavioral differences between waves and feedback timing.A descriptive overview is provided in Table 1 and Figure 2. The details of the reported GLMM models, including the random effects structure and the effects of age and sex, are described in the Appendix 2. Since some children were poor learners who failed to reach 50% average accuracy in their last 20 trials (13 children at wave 1 and 6 children at wave 2), we also performed behavioral analyses with a reduced dataset in which results remained unchanged (Appendix 6).
To summarize, children's average accuracy improved over 2 years, while their win-stay probability increased and their lose-shift probability decreased between waves.Children were able to respond faster to cues paired with delayed feedback compared to cues paired with immediate feedback, and they became faster in their decision-making across waves (see mixed model effects overview in Table 1).Of note, reaction times were largely uncorrelated with accuracy and switching behavior (win-stay, lose-shift), while accuracy and switching behavior showed significant correlations at both waves (Figure 2D).

Modeling results
Children's behavior was best described by value-based learning We conducted a 2-step sequential procedure for model development and model selection.Model comparison using leave-one-out cross validation showed evidence in favor of the value-based learning model, reflected in the highest expected log pointwise predictive density and highest model weights, confirming that children's learning behavior in the longitudinal data can generally be better described by a value-based rather than by a heuristic strategy model ( elpd loo = -15154.9,Pseudo-BMA+ = 1, Table 2).Children whose individual fit was better for a heuristic model ( wsls ) than for the value-based model ( vbm 1 ), were at both waves more likely to be poor learners (defined as an accuracy below 50% in the last 20 trials).Taken together, children's learning behavior was best described by a value-based model, and a heuristic strategy model captured more poor learners compared to a value-based model.= sum of expected log pointwise predictive density of all 33,460 trials, including all participants and waves, and trial mean.Pseudo-BMA+ = model weight for relative model evidence using Bayesian model averaging stabilized by Bayesian bootstrap with 100,000 iterations.Note.α = learning rate across feedback timing, τ /ls = inverse temperature and learning score separated by conditions of feedback timing.

Feedback timing modulated choice stochasticity
Model vbm 3 ( 1α2τ ) showed the largest model evidence, reflected in the highest expected log pointwise predictive density and highest model weights and suggests that feedback timing affected the inverse temperature, but not the learning rate or outcome sensitivity ( elpd loo = -15045.3,pseudo-BMA+ = 0.73, Table 2).Table 3 and Figure 3A provide a descriptive overview of the winning model parameters.Of note, there were only small differences in model fit ( elpd loo ) to the secondbest model ( vbm 7 , 1α2ρ , ∆elpd loo = -2.93,elpd_SE loo = 2.92, Pseudo-BMA+ = 0.24), which suggests a potential separable feedback timing effect on outcome sensitivity.We also performed the model comparison with a reduced dataset in which the winning model remained the same (Appendix 6).The average inverse temperature did not differ by feedback condition, but showed large within-person condition differences at both waves, indicating individual differences in feedback timing modulation (wave 1: ∆τ del−ime Mean = 0.22, SD = 3.80, Range = 21.74,wave 2: ∆τ del−ime Mean = 0.35, SD = 3.70, Range = 24.03).The correlations between the parameters are reported in Appendix 3. Since reaction times were predicted by feedback timing behaviorally, and inverse temperature is assumed to reflect decision-making, we were interested in whether differences in reaction time were related to inverse temperature differences.Indeed, at both waves, children who responded faster during delayed compared to immediate feedback had a higher inverse temperature at delayed compared to immediate feedback (wave 1: r = -0.261,t(138) = -3.18,p = 0.002, wave 2: r = -0.345,t(124) = -4.10,p < 0.001, Figure 3B).Taken together, children's learning behavior was best described by a value-based model, where feedback timing modulated individual differences in the choice rule during value-based learning.Interestingly, the differences in the choice rule and reaction time were The inverse temperature τ but not learning rate α was separated by feedback timing, and both increased between waves in their values (top panel).The condition difference in the inverse temperature did not differ on average, but showed individual differences (bottom left panel).(B) The condition differences in the inverse temperature correlated with reaction time, that is higher delayed compared to immediate inverse temperature was related to faster delayed compared to immediate reaction time.
correlated.Specifically, more value-guided choice behavior (i.e. higher inverse temperature) was related to faster responses during delayed feedback relative to immediate feedback, suggesting a link between model parameter and behavior in relation to feedback timing.

Children's value-based learning became more optimal
Next, we compared the parameter space according to model simulation (Figure 4A) with the empirical posterior parameters fitted by the winning model (Table 3, Figure 4B) to determine whether children increased their value-based learning towards more optimal parameter combinations.Both fitted and simulated parameter combinations allowed us to derive a learning score that captured learning performance according to the winning value-based model.Note that the learning score was defined as the average choice probability for the more rewarded choice option.We refer to these model-derived choice probabilities as learning score, since they reflect value-based learning and combine information of learned values, that depend on the learning rate, and values translated into choice probabilities, that depend on the inverse temperature.Thus, a higher learning score reflects more optimal value-based learning.We simulated 10,000 parameter combinations and created a learning score map according to each parameter combination (Figure 4A).The optimal parameter combination was at a learning rate α = 0.29, and an inverse temperature τ = 19.8, and with an average learning score of 96.5% (Figure 4A).Children's fitted learning rates ranged 0.01-0.22 and inverse temperature 6.73-18.70 and were outside the parameter space of a learning score above 96% (Table 3 and Figure 4A).The average longitudinal increases in learning rate and inverse temperature were mirrored by average increases in the learning scores, confirming our prediction that their parameters developed towards optimal value-based learning (arrow in Figure 4B).We further found that the average longitudinal change in win-stay and lose-shift proportion also developed towards more optimal value-based learning (Appendix 4).

Model validation
To validate our winning model vbm 3 , we estimated its predictive accuracy by comparing one-stepahead model predictions with the choice data.The one-step ahead predictions of the winning model captured children's choices overall well, with predictive accuracies of 65.3% at wave 1 and 75.7% at wave 2 (Figure 4C).Further, our winning model showed a good parameter recovery for learning rate (r = 0.85) and inverse temperature (r = 0.75-0.77).Our winning model showed excellent on the group level (100%) when comparing it to a set of models used during model comparison ( vbm 1 , vbm 7 , wsls ).The individual model recovery was lower (58%), with 35% of the simulated winning model fitting best on our baseline model vbm 1 with a single inverse temperature, which likely reflects the noisy property of the inverse temperature (Appendix 1).

Longitudinal brain-cognition links Significant longitudinal change in brain and cognition
We first performed univariate LCS model analyses to estimate a latent change score of immediate and delayed learning scores as well as striatal and hippocampal volumes (see descriptive changes in Figure 5B, C).All four variables of interest showed significant positive mean changes and variances, and all univariate models provided a good fit to the data (see Appendix 5).This allowed us to further relate the differences in structural brain changes to changes in learning.

Hippocampal volume exhibited more protracted development during middle childhood
We next fitted a bivariate LCS model to compare striatal and hippocampal change scores.We theorized that by middle childhood, the striatum would be relatively mature, whereas the hippocampus continues to develop.We progressively constructed multiple LCS models to test this idea.First, the bivariate LCS model provided a good data fit (χ² (14) = 10.09,CFI = 1.00,RMSEA (CI) = 0 (0-0.06),SRMR = 0.04).We then further fitted two constrained models, to see whether setting the mean striatal change or the mean hippocampal change to 0 would lead to a drop in the model fit.Compared to the unrestricted model, the constrained model that assumed no striatal change did not lead to a drop in model fit (Δχ² (1) = 2.74, p = 0.098), whereas the model that assumed hippocampal change dropped in model fit (Δχ² (1) = 12.69, p < 0.001).Finally, we tested a more stringent assumption of equal change for striatal and hippocampal volumes, in which the model dropped in model fit compared to the unrestricted model (Δχ² (1) = 18.04, p < 0.001) and suggests that striatal and hippocampal change differed.Together, these results support our postulation of separable maturational brain trajectories in our study sample, suggesting that the hippocampus continued to grow in middle childhood, whereas striatal volume increased less.

Hippocampal and striatal volume showed distinct associations to learning
We fitted a four-variate LCS model to test our prediction of selective brain-cognition links.Specifically, we assumed a larger contribution of striatal volume at immediate learning, and a larger contribution of hippocampal volume at delayed learning.The LCS model provided good data fit (χ² (27) = 15.4,CFI = 1.00,RMSEA (CI) = 0 (0 -0.010, SRMR = 0.045)), and all relevant paths are shown in Figure 5D (see Table 4 for a detailed model overview).For the striatal associations to cognition, we found that wave 1 striatal volume covaried with both immediate learning score and delayed learning score ( ϕ STRw1,LSi,w1 = 0.19, z = 2.52, SE = 0.07, p = 0.012, ϕ STRw1,LSd,w1 = 0.18, z = 2.37, SE = 0.07, p = 0.018).Constraining the striatal association to immediate learning to 0 worsened the model fit relative to the unrestricted model (Δχ² (1) = 5.66, p = 0.017), which was the same when constraining the striatal association to delayed learning to 0 (Δχ² (df 1) = 5.14, p = 0.023).In summary, larger striatal volume was associated with better learning scores for both immediate and better delayed feedback.This pattern remained the same in the results of the reduced dataset (Appendix 6).
Hippocampal volume, on the other hand, only covaried with delayed learning at wave 1 ( ϕ HPCw1,LSd,w1 = 0.14, z = 2.05, SE = 0.07, p = 0.041), not with immediate learning score ( ϕ HPCw1,LSi,w1 = 0.12, z = 1.68,SE = 0.07, p = 0.092).Fixing the path between hippocampal volume and delayed learning to 0 worsened the model fit relative to the unrestricted model (Δχ² (1) = 4.19, p = 0.041), but not when its path to immediate learning was constrained to 0 (Δχ² (1) = 2.94, p = 0.086).This suggests that larger hippocampal volume was specifically associated with better delayed learning.In the results of the reduced dataset, the hippocampal association to the delayed learning score was no longer significant, suggesting a weakened pattern when excluding poor learners (Appendix 6).It is likely that the exclusion reduced the group variance for hippocampal volume and delayed learning score in the model.
As a next step, the associations between striatum and hippocampus to immediate or delayed learning was directly compared against each other.A model equal-constraining striatal and hippocampal paths to immediate learning (Δχ² (1) = 0.41, p = 0.521) and another model equal-constraining these paths to delayed learning (Δχ² (1) = 0.14, p = 0.707) did not lead to a worse model fit compared to the unrestricted model, which suggests that the brain-cognition links have considerable overlap.This is in line with the high wave 1 covariance and change-change covariance within the brain and cognition domain (see Table 4).We found no longitudinal links between the brain and cognition domains, which suggests that the found brain-cognition links at wave 1 remained longitudinally stable (see Appendix 5 for an exploratory LCS model that related the model parameters to striatal and hippocampal volume).Taken together, the confirmatory LCS model results were in line with our predictions of a relatively larger involvement of the hippocampus during delayed feedback learning, but the findings on striatal volume disconfirmed a selective association with immediate feedback learning and suggest a more general role of the striatum in both learning conditions.
No evidence for enhanced episodic memory during delayed feedback Finally, we investigated whether a hippocampal contribution at delayed feedback would selectively enhance episodic memory.Episodic memory, as measured by individual corrected object recognition memory (hits -false alarms) of confident ('sure') ratings, showed at trend better memory for items shown in the delayed feedback condition ( β feedback=delayed = 0.009, SE = 0.005, t(137) = 1.80, p = 0.074, see Figure 5A).Note that in the reduced dataset, delayed feedback predicted enhanced item memory significantly (Appendix 6).The inclusion of poor learners in the complete dataset may have weakend this effect because their hippocampal function was worse and was not involved in learning (nor encoding), regardless of feedback timing.To summarize, there was inconclusive support for enhanced episodic memory during delayed compared to immediate feedback, calling for future study to test the postulation of a selective association between hippocampal volume and delayed feedback learning.

Discussion
In this study, we examined the longitudinal development of value-based learning in middle childhood and its associations with striatal and hippocampal volumes that were predicted to differ by feedback timing.Children improved their learning in the 2-year study period.Behaviorally, learning was improved by an increase in accuracy and a reduction in reaction time (i.e.faster responses).Further, children's switching behavior improved by an increase in win-stay and a decrease in lose-shift behavior.Computationally, learning was enhanced by an increase in learning rate and inverse temperature, which together constituted more optimal value-based learning.Further, feedback timing modulated specifically the inverse temperature.In terms of brain structures, we found that longitudinal changes in hippocampal volume were larger compared to striatal volume, which suggests more protracted hippocampal maturation.The brain-cognition links were longitudinally stable and partially confirmed our hypotheses.In line with previous adult literature and our assumption, hippocampal volume was more strongly associated with delayed feedback learning.Contrary to our expectations, episodic memory performance was not enhanced under delayed feedback compared to immediate feedback.Furthermore, striatal volume unexpectedly was associated with both immediate and delayed feedback learning, suggesting a common involvement of the striatum during value-based learning in middle childhood across timescales.
Children's learning improvement between waves was described behaviorally by increased win-stay and decreased lose-shift behavior.Our finding is in line with cross-sectional studies in the developmental literature that reported increased learning accuracy and win-stay behavior (Chierchia et al., 2023;Habicht et al., 2022).Our longitudinal dataset with younger children further suggests that learning change is not only accompanied by increased win-stay, but also decreased lose-shift behavior.We found lower learning performance and less optimal switching behavior in girls compared to boys, which could point to sex differences for reinforcement learning during middle childhood (Appendix 2).Previous studies have found both male and female advantages depending on their age and the type of learning task (Mandolesi et al., 2009;Overman, 2004;Evans and Hampson, 2015).Alternatively, sex differences may have been driven by confounding variables not included in the analysis.
Computationally, we found longitudinally increased and more optimal learning rate and inverse temperature, as shown by simulation data, that add to the growing literature of developmental reinforcement learning (Nussenbaum and Hartley, 2019).Adult studies that examined feedback timing during reinforcement learning reported average learning rates range from 0.12 to 0.34 (Foerde and Shohamy, 2011;Höltje and Mecklinger, 2020;Lighthall et al., 2018), which are much closer to the simulated optimal learning rates of 0.29 than children's average learning rates of 0.02 and 0.05 at wave 1 and 2 in our study.Therefore, it is likely that individuals approach adult-like optimal learning rates later during adolescence.However, the differences in learning rate across studies have to be interpreted with caution.The differences in the task and the analysis approach may limit their comparability (Zhang et al., 2020;Eckstein et al., 2021).Task proporties such as the trial number per condition differed across studies.Our study included 32 trials per cue in each condition, while in adult studies, the trials per condition ranged from 28 to 100 (Foerde and Shohamy, 2011;Höltje and Mecklinger, 2020;Lighthall et al., 2018).Optimal learning rates in a stable learning environment were at around 0.25 for 10-30 trials (Zhang et al., 2020), another study reported a lower optimal learning rate of around 0.08 for 120 trials (Behrens et al., 2007).This may partly explain why in our case of 32 trials per condition and cue, optimal learning rates called for a relatively high optimal learning rate of 0.29, while in other studies, optimal learning rates may be lower.Regarding differences in the analysis approach, the hierarchical bayesian estimation approach used in our study produces more reliable results in comparison to maximum likelihood estimation (Brown et al., 2020), which had been used in some of the previous adult studies and may have led to biased results towards extreme values.Taken together, our study underscores the importance of using longitudinal data to examine developmental change as well as the importance of simulation-based optimal parameters to interpret the direction of developmental change.
Despite a relatively immature hippocampal structure in middle childhood, our results confirmed a longitudinally stable association between hippocampal volume and delayed feedback learning.However, episodic memory in this learning condition was not enhanced.This suggests a developmentally early hippocampal contribution to value-based learning during delayed feedback, which does not modulate episodic memory as much as compared to adults.Therefore, our study partially extends the findings from the adult literature to middle childhood (Foerde and Shohamy, 2011;Foerde et al., 2013;Höltje and Mecklinger, 2020;Lighthall et al., 2018).The reduced effect of delayed feedback on episodic memory may be due to the protracted development of hippocampal maturation.In an aging study with a similar task, older adults failed to exhibit enhanced episodic memory for objects presented during delayed feedback trials, and they showed no enhanced hippocampal activation during delayed feedback and (Lighthall et al., 2018).Therefore, the findings converge nicely at both childhood and older adulthood, during which the structural and functional integrity of hippocampus are known to be less optimal than at younger adulthood (Shing et al., 2010;Keresztes et al., 2017;Ghetti and Bunge, 2012).
Our brain-cognition links were only partially confirmed, as striatal volumes exhibited associations with not just immediate learning scores, as we predicted, but also with delayed learning scores.This result suggests that the striatum may be important for value-based learning in general rather than exhibiting a selective association with immediate feedback learning.This is also what we found in an explorative analysis that related the striatum to learning rate in general and further predicted longitudinal change in learning rate (Appendix 5).This overall reduced brain-behavior specificity could reflect less differentiated memory systems during development, similar to findings from aging research.Here, older adults exhibited stronger striatal and hippocampal co-activation during both implicit and explicit learning, compared to more dissociable brain-behavior relationships in younger adults (Dennis and Cabeza, 2011).Interestingly, even in young adults, clear dissociations between memory systems such as in non-human lesion studies are uncommon, and factors like stress modulate their cooperative interaction (Packard and Goodman, 2013;Packard et al., 2018;Schwabe and Wolf, 2013;Ferbinteanu, 2016;White and McDonald, 2002).Further, there are methodological differences to previous studies that could explain why striatal volumes were not uniquely associated with immediate learning in our study.For example, previous studies related reward prediction errors to striatal and hippocampal activation (Foerde and Shohamy, 2011;Höltje and Mecklinger, 2020;Lighthall et al., 2018), whereas we examined individual differences in brain structure and the modelderived learning scores.Future functional neuroimaging studies with children could further clarify whether children's memory systems are indeed less differentiated and explain the attenuated modulation by feedback timing.Taken together, compared to the adult literature, our results with children showed that the hippocampal structure was associated with delayed feedback learning, but did not enhance episodic memory encoding, while the striatum generally supported value-based learning.These findings point towards a developmental effect of less differentiated and more cooperative memory systems in middle childhood.
Our computational modeling results revealed a separable effect of feedback timing on inverse temperature, which suggests that the memory systems modulated learning during decision-making.The reported behavioral differences in reaction time and their correlation to the inverse temperature further support the idea of a decision-related mechanism, as we found children to respond faster during delayed feedback trials and faster responding children also exhibited more value-guided choice behavior (i.e. higher inverse temperature) during delayed compared to immediate feedback.The hippocampus may contribute to a decision-related effect in the delayed feedback condition by facilitating the encoding and retrieval of learned values (Shadlen and Shohamy, 2016).This is in contrast to previous event-related fMRI and EEG studies reporting feedback timing modulations at value update (Foerde and Shohamy, 2011;Höltje and Mecklinger, 2020;Lighthall et al., 2018), which may be due to at least two reasons.First, we did not include a functional brain measure to examine its differential engagement during the choice and feedback phases.Second, in such a reinforcement learning task, disentangling model parameters from the choice and feedback phases can be challenging, such as for the inverse temperature and outcome sensitivity (Browning et al., 2023).Taken together, hippocampal engagement at delayed feedback may enhance outcome sensitivity as well as facilitate choice behavior through improved retrieval of action-outcome associations.A mechanism facilitating retrieval seems especially relevant in our paradigm, where multiple cues were learned and presented in a mixed order, thus creating a high memory load.To summarize, our study results suggest that feedback timing could modulate decision-making in addition to or as alternative to a mechanism at value update.However, disentangling the effects of inverse temperature and outcome sensitivity is challenging and warrants careful interpretation.Future studies might shed new light by examining neural activations at both task phases, by additionally modeling reaction times using a drift-diffusion approach, or by choosing a task design that allows independent manipulations of these phases and associated model parameters, for example, by using different reward magnitudes during reinforcement learning, or by studying outcome sensitivity without decision-making.
One aim of developmental investigations is to identify the emergence of brain and cognition dynamics, such as the hippocampal-dependent and striatal-dependent memory systems, which have been shown to engage during reinforcement learning depending on the delay in feedback delivery.Our longitudinal study partially confirmed these brain-cognition links in middle childhood but with less specificity as previously found in adults.
An early existing memory system dynamic, similar to that of adults, is relevant for applying reinforcement learning principles at different timescales.In scenarios such as in the classroom, a teacher may comment on a child's behavior immediately after the action or some moments later, in par with our experimental manipulation of 1 s versus 5 s.Within such short range of delay in teachers' feedback, children's learning ability during the first years of schooling may function equally well and depend on the striatal-dependent memory system.However, we anticipate that the reliance on the hippocampus will become even more pronounced when feedback is further delayed for longer time.Children's capacity for learning over longer timescales relies on the hippocampal-dependent memory system, which is still under development.This knowledge could help to better structure learning according to their development.Furthermore, probabilistic learning from delayed feedback may be a potential diagnostic tool to examine the hippocampal-dependent memory system during learning in children at risk.Environmental factors such as stress (Schwabe and Wolf, 2013) and socioeconomic status (Raffington et al., 2019;Hackman et al., 2010) have been shown to affect hippocampal structure and function and may contribute to a heightened risk for psychopathology in the long term (Frodl et al., 2010;Lucassen et al., 2017;Rahman et al., 2016).Deficits in hippocampal-dependent learning may be particularly relevant to psychopathology since dysfunctional behavior may arise from a tendency to prioritize short-term consequences over long-term ones (Levin et al., 2018;Von Siebenthal et al., 2017) and from the maladaptive application of previously learned behavior in inappropriate contexts (Maren et al., 2013).Interestingly, poor learners showed relatively less value-based learning in favor of stronger simple heuristic strategies, and excluding them modulated the hippocampal-dependent associations to learning and memory in our results.More studies are needed to further clarify the relationship between hippocampus and psychopathology during cognitive and brain development.
Another key question is whether developmental trajectories observed cross-sectionally are also confirmed by longitudinal results, such as for the learning rate and inverse temperature.Our results show developmental improvements in these learning parameters in only 2 years.This suggests that the initial 2 years of schooling constitute a dynamic period for feedback-based learning, in which contingent feedback is important in shaping behavior and development.

Participants
Children and their parents took part in 2 waves of data collection with an interval of about 2 years (Mean = 2.07, SD = 0.17, Range = 1.69-2.68).The inclusion criteria for wave 1 were children attending first or second grade, no psychiatric or physical health disorders, at least one parent speaking fluent German, and born full-term (≥37 weeks of gestation).At wave 1, 142 children (46% female, age Mean = 7.19, SD = 0.46, Range = 6.07-7.98)and their parents or caregivers participated in the study.140 children were included in the analysis (one child did not complete the probabilistic learning task, and another child was later excluded due to technical problems during the task).A subgroup of 90 children (49% female, 100% right-handed), who was randomly selected, completed magnetic resonance imaging (MRI) scanning at wave 1, and 82 of them contributed to structural data after removing scans with excessive movement.At wave 2, 127 children (46% female, age Mean = 9.25, SD = 0.45, Range = 8.30-10.2) continued taking part in the study, while families of the remaining children were unable to be contacted or decided not to return to the study.A total of 126 children at wave 2 completed the reinforcement learning task and were included in the analysis.All children at wave 2 were invited for MRI scanning, and 104 of them completed scanning (45% female, 92% right-handed).Ninety-nine children contributed to structural data, after removing scans with excessive movement.In total, 73 children contributed to the longitudinal MRI data and 126 children contributed to the longitudinal learning data.As previously reported for this study sample, we found no systematic bias due to wave 2 dropout (Raffington et al., 2019).

Procedure
The study consisted of a series of cognitive tasks tested during two behavioral sessions, including a reinforcement learning task, and one MRI session at wave 1 (Raffington et al., 2019;Raffington et al., 2020).Two years later, the children underwent one behavioral and one MRI session.MRI scanning was performed within 3 weeks of the behavioral task session.Each session lasted between 150 and 180 min and was scheduled either on weekdays between 2 p.m. and 6 p.m. or during weekends.Before participation, the parents provided written informed consent and children's verbal assent at both waves.All children were compensated with an honorarium of 8 euro per hour.

Reinforcement learning task
Children completed an adapted reinforcement learning task (Foerde and Shohamy, 2011) in which they learned the preferred associations between four cues (cartoon characters) and two choices (round-shaped or square-shaped lolli) through probabilistic feedback (87.5% contingent and 12.5% non-contingent reward probability).In each trial, after an initial inter-trial interval of 0.5 s, a cue and its choice options were presented for up to 7 s until the child made a choice (Figure 1, choice phase).In the delay phase, we manipulated feedback timing.For two cues, the selected choice remained visible for 1 s (immediate feedback condition), whereas for the other two cue characters, it remained visible for 5 s before feedback was given (delayed feedback condition).A final feedback phase of 2 s indicated a reward by a green frame, and a punishment by a red frame.Inside each frame, a unique object picture was shown, which was incidentally encoded and irrelevant to the task.The child was instructed to pay attention to the feedback indicated by the frame color.In an initial practice phase of 32 trials, the child practiced the task with a fifth cartoon character not included in the actual task to avoid practice effects.The experimenter instructed the child to select the choice that was most likely to result in a reward.The Experimenter checked whether the child learned the more rewarded choice during practice and let it repeat the practice task otherwise to ensure understanding of the task.In the actual task, 128 trials were presented in four blocks and with small breaks in between.Cues were presented in a mixed, pseudo-randomized order.A total of 64 unique objects were shown in the feedback phase, each one twice within the same feedback condition.In both delay phases, contingent choice and choice location remained the same for each cue within the task, but were balanced across participants by using four different task versions.At wave 2, four new cues replaced the previous ones to rule out memory effects.

Object recognition test
At wave 1, children were additionally tested for recognition memory on the object pictures that were incidentally encoded during reinforcement learning.A total of 80 objects (48 old objects and 32 new objects) were presented in randomized order.The 48 old objects (24 for each feedback condition) were selected from the 64 old objects shown during learning based on two lists to balance the shown and omitted old objects across task versions.Each old object was shown twice during learning, but if the child failed to respond during learning, no feedback or object was shown in the trial, so some objects only appeared once.These objects were excluded at the individual level (individually missing object Mean = 2.71).At recognition, children had 4 response options ('old sure', 'old unsure', 'new unsure', 'new sure') with up to 7 s to respond.The children answered verbally, and the experimenter entered their response.At wave 2, this test was excluded due to time constraints.

Brain volume
We extracted the bilateral brain volumes for our regions of interest, which were striatum and hippocampus.The striatum regions included nucleus accumbens, caudate and putamen.For our imaging data, structural MRI images were acquired on a Siemens Magnetom TrioTim syngo 3 Tesla scanner with a 12-channel head coil (Siemens Medical AG, Erlangen, Germany) using a 3D T1-weighted Magnetization Prepared Rapid Gradient Echo (MPRAGE) sequence, with the following parameters: 192 slices; field of view = 256 mm, voxel size = 1 mm 3 , TR = 2500ms; TE = 3.69ms, flip angle = 7°, TI = 1100ms.Volumetric segmentation was performed using the Freesurfer 6.0.0 image analysis suite (Fischl, 2012).Previous studies suggested that software tools based on adult brain templates provide inaccurate segmentation for pediatric samples, which can be improved through the use of studyspecific template brains (Phan et al., 2018;Schoemaker et al., 2016).Thus, we created two studyspecific template brains (one for each wave) using Freesurfer's 'make_average_subject' command.This pipeline utilized the default adult template brain registrations of the 'recon-all-all' command to average surfaces, curvatures, and volumes from all subjects into a study-specific template brain.All subjects were then re-registered to this study-specific template brain to improve segmentation accuracy.Segmented images were manually inspected for accuracy and 8 cases at wave 1 and 5 cases at wave 2 were excluded for inaccurate or failed registration due to excessive motion.

Behavioral learning performance
As a first step, we calculated learning outcomes diretly from the raw data, which where learning accuracy, win-stay and lose-shift behavior as well as reaction time.Learning accuracy was defined as the proportion to choose the more rewarding option, while win-stay and lose-shift refer to the proportion of staying with the previously chosen option after a reward and switching to the alternative choice after receiving a punishment, respectively.We used these outcomes as our dependent variables to examine the effect of the predictors feedback timing (immediate, delayed), wave (1, 2), wave 1 age, and sex (girls, boys), utilizing generalized linear mixed models (GLMM) with the R package lme4 (Bates et al., 2015).All reported models included random slopes for within-subject factors feedback timing and wave (see Appendix 2 for the model structure).We systematically tested main effects and interactions between the predictors and their interaction had to statistically improve the predictive ability of the model to be included in the final reported model.All predictor variables were grandmean-centered to interpret the interaction effects independent from other predictors.

Reinforcement learning models
As a next step, we used computational modeling to compare the learning models of basic heuristic strategies and value-based learning and to determine the model that could best capture children's trial-by-trial learning behavior.For heuristic strategies, we considered models that reflected a Winstay-lose-shift (wsls) or a Win-stay (ws) strategy.Win-stay is a heuristic strategy in which the same action is repeated if it leads to a positive outcome in the previous trial, and Win-stay-lose-shift additionally switches to a different action if the previous outcome is negative.Note that these modelbased outcomes are not identical to the win-stay and lose-shift behavior that were calculated from the raw data.The use of such model-based measure offers the advantage in discerning the underlying hidden cognitive process with greather nuance, in contrast to classical approaches that directly use raw behavioral data.The models quantified the learning behavior for each individual I for each cue c and trial t.The heuristic models consisted of a weight w that reflected its degree in strategy use.In the case of reward r = 1, w was equal to 1 for the chosen option (e.g.choice A), and 0 for the unchosen option (e.g.choice B), thus maximizing win-stay, i.e., choosing A at the subsquent trial t + 1 : w i,c,t+1,A|r=1 = 1 and w i,c,t+1,B|r=1 = 0 (1) For trials r = 0 (applicable only to the wsls model), model weights were the opposite, maximizing lose-shift: ,t+1,A|r=0 = 0 and w i,c,t+1,B|r=0 = 1 (2) The initial weights for both choices were set to w i,c,t=1 = 0.5.The weight w then scaled the parameter τ _wsls or τ _ws to estimate the individual strategy use during decision-making.The choice probabilities were calculated using the softmax function, for example., for the chosen option A: (3) Thus, a higher probability of strategy use was reflected by a larger value of τ _wsls or τ _ws .For value-based learning, we considered a Rescorla-Wagner model and several variants based on our theoretical conceptions.The baseline value-based model vbm 1 updated the value v of the selected choice (A or B) for the next trial t.This value update was determined by calculating the difference between the received reward r and the expected value v of the selected choice, which was the reward prediction error.The value update was further scaled by a learning rate α ( 0 < α < 1 ) : When the outcome sensitivity parameter ρ ( 0 < ρ < 20 ) was included, the reward was additionally scaled at the value update: The inverse temperature parameter τ ( 0 < τ < 20 ) was included in the softmax function to compute choice probabilities: exp vi,c,t,A * τi exp vi,c,t,A * τi + exp vi,c,t,B * τi  (6)   Note, however, that outcome sensitivity and inverse temperature are difficult to fit simultaneously due to non-identifiability issues (Brown et al., 2021).Therefore, models including the inverse temperature fixed outcome sensitivity at 1 (inverse temperature model family), assuming no individual differences in outcome sensitivity.For the outcome sensitivity model family, outcome sensitivity was freely estimated, and the inverse temperature was fixed at 1, asssuming the same degree of valuebased decision behavior across individuals.Even though outcome sensitivity is usually restricted to an upper bound of 2 to not inflate outcomes at value update, this configuration led to ceiling effects in outcome sensitivity and non-converging model results.Further, this issue was not resolved when we fixed the inverse temperature at the group mean of 15.47 of the winning inverse temperature family model.It may be that in children, individual differences in outcome sensitivity are more pronounced, leading to more extreme values.Therefore, we decided to extend the upper bound to 20, parallel to the inverse temperature, and all our models converged with Rhat < 1.1.Each model family consisted of 4 model variants vbm 1−4 ( 1α1τ , 2α1τ , 1α2τ , 2α2τ ) and vbm 5−8 ( 1α1ρ , 2α1ρ, 1α2ρ, 2α2ρ ), in which each parameter was either separated by feedback timing or kept as a single parameter across feedback conditions.Our baseline value-based model vbm 1 included a single learning rate and a single inverse temperature ( 1α1τ ).

Parameter estimation
All choice data were fitted in a hierarchical Bayesian analysis using the Stan language in R (Stan Development Team, 2021; R Development Core Team, 2021) adopted from the hBayesDM package (Ahn et al., 2017).Posterior parameter distributions were estimated using Markov chain Monte Carlo (MCMC) sampling running four chains each with 3000 iterations, using the first half of the chain as warmup, and group-level parameters and individual-level parameters were estimated simoultaneously.The hierarchical Bayesian approach provides more stable and reliable parameter estimates as opposed to point-estimation approaches like maximum likelihood estimation (Brown et al., 2020).Each model fit both wave 1 and wave 2 data at once, considering the correlation structure of the same parameter across waves, to account for within-subject dependency using the Cholesky decomposition.The Cholesky decomposition used a Lewandowski-Kurowicka-Joe prior of 2, and all other group-level parameters had a prior normal distribution, Normal (0, 0.5).Non-response trials (wave 1 = 2.41%, wave 2 = 0.97% on average) were excluded in advance.

Model simulation and model-derived learning score
To appropriately interpret the parameter results with respect to the optimal parameter combination of the winning model, we simulated 5,000,000 individual datasets using 10,000 different parameter value combinations (covering the whole range of each parameter) to identify the optimal parameter combination of the winning model that was selected by model comparison.In addition, we computed the model-derived mean choice probability of the contingent, that is, the more rewarded option, and we referred to it as the model-derived learning score.This model-derived choice probability differs from the observed empirical choice probability (i.e. the accuracy of selecting the more rewarded option), because the model-derived learning score combines the model with the data by incorporating latent information carried out by key learning parameters.Thus, the learning score captures observed behavior based on trial-by-trial latent processes predicted by value-based models.We used this as metric to interpret the fitted posterior parameters in relation to the optimal parameter combination of our probabilistic learning task.

Model selection and validation
We conducted a two-step sequential procedure for the model development and model selection.As a first step, we compared model evidence for the baseline value-based model that does not separate learning rate and inverse temperature by feedback timing ( vbm 1 : 1α, 1τ ) to the non-value-based, heuristic strategy models that reflect Win-stay or Win-stay-lose-shift strategy behavior ( ws , wsls ).As a second step, we compared model evidence for 8 value-based model variants, 4 of the model family with learning rate and inverse temperature ( 1α1τ , 2α1τ , 1α2τ , 2α2τ ) and 4 of the model family with learning rate and outcome sensitivity ( 1α1ρ , 2α1ρ, 1α2ρ, 2α2ρ ).This allowed us to compare whether children showed separable effects of feedback timing on one of the model parameters.We compared the model fit using Bayesian leave-one-out cross-validation and obtained the expected log pointwise predictive density ( elpd loo ) using the R package loo (Vehtari et al., 2017).We further computed the model weights (Pseudo-BMA+) using Pseudo Bayesian model averaging stabilized by Bayesian bootstrap with 100,000 iterations (Yao et al., 2018).To validate our models, we estimated predictive accuracy by comparing one-step-ahead model predictions with the choice data (Zhang et al., 2020;Crawley et al., 2020).We performed parameter recovery for the winning model and model recovery by comparing it to a set of models used during model comparison (Appendix 1; Wilson and Collins, 2019).

Episodic memory at wave 1
We predicted the individual corrected recognition memory (hits-false alarms) by feedback condition in a linear mixed effects model using the R package lme4 (Bates et al., 2015).Only confident ('sure') ratings were included in the analysis, which were 98.1% of all given responses.A total of 140 children completed the recognition memory test and 138 were included in the analysis, with two being excluded due to negative corrected recognition memory value (i.e.poor recognition memory).Age and sex were controlled for as covariates.

Longitudinal brain-cognition links
We used latent change score (LCS) models to examine the longitudinal relationships between brain and learning score measures.LCS models are longitudinal structural equation models that have been widely applied to estimate developmental changes and coupling effects across domains such as the brain and cognition (Kievit et al., 2018;Ferrer and McArdle, 2010).LCS models allow the definition of specific paths between multiple variables to test explicit hypotheses and estimate latent change from the observed variables that account for measurement error and increase testing power (van der Sluis et al., 2010).We compiled univariate LCS models for each variable separately (learning scores and brain volumes) to examine whether there was significant individual variance and change, which could be related within a multivariate LCS model as a next step.Model fit had to be at least acceptable, with a comparative fit index (CFI) >0.95, standardized root mean square residual (SRMR) < 0.08 and root mean square error of approximation (RMSEA) < 0.08 (Little, 2013).Age and sex were included as covariates at wave 1, as well as the estimated total intracranial volume (eTIV) when brain volume was included in the model.Multivariate LCS models allow to estimate meaningful brain-cognition relationships: a wave 1 covariance between brain and cognition, brain predicting change onto cognition, or vice versa, and a covariance in both brain and cognition change scores (wave 1 to wave 2).Before compiling the variables into an LCS model, they were checked for outliers ± 4 SD around the mean.We identified one outlier for the learning rate at wave 2, which was removed for the explorative LCS model that included model parameters.There were no further outliers in other cognitive variables or brain volumes.Continuous variables were standardized to the wave 1 measure so that wave 2 values represent the change from wave 1, sex was contrast-coded (girls = 1, boys = -1).Lose-shift probability was lower at wave 2 compared to wave 1 ( β wave=2 = -0.586,SE = 0.071, z = -8.22,p < 0.001) and with higher age at wave 1 ( β wave 1 age = -0.177,SE = 0.078, z = 2.27, p = 0.024), but did not differ by feedback type ( β feedback=delayed = 0.036, SE = 0.020, z = 1.74, p = 0.081) and sex ( β sex=girls = 0.063, SE = 0.036, z = 1.76, p = 0.079).Taken together, children on average improved their accuracy, while win-stay probability increased and lose-shift probability decreased between waves.Girls were on average less accurate, showed reduced win-stay behavior and a smaller decrease in lose-shift probability between waves (Appendix 2-table 1 and Appendix 2-figure 1).Reaction times were predicted to be faster at wave 2 compared to wave 1 ( β wave=2 = -218, SE = 22.7, t(126) = -9.61,p < 0.001), but did not differ by wave 1 age ( β age wave 1 = -42.5,SE = 25.7,t = -1.66,p = 0.100), and they were faster for delayed compared to immediate feedback trials ( β feedback=delayed = -14.0,SE = 6.61, t = -2.12,p = 0.036).Girls were not different compared to boys ( β sex=girls = 23.5, SE = 25.7,t = 0.91, p = 0.362).To summarize the reaction time results, children were able to respond faster to cues paired with delayed feedback, compared to cues paired with immediate feedback, and they became faster in their decision making across waves.
For hippocampal volume, we found a positive covariance with delayed inverse temperature at wave 1( ϕ HCw1,τdel,w1 = 0.13, z = 2.30, SE = 0.06, p = 0.021), whereas striatal volume positively covaried with learning rate at ( ϕ STRw1,αw1 = 0.15, z = 2.05, SE = 0.08, p = 0.041).The striatal link to learning rate however was diminished when excluding children below the learning criterion.Longitudinally, striatal volume at wave 1 further predicted positive gains in learning rate ( β STRw1,∆α = 0.44, z = 2.25, SE = 0.20, p = 0.024).Changes in learning rate covaried positively with changes in immediate inverse temperature ( ϕ ∆STR,∆τi = 0.35, z = 2.46, SE = 0.14, p = 0.014), while changes in immediate inverse temperature covaried negatively with changes in delayed inverse temperature ( ϕ ∆τi,∆τd = -0.28,z = -3.60,SE = 0.08, p < 0.001).Immediate inverse temperature at wave 1 predicted negative striatal volume change ( β τi,w1,∆STR = -0.09,z = -2.38,SE = 0.04, p = 0.017), while delayed inverse temperature at wave 1 predicted negative change in hippocampal volume ( β τd,w1,∆HPC = -0.08,z = -2.06,SE = 0.04, p = 0.039) in the reduced sample, but not in the full sample.Taken together, while hippocampal volume was only linked to delayed inverse temperature at wave 1, striatal volume was linked to learning rate at wave 1 and was predictive of learning rate development.Further, there was evidence that inverse temperature was predictive of brain volume change in line with the hypothesized brain-cognition links.The inverse temperature between delayed and immediate feedback showed diverging changes, in which the change in immediate inverse temperate was similar to that of learning rate, but dissimilar to that of delayed inverse temperature.This suggests that the hippocampus might be uniquely associated with inverse temperature during delayed learning, whereas the striatum was linked to learning rates, inverse temperature and suggest a stronger contribution to the longitudinal change of learning function in general.

Model results
We repeated model comparison with the reduced dataset by excluding the elpd loo (expected log pointwise predictive density) of the poor learners (Appendix 6-table 2).One may argue that this procedure is suboptimal, as the model parameters were fitted using the complete dataset so that poor learners impacted the parameters of the remaining participants in hierarchical model estimation.However, fitting the reduced dataset only would have required a different model structure, as the amount of longitudinal datasets had been much smaller, and some participants would only have wave 2 data.Since we used a wide prior for model estimation, the impact of poor learners on the group level is reduced.Note.Model = Heuristic ( ws , wsls ) and value-based models ( vbm 1−8 ) that were compared against each other.Parameters = corresponding model parameters learning rate ( α ), inverse temperature ( τ ) and outcome sensitivity ( ρ ).∆elpd loo = differences in Bayesian leave-one-out cross-validation estimate of the expected log pointwise predictive density relative to the winning model and its standard errors.
mean elpd loo = mean of expected log pointwise predictive density across all trials.Pseudo-BMA+ = model weight for relative model evidence using Bayesian model averaging stabilized by Bayesian bootstrap using 100,000 iterations.
The model comparison of the reduced dataset did not differ from the results of the complete dataset.At the first step, children's learning behavior in the longitudinal data again can be better described by a value-based rather than by a heuristic strategy model.At the second step, comparison different value-based models, the winning model again suggests that feedback timing affected the inverse temperature, but not the learning rate or outcome sensitivity.We did not find any deviations from the findings of the winning model when using the reduced dataset.The mean model fit (mean elpd loo ) was slightly worse in the reduced dataset, which suggests that the additional poor learners Appendix 6-table 1 Continued in the complete dataset did not fit worse to the model than the other children, despite their low accuracy.The correlations between condition differences of inverse temperature and reaction times remained (r = -0.288,t(125) = -3.36,p = 0.001 at wave 1 and r = -0.352,t(118) = -4.09,p < 0.001 at wave 2).To conclude, the same winning model from the computational analysis remained and was therefore used for further analyses.
The results obtained from the reduced dataset suggest that the striatal associations to learning remained unchanged, while the results for the hippocampus differed.The hippocampal volume was no longer associated with the delayed learning condition.Furthermore, the hippocampal-dependent episodic recognition memory was enhanced for items encoded during delayed compared to immediate feedback, which was not the case in the results obtained from the complete dataset.

Figure 1 .
Figure1.Reinforcement learning task.(A) Depiction of two example trials of immediate and delayed feedback conditions presented at wave 1.For immediate feedback (top panel), between choice response and feedback, cue and choice were presented for 1 s.At feedback, a green frame around the incidentally encoded object indicated a positive outcome, which appeared in 87.5% of the trials when selecting the squard-shaped lolli for this example cue.For delayed feedback (bottom panel), the delay phase between choice response and feedback lasted for 5 s.The red frame around the object indicated a negative outcome and appeared in 87.5% of the trials when selecting the squard-shaped lolli for this example cue.(B) For each feedback condition, two action-outcome contingencies were learned to balance a potential choice bias.With the four task versions, the cues and outcome contingencies were counterbalanced across participants.

Figure 2 .
Figure 2. Individual differences in the behavioral learning outcomes and their longitudinal change.(A) Accuracy did not differ by feedback timing and increased between waves.(B) Win-stay and lose-shift proportion did not differ by feedback timing, and win-stay increased and lose-shift proportion decreased between waves.(C) Reaction time (in ms) differed by feedback timing, in which decisions for cues learned with delayed feedback were faster, and reaction times were faster at wave 2 compared to wave 1. (D) Correlations between behavioral outcomes reveal that learning accuracy was primarily correlated with the win-stay and lose-shift probabilities both within and between waves, but was uncorrelated to reaction time.Significant correlations are circled, p-values were adjusted for multiple comparisons using bonferroni correction.

Figure 3 .
Figure 3. Overview of the computational model parameters.(A) Individual differences in the learning rate and inverse temperature of the winning model and their longitudinal change.The inverse temperature τ but not learning rate α was separated by feedback timing, and both increased

Figure 4 .
Figure 4. Model simulation/validation.(A) The model simulation depicts parameter combinations and simulation-based average learning scores.The cyan 'X' in the middle top depicts the optimal parameter combination where average learning scores were at 96.5%, and the cyan rectangle depicts the space of the fitted parameter combinations, (B) Enlarged view of the space of fitted parameter combinations.The colored arrows depict mean change (bold arrow) and individual change (transparent arrows) of the fitted parameters.The greyscale gradient-filled dots, that are connected by the arrows, depict the individual learning score, while the the greyscale gradient in the background depicts the simulated average learning score.The mean change reveals an overall change towards the higher, that is, more optimal, learning scores.(C) One-step-ahead posterior predictions of the winning model for each wave.The colored lines depict averaged trial-by-trial task behavior for each feedback condition, and a cyan ribbon indicates the 95% highest density interval of the one-step-ahead prediction using the entire posterior distribution, which included 6000 iterations for each of the 33,460 trials.

Figure 5 .
Figure 5. Cognitive and brain measures with cross-sectional and longitudinal links.(A) Recognition memory (corrected recognition = hits -false alarms) for objects presented during delayed feedback was only enhanced at trend.(B) Learning scores depicted here were used in the LCS analyses.Learning scores were the model-derived choice probability of the contingent choice using fitted posterior parameters.(C) Hippocampal and striatal volumes increased between waves, while hippocampal volume increased most.(D) A four-variate latent change score (LCS) model that included striatal and hippocampal volumes as well as immediate and delayed learning scores.Depicted are significant paths cross-domain (brain-cognition, dashed lines) and within-domain (brain or cognition, solid lines), other paths are omitted for visual clarity and are summarized in Table 4. Depicted brain-cognition links included ϕ STRw1,LSime,w1 (covariance between striatal volume and immediate learning score at wave 1), as well as ϕ HPCw1,LSdel,w1 and ϕ STRw1,LSdel,w1 (covariances between hippocampal and striatal volumes and delayed learning score at wave 1).Brain links included ϕ STRw1,HPCw1 and ρ ∆STR,∆HPC (wave 1 covariance and change-change covariance), and similarly, cognition links included ϕ LSime,w1,LSdel,w1 and ρ ∆LSime,∆LSdel .Covariates included age, sex and estimated total intracranial volume.** denotes significance at α < 0.001, * at α < 0.05.

Appendix 1 -
figure 1. Parameter recovery of the winning model, the black line represents the identity line, whereas the blue line is loess regression line, Correlations are calculated by Pearson's r.

Table 1 .
Behavioral learning outcomes and mixed model fixed effects that predicted the outcomes.
Note.Mean (standard deviation of accuracy) (ACC, probability correct), win-stay probability (WS), lose-shift probability (LS), and reaction time (RT, in seconds), split by wave and feedback timing.Mixed model effects and their directionality of effect (increasing ↑ or decreasing ↓).W2 = Wave 2.

Table 2 .
Model comparison results.

Table 3 .
Description of computational model parameters from the winning value-based model vbm 3 .

Table 4 .
Parameter estimates of a four-variate latent change score model that includes brain (striatal and hippocampal volume) and cognition domains (immediate and delayed learning score).
Note. ** denotes significance at α < 0.001, * at α < 0.05.X indicates which random effects were included in the final model.ICC = intraclass correlation.Marginal R 2 = variance explained by fixed effects, Conditional R 2 = variance explained by both fixed and random effects.
Appendix 6-table2.Model comparison results obtained with the reduced dataset and the complete dataset.