## RL
Continuous-time RL is modelled as a continuous-time Markov decision process. 
There is a set of environment **states**, $\mathcal{S}$ and a set of agent **actions**, $\mathcal{A}$. 
At  any time $t$, the environment will be in some state, $s(t) \in \mathcal{S}$. 
The agent will choose when to act and what action to take based on a stochastic **policy**, $a(t) \sim \pi(s(t))$. 
These actions will affect the state of the environment and the reward rate function, $R(t) = R(s(t), a(t))$. 
The task in RL is to learn a policy to maximize the expected discounted integral of future rewards:
\begin{align}
    \max_{\pi} \, \mathbb{E}_{\pi} \left [ \int_{t=0}^{\infty} \gamma^t R(t) dt \right ], 
\end{align}
where $\gamma \in [0,1]$ is the discount factor. 
The above function, when at some particular state $s$ at time $t$, is the value function.
\begin{align}
    V(s) = \mathbb{E}_{\pi} \left [ \int_{k=0}^{\infty} \gamma^k R(t+k) dk \Big | s(t) = s  \right ]
\end{align}
One can also define the `Q' function, $Q(s,a)$, in which the above expectation is also conditioned on the action taken at time $t$.

A value (or Q) function can be learned by TD algorithms that take advantage of the recursive relationship between successive values. 
\begin{align}
V(s(t)) 
&\approx \int_{k=0}^{\theta} \gamma^k R(t+k) dk + \gamma^{\theta} V(s(t+\theta)).
\end{align}
This expression can be used to update the value function.
\begin{align}
V(s(t)) \leftarrow V(s(t)) + \lambda \left [\int_{k=0}^{\theta} \gamma^k R(t+k) dk + \gamma^{\theta} V_{\pi}(s(t+\theta)) \right ],
\end{align}
where $\lambda$ is the learning rate, and the term in the square brackets is the TD error. 
This update, as written, is for tabular RL, in which the values of all states are stored in a table. 
To generalize to an continuous state space, you can approximate $V(s)$ with a neural network trained using the TD error. 

A popular architecture in RL is the Advantage Actor-Critic (A2C) model. 
In this setup, one learns both a value function (the **critic**) and a policy (the **actor**). 
The actor is used to select actions while the critic is used to train the actor using the advantage function,
\begin{equation}
    A(s,a) = Q(s,a) - V(s).
\end{equation}
This advantage function can be approximated with the TD error signal. 

## Memory in RL

Often in RL, a replay memory is used to store samples of past states, actions, and rewards, $(s,s',r,a)$. This memory is repeatly sampled from to compute TD errors (maybe TD(0), TD(n) or forward view/ offline TD($\lambda$)) and train the model. The number of memories perfectly stored in the replay memory may become quite large. This approach is not suitable for biologically plausible RL models. Additionally, it is not suitable when dealing with RL in continuous time. 

Eligibility traces are a method for online TD learning. Instead of updating value estimates based on future information, current information is used to update past value estimates. This is done by maintaining a vector, the eligibility trace, that holds the decaying values of $V(s)$.

Here I instead use LMUs to maintain a vector from which past values can be obtained. Why use LMUs instead of eligibility traces? Standard eligibility traces are used when dealing with a discrete state space. The vector dimension is the number of states and each componet in the trace is something like a leaky integrator. These methods do not easily translate to continuous state spaces. In contrast, LMUs provid a vector memory representation for continous signals. In addition, with an LMU memory, many different TD errors can be computed -- ones with different discount factors or over different time scales. Additionally, with LMUs we can recall more then just decaying values. For example, we can recall past neural activity, which is needed for training spiking neural networks.

There has been past work on implementing RL algorithms in continuous time with spiking neural networks. However, obtaining the past activity of spiking neurons for such updates is a challenge.
The Actor-Critic (AC) model in fremaux2013 does not actually compute TD signals with spiking neurons to avoid this.
The hierarchical reinforcement learning (HRL) model in rasmussen2017 addresses the challenge by using two identical neural populations to represent current and delayed Q functions, with mechanisms for copying learned weights from one population to the other.  This is similar to a TD(0) algorithm with a target network. The delay used is fixed in advance. Different tasks may require credit assignment over time windows of different lengths and, in many cases, better performance can be achieved by using reward information over many time steps for updates.

## Nengo
To move towards biologically constrained/plausible RL we shall build neural networks using the Nengo package. Nengo models can be simulated with a small timestep so we can model continuous time RL. Nengo networks consist of 
* Node: This stores/outputs a vector over the simulation run time. Often used to provide input to the ensembles
* Ensemble: A group of (spiking) neurons that collectively represent a vector (the latent representation)
* Connection(pre, post): Used to connect nodes, ensembles, and other network components. If the pre object in the connection is an ensemble Nengo will create a decoded connection. When the pre object is anything else, Nengo will create a direct connection. The post object is only used to determine which signal will receive the data produced by the connection.

Check out https://www.nengo.ai to learn more.

### Decoded connection
The important thing about decoded connections is that they do not directly compute the function defined for that connection (keeping in mind that passing in no function is equivalent to passing in the identity function). Instead, the function is approximated by solving for a set of decoding weights. The output of a decoded connection is the sum of the pre ensemble’s neural activity weighted by the decoding weights solved for in the build process.

Mathematically, you can think of a decoded connection as implementing the following equation:
$$ y(t) = \sum_{i=0}^{N} \mathbf{d}^f_i a_i(x(t))$$
where
* $y(t)$ is the output of the connection at time $t$
* $N$ is the number of neurons in the pre ensemble
* $\mathbf{d}_i$ is the decoding weight associated with neuron $i$. If you set the connection  $f$
* a_i(x(t)) is the activity of neuron $i$ given $x(t)$, the input at time $t$.

Decoders will be automatically solved for in the Nengo build process. They are optimized to approximate the function or transformation or data points supplied when defining the Connection object. But decoders can also be learned online...

## PES learning rule
The Prescribed Error Sensitivity  is a biologically plausible supervised learning rule.
To learn a connection between a pre-population of $N$ neurons and a post-population of neurons encoding a $n-$dim vector, this rule modifies the pre-population's decoders in response to an error signal:
\begin{align}
    \Delta D = \kappa \mathbf{\delta}(t) \mathbf{a}(t)^T
\end{align}
where $\kappa$ is a learning rate, $D \in \mathbb{R}^{n \times N}$ is the matrix of decoders, $\mathbf{a}(t) \in \mathbb{R}^{N \times 1}$ is the vector of (filtered) neural activites, $\mathbf{\delta}(t) \in \mathbb{R}^{n \times 1}$ is the error. (for value learning we'll have $n=1$)


The error signal may be computed by other neural populations in a model. Biologically, we can think of those populations as dopaminergic neurons that can modify weights in this way via dopamine levels.

See https://www.nengo.ai/nengo/examples/learning/learn-communication-channel.html for an example using PES

## LMUs
An LMU memory state, $\mathbf{m} \in \mathbb{R}^{q}$, for an order $q$ Legendre representation, is updated according to 
\begin{equation}
    \dot{\mathbf{m}}(t) =  A\mathbf{m}(t) +  Bu(t),
\end{equation}

where $u(t)$ is the input signal.  Where $A$ and $B$ are given by 
\begin{equation}
    A_{ij} =\frac{(2i + 1)}{\theta} \begin{cases}
        -1 & i < j \\
        (-1)^{i-j+1} & i \geq j
    \end{cases}, \quad B_{i} = \frac{(2i + 1)(-1)^{i}}{\theta}.
    \end{equation}
    
   where $\theta$ is the time window of the memory.
    
   An approximation of the input in the past, at, say $\tau\theta$ seconds in the past ($0\leq \tau \leq 1$) is given by 
   
   $$u(t-\tau) \approx \mathbf{P}^q(\tau)\mathbf{m}(t)$$
   
   where $\mathbf{P}^{q_a}(\tau) \in \mathbb{R}^{1 \times q}$ is the vector of the shifted Legendre polynomials (of degree one to $q$), evaluated at $\tau$.
   
   When implementing the LMU dynamics using the NEF, we'd have a neural population representing $\mathbf{m}(t)$ which takes input $u(t)$ with incoming weights set to do the transform $B$ and recurrent connections set to do the transform $I +\tau A$. When the memory needs to be reset (e.g., when an episode ends), the population representing $\mathbf{m}(t)$ can be inhibited.
   
   If we want to remeber a vector signal $\mathbf{u}(t) \in \mathbb{R}^n$ instead of a scalar signal, we can use multiple LDNs, all stacked together in a matrix, $\mathbf{m} \in \mathbb{R}^{q \times n}$:
  \begin{equation}
    \dot{\mathbf{m}}(t) =  A\mathbf{m}(t) +  \begin{bmatrix} B , \dots , B \end{bmatrix} \mathbf{u}(t),
\end{equation} 
   To obtain past values:  
    $$\mathbf{u}(t-\tau) \approx (\mathbf{P}^q(\tau)\mathbf{m}(t))^T$$

See the notebook "LMU with changing theta" for LMU examples

# RL with LMUs
For now, let's just consider the problem of training a critic network. Actions will come from a fixed pre-defined schedule. We have a neural population representing the state $s(t) \in \mathbb{R}^d$. A value neural population recieves the state as input and (after training) encodes the value function, $V(s(t))\in \mathbb{R}$. 

We will use the PES learning rule with the error signal being a TD error to train the decoders on the conection between the state population and value population. This will requrie memory (via LMUs) --not just to compute the TD error, but also to recall state activites for the $\mathbf{a}(t)$ term in the PES rule.

Let $\mathbf{m}_V(t) \in \mathbb{R}^{q_v }$ be a LMU memory of the value output, and $\mathbf{m}_R(t) \in \mathbb{R}^{q_r }$ be a LMU memory of the reward signal.
Let $\mathbf{m}_a(t) \in \mathbb{R}^{q_a \times N}$ be a LMU stack remembering the state population's activities. Note that unlike the last two, this is a memory of $N$ signals, so $\mathbf{m}_a(t)$ is a matrix. Assume all LMUs have a time window of $\theta$.


The PES rule to update a past value estimate (the one $\theta$ seconds ago) given TD error $\delta(t)$, is 
\begin{align} 
    \Delta \mathbf{d} = \kappa \delta(t)\mathbf{P}^{q_a}(1) \mathbf{m}_{a_j}(t),
\end{align}


#### TD(0)

The simplest RL learning rule that can be implemented in this way is the TD(0) rule -- an update of the value at just a short time in the past ($t-\Delta t$, so $\theta = \Delta t $) using only the current reward rate.
The standard TD(0) error is
$$\delta_t^{(0)} = R_t + \gamma V(s_{t+1}) - V(s_t) $$

Since we're going to learn online, instead of updating the current output $V(s_t)$ with error $\delta_t^{(0)}$ computed with ouput from the future ($V(s_{t+1})$), we'll update the past output using current output:
$$\delta_t^{(0)} = R_{t-\Delta t} + \gamma V(s_{t}) - V(s_{t-\Delta t}) $$

In continuous time RL, TD(0) is a bit different. An update with only the instantenous reward rate should be
\begin{align} 
\delta_t^{(0)} &= R(t) + \gamma \frac{dV(t)}{dt} - V(t) \\
&\approx  R(t) + \gamma \frac{V(t+\Delta t) - V(t)}{\Delta t} - V(t) \\
&=  R(t) +  \frac{\gamma}{\Delta t}V(t+\Delta t) - (1 + \frac{\gamma}{\Delta t})V(t)
\end{align}

The TD(0) error and PES update computed with LMUs (with  $\theta = \Delta t $) are given by
\begin{align} 
    \delta^{(0)}(t) &=  \mathbf{P}^{q_r}(1)\mathbf{m}_{R}(t) + \frac{\gamma}{\theta} V(t) - (1 + \frac{\gamma}{\theta})\mathbf{P}^{q_v}(1)\mathbf{m}_{v}(t) \\
    \Delta \mathbf{d} &= \kappa  \mathbf{P}^{q_a}(1) \mathbf{m}_{a_j}(t) \left ( \mathbf{P}^{q_r}(1)\mathbf{m}_{R}(t) + \frac{\gamma}{\theta} V(t) - (1 + \frac{\gamma}{\theta})\mathbf{P}^{q_v}(1)\mathbf{m}_{V}(t) \right )
\end{align}

See the TD(0) notebook for an example.




### Reward normalization
A possible issue with this is the magnitude of the value function. A neural population represents the value function. It can only represent varaibles between -1 and 1 so if the value function exceeds that, the population will 'saturate'. To avoid this, assume that the reward is bounded, $|R(s)| \leq r_{max}$ for all states $s$. Then since $V(s) = \mathbb{E} [ \int_{0}^{T} \gamma^t R(s(t)) dt | s(0) = s]$
$$\rightarrow |V| \leq r_{max} \int_{0}^{T} \gamma^t dt = r_{max} \frac{\gamma^T - 1}{\log(\gamma)}$$
If we want $ |V| \leq 1 $ then normalize rewards so that $r_{max} \leq \frac{\log(\gamma)}{\gamma^T - 1}$. Note that there may be better ways to normalize/modify either rewards, discount rates, or error signals to prevent saturation. In fact, if the reward is often less than $r_{max}$ (e.g., a sparse reward problem) then this normalization may result in a value function with very low magnitude. This may cause problems when using spiking neural networks to represent and compare values