Peri-Stimulus Time Histograms Estimation Through Poisson Regression Without Generalized Linear Models

Setup

Setting up `org`

(require 'ox-beamer)
(setq org-beamer-outline-frame-options "")
(setq org-export-babel-evaluate nil)

The Data and the Questions

Data’s origin

Viewed “from the outside”, neurons generate brief electrical pulses: the action potentials

Left, the brain of an insect with the recording probe on which 16 electrodes (the bright spots) have been etched. Each probe’s branch has a 80 $μm$ width. Right, 1 sec of data from 4 electrodes. The spikes are the action potentials.

Spike trains

After a “rather heavy” pre-processing called spike sorting, the raster plot representing the spike trains can be built:

Modeling spike trains: Why and How?

A key working hypothesis in Neurosciences states that the spikes’ occurrence times, as opposed to their waveform are the only information carriers between brain region (Adrian and Zotterman, 1926).
This hypothesis encourages the development of models whose goal is to predict the probability of occurrence of a spike at a given time, without necessarily considering the biophysical spike generation mechanisms.
In the sequel we will identify spike trains with point process / counting process realizations.

A tough case in the “stationary regime”

The expected counting process of a homogeneous Poisson process—with the same mean frequency—is shown in red.

A tough case (2)

A renewal process is inadequate here: the rank of successive inter spike intervals are correlated.

A tough case (2’)

We split the ranks into 5 categories of equal size.

A tough case (3)

ECDF of interval k+1 rank conditioned on the class of interval k rank. 95% confidence bands would have here a width of 0.14.

First Consequence

Even if we focus on “isolated” neurons in the stationary / homogeneous regime, renewal processes won’t be adequate in general as models of our observed spike trains.

Interactions between neurons

Neuron’s 3 spike times relative to neuron’s 2 (ref. neuron) spike times.

\alert{Consequence}: our model should be able to handle interactions.

Non-stationary regime: odor responses

20 stimulation with citronellal. Stimulation are delivered during 500 ms (gray background). Is neuron 2 responding to the stimulation? Cockroach (Periplaneta americana) recordings and spike sorting by Antoine Chaffiol.

Non-stationary regime: odor responses (2)

Neuron 1: 20 stimulation with citronellal, terpineol and a mixture of the two. \alert{Are the reponses any different?}

Model requirements

Our model should give room for:

The elapsed time since the last spike of the neuron (enough for homogeneous renewal processes).
Variables related to the discharge history—like the duration of the last inter spike interval.
The elapsed time since the last spike of a “functionally coupled” neuron.
The elapsed time since the beginning of an applied stimulation.

What do we want?

We want to estimate the peri-stimulus time histogram (PSTH) considered as an observation from an inhomogeneous Poisson process.
In addition to estimation we want to:
- Test if a neuron is responding to a given stimulation.
- Test if the responses of a given neuron to two different stimulations are different.
This implies building some sort of confidence bands around our best estimation.

First tool

The PSTH

We go from the raw data to an histogram built with a tiny time step (25 ms), leading to an estimator with little bias and large variance.

The PSTH (2)

We model this “averaged process” as an inhomogeneous Poisson process with intensity λ(t).
The histogram we just built can then be seen as the observation of a collection of Poisson random variables, $\{Y_1,\ldots,Y_k\}$, with parameters: $$n \, ∫_{t_i-δ/2}^t_i+δ/2λ(u) \, du \; ≈ \; n \, λ(t_i) \, δ \; , \quad i = 1,\ldots,k \; ,$$ where $t_i$ is the center of a class (bin), δ is the bin width, $n$ is the number of stimulations and $k$ is the number of bins.
A piecewise constant estimator of λ(t) is then obtained with:$$\hat{λ}(t) = y_i/(n δ)\, , \quad \textrm{if} \quad t ∈ [t_i-δ/2,t_i+δ/2) \; .$$ This is the “classical” PSTH.

The PSTH (3)

We are going to assume that λ(t) is \alert{smooth}—this is a very reasonable assumption given what we know about the insect olfactory system.
We can then attempt to improve on the “classical” PSTH by trading a little bias increase for (an hopefully large) variance decrease.
Many nonparametric methods are available to do that: kernel regression, local polynomials, smoothing splines, wavelets, etc.
A problem in the case of the PSTH is that the observed counts ($\{y_1,\ldots,y_k\}$) follow Poisson distributions with different parameters implying that they have different variances.
We have then at least two possibilities: i) use a generalized linear model (GLM); ii) transform the data to stabilize the variance.
We are going to use the second approach.

Variance stabilisation

Following Brown, Cai and Zhou (2010), let’s consider $X_1,\ldots,X_n$ IID from a Poisson distribution with parameter $ν$.
Define $X = ∑_j=1ⁿ X_j$, the CLT gives us: $$\sqrt{n}\left(X/n-ν\right) \stackrel{L}{→} \mathcal{N}(0,ν) \quad \textrm{as} \; n → ∞ \, .$$
A variance stabilizing transformation is a function $G : \mathbb{R} → \mathbb{R}$, such that:$$ G’(x) = 1/\sqrt{x}\, .$$
The delta method (or the error propagation method; a first order Taylor expansion) then yields:$$\sqrt{n}\left(G(X/n)-G(ν)\right) \stackrel{L}{→} \mathcal{N}(0,1)\, . $$

Variance stabilisation (2)

It is known (Anscombe, 1948) that the variance stabilizing properties can be further improved by using transformation of the form:$$H_n(X) = G\left(\frac{X+a}{n+b}\right)$$ for suitable choices of $a$ and $b$.
In nonparametric regression we want to set $a$ and $b$ such that $\mathrm{E}\left(H_n(X)\right)$ optimally matches $G(ν)$.
Brown, Cai and Zhou (2010) show that in all relevant PSTH estimation problems we have: $$\mathrm{Var}\left(2 \sqrt{(X+1/4)/n}\right) = \frac{1}{n} + O(n^-2) \, .$$
They also show that: $$\mathrm{E}\left(2 \sqrt{(X+1/4)/n}\right) - 2 \sqrt{ν} = O(n^-2) \, .$$
They get similar transformations for binomial and negative binomial random variables.

Example

Nonparametric estimation

Since our knowledge of the biophysics of these neurons and of the network they form is still in its infancy, we can hardly propose a reasonable parametric from for our PSTHs (or their variance stabilized versions).
We therefore model our stabilized PSTH by: $$Z_i \doteq 2 \sqrt{(Y_i+1/4)/n} = r(t_i) + ε_i σ \, ,$$ where the $ε_i \stackrel{\textrm{IID}}{∼} \mathcal{N}(0,1)$, $r$ is assumed “smooth” and is estimated with a linear smoother (kernel regression, local polynomials, smoothing splines) or with wavelets (or with any nonparametric method you like).

Nonparametric estimation (2)

Following Larry Wasserman (All of Nonparametric Statistics, 2006) we define a linear smoother by a collection of functions $l(t) = \left(l_1(t),\ldots,l_k(t)\right)^T$ such that: $$\hat{r}(t) = ∑_i=1^k l_i(t) Z_i\, . $$
The simplest smoother we are going to use is built from the tricube kernel: $$K(t) = \frac{70}{81}\left(1 - \left|t\right|^3\right)^3 I(t) \, ,$$ where $I(t)$ is the indicator function of $[-1,1]$.
The functions $l_i$ are then defined by: $$l_i(t) = \frac{K\left(\frac{t-t_i}{h}\right)}{∑_j=1^k K\left(\frac{t-t_j}{h}\right)}\, .$$

Nonparametric estimation (3)

When using this kind of approach the choice of the bandwidth $h$ is clearly critical.
Since after variance stabilization the variance is known we can set our bandwidth by minimizing Mallows’ $C_p$ criterion instead of using cross-validation. For (soft) wavelet thresholding we use the universal threshold that requires the knowledge (or an estimation) of the variance.
More explicitly, with linear smoothers our estimations $\left(\widehat{r}(t_1),\ldots,\widehat{r}(t_k)\right)^T$ can be written in matrix form as: $$\widehat{\mathbf{r}} = L(h) \, \mathbf{Z} \, ,$$ where $L(h)$ is the $k × k$ symmetric matrix whose element $(i,j)$ is given by $l_i(t_j)$.

Nonparametric estimation (4)

Ideally we would like to set $\widehat{h}$ as: $$arg\min_h (1/k) ∑_i=1^k \left(r(t_i) - \hat{r}(t_i)\right)^2 \, .$$
But we don’t know $r$ (that’s what we want to estimate!) so we minimize Mallows’ $C_p$ criterion: $$ (1/k) ∑_i=1^k \left(Z_i - \hat{r}(t_i)\right)^2 + 2 σ^2 \mathrm{tr}\left(L(h)\right)/k \, ,$$ where $\mathrm{tr}\left(L(h)\right)$ stands for the trace of $L(h)$.
If we don’t know $σ^2$, we minimize the cross-validation criterion: $$\frac{1}{k} ∑_i=1^k \frac{\left(Z_i - \hat{r}(t_i)\right)^2}{1-L_ii(h)} \, .$$

Nonparametric estimation (5)

Left: CV score in black, Cp score in red. Right: Variance stabilized data (black) with Nadaraya-Watson estimator (red) with “best” bandwidth.

Nonparametric estimation (6)

Residuals obtained with the Nadaraya-Watson estimator. The red dashed lines correspond to $± σ$.

Nonparametric estimation (7)

Nadaraya-Watson estimator (red), smoothing splines estimator (blue) and wavelet estimator (black; Haar wavelets, soft thresholding, universal threshold).

Confidence sets

Keeping in line with Wasserman (2006), we consider that providing an estimate $\hat{r}$ of a curve $r$ is not sufficient for drawing scientific conclusions.
We would like to provide a \alert{confidence set} for $r$ in the form of a band: $$\mathcal{B}=\left\{s : l(t) ≤ s(t) ≤ u(t), \; ∀ t ∈ [a,b]\right\}\, $$ based on a pair of functions $\left(l(t),u(t)\right)$.
We would like to have: $$\mathrm{Pr}\left\{r ∈ \mathcal{B} \right\} ≥ 1 - α $$ for all $r ∈ \mathcal{R}$ where $\mathcal{R}$ is a large class of functions.

Confidence sets (2)

When working with smoothers, our estimators exhibit a bias that does not disappear even with large sample sizes.
We will therefore try to built sets around $\overline{r} = \mathrm{E}(\hat{r})$; that will be sufficient to address some of the questions we started with.
For a linear smoother, $\hat{r}(t) = ∑_i=1^k l_i(t) Z_i$, we have: $$\overline{r}(t) = \mathrm{E}\left(\hat{r}(t)\right) = ∑_i=1^k l_i(t) r(t_i)$$ and $$\mathrm{Var}\left(\hat{r}(t)\right) = σ^2 \, ∑_i=1^k l_i(t)^2 = (1/n) \|l(t)\|^2\, .$$ Remember that we stabilized the variance at $1/n$.
We will consider a confidence band for $\overline{r}(t)$ of the form: $$I(t) = \left(\hat{r}(t) - c \|l(t)\|/\sqrt{n},\hat{r}(t) + c \|l(t)\|/\sqrt{n}\right) \, ,$$ for some $c > 0$ and $a ≤ t ≤ b$.

Confidence set (3)

Following Sun and Loader (1994), we have: $$\begin{array}{l l l} \mathrm{Pr}\left\{\overline{r}(t) ∉ I(t) \textrm{ for some } t ∈ [a,b]\right\} & = & \mathrm{Pr}\left\{max_{t ∈ [a,b]} \frac{|\hat{r}(t)-\overline{r}(t)|}{\|l(t)\|/\sqrt{n}} > c\right\} \, ,\ & = & \mathrm{Pr}\left\{max_{t ∈ [a,b]} \frac{|∑_i=1^k (ε_i/\sqrt{n}) l_i(t)|}{\|l(t)\|/\sqrt{n}} > c\right\} \, ,\ & = & \mathrm{Pr}\left\{max_{t ∈ [a,b]} |W(t)| > c\right\} \, ,\end{array}$$ where $W(t) = ∑_i=1^k ε_i l_i(t)/\|l(t)\|$ is a Gaussian process. To find $c$ we need to know the distribution of the maximum of a Gaussian process. Sun and Loader (1994) showed the tube formula: $$\mathrm{Pr}\left\{max_{t ∈ [a,b]} |∑_i=1^k ε_i l_i(t)/\|l(t)\|| > c\right\} ≈ 2\left(1 - Φ(c)\right) + \frac{κ_0}{π} exp - \frac{c^2}{2} \, ,$$ for large $c$, where, in our case, $κ_0 ≈ (b-a)/h \left(∫_a^b K’(t)^2 dt\right)^1/2$. We get $c$ by solving: $$2\left(1 - Φ(c)\right) + \frac{κ_0}{π} exp - \frac{c^2}{2} = α \, .$$

Confidence set (4)

Variance stabilized data (black) Nadaraya-Watson estimator (blue) and 0.95 confidence band (red).

Do you remember this slide?

20 stimulation with citronellal. Stimulation are delivered during 500 ms (gray background). Is neuron 2 responding to the stimulation?

Confidence set (5)

Since the null hypothesis is a constant, there is no bias and we can increase the bandwidth (right side) if necessary.

Second Tool

Remember again?

Neuron 1: 20 stimulation with citronellal, terpineol and a mixture of the two. \alert{Are the reponses any different?}

Setting the test

We start like previously by building a “classical” PSTH with very fine bins (25 ms) with the citronellal and terpineol trials to get: $\{y_1^citron,\ldots,y_k^citron\}$ and $\{y_1^terpi,\ldots,y_k^terpi\}$.
We stabilize the variance as we did before ($z_i = 2 \sqrt{(y_i+0.25)/n}$) to get: $\{z_1^citron,\ldots,z_k^citron\}$ and $\{z_1^terpi,\ldots,z_k^terpi\}$.
Our null hypothesis is that the two underlying inhomogeneous Poisson processes are the same, therefore: $$z_i^citron = r(t_i) + ε_i^citron σ \quad \textrm{and} \quad z_i^terpi = r(t_i) + ε_i^terpi σ \, ,$$ then $$z_i^terpi - z_i^citron = \sqrt{2} ε_i σ \, .$$
We then want to test if our collection of observed differences $\{z_1^terpi - z_1^citron,\ldots,z_k^terpi - z_k^citron\}$ is compatible with $k$ IID draws from $\mathcal{N}(0,2σ^2$).

Invariance principle / Donsker theorem

Theorem

If $X_1, X_2,\ldots$ is a sequence of IID random variables such that $\mathrm{E}(X_i)=0$ and $\mathrm{E}(X_i^2)=1$, then the sequence of processes: $$ S_k(t) = \frac{1}{\sqrt{k}} ∑_i=0^{⌊ k t ⌋} X_i, \quad 0 ≤ t ≤ 1, \quad X_0=0$$ converges in law towards a canonical Brownian motion.

Proof

You can find a proof in:

R Durrett (2009) Probability: Theory and Examples. CUP. Sec. 7.6, pp 323-329 ;
P Billingsley (1999) Convergence of Probability Measures. Wiley. p 121.

Recognizing a Brownian motion when we see one

Under our null hypothesis (same inhomogeneous Poisson process for citronellal and terpineol), the random variables: $$\frac{Z_i^terpi - Z_i^citron}{\sqrt{2} σ} \, ,$$ should correspond to the $X_i$ of Donsker’s theorem.
We can then construct $S_k(t)$ and check if the observed trajectory looks Brownian or not.
Ideally, we would like to define a domain in $[0,1] × \mathbb{R}$ containing the realizations of a canonical Brownian motion with a given probability.
To have a reasonable power, we would like the surface of this domain to be minimal.

Recognizing a Brownian motion when we see one (2)

Does this look like the realization of a canonical Brownian motion?

Recognizing a Brownian motion when we see one (3)

In a (non trivial) paper, Kendall, Marin et Robert (2007) showed that the upper boundary of this minimal surface domain is given by: $$u^∗(t) ≡ \sqrt{-W_-1\left(-(κ t)^2) \right)} \, \sqrt{t}, \quad \mathrm{for} \quad κ \, t ≤ 1/\sqrt{e}$$ where W_-1 is the secondary real branch of the Lambert W function (defined as the solution of $W(z) exp W(z) = z$); $κ$ being adjusted to get the desired probability.
They also showed that a domain whose upper boundary is given by: $u(t) = a + b \sqrt{t}$ is almost of minimal surface ($a > 0$ and $b > 0$ being adjusted to get the correct probability).
Loader and Deely (1987) give a very efficient algorithm to adjust $a$ and $b$ or $κ$.
The R package STAR (Spike Train Analysis with R) provides all that (and much more) out of the box.

Recognizing a Brownian motion when we see one (4)

Almost minimal surface domains with probabilities 0.95 (dashed red) and 0.99 (red) of containing an observed canonical Brownian motion. Black: terpineol - citronellal; blue: odd terpineol trials - even terpineol trials.

Alternative no-response test

## Get the part of n2citron preceeding the stimulation
np.sum(n2citron_x <= 6)
n2citron_y_b = n2citron_y[:500]
## Get the part of the SAME length coming just after
n2citron_y_r = n2citron_y[500:1000]
## Get the normalized partial sum of the difference process
n2citron_y_d = np.cumsum((n2citron_y_r - n2citron_y_b))*np.sqrt(10/500)
## Do the plot for the test
yy = np.linspace(0,1,500)
plt.plot(yy,n2citron_y_d)
plt.plot(yy,c95(yy),color='red',lw=2,linestyle='dashed')
plt.plot(yy,-c95(yy),color='red',lw=2,linestyle='dashed')
plt.plot(yy,c99(yy),color='red',lw=2)
plt.plot(yy,-c99(yy),color='red',lw=2)

Thank you!

I want to thank:

Rune, Susanne, Henrik and Massimiliano for inviting me at this wonderful workshop.
The SynchNeuro ANR project for paying for my flight.
Antoine Chaffiol for the data.
Chong Gu for his R package gss.
My colleagues Yves Rozenholc and Avner Bar-Hen for discussions.
Vilmos Prokaj, Olivier Faugeras and Jonathan Touboul for pointing Donsker’s theorem to me.
You, for listening.

Confidence set (6)

Confidence bands computed on [6,14]. 0.78 was chosen because $(1-0.78)^2 ≈ 0.05$. We fail to establish a difference here.

Conditional intensity

Filtration, history and conditional intensity

Probabilists working on processes use the filtration or history: a family of increasing sigma algebras, $\left(\mathcal{F}_t\right)_{0\leq t \leq ∞}$, such that all the information related to the process at time $t$ can be represented by an element of $\mathcal{F}_t$.
The conditional intensity of a counting process $N(t)$ is then defined by: $$ λ(t \mid \mathcal{F}_t) ≡ lim_{h ↓ 0} \frac{\mathrm{Prob}\{N(t+h)-N(t)=1 \mid \mathcal{F}_t\}}{h} \; .$$
$λ$ constitutes an exhaustive description of process / spike train.

Two problems

As soon as we adopt a conditional intensity based formalism, we must:

Find an estimator $\hat{λ}$ of $λ$.
Find goodness of fit tests.

Time transformation

What to do with $λ$: A summary

We start by associating to $λ$, the integrated intensity: $$ Λ = ∫_0^t λ(u \mid \mathcal{F}_u) du \, ,$$ it then easy—but a bit too long for such a brief talk—to show that:

If our model is correct ($\hat{λ} ≈ λ$), the density of successive spikes after time transformation: $$\{t_1,\ldots,t_n\} → \{Λ(t_1) = Λ_1,\ldots,Λ(t_n) = Λ_n\}$$ is exponential with parameter 1.
Stated differently, the point process $\{Λ_1,\ldots,Λ_n\}$ is a homogeneous Poisson process with parameter 1.

The next slides illustrate this result.

Time transformation illustration

Time transformation illustration (2)

Time transformation illustration (3)

Ogata’s tests

If, for a good model, the transformed sequence of spike times, $\{\hat{Λ}_1,\ldots,\hat{Λ}_n\}$, is the realization of a homogeneous Poisson process with rate 1, we should test $\{\hat{Λ}_1,\ldots,\hat{Λ}_n\}$ against such a process.
This is what Yosihiko Ogata proposed in 1988 (Statistical models for earthquake occurrences and residual analysis for point processes, Journal of the American Statistical Association, 83: 9-27).
But an observation suggest nevertheless that another type of test could also be used…

A Brownian motion?

A test based on Donsker’s theorem

Donsker’s theorem and minimal area region

The intuition of the convergence—of a properly normalized version—of the process $N(Λ) - Λ$ towards a Brownian motion is correct.
This is an easy consequence of Donsker’s theorem as Vilmos Prokaj explained to me on the R mailing and as Olivier Faugeras and Jonathan Touboul explained to me directly.
It is moreover possible to find regions of minimal area having a given probability to contain the whole trajectory of a canonical Brownian motion (Kendall, Marin et Robert, 2007; Loader et Deely, 1987).
We get thereby a new goodness of fit test.

Files

PouzatMathStatNeuroJuly2014.org

Latest commit

History

PouzatMathStatNeuroJuly2014.org

File metadata and controls

Peri-Stimulus Time Histograms Estimation Through Poisson Regression Without Generalized Linear Models

Setup

Setting up org

The Data and the Questions

Data’s origin

Spike trains

Modeling spike trains: Why and How?

A tough case in the “stationary regime”

A tough case (2)

A tough case (2’)

A tough case (3)

First Consequence

Interactions between neurons

Non-stationary regime: odor responses

Non-stationary regime: odor responses (2)

Model requirements

What do we want?

First tool

The PSTH

The PSTH (2)

The PSTH (3)

Variance stabilisation

Variance stabilisation (2)

Example

Nonparametric estimation

Nonparametric estimation (2)

Nonparametric estimation (3)

Nonparametric estimation (4)

Nonparametric estimation (5)

Nonparametric estimation (6)

Nonparametric estimation (7)

Confidence sets

Confidence sets

Confidence sets (2)

Confidence set (3)

Confidence set (4)

Do you remember this slide?

Confidence set (5)

Second Tool

Remember again?

Setting the test

Invariance principle / Donsker theorem

Theorem

Proof

Recognizing a Brownian motion when we see one

Recognizing a Brownian motion when we see one (2)

Recognizing a Brownian motion when we see one (3)

Recognizing a Brownian motion when we see one (4)

Alternative no-response test

Thank you!

Confidence set (6)

Conditional intensity

Filtration, history and conditional intensity

Two problems

Time transformation

What to do with $λ$: A summary

Time transformation illustration

Time transformation illustration (2)

Time transformation illustration (3)

Ogata’s tests

A Brownian motion?

A test based on Donsker’s theorem

Donsker’s theorem and minimal area region

Minimal area region at 95%

Setting up `org`