Uniformity in Histology Histograms Hypothesis
The field of connectomics suffers from increasingly large datasets without proportionally improving algorithms with which to process them. In order to keep up with the pace of embiggening data, it is commonly useful to reduce the size or complexity of a dataset by either downsampling or reducing the question of what is being inspected. For instance, the 2011 M. musculus V1 dataset from Network anatomy and in vivo physiology of visual cortical neurons (Bock et al)1 is over 20TB at native resolution. This volume of data is unmanageable on a conventional personal computer, and so an informed "downsampling" was performed wherein computer vision algorithms computationally detected synapses, and the volume was then spatially downsampled, with each supervoxel being replaced by the number of subvoxels within it that were taken up by synapse volume.
During this semester, we developed strategies to illustrate how meaningful neuroscientific analysis can still be performed on this reduced dataset. We believe that this method of reduction (from EM spatial data to synaptic volume data) may not be suitable for all types of inquiry, but for certain investigations, such as determining the dataset's orientation and location in cortical space, this comparatively tiny (MB instead of TB) dataset is, we posit, sufficient.
Our data are taken from the 2011 M. musculus V1 dataset from Network anatomy and in vivo physiology of visual cortical neurons (Bock et al)1. The data are simplified by downsampling, where the value of a downsampled voxel is the sum of all of its constituent voxels.
We posit that synaptic density may serve as an indicator of computationally important volumes of brain tissue: When generating a map of connectivity, it is these areas of high synaptic density that are most influential to the resultant graph.
Synaptic density may also be an indicator of a greater overarching pattern or layout to areas of highly structured brain tissue (namely, cortex). The concept of cortical motifs — recurrent cortical machinery — is a highly contested concept in modern computational neuroscience. The neuroscience community widely agrees that such motifs exist in V1 (hypercolumns) where they represent retinotopic layout, and PAC (cochlear tonotopy). However, the existence of these columns in other, non-sensory cortex remains up for debate. If patterns can be found in synaptic density across layers of cortex (a simpler task than inspecting the tissue or connectome itself), then this lends credence to the theory that cortical columns extend beyond primary sensory areas such as V1 and PAC.
Such information might be useful for establishing a baseline understanding of synaptic density that might, in future studies, be used to understand diseases thought to arise in the cortex. Discrepencies between the model described by this analysis and models describing a clinical population could yield understanding of a diseased state (i.e. Alzheimers, Dementia).
We have a dataset with 64x64x12
voxels, or 64^3
voxels (see readme in data source repo). Each
- Three coordinates
$x_i$ ,$y_i$ ,$z_i$ labelled cx, cy, and cz. - Some
$S_i$ , where${S_i ∈ ℝ : S_i ≥ 0, ∀ i}$ , labelled "synapses". This represents the number of synapses in that voxel, and was generated automatically computer vision. - Some
$U_i$ , where${Ui ∈ ℝ : Ui ≥ 0, ∀ i}$ , labelled "unmasked". See Definingunmasked
There are 108 unique 108x52x11
. Many analyses were performed on data reformatted into this real volume - a 3D array with synapse densities at each location, with the index representing relative location. Since the supervoxels are uniformly spaced, this works well.
Our first inquiry was to formally define the meaning of the unmasked
keyword in our input data. Before we embarked on any meaningful analyses, we needed to understand the fundamental definition of our descriptor. First, we discovered that
We also inspected the distribution of the unmasked
values (Fig. 1.)
Figure 1. Histogram of the distribution of the
unmasked
descriptor. From the figure above, it is clear that there are points of high intensity at around the 150,000 point and higher. There is also an immense peak at$0$ , suggesting that a large portion of the computer-recognized data were considered insignificant after human intervention.
Our next goal was to understand the meaning of the metric. Through personal communication, we established that unmasked
describes the number of voxels within the supervoxel that fell in an area that should be considered. That is, if unmasked
is zero, then the cell should be ignored entirely. We found that the values of all cells with an unmasked
value of
Our first forays into the data were to characterize the different columns of our raw csv. We began this process by determining that the
We first plotted our volume by binning nearest synapse-groups, and scaling by count (and by proxy, density). These plots are provided below (Figure 2).
Figure 2. 3D visualization of synapse density. In this figure it is clear that there are zones of high density (see the
$xy$ plot, third of the three subplots) and zones of low density. It is very clear from these alone that, qualitatively, synaptic distribution in cortex is non-uniform.
We then established other characterizations of our data: We calculated the mean and variance of the
After taking the unmasked value into account (dividing synapse count by unmasked) we plotted a histogram of the quantities of synapse densities. The resulting histogram looks somewhat like a gaussian distribution. The gaussian curve, however, does not quite fit. We next attempted to fit a skewed gaussian to the distribution using the lmfit library's SkewedGaussianModel. The skewed gaussian fits much better to the data with the model and statistics listed below.
ModelFit Statistics | ||||
---|---|---|---|---|
# function evals | = | 367 | ||
# data points | = | 60 | ||
# variables | = | 4 | ||
chi-square | = | 102305.249 | ||
reduced chi-square | = | 1826.879 | ||
Akaike info crit | = | 458.622 | ||
Bayesian info crit | = | 466.999 | ||
Variables | ||||
amplitude | 0.98654103 +/- 0.015933 (1.62%) (init= 10) | |||
sigma | 0.00064406 +/- 1.96e-05 (3.04%) (init= 1) | |||
center | 0.00157955 +/- 1.22e-05 (0.77%) (init= 0) | |||
gamma | -2.75584380 +/- 0.262922 (9.54%) (init= 0) | |||
Correlations (unreported correlations are <,0.100) | ||||
C(sigma, center) | = | 0.809, | ||
C(sigma, gamma) | = | -0.777 | ||
C(center, gamma) | = | -0.721 | ||
C(amplitude, sigma) | = | 0.449 | ||
C(amplitude, center) | = | 0.303 |
The following analysis is a visualization of each layer going through all three axes. This was to see if each layer in each direction could be fit well with a gaussian or skewed gaussian. In most cases this was possible. Below are 3 Gifs, one for each axis. Each frame is a different layer in that axis, depicting the distribution of synapse density in that layer with an attempt at fitting a gaussian and skewed gaussian
Breakdown of Distribution of Synapse Densities Through the Layers of the Y-axis
Breakdown of Distribution of Synapse Densities Through the Layers of the Z-axis
Breakdown of Distribution of Synapse Densities Through the Layers of the X-axis
We performed maz intensity projections down the x y and z axes. We can see a cluster in the lower left portion of the xy plane. While it isn't entirely evident here, the projection down the y axis became of most interest to us, as this orientation was in the direction of changes in cortical layers. One can see this in the first and third projections ( where y varies). See in the MIP down the X axis how there is an increase in density as you move from y=52 to y=0.
We also performed luster analysis on the max intensity projections through the z axis. This wasn't incredibly indicative of anything.
Figure: Above, the cluster analysis of max intensity projection through the z axis with y axis across the bottom and x axis on the right.
In order to formally reject the null hypothesis that synapses are distributed evenly through cortex, we performed a simple random sampling experiment that selected subvolumes from throughout the dataset and compared their summed synapse count. As is depicted clearly in Fig. 3,
Figure 3A. Above, the color of the circle marker is indicative of the synapse count in the data volume. The lighter circles are indicative of a higher synaptic density. As is obvious from the above diagram, synapses are not distributed evenly through cortex.Figure 3B. Here, a histogram of the above results clearly show that the distribution of synapses in randomly selected supervoxels are not equal. That is, areas of medium- to low-density (≈$1.7\times 10^5$) are far more common than areas of very low (1.5e5) or very high (2.0e5) counts. If synapses were equally distributed through cortex, this would not be the case, and this histogram's distribution would be far narrower.
Note: In the subsequent sections, "cortical depth" is used to refer to the distance from the pial surface, orthogonal to the pial surface.
When plotted, our volume had approximately ten voxels of margin ("edge-effect") data on each side, where synapse detection was either less effective, or signal was reduced (or perhaps artificial margins were added after processing the raw data). In our earlier calculations, we were able to ignore these margins in order to improve the fidelity of our calculations. But now matter how our data were generated, we could not establish the orientation of our volume in 3D space using these margins (clearly there would be no synapses exterior to the pial surface) or using volume shape (our data were not downsampled isotropically, so the shape of the resultant volume is not necessarily proportional to the original bock11
dataset).
In order to determine our volume's orientation in 3D cortical-space, we used a neuroscience prior that Layer I of V1 — the area of cortex from which we beleive this dataset to have been taken — has one of the highest cell densities in the mammalian brain, and furthermore, as a result, one of the highest synaptic densities.
Because the bock11
dataset was taken from at least layers I and II, possibly III, and potentially parts of IV, we know that there should be a clear demarcation between Layer I and Layer II, where synaptic density plummets as deeper cortex is reached.
To determine which of our cardinal axes were the axis representing cortical depth, we empirically tested all three axes. (For these investigations, see analyses here). The result was that the
Figure 4. In the image at left, the raw
bock11
dataset is shown. (The second layer is used for visualization here.) At right, the blue bars represent relative synaptic density as a binned histogram. The top third has the highest density overall, and we posit that it represents V1 Layer I. The local minima that surpass the threshold provided in the above iPython notebook are depicted as straight black lines in the image-data, where we claim the boundaries between cortical layers exist. Half-bar-width "error bars" exist above and below the layer boundary lines. 4B. The high density that we see at the top of thebock11:mp4merged
dataset makes the delineation between Layer I and Layer II very clear. Where large cell bodies start to appear, density decreases dramatically (as is shown in the top figure).
Figure 5 Fig 5. PCA on the 2D MIP. The first 5 Principal Components accounted for 71.8% of the variance, with the first PC explaining 54.6% of the variance.
Spatial autocorrelation measures the correlation of a variable with itself through space. This can be thought of as sliding the each plane over itself, looking for the correaltion of each point to each other point. From the graphs below we can infer that from any synapse there is a higher probability of finding another by moving in either the horizontal or vertical directions. This is inferred by the fact that the heatmap is most intense in the vertical and horizontal directions. However, this result is probably due to binning of the original data. In order to validate this result we must run the 2d autocorrelation on the original data. Until then this result in unsubstantiated.
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. Here we attempt to partition the data into clusters based on synapse density.
First, we needed to calculate the optimum number of clusters to run the k means algorithm. In figure 7 we have plotted the BIC (Bayesian information criterion) vs number of clusters. We plotted all the way to 26 clusters. Using the elbow method we were able to determine that the optimum number of clusters to run the k means algorithm is 7.
In figure 8 we have another Gif, where each frame depicts each plotted cluster group with a different color. From this visualization we can see the different qualities of the clusters. Some are very dense and grouped to a certain area in the volume, whereas several clusters are sparse.
Through our analyses mentioned above and available in our research code repository, available here, we were able to draw meaningful conclusions about the nature of the synaptic density in mouse V1 cortical tissue. Among these conclusions, we verified that synapse density in Layer I of mammalian V1 cortex is higher than that in neighboring tissue, and after passing through Layer I, synapse occurence declines as deeper sections of brain are reached.
While density variance down the y axes seems to show layers, the distinction between cortical layers is more easily seen visually. There might be other metrics besides just density histogram. It might be worth further exploration (in progress by David) to find other features that vary along y. Namely, constructing graphs and measuring properties of those graph might highlight new variance in the data that can separate layers more clearly than finding minima on the histogram.
It is worth considering the effects of the computer-vision error rate on our data. Though we performed many of these analyses under the assumption that our data were as pure as possible, close inspection of the mp4merged
dataset seems to indicate that there was a large error component to the 'recognized' synapses (that is, tissue that was inaccurately called a synapse when it was in fact mitochondrial boundaries, cell membrane, etc). With this in mind, it is important to acknowledge that our conclusions based exclusively on synapse density may be in some way flawed.
If this is the case, then it is important that we try to find a characterization of this noise factor: Is our analysis of the boundary of layers I and II flawed because noise is proportional to actual synaptic density? Or is synaptic density indeed in accordance with some distribution — perhaps even uniform — and our noise is non-uniform? This inspection must be carried out at the imagery-level or computer-vision level, as this information cannot be derived from our dataset alone.
We would also like to run these same analyses again on the pure synapse count data, rather than this derived value. This would not only allow us to draw more valuable insight into the actual, granular location of synapses in 3D space, but it would also give us some strategy for determining a "background noise" threshold.
- Bock et al. Nature (2011) ↩
- The closest power of two (without going over) is
$2^{18}=262,144$ which is wildly higher than the$165,789$ value. ↩