# neuro boot camp day 1
## wrangling ephys data

## contents
* [1. types of data neuroscientists acquire](#data)
* [2. voltage trace as vector (everything is a vector)](#vect)
* [3. peak detection (from scratch)](#peak)
* [4. simple spike sorting](#sort)
* [5. filtering: low, high pass; notch](#filt)
* [6. spike trains, raster plots](#plot)



## 0. preliminaries

(FYI: if you are an advanced student and you breeze through these exercises, I would point you at [Neural Data Science](https://www.sciencedirect.com/book/9780128040430/neural-data-science) by Nylen and Wallisch.  You can push yourself to work through some of the more advanced examples there.  Everyone else may find it a useful set of prompts to consider down the road.)
<br><br>
<span style="color:dodgerblue">Commentary added after the workshop appears in blue.</span>

In [1]:
import numpy as np
import matplotlib
%matplotlib tk

import matplotlib.pyplot as pl

import scipy.io

<a id="data"></a>
## 1. types of data neuroscientists acquire

**Exercise 1**: Load dataset.npy, and tell me what's in it.
<br><br>
<font color="dodgerblue">The purpose of this little puzzle was to demonstrate that it can be tough for the untrained eye to tell the difference between neuroscience data from different sources.  Ephys traces are vectors, but so are fMRI voxels over time, calcium imaging ROIs, etc... The idea I was trying to sell is that, since all sorts of data come in this way, your approach to analysis is generalizable at the early stages.</font>

In [2]:
# this was a suggestion for how to look at our dataset

dataset = np.load('dataset.npy')

fig1, ax1 = pl.subplots(num=1) # If you are explicit about giving each figure a different handle,
                               # you can use that handle to edit the figure/axes later.  passing
                               # this as the argument num will sync the figure's title.    
ax1.plot(dataset)

# but it isn't plotting the way we expect:
# we want our 1600 samples on the x axis

print(np.shape(dataset))

(8, 1600)


In [3]:
# this one successfully plots the data

dataset = np.load('dataset.npy')

print(dataset)

fig, ax = pl.subplots()

for x in dataset[:,0::]:
    ax.plot(x)

[[ 3.72529030e+01  3.72529030e+01  9.31322575e+00 ...  8.38190308e+01
   9.31322632e+01  1.11758713e+02]
 [ 1.31100680e+02  1.21829932e+02  1.12443537e+02 ...  1.20889796e+02
   1.09044898e+02  1.20687075e+02]
 [ 7.01904249e+00  7.93456984e+00  6.10351515e+00 ...  1.37329092e+01
   1.40380850e+01  2.74658179e+00]
 ...
 [ 2.29492188e+00  2.83203125e+00 -1.26953125e+00 ...  1.16699219e+01
   1.13281250e+01  6.39648438e+00]
 [ 1.07969521e+04  1.11958330e+04  1.08577852e+04 ...             nan
              nan             nan]
 [ 1.00722529e+04  9.98243359e+03  1.03290059e+04 ...             nan
              nan             nan]]


In [4]:
# the above example was explicit about indexing dataset for our loop.
# we could have done it implicitly:

fig, ax = pl.subplots()

for x in dataset:
    ax.plot(x)

In [5]:
# It's kind of tough to see the individual plots here.  One solution is to plot them one at 
# a time.  Here's one plot, that Carlos thought was an ephys trace.  (He was right!)

fig, ax = pl.subplots()
ax.plot(dataset[2])

[<matplotlib.lines.Line2D at 0xb1e07f128>]

<font color=dodgerblue>I wanted to put all the traces in one plot that accomplished two things: the traces would be scaled so that you could pick out individual features, and the traces would be separated in vertical space so you could better see each trace.
<br><br>
    To scale the traces, I chose to perform a <b>z-test</b>, which transforms the y scale to units in terms of standard deviation.  To formalize it, the Z score is defined as 
    </font><center>$\Large z = \frac{\color{red} x - \color{blue} \mu}{\color{violet} \sigma}$</center>

where $\color{red} x$ is our sample value, $\color{blue} \mu$ is the population mean, and $\color{violet} \sigma$ is the population standard deviation.  Ben will discuss z-scoring more in the statistics module.
<br><br><font color=dodgerblue> Z-scoring was one possible approach.  I could instead have chosen (e.g.) to scale each trace so that its min and max values were mapped to the range (-1,+1).
    <br><br>
    To visually separate our z-scored traces (where each is now centered on zero), I added an offset based on the index <i>i</i> as I went through the loop that did the plotting.

In [6]:
# goal: scale to comparable y ranges.  method: z score

z_data = dataset # initializing z_data

fig, ax = pl.subplots()

for i in range(len(dataset)):
    z_data[i] = (z_data[i] - np.nanmean(z_data[i])) / np.nanstd(z_data[i])
    
# I had originally written the above line with np.mean() and np.std() but this failed to
# give any numbers for my last two traces (which have NaNs inside) 
    
print(z_data)

# the function enumerate allows us to use a for loop where we have both an index(i) and
# the values of the list or array being iterated (here x).  If you find using enumerate 
# confusing, I would encourage you to write your for loops as I did above:
# for i in range(len(...)):

# I also strongly suggest that you ALWAYS use `i` to refer to an index, not the iterated value

for i,x in enumerate(z_data):
    ax.plot(x+i*3, lw=1)

[[ 0.28727752  0.28727752  0.09853784 ...  0.60184365  0.66475692
   0.79058336]
 [-0.04173179 -0.32455262 -0.61090144 ... -0.35323316 -0.71458304
  -0.35941753]
 [-0.16231574 -0.12129023 -0.20334126 ...  0.13853802  0.1522132
  -0.35376814]
 ...
 [ 0.19735814  0.23970503 -0.08367119 ...  0.93650379  0.90955577
   0.52073436]
 [-1.02605356  0.7207951  -0.75964304 ...         nan         nan
          nan]
 [-0.37808492 -0.82812243  0.90836914 ...         nan         nan
          nan]]


In [7]:
# enumerate command above is equivalent to:

fig, ax = pl.subplots()

i=0
for x in z_data:
    ax.plot(x+i*3)
    i = i + 1

In [8]:
# and also (I'd encourage you to start out writing your loops this way):

fig, ax = pl.subplots()

for i in range(len(z_data)):     # note I did not need to initialize i=0 this time.
    ax.plot(z_data[i] + i*3)
    i = i + 1

<span style="color:dodgerblue">For your amusement/enrichment, these were the sources of the traces: </span>
    

| source   | y value/units   |  sample interval |
|:-:|:-:|:-:|
|0: drosophila courship song| digitized speaker signal (µV)|  100 µsec |
|1: calcium imaging ROI| raw fluorescence in arbitrary units| 40 msec |
|2: crayfish extracellular nerve recording|voltage (µV)| 100 µsec|
|3: intracellular Aplysia neuron recording|membrane potential(mV)|100 µsec|
|4: calcium imaging ROI|raw fluorescence (arbitrary units)|40 msec|
|5: [EEG data](http://bnci-horizon-2020.eu/database/data-sets)|µV| 4 msec|
|6: fMRI voxels|BOLD signal (arbitrary units)| 1.5 seconds|
|7: fMRI voxels|BOLD signal|1.5 seconds|

<a id="vect"></a>
## 2. voltage trace as a vector (everything is a vector)

<span style="color:dodgerblue">I didn't elaborate any examples here, but I restated the punchline I was setting up in part 1.</span>

<a id="peak"></a>
## 3. peak detection (from scratch)

Python has some libraries that allow for peak detection, but I think doing some manual peak finding is a useful way to hone your python skills as well as think about ephys traces.

For the early parts of this exercise, we'll start with a simple sine wave.  But even this step requires a little bit of thought.  We're not going to find peaks on an abstract or analog sine wave, but rather that is explicitly sampled over time.  Play with <span style="color:dodgerblue"> different sampling frequencies (**Fs**) / sample intervals</span> and sine wave frequencies (**f**) to see if you can build some intuition about how *Fs* needs to relate to *f* in order to be able to pick out individual peaks in our sine wave.  <span style="color:dodgerblue">This is an exploration of the **Nyquist limit** (mentioned by Sue Ann in this morning's lecture).</span>

**Exercise 3:** write a function that will find the local maxima in a sine wave.

In [9]:
# interlude
#
# what should the input sinewave look like??

# we did this definition with a sample interval -- 
# we could have equivalently used a sampling *frequency* instead.

# I've added in an option here to pass a reference to an axis to plot onto.
#    (I'll use it below when I do several plots)
#
# I also added the option to suppress plotting.
#
# Finally, I believe this is also the first you're seeing inline if statements.
# Recall my discussion of my foible of wanting concise code...

def give_me_a_sine(freq, si=.001, total_time=1, ax=None, plot=True):    
    t = np.arange(0,total_time,si)
    x = np.sin(2*np.pi*freq*t)
    if not ax: fig,ax=pl.subplots() 
    if plot: ax.plot(t,x,'.-',lw=1) 
    return x,t # originally we only returned x, but I think you will see t is useful as well.

In [10]:
sinewave,_ = give_me_a_sine(8, 0.001)  # if we don't want `t` right now, no problem!

In [11]:
# look at the effect of different sampling rates on our generated sine curve

fig_sines, ax_sines = pl.subplots()

for interval in [0.005, 0.05, 0.1]:
    give_me_a_sine(freq=8, si=interval, total_time=1, ax=ax_sines)

<span style="color:dodgerblue">Next, before I cut you loose on developing this algorithm, we talked throught the approach at the board.  There are quite a few ways to approach this problem.  </span>
    
<span style="color:dodgerblue">1. One of the first suggestions was to detect all local maxima -- loop over all values in your data array, and annotate it as a peak if it is larger than the surrounding two.  This will certainly work for our sine wave, but it isn't immediately generalizable to recorded data because such data are seldom smooth -- let's look a bit more closely at one of our ephys traces from above:</span>

In [12]:
fig, ax = pl.subplots()
ax.plot(dataset[2,1140:1400])

[<matplotlib.lines.Line2D at 0xb21d6dd30>]

<span style="color:dodgerblue"> For that approach to work on this trace, you'd need to either pre-processing (e.g., smooth the curve) or post-processing of the peaks detected.</span>

<span style="color:dodgerblue">2. A second approach that we didn't get into is to take the derivative of the curve to find inflection points.  This approach would have similar constraints to approach \#1.</span>

<span style="color:dodgerblue">3. The approach we settled on implementing was to use a simple threshold to define our peaks.  If we define a threshold, then we can find a _global_ maximum in each segment we isolate.</span>
    
<span style="color:dodgerblue">To break this in to manageable chunks, our steps are:</span>
* define a threshold (black line)
* iterating over values in data trace (blue dots/line):
    * find (and record) the places where the trace crosses the threshold (red circles)
* between each set of threshold traversals, find the maximum value
![alt_text](peak_detect.png)

(This figure is generated below (with the addition of the peaks detected).)

<span style="color:dodgerblue">I will implement this with two approaches: loop based and vectorized.</span>
    

In [13]:
fig_sines, ax_sines = pl.subplots()
sinewave, sine_times = give_me_a_sine(8, 0.004, ax=ax_sines)

In [14]:
# loop method:  iterate through all values; make notes on threshold crossings; find maxima

def find_peaks_loop(sinewave, sine_times, thresh=None, plot=False):
    
    # 1: if no threshold is specified, set a default of 3 SDs above the mean of our trace.
    if thresh == None:
        thresh = np.mean(sinewave) + 3 * np.std(sinewave)
    
    # 2: find threshold crossings
    up_bounds = []
    down_bounds = []
    peaks = []
    
    for i in range(len(sinewave)-1):  # why minus 1?  What happens if you don't have that?
        if sinewave[i]<thresh and sinewave[i+1]>thresh:
            up_bounds.append(i)
        if sinewave[i]>thresh and sinewave[i+1]<thresh:
            down_bounds.append(i+1)
        
    # 3: find maximum between threshold crossings
    for i in range(len(up_bounds)):   # what if there were none?  should we build in an error?
        chunk = sinewave[up_bounds[i]:down_bounds[i]]
        peaks.append(chunk.argmax()+up_bounds[i]) #add up_bounds[i] because this is in coord sys of chunk
        
    # 4: if not suppressing, plot the bounds and the peaks
    if plot:
        fig_sines, ax_sines = pl.subplots()
        ax_sines.hlines(y=thresh,xmin=0,xmax=1)
        ax_sines.plot(sine_times, sinewave, 'o-', color="cornflowerblue",  linewidth=1, markersize=2)
        ax_sines.plot(sine_times[up_bounds],sinewave[up_bounds], 
              'o', markeredgecolor="red", markerfacecolor="None", markersize=8, label="up_bounds")
        ax_sines.plot(sine_times[down_bounds],sinewave[down_bounds], 
              'o', markeredgecolor="maroon", markerfacecolor="None", markersize=8, label="down_bounds")
        ax_sines.plot(sine_times[peaks],sinewave[peaks], 
              'o', markeredgecolor="black", markerfacecolor="None", markersize=8, label="peaks")
        ax_sines.legend(loc=(0.7,1.05))
        pl.tight_layout()
        
    return peaks  # what are we returning here?  peak indices? peak heights?

find_peaks_loop(sinewave, sine_times, 0.5, True)

[8, 39, 70, 102, 133, 164, 195, 227]

<ul><li> <span style="color:dodgerblue">Zoom in on the plot to evaluate whether we've captured the peaks correctly. (good to get in this habit -- even at this low sampling rate there are a lot of data points squeezed inside that circle!)</span></li>

<li> <span style="color:dodgerblue">In part 3, I could have gone through with a loop instead of using np.max(), but then: how do I deal with the possibility that two successive points are equal?  (This does *not* imply a peak!)  You'd have to do some kind of recursion or record keeping of sequences like that.</span></li>

<li> <span style="color:dodgerblue">Is there a circumstance where I would observe two peaks without dropping to the threshold between them?  (Not in this sinewave, but maybe in data we care about.)</span></li></ul>

In [15]:
# vectorized method: use boolean indexing to pull out values above threshold, np.diff() to
# find discontinuities between the indexes of these chunks, and then find peaks in the chunks.

def find_peaks_vect(sinewave,sine_times,thresh=None, plot=False):

    peaks = []
    
    # 1: if no threshold is specified, set a default of 3 SDs above the mean of our trace.
    if thresh == None:
        thresh = np.mean(sinewave) + 3 * np.std(sinewave)
        
    # 2: find threshold crossings
    
    # note: here we are only finding one boundary of the thresh traversal.
    # we'll still find the max if we go from one bound to the next.

    greater_than_threshold = sinewave > thresh

    above_thresh_indices, = np.where(greater_than_threshold)  
    # comma here is because np.where returns a tuple. equivalent:
    # indices = np.where(greater_than_threshold)[0]
    
    discontinuities = np.diff(above_thresh_indices)

    discont_indices = above_thresh_indices[np.where(discontinuities!=1)[0]]
    
    discont_indices = np.concatenate([[0], discont_indices, [above_thresh_indices[-1]]])
        # we need to add 0 and the last value to capture all of the chunks.
    
    
    # 3: find peaks

    #we tried but were unable to come up with a straightforward way to do a fully vectorized implementation
    
    for i in range(len(discont_indices)-1):
        chunk = sinewave[discont_indices[i]:discont_indices[i+1]]
        peaks.append(chunk.argmax()+discont_indices[i]) 
    
        
    # 4: if not suppressing, plot the bounds and the peaks
    if plot:
        fig_sines, ax_sines = pl.subplots()

        ax_sines.plot(sine_times[above_thresh_indices],sinewave[above_thresh_indices], 
              'o', color="yellow", markersize=8, label='above_thresh')  #we want this one in the back
        
        ax_sines.hlines(y=thresh,xmin=0,xmax=1)
        ax_sines.plot(sine_times, sinewave, 'bo-', color="steelblue", lw=1, markersize=3)

        ax_sines.plot(sine_times[discont_indices],sinewave[discont_indices], 
              'o', markeredgecolor="red", markerfacecolor="None", markersize=8, label='discontinuities')
        ax_sines.plot(sine_times[peaks],sinewave[peaks], 
              'o', markeredgecolor="black", markerfacecolor="None", markersize=8, label='peaks')
        ax_sines.legend(loc=(0.7,1.05))
        pl.tight_layout()

    return peaks
        
find_peaks_vect(sinewave, sine_times, 0.5, True)

[8, 39, 70, 102, 133, 164, 195, 227]

In [16]:
# third implementation: this was the one that I deployed quickly when the peakutils package
# choked on our actual neural data.  It is adapted from code written by Ben.

def find_peaks_ben(data, times, thresh=None, plot=False, direction=1):
    
    #new argument: `direction` = 1 for upward going; -1 for downward going
    
    from scipy.ndimage import label
    
    peaks = []
    chunks = []
    
    # 1: if no threshold is specified, set a default of 3 SDs above the mean of our trace.
    if thresh == None:
        thresh = data.mean() + data.std() * 3
        
    # 2: find the peaks
    
    above_thresh = direction * data > thresh

    # split the above_thresh signal into connected segments
    # this is a new function -- if you're curious about what it returns, print it!
    labs,nlabs = label(above_thresh)
    
    for chunk in range(1,nlabs+1):        # iterate through each segment          
        idxs = np.where(labs==chunk)[0]   # find the indices of this segment            
        vals = data[idxs]                 # extract the signal values at these idxs    
        peak_idx = idxs[np.argmax(vals)]  # select the index of peak signal value    
        peaks.append(peak_idx)
        chunks.append(idxs)

    # 3: if not suppressing, plot the bounds and the peaks
    if plot:
        fig_sines, ax_sines = pl.subplots()

        ax_sines.plot(times, sinewave, 'bo-', color="steelblue", lw=1, markersize=3)
        ax_sines.hlines(y=thresh,xmin=0,xmax=1)

        cplot=[]
        for chunk in chunks:
            handle, = ax_sines.plot(times[chunk], data[chunk], "o-", lw=2, markersize=4)
            cplot.append(handle)
            
        pplot, = ax_sines.plot(times[peaks],sinewave[peaks], 
              'o', markeredgecolor="black", markerfacecolor="None", markersize=8)
          
        ax_sines.legend([cplot[0],pplot],["chunks","peaks"],loc=(0.7,1.05))
        pl.tight_layout()

    return peaks

find_peaks_ben(sinewave, sine_times, 0.5, plot=True)

[8, 39, 70, 102, 133, 164, 195, 227]

<span style="color:dodgerblue">The above three functions were all implementations of a threshold-based peak detection.  I mentioned a few other approaches above, and each of those can be expanded by consideration of other parameters, like the allowable distance between peaks, prominence, ... For reference, see the [matlab findpeaks documentation](http://www.mathworks.com/help/signal/ref/findpeaks.html) and their [peak finding tutorial](https://www.mathworks.com/help/signal/examples/peak-analysis.html).</span>
![alt text](http://www.mathworks.com/help/examples/signal/win64/DeterminePeakWidthsExample_02.png "peak features")
    
<span style="color:dodgerblue">Next I proposed to extend our use of this function to actual ephys data.  Will these functions work on one of the traces in _dataset.npy_?  What additional concerns prop up?</span>

<span style="color:dodgerblue">To expediate matters, I attempted to do this using the [peakutils](https://bitbucket.org/lucashnegri/peakutils) package, but this crashed when I tried to feed it actual ephys data.  Here's are a few more resources in that direction:</span>
* a [blog post](https://blog.ytotech.com/2015/11/01/findpeaks-in-python/) discussing various packages containing a peak finding function;
* a [jupyter notebook](http://nbviewer.jupyter.org/github/demotu/BMC/blob/master/notebooks/DetectPeaks.ipynb) that was the basis for the peakutils package.  The peak detection function is listed inside the notebook if you want to see how the approach compares to what we came up with.
* there's also a [built-in scipy function](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.signal.find_peaks_cwt.html) based on wavelets, a different approach that lacks the features we were seeking to build out from the matlab model.

<span style="color:dodgerblue">To load in ephys data from its native binary file, I installed the package [neo](https://neo.readthedocs.io/en/0.6/), by typing <code>pip install neo</code> from a terminal window.  Neo has functions well beyond file IO if you are interested.</span>

In [17]:
import neo

  from ._conv import register_converters as _register_converters


In [18]:
# load data using package: neo

r = neo.AxonIO(filename='sample3.abf') # this block is just stepping into the data structure
block = r.read_block()
segment = block.segments[0]
data = segment.analogsignals[0]
# There are actually two channels in this data. Can you pull out the second one?

# `data` will be most useful to us as a 1-d array.
data = np.asarray(data).flatten()

# generate time array.  this can alternatively be pulled out of the neo data struct.
# 1e-4 is our sample interval expressed in seconds.
times = np.arange(len(data))*1e-4

In [19]:
#run our collection of peak finders

loop_peaks = find_peaks_loop(data, times)
vect_peaks = find_peaks_vect(data, times)
ben_peaks = find_peaks_ben(data, times)

In [20]:
#let's take a look.

ephys_plot, ephys_ax = pl.subplots()
ephys_ax.plot(times, data, '-', color="cornflowerblue",  lw=1, markersize=2)
ephys_ax.plot(times[loop_peaks], data[loop_peaks], 
              'o', markeredgecolor="black", markerfacecolor="None", markersize=8, label='loop')
ephys_ax.plot(times[vect_peaks], data[vect_peaks], 
              'd', markeredgecolor="chartreuse", markerfacecolor="None", markersize=8, label='vect')
ephys_ax.plot(times[ben_peaks], data[ben_peaks], 
              's', markeredgecolor="magenta", markerfacecolor="None", markersize=8, label='ben')
ephys_ax.legend(loc=(0.7,1.05))
pl.tight_layout()

<span style="color:dodgerblue">**How do our functions do with real data?**  Zoom in and take a close look.  The vector-based peak detection seems to be detecting extra peaks.  Can you figure out why?  (Hint/aside: I built out the remainder of your solutions using the built-in plot functions to have a visual record of what was being detected at each step, like a print debug statement.  You can likewise use these debugging plots to help troubleshoot here.)  Did we choose a good default threshold?</span>

<span style="color:magenta">**Disclaimer!** this code as implemented/called is not fully up to the task of handling (even our sample) ephys data.  Part of your pipeline for any analysis would be to tune your functions to handle the particular data you recorded. 

<span style="color:dodgerblue">Next I ran through some examples of what you can start to do with your detected peaks.</span>

**Exercise:** show me the average waveform for one class of detected spikes.

In [21]:
# step 0: let's pick one set of peaks to use for further analysis

peak_idxs = loop_peaks
peak_heights = data[peak_idxs]

In [22]:
# step 1: plot a histogram

fig_hist, ax_hist = pl.subplots()
ax_hist.hist(peak_heights)

(array([ 85.,  68., 458.,   3.,   8., 255.,  26.,   0.,   0.,   1.]),
 array([ 69.58007812, 101.53197937, 133.48388062, 165.43578186,
        197.38768311, 229.33958435, 261.2914856 , 293.24338684,
        325.19528809, 357.14718933, 389.09909058]),
 <a list of 10 Patch objects>)

In [23]:
# step 2: pull out 1 discrete class of AP that you like from your histogram
# (make a list of the peaks from that class)

peak_idxs_ary = np.array(peak_idxs)
peak_heights_ary = np.array(peak_heights).flatten()


selected_peak_indices = peak_idxs_ary[peak_heights_ary>200]
ax_hist.hist(data[selected_peak_indices])
#print(selected_peak_indices)

(array([ 68., 178.,  42.,   1.,   0.,   0.,   0.,   0.,   0.,   1.]),
 array([223.6938324 , 240.23435822, 256.77488403, 273.31540985,
        289.85593567, 306.39646149, 322.9369873 , 339.47751312,
        356.01803894, 372.55856476, 389.09909058]),
 <a list of 10 Patch objects>)

In [24]:
# step 3: take a window surrounding each detected peak and store it in a list
# step 4: plot the average waveform of this class of peaks

#data=data.flatten()
waveforms = []
for peaks in selected_peak_indices:
    waveforms.append(data[peaks-20:peaks+30])
    
waveform_ary = np.array(waveforms)

fig_wave, ax_wave = pl.subplots()
ax_wave.plot(waveform_ary.mean(axis=0))

[<matplotlib.lines.Line2D at 0x1a2c67eef0>]

In [25]:
# step 5: plot a set of extremities for the waveform (min/max, SEM, or std dev)
ax_wave.fill_between(range(waveform_ary.shape[1]),
                  waveform_ary.mean(axis=0)-waveform_ary.std(axis=0),
                  waveform_ary.mean(axis=0)+waveform_ary.std(axis=0),
                  color='teal', alpha=0.2)

<matplotlib.collections.PolyCollection at 0x1a2b91a358>

<a id="sort"></a>
## 4. simple spike sorting

<span style="color:dodgerblue">Steps 1 and 2 illustrate a rudimentary form of spike sorting.  Illustrated below is a slightly more systematic approach, using k-means clustering to separate spikes based on amplitude.  This has a number of limitations, including that it will dutifully generate as many clusters as we ask for (which we may not have an *a priori* estimate of) and that it will generate clusters with as much separation as possible (not necessarily a safe assumption that your spike classes should be separable in this way).</span>

<span style="color:dodgerblue">There's a rich literature on spike sorting with different approaches, including PCA, which we'll touch on briefly in Friday's class</span>
    


In [26]:
from sklearn.cluster import KMeans

n_clusters = 5

# reshape the data to the shape (n_samples, n_features) -- required for scikit-learn
X = peak_heights.reshape([-1,1])
# run k-means clustering
km = KMeans(n_clusters=n_clusters).fit(X)

In [27]:
# display the nerve and the peaks colored by cluster
fig_clus, ax_clus = pl.subplots()
ax_clus.plot(data, color='gray', lw=1)
ax_clus.scatter(peak_idxs, peak_heights, c=km.labels_, s=20, zorder=10)

<matplotlib.collections.PathCollection at 0x1a2de39128>

** For future reference: play with https://github.com/tridesclous/tridesclous.**

<a id="filt"></a>
## 5. filtering: low, high pass; notch; baseline

<span style="color:dodgerblue">We did not make it to discussion of filtering, but there are many ways that filtering are relevant to processing ephys data.  Analog traces are often prefiltered -- high and low pass filters can separate an electrode recording into a spiketrain (from high pass filtering) and a local field potential (low pass filtering).  Sometimes it's necessary to apply a notch filter to remove AC noise (i.e. filter out 60 ± 0.5 Hz signal).  Drifty data can be baseline filtered to make things like peak detection much easier.  The figure below is from <u>NDS</u> (Nylen and Wallisch).</span>

![alt_text](filtering.png)



<a id="plot"></a>
## 6. spike trains, raster plots

<span style="color:dodgerblue">One way to represent the information encoded by neurons is to record just a binary stream of spike times (ignoring spike waveforms and other features).  Spiketrains are useful in comparisons of neuronal activity at the network level (cf. information theory and population analysis discussion on Friday), and are often displayed as raster plots.</span>

In [28]:
# a simulation of a raster plot

A = np.random.choice([0,1], 1000, p=[0.9,0.1]).reshape(10,100)

fig6,ax6 = pl.subplots()

spiketimes = [i for i,x in enumerate(A[0]) if x==1]
ax6 = pl.vlines(spiketimes,0,1)

print(spiketimes)

[0, 20, 24, 25, 41, 74, 84, 85, 97]


**exercise 6**: show me a plot with 10 spiketrain rasters in it, that looks like this:

![alt_text](Figure_1.png)

**exercise 6 part 2**: repeat exercise 6, starting with an ephys trace.

**sidequest**:  Can you use vlines to put color-coded y scales correlated with each trace in that first plot?