Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plotting large numbers of sequences/time series together: dealing with fixed length numpy arrays #512

Merged
merged 3 commits into from
Oct 26, 2017

Conversation

narendramukherjee
Copy link
Contributor

@narendramukherjee narendramukherjee commented Oct 26, 2017

Added a function in ds.utils to convert time series/sequences stored as 2D numpy arrays to a dataframe with NaN separators between individual sequences. Also added an example in tseries.ipynb showing the use of this function. This is in response to issue #286

…mpy arrays

to a pandas dataframe with NaNs separating individual sequences
2.) An example in tseries.ipynb showing the use of this function while plotting
thousands of sequences together
@narendramukherjee
Copy link
Contributor Author

@jbednar @philippjfr Can you take a look and let me know what you think.

@jbednar
Copy link
Member

jbednar commented Oct 26, 2017

Thanks for the PR!

In most of our use cases for datashading curves, we want to be able to distinguish between the curves, which is only feasible for up to a few dozen curves if we use count_cat to colorize them. Here, there doesn't seem to be a way to convey the identity of each curve, but it seems like you have an application in mind where that doesn't matter? E.g. maybe you could talk about how this approach lets you discover underlying periodicities in pseudorandom number generators? That's what it looks like your example is showing:

image

image

@jbednar
Copy link
Member

jbednar commented Oct 26, 2017

Oops; the periodicities are just due to having many fewer items in your sequence than the number of points in the plot, which again is unusual. The bumps are just for each number involved:

image

image

image

image

Still trying to think of a way to motivate what an example like this will be used for...

@narendramukherjee
Copy link
Contributor Author

My specific use case involves looking at 1-2ms long voltage traces from neurons (action potentials) and determining if they are coming from the same neuron. Each neuron produces stereotypical action potentials, and plotting all the action potentials recorded on a electrode on top of each other let's us know if they are coming from one or multiple neurons. So, yes, in my case, the exact identity of each curve doesn't really matter. Check out our Scipy paper again for severely overplotted examples of this kind: http://conference.scipy.org/proceedings/scipy2017/narendra_mukherjee.html

I think that this sort of use case isn't that uncommon - the original question in #286 was trying to achieve exactly this sort of thing. I just used a pseudorandom number generator as an easy way to generate 'dummy' data of the kind I am plotting - I could as well put in my specific use case, with action potentials from a neuron as an example, but that would mean I would have to put in some actual data that I have recorded as well to make those plots work. I didn't know how to do that with a IPython notebook.

Let me know what you think!

@jbednar
Copy link
Member

jbednar commented Oct 26, 2017

I had forgotten that you were the one with the SciPy paper, which I do remember now!

A use case something like that was what I was imagining, but in that case, won't you want to know the identity of the inappropriately sorted curves, the ones with shapes that suggest that they are not action potentials for this neuron, so that you can exclude them from the group? I agree that a visualization like this is a good first step, to at least be able to see them, but then if it were my data I'd immediately want to start pulling out the outlier curves and see why they ended up in this bucket inappropriately, which is difficult if I can't identify them.

Maybe in practice what you do is just adjust some threshold, never dealing with individual curves by name or id? In that case I guess a good visualization would be to overlay a datashaded plot of the traces included by the threshold in one color, over a datashaded plot of the ones excluded in another color, adjusting the threshold until those two groups were quite visibly distinct. Doing that shouldn't require anything further from datashader, but it sure seems like it would be helpful to have an example that shows a workflow like this.

I wonder if there's a good way to do that with synthetic data, synthesizing a bunch of curves from different categories, pooling them all together, and then showing how to use datashader to see visually that there are these categories and then adjusting thresholds until a clustering algorithm correctly sorts out each category. Hmm; probably too ambitious, so I guess I should just merge this utility as-is and think about that later!

@jbednar jbednar merged commit 0bacd83 into holoviz:master Oct 26, 2017
@jbednar
Copy link
Member

jbednar commented Oct 26, 2017

Ok, I tidied up the example notebook a bit to remove extraneous changes and to use an example where each datapoint was countable for clarity, and merged it. Thanks for your contribution!

jbednar pushed a commit that referenced this pull request Oct 30, 2017
…h fixed length numpy arrays (#512)

* Added a  function to ds.utils to convert sequences stored as 2D numpy arrays to a pandas dataframe with NaNs separating individual sequences
* Added an example in tseries.ipynb showing the use of this function while plotting thousands of sequences together
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants