Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example of plotting points with associated probabilities #102

Open
jbednar opened this issue Mar 7, 2016 · 12 comments
Open

Example of plotting points with associated probabilities #102

jbednar opened this issue Mar 7, 2016 · 12 comments
Milestone

Comments

@jbednar
Copy link
Member

jbednar commented Mar 7, 2016

Currently, datashader's scatterplot/heatmap approach for points data partitions the set of points, allocating each one into non-overlapping pixel-shaped bins. Some types of data come with associated probabilities, such as a known measurement error bound or an estimated uncertainty per point.

It would be good to have an example of how to aggregate such data, such that the value of each datapoint is assigned to multiple bins in the aggregate array, according to some kernel function (e.g. a 2D Gaussian, where errors are specified as stddevs).

For the special case of a square error kernel, this approach is equivalent to implementing support for raster data (see #86), where each raster datapoint represents a specified area of the X,Y plane with equal probability or weighting within that square.

We'll need a suitable dataset of this type, preferably one with widely varying error estimates across the datapoints, such that some points have tight bounds and others are less constrained.

@thoth291
Copy link

Thank you, @jbednar .
Two questions.

First:
Will this feature help to crossplot data like this:
X Y VAL
1 1 0.2
2 1 0.3
...
1 2 0.3
2 2 0.4
....
5 5 1.0

Where for each pair (X,Y) there are unique value VAL.
And the result is a scatter plot of these points colored by some mapping of VAL to RGB?

Basically equivalent of

df.plot(kind='scatter', x='X', y='Y', c='VAL', s=50);

Second:
Is (or will be) there any way to define size of the points in datashader?

Thanks!

@jbednar
Copy link
Member Author

jbednar commented Mar 10, 2016

We're working on making point sizing be more flexible and automatic, and on properly documenting how to do it, but in the meantime you can apply the tf.spread function on your final image, as shown in this notebook:
https://gist.github.com/jcrist/62b366727886561356d8

The code is already available for the application you describe above; just pass the field you want to the appropriate aggregation function:

cvs = ds.Canvas(plot_width=800, plot_height=500, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'X', 'Y', ds.mean('VAL'))
img = tf.interpolate(agg, low="white", high='darkblue', how='linear')

where mean tells datashader that you want to average the VAL of all points falling into that pixel; you could instead take the max, median, etc.

@thoth291
Copy link

Thanks, @jbednar .
I was able to colorize my plot - thanks for the example! It was quite easy and my understanding of datashader got more solid!
But it looks like tf.spread is not available (version from conda -c conda) - I guess I need use the github version instead...

@jbednar
Copy link
Member Author

jbednar commented Mar 10, 2016

Oops, yes -- spread requires the Github master version.

@thoth291
Copy link

Thanks,

When I run

import datashader as ds

I get this error:

OSError: [Errno 13] Permission denied: '/opt/dist/anaconda/lib/python2.7/site-packages/datashader-0.1.0-py2.7.egg/datashader/__pycache__'

DatashaderImportError.txt
The reason is that I install this package as system-admin, but I run it as my regular user.
Is there anyway to prohibit any file creations like that in your library? Or at least isolate them so that one user is not affecting other user.

The version from conda -c conda never had this problem.

For now - I just gave rwx permissions for all users to datashader directory and it seems to work.
Other than that - all the features are perfect! Thank you!

P.S. I'm curious if by design of spread API shape + px = mask. Then Why wouldn't you just generalize shape parameter to accept numpy masks and then just ignore px in that case... Or even beter - somehow scale the mask based on px... but I'm just curious - no demanding here :-)

@jbednar
Copy link
Member Author

jbednar commented Mar 10, 2016

I don't think that issues with __pycache__ would be due to datashader per se, as we don't access that directly ourselves (though it looks like the separate numba library that we use does access it). So I'd assume that there's a different way to install it that would avoid permissions errors, but I don't know how you originally installed it, and thus what change to suggest.

For the shape, we often want to specify a circular mask at different radius values, which the px argument makes easy to do; it would be painful to make a new mask for every px value we wanted to try. Yes, scaling the mask based on the px value would be handy, but there are lots of ways to scale matrices, and so we'd rather leave that up to the user to do based on any of the many libraries available for that.

@jcrist
Copy link
Collaborator

jcrist commented Mar 10, 2016

The reason is that I install this package as system-admin, but I run it as my regular user.
Is there anyway to prohibit any file creations like that in your library? Or at least isolate them so that one user is not affecting other user.

We started caching code compilation in numba, which writes a cache file on first import. I've filed an issue, see numba/numba#1771.

For now, try running python -c "import datashader" with admin privileges after install. This should cause the compilation to happen once (and you have permission to write those files). Subsequent imports should only read the cache, which should be fine.

@thoth291
Copy link

That all makes sense!
Thank you for the ticket at numba - I'll watch it.

@Nithanaroy
Copy link

In the comment above, tf.interpolate is deprecated. The new code would be:

cvs = ds.Canvas(plot_width=800, plot_height=500, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'X', 'Y', ds.mean('VAL'))
img = tf.shade(agg, cmap=["white", 'darkblue'], how='linear')

@jbednar jbednar added this to the wishlist milestone Jun 7, 2021
@jbednar jbednar removed the wishlist label Jun 7, 2021
@naavis
Copy link

naavis commented Jul 28, 2022

Hi! I have been trying to use this method for plotting data points with associated probabilities/weights, but bumped into something I do not understand. If I pass all zeros values in the column used as the weighting factor, I expect the image to become empty. Yet it does not! Is it a bug or am I misunderstanding something?

Below is minimal code to reproduce it with datashader 0.13.0:

import datashader as ds
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

num_datapoints = 1000
xs = 200 * np.random.rand(num_datapoints)
ys = 200 * np.random.rand(num_datapoints)
weights = np.random.rand(num_datapoints)
# Uncommenting the line below should probably
# result in a black image, yet it doesn't?
# weights = np.zeros((num_datapoints,))

df = pd.DataFrame(np.array([xs, ys, weights]).T, columns=['x', 'y', 'weight'])
cvs = ds.Canvas(plot_width=200, plot_height=200, x_range=(0, 200), y_range=(0, 200))
agg = cvs.points(df, 'x', 'y', ds.sum('weight'))
img = ds.tf.shade(agg, cmap='white')

plt.imshow(img, origin='lower', cmap='gray')
plt.show()

And below is what I see if I uncomment the line that sets all the weights to zero.

Figure_1

In my other work the outputs of cvs.points(df, 'x', 'y', ds.sum('weight')) and a Matplotlib scatter plot with the weights used as colors or sizes look very different at the moment, so maybe I'm misunderstanding how it is supposed to work in Datashader. I assume using the ds.sum('weight') aggregator would make the brightness of each bin/pixel equal to the sum of the weights for data points that land in that bin.

@ianthomas23
Copy link
Member

@naavis If you look at the contents of agg when you are using your zero weights you will see that it contains two values, 0 and np.nan. Zeros correspond to where you have data points that has a weight of zero, np.nan where there are no data points. If there is only a single finite data value in agg, it is mapped to the top end of the cmap, hence white.

Secondly, your combination of ds.tf.shade() and plt.imshow() is almost certainly not doing what you want. ds.tf.shade() outputs a 200x200 array containing RGBA values that are encoded into uint32, and if you pass an MxN array to imshow it will treat is as scalar data and apply a colormap. Hence you are applying a colormap twice. I recommend for debug purposes replacing your matplotlib code with a call to ds.util.export_image() and it should all be easier to understand.

Anyway, this is really a usage question and should have been posted to https://discourse.holoviz.org/ rather than being appended to a 6-year old github issue. If you have further questions about this, please could you ask on the discourse instead. Thanks!

@naavis
Copy link

naavis commented Jul 29, 2022

Thanks, and sorry. This Github issue was the only place I found mentioning using data point specific weights/probabilities with Datashader. The documentation isn't exactly abundant on this:
https://datashader.org/user_guide/Points.html
https://datashader.org/api.html#definitions

I was not aware of the Discourse page. I'll post any further thoughts there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants