New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sparse map prototype #1170

Merged
merged 20 commits into from Oct 20, 2017

Conversation

Projects
None yet
3 participants
@woodmd
Member

woodmd commented Oct 17, 2017

This PR provides a proof of concept for implementing sparse maps with a custom sparse array class. SparseArray is a sparse data structure for ND arrays that partially mimics the interface of numpy.ndarray. It uses cython functions in gammapy.maps._sparse to efficiently lookup and update array values. Note that the current implementation is not efficient for certain use cases for instance setting an element will always trigger a reallocation of the full array in memory. However I think those use cases are probably not so important for the usage patterns we'll typically have for gamma-ray analysis where maps are normally allocated in a single pass at the start of the analysis. To demonstrate how this could be used for the sparse map classes I've updated HpxMapSparse to use SparseArray in lieu of scipy.sparse and added some tests for the get/set methods.

Currently SparseArray only supports a small subset of the functionality of numpy.ndarray and does not support the numpy ufunc interface. I also considered the option of making SparseArray a sub-class of numpy.ndarray which would probably be the cleaner approach for making something that can be used interchangeably with numpy.ndarray. However it looked like this would require a much deeper understanding of the numpy internals and didn't seem to be critical for getting a working prototype.

To the extent that we could eventually fully reproduce the numpy.ndarray interface we could consider doing away with the current class hierarchy with separate classes for sparse and non-sparse maps. In this case we would just have a switch statement in the map constructor that would allocate the data vector as a numpy.ndarray or SparseArray. From the standpoint of long-term code maintenance this would be my preferred option but while we're still prototyping different solutions we should probably keep the current design.

@cdeil cdeil self-assigned this Oct 17, 2017

@cdeil cdeil added the feature label Oct 17, 2017

@cdeil cdeil added this to the 0.7 milestone Oct 17, 2017

@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil Oct 17, 2017

Member

I just had a quick look.

From what I've heard, the vast majority of Numpy developers considers Numpy array subclassing to be always a mistake and frown upon the Astropy Quantity. What would the benefit be to subclassing the Numpy array? Is there some functionality or extra interoperability you get, that you don't already have now with your sparse duck?

I think there's two things to consider here to try and improve a bit before merging.

One is high-level use case and docs. At the moment it's all very abstract and hard to review, if the main use case isn't visible yet, i.e. if I can't make a counts cube from an event list yet and possibly even do something with it. So for review it would be helpful, but OK to add this in a follow-up PR if you prefer.

The other thing is that there's this error on Windows:

ValueError: Buffer dtype mismatch, expected 'int_t' but got 'long long'
gammapy\maps\_sparse.pyx:67: ValueError

https://ci.appveyor.com/project/cdeil/gammapy/build/1.0.2754/job/vop5ndc90lbhdlsd#L1548
I think the solution is to change the Cython and calling Python code to force 64-bit integers like via int64_t. See https://github.com/astropy/astropy-healpix/tree/master/astropy_healpix as a working example on Windows. @astrofrog - Is that the right approach and solution to getting rid of this error on Windows?

Member

cdeil commented Oct 17, 2017

I just had a quick look.

From what I've heard, the vast majority of Numpy developers considers Numpy array subclassing to be always a mistake and frown upon the Astropy Quantity. What would the benefit be to subclassing the Numpy array? Is there some functionality or extra interoperability you get, that you don't already have now with your sparse duck?

I think there's two things to consider here to try and improve a bit before merging.

One is high-level use case and docs. At the moment it's all very abstract and hard to review, if the main use case isn't visible yet, i.e. if I can't make a counts cube from an event list yet and possibly even do something with it. So for review it would be helpful, but OK to add this in a follow-up PR if you prefer.

The other thing is that there's this error on Windows:

ValueError: Buffer dtype mismatch, expected 'int_t' but got 'long long'
gammapy\maps\_sparse.pyx:67: ValueError

https://ci.appveyor.com/project/cdeil/gammapy/build/1.0.2754/job/vop5ndc90lbhdlsd#L1548
I think the solution is to change the Cython and calling Python code to force 64-bit integers like via int64_t. See https://github.com/astropy/astropy-healpix/tree/master/astropy_healpix as a working example on Windows. @astrofrog - Is that the right approach and solution to getting rid of this error on Windows?

class SparseArray(object):
"""Sparse N-dimensional array object. This class implements a data

This comment has been minimized.

@cdeil

cdeil Oct 17, 2017

Member

Is SparseArray a class end-users will interact with, or is it supposed to be something internal that doesn't need to be fully documented?

If it's something end-users see, please put an empty line after the first summary line of the docstring.
Also, please add a Parameters section documenting what the parameters of __init__ are.
Maybe even throw in an Examples section with a few lines of example code showing how to make one of those and do something with it?

@cdeil

cdeil Oct 17, 2017

Member

Is SparseArray a class end-users will interact with, or is it supposed to be something internal that doesn't need to be fully documented?

If it's something end-users see, please put an empty line after the first summary line of the docstring.
Also, please add a Parameters section documenting what the parameters of __init__ are.
Maybe even throw in an Examples section with a few lines of example code showing how to make one of those and do something with it?

This comment has been minimized.

@woodmd

woodmd Oct 17, 2017

Member

In principle users shouldn't need to interact with SparseArray directly -- the recommended interface for manipulating maps is via get/set/fill methods of the map classes. However it's probably good to have it well documented in any case since developers will want to know how to use this class when working with the maps classes. It's also possible that there may be use cases for it outside of gammapy.maps.

@woodmd

woodmd Oct 17, 2017

Member

In principle users shouldn't need to interact with SparseArray directly -- the recommended interface for manipulating maps is via get/set/fill methods of the map classes. However it's probably good to have it well documented in any case since developers will want to know how to use this class when working with the maps classes. It's also possible that there may be use cases for it outside of gammapy.maps.

Show outdated Hide outdated gammapy/maps/sparse.py Outdated
Show outdated Hide outdated gammapy/maps/sparse.py Outdated
@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil Oct 17, 2017

Member

So does read already work for HpxMapSparse?
If yes, could you add an example how to read a file to it's docstring or here?
http://docs.gammapy.org/en/latest/maps/index.html#fits-i-o

The main point is to show one example to users how they can get an example instance of each class in Gammapy, so that they can start playing around with it in IPython / Jupyter and start to learn how it works.

Member

cdeil commented Oct 17, 2017

So does read already work for HpxMapSparse?
If yes, could you add an example how to read a file to it's docstring or here?
http://docs.gammapy.org/en/latest/maps/index.html#fits-i-o

The main point is to show one example to users how they can get an example instance of each class in Gammapy, so that they can start playing around with it in IPython / Jupyter and start to learn how it works.

@woodmd

This comment has been minimized.

Show comment
Hide comment
@woodmd

woodmd Oct 17, 2017

Member

Thanks for the review. No read isn't working yet so perhaps it's worth adding that before merging. I can add this along with the fill method so that we have the basic functionality for working with counts maps.

Member

woodmd commented Oct 17, 2017

Thanks for the review. No read isn't working yet so perhaps it's worth adding that before merging. I can add this along with the fill method so that we have the basic functionality for working with counts maps.

@astrofrog

This comment has been minimized.

Show comment
Hide comment
@astrofrog

astrofrog Oct 17, 2017

Contributor

From what I've heard, the vast majority of Numpy developers considers Numpy array subclassing to be always a mistake and frown upon the Astropy Quantity

Huh, really? What should we have done instead though? 😕

Regarding the sparse array, I just realized that this might work as-is when used for interpolation in astropy-healpix, because the array of values to interpolate from is never passed to the Cython level and we just use standard Numpy indexing. Worth a try!

Contributor

astrofrog commented Oct 17, 2017

From what I've heard, the vast majority of Numpy developers considers Numpy array subclassing to be always a mistake and frown upon the Astropy Quantity

Huh, really? What should we have done instead though? 😕

Regarding the sparse array, I just realized that this might work as-is when used for interpolation in astropy-healpix, because the array of values to interpolate from is never passed to the Cython level and we just use standard Numpy indexing. Worth a try!

@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil Oct 18, 2017

Member

What should we have done instead though?

From what I gather, pint Quantity doesn't sub-class ndarray, and pandas Series was changed to remove the subclassing from ndarray:

I think basically they achieve some interoperability with Numpy and other functions that were initially written with Numpy arrays in mind via duck typing?

The problem with ndarray subclassing like Quantity does it is that it works most of the time, but then from time to time one runs into cases that don't work as expected:
http://astropy.readthedocs.io/en/stable/known_issues.html#quantity-issues

My understanding is that such cases also exist for pandas Series and pint Quantity (different issues, occuring when calling different functions), and they probably had to write more code to make it work as well as it does.


The SparseArray class added here is very similar to pandas.Series: it has an index and a value. I don't understand how Series is implemented internally, but I think what @woodmd did here is exactly the right approach: do it as simple as possible, using composition with two Numpy array data members of index and value.

@woodmd - What operations do we need for SparseArray objects for gamma-ray data analysis?

If I understand correctly, we can just write the analysis methods such that they use the data and index attributes, and we don't need a Numpy array interface for the SparseArray object, no?

PS: we do have one ndarray / Quantity subclass in Gammapy at the moment:
https://gammapy.readthedocs.io/en/latest/api/gammapy.utils.energy.Energy.html
I would like to get rid of it, we can just use Quantity objects directly, and move the few useful convenience methods on it to helper functions or methods on other classes.
E.g. there could be a log_energy_space function like
http://docs.gammapy.org/en/latest/api/gammapy.utils.nddata.sqrt_space.html
instead of the current
https://gammapy.readthedocs.io/en/latest/api/gammapy.utils.energy.Energy.html#gammapy.utils.energy.Energy.equal_log_spacing

Member

cdeil commented Oct 18, 2017

What should we have done instead though?

From what I gather, pint Quantity doesn't sub-class ndarray, and pandas Series was changed to remove the subclassing from ndarray:

I think basically they achieve some interoperability with Numpy and other functions that were initially written with Numpy arrays in mind via duck typing?

The problem with ndarray subclassing like Quantity does it is that it works most of the time, but then from time to time one runs into cases that don't work as expected:
http://astropy.readthedocs.io/en/stable/known_issues.html#quantity-issues

My understanding is that such cases also exist for pandas Series and pint Quantity (different issues, occuring when calling different functions), and they probably had to write more code to make it work as well as it does.


The SparseArray class added here is very similar to pandas.Series: it has an index and a value. I don't understand how Series is implemented internally, but I think what @woodmd did here is exactly the right approach: do it as simple as possible, using composition with two Numpy array data members of index and value.

@woodmd - What operations do we need for SparseArray objects for gamma-ray data analysis?

If I understand correctly, we can just write the analysis methods such that they use the data and index attributes, and we don't need a Numpy array interface for the SparseArray object, no?

PS: we do have one ndarray / Quantity subclass in Gammapy at the moment:
https://gammapy.readthedocs.io/en/latest/api/gammapy.utils.energy.Energy.html
I would like to get rid of it, we can just use Quantity objects directly, and move the few useful convenience methods on it to helper functions or methods on other classes.
E.g. there could be a log_energy_space function like
http://docs.gammapy.org/en/latest/api/gammapy.utils.nddata.sqrt_space.html
instead of the current
https://gammapy.readthedocs.io/en/latest/api/gammapy.utils.energy.Energy.html#gammapy.utils.energy.Energy.equal_log_spacing

Show outdated Hide outdated gammapy/maps/sparse.py Outdated
@woodmd

This comment has been minimized.

Show comment
Hide comment
@woodmd

woodmd Oct 18, 2017

Member

What operations do we need for SparseArray objects for gamma-ray data analysis?

We probably want to support indexing, arithmetic, and at least some of the numpy mathematical functions (e.g. np.sum). For the latter I was planning to follow the approach described here wherein you create a class method that overrides the given numpy function.

I do think having an interface that supports at least a subset of numpy array functionality is valuable. Most users are already familiar with this style of interface and it also allows the class to be more easily used with external functions and classes that accept numpy arrays. Of course expert users can always work the data and index members directly.

Member

woodmd commented Oct 18, 2017

What operations do we need for SparseArray objects for gamma-ray data analysis?

We probably want to support indexing, arithmetic, and at least some of the numpy mathematical functions (e.g. np.sum). For the latter I was planning to follow the approach described here wherein you create a class method that overrides the given numpy function.

I do think having an interface that supports at least a subset of numpy array functionality is valuable. Most users are already familiar with this style of interface and it also allows the class to be more easily used with external functions and classes that accept numpy arrays. Of course expert users can always work the data and index members directly.

@woodmd

This comment has been minimized.

Show comment
Hide comment
@woodmd

woodmd Oct 19, 2017

Member

Ok read and write are now implemented for HpxMapSparse. The syntax for the I/O methods is identical to that for non-sparse maps, e.g.

m = HpxMapSparse.read('map.fits')
m.write('out.fits')

The only difference is that sparse maps have sparse=True by default when writing.

I'd propose that we go ahead and merge this now and defer writing additional documentation and examples for a future PR.

Member

woodmd commented Oct 19, 2017

Ok read and write are now implemented for HpxMapSparse. The syntax for the I/O methods is identical to that for non-sparse maps, e.g.

m = HpxMapSparse.read('map.fits')
m.write('out.fits')

The only difference is that sparse maps have sparse=True by default when writing.

I'd propose that we go ahead and merge this now and defer writing additional documentation and examples for a future PR.

@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil Oct 19, 2017

Member

@woodmd - Nice!
Is this PR ready to merge? Or did you want to do something else here?

Tests now pass on Windows: https://ci.appveyor.com/project/cdeil/gammapy/build/1.0.2764/job/8q2f7nbf3spk3lnk#L1281

The compiler still shows a few cases where int and float precision isn't matched: "possible precision loss" on Windows:
https://ci.appveyor.com/project/cdeil/gammapy/build/1.0.2764/job/8q2f7nbf3spk3lnk#L620

On Mac / Linux, everything looks OK.

I'll have another look at the code now.

Member

cdeil commented Oct 19, 2017

@woodmd - Nice!
Is this PR ready to merge? Or did you want to do something else here?

Tests now pass on Windows: https://ci.appveyor.com/project/cdeil/gammapy/build/1.0.2764/job/8q2f7nbf3spk3lnk#L1281

The compiler still shows a few cases where int and float precision isn't matched: "possible precision loss" on Windows:
https://ci.appveyor.com/project/cdeil/gammapy/build/1.0.2764/job/8q2f7nbf3spk3lnk#L620

On Mac / Linux, everything looks OK.

I'll have another look at the code now.

@cdeil

I've read over the diff briefly and left some inline comments with suggestions.

Show outdated Hide outdated gammapy/maps/_sparse.pyx Outdated
Show outdated Hide outdated gammapy/maps/_sparse.pyx Outdated
Show outdated Hide outdated gammapy/maps/_sparse.pyx Outdated
Show outdated Hide outdated gammapy/maps/hpxcube.py Outdated
Show outdated Hide outdated gammapy/maps/hpxcube.py Outdated
Show outdated Hide outdated gammapy/maps/hpxsparse.py Outdated

@cdeil cdeil changed the title from Sparse Map Prototyping to Add sparse map prototype Oct 19, 2017

@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil Oct 19, 2017

Member

I'm not sure how to make the remaining int / float type casting warnings from the Windows build log show up on Linux / Mac, so that we can fix them. If you want to get rid of them, then probably changing the int you define in your Cython functions just as "cdef int" to something more explicit might help. Here's an example, but to be honest, I don't understand the logic of how to choose int types in Cython yet:
https://github.com/astropy/astropy-healpix/blob/fc320714a285c62795b2b8d4c65f4eefc90d6f2a/astropy_healpix/core_cython.pyx#L60

So maybe just leave as-is, no?

Do you rely on having 64 bit ints in those functions for large maps or resolutions? Do you have any tests that exercise them with integer inputs that don't fit in 32 bits?

Member

cdeil commented Oct 19, 2017

I'm not sure how to make the remaining int / float type casting warnings from the Windows build log show up on Linux / Mac, so that we can fix them. If you want to get rid of them, then probably changing the int you define in your Cython functions just as "cdef int" to something more explicit might help. Here's an example, but to be honest, I don't understand the logic of how to choose int types in Cython yet:
https://github.com/astropy/astropy-healpix/blob/fc320714a285c62795b2b8d4c65f4eefc90d6f2a/astropy_healpix/core_cython.pyx#L60

So maybe just leave as-is, no?

Do you rely on having 64 bit ints in those functions for large maps or resolutions? Do you have any tests that exercise them with integer inputs that don't fit in 32 bits?

@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil Oct 19, 2017

Member

@woodmd - Which of these files can gammapy.map read by now?
http://gamma-astro-data-formats.readthedocs.io/en/latest/skymaps/healpix/index.html#sample-files

Do you think it would be useful to add tests for that?
If yes, the way to do it would be to put a copy in gammapy-extra/test_datasets/maps and then access those files, like we do from many other tests in Gammapy.
Of course, if you are already generating such test files and have an I/O test at the moment without relying on external files, that's even better!

I just tried one and it didn't work:
https://gist.github.com/cdeil/557cf27eea3750a2636de7337db2b4a8

Probably I'm using the wrong class? Or is this format not supported yet?

@woodmd - Maybe an overview table on this page, or some kind of overview description mapping formats to gammapy.maps classes would help?
http://docs.gammapy.org/en/latest/maps/index.html
Or even a factory function read_map that does the sniffing around for what HDUs are there and then tries to load the data in an appropriate format and instantiate the right class?

Member

cdeil commented Oct 19, 2017

@woodmd - Which of these files can gammapy.map read by now?
http://gamma-astro-data-formats.readthedocs.io/en/latest/skymaps/healpix/index.html#sample-files

Do you think it would be useful to add tests for that?
If yes, the way to do it would be to put a copy in gammapy-extra/test_datasets/maps and then access those files, like we do from many other tests in Gammapy.
Of course, if you are already generating such test files and have an I/O test at the moment without relying on external files, that's even better!

I just tried one and it didn't work:
https://gist.github.com/cdeil/557cf27eea3750a2636de7337db2b4a8

Probably I'm using the wrong class? Or is this format not supported yet?

@woodmd - Maybe an overview table on this page, or some kind of overview description mapping formats to gammapy.maps classes would help?
http://docs.gammapy.org/en/latest/maps/index.html
Or even a factory function read_map that does the sniffing around for what HDUs are there and then tries to load the data in an appropriate format and instantiate the right class?

@woodmd

This comment has been minimized.

Show comment
Hide comment
@woodmd

woodmd Oct 19, 2017

Member

Do you rely on having 64 bit ints in those functions for large maps or resolutions? Do you have any tests that exercise them with integer inputs that don't fit in 32 bits?

I made some improvements to how types are handled in the cython methods. Apparently there is a "fused" type mechanism in cython which lets you create methods that accept multiple types (i.e. like a template method in C++). I've used this to make all cython methods compatible with both 32- and 64-bit integers and floats. This should avoid unnecessary overhead from type casting.

Of course, if you are already generating such test files and have an I/O test at the moment without relying on external files, that's even better!

There are already pretty extensive I/O tests in the unit test suite so I'm not sure we really need to test on external files. The one exception might be verifying that we can read files generated by other software like the Fermi STs.

Probably I'm using the wrong class? Or is this format not supported yet?

This one should have worked. I'll have a closer look now to track it down.

Or even a factory function read_map that does the sniffing around for what HDUs are there and then tries to load the data in an appropriate format and instantiate the right class?

I'll have to think a bit about this one. In principle the classes and data formats are independent so for instance it's completely valid to write a sparse map and read it back as a non-sparse map (and vice versa). What we could do is create a factory method with some default class mapping that depends on the data format. I'd propose something like the following:

# Auto-selects appropriate class based on FITS format
m = MapBase.read('file.fits')
# Explicitly instantiate with a given class
m = HpxMapSparse.read('file.fits')
Member

woodmd commented Oct 19, 2017

Do you rely on having 64 bit ints in those functions for large maps or resolutions? Do you have any tests that exercise them with integer inputs that don't fit in 32 bits?

I made some improvements to how types are handled in the cython methods. Apparently there is a "fused" type mechanism in cython which lets you create methods that accept multiple types (i.e. like a template method in C++). I've used this to make all cython methods compatible with both 32- and 64-bit integers and floats. This should avoid unnecessary overhead from type casting.

Of course, if you are already generating such test files and have an I/O test at the moment without relying on external files, that's even better!

There are already pretty extensive I/O tests in the unit test suite so I'm not sure we really need to test on external files. The one exception might be verifying that we can read files generated by other software like the Fermi STs.

Probably I'm using the wrong class? Or is this format not supported yet?

This one should have worked. I'll have a closer look now to track it down.

Or even a factory function read_map that does the sniffing around for what HDUs are there and then tries to load the data in an appropriate format and instantiate the right class?

I'll have to think a bit about this one. In principle the classes and data formats are independent so for instance it's completely valid to write a sparse map and read it back as a non-sparse map (and vice versa). What we could do is create a factory method with some default class mapping that depends on the data format. I'd propose something like the following:

# Auto-selects appropriate class based on FITS format
m = MapBase.read('file.fits')
# Explicitly instantiate with a given class
m = HpxMapSparse.read('file.fits')
@woodmd

This comment has been minimized.

Show comment
Hide comment
@woodmd

woodmd Oct 19, 2017

Member

Looks like we still have a number of warnings coming from the Windows build. I don't have time to track these down so why don't we leave as-is for now.

Regarding the failure on the test file this reflects some changes to the format I implemented on the gammapy side but didn't yet propagate to the format specification page. The issue arises because we need a mechanism to discover the column names in the BANDS table for a given axis. The convention now implemented in gammapy is to look for a header keyword AXCOLS{X} where {X} is the dimension index. I can make a PR to the GADF repo with these changes in the next few days.

Member

woodmd commented Oct 19, 2017

Looks like we still have a number of warnings coming from the Windows build. I don't have time to track these down so why don't we leave as-is for now.

Regarding the failure on the test file this reflects some changes to the format I implemented on the gammapy side but didn't yet propagate to the format specification page. The issue arises because we need a mechanism to discover the column names in the BANDS table for a given axis. The convention now implemented in gammapy is to look for a header keyword AXCOLS{X} where {X} is the dimension index. I can make a PR to the GADF repo with these changes in the next few days.

@cdeil

cdeil approved these changes Oct 20, 2017

@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil Oct 20, 2017

Member

Can't wait to see what gamma-ray analysis with sparse maps looks like.
@woodmd - Thanks for making a bit step in this direction here!

Member

cdeil commented Oct 20, 2017

Can't wait to see what gamma-ray analysis with sparse maps looks like.
@woodmd - Thanks for making a bit step in this direction here!

@cdeil cdeil merged commit 98c62a2 into gammapy:master Oct 20, 2017

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment