New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Implement CatalogMesh as a view of CatalogSource #393

Merged
merged 45 commits into from Sep 18, 2017

Conversation

Projects
None yet
2 participants
@nickhand
Member

nickhand commented Sep 14, 2017

The old CatalogMesh was unclear about who owned the columns, with unusual side-effects when adding columns directly to a mesh. This PR aims to implement CatalogMesh as a view of a CatalogSource object, similar to the numpy "view". This also implements slices of CatalogSource objects as views as well.

Also:

  • added functionality to ConvolvedFFTPower for cross-correlations in this PR.
  • includes #392 b/c some of the new functionality depended on it

nickhand added some commits Sep 8, 2017

add a set options context manager, similar to dask;
- current global variables are the dask cache size and dask chunk size
- increase dask chunk size to 5e6 from 1e5. This appears to help with overhead for large data sets and should still be small enough to fit into memory
add optimized selection function for dask arrays with test
This builds on the idea that the size of catalog is fixed after a certain point in the dask task graph.
So when we perform a selection, we can insert the selection task directly after the point in the
graph when the size last changed. The subsequent operations are only performed on the
selected data.

This saves a good amount of time -- for example, computing "Position" on the randoms catalog
if slicing only to a single redshift bin
Better treatment of a CatalogMesh as a view of a Catalog
getitem, setitem, and compute call the underlying catalog (stored as the
``source`` attribute).

Now, when columns are added to a mesh, the mesh behaves like a view, and
the column is in both the mesh and the catalog source.
better syntax for setting columns in a MultipleSpeciesCatalog
set using ``catalog[species_name][column_name] = column``, rather than
``catalog[species_name/column_name]``

Internally, we store a copy of the original sources, and indexing the
catalog with the species name, returns these copied sources
re-working FKP and ConvolvedPower to support cross-correlations
Currently, only cross correlations with different weights are supported
(same data/randoms).

Things are a lot trickier if we want to support cross-correlations for
overlapping tracers. Seems like the thing to do in that case is compute
density field of weighted combination of tracers, e.g., 1606.03435

@nickhand nickhand requested a review from rainwoodman Sep 14, 2017

@nickhand

This comment has been minimized.

Member

nickhand commented Sep 14, 2017

There's still a lot to be cleaned up here...in particular, I wasn't sure how to best init the MeshSource when initializing a new CatalogMesh from new. Thoughts, @rainwoodman ?

self._actions = []
self.base = None
self.actions.extend(getattr(obj, 'actions', []))

This comment has been minimized.

@rainwoodman

rainwoodman Sep 14, 2017

Member

This is heading towards the direction of having 'actions' as a part of dask diagram..

This comment has been minimized.

@nickhand

nickhand Sep 14, 2017

Member

How so? You can't have actions applied after the r2c that way?

This comment has been minimized.

@rainwoodman

rainwoodman Sep 14, 2017

Member

More or less like dask.delayed?

@@ -359,3 +371,28 @@ def save(self, output, dataset='Field', mode='real'):
bb.attrs[key] = json_str
except:
warnings.warn("attribute %s of type %s is unsupported and lost while saving MeshSource" % (key, type(value)))
def initialize_pm(self):

This comment has been minimized.

@rainwoodman

rainwoodman Sep 14, 2017

Member

This is not a member of the class?

This comment has been minimized.

@nickhand

nickhand Sep 14, 2017

Member

It's not right now but it could be. But this is part of the issues I was having with MeshSource. Ideally, we should probably have some sort of a mix-in MeshSource class to CatalogMesh that doesn't need to be initialized, but we need to work around the initializing pm and using the vector forms of Nmesh and BoxSize from the pm attribute

first = self.first[name]
second = self.second[name]
# the selection (same for first/second)
sel = self.first[name][self.first.selection]

This comment has been minimized.

@rainwoodman

rainwoodman Sep 14, 2017

Member

is self.first.selection an array like?

This comment has been minimized.

@nickhand

nickhand Sep 14, 2017

Member

sel is a dask array. self.first.selection is the name of the selection column, and self.first[name] is a CatalogMesh holding the data or randoms species

Return a hard-coded column
"""
return self.source.get_hardcolumn(col)
def __finalize__(self, obj):

This comment has been minimized.

@rainwoodman

rainwoodman Sep 14, 2017

Member

Why not using init? Another possibilty is to use setstate

I think if new exists you have to explicitly trigger init anyways.. (I may be wrong).

This comment has been minimized.

@nickhand

nickhand Sep 14, 2017

Member

obj here can be a CatalogSource or a CatalogMesh, so I'm not sure __init__ would work. I guess this is sort of a setstate.

CatalogMesh.__finalize__ will get called in different situations:

  1. when slicing a CatalogMesh to get a subset

return get_catalog_subset(self, sel)._view(self.__class__)

  1. when initializing a CatalogMesh from a CatalogSource

obj = source._view(cls)

This comment has been minimized.

@rainwoodman

rainwoodman Sep 14, 2017

Member

I guess what makes me confused is the function is defined by the occasion it is called, but not by its purpose -- thus it starts to feel like we are building a framework.

If I read the function, the purpose of the function is to copy Python attributes? would it work if we simply copy the __dict__ of the object instead?

This comment has been minimized.

@nickhand

nickhand Sep 14, 2017

Member

The logic is the same as numpy's __array_finalize__. In this case, copying __dict__ won't work because when you are going from CatalogSource to CatalogMesh, self in __finalize__(self, obj) is a bare object, and you need to add the default CatalogMesh attributes to it.

@nickhand

This comment has been minimized.

Member

nickhand commented Sep 17, 2017

Okay I re-worked some of the logic to make more sense now. __finalize__ just attaches attributes from one object to another, and allows us to return slices of the same type and implement CatalogMesh as a view of CatalogSource.

Let me know what you think @rainwoodman -- I think this is ready for merging soon

@nickhand nickhand referenced this pull request Sep 17, 2017

Open

Slicing mesh objects. #367

@@ -201,13 +313,13 @@ def test_with_zhist(comm):
# add n(z) from randoms to the FKP source
nofz = InterpolatedUnivariateSpline(zhist.bin_centers, zhist.nbar)
fkp['randoms/NZ'] = nofz(randoms['z'])
fkp['data/NZ'] = nofz(data['z'])
fkp['randoms']['NZ'] = nofz(randoms['z'])

This comment has been minimized.

@rainwoodman

rainwoodman Sep 17, 2017

Member

Does this mean the slash syntax is replaced by more brakets?

This comment has been minimized.

@nickhand

nickhand Sep 17, 2017

Member

Both are supported now and give identical results -- sometimes it feels easier to slice with the species name first when the species name is a variable

@rainwoodman

This comment has been minimized.

Member

rainwoodman commented Sep 17, 2017

Yes I think this is a very big PR and in a good direction. Do we still support the slash syntax looking up sub-catalogs in a multispecies catalog? I noticed you seem to have changed the syntax to [][].

@nickhand

This comment has been minimized.

Member

nickhand commented Sep 18, 2017

Okay I am going to go ahead and merge this

@nickhand nickhand merged commit 43867d5 into master Sep 18, 2017

3 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.1%) to 95.267%
Details

@nickhand nickhand deleted the catalogmesh-views branch Sep 18, 2017

@rainwoodman

This comment has been minimized.

Member

rainwoodman commented on nbodykit/base/catalogmesh.py in 4e41aab Nov 10, 2017

This and line 211 is causing errors when the data set doesn't fit into the memory -- do you remember anything about them?

This comment has been minimized.

Member

nickhand replied Nov 10, 2017

I think this is a remnant of when the dask optimizations were added. I suspect it's issues with the dask engine not doing what we want it to do in terms of selection... I think we can revert back to the old logic if we need to

This comment has been minimized.

Member

rainwoodman replied Nov 10, 2017

OK I'll file a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment