-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Implement CatalogMesh as a view of CatalogSource #393
Conversation
- current global variables are the dask cache size and dask chunk size - increase dask chunk size to 5e6 from 1e5. This appears to help with overhead for large data sets and should still be small enough to fit into memory
This builds on the idea that the size of catalog is fixed after a certain point in the dask task graph. So when we perform a selection, we can insert the selection task directly after the point in the graph when the size last changed. The subsequent operations are only performed on the selected data. This saves a good amount of time -- for example, computing "Position" on the randoms catalog if slicing only to a single redshift bin
This reverts commit 17795fe.
getitem, setitem, and compute call the underlying catalog (stored as the ``source`` attribute). Now, when columns are added to a mesh, the mesh behaves like a view, and the column is in both the mesh and the catalog source.
set using ``catalog[species_name][column_name] = column``, rather than ``catalog[species_name/column_name]`` Internally, we store a copy of the original sources, and indexing the catalog with the species name, returns these copied sources
Currently, only cross correlations with different weights are supported (same data/randoms). Things are a lot trickier if we want to support cross-correlations for overlapping tracers. Seems like the thing to do in that case is compute density field of weighted combination of tracers, e.g., 1606.03435
There's still a lot to be cleaned up here...in particular, I wasn't sure how to best init the MeshSource when initializing a new CatalogMesh from new. Thoughts, @rainwoodman ? |
FYI, from 57ed564 onwards are the new commits implementing the view... |
nbodykit/base/mesh.py
Outdated
|
||
self._actions = [] | ||
self.base = None | ||
self.actions.extend(getattr(obj, 'actions', [])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is heading towards the direction of having 'actions' as a part of dask diagram..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How so? You can't have actions applied after the r2c
that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More or less like dask.delayed?
nbodykit/base/mesh.py
Outdated
@@ -359,3 +371,28 @@ def save(self, output, dataset='Field', mode='real'): | |||
bb.attrs[key] = json_str | |||
except: | |||
warnings.warn("attribute %s of type %s is unsupported and lost while saving MeshSource" % (key, type(value))) | |||
|
|||
|
|||
def initialize_pm(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a member of the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not right now but it could be. But this is part of the issues I was having with MeshSource. Ideally, we should probably have some sort of a mix-in MeshSource class to CatalogMesh that doesn't need to be initialized, but we need to work around the initializing pm
and using the vector forms of Nmesh
and BoxSize
from the pm
attribute
nbodykit/algorithms/convpower.py
Outdated
first = self.first[name] | ||
second = self.second[name] | ||
# the selection (same for first/second) | ||
sel = self.first[name][self.first.selection] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is self.first.selection an array like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sel
is a dask array. self.first.selection
is the name of the selection column, and self.first[name]
is a CatalogMesh holding the data or randoms species
nbodykit/base/catalogmesh.py
Outdated
Return a hard-coded column | ||
""" | ||
return self.source.get_hardcolumn(col) | ||
def __finalize__(self, obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using init? Another possibilty is to use setstate
I think if new exists you have to explicitly trigger init anyways.. (I may be wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obj
here can be a CatalogSource or a CatalogMesh, so I'm not sure __init__
would work. I guess this is sort of a setstate
.
CatalogMesh.__finalize__
will get called in different situations:
- when slicing a CatalogMesh to get a subset
nbodykit/nbodykit/base/catalog.py
Line 434 in 884f57c
return get_catalog_subset(self, sel)._view(self.__class__) |
- when initializing a CatalogMesh from a CatalogSource
nbodykit/nbodykit/base/catalogmesh.py
Line 53 in 884f57c
obj = source._view(cls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what makes me confused is the function is defined by the occasion it is called, but not by its purpose -- thus it starts to feel like we are building a framework.
If I read the function, the purpose of the function is to copy Python attributes? would it work if we simply copy the __dict__
of the object instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is the same as numpy's __array_finalize__
. In this case, copying __dict__
won't work because when you are going from CatalogSource to CatalogMesh, self
in __finalize__(self, obj)
is a bare object, and you need to add the default CatalogMesh attributes to it.
Okay I re-worked some of the logic to make more sense now. Let me know what you think @rainwoodman -- I think this is ready for merging soon |
@@ -201,13 +313,13 @@ def test_with_zhist(comm): | |||
|
|||
# add n(z) from randoms to the FKP source | |||
nofz = InterpolatedUnivariateSpline(zhist.bin_centers, zhist.nbar) | |||
fkp['randoms/NZ'] = nofz(randoms['z']) | |||
fkp['data/NZ'] = nofz(data['z']) | |||
fkp['randoms']['NZ'] = nofz(randoms['z']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the slash syntax is replaced by more brakets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are supported now and give identical results -- sometimes it feels easier to slice with the species name first when the species name is a variable
Yes I think this is a very big PR and in a good direction. Do we still support the slash syntax looking up sub-catalogs in a multispecies catalog? I noticed you seem to have changed the syntax to |
Okay I am going to go ahead and merge this |
The old CatalogMesh was unclear about who owned the columns, with unusual side-effects when adding columns directly to a mesh. This PR aims to implement CatalogMesh as a view of a CatalogSource object, similar to the numpy "view". This also implements slices of CatalogSource objects as views as well.
Also: