Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Implement CatalogMesh as a view of CatalogSource #393

Merged
merged 45 commits into from
Sep 18, 2017

Conversation

nickhand
Copy link
Member

@nickhand nickhand commented Sep 14, 2017

The old CatalogMesh was unclear about who owned the columns, with unusual side-effects when adding columns directly to a mesh. This PR aims to implement CatalogMesh as a view of a CatalogSource object, similar to the numpy "view". This also implements slices of CatalogSource objects as views as well.

Also:

  • added functionality to ConvolvedFFTPower for cross-correlations in this PR.
  • includes Dask optimizations  #392 b/c some of the new functionality depended on it

- current global variables are the dask cache size and dask chunk size
- increase dask chunk size to 5e6 from 1e5. This appears to help with overhead for large data sets and should still be small enough to fit into memory
This builds on the idea that the size of catalog is fixed after a certain point in the dask task graph.
So when we perform a selection, we can insert the selection task directly after the point in the
graph when the size last changed. The subsequent operations are only performed on the
selected data.

This saves a good amount of time -- for example, computing "Position" on the randoms catalog
if slicing only to a single redshift bin
getitem, setitem, and compute call the underlying catalog (stored as the
``source`` attribute).

Now, when columns are added to a mesh, the mesh behaves like a view, and
the column is in both the mesh and the catalog source.
set using ``catalog[species_name][column_name] = column``, rather than
``catalog[species_name/column_name]``

Internally, we store a copy of the original sources, and indexing the
catalog with the species name, returns these copied sources
Currently, only cross correlations with different weights are supported
(same data/randoms).

Things are a lot trickier if we want to support cross-correlations for
overlapping tracers. Seems like the thing to do in that case is compute
density field of weighted combination of tracers, e.g., 1606.03435
@nickhand
Copy link
Member Author

There's still a lot to be cleaned up here...in particular, I wasn't sure how to best init the MeshSource when initializing a new CatalogMesh from new. Thoughts, @rainwoodman ?

@nickhand
Copy link
Member Author

FYI, from 57ed564 onwards are the new commits implementing the view...


self._actions = []
self.base = None
self.actions.extend(getattr(obj, 'actions', []))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is heading towards the direction of having 'actions' as a part of dask diagram..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so? You can't have actions applied after the r2c that way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More or less like dask.delayed?

@@ -359,3 +371,28 @@ def save(self, output, dataset='Field', mode='real'):
bb.attrs[key] = json_str
except:
warnings.warn("attribute %s of type %s is unsupported and lost while saving MeshSource" % (key, type(value)))


def initialize_pm(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a member of the class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not right now but it could be. But this is part of the issues I was having with MeshSource. Ideally, we should probably have some sort of a mix-in MeshSource class to CatalogMesh that doesn't need to be initialized, but we need to work around the initializing pm and using the vector forms of Nmesh and BoxSize from the pm attribute

first = self.first[name]
second = self.second[name]
# the selection (same for first/second)
sel = self.first[name][self.first.selection]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is self.first.selection an array like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sel is a dask array. self.first.selection is the name of the selection column, and self.first[name] is a CatalogMesh holding the data or randoms species

Return a hard-coded column
"""
return self.source.get_hardcolumn(col)
def __finalize__(self, obj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using init? Another possibilty is to use setstate

I think if new exists you have to explicitly trigger init anyways.. (I may be wrong).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj here can be a CatalogSource or a CatalogMesh, so I'm not sure __init__ would work. I guess this is sort of a setstate.

CatalogMesh.__finalize__ will get called in different situations:

  1. when slicing a CatalogMesh to get a subset

return get_catalog_subset(self, sel)._view(self.__class__)

  1. when initializing a CatalogMesh from a CatalogSource

obj = source._view(cls)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what makes me confused is the function is defined by the occasion it is called, but not by its purpose -- thus it starts to feel like we are building a framework.

If I read the function, the purpose of the function is to copy Python attributes? would it work if we simply copy the __dict__ of the object instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is the same as numpy's __array_finalize__. In this case, copying __dict__ won't work because when you are going from CatalogSource to CatalogMesh, self in __finalize__(self, obj) is a bare object, and you need to add the default CatalogMesh attributes to it.

@nickhand
Copy link
Member Author

Okay I re-worked some of the logic to make more sense now. __finalize__ just attaches attributes from one object to another, and allows us to return slices of the same type and implement CatalogMesh as a view of CatalogSource.

Let me know what you think @rainwoodman -- I think this is ready for merging soon

@@ -201,13 +313,13 @@ def test_with_zhist(comm):

# add n(z) from randoms to the FKP source
nofz = InterpolatedUnivariateSpline(zhist.bin_centers, zhist.nbar)
fkp['randoms/NZ'] = nofz(randoms['z'])
fkp['data/NZ'] = nofz(data['z'])
fkp['randoms']['NZ'] = nofz(randoms['z'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the slash syntax is replaced by more brakets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are supported now and give identical results -- sometimes it feels easier to slice with the species name first when the species name is a variable

@rainwoodman
Copy link
Member

Yes I think this is a very big PR and in a good direction. Do we still support the slash syntax looking up sub-catalogs in a multispecies catalog? I noticed you seem to have changed the syntax to [][].

@nickhand
Copy link
Member Author

Okay I am going to go ahead and merge this

@nickhand nickhand merged commit 43867d5 into master Sep 18, 2017
@nickhand nickhand deleted the catalogmesh-views branch September 18, 2017 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants