-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pruning computed dendrograms. #111
Conversation
tests = [pruning.min_delta(min_delta), | ||
pruning.min_npix(min_npix)] | ||
if is_independent is not None: | ||
if hasattr(is_independent, '__iter__'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more idiomatic to test isinstance(foo, collections.Iterable): http://stackoverflow.com/a/4668647
"I'm running into issues with how to ensure merged features meet the pruning requirement, while keeping the tree connected. " My approach in the past has been something like:
This always keeps the tree connected (and looks similar to what you've done here). Can you elaborate about what particular problem you're running into? |
Ah ok. I've been merging the child into the parent instead of the sibling. If a leaf has no siblings, would it then be merged into its parent branch? The issue I thought I had was redefining the values where the branches split. After looking at it again, I don't think this is a problem. I think this pruning just isn't doing the same as the pruning in compute, which is leading to the discrepancies. Also, the pruning needs to be iterated to remove branches that are turned into leaves in the first passes. This gives something that looks like this: Its similar to the recomputed one, but missing portions of the branches. |
This is the scheme I'm currently using: For the leaves:
For the branches:
@ChrisBeaumont , do you see any glaring omissions? |
There are some parts of your description that I either don't understand or think should be different: if leaf fails, merge into sibling. If there are no siblings, merge into parent
if the branch has only one child, merge into its parent More generally, I don't think you should ever prune a branch -- I think you should iteratively prune a leaf (which often turns a branch into a leaf, that might be pruned in a future iteration), until there are no more leaves to prune. Does that make sense? Once the basic approach stabilizes, we will also need to add some tests. Those tests will be straightforward -- I think it's a reasonable requirement that any pruning parameters passed to post_prune will have the same effect as if those parameters were passed during initial dendrogram construction. Thus, the tests will build two dendrograms, one pruned at compute time and one post_pruned, and then assert that they have the same structure. Thanks for tackling this, by the way. I think this will be a useful addition. |
I've made the changes and things are looking fairly good. I removed re-assigning the idx of the structures as this was leading to inconsistencies when comparing to the index map. I kind of like the idea of keeping the same idx for structures after pruning. It would make it easier to keep track of individual structures. Let me know if you think they should be re-labelled though. I think there's an issue in how compute does the pruning though. There are leaves in the dendrogram computed with the pruning conditions enabled (I'm using min_delta=1.0) that don't satisfy the conditions. d = Dendrogram.compute(array)
d.post_pruning(min_delta=1.0) does not give the same output as d = Dendrogram.compute(array, min_delta=1.0) BUT, then running d.post_pruning(min_delta=1.0) does. |
Yeah this is an interesting question. On the one hand, it's appealing to retain index numbers for comparing pre- and post-pruned dendrograms. OTOH, it's tempting for client code to do things like use arrays with length equal to the number of structures, and use a structure index number to address locations in this array. That leads to problems when index numbers aren't a tightly-packed sequence. That could lead to subtle bugs. Any thoughts, @astrofrog / @keflavich / others? We should also track down why compute and post_pruning aren't giving the same answer. Is it easy for you to add a (failing) test that exposes the inconsistency? Then I can try to dig into that. |
Would it be overly complicated to include both a "permanent" ID and a "tight/index" ID, so that users can choose their use-case? Such that all pruning processes would alter the "tight" ID but would leave the "permanent" ID intact. When I previously ran into similar problems with the connection between catalog indexing and _idx IDs, one workaround I had was a piece of code that looked like this: idx_index_map = {}
for i, struct in zip(range(len(structures)), structures):
# assumes catalog data and the list of structures have the same ordering
idx_index_map[struct.idx] = i
for struct in structures:
child_index = idx_index_map[struct.idx]
parent_index = idx_index_map[struct.parent.idx]
# do something like plot lines between child & parent catalog properties.... |
I'll give writing the test a crack. This appears to only be an issue with min_delta. |
I added the test using the testing data set in the package, but I'm having trouble reproducing the min_delta problem I'm encountering. The testing data passes the pruning tests, while the data I've been using isn't. I've tried this with sim data (PPV cube and an integrated intensity image) . For the comparison, I'm checking that the dendrograms have the same number of features and that the features have the same vmax and vmin (sorted by max). Without redefining the idx's, direct comparison of the trees is a bit cumbersome. |
hmm, would it be easy for you to email or post the data and script you are using that triggers the issue? (cbeaumont [at] cfa [dot] harvard [dot] edu). |
Thanks, I got your email. This was an issue with how min_delta was computed. I'll push a fix |
Tweak post prune logic, min_delta criteria
|
||
# Merge structures into the parent | ||
for m in merge: | ||
# XXX extract into a helper function? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify post_pruning a bit, I think the inside of this loop should be replaced with:
_merge_with_parent(m, parent, self.index_map)
del keep_structures[m.idx]
Where _merge_with_parent is a helper function containing the rest of this code
keep_structures = self._structures_dict.copy() | ||
|
||
# Continue until there are no more leaves to prune. | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, this while loop should be replaced with a for loop and a helper generator:
for struct in _to_prune(self, keep_structures, is_independent):
< lines 553-580>
def _to_prune(dendrogram, keep_structures, is_independent):
while True:
for struct in dendrogram.all_structures:
<lines 537-553>
yield struct
break
else:
return
This is looking good to me. I've made a bunch of style suggestions, to split up the post_pruning method into simpler pieces. If you could make one more pass at addressing some of these, I'm happy to merge. |
One other thought -- should |
I've made your suggested style changes and changed the name to Two more thoughts:
|
I wouldn't bother with min_value, since it's not so much a pruning parameter as a preprocessing step. Adding a warnings.warn() message if the pruning parameters are less restrictive than the original inputs sounds like a good idea |
|
||
def _to_prune(dendrogram, keep_structures, is_independent): | ||
''' | ||
Yields an iterator which returns leaves which need to be pruned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Yields a sequence of leaves which need to be pruned" is slightly better terminology
I made those small changes and added warnings for the pruning parameters. The parameters also now get updated in |
|
||
# Merge structures into the parent | ||
for m in merge: | ||
_merge_with_parent(m, parent, self.index_map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parent needs to be removed here
I made a couple small changes to avoid the warnings if the parameter isn't specified (ie. equal to 0). |
Looks great to me! Assuming the travis test passes, I'm happy to merge this |
Great! Thanks for all the help! |
Pruning computed dendrograms.
Thanks again! |
Hi all,
I have the need to run large numbers of dendrograms, varying min_delta for each (Figure 4 in Burkhart et al. 2013; http://iopscience.iop.org/0004-637X/770/2/141/pdf/apj_770_2_141.pdf). It would be nice to be able prune out features without having to recompute the whole dendrogram each time.
I've attempted to code this up, re-using the end stages of the compute function. Right now, it prunes off leaves and merges features together. I'm running into issues with how to ensure merged features meet the pruning requirement, while keeping the tree connected. Can this be done without re-computing?
Here's a comparison of recomputing (top) to the output of post_pruning (bottom).