Pruning computed dendrograms. #111

e-koch · 2014-06-03T23:59:18Z

Hi all,
I have the need to run large numbers of dendrograms, varying min_delta for each (Figure 4 in Burkhart et al. 2013; http://iopscience.iop.org/0004-637X/770/2/141/pdf/apj_770_2_141.pdf). It would be nice to be able prune out features without having to recompute the whole dendrogram each time.

I've attempted to code this up, re-using the end stages of the compute function. Right now, it prunes off leaves and merges features together. I'm running into issues with how to ensure merged features meet the pruning requirement, while keeping the tree connected. Can this be done without re-computing?

Here's a comparison of recomputing (top) to the output of post_pruning (bottom).

ChrisBeaumont · 2014-06-04T13:54:57Z

astrodendro/dendrogram.py

+        tests = [pruning.min_delta(min_delta),
+                 pruning.min_npix(min_npix)]
+        if is_independent is not None:
+            if hasattr(is_independent, '__iter__'):


I think it's more idiomatic to test isinstance(foo, collections.Iterable): http://stackoverflow.com/a/4668647

ChrisBeaumont · 2014-06-04T14:06:32Z

"I'm running into issues with how to ensure merged features meet the pruning requirement, while keeping the tree connected. "

My approach in the past has been something like:

Start with a leaf
While the selected structure fails some independence test, merge it with its sibling. Mark the merged structure as the "selected structure"
Repeat for all leaves

This always keeps the tree connected (and looks similar to what you've done here). Can you elaborate about what particular problem you're running into?

e-koch · 2014-06-04T20:41:52Z

Ah ok. I've been merging the child into the parent instead of the sibling. If a leaf has no siblings, would it then be merged into its parent branch?

The issue I thought I had was redefining the values where the branches split. After looking at it again, I don't think this is a problem. I think this pruning just isn't doing the same as the pruning in compute, which is leading to the discrepancies.

Also, the pruning needs to be iterated to remove branches that are turned into leaves in the first passes. This gives something that looks like this:

Its similar to the recomputed one, but missing portions of the branches.

e-koch · 2014-06-04T20:49:12Z

This is the scheme I'm currently using:

For the leaves:

check independence tests for all leaves

if leaf fails, merge into sibling. If there are no siblings, merge into parent.

For the branches:

if the branch has only one child, merge into its parent
if it has more than one, its left alone
continue until all branches satisfy the second point.

@ChrisBeaumont , do you see any glaring omissions?

ChrisBeaumont · 2014-06-05T13:43:56Z

There are some parts of your description that I either don't understand or think should be different:

if leaf fails, merge into sibling. If there are no siblings, merge into parent

For the case of a single sibling, if a leaf is pruned then you should merge both the leaf and its sibling into the parent (which becomes a new leaf).
Leaves with >1 sibling are trickier. I would vote for merging the pruned leaf's pixels into the parent structure. So, say, a branch that splits into 3 leaves becomes a branch that splits into 2 leaves

if the branch has only one child, merge into its parent
A branch should never have only one child (though I can see how your scheme as it stands can create some).

More generally, I don't think you should ever prune a branch -- I think you should iteratively prune a leaf (which often turns a branch into a leaf, that might be pruned in a future iteration), until there are no more leaves to prune. Does that make sense?

Once the basic approach stabilizes, we will also need to add some tests. Those tests will be straightforward -- I think it's a reasonable requirement that any pruning parameters passed to post_prune will have the same effect as if those parameters were passed during initial dendrogram construction. Thus, the tests will build two dendrograms, one pruned at compute time and one post_pruned, and then assert that they have the same structure.

Thanks for tackling this, by the way. I think this will be a useful addition.

e-koch · 2014-06-09T18:40:37Z

I've made the changes and things are looking fairly good. I removed re-assigning the idx of the structures as this was leading to inconsistencies when comparing to the index map. I kind of like the idea of keeping the same idx for structures after pruning. It would make it easier to keep track of individual structures. Let me know if you think they should be re-labelled though.

I think there's an issue in how compute does the pruning though. There are leaves in the dendrogram computed with the pruning conditions enabled (I'm using min_delta=1.0) that don't satisfy the conditions.
So,

d = Dendrogram.compute(array)
d.post_pruning(min_delta=1.0)

does not give the same output as

d = Dendrogram.compute(array, min_delta=1.0)

BUT, then running

d.post_pruning(min_delta=1.0)

does.

ChrisBeaumont · 2014-06-09T19:34:24Z

I removed re-assigning the idx of the structures as this was leading to inconsistencies when comparing to the index map. I kind of like the idea of keeping the same idx for structures after pruning

Yeah this is an interesting question. On the one hand, it's appealing to retain index numbers for comparing pre- and post-pruned dendrograms. OTOH, it's tempting for client code to do things like use arrays with length equal to the number of structures, and use a structure index number to address locations in this array. That leads to problems when index numbers aren't a tightly-packed sequence. That could lead to subtle bugs. Any thoughts, @astrofrog / @keflavich / others?

We should also track down why compute and post_pruning aren't giving the same answer. Is it easy for you to add a (failing) test that exposes the inconsistency? Then I can try to dig into that.

tomr-stargazer · 2014-06-09T19:43:19Z

Would it be overly complicated to include both a "permanent" ID and a "tight/index" ID, so that users can choose their use-case? Such that all pruning processes would alter the "tight" ID but would leave the "permanent" ID intact.

When I previously ran into similar problems with the connection between catalog indexing and _idx IDs, one workaround I had was a piece of code that looked like this:

idx_index_map = {}
for i, struct in zip(range(len(structures)), structures):
    # assumes catalog data and the list of structures have the same ordering
    idx_index_map[struct.idx] = i
for struct in structures:    
    child_index = idx_index_map[struct.idx]
    parent_index = idx_index_map[struct.parent.idx]
    # do something like plot lines between child & parent catalog properties....

e-koch · 2014-06-09T19:55:10Z

I'll give writing the test a crack. This appears to only be an issue with min_delta.

e-koch · 2014-06-10T19:47:40Z

I added the test using the testing data set in the package, but I'm having trouble reproducing the min_delta problem I'm encountering. The testing data passes the pruning tests, while the data I've been using isn't. I've tried this with sim data (PPV cube and an integrated intensity image) .

For the comparison, I'm checking that the dendrograms have the same number of features and that the features have the same vmax and vmin (sorted by max). Without redefining the idx's, direct comparison of the trees is a bit cumbersome.

ChrisBeaumont · 2014-06-10T19:51:08Z

hmm, would it be easy for you to email or post the data and script you are using that triggers the issue? (cbeaumont [at] cfa [dot] harvard [dot] edu).

ChrisBeaumont · 2014-06-10T21:04:05Z

Thanks, I got your email.

This was an issue with how min_delta was computed. I'll push a fix

Tweak post prune logic, min_delta criteria

ChrisBeaumont · 2014-06-11T13:45:20Z

astrodendro/dendrogram.py

+
+                # Merge structures into the parent
+                for m in merge:
+                    # XXX extract into a helper function?


To simplify post_pruning a bit, I think the inside of this loop should be replaced with:

_merge_with_parent(m, parent, self.index_map) del keep_structures[m.idx]

Where _merge_with_parent is a helper function containing the rest of this code

ChrisBeaumont · 2014-06-11T14:01:25Z

astrodendro/dendrogram.py

+        keep_structures = self._structures_dict.copy()
+
+        # Continue until there are no more leaves to prune.
+        while True:


On second thought, this while loop should be replaced with a for loop and a helper generator:

for struct in _to_prune(self, keep_structures, is_independent): < lines 553-580>

def _to_prune(dendrogram, keep_structures, is_independent): while True: for struct in dendrogram.all_structures: <lines 537-553> yield struct break else: return

ChrisBeaumont · 2014-06-11T14:04:17Z

This is looking good to me. I've made a bunch of style suggestions, to split up the post_pruning method into simpler pieces. If you could make one more pass at addressing some of these, I'm happy to merge.

ChrisBeaumont · 2014-06-11T14:09:48Z

One other thought -- should post_pruning be renamed prune?

e-koch · 2014-06-11T17:50:29Z

I've made your suggested style changes and changed the name to prune. It sounds cleaner. @ChrisBeaumont thanks for all the help getting this together!

Two more thoughts:

Should we add the ability to change min_value? I think it would need its own "pruning" method separate to other parameter changes.
Is it worth it to add in warnings if the supplied pruning parameters are less than those used when computing?

ChrisBeaumont · 2014-06-12T14:36:38Z

I wouldn't bother with min_value, since it's not so much a pruning parameter as a preprocessing step.

Adding a warnings.warn() message if the pruning parameters are less restrictive than the original inputs sounds like a good idea

ChrisBeaumont · 2014-06-12T14:39:55Z

astrodendro/dendrogram.py

+
+def _to_prune(dendrogram, keep_structures, is_independent):
+    '''
+    Yields an iterator which returns leaves which need to be pruned.


"Yields a sequence of leaves which need to be pruned" is slightly better terminology

…ring changes.

e-koch · 2014-06-12T16:50:21Z

I made those small changes and added warnings for the pruning parameters. The parameters also now get updated in self.params.

ChrisBeaumont · 2014-06-12T17:00:01Z

astrodendro/dendrogram.py

+
+                # Merge structures into the parent
+                for m in merge:
+                    _merge_with_parent(m, parent, self.index_map)


Parent needs to be removed here

e-koch · 2014-06-13T18:09:45Z

I made a couple small changes to avoid the warnings if the parameter isn't specified (ie. equal to 0).
@ChrisBeaumont does anything else need to be addressed?

ChrisBeaumont · 2014-06-13T18:12:32Z

Looks great to me! Assuming the travis test passes, I'm happy to merge this

e-koch · 2014-06-13T18:25:55Z

Great! Thanks for all the help!

Pruning computed dendrograms.

ChrisBeaumont · 2014-06-14T16:12:02Z

Thanks again!

e-koch added 4 commits May 27, 2014 11:42

First attempt at post-pruning dendrograms.

f67689a

Moved post-pruning into Dendrograms.

b5aec51

First attempt at merging branches.

ad86ac1

Clean-up removal section.

2a0258f

ChrisBeaumont reviewed Jun 4, 2014
View reviewed changes

Use isinstanceto check is_independent.

5ae1204

Merge failed leaves into siblings.

2b010d0

e-koch added 3 commits June 9, 2014 10:37

Revamped pruning scheme. Iterates until all pruning is done.

4463952

Use fill_footprint. Removed re-assigning idx.

b61d809

Set recursive to False.

803c50b

Copy _stuctures_dict for keep_structures.

650a293

Added pruning test.

dcc3760

Add .all() to test.

e2029d2

Chris Beaumont and others added 2 commits June 10, 2014 17:15

Tweak post prune logic, min_delta criteria

478503d

Merge pull request #1 from ChrisBeaumont/master

1bdc9b8

Tweak post prune logic, min_delta criteria

ChrisBeaumont reviewed Jun 11, 2014
View reviewed changes

ChrisBeaumont mentioned this pull request Jun 11, 2014

Debug method to assert consistent structure #113

Open

e-koch added 4 commits June 11, 2014 10:25

Style change to post_pruning.

5723c65

Changed post_pruning to prune.

36c5b59

Docstrings for new helper functions.

01407b2

Extend comparison test to index maps.

82f79f9

Update name to prune in tests.

a335526

ChrisBeaumont reviewed Jun 12, 2014
View reviewed changes

Removed min_value, added warnings and updates for params, minor docst…

5eff345

…ring changes.

ChrisBeaumont reviewed Jun 12, 2014
View reviewed changes

e-koch added 3 commits June 12, 2014 10:05

Update _merge_with_parent call.

c703159

Avoid warnings if param isn't changed in prune.

ed4005a

Fixed warning description.

b62c42a

ChrisBeaumont pushed a commit that referenced this pull request Jun 14, 2014

Merge pull request #111 from e-koch/master

3d636bf

Pruning computed dendrograms.

ChrisBeaumont merged commit 3d636bf into dendrograms:master Jun 14, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pruning computed dendrograms. #111

Pruning computed dendrograms. #111

e-koch commented Jun 3, 2014

ChrisBeaumont Jun 4, 2014

ChrisBeaumont commented Jun 4, 2014

e-koch commented Jun 4, 2014

e-koch commented Jun 4, 2014

ChrisBeaumont commented Jun 5, 2014

e-koch commented Jun 9, 2014

ChrisBeaumont commented Jun 9, 2014

tomr-stargazer commented Jun 9, 2014

e-koch commented Jun 9, 2014

e-koch commented Jun 10, 2014

ChrisBeaumont commented Jun 10, 2014

ChrisBeaumont commented Jun 10, 2014

ChrisBeaumont Jun 11, 2014

ChrisBeaumont Jun 11, 2014

ChrisBeaumont commented Jun 11, 2014

ChrisBeaumont commented Jun 11, 2014

e-koch commented Jun 11, 2014

ChrisBeaumont commented Jun 12, 2014

ChrisBeaumont Jun 12, 2014

e-koch commented Jun 12, 2014

ChrisBeaumont Jun 12, 2014

e-koch commented Jun 13, 2014

ChrisBeaumont commented Jun 13, 2014

e-koch commented Jun 13, 2014

ChrisBeaumont commented Jun 14, 2014

Pruning computed dendrograms. #111

Pruning computed dendrograms. #111

Conversation

e-koch commented Jun 3, 2014

ChrisBeaumont Jun 4, 2014

Choose a reason for hiding this comment

ChrisBeaumont commented Jun 4, 2014

e-koch commented Jun 4, 2014

e-koch commented Jun 4, 2014

ChrisBeaumont commented Jun 5, 2014

e-koch commented Jun 9, 2014

ChrisBeaumont commented Jun 9, 2014

tomr-stargazer commented Jun 9, 2014

e-koch commented Jun 9, 2014

e-koch commented Jun 10, 2014

ChrisBeaumont commented Jun 10, 2014

ChrisBeaumont commented Jun 10, 2014

ChrisBeaumont Jun 11, 2014

Choose a reason for hiding this comment

ChrisBeaumont Jun 11, 2014

Choose a reason for hiding this comment

ChrisBeaumont commented Jun 11, 2014

ChrisBeaumont commented Jun 11, 2014

e-koch commented Jun 11, 2014

ChrisBeaumont commented Jun 12, 2014

ChrisBeaumont Jun 12, 2014

Choose a reason for hiding this comment

e-koch commented Jun 12, 2014

ChrisBeaumont Jun 12, 2014

Choose a reason for hiding this comment

e-koch commented Jun 13, 2014

ChrisBeaumont commented Jun 13, 2014

e-koch commented Jun 13, 2014

ChrisBeaumont commented Jun 14, 2014