Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pruning computed dendrograms. #111

Merged
merged 23 commits into from
Jun 14, 2014
Merged

Conversation

e-koch
Copy link
Contributor

@e-koch e-koch commented Jun 3, 2014

Hi all,
I have the need to run large numbers of dendrograms, varying min_delta for each (Figure 4 in Burkhart et al. 2013; http://iopscience.iop.org/0004-637X/770/2/141/pdf/apj_770_2_141.pdf). It would be nice to be able prune out features without having to recompute the whole dendrogram each time.

I've attempted to code this up, re-using the end stages of the compute function. Right now, it prunes off leaves and merges features together. I'm running into issues with how to ensure merged features meet the pruning requirement, while keeping the tree connected. Can this be done without re-computing?

Here's a comparison of recomputing (top) to the output of post_pruning (bottom).

hd22 13co recomputed

hd22 13co postpruned

tests = [pruning.min_delta(min_delta),
pruning.min_npix(min_npix)]
if is_independent is not None:
if hasattr(is_independent, '__iter__'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more idiomatic to test isinstance(foo, collections.Iterable): http://stackoverflow.com/a/4668647

@ChrisBeaumont
Copy link
Contributor

"I'm running into issues with how to ensure merged features meet the pruning requirement, while keeping the tree connected. "

My approach in the past has been something like:

  • Start with a leaf
  • While the selected structure fails some independence test, merge it with its sibling. Mark the merged structure as the "selected structure"
  • Repeat for all leaves

This always keeps the tree connected (and looks similar to what you've done here). Can you elaborate about what particular problem you're running into?

@e-koch
Copy link
Contributor Author

e-koch commented Jun 4, 2014

Ah ok. I've been merging the child into the parent instead of the sibling. If a leaf has no siblings, would it then be merged into its parent branch?

The issue I thought I had was redefining the values where the branches split. After looking at it again, I don't think this is a problem. I think this pruning just isn't doing the same as the pruning in compute, which is leading to the discrepancies.

Also, the pruning needs to be iterated to remove branches that are turned into leaves in the first passes. This gives something that looks like this:

hd22 13co postpruned_iterated

Its similar to the recomputed one, but missing portions of the branches.

@e-koch
Copy link
Contributor Author

e-koch commented Jun 4, 2014

This is the scheme I'm currently using:

For the leaves:

  • check independence tests for all leaves
    
  • if leaf fails, merge into sibling. If there are no siblings, merge into parent.
    

For the branches:

  • if the branch has only one child, merge into its parent
  • if it has more than one, its left alone
  • continue until all branches satisfy the second point.

@ChrisBeaumont , do you see any glaring omissions?

@ChrisBeaumont
Copy link
Contributor

There are some parts of your description that I either don't understand or think should be different:

if leaf fails, merge into sibling. If there are no siblings, merge into parent

  • For the case of a single sibling, if a leaf is pruned then you should merge both the leaf and its sibling into the parent (which becomes a new leaf).
  • Leaves with >1 sibling are trickier. I would vote for merging the pruned leaf's pixels into the parent structure. So, say, a branch that splits into 3 leaves becomes a branch that splits into 2 leaves

if the branch has only one child, merge into its parent
A branch should never have only one child (though I can see how your scheme as it stands can create some).

More generally, I don't think you should ever prune a branch -- I think you should iteratively prune a leaf (which often turns a branch into a leaf, that might be pruned in a future iteration), until there are no more leaves to prune. Does that make sense?

Once the basic approach stabilizes, we will also need to add some tests. Those tests will be straightforward -- I think it's a reasonable requirement that any pruning parameters passed to post_prune will have the same effect as if those parameters were passed during initial dendrogram construction. Thus, the tests will build two dendrograms, one pruned at compute time and one post_pruned, and then assert that they have the same structure.

Thanks for tackling this, by the way. I think this will be a useful addition.

@e-koch
Copy link
Contributor Author

e-koch commented Jun 9, 2014

I've made the changes and things are looking fairly good. I removed re-assigning the idx of the structures as this was leading to inconsistencies when comparing to the index map. I kind of like the idea of keeping the same idx for structures after pruning. It would make it easier to keep track of individual structures. Let me know if you think they should be re-labelled though.

I think there's an issue in how compute does the pruning though. There are leaves in the dendrogram computed with the pruning conditions enabled (I'm using min_delta=1.0) that don't satisfy the conditions.
So,

d = Dendrogram.compute(array)
d.post_pruning(min_delta=1.0)

does not give the same output as

d = Dendrogram.compute(array, min_delta=1.0)

BUT, then running

d.post_pruning(min_delta=1.0)

does.

@ChrisBeaumont
Copy link
Contributor

I removed re-assigning the idx of the structures as this was leading to inconsistencies when comparing to the index map. I kind of like the idea of keeping the same idx for structures after pruning

Yeah this is an interesting question. On the one hand, it's appealing to retain index numbers for comparing pre- and post-pruned dendrograms. OTOH, it's tempting for client code to do things like use arrays with length equal to the number of structures, and use a structure index number to address locations in this array. That leads to problems when index numbers aren't a tightly-packed sequence. That could lead to subtle bugs. Any thoughts, @astrofrog / @keflavich / others?

We should also track down why compute and post_pruning aren't giving the same answer. Is it easy for you to add a (failing) test that exposes the inconsistency? Then I can try to dig into that.

@tomr-stargazer
Copy link
Contributor

Would it be overly complicated to include both a "permanent" ID and a "tight/index" ID, so that users can choose their use-case? Such that all pruning processes would alter the "tight" ID but would leave the "permanent" ID intact.

When I previously ran into similar problems with the connection between catalog indexing and _idx IDs, one workaround I had was a piece of code that looked like this:

idx_index_map = {}
for i, struct in zip(range(len(structures)), structures):
    # assumes catalog data and the list of structures have the same ordering
    idx_index_map[struct.idx] = i
for struct in structures:    
    child_index = idx_index_map[struct.idx]
    parent_index = idx_index_map[struct.parent.idx]
    # do something like plot lines between child & parent catalog properties....

@e-koch
Copy link
Contributor Author

e-koch commented Jun 9, 2014

I'll give writing the test a crack. This appears to only be an issue with min_delta.

@e-koch
Copy link
Contributor Author

e-koch commented Jun 10, 2014

I added the test using the testing data set in the package, but I'm having trouble reproducing the min_delta problem I'm encountering. The testing data passes the pruning tests, while the data I've been using isn't. I've tried this with sim data (PPV cube and an integrated intensity image) .

For the comparison, I'm checking that the dendrograms have the same number of features and that the features have the same vmax and vmin (sorted by max). Without redefining the idx's, direct comparison of the trees is a bit cumbersome.

@ChrisBeaumont
Copy link
Contributor

hmm, would it be easy for you to email or post the data and script you are using that triggers the issue? (cbeaumont [at] cfa [dot] harvard [dot] edu).

@ChrisBeaumont
Copy link
Contributor

Thanks, I got your email.

This was an issue with how min_delta was computed. I'll push a fix

Chris Beaumont and others added 2 commits June 10, 2014 17:15

# Merge structures into the parent
for m in merge:
# XXX extract into a helper function?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify post_pruning a bit, I think the inside of this loop should be replaced with:

_merge_with_parent(m, parent, self.index_map)
del keep_structures[m.idx]

Where _merge_with_parent is a helper function containing the rest of this code

keep_structures = self._structures_dict.copy()

# Continue until there are no more leaves to prune.
while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, this while loop should be replaced with a for loop and a helper generator:

for struct in _to_prune(self, keep_structures, is_independent):
< lines 553-580>
def _to_prune(dendrogram, keep_structures, is_independent):
    while True:
        for struct in dendrogram.all_structures:
            <lines 537-553>
            yield struct
            break
        else:
            return

@ChrisBeaumont
Copy link
Contributor

This is looking good to me. I've made a bunch of style suggestions, to split up the post_pruning method into simpler pieces. If you could make one more pass at addressing some of these, I'm happy to merge.

@ChrisBeaumont
Copy link
Contributor

One other thought -- should post_pruning be renamed prune?

@e-koch
Copy link
Contributor Author

e-koch commented Jun 11, 2014

I've made your suggested style changes and changed the name to prune. It sounds cleaner. @ChrisBeaumont thanks for all the help getting this together!

Two more thoughts:

  • Should we add the ability to change min_value? I think it would need its own "pruning" method separate to other parameter changes.
  • Is it worth it to add in warnings if the supplied pruning parameters are less than those used when computing?

@ChrisBeaumont
Copy link
Contributor

I wouldn't bother with min_value, since it's not so much a pruning parameter as a preprocessing step.

Adding a warnings.warn() message if the pruning parameters are less restrictive than the original inputs sounds like a good idea


def _to_prune(dendrogram, keep_structures, is_independent):
'''
Yields an iterator which returns leaves which need to be pruned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Yields a sequence of leaves which need to be pruned" is slightly better terminology

@e-koch
Copy link
Contributor Author

e-koch commented Jun 12, 2014

I made those small changes and added warnings for the pruning parameters. The parameters also now get updated in self.params.


# Merge structures into the parent
for m in merge:
_merge_with_parent(m, parent, self.index_map)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parent needs to be removed here

@e-koch
Copy link
Contributor Author

e-koch commented Jun 13, 2014

I made a couple small changes to avoid the warnings if the parameter isn't specified (ie. equal to 0).
@ChrisBeaumont does anything else need to be addressed?

@ChrisBeaumont
Copy link
Contributor

Looks great to me! Assuming the travis test passes, I'm happy to merge this

@e-koch
Copy link
Contributor Author

e-koch commented Jun 13, 2014

Great! Thanks for all the help!

ChrisBeaumont pushed a commit that referenced this pull request Jun 14, 2014
Pruning computed dendrograms.
@ChrisBeaumont ChrisBeaumont merged commit 3d636bf into dendrograms:master Jun 14, 2014
@ChrisBeaumont
Copy link
Contributor

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants