HierarchicalClustering Losing Values #11

Closed
Telofy opened this Issue Dec 6, 2013 · 7 comments

Comments

Projects
None yet
2 participants
@Telofy

Telofy commented Dec 6, 2013

Hi, and thanks for the library!

I’m losing values though:

In [1]: from cluster import HierarchicalClustering
In [2]: HierarchicalClustering([10, 30, 10, 35], lambda x,y: abs(x-y)).getlevel(10)
Out[2]: [[35], [10, 10, 10]]

I’ll try to understand the algorithm to see where the 30 went, but maybe you’re quicker.

In fluffiness,
Telofy

@Telofy

This comment has been minimized.

Show comment
Hide comment
@Telofy

Telofy Dec 6, 2013

The bug occurs when there are two equal values in the list (unless they’re consecutive, so that the second one gets the index of the first when the first is removed).

Instead of this (cluster.py around line 637):

self._data.remove(self._data[max(smallestpair[0],
    smallestpair[1])])  # remove item 1
self._data.remove(self._data[min(smallestpair[0],
    smallestpair[1])])  # remove item 2

Try this:

del self._data[max(smallestpair[0], smallestpair[1])]
del self._data[min(smallestpair[0], smallestpair[1])]

You could also obtain both the values first and only then delete them, but the above seems simpler.

Telofy commented Dec 6, 2013

The bug occurs when there are two equal values in the list (unless they’re consecutive, so that the second one gets the index of the first when the first is removed).

Instead of this (cluster.py around line 637):

self._data.remove(self._data[max(smallestpair[0],
    smallestpair[1])])  # remove item 1
self._data.remove(self._data[min(smallestpair[0],
    smallestpair[1])])  # remove item 2

Try this:

del self._data[max(smallestpair[0], smallestpair[1])]
del self._data[min(smallestpair[0], smallestpair[1])]

You could also obtain both the values first and only then delete them, but the above seems simpler.

@exhuma

This comment has been minimized.

Show comment
Hide comment
@exhuma

exhuma Dec 6, 2013

Owner

Thanks for the update. I'll see to it in the near future.

Have you tested your proposed solution? I'm not on my dev-box right now. Also... this is a fairly old piece of library. I'll have to dig bag in an understand my own code again... with any luck it should not take all too long ^_^

Also, you could propose a pull-request to speed things up ;)

Owner

exhuma commented Dec 6, 2013

Thanks for the update. I'll see to it in the near future.

Have you tested your proposed solution? I'm not on my dev-box right now. Also... this is a fairly old piece of library. I'll have to dig bag in an understand my own code again... with any luck it should not take all too long ^_^

Also, you could propose a pull-request to speed things up ;)

@Telofy

This comment has been minimized.

Show comment
Hide comment
@Telofy

Telofy Dec 6, 2013

It’s working for me, but I have not thoroughly tested it, which is also the reason why I chose to propose it here rather than filing a pull request right away.

Telofy commented Dec 6, 2013

It’s working for me, but I have not thoroughly tested it, which is also the reason why I chose to propose it here rather than filing a pull request right away.

@Telofy

This comment has been minimized.

Show comment
Hide comment
@Telofy

Telofy Dec 9, 2013

I think you could also optimize the performance if you found a way to not recalculate the parts of the matrix that could not have changed.

Telofy commented Dec 9, 2013

I think you could also optimize the performance if you found a way to not recalculate the parts of the matrix that could not have changed.

@exhuma exhuma added the bug label Aug 28, 2014

@exhuma

This comment has been minimized.

Show comment
Hide comment
@exhuma

exhuma Aug 28, 2014

Owner

I tried your proposed fix, but that caused another test to fail. I have been looking into this and I find that my initial implementation of the algorithm could be revised. And more importantly should be better tested. I am looking into this.

Owner

exhuma commented Aug 28, 2014

I tried your proposed fix, but that caused another test to fail. I have been looking into this and I find that my initial implementation of the algorithm could be revised. And more importantly should be better tested. I am looking into this.

exhuma added a commit that referenced this issue Aug 29, 2014

@exhuma

This comment has been minimized.

Show comment
Hide comment
@exhuma

exhuma Sep 8, 2014

Owner

Sorry for the delay... We had a death in the family, and I still am a bit shaken.

But this problem in keeps me awake at night... I have been working on a better unit-test (see http://michel.albert.lu). Alas, implementing a test with that data set did not reproduce the error!

I have now taken up the original values from the unit-test, and while writing up I realised that one value is duplicated (365) which might explain the error. I'm on it...

Owner

exhuma commented Sep 8, 2014

Sorry for the delay... We had a death in the family, and I still am a bit shaken.

But this problem in keeps me awake at night... I have been working on a better unit-test (see http://michel.albert.lu). Alas, implementing a test with that data set did not reproduce the error!

I have now taken up the original values from the unit-test, and while writing up I realised that one value is duplicated (365) which might explain the error. I'm on it...

@Telofy

This comment has been minimized.

Show comment
Hide comment
@Telofy

Telofy Sep 8, 2014

Oh, I’m sorry to hear that, my condolences.

Telofy commented Sep 8, 2014

Oh, I’m sorry to hear that, my condolences.

exhuma added a commit that referenced this issue Sep 10, 2014

Test data fixed!
References #11

@exhuma exhuma closed this in dc9f779 Sep 10, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment