Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meaning of tree failure. #23

Closed
sparktsao opened this issue Aug 28, 2016 · 15 comments
Closed

meaning of tree failure. #23

sparktsao opened this issue Aug 28, 2016 · 15 comments

Comments

@sparktsao
Copy link

Thank you for providing the great works,
I got a question that some dataset will lead to "tree failure exception" in the function "copyHeapToMatrix".
What does it means? how can I avoid it when preparing the dataset?

**********************************************terminate called after throwing an instance of 'Rcpp::exception'
  what():  Tree failure.
Aborted
@elbamos
Copy link
Owner

elbamos commented Aug 28, 2016

That's extremely odd-that error-check code is there to test the internal consistency of the implementation. Can you provide your data so I can take a look?

On Aug 28, 2016, at 9:33 AM, Spark Tsao notifications@github.com wrote:

Thank you for providing the great works,
I got a question that some dataset will lead to "tree failure exception" in the function "copyHeapToMatrix".
What does it means? how can I avoid it when preparing the dataset?

**********************************************terminate called after throwing an instance of 'Rcpp::exception'
what(): Tree failure.
Aborted

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@elbamos
Copy link
Owner

elbamos commented Aug 29, 2016

What it means, essentially, is that during the tree-search part of the neighbor search algorithm, it found zero neighbors for a point. It should not be possible for that to happen. I would really appreciate knowing the details of your dataset and the parameters you were using. I'm guessing you found some sort of edge case, and I should add a check for it.

@sparktsao
Copy link
Author

Thank you for the explanation.

When I largvis my dataset (dim:1600, number of record: 900K) successfully, I try to understand more about our data when dim is small, So I use feature reduction skill to reduce my 1600 dim.
I found the Tree failure issue when our data dim < 30. (ex: i failed at 2, 13, 30).
I am wondering if we meet some corner cases when dim too small, maybe too many data points fall into the same leaves which fail the random projection?

Another point is my working machine is not updated to the latest version,
I stayed in version commit on Aug 4, 2016 (654da27), because seems can handle more data than Commits on Aug 18, 2016 (580b2d2),

version 580b2d2 will run out of memory when doing 70% of randomProjectionTreeSearch, seems use more memory than 654da27. 654da27 can handle all the data (1600*9000K) smoothly. Although you advised me to use gcc 4.9.3 to built 580b2d2 successfully, i return to the old version due to the memory issue.

I will try to update to latest version to see if i will see the exception again.
Thank you so much again!

@elbamos
Copy link
Owner

elbamos commented Aug 29, 2016

Can you elaborate on the memory issue and is it possible to see this data?

The relevant code in neighbor search hasn't changed in quite some time so memory usage in that phase should be constant. And reducing dims to ~30 shouldn't affect the tree search at all. (What might affect it are na's and Nan's though.)

Thank you for reporting this! I'd really appreciate your help nailing it down.

On Aug 29, 2016, at 7:12 AM, Spark Tsao notifications@github.com wrote:

Thank you for the explanation.

When I largvis my dataset (dim:1600, number of record: 900K) successfully, I try to understand more about our data when dim is small, So I use feature reduction skill to reduce my 1600 dim.
I found the Tree failure issue when our data dim < 30. (ex: i failed at 2, 13, 30).
I am wondering if we meet some corner cases when dim too small, maybe too many data points fall into the same leaves which fail the random projection?

Another point is my working machine is not updated to the latest version,
I stayed in version commit on Aug 4, 2016 (654da27), because seems can handle more data than Commits on Aug 18, 2016 (580b2d2),

version 580b2d2 will run out of memory when doing 70% of randomProjectionTreeSearch, seems use more memory than 654da27. 654da27 can handle all the data (1600*9000K) smoothly. Although you advised me to use gcc 4.9.3 to built 580b2d2 successfully, i return to the old version due to the memory issue.

I will try to update to latest version to see if i will see the exception again.
Thank you so much again!


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@sparktsao
Copy link
Author

https://github.com/sparktsao/casetreefail
I can reproduce the error message in 2 different AWS EC2 instances,
but it's strange not always can reproduce the error message every run.
maybe there is some random behavior in the function.

@elbamos
Copy link
Owner

elbamos commented Aug 30, 2016

The function does have random behavior as part of the algorithm, but that error should never occur. Thank you for posting the data - I will take a look tonight.

On Aug 30, 2016, at 10:28 AM, Spark Tsao notifications@github.com wrote:

https://github.com/sparktsao/casetreefail
I can reproduce the error message in 2 different AWS EC2 instances,
but it's strange not always can reproduce the error message every run.
maybe there is some random behavior in the function.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@elbamos
Copy link
Owner

elbamos commented Aug 30, 2016

Wait a sec... Your log seems to show that the current version performs properly, you're only getting the error on old release 0.1.5 is that right?

On Aug 30, 2016, at 10:28 AM, Spark Tsao notifications@github.com wrote:

https://github.com/sparktsao/casetreefail
I can reproduce the error message in 2 different AWS EC2 instances,
but it's strange not always can reproduce the error message every run.
maybe there is some random behavior in the function.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@sparktsao
Copy link
Author

sparktsao commented Aug 30, 2016

image
0.1.6?

@elbamos
Copy link
Owner

elbamos commented Aug 30, 2016

But why not use the current version?

On Aug 30, 2016, at 3:38 PM, Spark Tsao notifications@github.com wrote:

0.1.6?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@sparktsao
Copy link
Author

sparktsao commented Aug 30, 2016

And Yes, latest version only output warning message without 'tree failure'.
I choose stay in 0.1.6 here due to it can handle (1600 * 900K) smoothly. The program got "killed" when running large datatset (1600 * 900k) when using latest version.
Might be run out of memory, It might not be an issue, due to it might be solved by increase memory.
Sorry i didnt repeat that yet.

@elbamos
Copy link
Owner

elbamos commented Aug 30, 2016

Can you show me the data where the current version died? It should not be less memory efficient at all.

On Aug 30, 2016, at 3:53 PM, Spark Tsao notifications@github.com wrote:

And Yes, latest version only output warning message without 'tree failure'.
I choose stay in 0.1.6 here due to it can handle (1600900K) smoothly. The program got "killed" when running large datatset (1600900k) when using latest version.
Might be run out of memory, It might not be an issue, due to it might be solved by increase memory.
Sorry i didnt repeat that yet.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@elbamos
Copy link
Owner

elbamos commented Aug 30, 2016

Actually - one thing that did change after 0.1.6 was the default parameters. So what may be happening is that it's trying to use default settings, probably for tree_threshold, that are using more ram.

The reason for the change is to emulate the settings of the paper authors' reference code.

Try tamping-down the tree threshold. They set it way too big on high-D data.

On Aug 30, 2016, at 3:53 PM, Spark Tsao notifications@github.com wrote:

And Yes, latest version only output warning message without 'tree failure'.
I choose stay in 0.1.6 here due to it can handle (1600900K) smoothly. The program got "killed" when running large datatset (1600900k) when using latest version.
Might be run out of memory, It might not be an issue, due to it might be solved by increase memory.
Sorry i didnt repeat that yet.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@elbamos
Copy link
Owner

elbamos commented Aug 31, 2016

@sparktsao I just tried it, and with the default settings, it ran and completed on my machine in less than 3 seconds. It did not take long enough for me to even measure how much RAM was being used. I tried it up to K = 100.

(I do need to adjust that progress bar a bit...)

The reason why you're getting fewer neighbors found than you're looking for, by the way, is that approximately 1/3 of your dataset are duplicates.

> str(test)
 int [1:2, 1:25000] 28538 303513 174704 343275 52760 269921 183379 112205 52388 277515 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "x" "y"
  ..$ : NULL
> test <- data.frame(t(test))
> bob <- duplicated(test)
> sum(bob)
[1] 8295

Is there anything else you can do to help me reproduce the issue you're having?

@sparktsao
Copy link
Author

The data i prepared is the minimum set of data i can reproduce tree failure case in build 654da27, not for memory issue.
If you are using latest version of the code, should be ok without tree failure exception and memory issue.

The default setting change might explain why i met the memory issue, Now I will try to use tree threshold parameters to find a good configuration for my large dataset. I will report if i meet memory issue in the future.

thanks so much for helping again.

@elbamos
Copy link
Owner

elbamos commented Aug 31, 2016

Ok I'm going to close this issue.

Regarding the tree threshold, I suggest you look at the benchmarks vignette. It includes a detailed discussion of how changing the threshold, the number of trees, and the number of exploration-iterations affects performance, memory usage, and accuracy. It is intended to be helpful to folks dealing with issues like yours -- if it doesn't get you to where you need to go, let me know and I'll try to improve it.

@elbamos elbamos closed this as completed Aug 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants