Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about data size limits #3

Open
cristinaluengoagullo opened this issue Oct 5, 2017 · 4 comments
Open

Question about data size limits #3

cristinaluengoagullo opened this issue Oct 5, 2017 · 4 comments
Assignees

Comments

@cristinaluengoagullo
Copy link

Hi, first of all thanks for the application! I've been trying it out with different datasets and it works great with the smaller ones! But the application stalls with bigger datasets. My particular case is a dataset of 120GB with 2000 million records, and I want to run DBScan with an eps of 0.0001. I don't know if maybe I'm configuring the parameter ppd badly (with a value of 100 it stalls indefinitely, but with smaller values there seems to be a progress...even though it still hangs), or if it won't work with this large dataset and such a small eps.
Is there any chance that I'm configuring it wrong? Thanks in advance!

@hag0r hag0r self-assigned this Oct 18, 2017
@hag0r
Copy link
Member

hag0r commented Oct 18, 2017

Hi,
no, you're not configuring it wrong, it just seems that DBSCAN takes very long for this large dataset. It has to compute many distances and performs lots of comparisons.

If it works for smaller data sets I would guess that:

  1. The operation just takes long time (maybe also with some interruptions for other processes?)
  2. Maybe the garbage collector has to run too often to free up space which causes your program to stall frequently.

Can you tell the setup you are running your program on? Number of nodes, CPU, etc.

@cristinaluengoagullo
Copy link
Author

Hi,
Thanks! I'm running it on a 6 node cluster with yarn and Zeppelin to run the spark jobs. Each node has 50GB and 16 cores available, and I used executors with 12g and 4 cores each. When I use a ppd of 10 the execution finishes with yarn errors on different containers (out of memory) and huge garbage collection times for each task. But when I use a value of 20 for example, the garbage collection time is small and the program just stalls in line 149 of DBScan.scala. There seems to be progress in the map function (line 149), but very very slow (for example, I left the program running 2 days straight and it had not gotten to 10% of the progress of the map function when I checked).

I also tried using the BSPartitioner, but it crashed earlier than with the GridPartitioner. And I also tried repartitioning the data before calling the dbscan run function, but still, it gets stalled :(

I was thinking that there's a comment in DBScan.scala (148) that suggests to repartition the data after the groupBy. Maybe I'll try to modify that part and see if there's any improvement. Do you think it could work?
Thanks again!

@hag0r
Copy link
Member

hag0r commented Oct 20, 2017

It may help. Though, I also found a place were we collected the cluster content in a List object, instead of reusing Iterator. This will lead to performance issues for large partitions.

I will try to solve this

@cristinaluengoagullo
Copy link
Author

Ok thank you very much!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants