Question about data size limits #3

cristinaluengoagullo · 2017-10-05T07:02:29Z

Hi, first of all thanks for the application! I've been trying it out with different datasets and it works great with the smaller ones! But the application stalls with bigger datasets. My particular case is a dataset of 120GB with 2000 million records, and I want to run DBScan with an eps of 0.0001. I don't know if maybe I'm configuring the parameter ppd badly (with a value of 100 it stalls indefinitely, but with smaller values there seems to be a progress...even though it still hangs), or if it won't work with this large dataset and such a small eps.
Is there any chance that I'm configuring it wrong? Thanks in advance!

hag0r · 2017-10-18T09:26:08Z

Hi,
no, you're not configuring it wrong, it just seems that DBSCAN takes very long for this large dataset. It has to compute many distances and performs lots of comparisons.

If it works for smaller data sets I would guess that:

The operation just takes long time (maybe also with some interruptions for other processes?)
Maybe the garbage collector has to run too often to free up space which causes your program to stall frequently.

Can you tell the setup you are running your program on? Number of nodes, CPU, etc.

cristinaluengoagullo · 2017-10-18T10:55:52Z

Hi,
Thanks! I'm running it on a 6 node cluster with yarn and Zeppelin to run the spark jobs. Each node has 50GB and 16 cores available, and I used executors with 12g and 4 cores each. When I use a ppd of 10 the execution finishes with yarn errors on different containers (out of memory) and huge garbage collection times for each task. But when I use a value of 20 for example, the garbage collection time is small and the program just stalls in line 149 of DBScan.scala. There seems to be progress in the map function (line 149), but very very slow (for example, I left the program running 2 days straight and it had not gotten to 10% of the progress of the map function when I checked).

I also tried using the BSPartitioner, but it crashed earlier than with the GridPartitioner. And I also tried repartitioning the data before calling the dbscan run function, but still, it gets stalled :(

I was thinking that there's a comment in DBScan.scala (148) that suggests to repartition the data after the groupBy. Maybe I'll try to modify that part and see if there's any improvement. Do you think it could work?
Thanks again!

hag0r · 2017-10-20T12:25:40Z

It may help. Though, I also found a place were we collected the cluster content in a List object, instead of reusing Iterator. This will lead to performance issues for large partitions.

I will try to solve this

cristinaluengoagullo · 2017-10-20T17:07:02Z

Ok thank you very much!!

hag0r self-assigned this Oct 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about data size limits #3

Question about data size limits #3

cristinaluengoagullo commented Oct 5, 2017

hag0r commented Oct 18, 2017

cristinaluengoagullo commented Oct 18, 2017

hag0r commented Oct 20, 2017

cristinaluengoagullo commented Oct 20, 2017

Question about data size limits #3

Question about data size limits #3

Comments

cristinaluengoagullo commented Oct 5, 2017

hag0r commented Oct 18, 2017

cristinaluengoagullo commented Oct 18, 2017

hag0r commented Oct 20, 2017

cristinaluengoagullo commented Oct 20, 2017