Friends-of-Friends Query #161

sslattery · 2019-10-24T15:04:29Z

In many cosmology applications a Friends-of-Friends (FOF) query is used to identify clustering in point clouds. In general, the algorithm is as follows:

Build a tree from a set of input points
Establish a fixed neighborhood radius r
For every point, locate the other points in the tree that are within distance r
For every neighboring point within distance r, find its neighboring points that are within distance r excluding any neighbors already found previously in the query
For each neighbor-of-neighbor repeat step 4 until no more points are found within distance r

The end result of each query should be a list of points that are within distance r of the query point, or are a neighbor-of-neighbors-of-neighbors-etc... of the query point.

Some questions:

It was mentioned that we could possibly cap the amount of recursion in the algorithm to a fixed depth of neighbors. Does this provide a benefit? If so what are reasonable values?
The output of the query could be in our standard structure in a CSR-like format where each query returns a set of object ids that satisfied the query predicate. However, many particles will belong to the same cluster and this cluster will be repeated for each point in it, thus potentially resulting in a large amount of memory needed for the query results depending on the structure of the cluster. What is the most useful output format of this type of query? Should we return clusters rather than results for individual points? Or return clusters as well as a list for each point of the cluster in which it is located?

The text was updated successfully, but these errors were encountered:

aprokop · 2019-10-24T15:15:26Z

If I understand the problem right, it should be equivalent to finding strongly connected components in an undirected (in this case) graph, a well studied problem in the graph community with several parallel methods.

I don't see, however, an easy way to not construct the graph with all the neighbors within certain distance at the moment.

sslattery · 2019-10-24T15:16:59Z

Yes I think that is a good generalization if one considers the results of a preliminary radius query within r to be the graph.

steverangel · 2019-10-24T15:19:56Z

Agreed. In discussing this with Damian, we also thought that after steps 1-3, the task is then finding the connected components of a graph. Each component is then the desired output.

To add, a matrix representing the graph would be very sparse, so a representation that can take advantage of that property is needed. Roughly ~100 million points (vertices) in the graph.

aprokop · 2019-10-24T15:46:21Z

To add, a matrix representing the graph would be very sparse, so a representation that can take advantage of that property is needed

If every query returns m results, our overhead is 1 + 1/m (assuming int for both results and offsets). I think this is OK for now. 100 million int is around 400MB, which is tolerable for GPUs.

aprokop · 2019-10-27T10:58:08Z

Just to note: the original multistep method Kokkos implementation is here. It is unknown whether it still works.

aprokop · 2019-12-20T22:42:45Z

I think our plan of attack for the problem should be the following:

Version 1
- Construct the tree using ArborX for the test problem
- Run ECL-CC algorithm on the resulting graph to produce strongly connected components
Version 2
- Implement Union-Find algorithm as part of the ArborX callback
- Compare with version 1

Version 1 seems to be easy to implement, just need to integrate ECL-CC (which already takes in a CSR graph as input). Version 2 should be possible, but needs some further thinking. I think the main question in is whether we can combine init and compute1 stages of the ECL-CC algorithm.

There are also further optimizations that could be considered. For example, if the diameter of a bounding box is less than the specified radius, we can guarantee that all its leaves are going to be in the same SCC. We could probably do some pre-processing to partition the domain in boxes prior to the construction. It remains to be seen.

aprokop · 2020-04-27T19:37:42Z

Another issue to consider is whether we want to use something like density clustering for HACC, as the data is very unevenly distributed. For example, this paper.

sslattery · 2020-04-27T21:51:51Z

Good idea - this seems like it would be a good solution for at least some cases of unbalanced trees. We will also likely have a use case for something like this in the raytracing problem on facet geometries where the triangle distributions are irregular.

aprokop · 2020-05-13T01:35:27Z

The driver was implemented and merged in examples. The algorithm uses version 2 from the previous comment. I'm closing this issue until we have an external request to improve it in any form, or address any of its limitations.

aprokop · 2020-10-01T21:02:29Z

Timeline of the algorithm improvement

date	time	hash	description
2020/03/09	8.33	`147f3c3` (part of #237)	Version 1 (using ECL-CC for post-processing)
2020/05/03	3.52	`281b77d` (#237)	Version 2 (inline processing using union-find)
2020/05/20	2.74	`c3893a6` (#306)	Traversal optimization (stack Karras style)
2020/08/07	2.58	`779f43d` (#329)	New query overload (without offset or indices)
2020/08/18	0.86	`2754cf1` (#364)	Stackless traversal using escape index (rope)
2020/09/26	0.68	`2754cf1` with `1fcd0d7` merged in	Kokkos occupancy control (50%)

"date" means date feature originally implemented, not when it was merged
"hash" means the hash of the merge of the feature branch (unless stated otherwise)

The baseline is 76sec on a single Power9 core on Summit.

aprokop added the enhancement New feature or request label Oct 25, 2019

aprokop added the application label Dec 18, 2019

aprokop added this to In progress in Developer: aprokop Feb 21, 2020

aprokop mentioned this issue Mar 5, 2020

Halo finder #237

Merged

6 tasks

aprokop moved this from In progress to To do in Developer: aprokop Mar 25, 2020

aprokop closed this as completed May 13, 2020

Developer: aprokop automation moved this from To do to Done May 13, 2020

aprokop mentioned this issue Jun 2, 2020

Implement DBSCAN #330

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Friends-of-Friends Query #161

Friends-of-Friends Query #161

sslattery commented Oct 24, 2019 •

edited

Loading

aprokop commented Oct 24, 2019 •

edited

Loading

sslattery commented Oct 24, 2019

steverangel commented Oct 24, 2019 •

edited

Loading

aprokop commented Oct 24, 2019

aprokop commented Oct 27, 2019 •

edited

Loading

aprokop commented Dec 20, 2019 •

edited

Loading

aprokop commented Apr 27, 2020

sslattery commented Apr 27, 2020

aprokop commented May 13, 2020

aprokop commented Oct 1, 2020 •

edited

Loading

Friends-of-Friends Query #161

Friends-of-Friends Query #161

Comments

sslattery commented Oct 24, 2019 • edited Loading

aprokop commented Oct 24, 2019 • edited Loading

sslattery commented Oct 24, 2019

steverangel commented Oct 24, 2019 • edited Loading

aprokop commented Oct 24, 2019

aprokop commented Oct 27, 2019 • edited Loading

aprokop commented Dec 20, 2019 • edited Loading

aprokop commented Apr 27, 2020

sslattery commented Apr 27, 2020

aprokop commented May 13, 2020

aprokop commented Oct 1, 2020 • edited Loading

Timeline of the algorithm improvement

sslattery commented Oct 24, 2019 •

edited

Loading

aprokop commented Oct 24, 2019 •

edited

Loading

steverangel commented Oct 24, 2019 •

edited

Loading

aprokop commented Oct 27, 2019 •

edited

Loading

aprokop commented Dec 20, 2019 •

edited

Loading

aprokop commented Oct 1, 2020 •

edited

Loading