Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DBSCAN runner to use multi-dimensional geometries and migrate it to benchmarks #736

Merged
merged 9 commits into from
Sep 13, 2022

Conversation

aprokop
Copy link
Contributor

@aprokop aprokop commented Sep 3, 2022

  • Move DBSCAN driver from examples to benchmark
  • Allow to run the driver for dimension 2-6 (with Kokkos 3.7+)
  • Update the test file input.txt to make sure it's read as 3D data

@aprokop aprokop added the enhancement New feature or request label Sep 3, 2022
@aprokop aprokop force-pushed the dbscan_example branch 2 times, most recently from f7f798a to 8a646a7 Compare September 3, 2022 04:11
@aprokop aprokop changed the title Updated DBSCAN runner to use multi-dimensional geometries and migrate to benchmarks/ Updated DBSCAN runner to use multi-dimensional geometries and migrated it to benchmarks Sep 3, 2022
@aprokop aprokop added the refactoring Code reorganization label Sep 3, 2022
@aprokop aprokop changed the title Updated DBSCAN runner to use multi-dimensional geometries and migrated it to benchmarks Update DBSCAN runner to use multi-dimensional geometries and migrate it to benchmarks Sep 3, 2022
@aprokop aprokop force-pushed the dbscan_example branch 2 times, most recently from 3ed17b1 to afd98ca Compare September 3, 2022 15:38
Copy link
Contributor

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan on re-introducing a clustering example?

@aprokop
Copy link
Contributor Author

aprokop commented Sep 8, 2022

Do you plan on re-introducing a clustering example?

Yes. It will be a simple point cloud and calling the DBSCAN interface, with printing the number of found clusters with the number of points in them.

benchmarks/dbscan/input.txt Show resolved Hide resolved
test/CMakeLists.txt Show resolved Hide resolved
examples/dbscan/dbscan.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What benefit from the explicit instantiation (as implemented here) did you see?

benchmarks/dbscan/CMakeLists.txt Show resolved Hide resolved
benchmarks/dbscan/dbscan.cpp Outdated Show resolved Hide resolved
benchmarks/dbscan/dbscan.cpp Outdated Show resolved Hide resolved
benchmarks/dbscan/dbscan.hpp Outdated Show resolved Hide resolved
Comment on lines +25 to +44
template <int DIM>
std::vector<Point<DIM>> sampleData(std::vector<Point<DIM>> const &data,
int num_samples)
{
std::vector<Point<DIM>> sampled_data(num_samples);

std::srand(1337);

// Knuth algorithm
auto const N = (int)data.size();
auto const M = num_samples;
for (int in = 0, im = 0; in < N && im < M; ++in)
{
int rn = N - in;
int rm = M - im;
if (std::rand() % rn < rm)
sampled_data[im++] = data[in];
}
return sampled_data;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having this defined next to getDataDimensions hurts readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you are talking about. This function is completely independent from anything else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both function implementation are looking into the same input data and you move them away from each other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But functionally they have nothing to do with each other. Sampling data can be done with any input, independent of how you construct it. For example, with Gan-Tao generator you would generate the data without reading it in, and would still be able to sample it. So for me I see zero reason to keep them in the same place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I was commenting about loadData. Where did that function go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed to me that you were commenting about getDataDimensions. Nevertheless, I stand by my point that sampling has nothing to do with how the data is constructed, and thus is unnecessary to be kept together. However, I'll move the sampleData after loadData if it makes you happier.

@aprokop
Copy link
Contributor Author

aprokop commented Sep 10, 2022

What benefit from the explicit instantiation (as implemented here) did you see?

Times are for the full dbscan directory (including conveter):

Single dimension: 11.1s
Five dimensions no ETI: 36s
Five dimension with ETI (make -j8): 12s

@aprokop
Copy link
Contributor Author

aprokop commented Sep 10, 2022

Something wrong with the HIP tester

terminate called after throwing an instance of 'std::runtime_error'
  what():  hipGetDeviceCount(&m_hipDevCount) error( hipErrorNoDevice): hipErrorNoDevice /scratch/kokkos/core/src/HIP/Kokkos_HIP_Instance.cpp:80

but everything else passes

benchmarks/dbscan/CMakeLists.txt Show resolved Hide resolved
benchmarks/dbscan/dbscan.cpp Outdated Show resolved Hide resolved
Comment on lines +25 to +44
template <int DIM>
std::vector<Point<DIM>> sampleData(std::vector<Point<DIM>> const &data,
int num_samples)
{
std::vector<Point<DIM>> sampled_data(num_samples);

std::srand(1337);

// Knuth algorithm
auto const N = (int)data.size();
auto const M = num_samples;
for (int in = 0, im = 0; in < N && im < M; ++in)
{
int rn = N - in;
int rm = M - im;
if (std::rand() % rn < rm)
sampled_data[im++] = data[in];
}
return sampled_data;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I was commenting about loadData. Where did that function go?

Comment on lines +57 to +49
template <int DIM>
std::vector<Point<DIM>> loadData(std::string const &filename,
bool binary = true, int max_num_points = -1,
int num_samples = -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is

@aprokop aprokop merged commit f9728b5 into arborx:master Sep 13, 2022
@aprokop aprokop deleted the dbscan_example branch September 13, 2022 02:45
@aprokop aprokop mentioned this pull request Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactoring Code reorganization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants