-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update DBSCAN runner to use multi-dimensional geometries and migrate it to benchmarks #736
Conversation
f7f798a
to
8a646a7
Compare
3ed17b1
to
afd98ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan on re-introducing a clustering example?
Yes. It will be a simple point cloud and calling the DBSCAN interface, with printing the number of found clusters with the number of points in them. |
afd98ca
to
cf15c2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What benefit from the explicit instantiation (as implemented here) did you see?
template <int DIM> | ||
std::vector<Point<DIM>> sampleData(std::vector<Point<DIM>> const &data, | ||
int num_samples) | ||
{ | ||
std::vector<Point<DIM>> sampled_data(num_samples); | ||
|
||
std::srand(1337); | ||
|
||
// Knuth algorithm | ||
auto const N = (int)data.size(); | ||
auto const M = num_samples; | ||
for (int in = 0, im = 0; in < N && im < M; ++in) | ||
{ | ||
int rn = N - in; | ||
int rm = M - im; | ||
if (std::rand() % rn < rm) | ||
sampled_data[im++] = data[in]; | ||
} | ||
return sampled_data; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not having this defined next to getDataDimensions
hurts readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you are talking about. This function is completely independent from anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both function implementation are looking into the same input data and you move them away from each other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But functionally they have nothing to do with each other. Sampling data can be done with any input, independent of how you construct it. For example, with Gan-Tao generator you would generate the data without reading it in, and would still be able to sample it. So for me I see zero reason to keep them in the same place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I was commenting about loadData
. Where did that function go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed to me that you were commenting about getDataDimensions
. Nevertheless, I stand by my point that sampling has nothing to do with how the data is constructed, and thus is unnecessary to be kept together. However, I'll move the sampleData
after loadData
if it makes you happier.
Times are for the full dbscan directory (including conveter): Single dimension: 11.1s |
Something wrong with the HIP tester
but everything else passes |
template <int DIM> | ||
std::vector<Point<DIM>> sampleData(std::vector<Point<DIM>> const &data, | ||
int num_samples) | ||
{ | ||
std::vector<Point<DIM>> sampled_data(num_samples); | ||
|
||
std::srand(1337); | ||
|
||
// Knuth algorithm | ||
auto const N = (int)data.size(); | ||
auto const M = num_samples; | ||
for (int in = 0, im = 0; in < N && im < M; ++in) | ||
{ | ||
int rn = N - in; | ||
int rm = M - im; | ||
if (std::rand() % rn < rm) | ||
sampled_data[im++] = data[in]; | ||
} | ||
return sampled_data; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I was commenting about loadData
. Where did that function go?
template <int DIM> | ||
std::vector<Point<DIM>> loadData(std::string const &filename, | ||
bool binary = true, int max_num_points = -1, | ||
int num_samples = -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is
e785a98
to
7bac1ed
Compare
7bac1ed
to
0f2eaab
Compare
examples
tobenchmark
input.txt
to make sure it's read as 3D data