New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-1731] [ml] Implementation of Feature K-Means and Test Suite #700
Conversation
Can anybody with more Apache insight answer to @peedeeX21 concerns? Otherwise I suggest to merge this and open a follow-up issue that extends the current implementation to KMeans++. |
Hey guys. You might wanna look at the initialization schemes here: #757 |
85066d9
to
e1ec4bc
Compare
The travis build is failing on Oracle JDK 8. Maven or Flink are hanging according to the build log. Can anyone help or at least restart the build? |
@tillrohrmann |
Will do @peedeeX21. Currently I'm busy with the upcoming release, but once we're done with it, I'll work on this PR. |
@tillrohrmann great. no worries. was just not sure what is going on. :) good luck with the new release! |
Hello @peedeeX21 , most of the failing Travis tests have been fixed in the current master, could you try rebasing this PR and making a forced push to this branch? |
e1ec4bc
to
292ec0b
Compare
Just rebased and force pushed -> hoping for good Travis results 😃 |
Thanks, seems like all is fine now. We will start reviewing this in the next few days. |
|
||
instance.centroids match { | ||
case Some(centroids) => { | ||
input.map(new SelectNearestCenterMapper).withBroadcastSet(centroids, CENTROIDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mapping operation can be replaced by using the new mapWtihBcVariable function. You can check out the function SGDStep in flink.ml.optimization.GradientDescent on how to use it.
It should make code more concise and readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thvasilo the mapWithBcVariable
only supports single element variables as defined in org.apache.flink.BroadcastSingleElementMapper
.
In our case we broadcast a whole DataSet
of centroids (list of vectors) to the mapper. Shall we extend the ML lib with a function like mapWithBcVariableList
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Hello I've left some initial comments. Once those have been addressed I'll try to do some more integration testing and then pass the review over to a commiter. |
Another note: It should not be necessary for the user to provide the initial centroids, those should be possible to generated from the algorithm itself, ideally with a scheme like kmeans++. |
var closestCentroidLabel: Double = -1 | ||
centroids.foreach(centroid => { | ||
val distance = EuclideanDistanceMetric().distance(v, centroid.vector) | ||
if (distance < minDistance) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also make use of the triangle inequality property, and avoid doing checks for points where the minimum possible distance is larger than the current closest distance.
I've been following this PR since my PR on initialization schemes can't be merged before this. I already have three initialization mechanisms [namely Random, k-means++, kmeans||]. I've referenced the PR on this thread earlier. |
I am having some trouble to fit our predictor into the new API. In our case our model is a For me the API feels a bit inconsistent in that case, or do I miss something? For now I implemented only an |
Hello @peedeeX21 . The API does not deal with distributed models at the moment. In the K-means case having the model distributed is overkill, as it is highly unlikely that you will have >1000 centroids, making the model tiny, and distributing it actually creates unnecessary overhead. We can keep the current implementation, but in the future we should really test against a non distributed model, which can be broadcast in a DataSet[Seq[LabeledVector]] and compare performance. Also, could you add an evaluate operation (EvaluateDataSetOperation) for Kmeans (and corresponding test)? It would be parametrized as EvaluateDataSetOperation[Kmeans, Vector, Double] EDIT: For the predict operation, you could also combine the |
Hi. IMO, the purpose of learning is to develop a model which compactly represents the data somehow. Thus, having a distributed model doesn't make sense. Besides, the user might just want to take the model and use it somewhere else in which case it makes sense to have it available not-as-distributed, but just as a java slash scala object which user can easily operate on. |
I totally agree on you guys points. We have a little amount of centroids, and the model is not supposed to be distributed in the end. The question is now: Should the resulting Note: Unfortunately I am quite busy right now with other projects, so I wont have time to do lots of changes right now. Either the people from my group (who might actually have the same workload right now) or @sachingoel0101 can work on that if its really urgent. |
Modified KMeans to use a Seq for model
What we would like to see actually is this PR and #757 to be merged into one, so that we can review them as a whole. @sachingoel0101 do you think you will be able to do that? |
@thvasilo , how do I merge this PR into mine? Maybe @peedeeX21 can create a pull request to my branch at https://github.com/sachingoel0101/flink/tree/clustering_initializations or is there a better option? |
@sachingoel0101 me creating a pull request for your repo would be the best. But for some reason I can't choose your repo as base fork. |
@peedeeX21 , try this link: sachingoel0101/flink@clustering_initializations...peedeeX21:feature_kmeans |
Hello @peedeeX21, one thing you could try is to rebase this branch on @sachingoel0101's branch, and then do a forced push to this one. |
@thvasilo I actually could create a pull request for @sachingoel0101 . So everything should be fine now. We can even close this PR. |
Sure, feel free to close this, and link to the new one. |
Within the IMPRO-3 warm-up task the implementation of K-Means and corresponding test suite was done.