Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-1731] [ml] Implementation of Feature K-Means and Test Suite #700

Closed
wants to merge 10 commits into from

Conversation

peterschrott
Copy link
Contributor

Within the IMPRO-3 warm-up task the implementation of K-Means and corresponding test suite was done.

@aalexandrov
Copy link
Contributor

Can anybody with more Apache insight answer to @peedeeX21 concerns? Otherwise I suggest to merge this and open a follow-up issue that extends the current implementation to KMeans++.

@sachingoel0101
Copy link
Contributor

Hey guys. You might wanna look at the initialization schemes here: #757

@FGoessler FGoessler force-pushed the feature_kmeans branch 4 times, most recently from 85066d9 to e1ec4bc Compare June 3, 2015 09:42
@FGoessler
Copy link

The travis build is failing on Oracle JDK 8. Maven or Flink are hanging according to the build log. Can anyone help or at least restart the build?
Are there any known "flipping tests"? Imo the failure isn't related to our changes.

@peterschrott
Copy link
Contributor Author

@tillrohrmann
Would you please help me out with that pending pull request?

@tillrohrmann
Copy link
Contributor

Will do @peedeeX21. Currently I'm busy with the upcoming release, but once we're done with it, I'll work on this PR.

@peterschrott
Copy link
Contributor Author

@tillrohrmann great. no worries. was just not sure what is going on. :) good luck with the new release!

@thvasilo
Copy link

Hello @peedeeX21 , most of the failing Travis tests have been fixed in the current master, could you try rebasing this PR and making a forced push to this branch?

@FGoessler
Copy link

Just rebased and force pushed -> hoping for good Travis results 😃

@thvasilo
Copy link

Thanks, seems like all is fine now. We will start reviewing this in the next few days.


instance.centroids match {
case Some(centroids) => {
input.map(new SelectNearestCenterMapper).withBroadcastSet(centroids, CENTROIDS)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mapping operation can be replaced by using the new mapWtihBcVariable function. You can check out the function SGDStep in flink.ml.optimization.GradientDescent on how to use it.

It should make code more concise and readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thvasilo the mapWithBcVariable only supports single element variables as defined in org.apache.flink.BroadcastSingleElementMapper.
In our case we broadcast a whole DataSet of centroids (list of vectors) to the mapper. Shall we extend the ML lib with a function like mapWithBcVariableList?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@thvasilo
Copy link

Hello I've left some initial comments. Once those have been addressed I'll try to do some more integration testing and then pass the review over to a commiter.

@thvasilo
Copy link

Another note: It should not be necessary for the user to provide the initial centroids, those should be possible to generated from the algorithm itself, ideally with a scheme like kmeans++.

var closestCentroidLabel: Double = -1
centroids.foreach(centroid => {
val distance = EuclideanDistanceMetric().distance(v, centroid.vector)
if (distance < minDistance) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also make use of the triangle inequality property, and avoid doing checks for points where the minimum possible distance is larger than the current closest distance.

@sachingoel0101
Copy link
Contributor

I've been following this PR since my PR on initialization schemes can't be merged before this. I already have three initialization mechanisms [namely Random, k-means++, kmeans||]. I've referenced the PR on this thread earlier.

@peterschrott
Copy link
Contributor Author

I am having some trouble to fit our predictor into the new API.
The problem is, that with PredictOperation the type of the model has to be defined. A DataSet of this type is the output of the getModel. For the predict method the input is just an object of this type.

In our case our model is a DataSet of LabeledVectors (the centroids). This means I can not implement a PredictOperation due to that restriction.

For me the API feels a bit inconsistent in that case, or do I miss something?

For now I implemented only an PredictDataSetOperation.

@thvasilo
Copy link

Hello @peedeeX21 . The API does not deal with distributed models at the moment. In the K-means case having the model distributed is overkill, as it is highly unlikely that you will have >1000 centroids, making the model tiny, and distributing it actually creates unnecessary overhead.

We can keep the current implementation, but in the future we should really test against a non distributed model, which can be broadcast in a DataSet[Seq[LabeledVector]] and compare performance.

Also, could you add an evaluate operation (EvaluateDataSetOperation) for Kmeans (and corresponding test)? It would be parametrized as EvaluateDataSetOperation[Kmeans, Vector, Double]

EDIT: For the predict operation, you could also combine the DataSet[LabeledVector] in getModel, to return a DataSet[List[LabeledVector]], by doing a reduceGroup(elements => elements.toList()) on the model DataSet. Then you would be able to make predictions as normal.

@sachingoel0101
Copy link
Contributor

Hi. IMO, the purpose of learning is to develop a model which compactly represents the data somehow. Thus, having a distributed model doesn't make sense. Besides, the user might just want to take the model and use it somewhere else in which case it makes sense to have it available not-as-distributed, but just as a java slash scala object which user can easily operate on.

@peterschrott
Copy link
Contributor Author

I totally agree on you guys points. We have a little amount of centroids, and the model is not supposed to be distributed in the end.

The question is now: Should the resulting DataSet of centroids just be collected, or the the whole iteration be rewritten to work an a non distributed collection?

Note: Unfortunately I am quite busy right now with other projects, so I wont have time to do lots of changes right now. Either the people from my group (who might actually have the same workload right now) or @sachingoel0101 can work on that if its really urgent.

@thvasilo
Copy link

What we would like to see actually is this PR and #757 to be merged into one, so that we can review them as a whole. @sachingoel0101 do you think you will be able to do that?

@sachingoel0101
Copy link
Contributor

@thvasilo , how do I merge this PR into mine? Maybe @peedeeX21 can create a pull request to my branch at https://github.com/sachingoel0101/flink/tree/clustering_initializations or is there a better option?

@peterschrott
Copy link
Contributor Author

@sachingoel0101 me creating a pull request for your repo would be the best. But for some reason I can't choose your repo as base fork.

@sachingoel0101
Copy link
Contributor

@peedeeX21 , try this link: sachingoel0101/flink@clustering_initializations...peedeeX21:feature_kmeans
I had a lot of trouble too getting to create a PR to your repo yesterday.

@thvasilo
Copy link

thvasilo commented Jul 2, 2015

Hello @peedeeX21, one thing you could try is to rebase this branch on @sachingoel0101's branch, and then do a forced push to this one.

@peterschrott
Copy link
Contributor Author

@thvasilo I actually could create a pull request for @sachingoel0101 . So everything should be fine now. We can even close this PR.

@thvasilo
Copy link

thvasilo commented Jul 2, 2015

Sure, feel free to close this, and link to the new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants