[FLINK-1731] [ml] Implementation of Feature K-Means and Test Suite #700

peterschrott · 2015-05-20T12:27:46Z

Within the IMPRO-3 warm-up task the implementation of K-Means and corresponding test suite was done.

aalexandrov · 2015-05-29T19:38:07Z

Can anybody with more Apache insight answer to @peedeeX21 concerns? Otherwise I suggest to merge this and open a follow-up issue that extends the current implementation to KMeans++.

sachingoel0101 · 2015-06-02T06:54:31Z

Hey guys. You might wanna look at the initialization schemes here: #757

FGoessler · 2015-06-03T11:23:40Z

The travis build is failing on Oracle JDK 8. Maven or Flink are hanging according to the build log. Can anyone help or at least restart the build?
Are there any known "flipping tests"? Imo the failure isn't related to our changes.

peterschrott · 2015-06-10T07:00:14Z

@tillrohrmann
Would you please help me out with that pending pull request?

tillrohrmann · 2015-06-10T08:51:14Z

Will do @peedeeX21. Currently I'm busy with the upcoming release, but once we're done with it, I'll work on this PR.

peterschrott · 2015-06-10T10:15:05Z

@tillrohrmann great. no worries. was just not sure what is going on. :) good luck with the new release!

thvasilo · 2015-06-22T14:38:13Z

Hello @peedeeX21 , most of the failing Travis tests have been fixed in the current master, could you try rebasing this PR and making a forced push to this branch?

…interfaces

FGoessler · 2015-06-24T15:49:44Z

Just rebased and force pushed -> hoping for good Travis results 😃

thvasilo · 2015-06-25T07:09:08Z

Thanks, seems like all is fine now. We will start reviewing this in the next few days.

thvasilo · 2015-06-29T12:58:48Z

flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala

+
+        instance.centroids match {
+          case Some(centroids) => {
+            input.map(new SelectNearestCenterMapper).withBroadcastSet(centroids, CENTROIDS)


This mapping operation can be replaced by using the new mapWtihBcVariable function. You can check out the function SGDStep in flink.ml.optimization.GradientDescent on how to use it.

It should make code more concise and readable.

@thvasilo the mapWithBcVariable only supports single element variables as defined in org.apache.flink.BroadcastSingleElementMapper.
In our case we broadcast a whole DataSet of centroids (list of vectors) to the mapper. Shall we extend the ML lib with a function like mapWithBcVariableList?

thvasilo · 2015-06-29T13:37:45Z

Hello I've left some initial comments. Once those have been addressed I'll try to do some more integration testing and then pass the review over to a commiter.

thvasilo · 2015-06-29T13:56:05Z

Another note: It should not be necessary for the user to provide the initial centroids, those should be possible to generated from the algorithm itself, ideally with a scheme like kmeans++.

thvasilo · 2015-06-29T14:01:50Z

flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala

+    var closestCentroidLabel: Double = -1
+    centroids.foreach(centroid => {
+      val distance = EuclideanDistanceMetric().distance(v, centroid.vector)
+      if (distance < minDistance) {


You can also make use of the triangle inequality property, and avoid doing checks for points where the minimum possible distance is larger than the current closest distance.

sachingoel0101 · 2015-06-29T14:10:50Z

I've been following this PR since my PR on initialization schemes can't be merged before this. I already have three initialization mechanisms [namely Random, k-means++, kmeans||]. I've referenced the PR on this thread earlier.

peterschrott · 2015-06-29T21:24:36Z

I am having some trouble to fit our predictor into the new API.
The problem is, that with PredictOperation the type of the model has to be defined. A DataSet of this type is the output of the getModel. For the predict method the input is just an object of this type.

In our case our model is a DataSet of LabeledVectors (the centroids). This means I can not implement a PredictOperation due to that restriction.

For me the API feels a bit inconsistent in that case, or do I miss something?

For now I implemented only an PredictDataSetOperation.

thvasilo · 2015-06-30T08:28:17Z

Hello @peedeeX21 . The API does not deal with distributed models at the moment. In the K-means case having the model distributed is overkill, as it is highly unlikely that you will have >1000 centroids, making the model tiny, and distributing it actually creates unnecessary overhead.

We can keep the current implementation, but in the future we should really test against a non distributed model, which can be broadcast in a DataSet[Seq[LabeledVector]] and compare performance.

Also, could you add an evaluate operation (EvaluateDataSetOperation) for Kmeans (and corresponding test)? It would be parametrized as EvaluateDataSetOperation[Kmeans, Vector, Double]

EDIT: For the predict operation, you could also combine the DataSet[LabeledVector] in getModel, to return a DataSet[List[LabeledVector]], by doing a reduceGroup(elements => elements.toList()) on the model DataSet. Then you would be able to make predictions as normal.

sachingoel0101 · 2015-06-30T08:51:23Z

Hi. IMO, the purpose of learning is to develop a model which compactly represents the data somehow. Thus, having a distributed model doesn't make sense. Besides, the user might just want to take the model and use it somewhere else in which case it makes sense to have it available not-as-distributed, but just as a java slash scala object which user can easily operate on.

peterschrott · 2015-06-30T09:14:58Z

I totally agree on you guys points. We have a little amount of centroids, and the model is not supposed to be distributed in the end.

The question is now: Should the resulting DataSet of centroids just be collected, or the the whole iteration be rewritten to work an a non distributed collection?

Note: Unfortunately I am quite busy right now with other projects, so I wont have time to do lots of changes right now. Either the people from my group (who might actually have the same workload right now) or @sachingoel0101 can work on that if its really urgent.

Modified KMeans to use a Seq for model

thvasilo · 2015-06-30T13:58:09Z

What we would like to see actually is this PR and #757 to be merged into one, so that we can review them as a whole. @sachingoel0101 do you think you will be able to do that?

sachingoel0101 · 2015-07-01T15:51:48Z

@thvasilo , how do I merge this PR into mine? Maybe @peedeeX21 can create a pull request to my branch at https://github.com/sachingoel0101/flink/tree/clustering_initializations or is there a better option?

peterschrott · 2015-07-01T16:07:40Z

@sachingoel0101 me creating a pull request for your repo would be the best. But for some reason I can't choose your repo as base fork.

sachingoel0101 · 2015-07-01T16:09:11Z

@peedeeX21 , try this link: sachingoel0101/flink@clustering_initializations...peedeeX21:feature_kmeans
I had a lot of trouble too getting to create a PR to your repo yesterday.

thvasilo · 2015-07-02T12:37:11Z

Hello @peedeeX21, one thing you could try is to rebase this branch on @sachingoel0101's branch, and then do a forced push to this one.

peterschrott · 2015-07-02T12:42:28Z

@thvasilo I actually could create a pull request for @sachingoel0101 . So everything should be fine now. We can even close this PR.

thvasilo · 2015-07-02T12:59:15Z

Sure, feel free to close this, and link to the new one.

FGoessler force-pushed the feature_kmeans branch from 02fe6b2 to 28b5edc Compare May 31, 2015 07:08

FGoessler force-pushed the feature_kmeans branch 4 times, most recently from 85066d9 to e1ec4bc Compare June 3, 2015 09:42

peterschrott and others added 4 commits June 24, 2015 17:47

[FLINK-1731] [ml] Implementation of K-Means

91e83a7

[FLINK-1731] [ml] unit test for KMeans

9da44f6

[FLINK-1731] [ml] Migrated K-Means implementation to new ml pipeline …

a351b79

…interfaces

[FLINK-1731] [ml] adjusted unit test for KMeans for the new ml pipeline

292ec0b

FGoessler force-pushed the feature_kmeans branch from e1ec4bc to 292ec0b Compare June 24, 2015 15:48

thvasilo reviewed Jun 29, 2015
View reviewed changes

peterschrott added 2 commits June 29, 2015 17:37

Adjustmensts in KMeans.scala for PR

e2a5edc

Merge branch 'master' of github.com:apache/flink into feature_kmeans

e02a8a2

peterschrott added 2 commits June 29, 2015 23:25

Fix according to new API

b809354

Source code format and comments.

8c7dadb

sachingoel0101 and others added 2 commits June 30, 2015 16:23

Modified KMeans to use a Seq for model

3e37b56

Merge pull request #1 from sachingoel0101/feature_kmeans

397759e

Modified KMeans to use a Seq for model

peterschrott mentioned this pull request Jul 1, 2015

[FLINK-1731] [ml] Implementation of Feature K-Means and Test Suite sachingoel0101/flink#1

Merged

peterschrott closed this Oct 6, 2015

rmetzger added the component=Library/MachineLearning label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-1731] [ml] Implementation of Feature K-Means and Test Suite #700

[FLINK-1731] [ml] Implementation of Feature K-Means and Test Suite #700

peterschrott commented May 20, 2015

aalexandrov commented May 29, 2015

sachingoel0101 commented Jun 2, 2015

FGoessler commented Jun 3, 2015

peterschrott commented Jun 10, 2015

tillrohrmann commented Jun 10, 2015

peterschrott commented Jun 10, 2015

thvasilo commented Jun 22, 2015

FGoessler commented Jun 24, 2015

thvasilo commented Jun 25, 2015

thvasilo Jun 29, 2015

peterschrott Jun 29, 2015

peterschrott Jun 29, 2015

thvasilo commented Jun 29, 2015

thvasilo commented Jun 29, 2015

thvasilo Jun 29, 2015

sachingoel0101 commented Jun 29, 2015

peterschrott commented Jun 29, 2015

thvasilo commented Jun 30, 2015

sachingoel0101 commented Jun 30, 2015

peterschrott commented Jun 30, 2015

thvasilo commented Jun 30, 2015

sachingoel0101 commented Jul 1, 2015

peterschrott commented Jul 1, 2015

sachingoel0101 commented Jul 1, 2015

thvasilo commented Jul 2, 2015

peterschrott commented Jul 2, 2015

thvasilo commented Jul 2, 2015

[FLINK-1731] [ml] Implementation of Feature K-Means and Test Suite #700

[FLINK-1731] [ml] Implementation of Feature K-Means and Test Suite #700

Conversation

peterschrott commented May 20, 2015

aalexandrov commented May 29, 2015

sachingoel0101 commented Jun 2, 2015

FGoessler commented Jun 3, 2015

peterschrott commented Jun 10, 2015

tillrohrmann commented Jun 10, 2015

peterschrott commented Jun 10, 2015

thvasilo commented Jun 22, 2015

FGoessler commented Jun 24, 2015

thvasilo commented Jun 25, 2015

thvasilo Jun 29, 2015

Choose a reason for hiding this comment

peterschrott Jun 29, 2015

Choose a reason for hiding this comment

peterschrott Jun 29, 2015

Choose a reason for hiding this comment

thvasilo commented Jun 29, 2015

thvasilo commented Jun 29, 2015

thvasilo Jun 29, 2015

Choose a reason for hiding this comment

sachingoel0101 commented Jun 29, 2015

peterschrott commented Jun 29, 2015

thvasilo commented Jun 30, 2015

sachingoel0101 commented Jun 30, 2015

peterschrott commented Jun 30, 2015

thvasilo commented Jun 30, 2015

sachingoel0101 commented Jul 1, 2015

peterschrott commented Jul 1, 2015

sachingoel0101 commented Jul 1, 2015

thvasilo commented Jul 2, 2015

peterschrott commented Jul 2, 2015

thvasilo commented Jul 2, 2015