[FLINK-3128] [flink-ml] Add Isotonic Regression To ML Library #1565

f-sander · 2016-01-31T15:42:26Z

Adds isotonic regression to the ml library.
It's a port of the implementation in Apache Spark.

…gression

…ink3128

f-sander · 2016-02-02T10:00:49Z

Are the build failures related to us? I don't really understand how...

The first failure happens in oraclejd8 with hadoop 2.7.1:

Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 351.368 sec <<< FAILURE! - in org.apache.flink.test.recovery.TaskManagerProcessFailureBatchRecoveryITCase
testTaskManagerProcessFailure[0](org.apache.flink.test.recovery.TaskManagerProcessFailureBatchRecoveryITCase)  Time elapsed: 318.792 sec  <<< FAILURE!
java.lang.AssertionError: The program did not finish in time
    at org.junit.Assert.fail(Assert.java:88)
    at org.junit.Assert.assertTrue(Assert.java:41)
    at org.junit.Assert.assertFalse(Assert.java:64)
    at org.apache.flink.test.recovery.AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure(AbstractTaskManagerProcessFailureRecoveryTest.java:212)

The second on in openjdk 7 with hadoop 1 appears to experience a deadlock (?):

==============================================================================
Maven produced no output for 300 seconds.
==============================================================================
==============================================================================
The following Java processes are running (JPS)
==============================================================================
2286 Launcher
77113 Jps
76276 surefirebooter4006285424712115006.jar
==============================================================================
Printing stack trace of Java process 2286
==============================================================================

After that only lots of lots of process stack traces.

tillrohrmann · 2016-02-03T11:04:12Z

flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/regression/IsotonicRegression.scala

+
+        def linearInterpolation(x1: Double, y1: Double, x2: Double, y2: Double, x:
+        Double):
+        Double = {


Formatting off

tillrohrmann · 2016-02-03T11:13:45Z

Really good work @f-sander. Good test coverage and good code documentation.

It would be good to add some online documentation for this algorithm (see flink/docs/libraries/ml).

I had a comment concerning scalability. I fear that with the current implementation, the algorithm is effectively bound by the capacities of a single machine. Especially sorting the data on the heap is destined to quickly crash the system. I'm not an expert on isotonic regression but it would be nice to get rid of the operator which collects all the input data in a single task to sort them.

I also haven't gone through the math details yet. Will do, once the scalability issue is fixed.

f-sander · 2016-02-03T11:32:49Z

Thanks for your Feedback!

Yes, scalability is the main issue for us too. We are not aware of any other parallel implementation. We also talked to the original author of Spark's IR implementation (which is equivalent too ours) about this with the same result. However, we think we have a theoretical approach to solving this, but it depends on the self join without duplicates. Remember our discussion on the user-mailing list with subject join with no element appearing in multiple join-pairs? I need that for this.

I will link a sketch to our algorithm design here in a few days, If we haven't found a way to solve this. I guess IR won't make it into Flink without a fully parallelized way?

tillrohrmann · 2016-02-03T13:21:55Z

Yeah I thought so that your question on the mailing list was related to
this PR. It would be great to have a fully parallelized version of the
algorithm, because if it only runs on a single machine then you could
directly use sklearn or another ML-library to solve the problem.

You can also share the sketch of your algorithm right now, if you want to.
That way, others could directly chime in and maybe someone knows how to do
the alternating pair join operation.

Cheers,
Till

On Wed, Feb 3, 2016 at 12:32 PM, Fridtjof notifications@github.com wrote:

Thanks for your Feedback!

Yes, scalability is the main issue for us too. We are not aware of any
other parallel implementation
he main issue for us too. We also talked to the original author of Spark's
IR implementation (which is equivalent too ours) about this with the same
result. However, we think we have a theoretical approach to solving this,
but it depends on the self join without duplicates. Remember our discussion
on the user-mailing list with subject join with no element appearing in
multiple join-pairs? I need that for this.

I will link a sketch to our algorithm design here in a few days, If we
haven't found a way to solve this. I guess IR won't make it into Flink
without a fully parallelized way?

—
Reply to this email directly or view it on GitHub
#1565 (comment).

f-sander · 2016-02-03T13:55:46Z

There is one advantage of this over using a single-node ML-Lib: This implementation contains the compression procedure used in Spark that combines data points with equal label. The hope of this parallelization strategy is, that in each partition enough points are compressed so that the combined dataset in the last step fits into one node.

I will try to outline our algorithm tonight, but I'm very busy right now and can't promise. But I'll try.

f-sander · 2016-02-06T16:06:28Z

Sorry for the long delay. I still don't really have time for this, but I wan't to describe it anyways. That's why the writing and formatting is pretty sloppy in this. Sorry for that, I hope you bare with me:

We only consider isotonic regression on weighted, two dimensional data. Thus, datapoints are tuples of three doubles: (y, x, w).

PAV assumes the data to be sorted by x. It starts on the left and goes to the right. Whenever two Point's (or more) are found that are descending in order of x, it "pools" them, which means all y values (multiplied by their weight) in that pool are averaged by the sum of all weights. Any point in the pool then looks like this: (y_weighted_pool_avg, x, w). Because the y values where changed, we have to look back in x order if the new pool avg is lower than the value before the pool. If that's the case, we have to pool again until now higher y value is present before the pool.

Any sequence of data points from i to j sharing the same y value is compressed in the following way: (y, x_i, sum_of_weights), (y, x_j, 0). The hope of Sparks implementation is that enough data gets compressed that way, that all remaining data fits into one node in the last step. However, there are of course cases, where this simply doesn't work.

Our approach (not implemented in this PR) works like this:

compare two consecutive data points i and j:
if y_i < y_j, leave them untouched
if y_i > y_j, replace both with ((y_i * w_i + y_j * w_j) / (w_i + w_j), x_i, w_i + w_j). Also remember x_j
if y_i = y_j, replace both with (y_i, x_i, w_i + w_j). Also remember x_j
Repeat that until no pairs are combined to one

After the iteration terminated: Foreach point that has a "remembered" x_j, add another (y, x_j, 0) directly behind it.

We are able to compare each point with its successor, by attaching each point with an index (zipWithIndex) and a "next-pointer" (index+1) and then doing a:
set.join(set).where(next).equalTo(index)
However, because of the weight summation, we must avoid that a point appears in multiple join pairs. Otherwise a point's weight might be summed into multiple combined points.

We worked around that by doing two joins in each iteration step:

step 1: left join side has only points with even indices, right side only with odd
step 2: left join side has only points with odd indices, right side only with even
if nothing happened during these two runs, we are done

Unfortunately, because of the merging the indices are not incrementing by 1 anymore. That's why we wanted to apply another zipWithIndex after the two joins, but the join repartitioned the data, so we loose our range-partitioning. But, this is required to get indices representing the total order of the data.

I hope you can understand the problem. Again sorry for sloppy writing.

zentol · 2019-02-28T22:57:33Z

Closing since flink-ml is effectively frozen.

f-sander and others added 14 commits December 7, 2015 15:32

IR fit implementation

227f915

finished predict and fit. validated against spark

7b09de3

Merge branch 'master' into flink3128

0bd72af

making use of new range partitioning + implementation of antitonic re…

6cd01a8

…gression

added test cases

bcfd3b3

Add more IR integration tests

370b68d

Fix weight bug in runIsotonicRegression()

fd075ac

cleanup

0e33df3

Remove unused import

990f49e

Merge branch 'flink3128' of https://github.com/f-sander/flink into fl…

a4776a5

…ink3128

javadoc

c5b0915

Merge branch 'master' into flink3128

92420be

moved isotonic regression into new lib

4ab6f5f

[FLINK-3128] Add Isotonic Regression To ML Library

dd74aeb

tillrohrmann reviewed Feb 3, 2016
View reviewed changes

Merge branch 'master' of https://github.com/apache/flink into flink3128

cab540c

re-formatted code and included test that was pending on FLINK-3281

38de075

Nik Hille and others added 6 commits February 12, 2016 20:37

Fix bug that occurred with zero weights

1db457f

corrected parallelism 1 call

1c8bf5a

Merge branch 'flink3128' of github.com:f-sander/flink into flink3128

6ec8c59

switched to PAV implementation from sklearn

4630400

Add/adjust IsotonicRegression code documentation

22d8826

corrected comment widths

88d84d9

zentol closed this Feb 28, 2019

rmetzger added the component=Library/MachineLearning label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-3128] [flink-ml] Add Isotonic Regression To ML Library #1565

[FLINK-3128] [flink-ml] Add Isotonic Regression To ML Library #1565

f-sander commented Jan 31, 2016

f-sander commented Feb 2, 2016

tillrohrmann Feb 3, 2016

tillrohrmann commented Feb 3, 2016

f-sander commented Feb 3, 2016

tillrohrmann commented Feb 3, 2016

f-sander commented Feb 3, 2016

f-sander commented Feb 6, 2016

zentol commented Feb 28, 2019

[FLINK-3128] [flink-ml] Add Isotonic Regression To ML Library #1565

[FLINK-3128] [flink-ml] Add Isotonic Regression To ML Library #1565

Conversation

f-sander commented Jan 31, 2016

f-sander commented Feb 2, 2016

tillrohrmann Feb 3, 2016

Choose a reason for hiding this comment

tillrohrmann commented Feb 3, 2016

f-sander commented Feb 3, 2016

tillrohrmann commented Feb 3, 2016

f-sander commented Feb 3, 2016

f-sander commented Feb 6, 2016

zentol commented Feb 28, 2019