-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-3128] [flink-ml] Add Isotonic Regression To ML Library #1565
Conversation
Are the build failures related to us? I don't really understand how... The first failure happens in oraclejd8 with hadoop 2.7.1:
The second on in openjdk 7 with hadoop 1 appears to experience a deadlock (?):
After that only lots of lots of process stack traces. |
|
||
def linearInterpolation(x1: Double, y1: Double, x2: Double, y2: Double, x: | ||
Double): | ||
Double = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting off
Really good work @f-sander. Good test coverage and good code documentation. It would be good to add some online documentation for this algorithm (see flink/docs/libraries/ml). I had a comment concerning scalability. I fear that with the current implementation, the algorithm is effectively bound by the capacities of a single machine. Especially sorting the data on the heap is destined to quickly crash the system. I'm not an expert on isotonic regression but it would be nice to get rid of the operator which collects all the input data in a single task to sort them. I also haven't gone through the math details yet. Will do, once the scalability issue is fixed. |
Thanks for your Feedback! Yes, scalability is the main issue for us too. We are not aware of any other parallel implementation. We also talked to the original author of Spark's IR implementation (which is equivalent too ours) about this with the same result. However, we think we have a theoretical approach to solving this, but it depends on the self join without duplicates. Remember our discussion on the user-mailing list with subject I will link a sketch to our algorithm design here in a few days, If we haven't found a way to solve this. I guess IR won't make it into Flink without a fully parallelized way? |
Yeah I thought so that your question on the mailing list was related to You can also share the sketch of your algorithm right now, if you want to. Cheers, On Wed, Feb 3, 2016 at 12:32 PM, Fridtjof notifications@github.com wrote:
|
There is one advantage of this over using a single-node ML-Lib: This implementation contains the compression procedure used in Spark that combines data points with equal label. The hope of this parallelization strategy is, that in each partition enough points are compressed so that the combined dataset in the last step fits into one node. I will try to outline our algorithm tonight, but I'm very busy right now and can't promise. But I'll try. |
Sorry for the long delay. I still don't really have time for this, but I wan't to describe it anyways. That's why the writing and formatting is pretty sloppy in this. Sorry for that, I hope you bare with me: We only consider isotonic regression on weighted, two dimensional data. Thus, datapoints are tuples of three doubles: PAV assumes the data to be sorted by Any sequence of data points from Our approach (not implemented in this PR) works like this:
After the iteration terminated: Foreach point that has a "remembered" We are able to compare each point with its successor, by attaching each point with an index (zipWithIndex) and a "next-pointer" (index+1) and then doing a: We worked around that by doing two joins in each iteration step:
Unfortunately, because of the merging the indices are not incrementing by 1 anymore. That's why we wanted to apply another zipWithIndex after the two joins, but the join repartitioned the data, so we loose our range-partitioning. But, this is required to get indices representing the total order of the data. I hope you can understand the problem. Again sorry for sloppy writing. |
Closing since flink-ml is effectively frozen. |
Adds isotonic regression to the ml library.
It's a port of the implementation in Apache Spark.