-
Notifications
You must be signed in to change notification settings - Fork 29k
[ML] SPARK-2426: Integrate Breeze NNLS with ML ALS #5005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…hich is based upon breeze.optimize.proximal.QuadraticMinimizer; Made sure the tests are clean; It is dependent on next snapshot of Breeze
|
Test build #28541 has finished for PR 5005 at commit
|
|
Test build #28543 has finished for PR 5005 at commit
|
|
@mengxr Spark does not build with SNAPSHOTS ? David already pushed a 0.12-SNAPSHOT 2 days back... |
|
No, the build does not enable snapshot repos. I think that's probably for the best. Depending on someone's snapshot build can make the Spark build break unpredictably. If Spark is going to depend on something it needs to have been released. |
|
Got it...I will wait for the next Breeze release |
|
@debasish83 Let's first implement breeze-based solvers as new solvers instead of replacing old ones. So we can easily compare the performance and accuracy. For example, you are using |
|
@mengxr breeze NNLS solver is exactly same as mllib optimization NNLS...breeze QuadraticMinimizer default is Cholesky and it supports all the constraints that we have discussed in the past to do sparse coding and lsa with least square loss...You asked me to move the local solvers to breeze on this JIRA: https://issues.apache.org/jira/browse/SPARK-2426 I did exactly that and cleaned all the copyright from Spark and moved them to Breeze... I will run datasets over this PR and compare the runtime with default. |
|
Also quadraticMinimizer keeps it's own workspace...The idea is to construct ALS.QuadraticSolver once and keep re-using it..this is specially useful for LSA constraints...For ALS.NNLSSolver workspace is still maintained by ALS...Let me do the comparisons with CholeskySolver first and report the results...About the breeze iterator pattern and doing a while loop, I benchmarked it before adding the solver to Breeze and they were at par (I was surprised)... |
|
@mengxr thanks I got the idea....updateGram should always keep lower triangular/upper triangular memory and we directly drop down to lapack to do the solve...It will improve the runtime of QuadraticMinimizer default as well as all other formulations...Let me add it...For NNLS I am not sure if this optimization holds since it is doing gradient based CG calls...Let's think on it |
|
For NNLS also it is applicable...Let me use lapack ssbmv basically to do symmetric matrix vector multiply for generating gradients |
|
Test build #28623 has finished for PR 5005 at commit
|
|
Test build #28635 has finished for PR 5005 at commit
|
|
I compared first Breeze NNLS and mllib NNLS as it is simpler. The NNLS algorithm is similar to what is implemented by @coderxiang. I did not try Breeze CG yet but later I will merge Breeze CG that's used in TRON with NNLS. For now I migrated NNLS to Breeze as it is a local solver and used breeze optimization pattern. breeze.optimize.linear and breeze.optimize.proximal packages will be cleaned once we are done with the stress test. I tried to make all the seeds 0L so that both runs are looking at same results (the train set and test set have same number of records, ALS seed is anyway at 0L) Breeze NNLS: export solver=breeze; ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --total-executor-cores 2 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar ~/datasets/ml-1m/ratings.dat --nonNegative --numIterations 2 Got 1000209 ratings from 6040 users on 3706 movies. TUSCA09LMLVT00C:spark-brznnls v606014$ grep solveTime ./work/breeze-nnls/0/stderr mllib NNLS: unset solver; ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --total-executor-cores 2 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar ~/datasets/ml-1m/ratings.dat --nonNegative --numIterations 2 TUSCA09LMLVT00C:spark-brznnls v606014$ grep solveTime ./work/mllib-nnls/0/stderr Breeze NNLS is slower and I am not sure what's the exact cause. I made sure the linear algebra is clean (basically no memory allocation and reusing the old state memory inside the solver loop) but I will look into it more closely. gemv and axpy are both using BLAS from netlib-java. Breeze NNLS uses iterator pattern but I doubt that will show so much difference. Any pointers will be great. The code is updated here. Next I will compare on CholeskySolver vs QuadraticMinimizer default. The memory optimization for triangular space will be a common optimization for both mllib/breeze NNLS and breeze QuadraticMinimizer. I will take that as an enhancement PR for breeze. It's a bit tricky for QuadraticMinimizer specially since it supports affine constraints of the form Aeq x = beq and Inequalities A x <= b or lb <= x <= ub but it can be done... First I want to see how big difference it makes. |
|
Test build #28637 has finished for PR 5005 at commit
|
|
@mengxr alternatively I can use ALS structure and add ml.factorization package that will have ConstrainedALS...QuadraticMinimizer can drive all formulations in that with --userConstraint and --productConstraint |
|
I also printed how much time is taken in inner solve in Breeze NNLS and how much time is in total solve (iterator pattern + other stuff). Looks like there is some overhead: 15/03/15 22:15:54 INFO ALS: inner solveTime 92.635 ms But interestingly inner solveTime is still ~5X of mllib NNLS which does not make sense...I will take a closer look tomorrow... |
|
@dlwh could you take a look into breeze.optimize.linear.NNLS code? I have made sure no objects are allocated inside the solver loop and I re-use everything from the previous state....It's strange that the runtime is slower than mllib NNLS which uses jblas... |
|
@debasish83 Thanks for testing the performance! Let's try to make PR minimal. For example, we can make a separate PR for replacing MLlib's NNLS implementation by breeze's. I like this change because then we only need to maintain it in breeze. But we need to make sure the performance/accuracy are about the same. Do you see a clear way of splitting this PR? If breeze uses iterator to access elements, it will be much slower than array lookups. |
|
@mengxr agreed let's focus on NNLS in this PR since all the learning will be applicable to QuadraticMinimizer as well for which I can open up a separate PR. I will clean up accordingly. The iterator pattern is used in all breeze optimizers. In place of using while loops in inner optimization iterations, in Breeze we use iterator so that users can have control over the whole optimization path and not only in the end result....It has an overhead as shown below but it gives more control to user and I doubt @dlwh will agree to replace the iterator with while loop :-) Breeze NNLS Outer solveTime (includes Iterator overhead): 15/03/16 12:26:42 INFO ALS: solveTime 149.791 ms Inner solveTime: 15/03/16 12:26:42 INFO ALS: innerTime 70.256 ms mllib NNLS: 15/03/16 12:28:03 INFO ALS: solveTime 39.141 ms So Breeze NNLS is still 2X slower. I had to use f2jBLAS for level 1 BLAS and level 2 BLAS for dgemv to bring the runtime within 2X. Some more optimizations that I can do is replace cforRange to while but we thought cforRange is faster ! Also in place of access a vector v through v(i), access it using v.data(i)... I checked in the version of the code in breeze.optimize.linear.NNLS...Please take a look if you can find any other issues... |
…ption added to breeze.optimize.linear.NNLS for debug
|
Test build #28668 has finished for PR 5005 at commit
|
|
@tmyklebu these least squares problem need not be necessarily small but for mllib ALS they are... Think about TRON (breeze.optimize.TruncatedNewtonMinimizer) and the underlying CG solver in TRON which is very similar to NNLS...There also we use a Projected Conjugate Gradient Solver and solve large problems....Also the direct/interior point based solvers are more robust to condition number as long as you can represent the gram matrix as sparse matrix and use sparse algebra.. I think for mllib ALS, it's just a design decision that whether we want to give all the intermediate state to the user or just the last state...To optimize the runtime for direct solvers in breeze we can do the following:
If you guys agree I can do this change to Breeze NNLS and QuadraticMinimizer. That way both of them should be able to replace ml.ALS.CholeskySolver and ml.ALS.NNLSSolver |
|
sure On Fri, Mar 20, 2015 at 10:47 AM, Debasish Das notifications@github.com
|
|
could you submit a PR for the changes soon. I want to get the fix out for On Fri, Mar 20, 2015 at 10:51 AM, David Hall david.lw.hall@gmail.com
|
|
yeah will push it over the weekend...I am almost done with the changes.. |
|
Even after cleaning up iterator, adding in-place gemv and create the state and re-use the memory, still the first iteration of Breeze NNLS is slower than mllib NNLS...Rest iterations are fine: Breeze NNLS: ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 1 ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 --numIterations 2 --nonNegative ~/datasets/ml-1m/ratings.dat Got 1000209 ratings from 6040 users on 3706 movies. mllib NNLS: export solver=mllib; ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 1 ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 --numIterations 2 --nonNegative ~/datasets/ml-1m/ratings.dat Got 1000209 ratings from 6040 users on 3706 movies. This is the version I will push to Breeze...It will be great if you guys could take a look at the breeze nnls and give some pointers on the first iteration... By the way the ~ 10-20% overhead in remaining iterations comes from breeze dot and axpy vs directly calling f2jblas dot and axpy...I verified that...But the first iteration slowdown is still not clear to me... |
|
It's probably just HotSpot warming up. I wouldn't worry about it. On Sat, Mar 21, 2015 at 7:34 PM, Debasish Das notifications@github.com
|
|
I am confused why the mllib NNLS does not show it...we are allocating exactly same memory in both Breeze and mllib NNLS. In Breeze we call it State and in mllib NNLS it's called workspace. May be there is something I am missing here...the same issue shows up in replacing cholesky solver with QuadraticMinimizer default as well...opening that up in a bit |
|
Test build #28953 has finished for PR 5005 at commit
|
|
failure testcase is due to changing the als seed to 0L and get repeatable results over multiple runs... |
|
All the runtime enhancements are being added to Breeze in this PR: scalanlp/breeze#386 |
|
@mengxr any updates on it ? breeze 0.11.2 is now integrated with Spark...I can clean up the PR for reviews.. |
|
Updated the PR with breeze 0.11.2...Except first iteration, rest of them are at par: Breeze NNLS: TUSCA09LMLVT00C:spark-brznnls v606014$ grep solveTime ./work/app-20150328110507-0003/0/stderr mllib NNLS: TUSCA09LMLVT00C:spark-brznnls v606014$ grep solveTime ./work/app-20150328110532-0004/0/stderr export solver=mllib runs the mllib NNLS...I will wait for the feedbacks... |
|
Test build #29352 has finished for PR 5005 at commit
|
|
@mengxr any insight on it ? the runtime issue is only in first iteration and I think you can point out if there is any obvious issue in the way I call the solver...looks like something to do with initialization... |
|
We should do a micro-benchmark instead of comparing the running times in ALS. Could you create a repo, copy the implementation over, and put your benchmark code there. I can take a look. |
|
Sure...Let me do that and point you to the repo...most likely it will be a breeze based branch and I will copy the mllib implementation over there...I am also curious why the first iteration difference is showing up in both NNLS and QuadraticMinimizer... |
|
@tmyklebu do you have the original NNLS paper in english ? Breeze also has a linear CG...I am thinking if it is possible to merge simple projections like positivity and bounds with the linear CG...CG based linear solves can be extended to handle projection similar to SPG...But NNLS looks like does some specific optimization for x >= 0...can NNLS be extended to other projection/proximal operators ? |
|
Not at home right now, so I don't have everything in front of me. If you have a "projection onto tangent cone" operator and you keep explicit track of the active set, you can generalise Polyak's method here to quadratic minimisation over any polyhedral set. The trouble is that projection onto the tangent cone requires solving a linear system for general polyhedral sets. Do you have a specific application in mind? |
|
if you look into breeze.optimize.proximal.Proximal, I added a library of projection/proximal operators...in my experiments looks like projection based algorithms (SPG for example) does not work for L1 and sparsity constraint that well but works well for positivity and bounds for example...I am thinking to extend breeze linear CG / NNLS to handle simple projections and hopefully consolidate both into one linear CG with projection... I support these constraints through a cholesky/LDL based ADMM solver but I wanted to write an iterative version using linear CG to see if ADMM performance can be improved...For well conditioned QPs papers have found ADMM faster than FISTA but I did not see comparisons with linear CG variant... |
|
Application is topic modeling/genre finding using Sparsity constraints like L1 and probability simplex on items and supporting bounds in ALS...Equality is difficult in projection due to the linear system issue you mentioned above...so we can skip that...Inequality again should be fine but is not that useful in ALS applications.. |
|
OK. I haven't made a serious attempt to write a solver for general L1-constrained least squares problems. I don't see anything wrong with implementing a generalisation of Polyak's method for more general constrained least squares problems, but I'm not too sure it'll go fast. (It probably flies once you're close to the optimal face, but that isn't where you start.) With nonnegativity-constrained least-squares, the active set usually doesn't change very much. |
|
Test build #31054 timed out for PR 5005 at commit |
|
Test build #43582 has finished for PR 5005 at commit
|
|
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
This PR has the following changes:
@mengxr @coderxiang @dlwh I opened it up for early reviews....If you guys are good with basic change we can merge it and next PR I will bring in the other constraints in ALS.