-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation of rho_x parameter #150
Comments
Thanks for the question Thomas. You're right this isn't documented anywhere, and I will leave this issue open so I can remember to document carefully all the parameters. The rho_x term comes in the linear system solve (basically as the admm step size param as you say), but only in weighting the x component, which is why it actually makes a difference. So instead of solving z = (I+Q)^{-1} r we solve with matrix: [rho_x * I_n 0 ] + Q where I_n and I_{m+1} are identity matrices of dimension n and m+1 resp., and Q is of shape (n+m+1) x (n+m+1). The reason for using rho_x small, like 1e-6, is that unlike the 'y' variable the 'x' variable does not have to be in a cone (other than the 'free' cone), so the output of the projection step is less important, so we should downweight the contribution from the cone projection step. Note that the algorithm could be adapted to allow 'x' to be in a cone too, and include it in the projection, in which case rho_x being small would no longer be appropriate. In fact the algorithm has all the same guarantees with any generic diagonal, positive definite matrix, rather than I. Let's call such a matrix D. When we use (D+Q)^{-1} instead of (I+Q)^{-1} we also need to scale the input vector by the same quantity. So we actually solve z = (D+Q)^{-1} D * r. The same D is used in the projection step, but it makes no difference there, and in the dual variable update step (in the special case of D = diag(rho_x * I , I) and x is in the free cone it makes no difference to the dual variable update either). I'm a bit rusty on this part, but I think it can be justified as follows: In ADMM we're solving problems of the form: min. f(x) + g(z) but this is equivalent to problem min. f(x) + g(z) for any positive definite matrix D (diagonal being a special case). Now if you derive the algorithm using that weighting on the equality variable you should get what we have in the code. |
Thanks for the prompt response! It's a great help, and I'll work through the details. The reason I started digging into this is that I have a hard time getting dual residuals below 5e-2 on a fairly large-scale problem (SDP cone of dim ~1k, 60k linear constraints). It's workable, but I'm hoping to get 1 more order of magnitude. Based on your info, it seems like a comparatively small rho_x should help the dual residual. I'm also curious if the Sherman-M-W step could be an issue. I'm using Python Linsys using a direct solver that benefits from symbolic elimination. The residuals from the solver can benefit from iterative refinement (typical unrefined residuals are around 1e-4, but 1e-10 after refinement). Refining at the level of the linsys API doesn't seem to improve dual residual, but I think I need to do refinement after the SMW step. I'm not using equilibriation, but my A matrix is already fairly well infinity-norm equillibriated, though 2-norm conditioning is more severe. Let me know if you have any any thoughts. |
I would keep I'm working on SCS v3 right now which makes a lot of improvements, including automatically adapting the scale parameter, and making the acceleration more stable overall. Currently the |
I'll look at the branch, but I think the python linsys is a must-have. The problem involves a large number of low-rank constraints, and so the Ax and A^Tx operators make heavy use of converting outer products to inner products, as in Tr(U U^T X) = Tr(U^T X U). The theoretical computational complexity is the same, but reducing the memory requirements allows better locality and caching. The qdldl/colamd solver seems to suffer from disastrous fill-in as well. If MA86 or PARDISO have somehow become more available, that be worth trying out, but we don't have any good licensing options at the moment. |
I figured it out and have updated my notes and attached them for the betterment of anyone who finds them. It is interesting that by reducing the weight of x in the projection, we also unintentionally make the projection into the zero cone (r) less tethered. It seems that another interesting possibility for weighting would be to consider diag(D, D^-1), so that the projection would be very tethered to the zero cone. This would cost 2 linear solves at each step and require computation of the dual variables. But I'm curious if this was ever tried out. |
Thanks for posting this. Has this improved the performance sufficiently? If not you should try modifying the scale parameter as I suggested. I realize you are not using the built-in normalization (equilibration) routine and so the scale parameter you pass to SCS does nothing, but all |
I haven't had a chance to look at this for a few days. Will keep you posted. |
The rho_x parameter isn't documented in the JOTA paper nor the longer version of it. Judging by its name, this variable seems like it should be the ADMM step size parameter in traditional presentations. However, this cannot be, because the alternating minimization steps are homogeneous in rho, which implies that the dual variable update step is also independent of rho because v^k = \lambda^k, etc.
From a reverse engineering perspective, I can't figure out what projection is implemented when rho_x != 1. Moreover, I can't understand how to derive the almost-symmetric indefinite linear system I + Q as in the paper where the left term is anything other than I. (The derivation uses a Schur complement of a bigger KKT system and a cancellation; the cancellation doesn't seem to work for anything other than I)
Any pointers or hints would be greatly appreciated.
The text was updated successfully, but these errors were encountered: