Documentation of rho_x parameter #150

ThomasVacekTR · 2020-11-06T01:06:24Z

The rho_x parameter isn't documented in the JOTA paper nor the longer version of it. Judging by its name, this variable seems like it should be the ADMM step size parameter in traditional presentations. However, this cannot be, because the alternating minimization steps are homogeneous in rho, which implies that the dual variable update step is also independent of rho because v^k = \lambda^k, etc.
From a reverse engineering perspective, I can't figure out what projection is implemented when rho_x != 1. Moreover, I can't understand how to derive the almost-symmetric indefinite linear system I + Q as in the paper where the left term is anything other than I. (The derivation uses a Schur complement of a bigger KKT system and a cancellation; the cancellation doesn't seem to work for anything other than I)
Any pointers or hints would be greatly appreciated.

bodono · 2020-11-06T16:51:32Z

Thanks for the question Thomas. You're right this isn't documented anywhere, and I will leave this issue open so I can remember to document carefully all the parameters.

The rho_x term comes in the linear system solve (basically as the admm step size param as you say), but only in weighting the x component, which is why it actually makes a difference. So instead of solving z = (I+Q)^{-1} r we solve with matrix:

[rho_x * I_n 0 ] + Q
[ 0 I_{m+1} ]

where I_n and I_{m+1} are identity matrices of dimension n and m+1 resp., and Q is of shape (n+m+1) x (n+m+1). The reason for using rho_x small, like 1e-6, is that unlike the 'y' variable the 'x' variable does not have to be in a cone (other than the 'free' cone), so the output of the projection step is less important, so we should downweight the contribution from the cone projection step. Note that the algorithm could be adapted to allow 'x' to be in a cone too, and include it in the projection, in which case rho_x being small would no longer be appropriate.

In fact the algorithm has all the same guarantees with any generic diagonal, positive definite matrix, rather than I. Let's call such a matrix D. When we use (D+Q)^{-1} instead of (I+Q)^{-1} we also need to scale the input vector by the same quantity. So we actually solve

z = (D+Q)^{-1} D * r.

The same D is used in the projection step, but it makes no difference there, and in the dual variable update step (in the special case of D = diag(rho_x * I , I) and x is in the free cone it makes no difference to the dual variable update either).

I'm a bit rusty on this part, but I think it can be justified as follows: In ADMM we're solving problems of the form:

min. f(x) + g(z)
s.t. x == z

but this is equivalent to problem

min. f(x) + g(z)
s.t. D * x == D * z

for any positive definite matrix D (diagonal being a special case). Now if you derive the algorithm using that weighting on the equality variable you should get what we have in the code.

ThomasVacekTR · 2020-11-06T17:39:45Z

Thanks for the prompt response! It's a great help, and I'll work through the details.

The reason I started digging into this is that I have a hard time getting dual residuals below 5e-2 on a fairly large-scale problem (SDP cone of dim ~1k, 60k linear constraints). It's workable, but I'm hoping to get 1 more order of magnitude. Based on your info, it seems like a comparatively small rho_x should help the dual residual. I'm also curious if the Sherman-M-W step could be an issue. I'm using Python Linsys using a direct solver that benefits from symbolic elimination. The residuals from the solver can benefit from iterative refinement (typical unrefined residuals are around 1e-4, but 1e-10 after refinement). Refining at the level of the linsys API doesn't seem to improve dual residual, but I think I need to do refinement after the SMW step. I'm not using equilibriation, but my A matrix is already fairly well infinity-norm equillibriated, though 2-norm conditioning is more severe. Let me know if you have any any thoughts.

bodono · 2020-11-08T15:14:05Z

I would keep rho_x small and actually play around with the scale parameter, which makes much more difference in practice. The rule of thumb is that if the primal residual is converging slower then increase scale, and vice-versa. You can also trying setting acceleration_lookback=0 as sometimes that can interfere with good performance.

I'm working on SCS v3 right now which makes a lot of improvements, including automatically adapting the scale parameter, and making the acceleration more stable overall. Currently the python_linsys has been removed though, how crucial is that for you? If you use the built-in linear system solver you can pass a string parameter write_data_filename which will write the data defining the problem to a file that you can send over to me if you're comfortable with that. If you email it to me I can play around with the settings a bit to see what helps the performance. You can also try it yourself in the scs-python branch qp which pulls in scs branch qp2, though as I said I have (currently) removed the python_linsys.

ThomasVacekTR · 2020-11-10T22:17:21Z

I'll look at the branch, but I think the python linsys is a must-have. The problem involves a large number of low-rank constraints, and so the Ax and A^Tx operators make heavy use of converting outer products to inner products, as in Tr(U U^T X) = Tr(U^T X U). The theoretical computational complexity is the same, but reducing the memory requirements allows better locality and caching. The qdldl/colamd solver seems to suffer from disastrous fill-in as well. If MA86 or PARDISO have somehow become more available, that be worth trying out, but we don't have any good licensing options at the moment.
I worked through the details on rho_x and I still haven't been able to derive the linear system in the SCS solver, as well as trying to numerically verify. I've made a summary of the pointers you sent and filled in some details in an attached Latex doc. Comments and pointers much appreciated. Thanks!
rho_x_derivation.pdf

ThomasVacekTR · 2020-11-11T20:51:49Z

I figured it out and have updated my notes and attached them for the betterment of anyone who finds them. It is interesting that by reducing the weight of x in the projection, we also unintentionally make the projection into the zero cone (r) less tethered. It seems that another interesting possibility for weighting would be to consider diag(D, D^-1), so that the projection would be very tethered to the zero cone. This would cost 2 linear solves at each step and require computation of the dual variables. But I'm curious if this was ever tried out.
rho_x_derivation.pdf

bodono · 2020-11-14T16:29:53Z

Thanks for posting this. Has this improved the performance sufficiently? If not you should try modifying the scale parameter as I suggested. I realize you are not using the built-in normalization (equilibration) routine and so the scale parameter you pass to SCS does nothing, but all scale does is multiply the data by the scale, so (A, b, c) -> (s * A, s * b, s * c), where s is the scale parameter, so this is something you can implement yourself in the python linsys or before starting SCS. If the primal residual is typically larger during the solve, then increasing s should help, and vice-versa. Another avenue is to tinker around with the Anderson acceleration (including seeing what happens if you turn it off by setting acceleration_lookback = 0).

ThomasVacekTR · 2020-11-14T16:35:33Z

I haven't had a chance to look at this for a few days. Will keep you posted.

ThomasVacekTR closed this as completed Nov 11, 2020

ThomasVacekTR reopened this Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation of rho_x parameter #150

Documentation of rho_x parameter #150

ThomasVacekTR commented Nov 6, 2020

bodono commented Nov 6, 2020

ThomasVacekTR commented Nov 6, 2020

bodono commented Nov 8, 2020

ThomasVacekTR commented Nov 10, 2020

ThomasVacekTR commented Nov 11, 2020 •

edited

bodono commented Nov 14, 2020

ThomasVacekTR commented Nov 14, 2020

Documentation of rho_x parameter #150

Documentation of rho_x parameter #150

Comments

ThomasVacekTR commented Nov 6, 2020

bodono commented Nov 6, 2020

ThomasVacekTR commented Nov 6, 2020

bodono commented Nov 8, 2020

ThomasVacekTR commented Nov 10, 2020

ThomasVacekTR commented Nov 11, 2020 • edited

bodono commented Nov 14, 2020

ThomasVacekTR commented Nov 14, 2020

ThomasVacekTR commented Nov 11, 2020 •

edited