Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cvxcore optimization #1255

Merged
merged 15 commits into from Mar 2, 2021
Merged

cvxcore optimization #1255

merged 15 commits into from Mar 2, 2021

Conversation

akshayka
Copy link
Collaborator

Adds several optimizations to cvxcore, and adds parallelism.

This change makes the processing of LinOps in cvxcore's build_matrix function happen in parallel, provided that cvxcore was compiled with OpenMP. The user can choose the number of threads to be used via cvxpy.set_num_threads().

By default, cvxcore is not built with OpenMP, since some systems might not have it. To build with OpenMP, specify the compiler/linker flags at the command line. eg,

CFLAGS="-fopenmp" LDFLAGS="-lgomp" python setup.py install

That said, we should make sure that our wheels are built with OpenMP by default. I'd appreciate it if someone could help with that, in a future PR.

Additionally, this change gets rid of some unnecessary copies in cvxcore.

And revert to conservative sparse*sparse products in cvxcore ...
pruned products intermittently crash.
You can manually supply CLFAGS and LDFLAGS to enable openmp

TODO: wheels should be built with openmp
@SteveDiamond
Copy link
Collaborator

Looks reasonable to me! I'd appreciate some comments explaining how you're doing things in a more copy-free/efficient way. Also looks like there are compile issues on travis.

@akshayka
Copy link
Collaborator Author

Looks reasonable to me! I'd appreciate some comments explaining how you're doing things in a more copy-free/efficient way. Also looks like there are compile issues on travis.

Thanks for the review! Previously, the copy constructor of SparseMatrix was invoked implicitly in many places --- in the creation of temporaries, when pushing back an existing DictMat or Matrix into an STL container, when passing arguments by value, iterating over arguments by value, and so on. The change fixes many instances of things like that.

I'll fix the Travis CI.

Copy link
Collaborator

@rileyjmurray rileyjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me! I did ask one question in cvxcore.cpp about gracefully handling potential OOM errors. Other than that I'd say we need to make sure to update the web documentation.

Edit: @SteveDiamond and @akshayka what do you think about adding in some tests (maybe with a flag on a certain TravisCI or AppVeyor build) to ensure multi-threaded cvxcore behaves as expected?

@akshayka
Copy link
Collaborator Author

To add tests, the build would need to be configured to compile cvxcore with openmp. Do you have thoughts on how to do that cleanly?

@rileyjmurray
Copy link
Collaborator

rileyjmurray commented Feb 28, 2021

The Travis config file can define an environment variable tailored to each build configuration. We'd need to edit the Travis installation script to check the value of that environment variable, and then proceed with or without OpenMP (as appropriate). We can make sure OpenMP on these systems just by installing from conda-forge. The one weird thing we'd need to do is explicitly set the number of OpenMP threads to 2 or 4 (https://docs.travis-ci.com/user/languages/c/#openmp-projects).

Edit: we use this kind of environment-variable build-flag logic when selecting the python version https://github.com/cvxgrp/cvxpy/blob/master/continuous_integration/Travis/install_dependencies.sh#L29.

Configurable via OMP_NUM_THREADS or cvxpy.set_num_threads()
@akshayka
Copy link
Collaborator Author

akshayka commented Mar 1, 2021

Some timings: master, this branch 1 thread, this branch 8 threads (test_benchmarks.py).

This branch, single-threaded, is somewhat faster than master.

Most of these problems don't have many expression trees that need to be canonicalized, so you don't see a big difference in timings. However diffcp_sdp is 3x faster when using multiple threads.

EDIT 1: Note that most of these benchmarks time the get_problem_data method, which runs an entire reduction chain. So a (say) 2x speed up in build_matrix won't necessarily make the benchmarks run 2x faster.

EDIT 2: I found and eliminated additional copies, yielding significant savings. The updated numbers are below.

master (total time 15.7 seconds)

cone_matrix_stuffing_with_many_constraints: avg=1.117e+00 s , std=0.000e+00 s (1 iterations)
.diffcp_sdp: avg=7.330e+00 s , std=0.000e+00 s (1 iterations)
.least_squares: avg=4.506e-03 s , std=0.000e+00 s (1 iterations)
.sConic canonicalization
(ECOS) solver time:  0.038621366
cvxpy time:  0.5479763150455322
Quadratic canonicalization
(OSQP) solver time:  0.020118558
cvxpy time:  0.48471370976123046
.qp: avg=5.823e-03 s , std=0.000e+00 s (1 iterations)
.small_cone_matrix_stuffing: avg=3.141e-02 s , std=5.636e-03 s (10 iterations)
.small_lp: avg=1.476e-02 s , std=0.000e+00 s (1 iterations)
small_lp_second_time: avg=1.024e-03 s , std=0.000e+00 s (1 iterations)
.small_parameterized_cone_matrix_stuffing: avg=1.550e+00 s , std=0.000e+00 s (1 iterations)
.small_parameterized_lp: avg=1.503e+00 s , std=0.000e+00 s (1 iterations)
small_parameterized_lp_second_time: avg=2.394e-03 s , std=0.000e+00 s (1 iterations)
.tv_inpainting: avg=2.106e+00 s , std=0.000e+00 s (1 iterations)

1 thread (total time 11.7 seconds)

cone_matrix_stuffing_with_many_constraints: avg=1.043e+00 s , std=0.000e+00 s (1 iterations)
.diffcp_sdp: avg=4.350e+00 s , std=0.000e+00 s (1 iterations)
.least_squares: avg=4.452e-03 s , std=0.000e+00 s (1 iterations)
.sConic canonicalization
(ECOS) solver time:  0.038705834
cvxpy time:  0.44830249920617676
Quadratic canonicalization
(OSQP) solver time:  0.024713480000000003
cvxpy time:  0.4430909519152832
.qp: avg=1.032e-02 s , std=0.000e+00 s (1 iterations)
.small_cone_matrix_stuffing: avg=2.774e-02 s , std=5.836e-03 s (10 iterations)
.small_lp: avg=1.426e-02 s , std=0.000e+00 s (1 iterations)
small_lp_second_time: avg=1.017e-03 s , std=0.000e+00 s (1 iterations)
.small_parameterized_cone_matrix_stuffing: avg=1.292e+00 s , std=0.000e+00 s (1 iterations)
.small_parameterized_lp: avg=1.202e+00 s , std=0.000e+00 s (1 iterations)
small_parameterized_lp_second_time: avg=2.382e-03 s , std=0.000e+00 s (1 iterations)
.tv_inpainting: avg=1.906e+00 s , std=0.000e+00 s (1 iterations)

8 threads (total time 9.1 seconds)

cone_matrix_stuffing_with_many_constraints: avg=9.255e-01 s , std=0.000e+00 s (1 iterations)
.diffcp_sdp: avg=1.768e+00 s , std=0.000e+00 s (1 iterations)
.least_squares: avg=5.306e-03 s , std=0.000e+00 s (1 iterations)
.sConic canonicalization
(ECOS) solver time:  0.038623892
cvxpy time:  0.47473422414990235
Quadratic canonicalization
(OSQP) solver time:  0.020168179
cvxpy time:  0.46717441705407714
.qp: avg=6.624e-03 s , std=0.000e+00 s (1 iterations)
.small_cone_matrix_stuffing: avg=3.212e-02 s , std=8.298e-03 s (10 iterations)
.small_lp: avg=1.694e-02 s , std=0.000e+00 s (1 iterations)
small_lp_second_time: avg=1.043e-03 s , std=0.000e+00 s (1 iterations)
.small_parameterized_cone_matrix_stuffing: avg=1.181e+00 s , std=0.000e+00 s (1 iterations)
.small_parameterized_lp: avg=1.307e+00 s , std=0.000e+00 s (1 iterations)
small_parameterized_lp_second_time: avg=2.442e-03 s , std=0.000e+00 s (1 iterations)
.tv_inpainting: avg=1.830e+00 s , std=0.000e+00 s (1 iterations)

@akshayka
Copy link
Collaborator Author

akshayka commented Mar 1, 2021

I made a few more small optimizations (eliminating copies), upgraded Eigen to 3.3.9, and switched to pruned sparse-sparse multiplication (which works in 3.3.9, but was broken in the version we were using).

There are still some copies that I wasn't able to get rid of. Most notably get_constant_data copies out numerical data instead of mapping it. But there are other copies as well. We can address those in a later change.

The improvements are fairly significant. See below (test_benchmarks.py, all caveats from the above comment apply).

master (total time 15.6s)

cone_matrix_stuffing_with_many_constraints: avg=1.122e+00 s , std=0.000e+00 s (1 iterations)
.diffcp_sdp: avg=7.410e+00 s , std=0.000e+00 s (1 iterations)
.least_squares: avg=4.661e-03 s , std=0.000e+00 s (1 iterations)
.sConic canonicalization
(ECOS) solver time:  0.039003118
cvxpy time:  0.5048685231590576
Quadratic canonicalization
(OSQP) solver time:  0.020439915
cvxpy time:  0.4658264595880127
.qp: avg=5.785e-03 s , std=0.000e+00 s (1 iterations)
.small_cone_matrix_stuffing: avg=2.912e-02 s , std=5.820e-03 s (10 iterations)
.small_lp: avg=1.465e-02 s , std=0.000e+00 s (1 iterations)
small_lp_second_time: avg=1.029e-03 s , std=0.000e+00 s (1 iterations)
.small_parameterized_cone_matrix_stuffing: avg=1.470e+00 s , std=0.000e+00 s (1 iterations)
.small_parameterized_lp: avg=1.450e+00 s , std=0.000e+00 s (1 iterations)
small_parameterized_lp_second_time: avg=2.410e-03 s , std=0.000e+00 s (1 iterations)
.tv_inpainting: avg=2.087e+00 s , std=0.000e+00 s (1 iterations)

this PR, 1 thread (total time 10.5s)

cone_matrix_stuffing_with_many_constraints: avg=8.530e-01 s , std=0.000e+00 s (1 iterations)
.diffcp_sdp: avg=3.764e+00 s , std=0.000e+00 s (1 iterations)
.least_squares: avg=4.784e-03 s , std=0.000e+00 s (1 iterations)
.sConic canonicalization
(ECOS) solver time:  0.039389905
cvxpy time:  0.392903271651001
Quadratic canonicalization
(OSQP) solver time:  0.020359027999999998
cvxpy time:  0.42363215623461914
.qp: avg=5.808e-03 s , std=0.000e+00 s (1 iterations)
.small_cone_matrix_stuffing: avg=2.845e-02 s , std=6.201e-03 s (10 iterations)
.small_lp: avg=1.412e-02 s , std=0.000e+00 s (1 iterations)
small_lp_second_time: avg=1.026e-03 s , std=0.000e+00 s (1 iterations)
.small_parameterized_cone_matrix_stuffing: avg=1.211e+00 s , std=0.000e+00 s (1 iterations)
.small_parameterized_lp: avg=1.122e+00 s , std=0.000e+00 s (1 iterations)
small_parameterized_lp_second_time: avg=2.473e-03 s , std=0.000e+00 s (1 iterations)
.tv_inpainting: avg=1.724e+00 s , std=0.000e+00 s (1 iterations)

this PR, 8 threads (total time 8.1s)

cone_matrix_stuffing_with_many_constraints: avg=7.508e-01 s , std=0.000e+00 s (1 iterations)
.diffcp_sdp: avg=1.510e+00 s , std=0.000e+00 s (1 iterations)
.least_squares: avg=5.197e-03 s , std=0.000e+00 s (1 iterations)
.sConic canonicalization
(ECOS) solver time:  0.038531578
cvxpy time:  0.4179091091795654
Quadratic canonicalization
(OSQP) solver time:  0.02024423
cvxpy time:  0.4240571368060303
.qp: avg=6.654e-03 s , std=0.000e+00 s (1 iterations)
.small_cone_matrix_stuffing: avg=3.045e-02 s , std=6.115e-03 s (10 iterations)
.small_lp: avg=1.897e-02 s , std=0.000e+00 s (1 iterations)
small_lp_second_time: avg=1.156e-03 s , std=0.000e+00 s (1 iterations)
.small_parameterized_cone_matrix_stuffing: avg=1.092e+00 s , std=0.000e+00 s (1 iterations)
.small_parameterized_lp: avg=1.097e+00 s , std=0.000e+00 s (1 iterations)
small_parameterized_lp_second_time: avg=2.428e-03 s , std=0.000e+00 s (1 iterations)
.tv_inpainting: avg=1.741e+00 s , std=0.000e+00 s (1 iterations)

@akshayka
Copy link
Collaborator Author

akshayka commented Mar 2, 2021

I believe this PR is almost done. There are two outstanding things.

  1. Testing multithreaded compilation on Travis.

I've added an entry to the build matrix that builds and tests with OpenMP enabled. This works on linux, but is currently failing on macOS. @rileyjmurray or @SteveDiamond , can you help me get the macOS build passing? I'm not exactly sure what's wrong, because Travis is truncating the log. My guess is that I'm passing the wrong compiler/linker flags to gcc. I don't own a mac, so I can't test this locally.

EDIT: Per offline discussion, I've disabled testing with OpenMP on macOS.

  1. Web documentation.

My vote is to add the web docs in a future PR. Right now we don't really say anything about performance on cvxpy.org. I am of the opinion that we should add a new section on performance. This section will emphasize the fact that CVXPY is a compiler, and will have tips on how to get better compilation performance, including compiling with OpenMP.

Copy link
Collaborator

@rileyjmurray rileyjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of lines changed jumped massively, but I trust this is just because of updating Eigen. Approved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants