cvxcore optimization #1255

akshayka · 2021-02-27T18:57:15Z

Adds several optimizations to cvxcore, and adds parallelism.

This change makes the processing of LinOps in cvxcore's build_matrix function happen in parallel, provided that cvxcore was compiled with OpenMP. The user can choose the number of threads to be used via cvxpy.set_num_threads().

By default, cvxcore is not built with OpenMP, since some systems might not have it. To build with OpenMP, specify the compiler/linker flags at the command line. eg,

CFLAGS="-fopenmp" LDFLAGS="-lgomp" python setup.py install

That said, we should make sure that our wheels are built with OpenMP by default. I'd appreciate it if someone could help with that, in a future PR.

Additionally, this change gets rid of some unnecessary copies in cvxcore.

…n to 440

And revert to conservative sparse*sparse products in cvxcore ... pruned products intermittently crash.

You can manually supply CLFAGS and LDFLAGS to enable openmp TODO: wheels should be built with openmp

cvxpy/cvxcore/src/LinOpOperations.cpp

SteveDiamond · 2021-02-27T23:28:05Z

Looks reasonable to me! I'd appreciate some comments explaining how you're doing things in a more copy-free/efficient way. Also looks like there are compile issues on travis.

akshayka · 2021-02-28T17:24:39Z

Looks reasonable to me! I'd appreciate some comments explaining how you're doing things in a more copy-free/efficient way. Also looks like there are compile issues on travis.

Thanks for the review! Previously, the copy constructor of SparseMatrix was invoked implicitly in many places --- in the creation of temporaries, when pushing back an existing DictMat or Matrix into an STL container, when passing arguments by value, iterating over arguments by value, and so on. The change fixes many instances of things like that.

I'll fix the Travis CI.

cvxpy/cvxcore/src/cvxcore.cpp

rileyjmurray

Looks mostly good to me! I did ask one question in cvxcore.cpp about gracefully handling potential OOM errors. Other than that I'd say we need to make sure to update the web documentation.

Edit: @SteveDiamond and @akshayka what do you think about adding in some tests (maybe with a flag on a certain TravisCI or AppVeyor build) to ensure multi-threaded cvxcore behaves as expected?

akshayka · 2021-02-28T19:55:42Z

To add tests, the build would need to be configured to compile cvxcore with openmp. Do you have thoughts on how to do that cleanly?

rileyjmurray · 2021-02-28T20:39:37Z

The Travis config file can define an environment variable tailored to each build configuration. We'd need to edit the Travis installation script to check the value of that environment variable, and then proceed with or without OpenMP (as appropriate). We can make sure OpenMP on these systems just by installing from conda-forge. The one weird thing we'd need to do is explicitly set the number of OpenMP threads to 2 or 4 (https://docs.travis-ci.com/user/languages/c/#openmp-projects).

Edit: we use this kind of environment-variable build-flag logic when selecting the python version https://github.com/cvxgrp/cvxpy/blob/master/continuous_integration/Travis/install_dependencies.sh#L29.

Configurable via OMP_NUM_THREADS or cvxpy.set_num_threads()

akshayka · 2021-03-01T16:55:00Z

Some timings: master, this branch 1 thread, this branch 8 threads (test_benchmarks.py).

This branch, single-threaded, is somewhat faster than master.

Most of these problems don't have many expression trees that need to be canonicalized, so you don't see a big difference in timings. However diffcp_sdp is 3x faster when using multiple threads.

EDIT 1: Note that most of these benchmarks time the get_problem_data method, which runs an entire reduction chain. So a (say) 2x speed up in build_matrix won't necessarily make the benchmarks run 2x faster.

EDIT 2: I found and eliminated additional copies, yielding significant savings. The updated numbers are below.

master (total time 15.7 seconds)

cone_matrix_stuffing_with_many_constraints: avg=1.117e+00 s , std=0.000e+00 s (1 iterations)
.diffcp_sdp: avg=7.330e+00 s , std=0.000e+00 s (1 iterations)
.least_squares: avg=4.506e-03 s , std=0.000e+00 s (1 iterations)
.sConic canonicalization
(ECOS) solver time:  0.038621366
cvxpy time:  0.5479763150455322
Quadratic canonicalization
(OSQP) solver time:  0.020118558
cvxpy time:  0.48471370976123046
.qp: avg=5.823e-03 s , std=0.000e+00 s (1 iterations)
.small_cone_matrix_stuffing: avg=3.141e-02 s , std=5.636e-03 s (10 iterations)
.small_lp: avg=1.476e-02 s , std=0.000e+00 s (1 iterations)
small_lp_second_time: avg=1.024e-03 s , std=0.000e+00 s (1 iterations)
.small_parameterized_cone_matrix_stuffing: avg=1.550e+00 s , std=0.000e+00 s (1 iterations)
.small_parameterized_lp: avg=1.503e+00 s , std=0.000e+00 s (1 iterations)
small_parameterized_lp_second_time: avg=2.394e-03 s , std=0.000e+00 s (1 iterations)
.tv_inpainting: avg=2.106e+00 s , std=0.000e+00 s (1 iterations)

1 thread (total time 11.7 seconds)

cone_matrix_stuffing_with_many_constraints: avg=1.043e+00 s , std=0.000e+00 s (1 iterations)
.diffcp_sdp: avg=4.350e+00 s , std=0.000e+00 s (1 iterations)
.least_squares: avg=4.452e-03 s , std=0.000e+00 s (1 iterations)
.sConic canonicalization
(ECOS) solver time:  0.038705834
cvxpy time:  0.44830249920617676
Quadratic canonicalization
(OSQP) solver time:  0.024713480000000003
cvxpy time:  0.4430909519152832
.qp: avg=1.032e-02 s , std=0.000e+00 s (1 iterations)
.small_cone_matrix_stuffing: avg=2.774e-02 s , std=5.836e-03 s (10 iterations)
.small_lp: avg=1.426e-02 s , std=0.000e+00 s (1 iterations)
small_lp_second_time: avg=1.017e-03 s , std=0.000e+00 s (1 iterations)
.small_parameterized_cone_matrix_stuffing: avg=1.292e+00 s , std=0.000e+00 s (1 iterations)
.small_parameterized_lp: avg=1.202e+00 s , std=0.000e+00 s (1 iterations)
small_parameterized_lp_second_time: avg=2.382e-03 s , std=0.000e+00 s (1 iterations)
.tv_inpainting: avg=1.906e+00 s , std=0.000e+00 s (1 iterations)

8 threads (total time 9.1 seconds)

cone_matrix_stuffing_with_many_constraints: avg=9.255e-01 s , std=0.000e+00 s (1 iterations)
.diffcp_sdp: avg=1.768e+00 s , std=0.000e+00 s (1 iterations)
.least_squares: avg=5.306e-03 s , std=0.000e+00 s (1 iterations)
.sConic canonicalization
(ECOS) solver time:  0.038623892
cvxpy time:  0.47473422414990235
Quadratic canonicalization
(OSQP) solver time:  0.020168179
cvxpy time:  0.46717441705407714
.qp: avg=6.624e-03 s , std=0.000e+00 s (1 iterations)
.small_cone_matrix_stuffing: avg=3.212e-02 s , std=8.298e-03 s (10 iterations)
.small_lp: avg=1.694e-02 s , std=0.000e+00 s (1 iterations)
small_lp_second_time: avg=1.043e-03 s , std=0.000e+00 s (1 iterations)
.small_parameterized_cone_matrix_stuffing: avg=1.181e+00 s , std=0.000e+00 s (1 iterations)
.small_parameterized_lp: avg=1.307e+00 s , std=0.000e+00 s (1 iterations)
small_parameterized_lp_second_time: avg=2.442e-03 s , std=0.000e+00 s (1 iterations)
.tv_inpainting: avg=1.830e+00 s , std=0.000e+00 s (1 iterations)

akshayka · 2021-03-01T22:53:44Z

I made a few more small optimizations (eliminating copies), upgraded Eigen to 3.3.9, and switched to pruned sparse-sparse multiplication (which works in 3.3.9, but was broken in the version we were using).

There are still some copies that I wasn't able to get rid of. Most notably get_constant_data copies out numerical data instead of mapping it. But there are other copies as well. We can address those in a later change.

The improvements are fairly significant. See below (test_benchmarks.py, all caveats from the above comment apply).

master (total time 15.6s)

cone_matrix_stuffing_with_many_constraints: avg=1.122e+00 s , std=0.000e+00 s (1 iterations)
.diffcp_sdp: avg=7.410e+00 s , std=0.000e+00 s (1 iterations)
.least_squares: avg=4.661e-03 s , std=0.000e+00 s (1 iterations)
.sConic canonicalization
(ECOS) solver time:  0.039003118
cvxpy time:  0.5048685231590576
Quadratic canonicalization
(OSQP) solver time:  0.020439915
cvxpy time:  0.4658264595880127
.qp: avg=5.785e-03 s , std=0.000e+00 s (1 iterations)
.small_cone_matrix_stuffing: avg=2.912e-02 s , std=5.820e-03 s (10 iterations)
.small_lp: avg=1.465e-02 s , std=0.000e+00 s (1 iterations)
small_lp_second_time: avg=1.029e-03 s , std=0.000e+00 s (1 iterations)
.small_parameterized_cone_matrix_stuffing: avg=1.470e+00 s , std=0.000e+00 s (1 iterations)
.small_parameterized_lp: avg=1.450e+00 s , std=0.000e+00 s (1 iterations)
small_parameterized_lp_second_time: avg=2.410e-03 s , std=0.000e+00 s (1 iterations)
.tv_inpainting: avg=2.087e+00 s , std=0.000e+00 s (1 iterations)

this PR, 1 thread (total time 10.5s)

cone_matrix_stuffing_with_many_constraints: avg=8.530e-01 s , std=0.000e+00 s (1 iterations)
.diffcp_sdp: avg=3.764e+00 s , std=0.000e+00 s (1 iterations)
.least_squares: avg=4.784e-03 s , std=0.000e+00 s (1 iterations)
.sConic canonicalization
(ECOS) solver time:  0.039389905
cvxpy time:  0.392903271651001
Quadratic canonicalization
(OSQP) solver time:  0.020359027999999998
cvxpy time:  0.42363215623461914
.qp: avg=5.808e-03 s , std=0.000e+00 s (1 iterations)
.small_cone_matrix_stuffing: avg=2.845e-02 s , std=6.201e-03 s (10 iterations)
.small_lp: avg=1.412e-02 s , std=0.000e+00 s (1 iterations)
small_lp_second_time: avg=1.026e-03 s , std=0.000e+00 s (1 iterations)
.small_parameterized_cone_matrix_stuffing: avg=1.211e+00 s , std=0.000e+00 s (1 iterations)
.small_parameterized_lp: avg=1.122e+00 s , std=0.000e+00 s (1 iterations)
small_parameterized_lp_second_time: avg=2.473e-03 s , std=0.000e+00 s (1 iterations)
.tv_inpainting: avg=1.724e+00 s , std=0.000e+00 s (1 iterations)

this PR, 8 threads (total time 8.1s)

cone_matrix_stuffing_with_many_constraints: avg=7.508e-01 s , std=0.000e+00 s (1 iterations)
.diffcp_sdp: avg=1.510e+00 s , std=0.000e+00 s (1 iterations)
.least_squares: avg=5.197e-03 s , std=0.000e+00 s (1 iterations)
.sConic canonicalization
(ECOS) solver time:  0.038531578
cvxpy time:  0.4179091091795654
Quadratic canonicalization
(OSQP) solver time:  0.02024423
cvxpy time:  0.4240571368060303
.qp: avg=6.654e-03 s , std=0.000e+00 s (1 iterations)
.small_cone_matrix_stuffing: avg=3.045e-02 s , std=6.115e-03 s (10 iterations)
.small_lp: avg=1.897e-02 s , std=0.000e+00 s (1 iterations)
small_lp_second_time: avg=1.156e-03 s , std=0.000e+00 s (1 iterations)
.small_parameterized_cone_matrix_stuffing: avg=1.092e+00 s , std=0.000e+00 s (1 iterations)
.small_parameterized_lp: avg=1.097e+00 s , std=0.000e+00 s (1 iterations)
small_parameterized_lp_second_time: avg=2.428e-03 s , std=0.000e+00 s (1 iterations)
.tv_inpainting: avg=1.741e+00 s , std=0.000e+00 s (1 iterations)

akshayka · 2021-03-02T00:00:52Z

I believe this PR is almost done. There are two outstanding things.

Testing multithreaded compilation on Travis.

I've added an entry to the build matrix that builds and tests with OpenMP enabled. This works on linux, but is currently failing on macOS. @rileyjmurray or @SteveDiamond , can you help me get the macOS build passing? I'm not exactly sure what's wrong, because Travis is truncating the log. My guess is that I'm passing the wrong compiler/linker flags to gcc. I don't own a mac, so I can't test this locally.

EDIT: Per offline discussion, I've disabled testing with OpenMP on macOS.

Web documentation.

My vote is to add the web docs in a future PR. Right now we don't really say anything about performance on cvxpy.org. I am of the opinion that we should add a new section on performance. This section will emphasize the fact that CVXPY is a compiler, and will have tips on how to get better compilation performance, including compiling with OpenMP.

rileyjmurray

The number of lines changed jumped massively, but I trust this is just because of updating Eigen. Approved!

akshayka added 5 commits February 27, 2021 10:44

some basic optimizations; test problem goes from 780s canonicalizatio…

3696625

…n to 440

c++ parallel for each prototype, multiple threads not being used ...

d35d4ae

openmp in cvxcore

1a5e1a5

Control OPENMP parallelism via cvxpy.set_num_threads()

51e8fda

And revert to conservative sparse*sparse products in cvxcore ... pruned products intermittently crash.

Do not build with openmp by default

c2f79c0

You can manually supply CLFAGS and LDFLAGS to enable openmp TODO: wheels should be built with openmp

akshayka requested review from SteveDiamond and rileyjmurray February 27, 2021 18:57

SteveDiamond reviewed Feb 27, 2021

View reviewed changes

cvxpy/cvxcore/src/LinOpOperations.cpp Show resolved Hide resolved

-std=c++11

10e781f

rileyjmurray reviewed Feb 28, 2021

View reviewed changes

cvxpy/cvxcore/src/cvxcore.cpp Show resolved Hide resolved

rileyjmurray reviewed Feb 28, 2021

View reviewed changes

Let OpenMP decide default number of threads

13ed709

Configurable via OMP_NUM_THREADS or cvxpy.set_num_threads()

akshayka added 4 commits March 1, 2021 09:12

openmp on travis

4ab673e

get rid of more copies, significant performance increase

9de3135

fewer copies

0acb858

Upgrade Eigen to 3.3.9

74f9aaa

akshayka added 2 commits March 1, 2021 15:11

try a different linker flag for openmp on macOS ...

915ef94

fix syntax error in travis test script

039717b

akshayka added 2 commits March 1, 2021 17:09

do not build with openmp on osx

4659ee5

update travis exclusion

d665b16

rileyjmurray approved these changes Mar 2, 2021

View reviewed changes

akshayka merged commit 66a437d into master Mar 2, 2021

lochsh mentioned this pull request Mar 1, 2022

Utils.cpp:mat_vec_mul can be 4000% faster without SparseMatrixBase::pruned() call #1668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cvxcore optimization #1255

cvxcore optimization #1255

akshayka commented Feb 27, 2021

SteveDiamond commented Feb 27, 2021

akshayka commented Feb 28, 2021

rileyjmurray left a comment •

edited

akshayka commented Feb 28, 2021

rileyjmurray commented Feb 28, 2021 •

edited

akshayka commented Mar 1, 2021 •

edited

akshayka commented Mar 1, 2021 •

edited

akshayka commented Mar 2, 2021 •

edited

rileyjmurray left a comment

cvxcore optimization #1255

cvxcore optimization #1255

Conversation

akshayka commented Feb 27, 2021

SteveDiamond commented Feb 27, 2021

akshayka commented Feb 28, 2021

rileyjmurray left a comment • edited

Choose a reason for hiding this comment

akshayka commented Feb 28, 2021

rileyjmurray commented Feb 28, 2021 • edited

akshayka commented Mar 1, 2021 • edited

akshayka commented Mar 1, 2021 • edited

akshayka commented Mar 2, 2021 • edited

rileyjmurray left a comment

Choose a reason for hiding this comment

rileyjmurray left a comment •

edited

rileyjmurray commented Feb 28, 2021 •

edited

akshayka commented Mar 1, 2021 •

edited

akshayka commented Mar 1, 2021 •

edited

akshayka commented Mar 2, 2021 •

edited