tipsandtricks

Mario Zanon edited this page Feb 23, 2016 · 12 revisions

This page lists some tips and tricks that can speed up or improve convergence your algorithms written with CasADi.

Matrix multiplication

The current implementation of general sparse matrix multiplication is rather slow. The bottleneck here is the product of sparsity pattern. Instead of using DMatrix multiplication inside a loop, write an XFunction outside the loop and evaluate . This way, the sparsity pattern has only to be computed once.

Before:

A = huge sparse DMatrix
B = huge sparse DMatrix


loop: 
  C = mul(A,B)

After:

A = huge sparse DMatrix
B = huge sparse DMatrix

f = MXFunction([A,B],[mul(A,B)])
f.init()
loop: 
  f.evaluate()
  C = f.output()

It is not unusual to see a speed-up of factor 1000. For even faster evaluation of the matrix product, convert the MXFunction to an SXFunction and possibly even to C code (see below).

Automatic typecasting

Numbers, numpy arrays/matrices, octave arrays all get automatically converted to numeric SX and MX nodes if necessary. Each conversion will create unique nodes in the trees. To avoid excessive growth of the tree when looping, pre-convert the numerics to SX or MX.

Before:

A = numpy array
B = SXMatrix

C = 0
loop:
  C += mul(A,B)

After:

A = numpy array
B = SXMatrix

A = SXMatrix(A)

C = 0
loop:
  C += mul(A,B)

You can inspect the size of trees with the countNodes method.

Avoiding DMatrix for numerics

DMatrix is not intended to be used for computationally heavy operations, like inside time critical loops. Inside CasADi, it is mostly used to store matrices, rather than to perform operations on them and its design decisions are largely dictated by the use of the same template (Matrix<>) for symbolic operations. Said that, it uses a standard general sparse matrix format (compressed row storage) so if a particular operation is slow, it should be possible to improve it by implementing standard textbook algorithms or interfacing external code. What has been and continuously is being optimized in CasADi are the SX and MX virtual machines (implemented as SXFunction and MXFunction), which perform a sequence of scalar or matrix valued operations.

For time critical calculations in Python, use the scipy datatypes instead. Conversion from a DMatrix A to a scipy sparse matrix B is easy with B = A.toCsc_matrix(). Conversion in the other way is done with the DMatrix constructor A = DMatrix(B) or implicitely.

In Octave, use the native sparse matrix format.

In C++, use a package such as eigen (or uBlas/MTL).

Note: if you are using DMatrix just to store numerical data or perform operations on it outside critical loops, we recommend to make use of DMatrix in your script files. It hosts a variety of indexing/slicing operations designed to make you rlife easier.

Convert MXFunction to SXFunction

The MXFunction is not yet nearly as efficient as SXFunction, so for small to medium size problems you can achieve speedups of several orders of magnitudes by converting an MXFunction to an SXFunction.

fcn_mx = MXFunction(...,...)
fcn_mx.init()
fcn_sx = SXFunction(fcn_mx)

C code generation - Just-in-time compilation

For even faster evaluation, C code can be generated from an SXFunction. This can increase evaluation speeds with a factor 10, but comes at the expense of long compilation times and limits to how large expressions that can be handled. The C code can be compiled "on the fly" if desired by calling the system compiler to compile it to a dynamically linked library which can be read in to a running program. Have a look at examples/python/c_code_generation.py for a demonstration on how to use this feature. If you notice that compilation takes a long time, turn off your compiler code optimization. Using GCC's "-O3" might generate 30 % faster code, but could at the same time slow down the compilation time with a factor 10 or more.

Sundials suite

The default options in Sundials do not necessarily give you good performance. The "print_stats" options may reveal that a lot of time is spent in finite difference calculation of the rhs needed for the linear solver. There is a fair chance that the options {"exact_jacobian": True} and/or {"linear_solver":"iterative"} will boost the performance.

Substitute

A pattern that might arise in casadi code might be:

f = substitute(f,a,A)
f = substitute(f,b,B)

This will, in general create a larger-than-necessary expression graph when A and B share common subexpressions.

If this is the case, try:

f = substitute(f,vertcat([a, b]),vertcat([A, B]))