Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to autodiff? #65

Closed
ianhbell opened this issue Oct 30, 2022 · 23 comments
Closed

Switch to autodiff? #65

ianhbell opened this issue Oct 30, 2022 · 23 comments

Comments

@ianhbell
Copy link

In implementing teqp (https://github.com/usnistgov/teqp), I found that automatic differentiation was MUCH faster than complex step derivatives and multicomplex numbers. As you wrote the library in Rust, have you benchmarked your hyper dual derivatives with autodiff? I used this package in C++: https://github.com/autodiff/autodiff. I think you could also use this in Rust as Rust is C++-derived?

@prehner
Copy link
Contributor

prehner commented Oct 30, 2022

Thank you for your insights! We use the num-dual package (https://github.com/itt-ustutt/num-dual) which uses forward mode automatic differentiation to calculate first and higher order derivatives similar to what autodiff does.

While Rust fills a somewhat similar niche to C++ and uses a comparable syntax, there is no further connection. Rust is bootstrapped from OCaml and compiles to LLVM. External C or C++ libraries can be linked to, but in our case we would lose a lot of flexibility.

Have you specifically compared the perfomance of forward and reverse mode AD? The Helmholtz energy as single output function appears to be the ideal case for reverse mode but I always wonder whether the overhead from maintaining an expression tree (as autodiff calls it) is worth it.

@longemen3000
Copy link
Contributor

i did some tests about that in julia (https://github.com/ypaul21/Clapeyron.jl), and at least is not worth it for nothing else than the chemical potential, and even then, Forward mode has less overhead in that regard.

But, then again, i haven't checked what AD mode from autodiff does teqp use?
From autodiff FAQ:

Is one algorithm always faster than another?

Even though the reverse mode algorithm requires a single function evaluation and all derivatives are subsequently
computed in a single pass, as the constructed expression tree is traversed, the forward algorithm can still be a much more efficient choice for your application, despite its multiple, repeated function evaluations, one for each input variable. This is because the implementation of the forward mode algorithm in autodiff uses template meta-programming techniques to avoid as much as possible temporary memory allocations and to optimize the calculations in certain cases. The reverse mode algorithm, on the other hand, requires the construction of an expression tree at runtime, and, for this, several dynamic memory allocations are needed, which can be costly. We plan to implement alternative versions of this algorithm, in which memory allocation could be done in advance to decrease the number of subsequent allocations. This, however, will require a slightly more complicated usage than it is already provided by the reverse mode algorithm implemented in autodiff.

So, autodiff uses a code-rewriting Forward mode AD, whereas num-dual uses a function overloading type AD. also, it implies that for autodiff to work, it requires full access to a C++ codebase. For Rust, there is Enzyme, that performs code-rewriting at the LLVM level, and has Rust bindings.

@ianhbell
Copy link
Author

@prehner I guess I misunderstood. Didn't you previously use some complex math derivatives in FEOS? I recall citing something along those lines in our paper about teqp.

More generally, I would be curious to see a speed comparison between FEOS and teqp. I think the derivatives in Clapeyron were a fair bit slower than in C++, which is no great surprise. At least that was the case when @pw0908 and I looked at it.

In my experience, the forward mode (what I use), seems to be the "right" answer.

@ianhbell
Copy link
Author

@prehner
Copy link
Contributor

prehner commented Oct 31, 2022

@longemen3000 good to know that for you forward AD worked well, that implies that we're on a good path.

A comment on the operator overloading in Rust: Rust uses compile-time polymorphism, i.e., the different variants (floats, first derivatives, higher derivatives, ...) are all compiled separately, similar to templates in C++. Therefore, the generated code can leverage all low-level compiler optimizations and becomes very efficient.

[...] the forward algorithm can still be a much more efficient choice for your application, despite its multiple, repeated function evaluations, one for each input variable.

Currently we do multiple evaluations for multiple derivatives (e.g. chemical potentials) because it is more efficient than having dynamically sized types and heap allocations in every operation. In general, however, it is also possible to calculate gradients in a single pass using forward mode AD.

Then there is also this: https://github.com/feos-org/feos/blob/main/examples/core_dual_numbers.ipynb

This example shows indeed how (generalized) dual numbers are used in feos. The notation and the algebra is similar to complex numbers but we never used (multi)complex numbers in FeOs. Note that in that example, the eos is defined in Python, which is useful for demonstration and for prototyping but certainly not optimal for performance. The derivatives, however, are calculated using the same dual number implementation as an eos which is implemented in Rust.

A comparison between FeOs and teqp is certainly interesting. I would expect comparable performance but we should certainly confirm that.

@pw0908
Copy link

pw0908 commented Oct 31, 2022

Hi All,

My understanding of feos was essentially as Ian put it. I suppose the mix up originated from your paper on dual numbers in EoS modelling.

As for benchmarks, a while ago, I did some very rough comparisons between all the 'generalised' equation of state modelling packages. All comparisons were run in python with 10000 repeats; admittedly, that's not exactly the fairest comparison as only phasepy is written natively in python, but my rationale was that this is the speed of these packages from a novice user's point of view. I compared various solvers (not the derivatives directly) for PR and PC-SAFT (where possible). Probably a more rigorous comparison needs to be done but, for the time being, these are the results I found:
Benchmarks

*The method to solve saturation pressures for cubics in teqp is using Chebychev polynomials which are much faster than any numerical method. We've implemented this method in Clapeyron as well and have it highlighted in brackets.

All the best!

Pierre

@g-bauer
Copy link
Contributor

g-bauer commented Oct 31, 2022

Hi Pierre,

this is unfortunate and I am not sure where this confusion stems from. We thought that it's very clearly presented in our readme and the documentation that we use generalized hyper-dual numbers via num-dual, which is the implementation accompanying our publication.

Regarding the benchmarks, if I recall correctly, these are the same results you showed in you presentation on June this year? As you mentioned, a fair comparison/benchmark between packages is very difficult because there are differences due to the programming languages, algorithms (and their implementations) and starting values used.

I did some comparisons between Clapeyron and FeOs (because I was intrigued by this table) and got these results:

Clapeyron FeOs
PC-SAFT water
volume 29.828 μs ± 27.517 μs 89.7 µs ± 793 ns
vapor pressure 31.527 μs ± 69.143 μs 180 µs ± 893 ns
critical point 6.564 ms ± 1.258 ms 227 µs ± 105 ns
PC-SAFT water + co2
bubble_pressure 12.368 ms ± 2.100 ms 1.55 ms ± 10.6 µs
tp_flash with RRTPFlash 12.963 ms ± 2.135 ms 1.27 ms ± 957 ns

Note though, that my experience with Julia is limited so I might have done something wrong or non-optimal. The results also depend on the thermodynamic conditions. I used BenchmarkTools for Clapeyron and IPython's timeit for FeOs - both ran from Jupyter-Notebooks on an Apple M1 Pro. The reported times are mean values and std's of 10.000 runs.

I also played around with the parameters for water (I created a new substance called my_water in an csv file, since we have different parameter sets) and got drastically different results for the vapor pressure timings (4.908 ms ± 1.413 ms instead of the value reported above) which I didn't expect.

As stated before however, I am not sure how helpful these benchmarks actually are. thermopack for example places great value on robustness and therefore a lot of additional properties are calculated (such as critical points) where you might not expect it.

All the work on equations of state and making them publicly available is exciting and I really appreciate the discussions. As far as comparisons or benchmarks are concerned, I am personally somewhat reluctant because they are difficult to do in a fair manner and it is very difficult to yield insights for as complex projects as ours.

@ianhbell
Copy link
Author

  1. I have developed Superancillary equations for PC-SAFT (vanilla homosegmented model, no association) that are accurate enough to make the VLE calculations obsolete (within the scope of the approach) and are quite fast (about 0.6 microseconds). I put up the package on pypi: https://teqp.readthedocs.io/en/latest/models/PCSAFT.html#Superancillary . It's in C++, but not so convenient... Paper is under review.
  2. When it comes to benchmarking, I think there are two layers that are interesting in their own right:
    a) How fast is an evaluation of non-iterative properties from the EOS? For instance, evaluation of chemical potential for a mixture formed of the first 10 normal alkanes. This is a test of the differentiation approach, can be a test of both accuracy and speed. This test is relatively straightforward. We agree on models (I think probably non-associating homosegmented PC-SAFT), and calculate the same derivatives. I think the best test is the best that each library can do, so in the native language of the library.
    b) All the other things. This test is much more complicated because there are innumerable details that matter (convergence criteria, stability checks, starting values, etc.). For instance, for PC-SAFT there can be multiple critical points, how do you know you are getting the one you want? Doing that comparison in a fair way is highly non-trivial.

@prehner
Copy link
Contributor

prehner commented Oct 31, 2022

I agree with the assessment that we need to be careful with benchmarks. Nevertheless, I did another one comparing FeOs and teqp. To assess the performance of the AD instead of algorithms, I specifically looked at the calculation of properties and not phase equilibria. This is for the three component example that is used in the teqp introduction.

FeOs tepq
fugacity coefficients 17.4 µs ± 13.8 ns 8.33 µs ± 22 ns
compressibility 6.78 µs ± 5.19 ns 2.37 µs ± 3.99 ns
c_v 6.25 µs ± 23.6 ns 4.5 µs ± 3.42 ns
partial molar volumes 22.7 µs ± 19.1 ns 19.3 µs ± 32.8 ns

So the speed is comparable with teqp winning out in general. For these extremely fast calculations though the overhead from the Python calls can be meaningful.

@ianhbell
Copy link
Author

That's about what I would have expected, good check. Call overhead should be roughly 1 µs I think, at least its about that from my testing, going to depend on precisely what calculations you do.

@ianhbell
Copy link
Author

I did I think the same test in C++, with teqp, and I get:

-------------------------------------------------------------------------------
PCSAFT more derivatives
-------------------------------------------------------------------------------
/Users/ihb/Documents/Code/teqp/src/bench.cpp:94
...............................................................................

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
alphar                                         100            38     1.6454 ms 
                                        431.636 ns    431.032 ns    432.492 ns 
                                        3.63456 ns     2.7682 ns    5.82952 ns 
                                                                               
fugacity_coefficients w/ autodiff              100             7     1.7227 ms 
                                        2.45453 us    2.44984 us     2.4665 us 
                                        35.7453 ns    17.6353 ns    75.1923 ns 
                                                                               
compressibility w/ autodiff                    100            27      1.674 ms 
                                        614.164 ns    613.607 ns     614.95 ns 
                                        3.33357 ns    2.58453 ns    5.22147 ns 
                                                                               
c_vr/R w/ autodiff                             100            21      1.659 ms 
                                        800.338 ns    783.042 ns    854.188 ns 
                                        131.009 ns    3.81014 ns    288.971 ns 
                                                                               
partial_molar_volumes w/ autodiff              100             7     1.7395 ms 
                                        2.47196 us     2.4491 us    2.54458 us 
                                        178.447 ns      20.96 ns    390.367 ns 

@ianhbell
Copy link
Author

I think I need to revisit how the array passing is done in the C++ interface, there are opportunities to avoid copies, and make the Python and C++ values closer.

@ianhbell
Copy link
Author

I added some tests to the Jupyter notebook of docs: https://teqp.readthedocs.io/en/latest/models/PCSAFT.html#PC-SAFT

And in Python on my MacBook Pro machine, I have:
image

@prehner
Copy link
Contributor

prehner commented Nov 13, 2022

We conclude that the compiled languages produce comparably efficient code, with differences that arise mainly from usability aspects (caching, contributions, units, Python).

As the original topic of this issue arose from a misunderstanding about the nature of dual numbers and automatic differentiation, I'm closing this issue.

@prehner prehner closed this as completed Nov 13, 2022
@ianhbell
Copy link
Author

For my curiosity, how long do these calls take in rust?

@g-bauer
Copy link
Contributor

g-bauer commented Nov 13, 2022

I can push benchmarks if you want to run them on your machine and compare. It is difficult to compare numbers across different machines.

A call to a property in our code involves some steps (given a thermodynamic state, i.e. a State object, was created):

  • check which contributions to compute (residual (NVT or NPT)? including ideal gas?)
  • construct a closure (anonymous function) with the computation instructions
  • call the cache, check if derivatives exist and if not, evaluate the closure
  • fill cache and return result

The Helmholtz energy itself contains loops over trait objects (dynamic dispatch) to evaluate the contributions. That's what the code does when a method is called on State (both in Rust and Python). Now, I locally added benchmarks to compare that with direct calls to the residual Helmholtz energy function, given a StateHD with generic (hyper) dual numbers.

Taking Python as baseline (e.g. State(T, rho, x).method(Contributions.ResidualNvt), where method is compressibility, c_v, etc.):

  • calling the equivalent method in Rust is faster by a factor ~2x to 2.5x (when returning numpy arrays)
  • calling directly the function for the residual Helmholtz energy with hyper duals as input is faster by a factor of ~3x

Regarding the actual issue raised: I compared the ratio of time of the helmholtz energy evaluation (alphar I assume) vs first and second order derivatives from your timings (C++ benchmark) above with mine (rust benchmark).

If we attribute the difference to the implementation of dual numbers alone, autodiff is approx 5 to 10% faster (I expected worse to be honest, so I think our implementation is quite good here).

To summarize:

  • We currently perform a lot of copying between Rust and Python and transform the results to dimensioned quantities. We should look into that in the future.
  • Also, pyo3 may not be as efficient as pybind11?
  • It may be nice to have better benchmarks in the code base and maybe even as part of the CI
  • there are better options to improve current performance before touching the dual number implementation

Thanks for raising the issue. It's an interesting topic to look into.

@ianhbell
Copy link
Author

Thanks for digging into this. I spent a lot of time thinking about optimizing these calls since they are at the heart of everything. I agree that 10-15% worse in speed is pretty impressive and you should be proud of that. My experience was that even complex step derivatives were much slower than autodiff, and anyway are only useful for first derivatives. Our implementation of multicomplex doesn't scale well at all, so the only option remaining is autodiff. But I didn't have your hyper dual implementation to compare against, that would have been very interesting. There are some rather severe downsides to autodiff, especially the compile time is long, and the amount of memory needed to compile the library is quite crazy (12 GB!). So if we can find a method that is only 10% slower but reduces the compile time and memory use, I'm interested.

I noticed your wheels are pretty huge (at least compared with teqp). Anything you can do to reduce that? Not super crucial though as computers have lots of memory nowadays. Just a pain if you need to package into larger programs as I have learned the hard way.

@g-bauer g-bauer mentioned this issue Nov 15, 2022
14 tasks
@g-bauer
Copy link
Contributor

g-bauer commented Nov 15, 2022

The wheel-size can be reduced with a compiler flag, i.e. lto = true which on my machine reduced the wheel from 12 mb to about 8 mb. To my surprise, it also massively improved performance, because, apparently, the default release profile does not optimize the code across different rust crates (packages). We suspect that the improvement is so large because e.g. our dual numbers and quantities live in their own packages.

Just rerunning my benchmarks showed an performance improvement of about 50% across all functions that use dual numbers (Helmholtz energy was about 20% faster). The ratios of "derivatives vs residual Helmholtz energy evaluations" drastically decreased. For example time(da_dv) / time(a_res) was around 1.7 and now is around 1.12. For the second derivative (c_v) it went from 2.0 to 1.81.

I think we can let this issue rest now as the actual topic is resolved. I opened #77 to track implementation of proper benchmarks that should make it easier in the future to do these kinds of investigations.

@ianhbell
Copy link
Author

That's great news, LTO is key also in C++ to reduce the wheel sizes.

Surprising (in a nice way) that the speed is also increased so much, but it you were moving across shared library boundaries, I could imagine that might prevent further optimizations.

It would be best to build a docker container for benchmarking so we can run everything with the same setup. I can start that on my side.

@g-bauer
Copy link
Contributor

g-bauer commented Dec 15, 2022

FYI, the current main branch now contains benchmarks for the system we discussed in this thread (methane, ethane, propane) and others.

We have a benchmark (dual_numbers) in which direct calls to the Helmholtz energy functions with different dual number types are performed. This should give a good idea about the slowdown of the evaluation due to increasing numbers of arithmetic operations going to more complex dual numbers for higher derivatives.

The state_properties benchmark contains calls to properties such as compressibility, molar volume, fugacity, etc. and includes generating the State object which has some overhead (constructing the cache, validating input).

You can find an overview and information about how to run the benchmarks here.

@ianhbell
Copy link
Author

Cool, that's great. I'll have a go. Do you have a docker image for the benchmarks by any chance? If not, I can put one together as I'll be testing in a docker container.

Maybe not the right place for this question, but do you have any interest in releasing a port of your dual library for C++? I would like to have a go with that and see if it can really compete with autodiff. If so, I'll move to it because autodiff makes compilation very slow due to all the super-nested template evaluations.

@g-bauer
Copy link
Contributor

g-bauer commented Dec 15, 2022

We have very limited experience with Docker, so no, we don't have an image.

I am not sure if a port of our library to C++ would be useful. In Rust, we are able to define functions that are generic over the dual number types because we defined and implemented a trait (shared behavior or interface in Rust). The compiler, as Philipp mentioned above, creates at compile time a type-specific version of each function so there is no cost (other than more complex arithmetic operations) in using dual numbers at run time. That's what makes it very efficient.

I have no experience with templates in C++, but I don't think that there would be an advantage of our implementation over using dual numbers defined in C++ libraries such as cppduals or autodiff.

@ianhbell
Copy link
Author

Ok, I'll look into a docker image on my side.

That sound very similar to how autodiff works in C++ with templates. And I think the computational penalty would therefore be roughly similar for an apples-to-apples comparison. I wonder if both the Rust and C++ approaches would compile to the same assembly for a simple derivative (of x+1 w.r.t. x for instance).

cppduals is new to me, thanks very much for making me aware of it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants