
# Flaws of floating-point computing

Floating-point numbers can represent a very large range of numbers, from the smallest to the largest, similarly to scientific notation. They are the prefered types for scientific computing. Yet, one must be aware of the many **rounding errors** which are implied. 

First, in order to check visually the accuracy of some calculations, let's increase to 18 the output stream precision (this is 6 by default).

In [None]:
#include <iostream>
std::cout.precision(18) ;

## Binary is not decimal

In binary base, all the numbers `2^(-n)`, with `n` not to big, and all their combinations can be represented exactly by a floating point number:

In [None]:
std::cout << (1./2.) << " " << (1./4.) << " " << (1./2.+1./4.) << std::endl ;

Apart from this rare special ones, given the limited number of digits available for the internal representation, most numbers cannot be represented exactly:

In [None]:
std::cout << (1./3.) << std::endl ;

Less intuitive, some very simple numbers (for humans) do not have an exact base-two representation:

In [None]:
std::cout << 0.1 << std::endl ;

Some simple operations may add rounding errors, which complicates comparison of floating-point numbers:

In [None]:
double d1 = .3 ;
double d2 = .1+.2 ;
std::cout << d1 << std::endl ;
std::cout << d2 << std::endl ;
if (d1==d2)
 { std::cout<<"numbers are the same"<<std::endl ; }
else
 { std::cout<<"numbers differ !"<<std::endl ; }

## Good old-fashioned practice: epsilon

When comparing some floating point numbers, always allows for an epsilon difference, and scale it with the absolute values.

In [None]:
#include <cmath>
#include <limits>

In [None]:
bool compare( double val1, double val2 )
 {
  constexpr double eps = std::numeric_limits<double>::epsilon() ;
  return (std::abs(val1-val2)<(eps*std::max(std::abs(val1),std::abs(val2)))) ;
 }

In [None]:
if (compare(.3,.1+.2 ))
 { std::cout<<"numbers are the same"<<std::endl ; }
else
 { std::cout<<"numbers differ !"<<std::endl ; }

When doing long arithmetic computations and|or mathematical functions (exp, log, trigo...), it is even usual to multiply the epsilon by a factor such as 3.  

## Absorption

Adding a very small number to a very big one has no effect on the big one... And there is nothing you can do about it, except using a larger floating point type, to a given extent, and more importantly modify your algorithms so to avoid this situation. The even worse point is that it is really hard to detect such pitfall. 

In [None]:
%%file tmp.absorption.cpp

#include <iostream>
#include <stdfloat>

int main( int argc, char * argv[] )
 {
  auto v1 { 128.0f16 } ;
  auto v2 { 1.f16/16 } ;
  std::cout << v1 << std::endl ;
  std::cout << v2 << std::endl ;
  std::cout << (v1+v2) << std::endl ;
 }

In [None]:
!rm -f tmp.absorption.exe && g++ -O2 -std=c++23 tmp.absorption.cpp -o tmp.absorption.exe && ./tmp.absorption.exe

## Cancellation

Somehow similar to the previous problem, if you substract two numbers which are very close, the results will get very few significant digits. In the example below, where we consider the `long double` result as the "truth", after only few operations, the relative errors is far from the expected 7 significant digits.

In [None]:
%%file tmp.cancellation.cpp
    
#include <iostream>
#include <iomanip>

template< typename R >
std::tuple<R,R> main_impl()
 {
  R v1 { static_cast<R>(3.333) + static_cast<R>(3.0e-4) } ;
  R v2 { static_cast<R>(3.333) + static_cast<R>(2.0e-4) } ;
  R res1 = (v1*v1-v2*v2) ;
  R res2 = (v1+v2)*(v1-v2) ;
  return { res1, res2 } ;
 }

int main( int argc, char * argv[] ) {
  auto [ res1l, res2l ] = main_impl<long double>() ;
  auto [ res1f, res2f ] = main_impl<float>() ;

  std::cout << std::fixed << std::setprecision(18) ;
  std::cout << "(v1*v1-v2*v2)     float result: " << res1f << std::endl ;
  std::cout << "(v1+v2)*(v1-v2)   float result: " << res2f << std::endl ;
  std::cout << "(v1*v1-v2*v2)   relative error: " << (res1l-res1f)/res1l << std::endl ;
  std::cout << "(v1+v2)*(v1-v2) relative error: " << (res2l-res2f)/res2l << std::endl ;
 }


In [None]:
!rm -f tmp.cancellation.exe && g++ -O2 -std=c++23 tmp.cancellation.cpp -o tmp.cancellation.exe

In [None]:
!./tmp.cancellation.exe

# Take Away

Modern C++ will not bring any silver bullet for the rounding problems of floating point computing. You still have to rely on only some old-fashioned good practice, yet externals tools that can help to locate greatest errors ([CADNA](https://www-pequan.lip6.fr/cadna/), [verificarlo](https://github.com/verificarlo/verificarlo), [verrou](https://github.com/edf-hpc/verrou)).

# Questions ?

© *CNRS 2024*  
*This document was created by David Chamont. It is available under the [License Creative Commons - Attribution - No commercial use - Shared under the conditions 4.0 International](http://creativecommons.org/licenses/by-nc-sa/4.0/)*