Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up write_significand() #2499

Merged
merged 1 commit into from
Oct 9, 2021
Merged

Conversation

Roman-Koshelev
Copy link
Contributor

@Roman-Koshelev Roman-Koshelev commented Sep 12, 2021

The new version gives an increase of about 2 times, except for the case when the size of the integer part is 1 (since in the original code this case is specially optimized)

Tests
integral_size, floating_size) Speed
1, 1) 0.967493
1, 2) 1.366119
1, 3) 1.016240
1, 4) 1.194619
2, 1) 2.453043
2, 2) 2.560216
2, 3) 2.554420
2, 4) 1.944176
3, 1) 2.198698
3, 2) 2.430826
3, 3) 2.032839
3, 4) 2.088472
4, 1) 2.366449
4, 2) 2.460539
4, 3) 2.084301
4, 4) 2.077537

@vitaut
Copy link
Contributor

vitaut commented Sep 14, 2021

Could you post here the benchmark code that you used to obtained the timings?

@Roman-Koshelev
Copy link
Contributor Author

https://ideone.com/lTcMzk Here are tests of several options

@Roman-Koshelev
Copy link
Contributor Author

Execution result on my machine
is, fs) speed
01, 01) 1.621 1.568 1.858 1.440
01, 02) 1.500 1.917 1.887 1.342
01, 03) 1.632 1.329 1.489 1.487
01, 04) 1.405 1.577 1.633 1.335
02, 01) 1.360 2.491 3.068 2.454
02, 02) 2.087 2.364 2.350 1.901
02, 03) 1.455 2.251 2.727 2.264
02, 04) 1.873 1.927 1.806 1.476
03, 01) 1.249 2.157 2.414 1.877
03, 02) 2.483 2.619 2.603 1.746
03, 03) 1.277 1.964 2.091 1.571
03, 04) 2.124 2.220 2.267 1.671
04, 01) 1.290 2.222 2.599 1.621
04, 02) 2.350 2.551 2.550 1.344
04, 03) 1.378 1.846 2.222 1.603
04, 04) 2.082 2.199 2.197 1.313

@vitaut
Copy link
Contributor

vitaut commented Sep 15, 2021

Copying here for future reference:

#include <array>
#include <memory>
#include <cstdio>
#include <vector>
#include <random>
#include <chrono>
#include <cstring>
 
#define FMT_CONSTEXPR constexpr
 
template <typename T> constexpr auto make_checked(T* p, size_t) -> T* {
  return p;
}
 
// Converts value in the range [0, 100) to a string.
constexpr const char* digits2(size_t value) {
  // GCC generates slightly better code when value is pointer-size.
  return &"0001020304050607080910111213141516171819"
         "2021222324252627282930313233343536373839"
         "4041424344454647484950515253545556575859"
         "6061626364656667686970717273747576777879"
         "8081828384858687888990919293949596979899"[value * 2];
}
 
// Casts a nonnegative integer to unsigned.
template <typename Int>
FMT_CONSTEXPR auto to_unsigned(Int value) ->
    typename std::make_unsigned<Int>::type {
  //FMT_ASSERT(value >= 0, "negative value");
  return static_cast<typename std::make_unsigned<Int>::type>(value);
}
 
// Copies two characters from src to dst.
template <typename Char>
void copy2(Char* dst, const char* src) {
  if (/*!is_constant_evaluated() && */sizeof(Char) == sizeof(char)) {
    memcpy(dst, src, 2);
    return;
  }
  *dst++ = static_cast<Char>(*src++);
  *dst = static_cast<Char>(*src);
}
 
template <typename Iterator> struct format_decimal_result {
  Iterator begin;
  Iterator end;
};
 
// Formats a decimal unsigned integer value writing into out pointing to a
// buffer of specified size. The caller must ensure that the buffer is large
// enough.
template <typename Char, typename UInt>
FMT_CONSTEXPR auto format_decimal(Char* out, UInt value, int size)
    -> format_decimal_result<Char*> {
  //FMT_ASSERT(size >= count_digits(value), "invalid digit count");
  out += size;
  Char* end = out;
  while (value >= 100) {
    // Integer division is slow so do it for a group of two digits instead
    // of for every digit. The idea comes from the talk by Alexandrescu
    // "Three Optimization Tips for C++". See speed-test for a comparison.
    out -= 2;
    copy2(out, digits2(static_cast<size_t>(value % 100)));
    value /= 100;
  }
  if (value < 10) {
    *--out = static_cast<Char>('0' + value);
    return {out, end};
  }
  out -= 2;
  copy2(out, digits2(static_cast<size_t>(value)));
  return {out, end};
}
 
template <typename Char, typename UInt/*,
          FMT_ENABLE_IF(std::is_integral<UInt>::value)*/>
inline auto write_significand(Char* out, UInt significand, int significand_size,
                              int integral_size, Char decimal_point) -> Char* {
  if (!decimal_point)
    return format_decimal(out, significand, significand_size).end;
  auto end = format_decimal(out + 1, significand, significand_size).end;
  if (integral_size == 1) {
    out[0] = out[1];
  } else {
    std::uninitialized_copy_n(out + 1, integral_size,
                              make_checked(out, to_unsigned(integral_size)));
  }
  out[integral_size] = decimal_point;
  return end;
}
 
template <typename Char, typename UInt/*,
          FMT_ENABLE_IF(std::is_integral<UInt>::value)*/>
inline auto write_significand2(Char* out, UInt significand, int significand_size,
                              int integral_size, Char decimal_point) -> Char* {
  Char* ccc = out;
  if (!decimal_point)
    return format_decimal(out, significand, significand_size).end;
  out += significand_size + 1;
  Char* end = out;
  int floating_size = significand_size - integral_size;
  for(int i = floating_size / 2; i > 0; --i) {
    out -= 2;
    copy2(out, digits2(significand % 100));
    significand /= 100;
  }
  if (floating_size % 2 != 0) {
    const char* tmp = digits2(significand % 100);
    significand /= 100;
    *out-- =  static_cast<Char>(*tmp++);
    *out-- = decimal_point;
    *out =  static_cast<Char>(*tmp);
    --integral_size;
  } else {
    *--out = decimal_point;
  }
  if (integral_size > 0)
    format_decimal(out - integral_size, significand, integral_size);
 
  return end;
}
 
template <typename Char, typename UInt/*,
          FMT_ENABLE_IF(std::is_integral<UInt>::value)*/>
inline auto write_significand3(Char* out, UInt significand, int significand_size,
                              int integral_size, Char decimal_point) -> Char* {
  if (!decimal_point)
    return format_decimal(out, significand, significand_size).end;
  out += significand_size + 1;
  Char* end = out;
  int floating_size = significand_size - integral_size;
  for(int i = (floating_size + 1) / 2; i > 0; --i) {
    out -= 2;
    copy2(out, digits2(significand % 100));
    significand /= 100;
  }
  if (floating_size % 2 != 0) {
    Char tmp = *out;
    *out = decimal_point;
    *--out = tmp;
    --integral_size;
  } else {
    *--out = decimal_point;
  }
  if (integral_size > 0)
    format_decimal(out - integral_size, significand, integral_size);
 
  return end;
}
 
template <typename Char, typename UInt/*,
          FMT_ENABLE_IF(std::is_integral<UInt>::value)*/>
inline auto write_significand4(Char* out, UInt significand, int significand_size,
                              int integral_size, Char decimal_point) -> Char* {
  if (!decimal_point)
    return format_decimal(out, significand, significand_size).end;
  out += significand_size + 1;
  Char* end = out;
  int floating_size = significand_size - integral_size;
  for(int i = floating_size / 2; i > 0; --i) {
    out -= 2;
    copy2(out, digits2(significand % 100));
    significand /= 100;
  }
  if (floating_size % 2 != 0) {
    *--out = static_cast<Char>('0' + significand % 10);
    significand /= 10;
  }
  *--out = decimal_point;
  format_decimal(out - integral_size, significand, integral_size);
 
  return end;
}
 
template <typename Char, typename UInt/*,
          FMT_ENABLE_IF(std::is_integral<UInt>::value)*/>
inline auto write_significand5(Char* out, UInt significand, int significand_size,
                              int integral_size, Char decimal_point) -> Char* {
  if (!decimal_point)
    return format_decimal(out, significand, significand_size).end;
  auto end = format_decimal(out + 1, significand, significand_size).end;
 
  Char tmp = decimal_point;
  for(Char* i = out + integral_size; i != out; --i) {
    std::swap(tmp, *i);
  }
  *out = tmp;
  /*for (int i=0; i< integral_size; ++i) {
    out[i] = out[i + 1];
  }*/
  out[integral_size] = decimal_point;
  return end;
}
 
std::vector<unsigned> genTestVector(int integral_size, int floating_size) {
  std::uniform_int_distribution<unsigned> distrib1(
      std::pow(10, integral_size - 1), std::pow(10, integral_size) - 1);
  std::uniform_int_distribution<unsigned> distrib2(
      std::pow(10, floating_size - 1), std::pow(10, floating_size) - 1);
  std::random_device rd;
  std::mt19937 gen(rd());
  std::vector<unsigned> res;
  const unsigned tmp = std::pow(10, floating_size);
  for (int i = 0; i < 10000; ++i) {
    res.push_back(distrib1(gen) * tmp + distrib2(gen));
  }
  return res;
}
 
template <class F>
std::chrono::steady_clock::duration test(const std::vector<unsigned> &vec,
                                         int significand_size,
                                         int integral_size, F &&f)
{
  char buf[32];
  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < 1000; ++i) {
    for (unsigned tmp : vec) {
      volatile char *x = f(buf, tmp, significand_size, integral_size);
    }
  }
  auto end = std::chrono::steady_clock::now();
  return end - start;
}
 
int main() {
 
  printf("is, fs) speed\n");
 
  for (int integral_size = 1; integral_size <= 4; ++integral_size) {
    for (int floating_size = 1; floating_size <= 4; ++floating_size) {
      auto vec = genTestVector(integral_size, floating_size);
      int significand_size = integral_size + floating_size;
      std::array<std::chrono::duration<double, std::milli>, 5> time;
 
      time[0] = test(
          vec, significand_size, integral_size,
          [](char *buf, unsigned val, int significand_size, int integral_size) {
            return write_significand(buf, val, significand_size, integral_size,
                                     '.');
          });
 
      time[1] = test(
          vec, significand_size, integral_size,
          [](char *buf, unsigned val, int significand_size, int integral_size) {
            return write_significand2(buf, val, significand_size, integral_size,
                                      '.');
          });
 
      time[2] = test(
          vec, significand_size, integral_size,
          [](char *buf, unsigned val, int significand_size, int integral_size) {
            return write_significand3(buf, val, significand_size, integral_size,
                                      '.');
          });
 
      time[3] = test(
          vec, significand_size, integral_size,
          [](char *buf, unsigned val, int significand_size, int integral_size) {
            return write_significand4(buf, val, significand_size, integral_size,
                                      '.');
          });
 
      time[4] = test(
          vec, significand_size, integral_size,
          [](char *buf, unsigned val, int significand_size, int integral_size) {
            return write_significand5(buf, val, significand_size, integral_size,
                                      '.');
          });
 
      printf("%.2d, %.2d) %.3lf %.3lf %.3lf %.3lf\n", integral_size, floating_size,
             time[0] / time[1], time[0] / time[2], time[0] / time[3],
             time[0] / time[4]);
    }
  }
}

@vitaut
Copy link
Contributor

vitaut commented Sep 19, 2021

Thanks for the benchmark, it looks like you are using write_significand4 in the PR.

I tested it on clang with -std=c++17 -O3 -DNDEBUG and while it does show improvement in many cases, there is a regression for the most common case of a single-digit integral part:

01, 01) 0.916 0.773 0.667 0.965
01, 02) 0.982 1.000 1.001 0.842
01, 03) 0.868 0.862 0.843 1.084
01, 04) 0.975 0.976 0.987 1.015
02, 01) 1.430 1.038 1.239 0.988
02, 02) 1.069 1.250 1.288 0.915
02, 03) 1.069 0.987 1.190 0.987
02, 04) 1.151 1.155 1.166 0.970
03, 01) 1.631 1.221 0.871 0.942
03, 02) 1.229 1.248 1.257 0.965
03, 03) 1.155 1.265 1.018 1.057
03, 04) 1.266 1.212 1.288 0.917
04, 01) 1.764 1.575 1.647 1.069
04, 02) 1.238 1.228 1.250 0.970
04, 03) 1.252 1.257 1.392 0.982
04, 04) 1.263 1.278 1.271 0.964
                    ^ write_significand4 proposed in this PR

Is it possible to avoid the regression?

Copy link
Contributor

@vitaut vitaut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should go with write_significand2 which has smaller regression for the single-digit integral case.

@Roman-Koshelev
Copy link
Contributor Author

Optimization can be added, but is it needed? Why is the single-valued whole part a frequent occurrence?

  if (integral_size == 1) {
    auto end = format_decimal(out + 1, significand, significand_size).end;
    out[0] = out[1];
    out[1] = decimal_point;
    return end;
  }

@vitaut
Copy link
Contributor

vitaut commented Oct 3, 2021

Optimization can be added, but is it needed? Why is the single-valued whole part a frequent occurrence?

Single-digit integral part is very common because it's used in the exponential, general and default formats except for small(-ish) exponent. We shouldn't regress it too much.

@vitaut vitaut merged commit 7e4bc94 into fmtlib:master Oct 9, 2021
@vitaut
Copy link
Contributor

vitaut commented Oct 9, 2021

Merged, thanks. The single-digit integral part can be optimized later.

Iniesta8 pushed a commit to jhnc-oss/fmt that referenced this pull request Oct 28, 2021
PoetaKodu pushed a commit to pacc-repo/fmt that referenced this pull request Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants