Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile String.valid_utf8?/1 inline #12354

Closed
wants to merge 1 commit into from

Conversation

mtrudel
Copy link
Contributor

@mtrudel mtrudel commented Jan 19, 2023

A quick win to get about a 15% performance boost on UTF-8 validation. This function is consistently at the top of Bandit's WebSocket profile traces, so any wins we can get here would be great.

Mix.install([:benchee])

defmodule New do
  def valid?(<<string::binary>>), do: valid_utf8?(string)
  def valid?(_), do: false

  @compile {:inline, valid_utf8?: 1}
  defp valid_utf8?(<<_::utf8, rest::bits>>), do: valid_utf8?(rest)
  defp valid_utf8?(<<>>), do: true
  defp valid_utf8?(_), do: false
end

Benchee.run(
  %{
    "old" => fn input -> String.valid?(input) end,
    "new" => fn input -> New.valid?(input) end
  },
  time: 10,
  memory_time: 2,
  inputs: %{
    "micro" => String.duplicate("a", 10),
    "medium" => String.duplicate("a", 10_002)
  }
)

yields

Operating System: macOS
CPU Information: Apple M1
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.14.1
Erlang 25.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: medium, micro
Estimated total run time: 56 s

Benchmarking new with input medium ...
Benchmarking new with input micro ...
Benchmarking old with input medium ...
Benchmarking old with input micro ...

##### With input medium #####
Name           ips        average  deviation         median         99th %
new        38.39 K       26.05 μs    ±21.92%       25.83 μs       29.92 μs
old        34.39 K       29.08 μs    ±19.84%       28.79 μs       29.92 μs

Comparison:
new        38.39 K
old        34.39 K - 1.12x slower +3.03 μs

Memory usage statistics:

Name    Memory usage
new            656 B
old            656 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

##### With input micro #####
Name           ips        average  deviation         median         99th %
new         2.94 M      340.37 ns ±11621.35%         250 ns         375 ns
old         2.89 M      345.75 ns ±11544.85%         250 ns         416 ns

Comparison:
new         2.94 M
old         2.89 M - 1.02x slower +5.38 ns

Memory usage statistics:

Name    Memory usage
new            656 B
old            656 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

(FWIW @moogle19 and I have spent a fair bit of time trying to improve this implementation (longer strides, clever Bitwise tricks, Cowboy's implementation, and others) and haven't been able to get anything better that doesn't have awful memory characteristics. See notes at mtrudel/bandit#73)

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 19, 2023

Upon further review, this seems to actually be slower on x86. Rescinding this PR.

@mtrudel mtrudel closed this Jan 19, 2023
@mtrudel mtrudel deleted the inline_valid branch January 19, 2023 16:10
@josevalim
Copy link
Member

Hi @mtrudel and @moogle19! Also, be sure to check OTP 26, it has some improvements around this area.

But I would also send a proposal to the Erlang/OTP team to add a unicode:validate_utf8/1 function that is backed by a NIF. This is a very important operation and it can be highly optimised using techniques such as simdutf8. I would make it return something like :ok or {:error, byte_pos_of_the_first_invalid_codepoint}. I would love to see something like that get in, so let me know if you need help with any step of the process.

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 19, 2023

We're actually looking at taking an (optional) 'sprinkling of NIFs' approach with a couple of hot paths in Bandit, including this one. Around half of the improvements are probably of general interest, while the other half are pretty Bandit-specific.

@moogle19 has already worked up some prototypes using rustler, so integrating simdutf8 would be really straightforward. If it bears fruit (as I suspect it will) I'll happily take care of pushing the work / proposals upstream (when you said 'send a proposal', I assume you mean via the EEP process?)

Also those changes to OTP26 look great, and will likely be a significant improvement here regardless. Going to benchmark them now!

@josevalim
Copy link
Member

I don't think you need an EEP for this. I would start a discussion on the forum or potentially just submit a PR depending on your appetite. :)

Pro-tip, in order to add NIFs to Erlang/OTP, first:

  1. add the NIF stub to unicode.erl
  2. them implement the nif

If you implement the nif before the stub, then you can't start the runtime because the stub the nif is supposed to fill in is missing. It may also be necessary to put the nif on a private module, such as erts_internal. I am not sure all modules from OTP are nif-able. :)

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 19, 2023

Tips from the pro himself! Thanks so much!

@josevalim
Copy link
Member

Oh you! ☺️

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 19, 2023

OK, this is really interesting.

Given this implementation:

Mix.install([:benchee])

defmodule New do
  @compile {:inline, valid?: 1, valid_utf8?: 1}
  def valid?(<<string::binary>>), do: valid_utf8?(string)
  def valid?(_), do: false

  defp valid_utf8?(<<_::utf8, rest::bits>>), do: valid_utf8?(rest)
  defp valid_utf8?(<<>>), do: true
  defp valid_utf8?(_), do: false
end

defmodule New2 do
  @compile {:inline, valid?: 1, valid_utf8?: 1}
  def valid?(<<string::binary>>), do: valid_utf8?(string)
  def valid?(_), do: false

  defp valid_utf8?(<<a::8, b::8, c::8, d::8, rest::bits>>)
       when a < 128 and b < 128 and c < 128 and
              d < 128,
       do: valid_utf8?(rest)

  defp valid_utf8?(<<_::utf8, rest::bits>>), do: valid_utf8?(rest)
  defp valid_utf8?(<<>>), do: true
  defp valid_utf8?(_), do: false
end

Benchee.run(
  %{
    "old" => fn input -> String.valid?(input) end,
    "new" => fn input -> New.valid?(input) end,
    "new2" => fn input -> New2.valid?(input) end
  },
  time: 10,
  memory_time: 2,
  inputs: %{
    "micro" => String.duplicate("a", 10),
    "medium" => String.duplicate("a", 10_002)
  }
)

Here's what I get on OTP 25:

OTP 25 Benchmark
Operating System: macOS
CPU Information: Apple M1
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.14.1
Erlang 25.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: medium, micro
Estimated total run time: 1.40 min

Benchmarking new with input medium ...
Benchmarking new with input micro ...
Benchmarking new2 with input medium ...
Benchmarking new2 with input micro ...
Benchmarking old with input medium ...
Benchmarking old with input micro ...

##### With input medium #####
Name           ips        average  deviation         median         99th %
new        37.94 K       26.36 μs    ±22.40%       25.92 μs          33 μs
old        33.67 K       29.70 μs    ±22.99%       29.29 μs       35.33 μs
new2       29.13 K       34.32 μs    ±30.66%       33.75 μs       41.58 μs

Comparison:
new        37.94 K
old        33.67 K - 1.13x slower +3.34 μs
new2       29.13 K - 1.30x slower +7.97 μs

Memory usage statistics:

Name    Memory usage
new            656 B
old            656 B - 1.00x memory usage +0 B
new2           656 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

##### With input micro #####
Name           ips        average  deviation         median         99th %
new         2.91 M      343.35 ns ±11272.78%         250 ns         375 ns
old         2.81 M      355.54 ns ±11417.85%         250 ns         417 ns
new2        2.63 M      380.11 ns ±10750.79%         291 ns         458 ns

Comparison:
new         2.91 M
old         2.81 M - 1.04x slower +12.19 ns
new2        2.63 M - 1.11x slower +36.77 ns

Memory usage statistics:

Name    Memory usage
new            656 B
old            656 B - 1.00x memory usage +0 B
new2           656 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

(tl;dr: inlining makes a little bit of difference, and fast-pathing ASCII strings with a 4 byte match actually slows things down)

Here's the exact same thing on OTP 26 (ref:master):

OTP 26 Benchmark
Operating System: macOS
CPU Information: Apple M1
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.14.1
Erlang 26.0-rc0

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: medium, micro
Estimated total run time: 1.40 min

Benchmarking new with input medium ...
Benchmarking new with input micro ...
Benchmarking new2 with input medium ...
Benchmarking new2 with input micro ...
Benchmarking old with input medium ...
Benchmarking old with input micro ...

##### With input medium #####
Name           ips        average  deviation         median         99th %
new2      222.95 K        4.49 μs   ±232.18%        4.38 μs        4.67 μs
old        39.31 K       25.44 μs    ±25.66%       25.21 μs       26.21 μs
new        39.05 K       25.61 μs    ±30.18%       25.21 μs       32.63 μs

Comparison:
new2      222.95 K
old        39.31 K - 5.67x slower +20.95 μs
new        39.05 K - 5.71x slower +21.12 μs

Memory usage statistics:

Name    Memory usage
new2           656 B
old            656 B - 1.00x memory usage +0 B
new            656 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

##### With input micro #####
Name           ips        average  deviation         median         99th %
new2        2.98 M      335.78 ns ±11879.12%         250 ns         416 ns
new         2.94 M      340.57 ns ±11380.92%         250 ns         375 ns
old         2.89 M      346.60 ns ±12061.87%         250 ns         416 ns

Comparison:
new2        2.98 M
new         2.94 M - 1.01x slower +4.79 ns
old         2.89 M - 1.03x slower +10.82 ns

Memory usage statistics:

Name    Memory usage
new2           656 B
new            656 B - 1.00x memory usage +0 B
old            656 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

This time, we see basically no difference from inlining, but a 5.6x difference on large ASCII strings by matching them 4 bytes at a time. WOW.

On x86 the difference is smaller, at only 2x faster, but it's still twice as fast for a pretty small change.

The downside here is that this exact same change ends up 25% slower on OTP 25 and earlier.

I'm going to spend a bit more time characterizing this but assuming it checks out, @josevalim is there precedent / best practices (or even an appetite) for per-OTP optimizations such as this in the standard library? Happy to work up a PR if so.

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 19, 2023

Of note, the improvement doesn't seem to come from the upstream work on utf8 validation, but rather something to do with the efficiency of the '4 byte at a time' pattern match.

@josevalim
Copy link
Member

How does it scale? Comparing to reading two/four/eight bytes at a time?

@josevalim
Copy link
Member

Also please benchmark non ascii strings to see the additional cost there. The linked paper on simdutf8 also has references on useful non SIMD techniques to validate them.

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 19, 2023

How does it scale? Comparing to reading two/four/eight bytes at a time?

I'm looking at this now. I spent the better part of an afternoon recently looking at a bunch of variations (different stride lengths, using bitwise vs numeric matches, a whole bunch of different things) and they all came up lacking in some way or another on OTP 25, but literally the first one of these I tried on OTP 26 knocked me right over. We likely get even bigger wins here by the time I'm done golfing.

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 19, 2023

Also please benchmark non ascii strings to see the additional cost there. The linked paper on simdutf8 also has references on useful non SIMD techniques to validate them.

100%. This is nowhere near ready for consideration yet, and validating over all types and sizes of inputs is absolutely a necessary part of that!

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 20, 2023

I've finished characterizing these changes and have some results.

Approach & Raw Data

I took the following approach (all steps carried out on ARM & x86):

  1. I baked off a number of approaches off against one another on all-ascii inputs. Approaches taken included:

    • Pattern matching in the style of <0::1, _::7,...> in 8, 32, 64 and 128 bit strides
    • Guard matching in the style of <a::8,...> when a < 128 in 8, 32, 64 and 128 bit strides
    • Pattern matching against all possible UTF-8 sizes
    • Cowboy's adaptation of Hoehrmann's DFA based approach
    • All tests were also run on OTP 25, which did not exhibit any improvement. Whatever the change is that improves this, it is only present in OTP26

    ARM results (some algorithms weren't at all competitive and have been elided here):
    image

    x86 results (some algorithms weren't at all competitive and have been elided here):
    image

  2. I validated that the algorithms validated (or not) both valid and invalid UTF-8 input of a variety of types / compositions.3.

  3. I took the best performing algorithms (pattern128 and guard128) and characterized their behaviour by size on all-ascii inputs:

    ARM results:
    image

    x86 results:
    image

  4. I again took the best performing algorithms (pattern128 and guard128) and characterized their behaviour by composition * size (all-ascii, all-non-ascii & mixed, each in 9, 72 and 1026 byte sizes (oddball sizes to accommodate non-ascii characters):

    ARM results:
    image

    x86 results:
    image

Summary of Results

Under the right conditions (all-ascii binaries larger than ~128 bits on OTP26), we see significant improvements: 2.5x for 4k inputs on x86, and 5x on ARM. These numbers seem to increase quadratically (0.02/0.07 coefficient (ARM/x86), R^2=0.98). This is nothing to sneeze at, though the necessary conditions are somewhat constraining.

Suggestion

Based on the above, it would seem that a winning approach here would be to:

  • Add an optional extra argument to String.valid?/1, in the same manner that String.downcase/2 has an optional :ascii parameter, which allows the user to opt into the following implementation:
    • Continue to use the existing approach for binaries less than 128 bits in length (we can do this with an extra clause of String.valid?/1 with a bit_size/1 guard).
    • Use the pattern128 approach for larger binaries. This ends up being a single extra pattern on String.valid_utf8?/1.

WDYT?

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 20, 2023

If it helps to clarify, a rough sketch would look like

  @spec valid?(t, atom) :: boolean
  def valid?(string, variant \\ :default)

  def valid?(<<string::binary>>, :fast_ascii) when bit_size(string) >= 128, do: valid_utf8_fast_ascii?(string)
  def valid?(<<string::binary>>, _), do: valid_utf8?(string)

  defp valid_utf8?(<<_::utf8, rest::bits>>), do: valid_utf8?(rest)
  defp valid_utf8?(<<>>), do: true
  defp valid_utf8?(_), do: false

  defp valid_utf8_fast_ascii?(
         <<0::1, _a::7, 0::1, _b::7, 0::1, _c::7, 0::1, _d::7, 0::1, _e::7, 0::1, _f::7, 0::1,
           _g::7, 0::1, _h::7, 0::1, _i::7, 0::1, _j::7, 0::1, _k::7, 0::1, _l::7, 0::1, _m::7,
           0::1, _n::7, 0::1, _o::7, 0::1, _p::7, rest::bits>>
       ),
       do: valid_utf8_fast_ascii?(rest)
  defp valid_utf8_fast_ascii?(<<_::utf8, rest::bits>>), do: valid_utf8_fast_ascii?(rest)
  defp valid_utf8_fast_ascii?(<<>>), do: true
  defp valid_utf8_fast_ascii?(_), do: false

@josevalim
Copy link
Member

josevalim commented Jan 20, 2023

Awesome work @mtrudel! Did you try a bitmask version? 32 bits are efficient in the Erlang VM, so you can read 32bits at a time and do a bit mask on 0x80808080. If the mask is zero, then it is valid.

You could try with 48bits and even 56bits but I am almost sure 64 bits are bignums in Erlang and maybe be slower.

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 20, 2023

I'd tried bitwise checks (at 8, 32, 64, 128 bit strides) and dropped them from competition as they weren't very remarkable. However, going back and trying different stride lengths is a very different story (especially on x86):

I tried 8, 32, 48, 56, 64, 128 bitwise matches this time around, on both platforms. On both platforms, 56 bit strides are the best performer (with 32 bit coming in second). This makes sense as the size of a small int is 60 bits so 56 is the largest whole-byte chunk we can read without straying into bignum territory.

Recasting the previous winner (pattern128) against bit32 and bit56, here's what we get:

ARM performance increases to 8.3x (!) @ 4k:
image

x86 goes up to 4.6x @ 4k:
image

It also improves on some of the previous shortcomings:

  • Performance for short binaries isn't impacted as much, so we can forego the > 128 bit check on input size
  • Performance doesn't fall off nearly as much for non-ascii binaries (though it still degrades somewhat)
  • Unfortunately this is all still OTP26+

I think the overall approach of adding an optional fast_ascii (or similar) flag remains the best option here, though I'll obviously be basing it on the (even simpler to implement!) bit56 approach

Unless you have any objections, I'll get this cooked up into a PR for 1.15.

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 20, 2023

Revised input type performance on ARM:
image

Revised input type performance on x86:
image

ARM performance is still pretty degraded for non-ascii strings, but x86 performance is improved for all cases except large non-ascii strings.

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 20, 2023

Again, for clarity, the changes would look something like:

@spec valid?(t, atom) :: boolean
  def valid?(string, variant \\ :default)

  def valid?(<<string::binary>>, :fast_ascii), do: valid_utf8_fast_ascii?(string)
  def valid?(<<string::binary>>, _), do: valid_utf8?(string)

  defp valid_utf8?(<<_::utf8, rest::bits>>), do: valid_utf8?(rest)
  defp valid_utf8?(<<>>), do: true
  defp valid_utf8?(_), do: false

  defp valid_utf8_fast_ascii?(<<a::56, rest::bits>>) when Bitwise.band(0x80808080808080, a) == 0, do: valid_utf8_fast_ascii?(rest)
  defp valid_utf8_fast_ascii?(<<_::utf8, rest::bits>>), do: valid_utf8_fast_ascii?(rest)
  defp valid_utf8_fast_ascii?(<<>>), do: true
  defp valid_utf8_fast_ascii?(_), do: false

@josevalim
Copy link
Member

Yes, I think those improvements are great, awesome work. PR welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants