Skip to content

Conversation

@ypconstante
Copy link
Contributor

@ypconstante ypconstante commented Aug 7, 2025

Today LazyHTML.Tree.to_html is really slow on large trees. Replacing the binary accumulator with a list makes calls with large tree significantly faster, and speeds up a little calls with smaller trees.
This does increase a little the memory usage, but considering the performance improvement seems to be worth.

Elixir 1.18.4
Erlang 28.0.1
JIT enabled: true

...

##### With input big #####
Name                                ips        average  deviation         median         99th %
to_html (io data)                 83.02       12.05 ms     ±7.11%       11.88 ms       16.49 ms
to_html (io data reverse)         76.58       13.06 ms    ±14.34%       12.11 ms       17.62 ms
to_html (main)                    14.72       67.95 ms    ±30.57%       75.58 ms       90.30 ms

Comparison: 
to_html (io data)                 83.02
to_html (io data reverse)         76.58 - 1.08x slower +1.01 ms
to_html (main)                    14.72 - 5.64x slower +55.90 ms

Memory usage statistics:

Name                         Memory usage
to_html (io data)                 5.96 MB
to_html (io data reverse)         8.55 MB - 1.43x memory usage +2.58 MB
to_html (main)                    6.73 MB - 1.13x memory usage +0.77 MB

**All measurements for memory usage were the same**

##### With input medium #####
Name                                ips        average  deviation         median         99th %
to_html (io data)                283.62        3.53 ms     ±2.98%        3.50 ms        3.86 ms
to_html (io data reverse)        264.60        3.78 ms    ±15.05%        3.69 ms        7.21 ms
to_html (main)                   232.41        4.30 ms    ±39.28%        4.02 ms       14.17 ms

Comparison: 
to_html (io data)                283.62
to_html (io data reverse)        264.60 - 1.07x slower +0.25 ms
to_html (main)                   232.41 - 1.22x slower +0.78 ms

Memory usage statistics:

Name                         Memory usage
to_html (io data)                 2.05 MB
to_html (io data reverse)         2.63 MB - 1.28x memory usage +0.57 MB
to_html (main)                    2.31 MB - 1.12x memory usage +0.25 MB

**All measurements for memory usage were the same**

##### With input small #####
Name                                ips        average  deviation         median         99th %
to_html (io data reverse)        1.35 K      740.29 μs     ±5.85%      726.38 μs      917.88 μs
to_html (io data)                1.32 K      758.74 μs     ±7.34%      736.79 μs      971.52 μs
to_html (main)                   1.25 K      800.30 μs    ±17.72%      738.42 μs     1414.52 μs

Comparison: 
to_html (io data reverse)        1.35 K
to_html (io data)                1.32 K - 1.02x slower +18.45 μs
to_html (main)                   1.25 K - 1.08x slower +60.01 μs

Memory usage statistics:

Name                         Memory usage
to_html (io data reverse)       524.79 KB
to_html (io data)               456.34 KB - 0.87x memory usage -68.45313 KB
to_html (main)                  517.45 KB - 0.99x memory usage -7.34375 KB
tag = "main"

read_file = fn name ->
  __ENV__.file
  |> Path.dirname()
  |> Path.join(name)
  |> File.read!()
  |> LazyHTML.from_document()
  |> LazyHTML.to_tree()
end

inputs = %{
  "big" => read_file.("big.html"),
  "medium" => read_file.("medium.html"),
  "small" => read_file.("small.html")
}

Benchee.run(
  %{
    "to_html" => &LazyHTML.Tree.to_html/1
  },
  inputs: inputs,
  pre_check: true,
  time: 10,
  memory_time: 2,
  save: [path: "benchs/results/to-html-#{tag}", tag: tag]
)

Benchee.report(load: "benchs/results/to-html-*")

@josevalim
Copy link
Member

In theory the binary should be faster and use fewer memory since we are reusing the match context:

ERL_COMPILER_OPTIONS=bin_opt_info mix compile --force

Can you please share the benchmark script you are using? You can also try this diff and see if it changes anything.

diff --git a/lib/lazy_html.ex b/lib/lazy_html.ex
index 358b37e..9e5f8c3 100644
--- a/lib/lazy_html.ex
+++ b/lib/lazy_html.ex
@@ -498,7 +498,7 @@ defmodule LazyHTML do
   """
   @spec html_escape(String.t()) :: String.t()
   def html_escape(string) when is_binary(string) do
-    LazyHTML.Tree.append_escaped(string, "")
+    LazyHTML.Tree.append_escaped(string)
   end

   # Access
diff --git a/lib/lazy_html/tree.ex b/lib/lazy_html/tree.ex
index 29edc32..66e9e6a 100644
--- a/lib/lazy_html/tree.ex
+++ b/lib/lazy_html/tree.ex
@@ -134,7 +134,7 @@ defmodule LazyHTML.Tree do
   # [1]: https://github.com/phoenixframework/phoenix_html/blob/v4.2.1/lib/phoenix_html/engine.ex#L29-L35

   @doc false
-  def append_escaped(text, html) do
+  def append_escaped(text, html \\ "") do
     append_escaped(text, text, 0, 0, html)
   end

@josevalim
Copy link
Member

Also, if you are using iodata, you don't need Enum.reverse, you can write operations like this [html, " ", attrs, "/ >"]. It should reduce your memory usage and improve performance.

@ypconstante
Copy link
Contributor Author

ypconstante commented Aug 7, 2025

Benchee script is in the PR description, html files are the ones used by Floki benchmark.

On main branch - ERL_COMPILER_OPTIONS=bin_opt_info mix compile --force

ERL_COMPILER_OPTIONS=bin_opt_info mix compile --force
Compiling 3 files (.ex)
     warning: OPTIMIZED: match context reused
     │
 111 │        do: append_text(rest, text, whitespace_size + 1, ctx, html)
     │        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/lazy_html/tree.ex:111

     warning: OPTIMIZED: match context reused
     │
 123 │        do: append_escaped(rest, text, 0, whitespace_size, html)
     │        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/lazy_html/tree.ex:123

     warning: OPTIMIZED: match context reused
     │
 164 │       append_escaped(rest, text, offset + size + 1, 0, html)
     │       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/lazy_html/tree.ex:164

     warning: OPTIMIZED: match context reused
     │
 164 │       append_escaped(rest, text, offset + size + 1, 0, html)
     │       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/lazy_html/tree.ex:164

     warning: OPTIMIZED: match context reused
     │
 164 │       append_escaped(rest, text, offset + size + 1, 0, html)
     │       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/lazy_html/tree.ex:164

     warning: OPTIMIZED: match context reused
     │
 164 │       append_escaped(rest, text, offset + size + 1, 0, html)
     │       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/lazy_html/tree.ex:164

     warning: OPTIMIZED: match context reused
     │
 164 │       append_escaped(rest, text, offset + size + 1, 0, html)
     │       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/lazy_html/tree.ex:164

     warning: OPTIMIZED: match context reused
     │
 169 │     append_escaped(rest, text, offset, size + 1, html)
     │     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/lazy_html/tree.ex:169

Same output with the patch

@jonatanklosko
Copy link
Member

jonatanklosko commented Aug 7, 2025

This is actually crazy.

When adding the original implementation I benchmarked a few versions and the current one was clearly superior, with runtime being close or better and 5 times lower memory usage.

However, this is no longer the case on main (as indicated by @ypconstante results). I tracked it down and it's a regression from #14. Now, if I make this change, it restores the previous behaviour:

-  def append_escaped(text, html) do
+  defp append_escaped(text, html) do

@jonatanklosko
Copy link
Member

In theory the binary should be faster and use fewer memory since we are reusing the match context

Just to clarify, it's not about match context (which has to do with recursive traversal, and that's the same in both implementations), it's about the runtime optimising binary appends. I don't think bin_opt_info will tell us any difference, because in the iodata implementation we are simply not building a binary :D

@josevalim
Copy link
Member

Just to clarify, it's not about match context (which has to do with recursive traversal, and that's the same in both implementations), it's about the runtime optimising binary appends

In my mind, "runtime optimizing binary appends" is the match context handling. If the match context was not optimized, then we would not see this optimization. But I found it weird nothing was warned about no match context being created and anothing about the append_attrs or to_html functions.

@josevalim
Copy link
Member

Nah, you are right, I am getting two different optimizations confused. Apologies. Ship it.

@jonatanklosko
Copy link
Member

Closing in favour of #19.

@ypconstante thank you very much for the PR, otherwise we wouldn't spot the regression!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants