Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MapSet Optimizations #4415

Merged
merged 1 commit into from Mar 23, 2016
Merged

Conversation

benwilson512
Copy link
Contributor

Overview

Hey folks!

Based upon discoveries made during my little foray into map performance characteristics a few weeks ago I thought I'd give MapSet a work over. I was also very curious about the circumstances under which MapSet out performs :lists.subtract

Warning: Be careful when benchmarking -- or :lists.subtract. The compiler (I think it's the compiler) does some crazy optimizations when operating on list literals. Always create a module / function like MyList.subtract/2 and call that instead, otherwise the result will be pre-computed at compile time.

General Changes

Many of the current functions use :maps.fold. Under the covers :maps.fold calls :maps.to_list first, and then :lists.foldl. In many cases :maps.fold uses Map.put to accumulate desired results in a new map.

This gives us three general opportunities for improvement.

  • By far the largest improvement comes from first accumulating {key, value} pairs into a list, and then using :maps.from_list instead of consecutive calls to Map.put/2. This is not only faster, but (from what I've been told) produces less garbage for the GC. If anyone knows of a good way to quantify the amount of garbage produced I'd love to hear it.
  • :maps.to_list can be improved upon by simply using Map.keys. Not only does Map.keys happen to be faster, it also improves code clarity because we only care about the keys anyway.
  • :lists.fold can be improved upon by manually recursing through lists produced by Map.keys and calling other functions as needed. While this does make the code more verbose, it tends to get us about a 20% improvement.

Functions

Benchmarking notes: Y axis is in micro seconds.

MapSet.difference/2

Benchmark: https://github.com/benwilson512/elixir-profiling/blob/master/bench/mapset_difference_bench.exs

image

MapSet.difference(set1, set2) is the only function that had any algorithmic changes. The current approach iterates through set2 and deletes each item found from set1.

I realized that when set2 is larger than set1 this no longer makes any sense. Instead it is faster to iterate through each item in set1 and only accumulate those items which do not also exist in set2.

After testing, I determined that it becomes faster to iterate through set1 when set2 is at least half the size of set1.

MapSet.intersection/2

Benchmark: https://github.com/benwilson512/elixir-profiling/blob/master/bench/mapset_intersection_bench.exs

image

Accumulate in a list, manual recursion, Map.keys. These small changes produce a surprisngly substantial improvement.

MapSet.disjoint/2

Benchmark: https://github.com/benwilson512/elixir-profiling/blob/master/bench/mapset_disjoint_bench.exs

image

The previous version iterated through the smallest set and would throw in the event that a given key was also found in the other set.

This version uses Map.keys and manual recursion to achieve functional flow control and superior performance.

MapSet.new

Benchmark: https://github.com/benwilson512/elixir-profiling/blob/master/bench/mapset_new_bench.exs

image
image

Manual recursion, list accumulation.

While I was at it, I took a similar approach with Map.new/1,2.

MapSet.disjoint?

image

Comparison to --

I'm still working on this, I figured I should get some feedback on my changes here before going further on this because any major changes we make have the potential to change my analysis here.

Enum.reduce(enumerable, %{}, fn {k, v}, acc -> put(acc, k, v) end)
enumerable
|> Enum.reduce([], &[&1 | &2])
|> :lists.reverse
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because even though Maps are unordered, if we do Map.new([a: 1, a: 2]) the current behaviour is that we get %{a: 2}. Without reversing we would get %{a: 1}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just Enum.to_list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah whoops I just edited this from the MapSet version, that would indeed make sense.

enumerable
|> Enum.reduce([], &[transform.(&1) | &2])
|> :lists.reverse
|> :maps.from_list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could compare this with Enum.into(enumerable, %{}, transform).

If one is faster than the other, both should be probably changed to be the same implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting a fast track for Enum.into/2 and maps? The regular collectable implementation for maps can't benefit from this, because it only has one item at a time. :maps.from_list only works when you know all the items you want to put into the map ahead of time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collectable for maps could accumulate a list and call :maps.from_list/1 at the end instead of accumulating a map if that's faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my understanding that each iteration in collectable had to return a map, but if not then yes we should do that to improve Enum.into. However, I don't think we should use Enum.into here. The protocol overhead would be definitely noticeable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Enum.into/2 optimizes a bunch of cases. So we should very likely call Enum.into/2 instead and make sure we optimize those cases there. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we should optimize the collectable to. To be clear and avoid confusion:

  1. Enum.into/2 inlines some code when it notices the second argument is a map (and not a struct). We can rely on it.
  2. Regardless of Enum.into/2, we can also optimize collectable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, Enum.into/2 should optimize into a single :maps.from_list call if the first argument is a list and the second is a map (not a struct). I believe we already do it today but, if we don't, we certainly should.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I've seen we don't do any optimizations for Enum.into an empty (or very small) map. I do believe that the best place for optimization is the Enum.into module itself, rather than the collectable protocol implementation for maps or mapsets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think we have another reason to avoid Enum.into in map: It's nicer to optimize Enum.into in terms of Map.new, rather than the other way around. For example. here's my proposed change to Enum.into for empty maps:

def into(enumerable, %{} = map) when map_size(map) == 0 do
  Map.new(enumerable)
end

Alternatively, we could have Enum.into do what my Map.new does now, and have Map.new use Enum.into. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim
Copy link
Member

Please go ahead with those changes. Using Enum.to_list would clean things up a bit but this is definitely the way to go. ❤️ 💚 💙 💛 💜

@benwilson512
Copy link
Contributor Author

General note: I realize that some of these changes are a bit more verbose, and I promise that I'm not a "faster at all costs" kinda guy. However when it comes to core language data structures I really do think that every bit counts, which is why you see me using a lot of manual recursion over Enum. I've tested the Enum versions locally and in most cases we gain at least 20% by recursing manually.

Part of this is the manual recursion, but I'm convinced a bigger part is that we get to avoid anonymous functions in the hot path. If it was only 2% I wouldn't care, but at 20% it really matters.

@josevalim
Copy link
Member

Yup, it is ok for us to have "verbose" code if it means everyone else does not have to.

@benwilson512
Copy link
Contributor Author

Preliminary Map module improvements:
Done by making a map of 10,000 items, and then dropping / taking / splitting 5,000 items

                          Before   After
Map.drop           1000   1948.48  1833.82
Map.take           1000   2181.71  1108.00
Map.split           500   5095.14  1122.09

@benwilson512
Copy link
Contributor Author

I'm gonna keep making regular commits and will squash when we're done.

%MapSet{map: Map.drop(map1, Map.keys(map2))}
end

def filter_not_in(keys, map2) do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be private.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, also it'd better to define it as bodiless clause def filter_not_in(keys, map2, acc \\ []).

@benwilson512
Copy link
Contributor Author

Anyone know why the commit that added Enum.into specialization broke compilation? I didn't notice locally at first because the regular make test worked, it's only when I ran make clean test that it blew up. It seems like there's some kind of compile time dependency on Map.new in the enum module?

@lexmag
Copy link
Member

lexmag commented Mar 20, 2016

Anyone know why the commit that added Enum.into specialization broke compilation?

Compilation fails because of %MapSet{} usage, it expects MapSet module to be available during compilation but we haven't compiled it yet.
To fix that we should use %{__struct__: MapSet}.

@benwilson512
Copy link
Contributor Author

Aha, thanks!

@benwilson512
Copy link
Contributor Author

Commits squashed. I'm pretty happy with where this is at. I think there are some things to think about as far as Enum.into and the collectable protocol ar concerned but those are definitely orthogonal, and since they may have implications beyond MapSet, I think deserve their own PR / discussion.

Any outstanding comments / critiques?

@@ -77,7 +77,9 @@ defmodule Map do
"""
@spec new(Enum.t) :: map
def new(enumerable) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a map is given as argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah we should just return the same map in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just be careful because the map must not be a struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@josevalim
Copy link
Member

I have dropped some final comments and we are ready to go!

@@ -52,7 +57,30 @@ defmodule MapSet do
"""
@spec new(Enum.t, (term -> term)) :: t
def new(enumerable, transform) do
Enum.reduce(enumerable, %MapSet{}, &put(&2, transform.(&1)))
map =
enumerable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misindentation here, it needs to be:

map =
  enumerable
  |> Enum.to_list


defp filter_not_in(keys, map2, acc \\ [])
defp filter_not_in([], _map2, acc), do: :maps.from_list(acc)
defp filter_not_in([k | rest], map2, acc) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use key here instead of k to match existing style, also feels more readable.

@@ -425,7 +441,12 @@ defmodule Map do
"""
@spec drop(map, [key]) :: map
def drop(map, keys) do
Enum.reduce(keys, map, &delete(&2, &1))
do_drop(keys, map)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the @spec for this function is wrong as well, we do need to handle any enumerable, right?
Same with split/2.

@benwilson512
Copy link
Contributor Author

One thing that is conspicuously absent here are tests. split, take and drop only have doc tests, and those cases don't cover enumerable input. Should I fill out the map_test.exs with relevant cases or is relying on doc tests and fleshing out the examples the way we should go?

@josevalim
Copy link
Member

Both approaches are fine. Feel free to add more tests or add doctests or
both. :)

José Valimwww.plataformatec.com.br
http://www.plataformatec.com.br/Founder and Director of R&D

On Tue, Mar 22, 2016 at 1:12 AM, Ben Wilson notifications@github.com
wrote:

One thing that is conspicuously absent here are tests. split, take and
drop only have doc tests, and those cases don't cover enumerable input.
Should I fill out the map_test.exs with relevant cases or is relying on
doc tests and fleshing out the examples the way we should go?


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#4415 (comment)

@benwilson512
Copy link
Contributor Author

Tests are in.

@josevalim josevalim merged commit f4f2688 into elixir-lang:master Mar 23, 2016
@josevalim
Copy link
Member

❤️ 💚 💙 💛 💜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants