Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add List.intersection/2 #10408

Closed
wants to merge 1 commit into from
Closed

Conversation

prodis
Copy link

@prodis prodis commented Oct 7, 2020

Add List.intersection/2 that returns a list containing only common elements in two lists.

Inspired by https://hexdocs.pm/miss/Miss.List.html#intersection/2.

@eksperimental
Copy link
Contributor

I was about to propose this to the mailing list as I found myself having to implement this

Copy link
Contributor

@eksperimental eksperimental left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be simplified like this


"""
@spec intersection(list(), list()) :: list()
def intersection(list1, list2) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def intersection(list1, list2) do
def intersection(list1, list2) do
list1 -- list1 -- list2
end

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I used the temp variable for the sake of readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I used the temp variable for the sake of readability.

I guess parentheses could be used, to save the creation of a variable

@eksperimental
Copy link
Contributor

There is one concern though. If you read the documentation for --/2

Before Erlang/OTP 22, the complexity of a -- b was proportional to length(a) * length(b), meaning that it would be very slow if both a and b were long lists. In such cases, consider converting each list to a MapSet and using MapSet.difference/2.

As of Erlang/OTP 22, this operation is significantly faster even if both lists are very long, and using --/2 is usually faster and uses less memory than using the MapSet-based alternative mentioned above. See also the Erlang efficiency guide.

@whatyouhide
Copy link
Member

Hi @prodis! Usually we discuss features in the elixir-lang-core mailing list, as described in the Development page on our website. Has this been discussed there?

I'm particularly concerned about the fact that this can be implemented with MapSet.intersection/2 after converting the lists to sets.

@josevalim
Copy link
Member

As @whatyouhide suggested, the mailing list would be preferred to discuss this. Although I would say I am a strong 👎 on this feature. This operation is at best O(m x n), which is linear to the list sizes, and some algorithms may be more efficient if rewritten to use maps or mapsets to at least make part of the operation based on log.

It can also be implemented more efficiently than in this PR using Enum.filter(list1, & &1 in list2) which at least makes the linear characteristics more apparently. So in my mind there isn't a strong need to make this part of the language.

Thank you!

@josevalim josevalim closed this Oct 7, 2020
@prodis
Copy link
Author

prodis commented Oct 7, 2020

Thanks for sharing the Enum.filter(list1, & &1 in list2) option. 😉

@Eiji7
Copy link
Contributor

Eiji7 commented Oct 7, 2020

@prodis For all interested I wrote a simple benchmark using Benchee:

example.exs
list1 = Enum.to_list(1..1000)
list2 = Enum.to_list(250..750)
mapset1 = MapSet.new(list1)
mapset2 = MapSet.new(list2)

Benchee.run(
  %{
    "filter" => fn -> Enum.filter(list1, & &1 in list2) end,
    "mapset" => fn -> MapSet.intersection(mapset1, mapset2) end,
    "pr" => fn -> list1 -- list1 -- list2 end
  }
)
Benchmark
$ mix run example.exs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Number of Available Cores: 8
Available memory: 15.53 GB
Elixir 1.11.0
Erlang 23.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 21 s

Benchmarking filter...
Benchmarking mapset...
Benchmarking pr...

Name             ips        average  deviation         median         99th %
mapset       12.86 K       77.77 μs    ±23.77%       73.31 μs      140.21 μs
pr            7.15 K      139.95 μs    ±10.03%      134.89 μs      188.00 μs
filter        1.55 K      645.54 μs    ±20.33%      634.85 μs     1011.32 μs

Comparison: 
mapset       12.86 K
pr            7.15 K - 1.80x slower +62.19 μs
filter        1.55 K - 8.30x slower +567.77 μs

@josevalim
Copy link
Member

Awesome job @Eiji7!

It is worth noting that the -- operator has been optimized to build a set internally in Erlang/OTP 23, which makes it more performant, but the algorithmic complexity is still there, so for some data I expect it to perform worse than the filter approach.

There is also room for optimization for MapSet.intersection which should make it considerably more efficient, which I proposed on EEP 50.

@eksperimental
Copy link
Contributor

eksperimental commented Oct 7, 2020

It is worth noting that when the lists are about the same size, the implementation in this PR is even faster.

list1 = Enum.to_list(1..1000)
list2 = Enum.to_list(250..1250)

Comparison: 
pr            5.15 K
mapset        4.16 K - 1.24x slower +46.41 μs
filter        0.62 K - 8.27x slower +1409.86 μs

I would say is it ok to use it for small lists. When converting two lists to MapSets would be overkill.

@prodis
Copy link
Author

prodis commented Oct 7, 2020

The point of List.intersection/2 is to use lists and not MapSets. It is not fair to compare performance of MapSets with lists when intersecting. 😄

Including the cost to transform the list in MapSets and the way back, take a look in the results:

example1.exs

list1 = Enum.to_list(1..1000)
list2 = Enum.to_list(250..750)

Benchee.run(%{
  "filter" => fn -> Enum.filter(list1, &(&1 in list2)) end,
  "mapset" => fn ->
    mapset1 = MapSet.new(list1)
    mapset2 = MapSet.new(list2)

    MapSet.intersection(mapset1, mapset2)
    |> MapSet.to_list()
  end,
  "pr" => fn -> list1 -- list1 -- list2 end
})

Benchmark

mix run example1.exs
Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Number of Available Cores: 12
Available memory: 16 GB
Elixir 1.10.4
Erlang 23.0.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 21 s

Benchmarking filter...
Benchmarking mapset...
Benchmarking pr...

Name             ips        average  deviation         median         99th %
pr           10.69 K       93.56 μs    ±15.80%          90 μs         170 μs
mapset        4.12 K      242.62 μs    ±13.48%         239 μs      378.04 μs
filter        1.67 K      598.27 μs    ±13.78%         578 μs         860 μs

Comparison:
pr           10.69 K
mapset        4.12 K - 2.59x slower +149.05 μs
filter        1.67 K - 6.39x slower +504.71 μs

@josevalim
Copy link
Member

@prodis ideally you would build the mapset along the way instead of a pass at the end. The point is that you need to ask yourself if you have the best data type for the job, especially because List may have duplicates.

@vans163
Copy link
Contributor

vans163 commented Oct 7, 2020

I use this pattern occasionally and mapsets do not fit the bill. Main reasons for not using mapsets are they are annoying to use, cannot be fit into guards nor can you use List module with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants