Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array#shift is slow #5126

Closed
benoist opened this issue Oct 15, 2017 · 59 comments · Fixed by #10081
Closed

Array#shift is slow #5126

benoist opened this issue Oct 15, 2017 · 59 comments · Fixed by #10081

Comments

@benoist
Copy link
Contributor

benoist commented Oct 15, 2017

I was using array shift a lot, but I found out it's pretty slow compared to using an index lookup

This is the current implementation

  def shift
    if @size == 0
      yield
    else
      value = @buffer[0]
      @size -= 1
      @buffer.move_from(@buffer + 1, @size)
      (@buffer + @size).clear
      value
    end
  end

If I change this function by

  def shift
    if @size == 0
      yield
    else
      value   = @buffer[0]
      @size   -= 1
      @buffer += 1
      value
    end
  end

It is a lot faster
Shifting Array(String).new(100_000, "a") shows the following speed improvement

                 user     system      total        real
shift        1.600000   0.000000   1.600000 (  1.602299)
each         0.000000   0.000000   0.000000 (  0.000487)
fast shift   0.000000   0.000000   0.000000 (  0.000530)

Question is, can it really be implemented like this or is there in important reason the buffer move and clear is required for this operation? I've made the change in the current master branch and tests all pass.

@benoist benoist changed the title Array.shift is slow Array#shift is slow Oct 15, 2017
@jhass
Copy link
Member

jhass commented Oct 15, 2017

Well, doesn't your implementation leak memory? That is the array will never shrink again?

However I agree we should only move after some threshold, that is move if there' X free space in front of the buffer and also check that when pushing.

In any case check whether Deque doesn't fit better for your usage pattern.

@asterite
Copy link
Member

Duplicate of #573

@asterite asterite marked this as a duplicate of #573 Oct 15, 2017
@asterite
Copy link
Member

We should probably document that Array#shift is O(n)

@benoist
Copy link
Contributor Author

benoist commented Oct 15, 2017

Well Deque has the fast shift implementation, but I need other functions from array first before shifting.

@asterite
Copy link
Member

@benoist As to why your implementation doesn't work, later when you need to grow the array later you lost the reference to the original pointer to invoke realloc on.

@asterite
Copy link
Member

Alternatively we could store an offset inside the array, but maybe that's a big penalty for just this use case.

@benoist
Copy link
Contributor Author

benoist commented Oct 15, 2017

converting to deque before shift has some overhead of course, but it's already a 160x speedup

                       user     system      total        real
shift              1.600000   0.000000   1.600000 (  1.606989)
each               0.000000   0.000000   0.000000 (  0.000606)
fast shift         0.000000   0.000000   0.000000 (  0.000649)
convert to deque   0.010000   0.000000   0.010000 (  0.001799)

@asterite
Copy link
Member

In Ruby they do increase the base pointer... I don't know how they can later reallocate. So it's worth investigating how they do it, it might worth doing the same in our case.

@benoist
Copy link
Contributor Author

benoist commented Oct 15, 2017

If I'm reading the ruby implementation correctly it seems like they are holding off the memmove until inserts and they double the capacity upon the insert. That would give you the move penalty only once to when you actually need to reallocate. If the array is shifted until empty it would never need the reallocation. Just need to make sure it doesn't leak memory like @jhass pointed out.

@asterite
Copy link
Member

But how do they retain the original pointer and the current pointer?

@benoist
Copy link
Contributor Author

benoist commented Oct 15, 2017

If the array is not shared when shift is called, it will call the ary_make_shared function. I think this makes a copy of the original pointer and freezes it. This pointer is used in ary_modify to determine the shift.

https://github.com/ruby/ruby/blob/trunk/array.c#L590

https://github.com/ruby/ruby/blob/trunk/array.c#L322

@RX14
Copy link
Contributor

RX14 commented Oct 15, 2017

Well, what functions do you need that Deque doesn't have? I would have thought that all Array functions could work on Deque once implemented. If Deque is the right datastructure to use here, then you should use it.

Another option is reverse! then pop to use just Array.

@benoist
Copy link
Contributor Author

benoist commented Oct 15, 2017

I'm currently using uniq and sort from array, which might be a lot slower to do in deque as those functions are not just pushes and shifts for which deque is optimized. Thats just an assumption though, I might be wrong here.

@RX14
Copy link
Contributor

RX14 commented Oct 15, 2017

Have you tried it? benchmarked it? Don't work on assumptions.

You can always convert the array into a deque rather easilly.

Or if use reverse! and pop to get O(1) element removal from the ends in Array. It just matters which end you remove from.

@oprypin
Copy link
Member

oprypin commented Oct 15, 2017

There's a bit of a problem: Deque.sort! does not exist, because sorting implementation is hardcoded in Array.

@RX14
Copy link
Contributor

RX14 commented Oct 15, 2017

@oprypin which is exactly why I asked @benoist what functions he needed from array that were not on deque...

@benoist
Copy link
Contributor Author

benoist commented Oct 15, 2017

@RX14 Yes I know I can convert it to deque or use reverse and pop, but thats not really the issue. If we can make Array#shift a lot faster, isn't that worth investigating? It's not very intuitive to do all these workarounds because Array#shift is slow. The reason I'm making assumptions for the other parts now, is because putting the effort in testing and verifying still leaves this problem unsolved.

If the general conclusion is that Array#shift is as fast as it's going to get for now, then I have no problem if this issue gets closed. :-)

@oprypin
Copy link
Member

oprypin commented Oct 15, 2017

It's not very intuitive to do all these workarounds because Array#shift is slow.

Yeah, exactly. The non-workaround is Deque.

@benoist
Copy link
Contributor Author

benoist commented Oct 15, 2017

I would still consider that a workaround, but if thats just me, I can live with that :-)

@RX14
Copy link
Contributor

RX14 commented Oct 15, 2017

@benoist Using the correct datastructure for the job with the correct algorithmic complexity is a workaround?

@benoist
Copy link
Contributor Author

benoist commented Oct 15, 2017

if Deque is the only data structure that should do shifts, then shift should be removed from array. But I don't think that should be the case. If the overall operations on the same data are faster within one data structure, then it makes no sense to change. I think Array#shift can be made faster.

start = Time.now
a = Array.new(100_000, "a")
a.size.times do
  a.shift
end
puts Time.now - start
Crystal: 1.6069080
Ruby: 0.006131

@lbguilherme
Copy link
Contributor

lbguilherme commented Oct 15, 2017

Ruby implements an optionally shared buffer for the Array. Some Array instances owns the memory they use, and they will themselves reallocate it and free it when needed. But some Array instances are shared, in which they will keep reference to an external reference-counted buffer and maintain an offset on it. This shared Array will never touch the buffer unless it is the only array referencing it. This also has Copy-on-Write semantics. Then an shared Array tries to modify some data, it will first allocate its own buffer and copy everything to it. This is all done without any external impact, so the user of an Array cannot tell the difference, except with timing.

Here is some proof of it, taking @benoist sample:

start = Time.now
a = Array.new(100_000, "a")
a.size.times do
  a[0] = "b"  # Modifying the array will force the array to own its own memory.
  a.shift
end
puts Time.now - start

Running on my computer: (I did not run crystal with release optimizations!)

Ruby without changing array:     0.007283187
Ruby changing array:             3.364405281
Crystal without changing array:  2.3464860
Crystal changing array:          2.3259410

This kind of optimization is nice because it makes some things faster and you only have to pay some costs if you need to pay the costs. But it also brings inexplicable slowdowns in functions that shouldn't ever be slow. Who could tell that a[0] = "b" would take time proportional to the length of the array?


Refs:
rb_ary_shift calls ary_make_shared if the Array is not yet shared.
Everything that may modify the array call rb_ary_modify, which will make shared Arrays not shared (by allocating memory and copying/moving).
All happens here: https://github.com/ruby/ruby/blob/trunk/array.c

@benoist
Copy link
Contributor Author

benoist commented Oct 15, 2017

@lbguilherme thank you for this explanation! Just to be sure, what would be your suggestion to do with the current Array#shift implementation, leave it as is or change it into something similar to ruby?

@lbguilherme
Copy link
Contributor

I'm not sure in which direction Crystal wants to go here. We could:

  1. Leave it as it is now. shift is popping from the front of a container and it is slow. Document it and so that it is now a burden on the user to adapt to another structure (like copying the data into a Deque before) or pay the cost if he knows the array is small. The nice thing about this is that each operation has a clear cost that can be explained in documentation.

  2. Apply extensive optimizations in all sorts of places. This would have an impact on how data is stored and on pretty much all functions. The end result would be that everything is faster for the average user, but this makes the code more complicated to maintain and it also makes behavior less predictable. Why did writing to a single byte of memory cause my program to slowdown 462 times? This is hard to explain.

Still on point 2, there are simpler optimizations, like keeping an buffer and offset and only reallocating the they differ too much. This would be of much less impact than shared buffers, but still.

Just to be clear, there are possible optimizations for many other functions as well, not just shift. Taking a slice could benefit too, for example.

I don't dislike either solution. Maybe a speed focused container could come as a shard so not everybody would have to fear unpredictable performance. Or maybe is should be in the standard Array itself, so that everybody can benefit the performance. I particularly like optimizing for the average user, even with surprising behavior.

I would like to hear from @asterite on this.

@asterite
Copy link
Member

From a user perspective, I always liked how Array, Hash and String are super generally optimized data structures in Ruby. They can share data with other instances, they are mutable and can be made immutable, they have generally fast operations, etc. Of course that comes at the cost of implementing all of that. Maybe in Ruby it makes more sense because it's a dynamic language and implementing other data structures is inefficient, unless implemented in C. In most (all?) compiled languages you have different data structures, like in Java you have ArrayList, LinkedList, Dequeue, etc. That's nice but it's more cumbersome for the user, because she has to pick a data structure.

So... I don't know. If you need to shift a lot maybe just use another data structure? If sort! is missing in Dequeue maybe we should implement that? Maybe you can just use reverse! and then shift afterwards? It's hard to know without knowing what problem you are trying to solve, @benoist

Maybe adding an offset to Array is acceptable, I don't know. Maybe String could also have shared memory with parent strings to form views. I think all of that falls in the field of optimization, and right now that's not important, because that can always be done later without changing the external API.

@RX14
Copy link
Contributor

RX14 commented Oct 16, 2017

This discussion should happen later once optimizations like this come to the top of our agenda, instead of simply features and stability. At that time we'll have a lot more data on array performance in practice.

@benoist
Copy link
Contributor Author

benoist commented Oct 16, 2017

@asterite I've written a simple column storage database with encoding and to convert the columns to rows again I was shifting the values to form the rows. I'm using iterators because the column storage is not always aligned to values that belong to the same row due to compression.

I'm keeping an offset now and iterate through the array instead of shifting. This allows me to use the sort function without an extra conversion to Deque.

@ghost
Copy link

ghost commented Oct 16, 2017

What I find interesting is that I can read more about how ruby implements arrays here on crystal, than I can find on the ruby issue tracker or what not. Sorry for the distraction. ;-)

@funny-falcon
Copy link
Contributor

@asterite there is no need in circular buffer. Just move array's content to the begin of allocation when allocation end is reached. That is the way Ruby's Array works actially. 'Shared array' is just implementation trick.
Crystal can use just pointer to begin of allocation, size of allocation, pointer to first element and size. In other words, only pointer to first element should be added (or pointer to begin of allocation).

@asterite
Copy link
Member

@funny-falcon Yes, that's a possibility. But if you do shift + push in a loop doesn't it always grow the array infinitely? The current pointer would be increased but never decreased.

@funny-falcon
Copy link
Contributor

If Array will be used as Deque, then there should be amortization room, so move doesn't happen too often.

I did fix for Ruby's Array exactly for this scenario (ie Ruby as Deque).

@funny-falcon
Copy link
Contributor

@asterite

But if you do shift + push in a loop doesn't it always grow the array infinitely?

As I've said, when push reaches bound of capacity, and there is a room caused by shifts, elements are moved to begin of allocation.

@funny-falcon
Copy link
Contributor

If i'm not non-grata here, I can make PR for this.

@asterite
Copy link
Member

@funny-falcon Sure! No one is non-grata here.

I still have my doubts about this, though: adding four bytes to all arrays for just one method that's maybe not used all the time. And for example in the OPs use case there was really no need for a shift.

@oprypin
Copy link
Member

oprypin commented Oct 18, 2017

Haven't we already mentioned that it's too early for this kind of optimization? Besides, this is nothing more than a workaround.

@oprypin
Copy link
Member

oprypin commented Oct 18, 2017

Bleh nevermind, not having to use Deque would be tempting.
Uh nevermind on the nevermind. See my next comment

@RX14
Copy link
Contributor

RX14 commented Oct 18, 2017

Just a quick question, what downsides are there of a deque over an array with a start offset from the allocation start? Apart from the obvious to_unsafe one, which I don't think is a big deal. I wouldn't have thought that the array accesses would have been slowed down much by the tiny bit of extra maths.

@funny-falcon
Copy link
Contributor

@asterite , single benefit of implementing this is getting rid of Deque, and being closer to Ruby.

All other issues could be solved with programmers discipline.

@RX14 simplest way is to not change usage of @buffer, just move it with shift (and, possibly, with unshift). There is only need to save pointer to allocation, or amount of elements @buffer were shifted from allocation (ie offset). In latter way allocation start is calculated as @buffer - @offset.

@funny-falcon
Copy link
Contributor

I'd preffer to store pointer to allocation.

@oprypin
Copy link
Member

oprypin commented Oct 18, 2017

First about the upsides of Deque and downsides of this suggested approach (which, frankly, has NOT been clearly described so far).
This kind of workaround cannot achieve its asymptotic complexity. Deque guarantees O(1) complexity when using a balanced number of push+shift or unshift+pop. This kind of shifted array would instead occasionally (with a constant factor!) cause an O(N) operation. So the improvement is by a constant factor (or as some would say, no improvement) over the basic Array.
Sure, it improves the situation of many consequent shift calls (which Deque also covers very well), but why all the complexity just for this case? And now I'm sure that it definitely cannot replace Deque.

The downside of a Deque is, hmm, I don't know, but it probably destroys various kinds of caches so is unforgivably slower at typical operations you do with it.

@RX14
Copy link
Contributor

RX14 commented Oct 18, 2017

I really don't think that Deque will be any slower than Array in typical operations. The only downsides I can think of is that the prefetcher will get confused when wrapping, and that the branch prediction might fail too. Hopefully LLVM can compile it to a conditional move to avoid having to stall the pipeline.

In actual fact, I think that 95% of arrays will never wrap. Adding elements to the end is by far the most common array mutation op, and so I think most arrays won't ever wrap around.

@oprypin
Copy link
Member

oprypin commented Oct 18, 2017

In fact, instead of this suggestion, why not actually mix the implementations of Array and Deque? Have a shortcut scenario for when it's aligned. Re-align it to zero when required (to_unsafe) and before expensive operations (sort!).

@RX14
Copy link
Contributor

RX14 commented Oct 18, 2017

@oprypin what do you mean by a "shortcut scenario". The check that offset + i < size will always be true and it'll just act like an array?

@oprypin
Copy link
Member

oprypin commented Oct 18, 2017

That or just offset == 0

@RX14
Copy link
Contributor

RX14 commented Oct 18, 2017

@oprypin no, why? That means that there's more branches in the non-zero case and for what gain? Avoiding a single 1 cycle addition instruction in a piece of code that is memory bound, not IPC bound anyway?

@funny-falcon
Copy link
Contributor

@oprypin , you are not fully correct about complexity. Yes, occasional single operation is O(n). But amortized complexity is still O(1), cause big move is made once in n/c operations, so its amortized complexity is O(n / (n/c)) = O(c) = O(1) (iirc c=8 in Ruby) . I will not be surprised if Array's amortized performance will exactly the same as Deque for push+shift patern.

@akzhan
Copy link
Contributor

akzhan commented Oct 19, 2017

@funny-falcon what about simple Hash implementation w/o normalization etc.?

Now I think that small PRs will be reviewed quickly.

@akzhan
Copy link
Contributor

akzhan commented Oct 19, 2017

LBTM, there are no reasons to mix Array with Deque.

@funny-falcon
Copy link
Contributor

there are no reasons to mix Array with Deque.

If I did language, I'd repeat Ruby's trick, just "because I can".

But yeah, given Deque already implemented, there is no much point.

@RX14
Copy link
Contributor

RX14 commented Oct 19, 2017

@akzhan what are the reasons not to? Deque appears to be a superset of Array's functionality with the cost being in implementation complexity.

@akzhan
Copy link
Contributor

akzhan commented Oct 19, 2017

It's just my point. I rarely use Arrays in Deque manner.

funny-falcon added a commit to funny-falcon/crystal that referenced this issue Oct 19, 2017
To achieve it, pointer to allocation is preserved, and @buffer is moved
instead of moving elements on `#shift`.

For crystal-lang#5126
funny-falcon added a commit to funny-falcon/crystal that referenced this issue Oct 19, 2017
To achieve it, pointer to allocation is preserved, and @buffer is moved
instead of moving elements on `#shift`.

For crystal-lang#5126
@funny-falcon
Copy link
Contributor

I've created #5148 . It contains now smallest change in this direction.
To match with Deque, #insert, #delete_at and #unshift also should be improved.
If you strongly think, it is a way to go, then I will work in this direction.
Otherwise lets close this and #5148.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants