Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel refactor and optimize #8322

Merged
merged 2 commits into from Nov 20, 2019

Conversation

@firejox
Copy link
Contributor

firejox commented Oct 13, 2019

Several changes are applied:

  1. use internal storage list to reduce heap allocation
  2. move cleaning operations out of the lock section during close
  3. remove redundant lock by making internal send/receive return state
  4. because of 3, non blocking select can be determined by state
  5. the wait/unwait of select action can be done in O(1)
@lribeiro

This comment has been minimized.

Copy link

lribeiro commented Oct 13, 2019

Hi firejox, how does it affect performance, do you have any benchmarks you can show?

@asterite

This comment has been minimized.

Copy link
Member

asterite commented Oct 13, 2019

I don't know if anyone told you, but the PRs you are sending related to concurrency are great!

src/channel.cr Outdated Show resolved Hide resolved
@firejox

This comment has been minimized.

Copy link
Contributor Author

firejox commented Oct 14, 2019

@lribeiro Here is the benchmarks

channel close benchmark

before

[developer@crystal-dev tmp]$ ../bin/crystal channel-close-bench.cr -Dpreview_mt --release -- 10000
Using compiled compiler at `.build/crystal'
00:54:15.125022741

after

[developer@crystal-dev tmp]$ ../bin/crystal channel-close-bench.cr -Dpreview_mt --release -- 10000
Using compiled compiler at `.build/crystal'
00:40:58.541846658

channel select unwait benchmark

before

[developer@crystal-dev tmp]$ ../bin/crystal channel-select-unwait-bench.cr --release -- 100000
Using compiled compiler at `.build/crystal'
00:00:00.450562818

after

[developer@crystal-dev tmp]$ ../bin/crystal channel-select-unwait-bench.cr --release -- 100000
Using compiled compiler at `.build/crystal'
00:00:00.210562496

channel ping pong benchmark

before

[developer@crystal-dev tmp]$ ../bin/crystal channel-ping-pong-bench.cr -Dpreview_mt --release -- 500000
Using compiled compiler at `.build/crystal'
00:00:20.045008184

after

[developer@crystal-dev tmp]$ ../bin/crystal channel-ping-pong-bench.cr -Dpreview_mt --release -- 500000
Using compiled compiler at `.build/crystal'
00:00:16.457026952
@firejox firejox force-pushed the firejox:channel-refactor-and-optimize branch from d5a39ab to 5747de8 Oct 14, 2019
@wontruefree

This comment has been minimized.

Copy link
Contributor

wontruefree commented Oct 14, 2019

Thank you @firejox for all the work on concurrency!!

@@ -0,0 +1,98 @@
# :nodoc:
macro container_of(ptr, type, member)

This comment has been minimized.

Copy link
@asterite

asterite Oct 14, 2019

Member

Even though this is :nodoc: it leaks to the top-level namespace. So in my code I can call container_of and it will call this macro. This should probably be moved inside Crystal::StaticList and called like Crystal::StaticList.container_of. Same for the other macro.

This comment has been minimized.

Copy link
@firejox

firejox Oct 14, 2019

Author Contributor

OK, I will move these two macros inside the struct. Thank you.

This comment has been minimized.

Copy link
@firejox

firejox Oct 15, 2019

Author Contributor

@asterite Sorry for disturbing. I have a problem when I put them into Crystal::StaticList. The macros seem to be unable to know the Sender and Receiver because of visibility. And now I find three ways to resolve it.

  1. Remove the private of Sender and Receiver so that it can call Crystal::StaticList.container_of. But the problem is it will leak Sender and Receiver to public.

  2. Pack macros into a module like Crystal::StaticList::Reflect, and include the module into channel. It will call the container_of instead of Crystal::StaticList.container_of. And this way will be ugly and hard to understand.

  3. Use annotation and macro to generate corresponding method. Then it will invoke
    something like Receiver(T).from_{{ member_name }}(ptr). This could increase the compile time because of extra computation.

Here is the code pieces about these changes
https://gist.github.com/firejox/a6f2193a5e0820eef7e7e36c937c2330

Maybe there are other clever ways that I didn't come out. What do you think?

This comment has been minimized.

Copy link
@asterite

asterite Oct 15, 2019

Member

Instead of private visibility you can just put :nodoc: in a doc comment.

This comment has been minimized.

Copy link
@firejox

firejox Oct 15, 2019

Author Contributor

OK, I will use the :nodoc: . Thank you.

@firejox firejox force-pushed the firejox:channel-refactor-and-optimize branch 2 times, most recently from c6ae669 to b5fffa6 Oct 15, 2019
src/channel.cr Outdated Show resolved Hide resolved
@@ -352,19 +405,18 @@ class Channel(T)

ops_locks.each &.lock

# Check that no channel is closed
unless ops.all?(&.available?)

This comment has been minimized.

Copy link
@bcardiff

bcardiff Oct 17, 2019

Member

If this is removed, then the available? is no longer used.

But I don't follow where the check that non of the channel is originally closed is done. Because if there is a channel ready and other closed I think the current code in this PR will yield and not raise.

This comment has been minimized.

Copy link
@firejox

firejox Oct 18, 2019

Author Contributor

Because op.execute will return proper state, I think there is no need to check channel twice. The case you mentioned is a new case, and current spec doesn't cover it. The following code:

# ch1 is ready and ch2 is closed

# a.cr
select
when m1 = ch1.receive?
when m2 = ch2.receive
end

# b.cr
select
when m1 = ch1.receive
when m2 = ch2.receive?
end

What are the expect behaviors of the two code pieces?

This comment has been minimized.

Copy link
@bcardiff

bcardiff Oct 18, 2019

Member

Ok. We can avoid the double check and do not have a guarantee regarding which channel, the closed or the ready one, will fulfill the select.

  • a.cr will either execute ch1's when with m1 = nil, or ch2's when with data.
  • b.cr will either raise due to ch1 been closed, or ch2's when with data.

There are no specs mixing receive and receive? though.

If ch2 would not be ready I would expect that b.cr will always raise. Both on a blocking and non-blocking select. That is a spec that could be added. But I don't expect that receive and receive? are mixed.

So, this change is good.

This comment has been minimized.

Copy link
@firejox

firejox Oct 18, 2019

Author Contributor

I see. Although your precondition is different from me, I will add the spec about mixing receive and receive?.

src/channel.cr Outdated Show resolved Hide resolved
@@ -0,0 +1,97 @@
# :nodoc:
struct Crystal::StaticList

This comment has been minimized.

Copy link
@bcardiff

bcardiff Oct 17, 2019

Member

I would like to find a more crystal way to build this static list.
I think it should be possible to use a generic struct that will enforce typing and maybe allow some macros to hide the boilerplate as done here. But all the pointer, casting and final interface make me doubt.

Also, although it :nodoc: I think there could be some specs on the implementation.

The main problem is to build a linked list with data that should act as nodes and let them be stored in the stack, right?

This comment has been minimized.

Copy link
@firejox

firejox Oct 18, 2019

Author Contributor

I think this is a difference between internal storage and external storage. The external storage will enforce type explicit, and the internal storage requires some way to know which type stores the list node. That's why need the Crystal::StaticList.container_of. Because we store data in stack, it cannot prevent to handle without pointer.

And I will add some spec about the StaticList.

This comment has been minimized.

Copy link
@bcardiff

bcardiff Oct 18, 2019

Member

What do you think about something like this https://gist.github.com/bcardiff/4327823bb43cbb93734fbb6240315dac ?

  • there are less casts
  • the pointer tricker is hidden in the module
  • the API is a bit more clean: sometimes as methods, sometimes as macro calls.

there might be some subtle differences regarding the representation of the list. and this solution can only be used in structs. I didn't care to apply them to classes since I think it is only used in structs.

In case we need to extend it, in the worst case macros can be injected in the type that is including the module to specialize if the next/prev need to be pointer(T) or T.

This comment has been minimized.

Copy link
@firejox

firejox Oct 19, 2019

Author Contributor

Sorry for late on reply because I'm busy today. In my opinion, this change is a bit worse than mine. Listify.prepend(list, node) will make code like c, not crystal. Although I currently use node.append_to pointerof(list) in code, the implementation can be easily replaced by list.append node.self_pointer. list.append and list.shift will be better than Listify.append and Listify.shift.

Besides, the each operation should not hide the pointer. If you hide the pointer, the object you get is the copy object of the pointer point to. Any changes will not apply on the original object. This will be a problem.

Including a module is good, but I don't want to limit the name of member variable in struct/class which use the list. (If possible, they may need the list with different name for future refactor/optimize)

This comment has been minimized.

Copy link
@bcardiff

bcardiff Oct 21, 2019

Member

I acknowledge that the loop in my gist should return pointers.

But I would argue that for me, the API you are proposing

  1. seems more similar to what I would find/write in C.
  2. have more features that needed considering it will be hidden from the std-lib
    a. it is only used for structs but classes are supported
    b. it allows elements to exists in multiple lists at the same time (by passing the member). But at the cost of adding repetition in callers.

I still think the API like the one proposed in my gist is more simple and can do the job.

I would let others from the core team weigh in here.

This comment has been minimized.

Copy link
@RX14

RX14 Oct 31, 2019

Member

Look at the way that Thread::LinkedList does it, and i'm pretty sure you can generalise that to work on the raw pointers. Then you don't need the init_empty_list either, just store the head and tail and use null pointers to signify the start/end of the list.

@firejox firejox force-pushed the firejox:channel-refactor-and-optimize branch from b5fffa6 to 288a200 Oct 20, 2019
@firejox

This comment has been minimized.

Copy link
Contributor Author

firejox commented Oct 20, 2019

Updates:

  1. add specs of Crystal::StaticList
  2. add specs of mixing receive and receive? select
Copy link
Member

bcardiff left a comment

I left a couple of comments more.

FYI I am waiting on this PR to ask for a rebase on #8006 .

Thanks for all the effort @firejox

src/channel.cr Outdated Show resolved Hide resolved
@@ -0,0 +1,97 @@
# :nodoc:
struct Crystal::StaticList

This comment has been minimized.

Copy link
@bcardiff

bcardiff Oct 21, 2019

Member

I acknowledge that the loop in my gist should return pointers.

But I would argue that for me, the API you are proposing

  1. seems more similar to what I would find/write in C.
  2. have more features that needed considering it will be hidden from the std-lib
    a. it is only used for structs but classes are supported
    b. it allows elements to exists in multiple lists at the same time (by passing the member). But at the cost of adding repetition in callers.

I still think the API like the one proposed in my gist is more simple and can do the job.

I would let others from the core team weigh in here.

@firejox firejox force-pushed the firejox:channel-refactor-and-optimize branch from 288a200 to 031c4ca Oct 22, 2019
@bcardiff bcardiff requested review from ysbaddaden and asterite Oct 23, 2019
Copy link
Member

asterite left a comment

I'd like to review this but I currently don't have much time, plus concurrency internals is not my strong point (at the moment).

@bcardiff We can wait for a review from at least one other core-team member, but if you think this is good (code is readable, does the same thing, doesn't have hacks that later on we won't understand unless we add more comments or docs, etc.) then I'd merge this.

Copy link
Member

RX14 left a comment

The lists are going to need a better API. I can barely follow the specs, and it seems quite easy to slip up using them.

end
end

describe "Crysta::StaticList" do

This comment has been minimized.

Copy link
@RX14

RX14 Oct 31, 2019

Member

Crystal.

This comment has been minimized.

Copy link
@straight-shoota

straight-shoota Oct 31, 2019

Member

This typo can be avoided by removing the quotes.

# Because channel#close may clean up a long list, `select_context.try_trigger` may
# be called after the select return. In order to prevent invalid address access,
# the state is allocated in the heap.
state_ptr = Pointer(Atomic(SelectState)).malloc(1, Atomic(SelectState).new(SelectState::Active))

This comment has been minimized.

Copy link
@RX14

RX14 Oct 31, 2019

Member

Can probably just do Pointer.malloc here - the compiler will guess the type from the Atomic(SelectState).new(SelectState::Active).

@@ -0,0 +1,97 @@
# :nodoc:
struct Crystal::StaticList

This comment has been minimized.

Copy link
@RX14

RX14 Oct 31, 2019

Member

Look at the way that Thread::LinkedList does it, and i'm pretty sure you can generalise that to work on the raw pointers. Then you don't need the init_empty_list either, just store the head and tail and use null pointers to signify the start/end of the list.

@firejox firejox force-pushed the firejox:channel-refactor-and-optimize branch 2 times, most recently from 432003f to 8916ea6 Oct 31, 2019
@firejox firejox requested a review from RX14 Nov 1, 2019
@RX14

This comment has been minimized.

Copy link
Member

RX14 commented Nov 1, 2019

I still don't understand the reason why you need container_of. Could you explain it for me?

@RX14

This comment has been minimized.

Copy link
Member

RX14 commented Nov 1, 2019

Like, why not just make your list generic and assume that T has the next and prev pointers already, just like Thread::LinkedList does. You don't have to do all this awful macro offsetof casting, the types are just... correct.

Thread::LinkedList already works with Reference types, you could rename Thread::LinkedList and specialize that type using macros to work with Value. Just look at the way Atomic(T) changes it's implementation based on T.

@firejox firejox force-pushed the firejox:channel-refactor-and-optimize branch from 8916ea6 to b540ce5 Nov 1, 2019
@firejox

This comment has been minimized.

Copy link
Contributor Author

firejox commented Nov 1, 2019

@RX14 I want to keep some flexibility, so I use container_of and do not assume that T has next and prev pointer. All things is just for flexibility.

You cannot do that as Atomic(T) does. I have tried before I made this PR. The main problem will occur on the design of push operation. It should be expected to accept Pointer type, not Value type.

Now I have made the code in generic way. It seems that my implementation is not widely accepted by core team.

@asterite

This comment has been minimized.

Copy link
Member

asterite commented Nov 1, 2019

It seems that my implementation is not widely accepted by core team

I personally like the performance improvements but I think the code becomes much harder to follow, with many low-level tricks. My fear is that in a couple of months we won't be able to understand how it all works.

I wish for the performance improvements to happen without introducing that much of a code change.

@RX14

This comment has been minimized.

Copy link
Member

RX14 commented Nov 8, 2019

I want to keep some flexibility, so I use container_of and do not assume that T has next and prev pointer. All things is just for flexibility.

This flexibility - when the type will never be used outside of the stdlib - is not worth the drop in readability and the complication of the list API.

I personally like the performance improvements but I think the code becomes much harder to follow, with many low-level tricks. My fear is that in a couple of months we won't be able to understand how it all works.

I think really the only complicated tricks are in the list implementation. If that gets fixed, then i'm perfectly happy to merge this, pending @bcardiff's input.

@bcardiff

This comment has been minimized.

Copy link
Member

bcardiff commented Nov 11, 2019

@firejox I made another round on this PR.

Applied a couple of refactors to use linked list specifically for this use and move the shared state to a class instead of a manual Pointer.malloc. These changes are applied in their own commits at https://github.com/bcardiff/crystal/tree/refactor/linked-list

The underlying behavior is the same.

Then I perform the benchmarks you sent at #8322 (comment), with -Dpreview_mt and without (note that not all of them in your original posting use -Dpreview_mt) and I am not able to validate the improvements locally in the channel-close-bench.cr.

bench base firejox bcardiff
MT channel-close-bench.cr 00:10:42.358743895 00:13:56.305350595 00:14:03.869554707
MT channel-select-unwait-bench.cr 00:00:00.686825065 00:00:00.635335868 00:00:00.684841458
MT channel-ping-pong-bench.cr 00:00:07.174520691 00:00:05.834123856 00:00:05.903892141
ST channel-close-bench.cr 00:03:55.545640555 00:03:26.212166600 00:03:20.613478228
ST channel-select-unwait-bench.cr 00:00:00.216716945 00:00:00.048762337 00:00:00.051219858
ST channel-ping-pong-bench.cr 00:00:02.652404693 00:00:02.439848184 00:00:02.458077994

Which platform/hardware are you using?
Mine is Darwin Kernel Version 18.6.0 MacBookPro14,3 Intel Core i7 2.9 GHz

Between the refactor and the fix of the potential invalid memory access, I don't mind the slowness in channel-close-bench.cr.

Do you want to add those commits here? Or we can open another PR based on this one and pick it up from there. As you prefer. Again, thanks for all the effort.

@firejox

This comment has been minimized.

Copy link
Contributor Author

firejox commented Nov 12, 2019

@bcardiff Interesting. I don't know why it does not yield the improvement. It actually shows the difference on my computer.

Which platform/hardware are you using?
Mine is Darwin Kernel Version 18.6.0 MacBookPro14,3 Intel Core i7 2.9 GHz

My platform is Linux 5.1.7-arch1-1, x86_64, AMD Athlon X2 II 270 3.4 GHz

Do you want to add those commits here? Or we can open another PR based on this one and pick it up from there. As you prefer. Again, thanks for all the effort.

The changes are great. It is more readable. I will merge them later. Thank you.

EDIT: Sorry. I didn't notice that there are two commits. I only looked at the last one before. Another commit can be better. Because the implementation is change, there is no need to use append_to. The simple swap will achieve the same result.

I will pick the last one and make some change on the other one.

@bcardiff

This comment has been minimized.

Copy link
Member

bcardiff commented Nov 12, 2019

Sure thing @firejox thanks for keep iterating.

@firejox firejox force-pushed the firejox:channel-refactor-and-optimize branch from b540ce5 to ba92974 Nov 13, 2019
src/channel.cr Outdated Show resolved Hide resolved
@firejox firejox force-pushed the firejox:channel-refactor-and-optimize branch from ba92974 to fdb3167 Nov 13, 2019
src/channel.cr Outdated Show resolved Hide resolved
firejox and others added 2 commits Oct 13, 2019
Several changes are applied:
1. use internal storage list to reduce heap allocation
2. move cleaning operations out of the lock section during close
3. remove redundant lock by making internal send/receive return state
4. because of 3, non blocking select can be determined by state
5. the wait/unwait of select action can be done in O(1)
@firejox firejox force-pushed the firejox:channel-refactor-and-optimize branch from fdb3167 to ba33bee Nov 14, 2019
Copy link
Member

bcardiff left a comment

🎉 I have no more feedback. LGTM!

@bcardiff bcardiff added this to the 0.32.0 milestone Nov 14, 2019
@bcardiff bcardiff merged commit 7b46a78 into crystal-lang:master Nov 20, 2019
5 checks passed
5 checks passed
ci/circleci: check_format Your tests passed on CircleCI!
Details
ci/circleci: test_linux Your tests passed on CircleCI!
Details
ci/circleci: test_linux32_std Your tests passed on CircleCI!
Details
ci/circleci: test_preview_mt Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
def compare_and_set(cmp : SelectState, new : SelectState) : {SelectState, Bool}
@state.compare_and_set(SelectState::Active, SelectState::Done)
end
Comment on lines +79 to +81

This comment has been minimized.

Copy link
@Sija

Sija Nov 21, 2019

Contributor

hmm, seems like these arguments (cmp & new) aren't used anywhere... 🤔
Shouldn't this method body be @state.compare_and_set(cmp, new)?

This comment has been minimized.

Copy link
@bcardiff

bcardiff Nov 21, 2019

Member

You're right. Fortunately, it is only used on those values 🙈.

Sija added a commit to Sija/crystal that referenced this pull request Nov 21, 2019
asterite added a commit that referenced this pull request Nov 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.