New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switched from Broadcast to unagi-chan, which doesn't drop messages #3

Merged
merged 1 commit into from Sep 3, 2016

Conversation

Projects
None yet
10 participants
@sgraf812

sgraf812 commented Sep 3, 2016

I'm not sure how this performs now, haven't really got the benchmarks to run on my machine reliably.

I decided to use unagi-chan instead of broadcast-chan because of its focus on performance, but for no particularly good other reason.

Also this doesn't contain the fixes discussed in hashrocket#14 (comment) and implemented in #2 yet.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

@sgraf812 I'll amend mine to use unagi-chan like yours (that was my backup plan if Broadcast fell apart anyway) and rerun the benchmark for comparison.

Owner

bitemyapp commented Sep 3, 2016

@sgraf812 I'll amend mine to use unagi-chan like yours (that was my backup plan if Broadcast fell apart anyway) and rerun the benchmark for comparison.

@bitemyapp bitemyapp merged commit c7c0da0 into bitemyapp:master Sep 3, 2016

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

@sgraf812 I merged and combined your changes here with @zyla's - thank you!

How would y'all like to be credited in the edits to the blog post? Name / Github?

Owner

bitemyapp commented Sep 3, 2016

@sgraf812 I merged and combined your changes here with @zyla's - thank you!

How would y'all like to be credited in the edits to the blog post? Name / Github?

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

Currently this is what I'm getting when I run the bench suite with this unagi-chan version,

$ ./bin/hs-websocket-server 
hs-websocket-server: writev: resource vanished (Connection reset by peer)
hs-websocket-server: ConnectionClosed
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: hs-websocket-server: writev: resource vanished (Broken pipe)
writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Connection reset by peer)hs-websocket-server: writev: resource vanished (Broken pipe)

hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: ConnectionClosed
Owner

bitemyapp commented Sep 3, 2016

Currently this is what I'm getting when I run the bench suite with this unagi-chan version,

$ ./bin/hs-websocket-server 
hs-websocket-server: writev: resource vanished (Connection reset by peer)
hs-websocket-server: ConnectionClosed
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: hs-websocket-server: writev: resource vanished (Broken pipe)
writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: writev: resource vanished (Connection reset by peer)hs-websocket-server: writev: resource vanished (Broken pipe)

hs-websocket-server: writev: resource vanished (Broken pipe)
hs-websocket-server: ConnectionClosed
@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

I don't know exactly what causes the error, but it was because the step value was too high,

Dropping it to a "gentle bench" has it running:

clients:   530    95per-rtt: 105ms    min-rtt:   0ms    median-rtt:  32ms    max-rtt: 140ms
clients:   540    95per-rtt: 100ms    min-rtt:   0ms    median-rtt:  38ms    max-rtt: 144ms
clients:   550    95per-rtt: 109ms    min-rtt:   1ms    median-rtt:  39ms    max-rtt: 174ms
clients:   560    95per-rtt:  98ms    min-rtt:   0ms    median-rtt:  38ms    max-rtt: 139ms
clients:   570    95per-rtt: 103ms    min-rtt:   1ms    median-rtt:  48ms    max-rtt: 166ms
clients:   580    95per-rtt:  97ms    min-rtt:   0ms    median-rtt:  44ms    max-rtt: 148ms
clients:   590    95per-rtt: 115ms    min-rtt:   0ms    median-rtt:  42ms    max-rtt: 162ms
clients:   600    95per-rtt: 156ms    min-rtt:   0ms    median-rtt:  42ms    max-rtt: 254ms
clients:   610    95per-rtt: 152ms    min-rtt:   0ms    median-rtt:  37ms    max-rtt: 242ms
clients:   620    95per-rtt: 100ms    min-rtt:   1ms    median-rtt:  38ms    max-rtt: 122ms
clients:   630    95per-rtt: 111ms    min-rtt:   1ms    median-rtt:  43ms    max-rtt: 155ms

Currently it dies if the step size is even 250. No bueno.

Owner

bitemyapp commented Sep 3, 2016

I don't know exactly what causes the error, but it was because the step value was too high,

Dropping it to a "gentle bench" has it running:

clients:   530    95per-rtt: 105ms    min-rtt:   0ms    median-rtt:  32ms    max-rtt: 140ms
clients:   540    95per-rtt: 100ms    min-rtt:   0ms    median-rtt:  38ms    max-rtt: 144ms
clients:   550    95per-rtt: 109ms    min-rtt:   1ms    median-rtt:  39ms    max-rtt: 174ms
clients:   560    95per-rtt:  98ms    min-rtt:   0ms    median-rtt:  38ms    max-rtt: 139ms
clients:   570    95per-rtt: 103ms    min-rtt:   1ms    median-rtt:  48ms    max-rtt: 166ms
clients:   580    95per-rtt:  97ms    min-rtt:   0ms    median-rtt:  44ms    max-rtt: 148ms
clients:   590    95per-rtt: 115ms    min-rtt:   0ms    median-rtt:  42ms    max-rtt: 162ms
clients:   600    95per-rtt: 156ms    min-rtt:   0ms    median-rtt:  42ms    max-rtt: 254ms
clients:   610    95per-rtt: 152ms    min-rtt:   0ms    median-rtt:  37ms    max-rtt: 242ms
clients:   620    95per-rtt: 100ms    min-rtt:   1ms    median-rtt:  38ms    max-rtt: 122ms
clients:   630    95per-rtt: 111ms    min-rtt:   1ms    median-rtt:  43ms    max-rtt: 155ms

Currently it dies if the step size is even 250. No bueno.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

Tried https://github.com/merijn/broadcast-chan - died after 500:

bin/websocket-bench broadcast ws://127.0.0.1:3000/ws --concurrent 10 --sample-size 100 --step-size 250 --limit-percentile 95 --limit-rtt 250ms
clients:   250    95per-rtt: 181ms    min-rtt:   5ms    median-rtt:  79ms    max-rtt: 246ms
clients:   500    95per-rtt: 244ms    min-rtt:   5ms    median-rtt: 143ms    max-rtt: 281ms
Owner

bitemyapp commented Sep 3, 2016

Tried https://github.com/merijn/broadcast-chan - died after 500:

bin/websocket-bench broadcast ws://127.0.0.1:3000/ws --concurrent 10 --sample-size 100 --step-size 250 --limit-percentile 95 --limit-rtt 250ms
clients:   250    95per-rtt: 181ms    min-rtt:   5ms    median-rtt:  79ms    max-rtt: 246ms
clients:   500    95per-rtt: 244ms    min-rtt:   5ms    median-rtt: 143ms    max-rtt: 281ms
@codygman

This comment has been minimized.

Show comment
Hide comment
@codygman

codygman Sep 3, 2016

@bitemyapp Did you see the open file limits section of the hashrocket repo? Did you increase the number of open files you could have before running this?

codygman commented Sep 3, 2016

@bitemyapp Did you see the open file limits section of the hashrocket repo? Did you increase the number of open files you could have before running this?

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

@codygman yes, I wouldn't have hit 45k with the Broadcast variant without having done so and these results are much worse than Rust which currently reaches 9-10k clients before failing the SLA.

There's something else going on here common to broadcast-chan/unagi-chan implementations which is causing the server to fail the benchmark.

Owner

bitemyapp commented Sep 3, 2016

@codygman yes, I wouldn't have hit 45k with the Broadcast variant without having done so and these results are much worse than Rust which currently reaches 9-10k clients before failing the SLA.

There's something else going on here common to broadcast-chan/unagi-chan implementations which is causing the server to fail the benchmark.

@codygman

This comment has been minimized.

Show comment
Hide comment
@codygman

codygman Sep 3, 2016

@bitemyapp Yeah, just though I'd double check.

Hm, so if we take a look at the simple-broadcast example that avoids space leaks it only has two imports:

import Control.Concurrent.MVar
import Control.Exception (mask_)

The existence of mask_ there immediately reminds me of https://www.schoolofhaskell.com/user/snoyberg/general-haskell/exceptions/catching-all-exceptions and all of the things that can go wrong with exceptions.

Perhaps we could amend merijn/simple-broadcast to use safe-exceptions to eliminate any concurrent exceptions that might not be handled.

After doing that, then we only have:

import Control.Concurrent.MVar

I'm guessing we could do the math on the load this benchmark would put on an MVar and make a minimum example trying to reproduce that load and see if anything breaks?

I think the above would be enough to rule those two out as issues or if the issue lies there, lead us towards a solution. I'm just now becoming familiar with these libraries and this issue as a whole though, so please let me know what you think.

codygman commented Sep 3, 2016

@bitemyapp Yeah, just though I'd double check.

Hm, so if we take a look at the simple-broadcast example that avoids space leaks it only has two imports:

import Control.Concurrent.MVar
import Control.Exception (mask_)

The existence of mask_ there immediately reminds me of https://www.schoolofhaskell.com/user/snoyberg/general-haskell/exceptions/catching-all-exceptions and all of the things that can go wrong with exceptions.

Perhaps we could amend merijn/simple-broadcast to use safe-exceptions to eliminate any concurrent exceptions that might not be handled.

After doing that, then we only have:

import Control.Concurrent.MVar

I'm guessing we could do the math on the load this benchmark would put on an MVar and make a minimum example trying to reproduce that load and see if anything breaks?

I think the above would be enough to rule those two out as issues or if the issue lies there, lead us towards a solution. I'm just now becoming familiar with these libraries and this issue as a whole though, so please let me know what you think.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

You can pursue that angle if you want, I'm focused on getting a profile of the run up at the moment. I will certainly post it when I have one.

Owner

bitemyapp commented Sep 3, 2016

You can pursue that angle if you want, I'm focused on getting a profile of the run up at the moment. I will certainly post it when I have one.

@codygman

This comment has been minimized.

Show comment
Hide comment
@codygman

codygman Sep 3, 2016

Cool. Perhaps EKG would be useful as well?

On Sep 3, 2016 2:41 PM, "Chris Allen" notifications@github.com wrote:

You can pursue that angle if you want, I'm focused on getting a profile of
the run up at the moment.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANyNxwgoDRhleeeQG85t8L2Bi5CvBw8ks5qmc1zgaJpZM4J0VOi
.

codygman commented Sep 3, 2016

Cool. Perhaps EKG would be useful as well?

On Sep 3, 2016 2:41 PM, "Chris Allen" notifications@github.com wrote:

You can pursue that angle if you want, I'm focused on getting a profile of
the run up at the moment.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANyNxwgoDRhleeeQG85t8L2Bi5CvBw8ks5qmc1zgaJpZM4J0VOi
.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

I got a profile,

I would particularly highlight:

throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal               54.5    0.4
encodeMessages.\                      Network.WebSockets.Hybi13             10.3   94.0

We're eating it on the websockets implementation, for some reason, not the synchronization. The unagi-chan stuff is all 0.0.

Github won't let me post the full profile. I've attached it as a file here: hs-websocket-server.prof.txt

Owner

bitemyapp commented Sep 3, 2016

I got a profile,

I would particularly highlight:

throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal               54.5    0.4
encodeMessages.\                      Network.WebSockets.Hybi13             10.3   94.0

We're eating it on the websockets implementation, for some reason, not the synchronization. The unagi-chan stuff is all 0.0.

Github won't let me post the full profile. I've attached it as a file here: hs-websocket-server.prof.txt

@codygman

This comment has been minimized.

Show comment
Hide comment
@codygman

codygman Sep 3, 2016

So it looks like throwSocketErrorIfMinus1RetryMayBlock is throwing an error. Perhaps we can print out some of those errors and see what they are?

EDIT: maybe throwSocketErrorIfMinus1RetryMayBlock is being used when throwSocketErrorWaitRead should be used?

Just spit balling.

codygman commented Sep 3, 2016

So it looks like throwSocketErrorIfMinus1RetryMayBlock is throwing an error. Perhaps we can print out some of those errors and see what they are?

EDIT: maybe throwSocketErrorIfMinus1RetryMayBlock is being used when throwSocketErrorWaitRead should be used?

Just spit balling.

@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 3, 2016

I investigated the networking code. throwSocketErrorIfMinus1RetryMayBlock is just an alias (not really, but essentially through 3 levels of indirection) for
throwErrnoIfRetryMayBlock.

-- | as 'throwErrnoIfRetry', but additionally if the operation
-- yields the error code 'eAGAIN' or 'eWOULDBLOCK', an alternative
-- action is executed before retrying.
--
throwErrnoIfRetryMayBlock
                :: (a -> Bool)  -- ^ predicate to apply to the result value
                                -- of the 'IO' operation
                -> String       -- ^ textual description of the location
                -> IO a         -- ^ the 'IO' operation to be executed
                -> IO b         -- ^ action to execute before retrying if
                                -- an immediate retry would block
                -> IO a
throwErrnoIfRetryMayBlock pred loc f on_block  =
  do
    res <- f
    if pred res
      then do
        err <- getErrno
        if err == eINTR
          then throwErrnoIfRetryMayBlock pred loc f on_block
          else if err == eWOULDBLOCK || err == eAGAIN
                 then do _ <- on_block
                         throwErrnoIfRetryMayBlock pred loc f on_block
                 else throwErrno loc
      else return res

I would investigate whether the accept call returns EAGAIN or EWOULDBLOCK. In this case the operation might be retried in a loop (if something is wrong with the IO manager) and burn a lot of CPU time.

lpeterse commented Sep 3, 2016

I investigated the networking code. throwSocketErrorIfMinus1RetryMayBlock is just an alias (not really, but essentially through 3 levels of indirection) for
throwErrnoIfRetryMayBlock.

-- | as 'throwErrnoIfRetry', but additionally if the operation
-- yields the error code 'eAGAIN' or 'eWOULDBLOCK', an alternative
-- action is executed before retrying.
--
throwErrnoIfRetryMayBlock
                :: (a -> Bool)  -- ^ predicate to apply to the result value
                                -- of the 'IO' operation
                -> String       -- ^ textual description of the location
                -> IO a         -- ^ the 'IO' operation to be executed
                -> IO b         -- ^ action to execute before retrying if
                                -- an immediate retry would block
                -> IO a
throwErrnoIfRetryMayBlock pred loc f on_block  =
  do
    res <- f
    if pred res
      then do
        err <- getErrno
        if err == eINTR
          then throwErrnoIfRetryMayBlock pred loc f on_block
          else if err == eWOULDBLOCK || err == eAGAIN
                 then do _ <- on_block
                         throwErrnoIfRetryMayBlock pred loc f on_block
                 else throwErrno loc
      else return res

I would investigate whether the accept call returns EAGAIN or EWOULDBLOCK. In this case the operation might be retried in a loop (if something is wrong with the IO manager) and burn a lot of CPU time.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

@lpeterse makes sense. I've been able to reproduce the same hotspots on websockets 0.9.7.0 and 0.9.0.0, though 0.9.0.0 got about 25-30% further.

Owner

bitemyapp commented Sep 3, 2016

@lpeterse makes sense. I've been able to reproduce the same hotspots on websockets 0.9.7.0 and 0.9.0.0, though 0.9.0.0 got about 25-30% further.

@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 3, 2016

I would try an strace (unless the speed decrease would make the problem disappear). See what the accept calls return.

lpeterse commented Sep 3, 2016

I would try an strace (unless the speed decrease would make the problem disappear). See what the accept calls return.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

@lpeterse

$ sudo strace -c -p 9238
[sudo] password for callen: 
strace: Process 9238 attached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.98    1.750553          50     35313     23489 futex
  0.02    0.000368           0     19546        26 rt_sigreturn
  0.00    0.000021           0      3755      1855 accept4
  0.00    0.000000           0         4           write
  0.00    0.000000           0         1           close
  0.00    0.000000           0       158           mmap
  0.00    0.000000           0       158           munmap
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0        31           rt_sigprocmask
  0.00    0.000000           0         4           sched_yield
  0.00    0.000000           0         1           kill
  0.00    0.000000           0        16           getrusage
  0.00    0.000000           0         1           timer_settime
  0.00    0.000000           0         1           timer_delete
  0.00    0.000000           0      1856           epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00    1.750942                 60850     25370 total
Owner

bitemyapp commented Sep 3, 2016

@lpeterse

$ sudo strace -c -p 9238
[sudo] password for callen: 
strace: Process 9238 attached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.98    1.750553          50     35313     23489 futex
  0.02    0.000368           0     19546        26 rt_sigreturn
  0.00    0.000021           0      3755      1855 accept4
  0.00    0.000000           0         4           write
  0.00    0.000000           0         1           close
  0.00    0.000000           0       158           mmap
  0.00    0.000000           0       158           munmap
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0        31           rt_sigprocmask
  0.00    0.000000           0         4           sched_yield
  0.00    0.000000           0         1           kill
  0.00    0.000000           0        16           getrusage
  0.00    0.000000           0         1           timer_settime
  0.00    0.000000           0         1           timer_delete
  0.00    0.000000           0      1856           epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00    1.750942                 60850     25370 total
@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 3, 2016

Okay, there's nothing wrong. The accept is tried once and fails and then a second time after the IO manager cleared socket readability. This is why half of all accept calls fail.

# ifdef HAVE_ACCEPT4
     new_sock <- throwSocketErrorIfMinus1RetryMayBlock "accept"
                        (threadWaitRead (fromIntegral s))
                        (c_accept4 s sockaddr ptr_len (#const SOCK_NONBLOCK))

lpeterse commented Sep 3, 2016

Okay, there's nothing wrong. The accept is tried once and fails and then a second time after the IO manager cleared socket readability. This is why half of all accept calls fail.

# ifdef HAVE_ACCEPT4
     new_sock <- throwSocketErrorIfMinus1RetryMayBlock "accept"
                        (threadWaitRead (fromIntegral s))
                        (c_accept4 s sockaddr ptr_len (#const SOCK_NONBLOCK))
@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 3, 2016

The time measured spent in this function is probably the time spent in kernel space.

lpeterse commented Sep 3, 2016

The time measured spent in this function is probably the time spent in kernel space.

@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 3, 2016

What was the code you straced supposed to to? It does neither read nor write nor close a socket, does it?

lpeterse commented Sep 3, 2016

What was the code you straced supposed to to? It does neither read nor write nor close a socket, does it?

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

I attached strace to the websocket server. I'm as confused as you are. Going to try to run it differently.

Owner

bitemyapp commented Sep 3, 2016

I attached strace to the websocket server. I'm as confused as you are. Going to try to run it differently.

@agocorona

This comment has been minimized.

Show comment
Hide comment
@agocorona

agocorona Sep 3, 2016

Have you used Control.Concurrent.STM.TChan instead of Control.Concurrent.Broadcast?

It also broadcasts. All readTChan s receive the message of a single writeTChan

It should not loose messages since it uses queues instead of TVars.

Alternatively: Control.Concurrent.Chan, which is deprecated in favout of TChan, but it may be faster.

agocorona commented Sep 3, 2016

Have you used Control.Concurrent.STM.TChan instead of Control.Concurrent.Broadcast?

It also broadcasts. All readTChan s receive the message of a single writeTChan

It should not loose messages since it uses queues instead of TVars.

Alternatively: Control.Concurrent.Chan, which is deprecated in favout of TChan, but it may be faster.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.96    1.161443          49     23768     14597 futex
  0.03    0.000396           0     11872        17 rt_sigreturn
  0.00    0.000031           0       683       671 open
  0.00    0.000013           0      2968      1468 accept4
  0.00    0.000010           0      1471         1 epoll_ctl
  0.00    0.000000           0         8           read
  0.00    0.000000           0         4           write
  0.00    0.000000           0        12           close
  0.00    0.000000           0       244       183 stat
  0.00    0.000000           0        10           fstat
  0.00    0.000000           0       100           mmap
  0.00    0.000000           0        19           mprotect
  0.00    0.000000           0        67           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         9           rt_sigaction
  0.00    0.000000           0        23           rt_sigprocmask
  0.00    0.000000           0        10        10 access
  0.00    0.000000           0         2           pipe
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         2           setsockopt
  0.00    0.000000           0         2           clone
  0.00    0.000000           0         1           execve
  0.00    0.000000           0        17           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0        12           getrusage
  0.00    0.000000           0         1           sysinfo
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           epoll_create
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           timer_create
  0.00    0.000000           0         1           timer_settime
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         2           eventfd2
------ ----------- ----------- --------- --------- ----------------
100.00    1.161893                 41320     16947 total

Here's a different strace run @lpeterse

Owner

bitemyapp commented Sep 3, 2016

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.96    1.161443          49     23768     14597 futex
  0.03    0.000396           0     11872        17 rt_sigreturn
  0.00    0.000031           0       683       671 open
  0.00    0.000013           0      2968      1468 accept4
  0.00    0.000010           0      1471         1 epoll_ctl
  0.00    0.000000           0         8           read
  0.00    0.000000           0         4           write
  0.00    0.000000           0        12           close
  0.00    0.000000           0       244       183 stat
  0.00    0.000000           0        10           fstat
  0.00    0.000000           0       100           mmap
  0.00    0.000000           0        19           mprotect
  0.00    0.000000           0        67           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         9           rt_sigaction
  0.00    0.000000           0        23           rt_sigprocmask
  0.00    0.000000           0        10        10 access
  0.00    0.000000           0         2           pipe
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         2           setsockopt
  0.00    0.000000           0         2           clone
  0.00    0.000000           0         1           execve
  0.00    0.000000           0        17           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0        12           getrusage
  0.00    0.000000           0         1           sysinfo
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           epoll_create
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           timer_create
  0.00    0.000000           0         1           timer_settime
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         2           eventfd2
------ ----------- ----------- --------- --------- ----------------
100.00    1.161893                 41320     16947 total

Here's a different strace run @lpeterse

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

@agocorona unagi-chan is pretty fast and isn't really showing up in the current profiling. The problem is not that we still have a broadcast method that drops messages. That's no longer true. Something weirder is going on that I do not yet understand.

I'll try TChan for giggles.

Owner

bitemyapp commented Sep 3, 2016

@agocorona unagi-chan is pretty fast and isn't really showing up in the current profiling. The problem is not that we still have a broadcast method that drops messages. That's no longer true. Something weirder is going on that I do not yet understand.

I'll try TChan for giggles.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

@agocorona Same results as the other implementation, but worse than unagi-chan (died at 300 instead of ~1000-2000)

And the sync overhead actually showed up in the profiling.

readTChan                             Control.Concurrent.STM.TChan           2.4    0.0

Profile still dominated by sockets/serialization stuff.

Owner

bitemyapp commented Sep 3, 2016

@agocorona Same results as the other implementation, but worse than unagi-chan (died at 300 instead of ~1000-2000)

And the sync overhead actually showed up in the profiling.

readTChan                             Control.Concurrent.STM.TChan           2.4    0.0

Profile still dominated by sockets/serialization stuff.

@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 3, 2016

I suspect the threads that do the reading/writing have other PIDs. Does the problem persist when using the single-threaded runtime? This would make debugging far easier I guess.

lpeterse commented Sep 3, 2016

I suspect the threads that do the reading/writing have other PIDs. Does the problem persist when using the single-threaded runtime? This would make debugging far easier I guess.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

Single-threaded:

clients:   500    95per-rtt:  30ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  31ms
clients:   600    95per-rtt:  32ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt: 2499ms
clients:   700    95per-rtt:  34ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt: 3341ms
clients:   800    95per-rtt:  26ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt: 3232ms
clients:   900    95per-rtt:  28ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt: 4428ms
clients:  1000    95per-rtt:  28ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt: 4428ms
2016/09/03 15:57:47 read tcp 127.0.0.1:51862->127.0.0.1:3000: read: connection reset by peer
Makefile:35: recipe for target 'gentle-bench' failed
$ strace -c hs-websocket-server
hs-websocket-server: file descriptor 1024 out of range for select (0--1024).
Recompile with -threaded to work around this.
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 73.19    0.054626           1     56925      5416 select
 25.34    0.018910           0    552021           writev
  0.66    0.000493           0     74039           rt_sigprocmask
  0.38    0.000280           0     30168      5487 rt_sigreturn
  0.31    0.000235           0     37021           getrusage
  0.05    0.000040           0      4059      2038 recvfrom
  0.03    0.000023           0      1873       852 accept4
  0.02    0.000016           0       681       671 open
  0.02    0.000015           0       244       183 stat
  0.00    0.000000           0         8           read
  0.00    0.000000           0         3           write
  0.00    0.000000           0        10           close
  0.00    0.000000           0        10           fstat
  0.00    0.000000           0       300           mmap
  0.00    0.000000           0        17           mprotect
  0.00    0.000000           0       269           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         9           rt_sigaction
  0.00    0.000000           0        10        10 access
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         2           setsockopt
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           sysinfo
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           timer_create
  0.00    0.000000           0         1           timer_settime
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.074638                757685     14657 total
Owner

bitemyapp commented Sep 3, 2016

Single-threaded:

clients:   500    95per-rtt:  30ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  31ms
clients:   600    95per-rtt:  32ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt: 2499ms
clients:   700    95per-rtt:  34ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt: 3341ms
clients:   800    95per-rtt:  26ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt: 3232ms
clients:   900    95per-rtt:  28ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt: 4428ms
clients:  1000    95per-rtt:  28ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt: 4428ms
2016/09/03 15:57:47 read tcp 127.0.0.1:51862->127.0.0.1:3000: read: connection reset by peer
Makefile:35: recipe for target 'gentle-bench' failed
$ strace -c hs-websocket-server
hs-websocket-server: file descriptor 1024 out of range for select (0--1024).
Recompile with -threaded to work around this.
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 73.19    0.054626           1     56925      5416 select
 25.34    0.018910           0    552021           writev
  0.66    0.000493           0     74039           rt_sigprocmask
  0.38    0.000280           0     30168      5487 rt_sigreturn
  0.31    0.000235           0     37021           getrusage
  0.05    0.000040           0      4059      2038 recvfrom
  0.03    0.000023           0      1873       852 accept4
  0.02    0.000016           0       681       671 open
  0.02    0.000015           0       244       183 stat
  0.00    0.000000           0         8           read
  0.00    0.000000           0         3           write
  0.00    0.000000           0        10           close
  0.00    0.000000           0        10           fstat
  0.00    0.000000           0       300           mmap
  0.00    0.000000           0        17           mprotect
  0.00    0.000000           0       269           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         9           rt_sigaction
  0.00    0.000000           0        10        10 access
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         2           setsockopt
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           sysinfo
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           timer_create
  0.00    0.000000           0         1           timer_settime
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.074638                757685     14657 total
@agocorona

This comment has been minimized.

Show comment
Hide comment
@agocorona

agocorona Sep 3, 2016

I'm by no means an expert in performance, but 73% of time in select of futex may not be good...

agocorona commented Sep 3, 2016

I'm by no means an expert in performance, but 73% of time in select of futex may not be good...

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

The chain that leads to it is:

makeSocketStream.send                         Network.WebSockets.Stream               16106           0    0.1    0.0    61.8    1.8
  sendAll                                      Network.Socket.ByteString.Lazy.Posix    16107           0    0.7    0.1    61.7    1.8
    sendAll.bs'                                 Network.Socket.ByteString.Lazy.Posix    16154     2087102    0.3    0.0     0.3    0.0
       send                                        Network.Socket.ByteString.Lazy.Posix    16108     2088945    1.9    0.3    60.6    1.8
          send.\                                     Network.Socket.ByteString.Lazy.Posix    16142     2088945    0.1    0.1    57.5    1.1
            send.withPokes                            Network.Socket.ByteString.Lazy.Posix    16143     2088945    0.8    0.2    57.4    1.0
              send.withPokes.loop                      Network.Socket.ByteString.Lazy.Posix    16144     4177890    1.1    0.0    56.7    0.8
                send.\.\                                Network.Socket.ByteString.Lazy.Posix    16151     2088945    0.3    0.1    54.1    0.6
                  throwSocketErrorWaitWrite              Network.Socket.Internal                 16152     2088945    0.2    0.0    53.8    0.5
                    throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal                 16153     2088945   53.5    0.4    53.5    0.4
Owner

bitemyapp commented Sep 3, 2016

The chain that leads to it is:

makeSocketStream.send                         Network.WebSockets.Stream               16106           0    0.1    0.0    61.8    1.8
  sendAll                                      Network.Socket.ByteString.Lazy.Posix    16107           0    0.7    0.1    61.7    1.8
    sendAll.bs'                                 Network.Socket.ByteString.Lazy.Posix    16154     2087102    0.3    0.0     0.3    0.0
       send                                        Network.Socket.ByteString.Lazy.Posix    16108     2088945    1.9    0.3    60.6    1.8
          send.\                                     Network.Socket.ByteString.Lazy.Posix    16142     2088945    0.1    0.1    57.5    1.1
            send.withPokes                            Network.Socket.ByteString.Lazy.Posix    16143     2088945    0.8    0.2    57.4    1.0
              send.withPokes.loop                      Network.Socket.ByteString.Lazy.Posix    16144     4177890    1.1    0.0    56.7    0.8
                send.\.\                                Network.Socket.ByteString.Lazy.Posix    16151     2088945    0.3    0.1    54.1    0.6
                  throwSocketErrorWaitWrite              Network.Socket.Internal                 16152     2088945    0.2    0.0    53.8    0.5
                    throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal                 16153     2088945   53.5    0.4    53.5    0.4
@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 3, 2016

@agocorona Why is this a bad thing? select is equivalent to "I've done all my homework and am waiting for instructions".

See here. They show how an idle Postgres postmaster process looks like:

root@dev:~# strace -c -p 11084
Process 11084 attached - interrupt to quit
Process 11084 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 94.59    0.001014          48        21           select
  2.89    0.000031           1        21           getppid
  2.52    0.000027           1        21           time
------ ----------- ----------- --------- --------- ----------------
100.00    0.001072                    63           total
root@dev:~# 

lpeterse commented Sep 3, 2016

@agocorona Why is this a bad thing? select is equivalent to "I've done all my homework and am waiting for instructions".

See here. They show how an idle Postgres postmaster process looks like:

root@dev:~# strace -c -p 11084
Process 11084 attached - interrupt to quit
Process 11084 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 94.59    0.001014          48        21           select
  2.89    0.000031           1        21           getppid
  2.52    0.000027           1        21           time
------ ----------- ----------- --------- --------- ----------------
100.00    0.001072                    63           total
root@dev:~# 
@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

I made another broadcast version just to cross-check

clients:  1000    95per-rtt:  47ms    min-rtt:   1ms    median-rtt:  22ms    max-rtt:  70ms
clients:  2000    95per-rtt:  53ms    min-rtt:   0ms    median-rtt:  27ms    max-rtt:  59ms
clients:  3000    95per-rtt:  67ms    min-rtt:   1ms    median-rtt:  35ms    max-rtt:  96ms
clients:  4000    95per-rtt:  93ms    min-rtt:   1ms    median-rtt:  47ms    max-rtt:  98ms
clients:  5000    95per-rtt: 117ms    min-rtt:   2ms    median-rtt:  60ms    max-rtt: 164ms
clients:  6000    95per-rtt: 149ms    min-rtt:   1ms    median-rtt:  72ms    max-rtt: 207ms
clients:  7000    95per-rtt: 167ms    min-rtt:  80ms    median-rtt:  85ms    max-rtt: 251ms
clients:  8000    95per-rtt: 192ms    min-rtt:   2ms    median-rtt:  97ms    max-rtt: 211ms
clients:  9000    95per-rtt: 228ms    min-rtt:   3ms    median-rtt: 111ms    max-rtt: 321ms
clients: 10000    95per-rtt: 240ms    min-rtt:   3ms    median-rtt: 119ms    max-rtt: 352ms

This is comparable with Rust, I don't know how to reproduce what I got last night.

Here's the profile:

throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal        49.5    0.5
getSHA1Sched                          Data.Digest.Pure.SHA            7.9    0.2
encodeMessages.\                      Network.WebSockets.Hybi13       7.2   91.5
encodeFrame                           Network.WebSockets.Hybi13       5.3    1.6
broadcastThen.\                       Control.Concurrent.Broadcast    2.0    0.0
makeStream.send'                      Network.WebSockets.Stream       2.0    0.2
listen                                Control.Concurrent.Broadcast    1.8    0.3

Memory usage peaked at 35mb.

Owner

bitemyapp commented Sep 3, 2016

I made another broadcast version just to cross-check

clients:  1000    95per-rtt:  47ms    min-rtt:   1ms    median-rtt:  22ms    max-rtt:  70ms
clients:  2000    95per-rtt:  53ms    min-rtt:   0ms    median-rtt:  27ms    max-rtt:  59ms
clients:  3000    95per-rtt:  67ms    min-rtt:   1ms    median-rtt:  35ms    max-rtt:  96ms
clients:  4000    95per-rtt:  93ms    min-rtt:   1ms    median-rtt:  47ms    max-rtt:  98ms
clients:  5000    95per-rtt: 117ms    min-rtt:   2ms    median-rtt:  60ms    max-rtt: 164ms
clients:  6000    95per-rtt: 149ms    min-rtt:   1ms    median-rtt:  72ms    max-rtt: 207ms
clients:  7000    95per-rtt: 167ms    min-rtt:  80ms    median-rtt:  85ms    max-rtt: 251ms
clients:  8000    95per-rtt: 192ms    min-rtt:   2ms    median-rtt:  97ms    max-rtt: 211ms
clients:  9000    95per-rtt: 228ms    min-rtt:   3ms    median-rtt: 111ms    max-rtt: 321ms
clients: 10000    95per-rtt: 240ms    min-rtt:   3ms    median-rtt: 119ms    max-rtt: 352ms

This is comparable with Rust, I don't know how to reproduce what I got last night.

Here's the profile:

throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal        49.5    0.5
getSHA1Sched                          Data.Digest.Pure.SHA            7.9    0.2
encodeMessages.\                      Network.WebSockets.Hybi13       7.2   91.5
encodeFrame                           Network.WebSockets.Hybi13       5.3    1.6
broadcastThen.\                       Control.Concurrent.Broadcast    2.0    0.0
makeStream.send'                      Network.WebSockets.Stream       2.0    0.2
listen                                Control.Concurrent.Broadcast    1.8    0.3

Memory usage peaked at 35mb.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

If I use a strict ByteString, it looks like websockets just makes it a lazy one anyway.

Owner

bitemyapp commented Sep 3, 2016

If I use a strict ByteString, it looks like websockets just makes it a lazy one anyway.

@agocorona

This comment has been minimized.

Show comment
Hide comment
@agocorona

agocorona Sep 3, 2016

@lpeterse yea. I was taking it for a haskell profiling. Anyway, that's right

agocorona commented Sep 3, 2016

@lpeterse yea. I was taking it for a haskell profiling. Anyway, that's right

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 3, 2016

Owner

Using wai-websockets doesn't change anything.

throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal         53.2    0.4
encodeMessages.\                      Network.WebSockets.Hybi13       11.5   94.2
encodeFrame                           Network.WebSockets.Hybi13        8.8    1.6
Owner

bitemyapp commented Sep 3, 2016

Using wai-websockets doesn't change anything.

throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal         53.2    0.4
encodeMessages.\                      Network.WebSockets.Hybi13       11.5   94.2
encodeFrame                           Network.WebSockets.Hybi13        8.8    1.6
@agocorona

This comment has been minimized.

Show comment
Hide comment
@agocorona

agocorona Sep 3, 2016

May this (old thread) be relevant?

https://groups.google.com/forum/#!topic/yesodweb/rvR4uLUMi3k

in 2014 the websockets load was in Attoparsec. Maybe this problem persists since it uses the same library for parsing.

I don´t know if the authors of websockets can inform about the issue. I think that there have been no serious benchmarking/optimization of this critical library or at least I haven´t heard about it... until now.

agocorona commented Sep 3, 2016

May this (old thread) be relevant?

https://groups.google.com/forum/#!topic/yesodweb/rvR4uLUMi3k

in 2014 the websockets load was in Attoparsec. Maybe this problem persists since it uses the same library for parsing.

I don´t know if the authors of websockets can inform about the issue. I think that there have been no serious benchmarking/optimization of this critical library or at least I haven´t heard about it... until now.

@codygman

This comment has been minimized.

Show comment
Hide comment
@codygman

codygman Sep 3, 2016

It looks like using Michael's yesod-native-ws branch solved performance
problems there?

So that points another finger towards the websockets library?

On Sep 3, 2016 4:45 PM, "Alberto" notifications@github.com wrote:

may this (old thread) be relevant?

https://groups.google.com/forum/#!topic/yesodweb/rvR4uLUMi3k


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANyN6A6Bne1vS9UTmbaGmTpGLqjl_Zuks5qmeqKgaJpZM4J0VOi
.

codygman commented Sep 3, 2016

It looks like using Michael's yesod-native-ws branch solved performance
problems there?

So that points another finger towards the websockets library?

On Sep 3, 2016 4:45 PM, "Alberto" notifications@github.com wrote:

may this (old thread) be relevant?

https://groups.google.com/forum/#!topic/yesodweb/rvR4uLUMi3k


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANyN6A6Bne1vS9UTmbaGmTpGLqjl_Zuks5qmeqKgaJpZM4J0VOi
.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 4, 2016

Owner

https://github.com/bitemyapp/broadcast-bench a broadcast bench harness

I may have done it wrong, but based on what I know of unagi-chan's usual perf and what I saw in the websocket benchmark, I don't think UC is responsible for the slowness. Please feel free to validate.

Minimal dependency implementation sounds like a good way to isolate to me as well, although it might be some legwork to make it play nice with the Go client. Also, the one linked is going to be extremely slow --- it's String based.

Edit: And I have to make it concurrent & thread-safe. I'm going to try stuffing the socket connection into an MVar.

Owner

bitemyapp commented Sep 4, 2016

https://github.com/bitemyapp/broadcast-bench a broadcast bench harness

I may have done it wrong, but based on what I know of unagi-chan's usual perf and what I saw in the websocket benchmark, I don't think UC is responsible for the slowness. Please feel free to validate.

Minimal dependency implementation sounds like a good way to isolate to me as well, although it might be some legwork to make it play nice with the Go client. Also, the one linked is going to be extremely slow --- it's String based.

Edit: And I have to make it concurrent & thread-safe. I'm going to try stuffing the socket connection into an MVar.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 4, 2016

Owner

That example isn't up to date with the websocket RFC, there's an upgrade request you have to reply to with the correct challenge.

You can get an idea of what the protocol is like here: https://github.com/jaspervdj/websockets/blob/6a70dbad8f5efd88c72f287d7a81a0a63cda7c0f/src/Network/WebSockets/Hybi13.hs

Note that implementing a complete, thread-safe websocket server is tantamount to implementing 2/3s of websockets 😛

Owner

bitemyapp commented Sep 4, 2016

That example isn't up to date with the websocket RFC, there's an upgrade request you have to reply to with the correct challenge.

You can get an idea of what the protocol is like here: https://github.com/jaspervdj/websockets/blob/6a70dbad8f5efd88c72f287d7a81a0a63cda7c0f/src/Network/WebSockets/Hybi13.hs

Note that implementing a complete, thread-safe websocket server is tantamount to implementing 2/3s of websockets 😛

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 4, 2016

Owner

I've dumped about 6 or 7 hours into this day, a couple last night. Someone else will need to pick up the ball for this to go much further unless the websockets or unagi-chan maintainers give me a lead to follow. I don't have time to find out why networking/websockets is so slow.

I will say that a lot of the other language runtimes compared are using epoll or similar, not select. I don't know if that would explain the difference or not. Branches are up and I can answer questions if anyone pursuing it has any.

Owner

bitemyapp commented Sep 4, 2016

I've dumped about 6 or 7 hours into this day, a couple last night. Someone else will need to pick up the ball for this to go much further unless the websockets or unagi-chan maintainers give me a lead to follow. I don't have time to find out why networking/websockets is so slow.

I will say that a lot of the other language runtimes compared are using epoll or similar, not select. I don't know if that would explain the difference or not. Branches are up and I can answer questions if anyone pursuing it has any.

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Sep 4, 2016

I'm trying to reproduce the output from #3 (comment) and in particular the output that reads:

hs-websocket-server: writev: resource vanished (Broken pipe)

I don't get these messages when the client is running, only once it's decided to bail out. Is that what you're seeing too? If so, I think that's to be expected since the client does not exit very gracefully: if the server continues to send data to the client after the client has exited, it gets an EPIPE.

DaveCTurner commented Sep 4, 2016

I'm trying to reproduce the output from #3 (comment) and in particular the output that reads:

hs-websocket-server: writev: resource vanished (Broken pipe)

I don't get these messages when the client is running, only once it's decided to bail out. Is that what you're seeing too? If so, I think that's to be expected since the client does not exit very gracefully: if the server continues to send data to the client after the client has exited, it gets an EPIPE.

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Sep 4, 2016

Also, re. epoll vs select, according to strace we're using epoll. I'm using commit cac9235.

DaveCTurner commented Sep 4, 2016

Also, re. epoll vs select, according to strace we're using epoll. I'm using commit cac9235.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 4, 2016

Owner

@DaveCTurner no real mystery there, it's from running the non-gentle bench. It immediately hits a client count that breaks SLA, then bails.

Owner

bitemyapp commented Sep 4, 2016

@DaveCTurner no real mystery there, it's from running the non-gentle bench. It immediately hits a client count that breaks SLA, then bails.

@winterland1989

This comment has been minimized.

Show comment
Hide comment
@winterland1989

winterland1989 Sep 4, 2016

We may not need chan at all, other language's code seemed to just save all connection into a linked list, map, queue or whatever, so i think we can just:

  1. create a IORef [Connection], before we start sever, and pass it to handler.
  2. every time we accept a new Connection, we atomicModifyIORef (\cs -> ((c:cs), ())) csRef
  3. a broadcast will be simply a forM_ then.

I haven't setup the benchmark environment yet, but from my experience, haskell's io manager should be able to achieve half of manually written epoll code at least.

winterland1989 commented Sep 4, 2016

We may not need chan at all, other language's code seemed to just save all connection into a linked list, map, queue or whatever, so i think we can just:

  1. create a IORef [Connection], before we start sever, and pass it to handler.
  2. every time we accept a new Connection, we atomicModifyIORef (\cs -> ((c:cs), ())) csRef
  3. a broadcast will be simply a forM_ then.

I haven't setup the benchmark environment yet, but from my experience, haskell's io manager should be able to achieve half of manually written epoll code at least.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 4, 2016

Owner

@winterland1989 the synchronization is not the slow part and what you're suggesting would be much slower.

Owner

bitemyapp commented Sep 4, 2016

@winterland1989 the synchronization is not the slow part and what you're suggesting would be much slower.

@agocorona

This comment has been minimized.

Show comment
Hide comment
@agocorona

agocorona Sep 4, 2016

I would like to ask @gdoteof about his experience with websockets in haskell after:

https://groups.google.com/forum/#!topic/yesodweb/rvR4uLUMi3k

Was there some improvement that fixed the problem?

agocorona commented Sep 4, 2016

I would like to ask @gdoteof about his experience with websockets in haskell after:

https://groups.google.com/forum/#!topic/yesodweb/rvR4uLUMi3k

Was there some improvement that fixed the problem?

@winterland1989

This comment has been minimized.

Show comment
Hide comment
@winterland1989

winterland1989 Sep 4, 2016

No mater Chan or unagi-chan, they both use MVar to achieve blocking, since MVar is essentially a linked-list of TSOs, and register/wake up these TSOs in FIFO order take times, i certainly think it will be no quicker than a manual traverse, but since synchronization is not concern now, let's just keep chan here.

winterland1989 commented Sep 4, 2016

No mater Chan or unagi-chan, they both use MVar to achieve blocking, since MVar is essentially a linked-list of TSOs, and register/wake up these TSOs in FIFO order take times, i certainly think it will be no quicker than a manual traverse, but since synchronization is not concern now, let's just keep chan here.

@agocorona

This comment has been minimized.

Show comment
Hide comment
@agocorona

agocorona Sep 4, 2016

Ops: looking at the comments of the issue I mention, as @codygman said, there is a Yesod native-ws that do not use the websockets library, but it has not been updated since 2014:

https://github.com/yesodweb/yesod/tree/native-ws

Mentioned here:
https://groups.google.com/d/msg/yesodweb/rvR4uLUMi3k/fO_umCe7kekJ

agocorona commented Sep 4, 2016

Ops: looking at the comments of the issue I mention, as @codygman said, there is a Yesod native-ws that do not use the websockets library, but it has not been updated since 2014:

https://github.com/yesodweb/yesod/tree/native-ws

Mentioned here:
https://groups.google.com/d/msg/yesodweb/rvR4uLUMi3k/fO_umCe7kekJ

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Sep 4, 2016

The references to throwSocketErrorIfMinus1RetryMayBlock above are a bit of a red herring as this function is just a simple wrapper that runs something else.

I added a SCC to the call to c_writev here[1] and it seems that this is where the time is being spent.

[1] https://github.com/haskell/network/blob/v2.6.3.1/Network/Socket/ByteString/Lazy/Posix.hs#L36

DaveCTurner commented Sep 4, 2016

The references to throwSocketErrorIfMinus1RetryMayBlock above are a bit of a red herring as this function is just a simple wrapper that runs something else.

I added a SCC to the call to c_writev here[1] and it seems that this is where the time is being spent.

[1] https://github.com/haskell/network/blob/v2.6.3.1/Network/Socket/ByteString/Lazy/Posix.hs#L36

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 4, 2016

Owner

@DaveCTurner I thought about splitting out some functions for cost centers, thank you for finding this!

Any idea why c_writev is slow?

Owner

bitemyapp commented Sep 4, 2016

@DaveCTurner I thought about splitting out some functions for cost centers, thank you for finding this!

Any idea why c_writev is slow?

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Sep 4, 2016

Not yet, sadly. Some notes on investigations...

Here's its declaration:

foreign import ccall unsafe "writev"
  c_writev :: CInt -> Ptr IOVec -> CInt -> IO CSsize

Here's the man page: http://linux.die.net/man/2/writev

In particular, unsafe means that when you call it the capability running the thread just waits for the call to return, it doesn't shift its other threads elsewhere. But it should be nonblocking: The errors are as given for read(2) and write(2)... which includes EAGAIN. I don't see any calls to writev returning EAGAIN in the strace output which indicates it's fine.

DaveCTurner commented Sep 4, 2016

Not yet, sadly. Some notes on investigations...

Here's its declaration:

foreign import ccall unsafe "writev"
  c_writev :: CInt -> Ptr IOVec -> CInt -> IO CSsize

Here's the man page: http://linux.die.net/man/2/writev

In particular, unsafe means that when you call it the capability running the thread just waits for the call to return, it doesn't shift its other threads elsewhere. But it should be nonblocking: The errors are as given for read(2) and write(2)... which includes EAGAIN. I don't see any calls to writev returning EAGAIN in the strace output which indicates it's fine.

@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 4, 2016

I'm currently analyzing the websockets library. Here are some intermediate findings I want to share:

  • The single threaded RTS uses select while the multi threaded one uses epoll.
  • The rust program uses sendto while the websockets lib uses writev (as seen in strace).
  • The closures and abstractions in websockets hit performance (writing directly to the socket instead through Connection and Stream increased benchmark results from 2000 to 2300 clients).
  • Most memory allocation takes place when building the lazy ByteString here.
  • The flush here causes unnecessary allocation. Removing it reduced memory allocations spent on encodeMessages from 95% to 80% and further increases benchmark results to 2700 clients.
  • For comparison: The rust implementation handles 5300 clients on my machine.

I'll check in the code in a minute. You may try my changes to the websocket library, but be aware that it is far from beautiful. I broke nearly everything that is not relevant for the benchmark.

lpeterse commented Sep 4, 2016

I'm currently analyzing the websockets library. Here are some intermediate findings I want to share:

  • The single threaded RTS uses select while the multi threaded one uses epoll.
  • The rust program uses sendto while the websockets lib uses writev (as seen in strace).
  • The closures and abstractions in websockets hit performance (writing directly to the socket instead through Connection and Stream increased benchmark results from 2000 to 2300 clients).
  • Most memory allocation takes place when building the lazy ByteString here.
  • The flush here causes unnecessary allocation. Removing it reduced memory allocations spent on encodeMessages from 95% to 80% and further increases benchmark results to 2700 clients.
  • For comparison: The rust implementation handles 5300 clients on my machine.

I'll check in the code in a minute. You may try my changes to the websocket library, but be aware that it is far from beautiful. I broke nearly everything that is not relevant for the benchmark.

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Sep 4, 2016

Also looking at strace output, I see 2868 calls to writev within the space of a second and the typical duration of each call is about 80us (eyeballed estimate) with a max at 260us. That's not 75% of the time but it's still quite a chunk - about 22%.

DaveCTurner commented Sep 4, 2016

Also looking at strace output, I see 2868 calls to writev within the space of a second and the typical duration of each call is about 80us (eyeballed estimate) with a max at 260us. That's not 75% of the time but it's still quite a chunk - about 22%.

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Sep 4, 2016

Also the size written each time is <80 bytes so that's a lot of bitty syscalls. Batching the low-level writes up would certainly help here.

DaveCTurner commented Sep 4, 2016

Also the size written each time is <80 bytes so that's a lot of bitty syscalls. Batching the low-level writes up would certainly help here.

@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 4, 2016

writev might be the culprit. It's not really necessary here, but assembling the necessary data structures might be fare more expensive than a simple send. I'll try this next.

lpeterse commented Sep 4, 2016

writev might be the culprit. It's not really necessary here, but assembling the necessary data structures might be fare more expensive than a simple send. I'll try this next.

@codygman

This comment has been minimized.

Show comment
Hide comment
@codygman

codygman Sep 4, 2016

Re: rust uses sendto instead of writev

If I recall correctly sendto actually gives useful error messages if something goes wrong, maybe rust takes advantage of that?

codygman commented Sep 4, 2016

Re: rust uses sendto instead of writev

If I recall correctly sendto actually gives useful error messages if something goes wrong, maybe rust takes advantage of that?

@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 4, 2016

@codygman: I don't see anything failing. It's just slow.

lpeterse commented Sep 4, 2016

@codygman: I don't see anything failing. It's just slow.

@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 4, 2016

Here is the fork of the benchmark I used: https://github.com/lpeterse/websocket-shootout/...
And here is my websocket fork: https://github.com/lpeterse/websockets.

lpeterse commented Sep 4, 2016

Here is the fork of the benchmark I used: https://github.com/lpeterse/websocket-shootout/...
And here is my websocket fork: https://github.com/lpeterse/websockets.

@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 4, 2016

Edit: Sorry, I mixed up Network.Socket.ByteString.Lazy with Network.Socket.ByteString.

Again, I'm really puzzled. I see in the strace that the system call is definitely writev (ignore the EPIPE):

writev(29, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = 67
writev(29, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = -1 EPIPE (Broken pipe)
writev(70, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = 67
writev(70, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = -1 EPIPE (Broken pipe)
writev(19, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = 67
writev(19, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = -1 EPIPE (Broken pipe)
writev(34, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = 67
writev(34, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = -1 EPIPE (Broken pipe)

In the current socket implementation sendAll is used to write the socket:

import qualified Network.Socket.ByteString.Lazy as SBL (sendAll)

In the network library sendAll calls send which calls sendBuf which looks like this

sendBuf :: Socket     -- Bound/Connected Socket
        -> Ptr Word8  -- Pointer to the data to send
        -> Int        -- Length of the buffer
        -> IO Int     -- Number of Bytes sent
sendBuf sock@(MkSocket s _family _stype _protocol _status) str len = do
   liftM fromIntegral $
#if defined(mingw32_HOST_OS)
...
#else
     throwSocketErrorWaitWrite sock "sendBuf" $
        c_send s str (fromIntegral len) 0{-flags-}
#endif

and

foreign import CALLCONV unsafe "send"
1589      c_send :: CInt -> Ptr a -> CSize -> CInt -> IO CInt

Interpreting the network library correctly, vectored IO can only be done through sendMany and sendManyTo. So where does the writev come from?

lpeterse commented Sep 4, 2016

Edit: Sorry, I mixed up Network.Socket.ByteString.Lazy with Network.Socket.ByteString.

Again, I'm really puzzled. I see in the strace that the system call is definitely writev (ignore the EPIPE):

writev(29, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = 67
writev(29, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = -1 EPIPE (Broken pipe)
writev(70, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = 67
writev(70, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = -1 EPIPE (Broken pipe)
writev(19, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = 67
writev(19, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = -1 EPIPE (Broken pipe)
writev(34, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = 67
writev(34, [{iov_base="\201A{\"payload\":{\"sendTime\":\"147301"..., iov_len=67}], 1) = -1 EPIPE (Broken pipe)

In the current socket implementation sendAll is used to write the socket:

import qualified Network.Socket.ByteString.Lazy as SBL (sendAll)

In the network library sendAll calls send which calls sendBuf which looks like this

sendBuf :: Socket     -- Bound/Connected Socket
        -> Ptr Word8  -- Pointer to the data to send
        -> Int        -- Length of the buffer
        -> IO Int     -- Number of Bytes sent
sendBuf sock@(MkSocket s _family _stype _protocol _status) str len = do
   liftM fromIntegral $
#if defined(mingw32_HOST_OS)
...
#else
     throwSocketErrorWaitWrite sock "sendBuf" $
        c_send s str (fromIntegral len) 0{-flags-}
#endif

and

foreign import CALLCONV unsafe "send"
1589      c_send :: CInt -> Ptr a -> CSize -> CInt -> IO CInt

Interpreting the network library correctly, vectored IO can only be done through sendMany and sendManyTo. So where does the writev come from?

@codygman

This comment has been minimized.

Show comment
Hide comment
@codygman

codygman Sep 4, 2016

@lpeterse Not sure if related, but you should probably be aware of this other conversation, namely bug 367:

it's possible you have non-allocating loops: https://ghc.haskell.org/trac/ghc/ticket/367 (e.g. a reader that always lags a writer and never blocks; maybe that "extraneous reader").

It looks like we could quickly see if this issue is affecting us by using -fno-omit-yields.

codygman commented Sep 4, 2016

@lpeterse Not sure if related, but you should probably be aware of this other conversation, namely bug 367:

it's possible you have non-allocating loops: https://ghc.haskell.org/trac/ghc/ticket/367 (e.g. a reader that always lags a writer and never blocks; maybe that "extraneous reader").

It looks like we could quickly see if this issue is affecting us by using -fno-omit-yields.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Sep 4, 2016

Owner

@codygman having seen the gamut of source code in what we're analyzing, I would be hard pressed to find any non-allocating loops. Almost everything is in *. I didn't see anything to indicate much time was spent blocking on unagi-chan.

I think @lpeterse and @DaveCTurner are on to something with the networking and websockets code.

Owner

bitemyapp commented Sep 4, 2016

@codygman having seen the gamut of source code in what we're analyzing, I would be hard pressed to find any non-allocating loops. Almost everything is in *. I didn't see anything to indicate much time was spent blocking on unagi-chan.

I think @lpeterse and @DaveCTurner are on to something with the networking and websockets code.

@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 4, 2016

Now this is interesting (I just want to share, please give me reason to not believe it):

For curiosity, I added a print statement at the end of each message send (this seems to affect the RTS scheduling by keeping it busy I assume):

send :: Connection -> Message -> IO ()
send conn msg = do
    case msg of
        (ControlMessage (Close _ _)) ->
            writeIORef (connectionSentClose conn) True
        _ -> return ()
    forM_ (BL.toChunks $ Hybi13.encodeMessage' msg) (\a-> SB.sendAll (connectionSocket conn) a)
    print "1"

Then I started the server and piped the result to lines.txt:

./dist/build/hs-websocket-server/hs-websocket-server > lines.txt

I started the benchmark:

% make bench                                                                                                                                                                                                 
bin/websocket-bench broadcast ws://127.0.0.1:3000/ws --concurrent 10 --sample-size 100 --step-size 1000 --limit-percentile 95 --limit-rtt 250ms
clients:  1000    95per-rtt:  56ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  57ms
clients:  2000    95per-rtt:  39ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  39ms
clients:  3000    95per-rtt:  33ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  33ms
clients:  4000    95per-rtt:  26ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  26ms
clients:  5000    95per-rtt:  35ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  36ms
clients:  6000    95per-rtt:  39ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  40ms
clients:  7000    95per-rtt:  41ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  41ms
clients:  8000    95per-rtt:  35ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  36ms
clients:  9000    95per-rtt:  33ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  34ms
clients: 10000    95per-rtt:  16ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  16ms
clients: 11000    95per-rtt:  15ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  15ms
clients: 12000    95per-rtt:  58ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  68ms
clients: 13000    95per-rtt:  16ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  16ms
clients: 14000    95per-rtt:  16ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  16ms
clients: 15000    95per-rtt:  12ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  12ms
clients: 16000    95per-rtt:  49ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  49ms
clients: 17000    95per-rtt:  15ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  15ms
clients: 18000    95per-rtt:  15ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  16ms
clients: 19000    95per-rtt:  13ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  14ms
clients: 20000    95per-rtt:   9ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:   9ms
clients: 21000    95per-rtt:  14ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  14ms
clients: 22000    95per-rtt:  15ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  15ms
clients: 23000    95per-rtt:  14ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  14ms
clients: 24000    95per-rtt:  13ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  13ms
clients: 25000    95per-rtt:   9ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:   9ms
clients: 26000    95per-rtt:  16ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  16ms
clients: 27000    95per-rtt:  10ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  11ms
clients: 28000    95per-rtt:  10ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  10ms
2016/09/04 21:44:40 dial tcp 127.0.0.1:3000: connect: cannot assign requested address
make: *** [Makefile:32: bench] Error 1
make bench  56.30s user 46.48s system 205% cpu 50.069 total

I attached to one of the servers processes to see if it actually sends something (lines truncated):

% sudo strace -p 15493 2>&1 | grep send
sendto(1144, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1145, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1141, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1142, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1143, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1140, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1137, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1139, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1138, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1135, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1136, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1134, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
^C

Now let's see lines.txt:

% head lines.txt
"1"
"1"
"1"
"1"
"1"
"1"
"1"
"1"
"1"
"1"
% wc -l lines.txt
3200897 lines.txt

How many messages should have been sent? Is 3.2 M correct? What is the benchmark waiting for before it's increasing the number of clients? I don't see where I could be dropping messages. It's just that the benchmark could proceed before my program had chance to work up all queues. Memory consumption indeed is high:

            3858 MB total memory in use (0 MB lost due to fragmentation)

I'll investigate how many messages are pending per socket when the sockets get closed (not before tomorrow).

lpeterse commented Sep 4, 2016

Now this is interesting (I just want to share, please give me reason to not believe it):

For curiosity, I added a print statement at the end of each message send (this seems to affect the RTS scheduling by keeping it busy I assume):

send :: Connection -> Message -> IO ()
send conn msg = do
    case msg of
        (ControlMessage (Close _ _)) ->
            writeIORef (connectionSentClose conn) True
        _ -> return ()
    forM_ (BL.toChunks $ Hybi13.encodeMessage' msg) (\a-> SB.sendAll (connectionSocket conn) a)
    print "1"

Then I started the server and piped the result to lines.txt:

./dist/build/hs-websocket-server/hs-websocket-server > lines.txt

I started the benchmark:

% make bench                                                                                                                                                                                                 
bin/websocket-bench broadcast ws://127.0.0.1:3000/ws --concurrent 10 --sample-size 100 --step-size 1000 --limit-percentile 95 --limit-rtt 250ms
clients:  1000    95per-rtt:  56ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  57ms
clients:  2000    95per-rtt:  39ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  39ms
clients:  3000    95per-rtt:  33ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  33ms
clients:  4000    95per-rtt:  26ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  26ms
clients:  5000    95per-rtt:  35ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  36ms
clients:  6000    95per-rtt:  39ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  40ms
clients:  7000    95per-rtt:  41ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  41ms
clients:  8000    95per-rtt:  35ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  36ms
clients:  9000    95per-rtt:  33ms    min-rtt:   1ms    median-rtt:   1ms    max-rtt:  34ms
clients: 10000    95per-rtt:  16ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  16ms
clients: 11000    95per-rtt:  15ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  15ms
clients: 12000    95per-rtt:  58ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  68ms
clients: 13000    95per-rtt:  16ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  16ms
clients: 14000    95per-rtt:  16ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  16ms
clients: 15000    95per-rtt:  12ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  12ms
clients: 16000    95per-rtt:  49ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  49ms
clients: 17000    95per-rtt:  15ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  15ms
clients: 18000    95per-rtt:  15ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  16ms
clients: 19000    95per-rtt:  13ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  14ms
clients: 20000    95per-rtt:   9ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:   9ms
clients: 21000    95per-rtt:  14ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  14ms
clients: 22000    95per-rtt:  15ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  15ms
clients: 23000    95per-rtt:  14ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  14ms
clients: 24000    95per-rtt:  13ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  13ms
clients: 25000    95per-rtt:   9ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:   9ms
clients: 26000    95per-rtt:  16ms    min-rtt:   0ms    median-rtt:   1ms    max-rtt:  16ms
clients: 27000    95per-rtt:  10ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  11ms
clients: 28000    95per-rtt:  10ms    min-rtt:   0ms    median-rtt:   0ms    max-rtt:  10ms
2016/09/04 21:44:40 dial tcp 127.0.0.1:3000: connect: cannot assign requested address
make: *** [Makefile:32: bench] Error 1
make bench  56.30s user 46.48s system 205% cpu 50.069 total

I attached to one of the servers processes to see if it actually sends something (lines truncated):

% sudo strace -p 15493 2>&1 | grep send
sendto(1144, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1145, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1141, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1142, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1143, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1140, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1137, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1139, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1138, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1135, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1136, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
sendto(1134, "\201A{\"payload\":{\"sendTime\":\"147301"..., 67, 0, NULL, 0) = 67
^C

Now let's see lines.txt:

% head lines.txt
"1"
"1"
"1"
"1"
"1"
"1"
"1"
"1"
"1"
"1"
% wc -l lines.txt
3200897 lines.txt

How many messages should have been sent? Is 3.2 M correct? What is the benchmark waiting for before it's increasing the number of clients? I don't see where I could be dropping messages. It's just that the benchmark could proceed before my program had chance to work up all queues. Memory consumption indeed is high:

            3858 MB total memory in use (0 MB lost due to fragmentation)

I'll investigate how many messages are pending per socket when the sockets get closed (not before tomorrow).

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Sep 4, 2016

From my (admittedly limited) understanding of go, it prints each line after performing sample-size (i.e. 100) broadcasts. So I think the first line is printed after 100001 messages (100 samples, 1000 clients, plus the single success message) and the second after a further 200001, etc. Thus the last line gets printed after about 28x29/2x100000 which is about 40M. So 3.2M seems very low.

DaveCTurner commented Sep 4, 2016

From my (admittedly limited) understanding of go, it prints each line after performing sample-size (i.e. 100) broadcasts. So I think the first line is printed after 100001 messages (100 samples, 1000 clients, plus the single success message) and the second after a further 200001, etc. Thus the last line gets printed after about 28x29/2x100000 which is about 40M. So 3.2M seems very low.

@agocorona

This comment has been minimized.

Show comment
Hide comment
@agocorona

agocorona Sep 4, 2016

@lpeterse That result is with your fork of websockets without writev?

agocorona commented Sep 4, 2016

@lpeterse That result is with your fork of websockets without writev?

@gdoteof

This comment has been minimized.

Show comment
Hide comment
@gdoteof

gdoteof Sep 4, 2016

@agocorona

@snoyberg Pushed some changes that did improve things quite a bit; but at the time it was still faster (for my purposes) to just use get requests. This was probably because there is a ton of optimizations in Warp which were not included at all in the websocket code I was using.

It's been a couple years now, and I haven't used websockets in yesod since--though I would honestly be very surprised if things have not improved dramatically.

I would reach out to Greb Webber or Michael Snoyman directly for their comments on it.

gdoteof commented Sep 4, 2016

@agocorona

@snoyberg Pushed some changes that did improve things quite a bit; but at the time it was still faster (for my purposes) to just use get requests. This was probably because there is a ton of optimizations in Warp which were not included at all in the websocket code I was using.

It's been a couple years now, and I haven't used websockets in yesod since--though I would honestly be very surprised if things have not improved dramatically.

I would reach out to Greb Webber or Michael Snoyman directly for their comments on it.

@lpeterse

This comment has been minimized.

Show comment
Hide comment
@lpeterse

lpeterse Sep 4, 2016

@DaveCTurner Thanks for the insight. That helps. I think the sucess message might be the key. I must probably not send it until I broadcasted the messages to all clients.
@agocorona Yes, it is. I haven't pushed the one without writev yet. Sorry. Will do that tomorrow.

lpeterse commented Sep 4, 2016

@DaveCTurner Thanks for the insight. That helps. I think the sucess message might be the key. I must probably not send it until I broadcasted the messages to all clients.
@agocorona Yes, it is. I haven't pushed the one without writev yet. Sorry. Will do that tomorrow.

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Sep 5, 2016

@lpeterse the constraints on the order in which you send messages seem to be quite weak. You have to send the broadcast back to the client that requested it before you report success, but I think you can send the success message before you send all the other messages if you want and still satisfy the spec. Calling sendTo or writev or similar shuffles the data into the kernel but there's all sorts of other queues and delays before it finally gets all the way to the client. Concurrency is hard!

DaveCTurner commented Sep 5, 2016

@lpeterse the constraints on the order in which you send messages seem to be quite weak. You have to send the broadcast back to the client that requested it before you report success, but I think you can send the success message before you send all the other messages if you want and still satisfy the spec. Calling sendTo or writev or similar shuffles the data into the kernel but there's all sorts of other queues and delays before it finally gets all the way to the client. Concurrency is hard!

@bartavelle

This comment has been minimized.

Show comment
Hide comment
@bartavelle

bartavelle Sep 5, 2016

Just a data point : I build the @bitemyapp fork and ran it locally. It's only twice as slow as go, so it's not that slow!

bartavelle commented Sep 5, 2016

Just a data point : I build the @bitemyapp fork and ran it locally. It's only twice as slow as go, so it's not that slow!

@bartavelle

This comment has been minimized.

Show comment
Hide comment
@bartavelle

bartavelle Sep 5, 2016

As always with Haskell, you can't just use -N if you have many cores:

-N2 clients:  9000    95per-rtt: 395ms    min-rtt:   7ms    median-rtt: 219ms    max-rtt: 455ms
-N4 clients:  9000    95per-rtt: 468ms    min-rtt:   0ms    median-rtt: 157ms    max-rtt: 520ms
-N6 clients:  9000    95per-rtt: 468ms    min-rtt:   0ms    median-rtt: 152ms    max-rtt: 505ms
-N8 clients:  9000    95per-rtt: 410ms    min-rtt:  41ms    median-rtt: 173ms    max-rtt: 519ms
-N12 (too slow)

Go version for comparison:

clients:  9000    95per-rtt: 215ms    min-rtt: 112ms    median-rtt: 165ms    max-rtt: 221ms

The Haskell version seems to have a lot more variance than the Go version, which isn't a surprise to me.

bartavelle commented Sep 5, 2016

As always with Haskell, you can't just use -N if you have many cores:

-N2 clients:  9000    95per-rtt: 395ms    min-rtt:   7ms    median-rtt: 219ms    max-rtt: 455ms
-N4 clients:  9000    95per-rtt: 468ms    min-rtt:   0ms    median-rtt: 157ms    max-rtt: 520ms
-N6 clients:  9000    95per-rtt: 468ms    min-rtt:   0ms    median-rtt: 152ms    max-rtt: 505ms
-N8 clients:  9000    95per-rtt: 410ms    min-rtt:  41ms    median-rtt: 173ms    max-rtt: 519ms
-N12 (too slow)

Go version for comparison:

clients:  9000    95per-rtt: 215ms    min-rtt: 112ms    median-rtt: 165ms    max-rtt: 221ms

The Haskell version seems to have a lot more variance than the Go version, which isn't a surprise to me.

@bartavelle

This comment has been minimized.

Show comment
Hide comment
@bartavelle

bartavelle Sep 5, 2016

Also while I'm thinking on it, the strace bench is like all Haskell benches : it distorts reality. I just have tried benching with perf but for some reason it doesn't work too well with the DWARF support. Sigh.

bartavelle commented Sep 5, 2016

Also while I'm thinking on it, the strace bench is like all Haskell benches : it distorts reality. I just have tried benching with perf but for some reason it doesn't work too well with the DWARF support. Sigh.

@bartavelle

This comment has been minimized.

Show comment
Hide comment
@bartavelle

bartavelle Sep 5, 2016

For me, when testing with client and server on the same computer, your version works better than the Go version with -N6 -A2G -M5G. It's still jittery, because there is no fine grained control over the GC behavior. I'd suggest increasing performance by reducing the load on the GC ... but that's nothing new.

bartavelle commented Sep 5, 2016

For me, when testing with client and server on the same computer, your version works better than the Go version with -N6 -A2G -M5G. It's still jittery, because there is no fine grained control over the GC behavior. I'd suggest increasing performance by reducing the load on the GC ... but that's nothing new.

@codygman

This comment has been minimized.

Show comment
Hide comment
@codygman

codygman Sep 5, 2016

@bartavelle The Go version also doesn't have fine grained control over the
GC, why isn't it jittery?

On Sep 5, 2016 12:05 PM, "Simon Marechal" notifications@github.com wrote:

For me, when testing with client and server on the same computer, your
version works better than the Go version with -N6 -A2G -M5G. It's still
jittery, because there is no fine grained control over the GC behavior. I'd
suggest increasing performance by reducing the load on the GC ... but
that's nothing new.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANyN71cCtY8LNDzU0EP57hlsJ2C4on4ks5qnEv3gaJpZM4J0VOi
.

codygman commented Sep 5, 2016

@bartavelle The Go version also doesn't have fine grained control over the
GC, why isn't it jittery?

On Sep 5, 2016 12:05 PM, "Simon Marechal" notifications@github.com wrote:

For me, when testing with client and server on the same computer, your
version works better than the Go version with -N6 -A2G -M5G. It's still
jittery, because there is no fine grained control over the GC behavior. I'd
suggest increasing performance by reducing the load on the GC ... but
that's nothing new.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANyN71cCtY8LNDzU0EP57hlsJ2C4on4ks5qnEv3gaJpZM4J0VOi
.

@bartavelle

This comment has been minimized.

Show comment
Hide comment
@bartavelle

bartavelle Sep 5, 2016

I don't know much about Go, so I can't tell. I might be full of shit, as those are basically coffee break observations. I am however used to tuning the JVM, and miss that kind of control with GHC.

bartavelle commented Sep 5, 2016

I don't know much about Go, so I can't tell. I might be full of shit, as those are basically coffee break observations. I am however used to tuning the JVM, and miss that kind of control with GHC.

@codygman

This comment has been minimized.

Show comment
Hide comment
@codygman

codygman Sep 5, 2016

@bartavelle

Tactically, we will use a hybrid stop the world (STW) / concurrent garbage
collector (CGC). The STW piece will limit the amount of time goroutines are
stopped to less than 10 milliseconds out of every 50 milliseconds. If the
GC completes a cycle in this time frame, great. If not, the GC will
transition into a concurrent GC for the remainder of the 50 millisecond
block. This process will repeat until the GC cycle completes. As a
practical matter if one has a 50 millisecond response quality of service
(QOS) requirement one should expect to have 40 milliseconds in which to do
mutator tasks. These numbers assume hardware equivalent to a generic $1000
desktop box running Linux.

On Mon, Sep 5, 2016 at 1:01 PM, Simon Marechal notifications@github.com
wrote:

I don't know much about Go, so I can't tell. I might be full of shit, as
those are basically coffee break observations. I am however used to tuning
the JVM, and miss that kind of control with GHC.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANyN1f32CSZf9ToRwGxrVl8VpapOowsks5qnFj6gaJpZM4J0VOi
.

codygman commented Sep 5, 2016

@bartavelle

Tactically, we will use a hybrid stop the world (STW) / concurrent garbage
collector (CGC). The STW piece will limit the amount of time goroutines are
stopped to less than 10 milliseconds out of every 50 milliseconds. If the
GC completes a cycle in this time frame, great. If not, the GC will
transition into a concurrent GC for the remainder of the 50 millisecond
block. This process will repeat until the GC cycle completes. As a
practical matter if one has a 50 millisecond response quality of service
(QOS) requirement one should expect to have 40 milliseconds in which to do
mutator tasks. These numbers assume hardware equivalent to a generic $1000
desktop box running Linux.

On Mon, Sep 5, 2016 at 1:01 PM, Simon Marechal notifications@github.com
wrote:

I don't know much about Go, so I can't tell. I might be full of shit, as
those are basically coffee break observations. I am however used to tuning
the JVM, and miss that kind of control with GHC.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANyN1f32CSZf9ToRwGxrVl8VpapOowsks5qnFj6gaJpZM4J0VOi
.

@bartavelle

This comment has been minimized.

Show comment
Hide comment
@bartavelle

bartavelle Sep 6, 2016

Here are the flamegraphs for the go and haskell versions. I can't build the c++ or rust version on my computer. Also, on my computer, the Haskell version is faster than the go one, even though it sometimes stutters:

clients:  1000  expected: 40000  rcvd: 40000  95per-rtt:  93ms  min-rtt:   0ms  median-rtt:  61ms  max-rtt: 220ms
clients:  2000  expected: 80000  rcvd: 80000  95per-rtt: 222ms  min-rtt:   0ms  median-rtt:  75ms  max-rtt: 223ms
clients:  3000  expected: 120000  rcvd: 120000  95per-rtt:  72ms  min-rtt:   0ms  median-rtt:  71ms  max-rtt:  72ms
clients:  4000  expected: 160000  rcvd: 160000  95per-rtt: 151ms  min-rtt:   2ms  median-rtt: 149ms  max-rtt: 592ms
clients:  5000  expected: 200000  rcvd: 200000  95per-rtt:  97ms  min-rtt:   0ms  median-rtt:  96ms  max-rtt:  97ms
clients:  6000  expected: 240000  rcvd: 240000  95per-rtt: 284ms  min-rtt:   0ms  median-rtt: 277ms  max-rtt: 284ms
clients:  7000  expected: 280000  rcvd: 280000  95per-rtt: 354ms  min-rtt:   0ms  median-rtt: 103ms  max-rtt: 354ms
clients:  8000  expected: 320000  rcvd: 320000  95per-rtt: 208ms  min-rtt:   0ms  median-rtt: 206ms  max-rtt: 1127ms
clients:  9000  expected: 360000  rcvd: 360000  95per-rtt: 401ms  min-rtt:   0ms  median-rtt: 395ms  max-rtt: 402ms
clients: 10000  expected: 400000  rcvd: 400000  95per-rtt: 520ms  min-rtt:   0ms  median-rtt: 519ms  max-rtt: 520ms
clients: 11000  expected: 440000  rcvd: 440000  95per-rtt: 248ms  min-rtt:   0ms  median-rtt: 209ms  max-rtt: 248ms

They are not terribly dissimilar, so I suspect the Haskell code performance is alright.

flamegraphs.zip

bartavelle commented Sep 6, 2016

Here are the flamegraphs for the go and haskell versions. I can't build the c++ or rust version on my computer. Also, on my computer, the Haskell version is faster than the go one, even though it sometimes stutters:

clients:  1000  expected: 40000  rcvd: 40000  95per-rtt:  93ms  min-rtt:   0ms  median-rtt:  61ms  max-rtt: 220ms
clients:  2000  expected: 80000  rcvd: 80000  95per-rtt: 222ms  min-rtt:   0ms  median-rtt:  75ms  max-rtt: 223ms
clients:  3000  expected: 120000  rcvd: 120000  95per-rtt:  72ms  min-rtt:   0ms  median-rtt:  71ms  max-rtt:  72ms
clients:  4000  expected: 160000  rcvd: 160000  95per-rtt: 151ms  min-rtt:   2ms  median-rtt: 149ms  max-rtt: 592ms
clients:  5000  expected: 200000  rcvd: 200000  95per-rtt:  97ms  min-rtt:   0ms  median-rtt:  96ms  max-rtt:  97ms
clients:  6000  expected: 240000  rcvd: 240000  95per-rtt: 284ms  min-rtt:   0ms  median-rtt: 277ms  max-rtt: 284ms
clients:  7000  expected: 280000  rcvd: 280000  95per-rtt: 354ms  min-rtt:   0ms  median-rtt: 103ms  max-rtt: 354ms
clients:  8000  expected: 320000  rcvd: 320000  95per-rtt: 208ms  min-rtt:   0ms  median-rtt: 206ms  max-rtt: 1127ms
clients:  9000  expected: 360000  rcvd: 360000  95per-rtt: 401ms  min-rtt:   0ms  median-rtt: 395ms  max-rtt: 402ms
clients: 10000  expected: 400000  rcvd: 400000  95per-rtt: 520ms  min-rtt:   0ms  median-rtt: 519ms  max-rtt: 520ms
clients: 11000  expected: 440000  rcvd: 440000  95per-rtt: 248ms  min-rtt:   0ms  median-rtt: 209ms  max-rtt: 248ms

They are not terribly dissimilar, so I suspect the Haskell code performance is alright.

flamegraphs.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment