Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRDT Slower as workers are added. #5

Closed
gatoWololo opened this issue Jun 19, 2020 · 4 comments
Closed

CRDT Slower as workers are added. #5

gatoWololo opened this issue Jun 19, 2020 · 4 comments

Comments

@gatoWololo
Copy link

Following up on TimelyDataflow/differential-dataflow#273 and giving a more concrete example. CRDT seems to exhibit particularly poor scaling. Adding additional workers results in worse run times:

fish> for i in (seq 8)
           echo time ./target/release/crdt ../problems/crdt/input/ -w$i
           time ./target/release/crdt ../problems/crdt/input/ -w$i
      end
time ./target/release/crdt ../problems/crdt/input/ -w1
result: ((), 0, 104851)
2.655789383s	complete

________________________________________________________
Executed in    2.66 secs   fish           external 
   usr time    2.60 secs  246.00 micros    2.60 secs 
   sys time    0.06 secs  100.00 micros    0.06 secs 

time ./target/release/crdt ../problems/crdt/input/ -w2
result: ((), 0, 104851)
2.949718895s	complete
2.949774511s	complete

________________________________________________________
Executed in    2.96 secs   fish           external 
   usr time    5.67 secs  237.00 micros    5.67 secs 
   sys time    0.16 secs   96.00 micros    0.16 secs 

time ./target/release/crdt ../problems/crdt/input/ -w3
result: ((), 0, 104851)
3.068952142s	complete
3.069093637s	complete
3.069253361s	complete

________________________________________________________
Executed in    3.08 secs   fish           external 
   usr time    8.74 secs  223.00 micros    8.73 secs 
   sys time    0.28 secs   91.00 micros    0.28 secs 

time ./target/release/crdt ../problems/crdt/input/ -w4
result: ((), 0, 104851)
3.104697076s	complete
3.104720353s	complete
3.104779249s	complete
3.104817709s	complete

________________________________________________________
Executed in    3.11 secs   fish           external 
   usr time   11.73 secs  217.00 micros   11.73 secs 
   sys time    0.45 secs   87.00 micros    0.45 secs 

time ./target/release/crdt ../problems/crdt/input/ -w5
result: ((), 0, 104851)
3.337570417s	complete
3.337584961s	complete
3.337652864s	complete
3.337643363s	complete
3.337674563s	complete

________________________________________________________
Executed in    3.35 secs   fish           external 
   usr time   15.77 secs  214.00 micros   15.77 secs 
   sys time    0.52 secs   87.00 micros    0.52 secs 

time ./target/release/crdt ../problems/crdt/input/ -w6
result: ((), 0, 104851)
3.468785548s	complete
3.468819104s	complete
3.468858908s	complete
3.468944907s	complete
3.469012034s	complete
3.469172189s	complete

________________________________________________________
Executed in    3.48 secs   fish           external 
   usr time   19.68 secs  217.00 micros   19.68 secs 
   sys time    0.62 secs   88.00 micros    0.62 secs 

time ./target/release/crdt ../problems/crdt/input/ -w7
result: ((), 0, 104851)
3.685714758s	complete
3.685744579s	complete
3.685856582s	complete
3.685880433s	complete
3.685882302s	complete
3.685883169s	complete
3.686006777s	complete

________________________________________________________
Executed in    3.70 secs   fish           external 
   usr time   24.45 secs  222.00 micros   24.45 secs 
   sys time    0.70 secs   89.00 micros    0.70 secs 

time ./target/release/crdt ../problems/crdt/input/ -w8
result: ((), 0, 104851)
3.846975408s	complete
3.846992356s	complete
3.847061956s	complete
3.847267346s	complete
3.847285809s	complete
3.847296428s	complete
3.847290407s	complete
3.847402402s	complete

________________________________________________________
Executed in    3.86 secs   fish           external 
   usr time   28.90 secs  216.00 micros   28.90 secs 
   sys time    1.04 secs   88.00 micros    1.04 secs 

Looking at some perf flame graphs of one worker versus eight workers:
One Worker

Eight Workers

For eight worker, it seems a lot of time is spent is spent on step_or_park but not actually stepping. Instrumenting the advance function with an atomic counter:

  > time cargo run --release --bin crdt -- ../problems/crdt/input/ -w1
  3.023245458s    complete
  Calls to advance: 65270
  > time cargo run --release --bin crdt -- ../problems/crdt/input/ -w8
  225.56567529s   complete
  Calls to advance: 4043775664

I'll try to figure out why advance is being called so much with eight workers.

@frankmcsherry
Copy link
Owner

That sounds like a great plan! If it is helpful, the timely logging infrastructure produces streams of scheduling events, message communication and receipt events, stuff like that. It can be helpful to determine what is on the critical path. For inspiration, maybe check out

https://github.com/MaterializeInc/materialize/tree/master/src/dataflow/src/logging

The crdt example is definitely a bit mysterious: if you slowly play out the updates, rather than in one batch, it takes quite a bit longer. My guess is that there is a nice big global aggregation that ends up doing and re-doing a bunch of work.

@gatoWololo
Copy link
Author

Thanks for the tips!

I'll look into it.

@frankmcsherry
Copy link
Owner

frankmcsherry commented Aug 4, 2021

The reason for this seems to be that the program as written exhibits two iterative scopes that need to perform many thousands of iterations. The ancestor collapsing takes over 7000 iterations, and the "blank star" collapsing takes over 13000 iterations. The control flow aspects of these iterations take some time (tens of microseconds, it seems) and they are inherently sequential rather than parallel.

The problem can be fixed by using a different algorithm for these stages. If you collapse these paths using an iterated contraction algorithm, as in the crdt_improvements branch, they take tens of iterations and the running times look like (on my desktop):

threads time
1 965.782789ms
2 539.08139ms
4 302.270496ms
8 221.848203ms

So, certainly some better scaling here, and just generally better performance as well (the overhead of the loops is large, even ignoring the lack of scaling).

@frankmcsherry
Copy link
Owner

I'm going to close this out, as I believe the mystery has been resolved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants