Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang during multi-worker crash + recovery #3109

Open
slfritchie opened this issue Jan 30, 2020 · 0 comments
Open

Hang during multi-worker crash + recovery #3109

slfritchie opened this issue Jan 30, 2020 · 0 comments

Comments

@slfritchie
Copy link
Contributor

Is this a bug, feature request, or feedback?

Bug

What is the current behavior?

Hang during multi-worker crash + recovery. See master-crasher.sh command below to run.

Last messages from each worker that match the regexp --|~~|RECOVERY|Recovery|recovery:

initializer:
1580343036.556997,UNEXPECTED CALL to add_expected_boundary_count on recovery reconnector phase Not Recovery Reconnecting Phase. Ignoring!

worker1:
1580343036.555562,UNEXPECTED CALL to add_expected_boundary_count on recovery reconnector phase Not Recovery Reconnecting Phase. Ignoring!

worker2:
1580343036.768260,|~~ - Recovery initiated at worker5. Ceding control. - ~~|
1580343036.768268,_RecoveryPhase transition to _NotRecovering
1580343036.768335,Sent control message to worker5: AckRecoveryInitiatedMsg
1580343036.768343,Recovering worker: Skipping Phase III
1580343036.768347,|~~ INIT PHASE IV: Cluster is ready to work! ~~|

worker3:
1580343036.558269,UNEXPECTED CALL to add_expected_boundary_count on recovery reconnector phase Not Recovery Reconnecting Phase. Ignoring!

worker4:
1580343036.571309,UNEXPECTED CALL to add_expected_boundary_count on recovery reconnector phase Not Recovery Reconnecting Phase. Ignoring!

worker5:
1580343036.771454,Received msg on Control Channel: AckRecoveryInitiatedMsg

Full logs are available at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1580344129.tar.gz

What is the expected behavior?

Successful restart after multi-worker crash & restart

What OS and version of Wallaroo are you using?

Ubuntu 16.04 LTS/Xenial and master branch @ commit master

Steps to reproduce?

cd /path/to/wallaroo/testing/correctness/scripts/effectively-once
make -C ../../../.. resilience=on  PONYCFLAGS="--verbose=1 -d -Dtrace -Dcheckpoint_trace -Didentify_routing_ids"     build-examples-pony-passthrough build-testing-tools-external_sender    
./master-crasher.sh 6 crash5 crash{0,1,2,3,4}.slow crash-sink no-ack-progress

I see this error happening every 1-3 hours, so it's rare. If you're short on disk space, then I recommend running the following to reduce the # of log files that consume space in /tmp.

while [ 1 ]; do echo -n `date` "" ; df -h /tmp; ls -t /tmp/wallaroo.*gz | sed 1,20d | xargs rm -f ; sleep 15; done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant