Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geth won't shutdown properly when started from a systemd file. #24416

Closed
Fillicia opened this issue Feb 17, 2022 · 6 comments
Closed

Geth won't shutdown properly when started from a systemd file. #24416

Fillicia opened this issue Feb 17, 2022 · 6 comments

Comments

@Fillicia
Copy link

System information

Geth version: 1.10.15-stable
OS & Version: Ubuntu 21.10 server
Commit hash : latest

Expected behaviour

Clean shutdown with SIGTERM or SIGINT while using systemd

Actual behaviour

After using systemctl stop geth the service start stopping and then search for peers instead of fully shutting down. It doesn't show as an unclean shutdown when started again but it'll resync from ~2 days prior with the message Head state missing, repairing

Backtrace

Feb 17 02:43:52 web3 geth[1527916]: INFO [02-17|02:43:52.785] Got interrupt, shutting down...
Feb 17 02:43:52 web3 systemd[1]: Stopping Ethereum go client...
Feb 17 02:43:52 web3 geth[1527916]: INFO [02-17|02:43:52.787] HTTP server stopped                      endpoint=[::]:8745
Feb 17 02:43:52 web3 geth[1527916]: INFO [02-17|02:43:52.789] HTTP server stopped                      endpoint=[::]:8746
Feb 17 02:43:52 web3 geth[1527916]: INFO [02-17|02:43:52.789] IPC endpoint closed                      url=/eth/geth.ipc
Feb 17 02:43:53 web3 geth[1527916]: INFO [02-17|02:43:53.237] Ethereum protocol stopped
Feb 17 02:43:53 web3 geth[1527916]: INFO [02-17|02:43:53.237] Transaction pool stopped
Feb 17 02:43:53 web3 geth[1527916]: ERROR[02-17|02:43:53.417] Snapshot extension registration failed   peer=b2dad9fe err="peer connected on snap without compatible eth support"
Feb 17 02:44:02 web3 geth[1527916]: INFO [02-17|02:44:02.817] Looking for peers                        peercount=0 tried=63 static=0
Feb 17 02:44:12 web3 geth[1527916]: ERROR[02-17|02:44:12.644] Snapshot extension registration failed   peer=76785d7b err="peer connected on snap without compatible eth support"
Feb 17 02:44:12 web3 geth[1527916]: INFO [02-17|02:44:12.929] Looking for peers                        peercount=0 tried=131 static=0
Feb 17 02:44:18 web3 geth[1527916]: INFO [02-17|02:44:18.043] Deep froze chain segment                 blocks=2     elapsed=41.774ms    number=14,131,004 hash=c02ead..90c11c
Feb 17 02:44:22 web3 geth[1527916]: INFO [02-17|02:44:22.943] Looking for peers                        peercount=0 tried=82  static=0
Feb 17 02:44:33 web3 geth[1527916]: INFO [02-17|02:44:33.042] Looking for peers                        peercount=0 tried=131 static=0
Feb 17 02:44:41 web3 geth[1527916]: ERROR[02-17|02:44:41.058] Snapshot extension registration failed   peer=8f3a7071 err="peer connected on snap without compatible eth support"
Feb 17 02:44:43 web3 geth[1527916]: INFO [02-17|02:44:43.280] Looking for peers                        peercount=0 tried=95  static=0
Feb 17 02:44:43 web3 geth[1527916]: ERROR[02-17|02:44:43.480] Snapshot extension registration failed   peer=b9d0af37 err="peer connected on snap without compatible eth support"
Feb 17 02:44:53 web3 geth[1527916]: INFO [02-17|02:44:53.314] Looking for peers                        peercount=0 tried=100 static=0
Feb 17 02:45:03 web3 geth[1527916]: INFO [02-17|02:45:03.838] Looking for peers                        peercount=1 tried=54  static=0
Feb 17 02:45:11 web3 geth[1527916]: ERROR[02-17|02:45:11.876] Snapshot extension registration failed   peer=635d0ba6 err="peer connected on snap without compatible eth support"

and it goes on like this until a timeout is reached

systemd file:

[Unit]
Description=Ethereum go client
After=network.target

[Service]
User=web3
Group=web3
Type=simple
Restart=always
RestartSec=30s
ExecStart=/home/web3/geth/build/bin/geth \
           --cache 51200 \
           --metrics \
           --metrics.influxdbv2 \
           --metrics.influxdb.token "xxxxxxxxxxxxxxxxxxxxxxxxxx" \
           --metrics.influxdb.bucket geth \
           --config /home/web3/geth/build/bin/config.toml
KillSignal=SIGTERM 
TimeoutStopSec=600


[Install]
WantedBy=default.target

@holiman
Copy link
Contributor

holiman commented Feb 17, 2022

Geth might need quite a bit of time to shut down, but looking at the docs here, I think TimeoutStopSec=600 should give it ample time (10 minutes) to shut down.

If this is a shutdown-hang which eventually is hard-killed, there should be a stacktrace. I don't suppose you have any such stacktrace somewhere?

@holiman
Copy link
Contributor

holiman commented Feb 17, 2022

and it goes on like this until a timeout is

Please post the logs for what happens around that time

@holiman
Copy link
Contributor

holiman commented Feb 17, 2022

If terminate is set the service will be gracefully terminated by sending the signal specified in KillSignal= (defaults to SIGTERM, see systemd.kill(5)). If the service does not terminate the FinalKillSignal= is sent after TimeoutStopSec=. If abort is set, WatchdogSignal= is sent instead and TimeoutAbortSec= applies before sending FinalKillSignal=. This setting may be used to analyze services that fail to start-up or shut-down intermittently. By using kill the service is immediately terminated by sending FinalKillSignal= without any further timeout. This setting can be used to expedite the shutdown of failing services.

Please try with FinalKillSignal=SIGABRT

@holiman
Copy link
Contributor

holiman commented Feb 17, 2022

Alternatively, it might suffice to set TimeoutStopFailureMode=abort. What we want to achieve is that after the graceful timeout ( a couple of minutes) for stop, we send it a sigabort, which causes a stack dump.

@karalabe
Copy link
Member

But please make sure to set the timeout to at least 3 minutes. Geth might want to write 1+GB on shutdown into leveldb, which can take a while.

@Fillicia
Copy link
Author

and it goes on like this until a timeout is

Please post the logs for what happens around that time

After the timeout it was just sending the SIGKILL signal to all process

Feb 17 02:53:53 web3 systemd[1]: geth.service: State 'stop-sigterm' timed out. Killing.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1527916 (geth) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1527917 (geth) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1527922 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1527924 (geth) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1528107 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1528108 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1528120 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1528127 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1528128 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1528129 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1528130 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1528131 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1528133 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1528135 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1528138 (geth) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1532347 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1532352 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1532353 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1532354 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1532355 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1532358 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1532359 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1572932 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1572935 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1572938 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1704733 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1913040 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 2056219 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459172 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459174 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459175 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459177 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459178 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459180 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459181 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459182 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459185 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459186 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459189 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459190 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459191 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459192 (geth) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459193 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459196 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459197 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459198 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459199 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459203 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459204 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459205 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459206 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459208 (n/a) with signal SIGKILL.
Feb 17 02:53:53 web3 systemd[1]: geth.service: Killing process 1459210 (n/a) with signal SIGKILL.
Feb 17 02:53:58 web3 systemd[1]: geth.service: Main process exited, code=killed, status=9/KILL
Feb 17 02:53:58 web3 systemd[1]: geth.service: Failed with result 'timeout'.
Feb 17 02:53:58 web3 systemd[1]: Stopped Ethereum go client.
Feb 17 02:53:58 web3 systemd[1]: geth.service: Consumed 4w 2d 7h 23min 47.648s CPU time.

Weirdly I rebuilt Geth after that from the latest commit and it didn't happened again for the first time in 5-6 months.

I'll close the issue for now but I've added the FinalKillSignal=SIGABRT and will post a tracelog if it happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants