Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apiserver, sundog: random failures after reboot #1650

Closed
etungsten opened this issue Jul 8, 2021 · 3 comments
Closed

apiserver, sundog: random failures after reboot #1650

etungsten opened this issue Jul 8, 2021 · 3 comments
Assignees
Labels
area/core Issues core to the OS (variant independent) type/bug Something isn't working
Milestone

Comments

@etungsten
Copy link
Contributor

etungsten commented Jul 8, 2021

Image I'm using:
Bottlerocket v1.1.2
Seen this in both aarch64 aws-k8s-1.17 and aarch64 aws-k8s-1.20 so far

What I expected to happen:
Host to start-up successfully after a reboot

What actually happened:
2/400ish hosts failed to reboot due to the following error:

[  OK  ] Finished Datastore creator.

         Starting Bottlerocket API server...

[  OK  ] Started Bottlerocket API server.

         Starting User-specified setting generators...

[    4.676152] settings-committer[366]: Error sending POST to /tx/commit?tx=bottlerocket-launch: Failed to send request: error trying to connect: No such file or directory (os error 2)

[FAILED] Failed to start User-specified setting generators.

See 'systemctl status sundog.service' for details.

[DEPEND] Dependency failed for Applies settings to create config files.

[DEPEND] Dependency failed for Bottlerocket initial configuration complete.

[DEPEND] Dependency failed for Isolates configured.target.

sundog fails to commit settings to the apiserver due to the socket somehow missing.

apiserver logs:

Started Bottlerocket API server.
06:32:50 [INFO] Starting 1 workers
06:32:50 [INFO] Starting "actix-web-service-"/run/api.sock"" service on "/run/api.sock" (pathname)
06:33:18 [INFO] thar-be-settings started
06:33:18 [INFO] Parsing stdin for updated settings
06:33:18 [INFO] Requesting affected services for settings: {"settings.updates.metadata-base-url", "settings.updates.targets-base-
06:33:18 [INFO] Requesting configuration file data for affected services
06:33:18 [INFO] Rendering config files...
06:33:18 [INFO] Writing config files to disk...
06:33:18 [INFO] Restarting affected services...
06:33:18 [INFO] restart commands []
06:33:21 [INFO] thar-be-settings started
06:33:21 [INFO] Parsing stdin for updated settings
06:33:21 [INFO] Requesting affected services for settings: {"settings.updates.ignore-waves", "settings.updates.version-lock"}
06:33:21 [INFO] Requesting configuration file data for affected services
06:33:21 [INFO] Rendering config files...
06:33:21 [INFO] Writing config files to disk...
06:33:21 [INFO] Restarting affected services...
06:33:21 [INFO] restart commands []
Stopping Bottlerocket API server...
06:33:43 [INFO] SIGTERM received, stopping
06:33:43 [INFO] Paused accepting connections on "/run/api.sock" (pathname)
06:33:43 [INFO] Shutting down worker, 0 connections
apiserver.service: Succeeded.
Stopped Bottlerocket API server.
--reboot--
Starting Bottlerocket API server...
06:34:04 [INFO] Starting server at /run/api.sock with 1 thread and datastore at /var/lib/bottlerocket/datastore/current
Started Bottlerocket API server.
06:34:04 [INFO] Starting 1 workers
06:34:04 [INFO] Starting "actix-web-service-"/run/api.sock"" service on "/run/api.sock" (pathname)
06:34:04 [INFO] Paused accepting connections on "/run/api.sock" (pathname)
06:34:04 [INFO] Resume accepting connections on "/run/api.sock" (pathname)

It seems like at some point apiserver pauses connections on /run/api.sock. The timing coincides with sundog trying to commit settings to the apiserver.

More detailed logs:

    [root@ip-172-31-47-164 debug]# journalctl -D ./var/log/journal/ -u sundog -u apiserver -u migrator -u storewolf
     
    -- Logs begin at Thu 2021-07-08 06:32:46 UTC, end at Thu 2021-07-08 17:04:06 UTC. --
    Jul 08 06:32:48 localhost systemd[1]: Starting Bottlerocket data store migrator...
    Jul 08 06:32:48 localhost migrator[344]: Data store does not exist at given path, exiting (/var/lib/bottlerocket/datastore/current)
    Jul 08 06:32:48 localhost systemd[1]: Finished Bottlerocket data store migrator.
    Jul 08 06:32:48 localhost systemd[1]: Starting Datastore creator...
    Jul 08 06:32:48 localhost storewolf[353]: 06:32:48 [INFO] Storewolf started
    Jul 08 06:32:48 localhost storewolf[353]: 06:32:48 [INFO] Deleting pending transactions
    Jul 08 06:32:48 localhost storewolf[353]: 06:32:48 [INFO] Populating datastore at: /var/lib/bottlerocket/datastore
    Jul 08 06:32:48 localhost storewolf[353]: 06:32:48 [INFO] Creating datastore at: /var/lib/bottlerocket/datastore/current/live
    Jul 08 06:32:50 ip-192-168-144-159.us-west-2.compute.internal storewolf[353]: 06:32:50 [INFO] Datastore populated
    Jul 08 06:32:50 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Finished Datastore creator.
    Jul 08 06:32:50 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Starting Bottlerocket API server...
    Jul 08 06:32:50 ip-192-168-144-159.us-west-2.compute.internal apiserver[362]: 06:32:50 [INFO] Starting server at /run/api.sock with 1 thread and datastore at /var/lib/bottlerocket/datastore/current
    Jul 08 06:32:50 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Started Bottlerocket API server.
    Jul 08 06:32:50 ip-192-168-144-159.us-west-2.compute.internal apiserver[362]: 06:32:50 [INFO] Starting 1 workers
    Jul 08 06:32:50 ip-192-168-144-159.us-west-2.compute.internal apiserver[362]: 06:32:50 [INFO] Starting "actix-web-service-"/run/api.sock"" service on "/run/api.sock" (pathname)
    Jul 08 06:32:50 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Starting User-specified setting generators...
    Jul 08 06:32:50 ip-192-168-144-159.us-west-2.compute.internal settings-committer[370]: 06:32:50 [INFO] Checking pending settings.
    Jul 08 06:32:50 ip-192-168-144-159.us-west-2.compute.internal settings-committer[370]: 06:32:50 [INFO] Committing settings.
    Jul 08 06:32:51 ip-192-168-144-159.us-west-2.compute.internal sundog[373]: 06:32:51 [INFO] Sundog started
    Jul 08 06:32:51 ip-192-168-144-159.us-west-2.compute.internal sundog[373]: 06:32:51 [INFO] Retrieving setting generators
    Jul 08 06:32:51 ip-192-168-144-159.us-west-2.compute.internal sundog[373]: 06:32:51 [INFO] Retrieving settings values
    Jul 08 06:32:51 ip-192-168-144-159.us-west-2.compute.internal sundog[373]: 06:32:51 [INFO] Sending settings values to the API
    Jul 08 06:32:51 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Finished User-specified setting generators.
    Jul 08 06:33:18 ip-192-168-144-159.us-west-2.compute.internal apiserver[1040]: 06:33:18 [INFO] thar-be-settings started
    Jul 08 06:33:18 ip-192-168-144-159.us-west-2.compute.internal apiserver[1040]: 06:33:18 [INFO] Parsing stdin for updated settings
    Jul 08 06:33:18 ip-192-168-144-159.us-west-2.compute.internal apiserver[1040]: 06:33:18 [INFO] Requesting affected services for settings: {"settings.updates.metadata-base-url", "settings.updates.tar
    Jul 08 06:33:18 ip-192-168-144-159.us-west-2.compute.internal apiserver[1040]: 06:33:18 [INFO] Requesting configuration file data for affected services
    Jul 08 06:33:18 ip-192-168-144-159.us-west-2.compute.internal apiserver[1040]: 06:33:18 [INFO] Rendering config files...
    Jul 08 06:33:18 ip-192-168-144-159.us-west-2.compute.internal apiserver[1040]: 06:33:18 [INFO] Writing config files to disk...
    Jul 08 06:33:18 ip-192-168-144-159.us-west-2.compute.internal apiserver[1040]: 06:33:18 [INFO] Restarting affected services...
    Jul 08 06:33:18 ip-192-168-144-159.us-west-2.compute.internal apiserver[1040]: 06:33:18 [INFO] restart commands []
    Jul 08 06:33:21 ip-192-168-144-159.us-west-2.compute.internal apiserver[1081]: 06:33:21 [INFO] thar-be-settings started
    Jul 08 06:33:21 ip-192-168-144-159.us-west-2.compute.internal apiserver[1081]: 06:33:21 [INFO] Parsing stdin for updated settings
    Jul 08 06:33:21 ip-192-168-144-159.us-west-2.compute.internal apiserver[1081]: 06:33:21 [INFO] Requesting affected services for settings: {"settings.updates.ignore-waves", "settings.updates.version-
    Jul 08 06:33:21 ip-192-168-144-159.us-west-2.compute.internal apiserver[1081]: 06:33:21 [INFO] Requesting configuration file data for affected services
    Jul 08 06:33:21 ip-192-168-144-159.us-west-2.compute.internal apiserver[1081]: 06:33:21 [INFO] Rendering config files...
    Jul 08 06:33:21 ip-192-168-144-159.us-west-2.compute.internal apiserver[1081]: 06:33:21 [INFO] Writing config files to disk...
    Jul 08 06:33:21 ip-192-168-144-159.us-west-2.compute.internal apiserver[1081]: 06:33:21 [INFO] Restarting affected services...
    Jul 08 06:33:21 ip-192-168-144-159.us-west-2.compute.internal apiserver[1081]: 06:33:21 [INFO] restart commands []
    Jul 08 06:33:43 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: sundog.service: Succeeded.
    Jul 08 06:33:43 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Stopped User-specified setting generators.
    Jul 08 06:33:43 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Stopping Bottlerocket API server...
    Jul 08 06:33:43 ip-192-168-144-159.us-west-2.compute.internal apiserver[362]: 06:33:43 [INFO] SIGTERM received, stopping
    Jul 08 06:33:43 ip-192-168-144-159.us-west-2.compute.internal apiserver[362]: 06:33:43 [INFO] Paused accepting connections on "/run/api.sock" (pathname)
    Jul 08 06:33:43 ip-192-168-144-159.us-west-2.compute.internal apiserver[362]: 06:33:43 [INFO] Shutting down worker, 0 connections
    Jul 08 06:33:43 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: apiserver.service: Succeeded.
    Jul 08 06:33:43 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Stopped Bottlerocket API server.
    Jul 08 06:33:43 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: storewolf.service: Succeeded.
    Jul 08 06:33:43 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Stopped Datastore creator.
    Jul 08 06:33:43 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: migrator.service: Succeeded.
    Jul 08 06:33:43 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Stopped Bottlerocket data store migrator.
    -- Reboot --
    Jul 08 06:34:02 localhost systemd[1]: Starting Bottlerocket data store migrator...
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal migrator[343]: 06:34:04 [INFO] New data store is being built at work location /var/lib/bottlerocket/datastore/v20.0.0_aDrOR2T3GwWYrcVP
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal migrator[343]: 06:34:04 [INFO] Running migration command: "/proc/self/fd/4" "--forward" "--source-datastore" "/var/lib/bottlerocket/data
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal migrator[343]: 06:34:04 [INFO] New data store is being built at work location /var/lib/bottlerocket/datastore/v20.0.0_2VvDttWQprRLYDo4
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal migrator[343]: 06:34:04 [INFO] Running migration command: "/proc/self/fd/4" "--forward" "--source-datastore" "/var/lib/bottlerocket/data
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal migrator[343]: 06:34:04 [INFO] Flipping /var/lib/bottlerocket/datastore/v20.0.0 to point to v20.0.0_2VvDttWQprRLYDo4
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal migrator[343]: 06:34:04 [INFO] Flipping /var/lib/bottlerocket/datastore/v20.0 to point to v20.0.0
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal migrator[343]: 06:34:04 [INFO] Flipping /var/lib/bottlerocket/datastore/v20 to point to v20.0
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal migrator[343]: 06:34:04 [INFO] Flipping /var/lib/bottlerocket/datastore/current to point to v20
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Finished Bottlerocket data store migrator.
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Starting Datastore creator...
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal storewolf[363]: 06:34:04 [INFO] Storewolf started
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal storewolf[363]: 06:34:04 [INFO] Deleting pending transactions
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal storewolf[363]: 06:34:04 [INFO] Populating datastore at: /var/lib/bottlerocket/datastore
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal storewolf[363]: 06:34:04 [INFO] Datastore populated
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Finished Datastore creator.
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Starting Bottlerocket API server...
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal apiserver[364]: 06:34:04 [INFO] Starting server at /run/api.sock with 1 thread and datastore at /var/lib/bottlerocket/datastore/current
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Started Bottlerocket API server.
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal apiserver[364]: 06:34:04 [INFO] Starting 1 workers
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Starting User-specified setting generators...
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal apiserver[364]: 06:34:04 [INFO] Starting "actix-web-service-"/run/api.sock"" service on "/run/api.sock" (pathname)
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal settings-committer[366]: 06:34:04 [INFO] Checking pending settings.
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal apiserver[364]: 06:34:04 [INFO] Paused accepting connections on "/run/api.sock" (pathname)
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal apiserver[364]: 06:34:04 [INFO] Resume accepting connections on "/run/api.sock" (pathname)
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal settings-committer[366]: 06:34:04 [INFO] Committing settings.
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal settings-committer[366]: Error sending POST to /tx/commit?tx=bottlerocket-launch: Failed to send request: error trying to connect: No su
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: sundog.service: Control process exited, code=exited, status=1/FAILURE
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: sundog.service: Failed with result 'exit-code'.
    Jul 08 06:34:04 ip-192-168-144-159.us-west-2.compute.internal systemd[1]: Failed to start User-specified setting generators.

How to reproduce the problem:
Launch a bunch of bottlerocket hosts and restart them.

@etungsten etungsten added the status/needs-triage Pending triage or re-evaluation label Jul 8, 2021
@jpculp jpculp added area/core Issues core to the OS (variant independent) priority/p0 type/bug Something isn't working and removed status/needs-triage Pending triage or re-evaluation labels Jul 8, 2021
@jpculp jpculp added this to the next milestone Jul 8, 2021
@etungsten
Copy link
Contributor Author

I'm having trouble reproducing this issue. It might be rarer than expected or maybe there are some other factors I haven't considered. In any case, I'm gonna put this aside for now and come back to it once it happens again or resolve this if it hasn't happened again after a while.

@tjkirch
Copy link
Contributor

tjkirch commented Jul 29, 2021

Note: we've still seen this after #1655, but it's seemingly become even more rare.

@jhaynes jhaynes modified the milestones: next, oncall Jul 29, 2021
@etungsten
Copy link
Contributor Author

etungsten commented Sep 29, 2021

This hasn't happened for a while now (two weeks+). Not sure what fixed it, maybe the rust package upgrades. I'll go ahead and close the issue for now and re-open it if it ever resurfaces again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core Issues core to the OS (variant independent) type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants