Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running update-state on a pending ReplicaUpdateFailed causes looping miner crash #9329

Closed
9 of 18 tasks
TippyFlitsUK opened this issue Sep 19, 2022 · 1 comment · Fixed by #9331
Closed
9 of 18 tasks
Labels
area/sealing kind/bug Kind: Bug need/analysis Hint: Needs Analysis P2 P2: Should be resolved SnapDeals

Comments

@TippyFlitsUK
Copy link
Contributor

TippyFlitsUK commented Sep 19, 2022

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
  • This is not a new feature request. If it is, please file a feature request instead.
  • This is not an enhancement request. If it is, please file a improvement suggestion instead.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus miner - mining and block production
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt)
  • lotus miner/market - storage deal
  • lotus miner/market - retrieval deal
  • lotus miner/market - data transfer
  • lotus client
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

Daemon:  1.17.1+mainnet+git.9ec2ec08f+api1.5.0
Local: lotus-miner version 1.17.1+mainnet+git.9ec2ec08f

Describe the Bug

We are beginning to see reports of a critical looping miner crash after running the update-state command on a ReplicaUpdateFailed sector.

Currently, if the FSM finishes a replica update within the deadline of the sector or the one before it, the finalisation of the replica update will hang for a while in the ReplicaUpdateFailed state until the deadline has passed. This is intended behaviour and doesn't represent a problem in itself.

{"level":"error","ts":"2022-09-15T07:59:30.604-0700","logger":"sectors","caller":"pipeline/states_replica_update.go:178","msg":"handleSubmitReplicaUpdate: error sending message: GasEstimateMessageGas error: estimating gas used: message execution failed: exit 16, reason: message failed with backtrace:\n00: f01886797 (method 27) -- no valid updates (16)\n (RetCode=16)"}

Some users who are new to snapping sectors and are not aware of this behaviour are attempting to force a state update to UpdateActivating using the lotus-miner sectors update-state XXX UpdateActivating. This command is being issued prior to the natural expiry of the active sector deadline.

The result of taking this action is a looping critical of the miner instance which is proving very complicated to resolve. If the miner is able to stay online long enough between crashes, a lotus-miner sector remove command can be issued which will resolve the issue. However, in some cases, the miner will not stay online long enough to allow the running of any commands and the only resolution is a code level change.

It is understandable that running an update-state command at this juncture would produce an error but a total miner crash resulting in a completely unusable miner instance is not an acceptable outcome and needs to be resolved.

journalctl crash output is shown below.

https://filecoinproject.slack.com/archives/CPFTWMY7N/p1663342132529389
https://filecoinproject.slack.com/archives/CEGN061C5/p1657690542697849

Logging Information

Sep 16 12:01:19 V32-VAN1-MN-01 systemd[1]: f01886797_lotus-miner.service: Scheduled restart job, restart counter is at 10.
Sep 16 12:01:19 V32-VAN1-MN-01 systemd[1]: Stopped Lotus Miner.
Sep 16 12:01:19 V32-VAN1-MN-01 systemd[1]: f01886797_lotus-miner.service: Consumed 5min 29.514s CPU time.
Sep 16 12:01:19 V32-VAN1-MN-01 systemd[1]: Started Lotus Miner.
Sep 16 12:01:19 V32-VAN1-MN-01 lotus-miner[854432]: Worker: 262; Base: 0; Target: 0 (diff: 0)
Sep 16 12:01:19 V32-VAN1-MN-01 lotus-miner[854432]: State: idle; Current Epoch: 0; Todo: 0
Sep 16 12:01:19 V32-VAN1-MN-01 lotus-miner[854432]: Done!
Sep 16 12:02:55 V32-VAN1-MN-01 lotus-miner[854432]: [flexi_logger][ERRCODE::Time] flexi_logger has to work with UTC rather than with local time, caused by IndeterminateOffset
Sep 16 12:02:55 V32-VAN1-MN-01 lotus-miner[854432]:     See https://docs.rs/flexi_logger/latest/flexi_logger/error_info/index.html#time
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]: panic: runtime error: invalid memory address or nil pointer dereference
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a77e64]
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]: goroutine 21295 [running]:
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]: github.com/filecoin-project/lotus/storage/pipeline.(*Sealing).handleUpdateActivating.func1()
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]:         /root/lotus/storage/pipeline/states_replica_update.go:239 +0x44
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]: github.com/filecoin-project/lotus/storage/pipeline.(*Sealing).handleUpdateActivating(_, {{_, _}, _}, {{0xc014e3e000, 0x10}, 0x2672, 0x8, 0x63248a6c, {0xc02a2bdea0, ...}, ...})
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]:         /root/lotus/storage/pipeline/states_replica_update.go:267 +0x142
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]: github.com/filecoin-project/lotus/storage/pipeline.(*Sealing).Plan.func1({{_, _}, _}, {{0xc014e3e000, 0x10}, 0x2672, 0x8, 0x63248a6c, {0xc02a2bdea0, 0x1, ...}, ...})
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]:         /root/lotus/storage/pipeline/fsm.go:32 +0x76
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]: reflect.Value.call({0x387aae0?, 0xc03f3cecf0?, 0x13?}, {0x3c93789, 0x4}, {0xc023bcef98, 0x2, 0x2?})
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]:         /usr/lib/go-1.18/src/reflect/value.go:556 +0x845
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]: reflect.Value.Call({0x387aae0?, 0xc03f3cecf0?, 0x0?}, {0xc000624f98, 0x2, 0x2})
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]:         /usr/lib/go-1.18/src/reflect/value.go:339 +0xbf
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]: github.com/filecoin-project/go-statemachine.(*StateMachine).run.func3()
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]:         /usr/local/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.2/machine.go:113 +0x329
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]: created by github.com/filecoin-project/go-statemachine.(*StateMachine).run
Sep 16 12:03:05 V32-VAN1-MN-01 lotus-miner[854432]:         /usr/local/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.2/machine.go:109 +0x6aa
Sep 16 12:03:05 V32-VAN1-MN-01 systemd[1]: f01886797_lotus-miner.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 16 12:03:05 V32-VAN1-MN-01 systemd[1]: f01886797_lotus-miner.service: Failed with result 'exit-code'.
Sep 16 12:03:05 V32-VAN1-MN-01 systemd[1]: f01886797_lotus-miner.service: Consumed 5min 27.214s CPU time.
Sep 16 12:03:15 V32-VAN1-MN-01 systemd[1]: f01886797_lotus-miner.service: Scheduled restart job, restart counter is at 11.
Sep 16 12:03:15 V32-VAN1-MN-01 systemd[1]: Stopped Lotus Miner.
Sep 16 12:03:15 V32-VAN1-MN-01 systemd[1]: f01886797_lotus-miner.service: Consumed 5min 27.214s CPU time.
Sep 16 12:03:15 V32-VAN1-MN-01 systemd[1]: Started Lotus Miner.
Sep 16 12:03:15 V32-VAN1-MN-01 lotus-miner[860461]: Worker: 288; Base: 2168286; Target: 2168286 (diff: 0)
Sep 16 12:03:15 V32-VAN1-MN-01 lotus-miner[860461]: State: complete; Current Epoch: 2168286; Todo: 0
Sep 16 12:03:15 V32-VAN1-MN-01 lotus-miner[860461]: Done!

Repo Steps

As above

@TippyFlitsUK TippyFlitsUK added need/triage kind/bug Kind: Bug P2 P2: Should be resolved need/analysis Hint: Needs Analysis area/sealing SnapDeals P1 P1: Must be resolved and removed need/triage P2 P2: Should be resolved labels Sep 19, 2022
@jennijuju jennijuju added P2 P2: Should be resolved and removed P1 P1: Must be resolved labels Sep 19, 2022
@TippyFlitsUK
Copy link
Contributor Author

#8496 Would help to mitigate this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sealing kind/bug Kind: Bug need/analysis Hint: Needs Analysis P2 P2: Should be resolved SnapDeals
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants