Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(SnapDeals) ERROR messages during open deadline. GasEstimateMessageGas error #8496

Closed
9 of 18 tasks
Tracked by #543
Reiers opened this issue Apr 15, 2022 · 4 comments
Closed
9 of 18 tasks
Tracked by #543
Assignees
Labels
area/sealing kind/bug Kind: Bug need/analysis Hint: Needs Analysis P2 P2: Should be resolved SnapDeals

Comments

@Reiers
Copy link

Reiers commented Apr 15, 2022

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
  • This is not a new feature request. If it is, please file a feature request instead.
  • This is not an enhancement request. If it is, please file a improvement suggestion instead.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus miner - mining and block production
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt)
  • lotus miner/market - storage deal
  • lotus miner/market - retrieval deal
  • lotus miner/market - data transfer
  • lotus client
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

Daemon:  1.15.2-rc1+mainnet+git.dcf6f6414+api1.5.0
Local: lotus version 1.15.2-rc1+mainnet+git.dcf6f6414

Describe the Bug

See log:

  • these messages will loop if the sectors that you (snap-up) upgraded is within a open deadline.
    And they won't stop until the deadlines closes (not after wdPost is sent) .
    The messages will be sent and sealing will continue as soon as deadline closes, no restart needed.

It's not pretty and quite noisy in the miner log seeing that many errors, especially around wdPost.

What is the expected behaviour?

  • Either a WARN instead of ERROR, and a reason on why it fails:
    reason: upgrade stalled deadline is open or something similar.

Or change it that the sectors wait until the deadline is closed. But that might be hard to implement.

Logging Information

2022-04-12T12:32:49.670+0200    INFO    sectors storage-sealing/states_failed.go:28 ReplicaUpdateFailed(3164), waiting 59.329163098s before retrying
2022-04-12T12:32:49.687+0200    ERROR   sectors storage-sealing/states_replica_update.go:173    handleSubmitReplicaUpdate: error sending message: GasEstimateMessageGas error: estimating gas used: message execution failed: exit 16, reason: 
2022-04-12T12:32:49.690+0200    INFO    sectors storage-sealing/states_failed.go:28 ReplicaUpdateFailed(3216), waiting 59.309057881s before retrying
2022-04-12T12:32:49.708+0200    ERROR   sectors storage-sealing/states_replica_update.go:173    handleSubmitReplicaUpdate: error sending message: GasEstimateMessageGas error: estimating gas used: message execution failed: exit 16, reason: 
2022-04-12T12:32:49.712+0200    INFO    sectors storage-sealing/states_failed.go:28 ReplicaUpdateFailed(3217), waiting 59.287675255s before retrying

Repo Steps

  1. lotus-miner sector snap-up <sectornumb>, say a 100 CC sectors in deadline 1
  2. Have sectors sealing while deadline 1 is open
  3. See the ERROR above
@ZenGround0
Copy link
Contributor

The easiest approach I can think of is
a) explicitly check the deadline mutability before doing gas estimation or after the estimation failure to detect this case
b) Goto a sealer state WaitMutable which waits for the deadline to close
c) This sealer state sleeps for an amount of time roughly equivalent to the remainder of the deadline in a loop so we don't need to get the time exactly right for correctness but in practice it only wakes up 1-2 times before submitting or doing the final failure.

One downside is that other problems might be causing a legitimate failure and this will cause us to keep those problem replicas around for an extra deadline since all failures in an immutable deadline will be sent to the WaitMutable state. Parsing actors log output and getting it to the sealer is a bigger change that we might want to consider. Then we could do all the above in response to particular logs in gas estimation.

@Reiers
Copy link
Author

Reiers commented Jul 21, 2022

#8888

I have suspicion these issue are tied together somehow.

@Reiers
Copy link
Author

Reiers commented Jul 21, 2022

deadline  partition  sector  status
8         0          19577   bad (generating vanilla proof: could not read from path="/nfs/TechHedgeUnit6/update-cache/s-t022352-19577/p_aux"
Caused by:
    No such file or directory (os error 2))
8  0  19671  bad (generating vanilla proof: could not read from path="/nfs/TechHedgeUnit6/update-cache/s-t022352-19671/p_aux"
Caused by:
    No such file or directory (os error 2))
8  0  19830  bad (generating vanilla proof: could not read from path="/nfs/TechHedgeUnit1/update-cache/s-t022352-19830/p_aux"
Caused by:
    No such file or directory (os error 2))
8  0  19649  bad (generating vanilla proof: could not read from path="/nfs/TechHedgeUnit7/update-cache/s-t022352-19649/p_aux"
Caused by:
    No such file or directory (os error 2))
8  0  19676  bad (generating vanilla proof: could not read from path="/nfs/TechHedgeUnit2/update-cache/s-t022352-19676/p_aux"
Caused by:
    No such file or directory (os error 2))
8  0  19684  bad (generating vanilla proof: could not read from path="/nfs/TechHedgeUnit7/update-cache/s-t022352-19684/p_aux"
Caused by:
    No such file or directory (os error 2))
8  0  19573  bad (generating vanilla proof: could not read from path="/nfs/TechHedgeUnit1/update-cache/s-t022352-19573/p_aux"
Caused by:
    No such file or directory (os error 2))
8  0  19672  bad (generating vanilla proof: could not read from path="/nfs/TechHedgeUnit6/update-cache/s-t022352-19672/p_aux"
Caused by:
    No such file or directory (os error 2))
8  0  19607  bad (generating vanilla proof: could not read from path="/nfs/TechHedgeUnit6/update-cache/s-t022352-19607/p_aux"
Caused by:
    No such file or directory (os error 2))
8  0  19606  bad (generating vanilla proof: could not read from path="/nfs/TechHedgeUnit7/update-cache/s-t022352-19606/p_aux"
Caused by:
    No such file or directory (os error 2))
8  0  19310  bad (can't acquire read lock)
8  0  19651  bad (generating vanilla proof: could not read from path="/nfs/TechHedgeUnit2/update-cache/s-t022352-19651/p_aux"
Caused by:
    No such file or directory (os error 2))

So when deadline 8 is open, and it tries to finalize but cant - I have seen that it tends to break the sector, not fully - but its missing files. See logs above.

@ZenGround0
Copy link
Contributor

Parsing actors log output and getting it to the sealer is a bigger change that we might want to consider. Then we could do all the above in response to particular logs in gas estimation.

Random thought: receipt event reporting targeted for nv18 will probably turn this from a janky solution to a good solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sealing kind/bug Kind: Bug need/analysis Hint: Needs Analysis P2 P2: Should be resolved SnapDeals
Projects
None yet
Development

No branches or pull requests

3 participants