Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stores: Deduplicate parallel stat requests #8589

Merged
merged 3 commits into from
May 18, 2022
Merged

stores: Deduplicate parallel stat requests #8589

merged 3 commits into from
May 18, 2022

Conversation

magik6k
Copy link
Contributor

@magik6k magik6k commented May 3, 2022

No longer work in progress, trying to fix scheduler hangs when getting storage stats

@codecov
Copy link

codecov bot commented May 3, 2022

Codecov Report

Merging #8589 (086908e) into master (e793d7e) will decrease coverage by 0.04%.
The diff coverage is 85.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8589      +/-   ##
==========================================
- Coverage   40.91%   40.86%   -0.05%     
==========================================
  Files         686      687       +1     
  Lines       75710    75769      +59     
==========================================
- Hits        30974    30964      -10     
- Misses      39404    39458      +54     
- Partials     5332     5347      +15     
Impacted Files Coverage Δ
extern/sector-storage/stores/remote.go 57.76% <ø> (ø)
extern/sector-storage/fsutil/filesize_unix.go 73.68% <33.33%> (-7.57%) ⬇️
...xtern/sector-storage/stores/localstorage_cached.go 87.03% <87.03%> (ø)
extern/sector-storage/stores/local.go 60.81% <100.00%> (-1.08%) ⬇️
chain/events/message_cache.go 87.50% <0.00%> (-12.50%) ⬇️
extern/storage-sealing/currentdealinfo.go 70.17% <0.00%> (-5.27%) ⬇️
chain/events/events_called.go 83.90% <0.00%> (-4.40%) ⬇️
node/hello/hello.go 63.63% <0.00%> (-3.41%) ⬇️
chain/vm/mkactor.go 48.48% <0.00%> (-3.04%) ⬇️
extern/sector-storage/worker_tracked.go 80.50% <0.00%> (-2.55%) ⬇️
... and 18 more

@magik6k magik6k changed the title wip: stores: Deduplicate parallel stat requests stores: Deduplicate parallel stat requests May 10, 2022
@magik6k magik6k marked this pull request as ready for review May 10, 2022 22:50
@magik6k magik6k requested a review from a team as a code owner May 10, 2022 22:50
@magik6k magik6k force-pushed the fix/sched-hang branch 2 times, most recently from eeccdb6 to 057e2e4 Compare May 10, 2022 23:03
@RobQuistNL
Copy link
Contributor

This doesn't seem to resolve the issues. Don't know exactly whats going on, but sectors remain stuck in CommitFinalize;

Scheduler log for sector 15:

{"level":"info","ts":"2022-05-15T19:44:01.371Z","logger":"sectors","caller":"storage-sealing/garbage.go:38","msg":"Creating CC sector 915"}
{"level":"warn","ts":"2022-05-15T19:44:01.372Z","logger":"sectors","caller":"storage-sealing/states_sealing.go:72","msg":"Creating 1 filler pieces for sector 915"}
{"level":"debug","ts":"2022-05-15T19:44:04.166Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 432 (awi:0)"}
{"level":"debug","ts":"2022-05-15T19:44:04.166Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 5 (awi:1)"}
{"level":"debug","ts":"2022-05-15T19:44:04.166Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 915 to pc1-32-ap-1"}
{"level":"debug","ts":"2022-05-15T19:45:04.000Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 243 (awi:0)"}
{"level":"debug","ts":"2022-05-15T19:45:04.000Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 244 (awi:1)"}
{"level":"debug","ts":"2022-05-15T19:45:04.000Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 915 to pc1-15-main"}
{"level":"debug","ts":"2022-05-15T22:20:07.348Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 180 (awi:0)"}
{"level":"debug","ts":"2022-05-15T22:20:07.348Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 181 (awi:1)"}
{"level":"debug","ts":"2022-05-15T22:20:25.568Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 915 to pc2-04-pc2-2"}
{"level":"debug","ts":"2022-05-15T23:58:08.175Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 318 (awi:0)"}
{"level":"debug","ts":"2022-05-15T23:58:08.175Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 343 (awi:1)"}
{"level":"debug","ts":"2022-05-15T23:58:08.572Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:429","msg":"assign worker sector 915 (ready)"}
{"level":"debug","ts":"2022-05-15T23:58:09.980Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 187 (awi:0)"}
{"level":"debug","ts":"2022-05-15T23:58:09.980Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 188 (awi:1)"}
{"level":"debug","ts":"2022-05-15T23:58:09.980Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:429","msg":"assign worker sector 915 (ready)"}
{"level":"debug","ts":"2022-05-16T00:27:30.048Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 356 (awi:0)"}
{"level":"debug","ts":"2022-05-16T00:27:30.048Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 412 (awi:1)"}
{"level":"debug","ts":"2022-05-16T00:27:30.257Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 915 to pc1-15-main"}
{"level":"debug","ts":"2022-05-16T00:28:32.107Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 350 (awi:0)"}
{"level":"debug","ts":"2022-05-16T00:28:32.107Z","logger":"advmgr","caller":"sector-storage/sched.go:480","msg":"SCHED try assign sqi:0 sector 915 to window 382 (awi:1)"}
{"level":"debug","ts":"2022-05-16T00:29:17.238Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 915 to pc1-30-main"}

pc1-30 log; (I've removed the "time" debug log)

{"level":"error","ts":"2022-05-16T00:29:17.468Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-14/storage3/sealed/s-t036146-916","error":"file does not exist"}
{"level":"error","ts":"2022-05-16T00:29:17.469Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-14/storage1/cache/s-t036146-916","error":"file does not exist"}
{"level":"info","ts":"2022-05-16T00:29:17.469Z","logger":"stores","caller":"stores/remote.go:260","msg":"Fetch http://10.224.0.68:2345/remote/sealed/s-t036146-916 -> /mnt/stor-14/storage3/sealed/fetching/s-t036146-916"}
{"level":"info","ts":"2022-05-16T00:29:17.470Z","logger":"stores","caller":"stores/remote.go:260","msg":"Fetch http://10.224.0.68:2345/remote/sealed/s-t036146-915 -> /mnt/stor-14/storage3/sealed/fetching/s-t036146-915"}
{"level":"error","ts":"2022-05-16T00:29:18.260Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-14/storage3/sealed/s-t036146-915","error":"file does not exist"}
{"level":"error","ts":"2022-05-16T00:29:18.281Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-14/storage1/cache/s-t036146-915","error":"file does not exist"}
{"level":"error","ts":"2022-05-16T00:29:28.312Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-14/storage3/sealed/s-t036146-916","error":"file does not exist"}
{"level":"error","ts":"2022-05-16T00:29:28.312Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-14/storage3/sealed/s-t036146-915","error":"file does not exist"}
{"level":"error","ts":"2022-05-16T00:29:28.334Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-14/storage1/cache/s-t036146-916","error":"file does not exist"}
{"level":"error","ts":"2022-05-16T00:29:28.334Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-14/storage1/cache/s-t036146-915","error":"file does not exist"}
{"level":"error","ts":"2022-05-16T00:29:38.370Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-14/storage1/cache/s-t036146-916","error":"file does not exist"}
(.....)

pc1-15:

{"level":"debug","ts":"2022-05-15T23:58:08.575Z","logger":"advmgr","caller":"sector-storage/worker_local.go:158","msg":"acquired sector {{36146 915} 8} (e:6; a:0): {{0 0}  /storage2/sealed/s-t036146-915 /storage2/cache/s-t036146-915  }"}
{"level":"info","ts":"2022-05-15T23:58:08.575+0000","logger":"filcrypto::proofs::api","caller":"src/proofs/api.rs:329","msg":"seal_commit_phase1: start"}
{"level":"info","ts":"2022-05-15T23:58:08.575+0000","logger":"filecoin_proofs::api","caller":"/home/lotus/.cargo/registry/src/github.com-1ecc6299db9ec823/filecoin-proofs-11.0.2/src/api/mod.rs:678","msg":"validate_cache_for_commit:start"}
{"level":"info","ts":"2022-05-15T23:58:08.575+0000","logger":"filecoin_proofs::api","caller":"/home/lotus/.cargo/registry/src/github.com-1ecc6299db9ec823/filecoin-proofs-11.0.2/src/api/mod.rs:734","msg":"validate_cache_for_commit:finish"}
{"level":"info","ts":"2022-05-15T23:58:08.575+0000","logger":"filecoin_proofs::api::seal","caller":"/home/lotus/.cargo/registry/src/github.com-1ecc6299db9ec823/filecoin-proofs-11.0.2/src/api/seal.rs:336","msg":"seal_commit_phase1:start: SectorId(915)"}
{"level":"info","ts":"2022-05-15T23:58:08.776+0000","logger":"filecoin_proofs::api::seal","caller":"/home/lotus/.cargo/registry/src/github.com-1ecc6299db9ec823/filecoin-proofs-11.0.2/src/api/seal.rs:448","msg":"seal_commit_phase1:finish: SectorId(915)"}
{"level":"info","ts":"2022-05-15T23:58:08.869+0000","logger":"filcrypto::proofs::api","caller":"src/proofs/api.rs:367","msg":"seal_commit_phase1: finish"}

@magik6k
Copy link
Contributor Author

magik6k commented May 18, 2022

@rjan90 This won't fix windowPoSt failing, it just makes the scheduler deal better with very overloaded nfs storage. If your nfs mount is so overloaded that you can't read windowpost challenges in 40min, it's probably necessary to do some of:

  • Limit parallel fetches
  • Mount long-term storage on all PC2 workers so that CommitFinalize doesn't try to push all data through lotus-miner
    • Sharing storage like that used to not work very well, and had issues with data loss, but we've fixed those issues a few versions ago
    • The idea is that by doing this you have multiple paths your nfs servers, so reads hopefully aren't blocked by writes
  • Tune nfs
    • I'm not sure how much can be done here, but some things to try would be enabling some async modes, more aggresive caching, maybe trying udp.
  • Using other remote storage solution that supports multi-path reads better
    • I'm actually not sure what remote storage systems support this well
    • One option is to just run lotus-worker on storage boxes with all compute tasks disabled; this way all storage accesses are going through independent tcp streams, and straight to whatever filesystem

@RobQuistNL If that persists for a long time, it may be caused by a stuck nfs mount, and isn't really related to this PR / isn't exactly fixable (tho we could add alerts using some of the code in this PR). I had this exact thing happen on my setup, and in that case the debug process was:

  • Run df on the box with mounted long-term storage
    • it hanged, indicating that some nfs mount broke
  • Run strace -ff df and see
[...]
newfstatat(AT_FDCWD, "/data/nfs0/2", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
newfstatat(AT_FDCWD, "/data/nfs0/1", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
newfstatat(AT_FDCWD, "/data/nfs1/0",
<command hanged>
  • Go look at the nfs1 box, see kernel traces
  • "oh yeah, that box has no ecc memory and months of uptime"
  • Reboot the box

@RobQuistNL
Copy link
Contributor

Thanks @magik6k - I'm re-running it all with this patch applied and the hangs seem to be caused by a ton of CommitFinalize tasks (which just take long, because its hundreds of 32GB files being moved to slow disks).

I've tried running it with a simple worker on the disk machines (no tasks, so only finalize / fetch etc.) and that kind of works, except for the fact that all the fetches are thrown on 1 machine with no regards to available space (#8566).

Then there's the fact that 1 worker might get 40 fetches at once, and thats because the "resource" allocation on the "fetch" task is pretty much 0. I've changed the maxparallelism on that task from 1 to 0 - this fixes the allocation in bulk, but if there's no fetches running it still trickles down ordered by connection moment / resource available (instead of checking the storage space)

I'm going to wait until all the hundreds of CommitFinalize's have resolved and then see how its performing.

@magik6k magik6k merged commit effee8c into master May 18, 2022
@magik6k magik6k deleted the fix/sched-hang branch May 18, 2022 12:35
Copy link
Contributor

@RobQuistNL RobQuistNL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging :)

@@ -155,6 +157,8 @@ func (p *path) stat(ls LocalStorage) (fsutil.FsStat, error) {
}
}

log.Infow("storage stat", "took", time.Now().Sub(start), "reservations", len(p.reservations))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to remove this one because with a lot of storages connected this blows up all the logs everywhere

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runPoStCycle failed: running window post failed

why?

@RobQuistNL
Copy link
Contributor

Related btw; #8665

@RobQuistNL
Copy link
Contributor

I'm not sure if this is an actual error we should log, but;

stor-01 Finalized sector + stores sector 0, 1, 2
stor-02 Finalized sector + stores sector 3, 4

stor-01 now fills with logs;

{"level":"error","ts":"2022-05-19T21:45:14.637Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/storage3/cache/s-tredacted-4","error":"file does not exist"}
{"level":"error","ts":"2022-05-19T21:45:14.637Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/storage3/sealed/s-tredacted-3","error":"file does not exist"}
{"level":"error","ts":"2022-05-19T21:45:14.637Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/storage3/cache/s-tredacted-3","error":"file does not exist"}

I think this is harmless but it doesn't look good in the logs

@RobQuistNL
Copy link
Contributor

RobQuistNL commented May 20, 2022

Weird thing is that after this, the sectors moved from one storage machine to another;

(this is sector nr. 4) (the trumpets 🎺 )

image

@RobQuistNL
Copy link
Contributor

Please note that all sectors got the FIN and GET on stor-07, but that is because of #8566

@RobQuistNL
Copy link
Contributor

Scheduler is always quick - lotus sealing jobs always returns quick, UNTIL CommitFinalize and Finalize jobs start appearing.

Then the lotus-miner sealing jobs hangs for a couple of minutes, and the scheduler does not distribute work.

Then, after a couple of minutes, the miner spews out a quick log;

{"level":"debug","ts":"2022-05-20T12:41:43.834Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 31 to pc1-26-main"}
{"level":"debug","ts":"2022-05-20T12:41:47.415Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 28 to pc1-10-main"}
{"level":"debug","ts":"2022-05-20T12:41:47.416Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 24 to pc1-11-main"}
{"level":"debug","ts":"2022-05-20T12:41:47.416Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 25 to pc1-15-main"}
{"level":"debug","ts":"2022-05-20T12:41:47.416Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"c8e44eb9-eafb-4a69-99a1-bb355397663f"}
{"level":"debug","ts":"2022-05-20T12:41:47.417Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"ff83212e-d833-4f12-b555-325ac5f86100"}
{"level":"debug","ts":"2022-05-20T12:41:47.418Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 26 to pc1-21-main"}
{"level":"debug","ts":"2022-05-20T12:41:47.418Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"e96be29f-ebc2-47a8-8e94-e5b5b00ae5ec"}
{"level":"debug","ts":"2022-05-20T12:41:47.419Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 29 to pc1-23-main"}
{"level":"debug","ts":"2022-05-20T12:41:47.420Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"ff9ca422-fbbd-4586-bb82-2b47b9d3b584"}
{"level":"debug","ts":"2022-05-20T12:41:47.420Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 30 to pc1-19-main"}
{"level":"debug","ts":"2022-05-20T12:41:47.471Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"c145210c-9678-41a1-9307-338a80c3626b"}
{"level":"debug","ts":"2022-05-20T12:41:47.493Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"7a21cb5a-54fe-46fd-9641-98a0621076b1"}
{"level":"debug","ts":"2022-05-20T12:41:47.585Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"9f476aa8-a1e8-400e-821a-331804e9a9bc"}
{"level":"debug","ts":"2022-05-20T12:41:47.586Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"cbb7ea0c-fcdf-402e-b280-9eef3bfcc590"}
{"level":"debug","ts":"2022-05-20T12:41:47.588Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"f7258d60-5de7-4320-b0bc-dbb39f943e23"}
{"level":"debug","ts":"2022-05-20T12:41:47.588Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"bb280c2c-e167-4fe4-9bcc-b05161a86b00"}
{"level":"debug","ts":"2022-05-20T12:41:47.588Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 14 to pc1-15-main"}
{"level":"debug","ts":"2022-05-20T12:41:47.588Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 16 to pc1-15-main"}
{"level":"debug","ts":"2022-05-20T12:41:47.588Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 17 to pc1-15-main"}
{"level":"debug","ts":"2022-05-20T12:41:47.588Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:369","msg":"assign worker sector 27 to pc1-15-main"}
{"level":"debug","ts":"2022-05-20T12:41:47.593Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"1c54ec0f-e50b-4485-814d-02ca82e4a9a8"}
{"level":"debug","ts":"2022-05-20T12:41:47.596Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"9577d5e9-90f6-4131-90ee-080da73066da"}
{"level":"debug","ts":"2022-05-20T12:41:47.719Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"df3dcf25-0983-4bd7-ba61-f3303d4b88de"}
{"level":"debug","ts":"2022-05-20T12:41:47.729Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"df3dcf25-0983-4bd7-ba61-f3303d4b88de"}
{"level":"debug","ts":"2022-05-20T12:41:47.729Z","logger":"advmgr","caller":"sector-storage/sched_worker.go:274","msg":"task done","workerid":"df3dcf25-0983-4bd7-ba61-f3303d4b88de"}

and the lotus-miner sealing jobs returns its info

@RobQuistNL
Copy link
Contributor

(this is with this patch running)

@RobQuistNL
Copy link
Contributor

NFS is not slow / stalling. I can tree -ah through all the NFS connected storages in < 200ms.

@RobQuistNL
Copy link
Contributor

Am now also seeing some of these;

{"level":"error","ts":"2022-05-20T12:50:36.877Z","logger":"advmgr","caller":"sector-storage/sched.go:412","msg":"trySched(1) req.sel.Ok error: getting worker paths:\n    github.com/filecoin-project/lotus/extern/sector-storage.(*allocSelector).Ok\n        /home/lotus/lotus-source/extern/sector-storage/selector_alloc.go:40\n  - RPC client error: sendRequest failed: Post \"http://10.224.2.35:4560/rpc/v0\": context deadline exceeded"}
{"level":"info","ts":"2022-05-20T12:51:06.008Z","logger":"miner","caller":"miner/miner.go:479","msg":"completed mineOne","tookMilliseconds":2,"forRound":1824823,"baseEpoch":1824822,"baseDeltaSeconds":6,"nullRounds":0,"lateStart":false,"beaconEpoch":0,"lookbackEpochs":900,"networkPowerAtLookback":"19383485704203403264","minerPowerAtLookback":"377957122048","isEligible":false,"isWinner":false,"error":null}
{"level":"error","ts":"2022-05-20T12:51:14.058Z","logger":"advmgr","caller":"sector-storage/sched.go:412","msg":"trySched(1) req.sel.Ok error: getting worker paths:\n    github.com/filecoin-project/lotus/extern/sector-storage.(*allocSelector).Ok\n        /home/lotus/lotus-source/extern/sector-storage/selector_alloc.go:40\n  - RPC client error: sendRequest failed: Post \"http://10.224.2.83:4563/rpc/v0\": context deadline exceeded"}
{"level":"info","ts":"2022-05-20T12:51:36.409Z","logger":"miner","caller":"miner/miner.go:479","msg":"completed mineOne","tookMilliseconds":3,"forRound":1824824,"baseEpoch":1824823,"baseDeltaSeconds":6,"nullRounds":0,"lateStart":false,"beaconEpoch":0,"lookbackEpochs":900,"networkPowerAtLookback":"19383485704203403264","minerPowerAtLookback":"377957122048","isEligible":false,"isWinner":false,"error":null}
{"level":"info","ts":"2022-05-20T12:52:06.532Z","logger":"miner","caller":"miner/miner.go:479","msg":"completed mineOne","tookMilliseconds":216,"forRound":1824825,"baseEpoch":1824824,"baseDeltaSeconds":6,"nullRounds":0,"lateStart":false,"beaconEpoch":0,"lookbackEpochs":900,"networkPowerAtLookback":"19383485566764449792","minerPowerAtLookback":"377957122048","isEligible":false,"isWinner":false,"error":null}
{"level":"error","ts":"2022-05-20T12:52:18.667Z","logger":"advmgr","caller":"sector-storage/sched.go:412","msg":"trySched(1) req.sel.Ok error: getting worker paths:\n    github.com/filecoin-project/lotus/extern/sector-storage.(*allocSelector).Ok\n        /home/lotus/lotus-source/extern/sector-storage/selector_alloc.go:40\n  - RPC client error: sendRequest failed: Post \"http://10.224.2.18:4562/rpc/v0\": context deadline exceeded"}
{"level":"warn","ts":"2022-05-20T12:52:37.752Z","logger":"miner","caller":"miner/miner.go:477","msg":"completed mineOne","tookMilliseconds":4,"forRound":1824826,"baseEpoch":1824825,"baseDeltaSeconds":7,"nullRounds":0,"lateStart":true,"beaconEpoch":0,"lookbackEpochs":900,"networkPowerAtLookback":"19383485566764449792","minerPowerAtLookback":"377957122048","isEligible":false,"isWinner":false,"error":null}
{"level":"error","ts":"2022-05-20T12:53:05.000Z","logger":"advmgr","caller":"sector-storage/sched.go:412","msg":"trySched(1) req.sel.Ok error: getting worker paths:\n    github.com/filecoin-project/lotus/extern/sector-storage.(*allocSelector).Ok\n        /home/lotus/lotus-source/extern/sector-storage/selector_alloc.go:40\n  - RPC client error: sendRequest failed: Post \"http://10.224.2.2:4567/rpc/v0\": context deadline exceeded"}
{"level":"info","ts":"2022-05-20T12:53:06.187Z","logger":"miner","caller":"miner/miner.go:479","msg":"completed mineOne","tookMilliseconds":2,"forRound":1824827,"baseEpoch":1824826,"baseDeltaSeconds":6,"nullRounds":0,"lateStart":false,"beaconEpoch":0,"lookbackEpochs":900,"networkPowerAtLookback":"19383486902990012416","minerPowerAtLookback":"377957122048","isEligible":false,"isWinner":false,"error":null}
{"level":"info","ts":"2022-05-20T12:53:36.007Z","logger":"miner","caller":"miner/miner.go:479","msg":"completed mineOne","tookMilliseconds":1,"forRound":1824828,"baseEpoch":1824827,"baseDeltaSeconds":6,"nullRounds":0,"lateStart":false,"beaconEpoch":0,"lookbackEpochs":900,"networkPowerAtLookback":"19383564899596107776","minerPowerAtLookback":"377957122048","isEligible":false,"isWinner":false,"error":null}
{"level":"error","ts":"2022-05-20T12:53:49.705Z","logger":"advmgr","caller":"sector-storage/sched.go:412","msg":"trySched(1) req.sel.Ok error: getting worker paths:\n    github.com/filecoin-project/lotus/extern/sector-storage.(*allocSelector).Ok\n        /home/lotus/lotus-source/extern/sector-storage/selector_alloc.go:40\n  - RPC client error: sendRequest failed: Post \"http://10.224.2.35:4567/rpc/v0\": context deadline exceeded"}
{"level":"error","ts":"2022-05-20T12:53:57.809Z","logger":"advmgr","caller":"sector-storage/sched.go:412","msg":"trySched(1) req.sel.Ok error: getting worker paths:\n    github.com/filecoin-project/lotus/extern/sector-storage.(*allocSelector).Ok\n        /home/lotus/lotus-source/extern/sector-storage/selector_alloc.go:40\n  - RPC client error: sendRequest failed: Post \"http://10.224.2.66:4564/rpc/v0\": context deadline exceeded"}

@RobQuistNL
Copy link
Contributor

Please also note that there are a lot of storage connections (4diskspc1workerspc2machines) + (6pools20storagespc1workers)
This used to be twice as big but I've removed some mounts I deemed unnecessary

pc1-16-storage3:
	[##                                                ] 348.7 GiB/5.728 TiB 5%
	Unsealed: 0; Sealed: 0; Caches: 0; Reserved: 0 B
	Weight: 2; Use: Seal 
	Local: /mnt/pc1-16/storage3
	URL: http://10.224.0.68:2345/remote
	URL: http://10.224.2.67:4560/remote
	URL: http://10.224.2.99:4560/remote
	URL: http://10.224.2.18:4560/remote
	URL: http://10.224.2.98:4560/remote
	URL: http://10.224.2.100:4561/remote
	URL: http://10.224.2.83:4560/remote
	URL: http://10.224.2.130:4560/remote
	URL: http://10.224.2.2:4565/remote
	URL: http://10.224.2.18:4561/remote
	URL: http://10.224.2.3:4560/remote
	URL: http://10.224.2.18:4566/remote
	URL: http://10.224.2.19:4560/remote
	URL: http://10.224.2.66:4560/remote
	URL: http://10.224.2.2:4562/remote
	URL: http://10.224.2.3:4562/remote
	URL: http://10.224.2.67:4561/remote
	URL: http://10.224.2.82:4560/remote
	URL: http://10.224.2.114:4560/remote
	URL: http://10.224.2.18:4565/remote
	URL: http://10.224.2.115:4560/remote
	URL: http://10.224.2.2:4561/remote
	URL: http://10.224.2.98:4561/remote
	URL: http://10.224.2.34:4560/remote
	URL: http://10.224.2.3:4563/remote
	URL: http://10.224.2.67:4566/remote
	URL: http://10.224.2.35:4560/remote
	URL: http://10.224.2.2:4564/remote
	URL: http://10.224.2.68:4560/remote
	URL: http://10.224.2.83:4564/remote
	URL: http://10.224.2.67:4562/remote
	URL: http://10.224.2.67:4564/remote
	URL: http://10.224.2.3:4564/remote
	URL: http://10.224.2.3:4561/remote
	URL: http://10.224.2.3:4565/remote
	URL: http://10.224.2.2:4563/remote
	URL: http://10.224.2.83:4565/remote
	URL: http://10.224.2.3:4567/remote
	URL: http://10.224.2.83:4563/remote
	URL: http://10.224.2.3:4566/remote
	URL: http://10.224.2.35:4561/remote
	URL: http://10.224.2.100:4562/remote
	URL: http://10.224.2.83:4567/remote
	URL: http://10.224.2.83:4562/remote
	URL: http://10.224.2.2:4566/remote
	URL: http://10.224.2.83:4566/remote
	URL: http://10.224.2.82:4564/remote
	URL: http://10.224.2.100:4560/remote
	URL: http://10.224.2.82:4561/remote
	URL: http://10.224.2.82:4563/remote
	URL: http://10.224.2.99:4564/remote
	URL: http://10.224.2.18:4562/remote
	URL: http://10.224.2.100:4563/remote
	URL: http://10.224.2.67:4563/remote
	URL: http://10.224.2.100:4564/remote
	URL: http://10.224.2.2:4560/remote
	URL: http://10.224.2.98:4562/remote
	URL: http://10.224.2.83:4561/remote
	URL: http://10.224.2.67:4565/remote
	URL: http://10.224.2.18:4563/remote
	URL: http://10.224.2.98:4563/remote
	URL: http://10.224.2.67:4567/remote
	URL: http://10.224.2.66:4561/remote
	URL: http://10.224.2.66:4563/remote
	URL: http://10.224.2.35:4563/remote
	URL: http://10.224.2.98:4567/remote
	URL: http://10.224.2.35:4562/remote
	URL: http://10.224.2.2:4567/remote
	URL: http://10.224.2.100:4566/remote
	URL: http://10.224.2.99:4566/remote
	URL: http://10.224.2.98:4564/remote
	URL: http://10.224.2.99:4562/remote
	URL: http://10.224.2.66:4562/remote
	URL: http://10.224.2.66:4567/remote
	URL: http://10.224.2.115:4565/remote
	URL: http://10.224.2.35:4566/remote
	URL: http://10.224.2.82:4562/remote
	URL: http://10.224.2.82:4566/remote
	URL: http://10.224.2.115:4561/remote
	URL: http://10.224.2.66:4566/remote
	URL: http://10.224.2.66:4565/remote
	URL: http://10.224.2.98:4565/remote
	URL: http://10.224.2.98:4566/remote
	URL: http://10.224.2.35:4567/remote
	URL: http://10.224.2.115:4563/remote
	URL: http://10.224.2.66:4564/remote
	URL: http://10.224.2.82:4565/remote
	URL: http://10.224.2.115:4567/remote
	URL: http://10.224.2.35:4564/remote
	URL: http://10.224.2.35:4565/remote
	URL: http://10.224.2.34:4563/remote
	URL: http://10.224.2.34:4561/remote
	URL: http://10.224.2.99:4565/remote
	URL: http://10.224.2.100:4565/remote
	URL: http://10.224.2.34:4567/remote
	URL: http://10.224.2.99:4567/remote
	URL: http://10.224.2.100:4567/remote
	URL: http://10.224.2.82:4567/remote
	URL: http://10.224.2.99:4563/remote
	URL: http://10.224.2.115:4566/remote
	URL: http://10.224.2.34:4562/remote
	URL: http://10.224.2.115:4564/remote
	URL: http://10.224.2.68:4565/remote
	URL: http://10.224.2.18:4564/remote
	URL: http://10.224.2.99:4561/remote
	URL: http://10.224.2.18:4567/remote
	URL: http://10.224.2.115:4562/remote
	URL: http://10.224.2.68:4566/remote
	URL: http://10.224.2.34:4565/remote
	URL: http://10.224.2.68:4563/remote
	URL: http://10.224.2.34:4564/remote
	URL: http://10.224.2.68:4564/remote
	URL: http://10.224.2.34:4566/remote
	URL: http://10.224.2.68:4562/remote
	URL: http://10.224.2.68:4567/remote
	URL: http://10.224.2.68:4561/remote
	URL: http://10.224.2.114:4562/remote
	URL: http://10.224.2.130:4566/remote
	URL: http://10.224.2.130:4564/remote
	URL: http://10.224.2.114:4567/remote
	URL: http://10.224.2.114:4565/remote
	URL: http://10.224.2.130:4565/remote
	URL: http://10.224.2.114:4561/remote
	URL: http://10.224.2.114:4564/remote
	URL: http://10.224.2.130:4562/remote
	URL: http://10.224.2.130:4567/remote
	URL: http://10.224.2.130:4563/remote
	URL: http://10.224.2.114:4563/remote
	URL: http://10.224.2.114:4566/remote
	URL: http://10.224.2.130:4561/remote
	URL: http://10.224.2.19:4566/remote
	URL: http://10.224.2.19:4565/remote
	URL: http://10.224.2.19:4562/remote
	URL: http://10.224.2.19:4561/remote
	URL: http://10.224.2.19:4563/remote
	URL: http://10.224.2.19:4564/remote
	URL: http://10.224.2.19:4567/remote
	URL: http://10.224.1.66:3456/remote
$ lotus-miner storage list | wc -l
18480

@RobQuistNL
Copy link
Contributor

RobQuistNL commented May 20, 2022

While my PC1 machine is doing a GET for a CommitFinalze sector I'm seeing these in the logs too;

{"level":"error","ts":"2022-05-20T13:14:07.218Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-08/storage1/sealed/s-tredacted-18","error":"file does not exist"}
{"level":"error","ts":"2022-05-20T13:14:07.221Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-18/storage3/cache/s-tredacted-18","error":"file does not exist"}
{"level":"error","ts":"2022-05-20T13:14:17.293Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-08/storage1/sealed/s-tredacted-18","error":"file does not exist"}
{"level":"error","ts":"2022-05-20T13:14:17.296Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-18/storage3/cache/s-tredacted-18","error":"file does not exist"}
{"level":"error","ts":"2022-05-20T13:14:27.574Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-08/storage1/sealed/s-tredacted-18","error":"file does not exist"}
{"level":"error","ts":"2022-05-20T13:14:27.577Z","logger":"stores","caller":"stores/localstorage_cached.go:105","msg":"error getting disk usage","path":"/mnt/stor-18/storage3/cache/s-tredacted-18","error":"file does not exist"}

@RobQuistNL
Copy link
Contributor

Another issue causing fetches to become very slow; PC1 machine fetches the sealed sector from another PC1 machine, through the Miner (because the miner has it mounted as NFS locally;)

pc1-07:

{"level":"info","ts":"2022-05-20T12:23:41.725Z","logger":"stores","caller":"stores/remote.go:260","msg":"Fetch http://miner-machine:2345/remote/sealed/s-tredacted-18 -> /mnt/stor-08/storage1/sealed/fetching/s-tredacted-18"}

miner:

$ ls -lah /mnt/pc1-29/storage2/sealed/s-tredacted-18 
-rw-r--r-- 1 lotus lotus 32G May 20 10:33 /mnt/pc1-29/storage2/sealed/s-tredacted-18

pc1-29:

$ ls -lah /storage2/sealed/s-tredacted-18 
-rw-r--r-- 1 lotus lotus 32G May 20 10:33 /storage2/sealed/s-tredacted-18

What I'd prefer is stor-08 doing a local move from /mnt/pc1-29/sealed/s-tredacted-18 to /storage1/sealed/

That doesn't work, because running an fetcher-worker on the stor-08 causes this issue; #8566

And right now its doing:

pc1-07 -> writing over NFS to stor-08 -> fetches over HTTP from miner -> which reads from NFS on pc-27 

@wangmeidong
Copy link

runPoStCycle failed: running window post failed

What caused this error

@RobQuistNL
Copy link
Contributor

runPoStCycle failed: running window post failed

What caused this error

read the logs around it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants