Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLite WAL File for Events Growing Until Lotus Restart #12089

Closed
snissn opened this issue Jun 13, 2024 · 2 comments
Closed

SQLite WAL File for Events Growing Until Lotus Restart #12089

snissn opened this issue Jun 13, 2024 · 2 comments

Comments

@snissn
Copy link
Contributor

snissn commented Jun 13, 2024

Description:
The SQLite Write-Ahead Logging (WAL) file for storing events in the Lotus client continues to grow indefinitely until the client is restarted. This can lead to excessive disk usage and potential performance degradation. It is not the state file but a write ahead log that should be periodically reset. Restarting lotus clears the file. events.db-wal was 14GBs from running over a < 24 hour period and was deleted and reset after restarting lotus. It continues to grow from 0 bytes after restarting lotus

Steps to Reproduce:

  1. Run the Lotus client.
  2. Monitor the size of the SQLite WAL file over time.
  3. Observe that the WAL file grows continuously without being truncated.

Expected Behavior:
The SQLite WAL file should be periodically checkpointed and truncated to prevent uncontrolled growth.

Environment:

  • Lotus version: master
  • OS: ubuntu

Additional Information:
Implementing a mechanism to manually trigger checkpoints or reviewing the configuration settings related to SQLite checkpointing may help mitigate this issue. The following config parameter should force the WAL to be vacuumed automatically https://sqlite.org/pragma.html#pragma_wal_autocheckpoint. Also as helpful information a query can be sent directly to sqlite to check point via PRAGMA wal_checkpoint(FULL);

Possible Solutions:

  1. Investigate and fix any bugs preventing automatic checkpointing related to configuration of pragma_wal_autocheckpoint.
  2. Implement automatic checkpointing every N minutes as a temporary workaround that executes PRAGMA wal_checkpoint(FULL);
@rvagg
Copy link
Member

rvagg commented Jun 14, 2024

I spent some time tinkering with this because I've been running the db since nv22 and my WAL is now over 200G which is definitely not normal. Unfortunately I couldn't figure out a straightforward solution and get it solved! But I do think this is urgent and we should spend some time figuring this out because it likely impacts performance, aside from just being a pain in the backside for node operators.

My understanding and some of the things I've tried:

  • Our current settings should automatic checkpoint an should compact the WAL periodically because of it; but it doesn't seem to
  • It won't checkpoint and compact if there are open read operations. I did find one potentially unclosed transaction in index.go but dealing with it didn't seem to make a difference. Perhaps there are more subtle bugs with our transactions in there.
  • Fiddling with other WAL settings didn't seem to make much difference, including putting a hard max on it!
  • I haven't tried turning off auto checkpointing, I didn't really want to come up with a checkpoint frequency algorithm, but it could be something as straightforward like "every X epochs".

Possibly some weird interaction with one of our other PRAMAs in there. Debugging this is obviously quite difficult; trial and error and time. Unless someone with more insight sees something obvious there.

Will assign this to myself since i have a good working setup and started looking into this. But @snissn if you have a brilliant idea then I'm all ears!

rvagg added a commit that referenced this issue Jun 14, 2024
* fix unclosed multi-row query
* tune options to limit wal growth

Ref: #12089
@rvagg
Copy link
Member

rvagg commented Jun 14, 2024

Tinkering here #12090

Stebalien pushed a commit that referenced this issue Jun 14, 2024
* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: #12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures
ribasushi pushed a commit to ribasushi/ci-abusing-lotus-fork that referenced this issue Jun 15, 2024
* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: filecoin-project#12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures
ribasushi pushed a commit to ribasushi/ci-abusing-lotus-fork that referenced this issue Jun 15, 2024
* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: filecoin-project#12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures
rjan90 pushed a commit that referenced this issue Jun 17, 2024
* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: #12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures
dumikau pushed a commit to protofire/lotus that referenced this issue Jun 18, 2024
* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: filecoin-project#12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures
jennijuju pushed a commit that referenced this issue Jun 19, 2024
* fix: ci: do not use deprecated --debug goreleaser flag (#12086)

* chore: deals: remove forgotten graphsync references (#12084)

* chore: types: remove more items forgotten after markets (#12095)

* chore: cleanup: remove more items forgotten after markets

* .gz somehow reappeared after #11625

* fix: ETH RPC API: ETH Call should use the parent state root of the subsequent tipset (#11905)

* fix eth call

* tests

* changes as per review

* changes as per review

* Update node/impl/full/eth.go

Co-authored-by: Rod Vagg <rod@vagg.org>

* fix as per review

---------

Co-authored-by: Rod Vagg <rod@vagg.org>

* Update changelog to RC2

Update changelog to RC2

* Make gen / make docsgen-cli

Make gen / make docsgen-cli

* chore: api: the Net API/CLI now remains only on daemon

The only part of this repository that does lp2p is now lotus-daemon

Remove the CommonNet type, used exclusively bu the CLI stack

Adjust the rest of struct-memebership to match what went where

End result best seen in diff of `documentation/en/api-v0-methods-miner.md`

* Update changelog

Update changelog

* fix: events: sqlite db improvements (#12090)

* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: #12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures

* Update CHANGELOG.md

---------

Co-authored-by: Piotr Galar <piotr.galar@gmail.com>
Co-authored-by: Peter Rabbitson <ribasushi@protocol.ai>
Co-authored-by: Aarsh Shah <aarshkshah1992@gmail.com>
Co-authored-by: Rod Vagg <rod@vagg.org>
Co-authored-by: Peter Rabbitson <ribasushi@leporine.io>
jennijuju added a commit that referenced this issue Jun 25, 2024
* release: v1.26.3 (#11908) (#11915)

* deps: update dependencies to address migration memory bloat

to address memory concerns during a heavy migration

Ref: filecoin-project/go-state-types#260
Ref: whyrusleeping/cbor-gen#96
Ref: filecoin-project/go-amt-ipld#90

* release: prep v1.26.3 patch

Prep v1.26.3 patch release:
- Update changelog, version and make gen + make docsgen-cli

* deps: update cbor-gen to tagged version

deps: update cbor-gen to tagged version

* deps: update go-state-types to tagged version

deps: update go-state-types to tagged version v0.13.2

* chore: deps: update go-state-types to v0.13.3

Fixes a panic when we have fewer than 1k proposals.

---------

Co-authored-by: Rod Vagg <rod@vagg.org>
Co-authored-by: Steven Allen <steven@stebalien.com>

* build: release: v1.27.0-rc1 (#11947)

* chore: Set version as v1.27.0-rc1

Set version as v1.27.0-rc1, run make gen & make docsgen-cli

* Update changelog

Update changelog

* Update changelog

Update changelog based on feedback

* Bump pubsub-dep

Bump pubsub-dep

* Prep v1.27.0-rc2

Prep v1.27.0-rc2

* Typo fixes, and more changelog updates

Typo fixes, and more changelog updates

* chore: remove unmaintained bootstrappers (#11983)

* chore: remove unmaintained bootstrappers

chore: remove unmaintained bootstrappers

* Update mainnet.pi fixing typoed domain

fixing typo for 1475.io 'bootstarp' -> 'bootstrap'

* Update mainnet.pi

apparently the actual hostname is typoed. so bootstarp it is.

---------

Co-authored-by: smagdali <stefan@fil.org>

* chore: update go-data-transfer and go-graphsync

* add ETH addrs API to Gateway (#11979)

* fix: copy Flags field from SectorOnChainInfo

Fixes: #11962

* feat: libp2p: Lotus stream cleanup (#11993)

* set stream deadlines in Lotus

* reduce timeout

* whitelist bootstrappers

* fix tests

* Update changelog and version

Update changelog and version

* ci: deprecate circle ci in favour of github actions (#11786)

* Update changelog

Update changelog with the deprecate circle-ci

* chore: update drand (#12021)

* Update changelog / make docsgen

Update changelog / make docsgen

* chore: lint: update golangci lint config

* remove and replace some linters
* remove some exclusions
* make all exclusions more explicit matches

* chore: lint: fix lint errors with new linting config

Ref: #11967

* chore: lint: address feedback from reviews

* doc: eth: restore comment lost in linter cleanup

Ref: #11968

* chore: libp2p: update to v0.34.1 (#12027)

* update libp2p to v0.34.0

* fix libp2p err

* fix imports

* update go mod

* update go mod

* Update changelog

Update changelog

* go mod tidy

go mod tidy

* revert go version change (#12050)

* Update changelog

Update changelog

* chore: backport #12054 to release/v1.27.0 branch (#12056)

* chore: pin golanglint-ci to v1.58.2 (#12054)

Fixes: #12044

* Add backport to changelog

Add backport to changelog

---------

Co-authored-by: Rod Vagg <rod@vagg.org>

* Bump version - make gen/make docsgen

Bump version - make gen/make docsgen

* Update changelog

Update changelog

* Bump NodeBuildVersion to v1.27.1-rc1

Bump NodeBuildVersion to v1.27.1-rc1

* Add Lotus-Miner / Curio related changes

Add Lotus-Miner / Curio related changes

* Update date and upgrade warnings

Update date and upgrade warnings

* fix: ci: do not use deprecated --debug goreleaser flag (#12086)

* chore: deals: remove forgotten graphsync references (#12084)

* chore: types: remove more items forgotten after markets (#12095)

* chore: cleanup: remove more items forgotten after markets

* .gz somehow reappeared after #11625

* fix: ETH RPC API: ETH Call should use the parent state root of the subsequent tipset (#11905)

* fix eth call

* tests

* changes as per review

* changes as per review

* Update node/impl/full/eth.go

Co-authored-by: Rod Vagg <rod@vagg.org>

* fix as per review

---------

Co-authored-by: Rod Vagg <rod@vagg.org>

* Update changelog to RC2

Update changelog to RC2

* Make gen / make docsgen-cli

Make gen / make docsgen-cli

* chore: api: the Net API/CLI now remains only on daemon

The only part of this repository that does lp2p is now lotus-daemon

Remove the CommonNet type, used exclusively bu the CLI stack

Adjust the rest of struct-memebership to match what went where

End result best seen in diff of `documentation/en/api-v0-methods-miner.md`

* Update changelog

Update changelog

* fix: events: sqlite db improvements (#12090)

* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: #12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures

* Update CHANGELOG.md

* build: release: v1.27.1-rc2 (#12101)

* fix: ci: do not use deprecated --debug goreleaser flag (#12086)

* chore: deals: remove forgotten graphsync references (#12084)

* chore: types: remove more items forgotten after markets (#12095)

* chore: cleanup: remove more items forgotten after markets

* .gz somehow reappeared after #11625

* fix: ETH RPC API: ETH Call should use the parent state root of the subsequent tipset (#11905)

* fix eth call

* tests

* changes as per review

* changes as per review

* Update node/impl/full/eth.go

Co-authored-by: Rod Vagg <rod@vagg.org>

* fix as per review

---------

Co-authored-by: Rod Vagg <rod@vagg.org>

* Update changelog to RC2

Update changelog to RC2

* Make gen / make docsgen-cli

Make gen / make docsgen-cli

* chore: api: the Net API/CLI now remains only on daemon

The only part of this repository that does lp2p is now lotus-daemon

Remove the CommonNet type, used exclusively bu the CLI stack

Adjust the rest of struct-memebership to match what went where

End result best seen in diff of `documentation/en/api-v0-methods-miner.md`

* Update changelog

Update changelog

* fix: events: sqlite db improvements (#12090)

* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: #12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures

* Update CHANGELOG.md

---------

Co-authored-by: Piotr Galar <piotr.galar@gmail.com>
Co-authored-by: Peter Rabbitson <ribasushi@protocol.ai>
Co-authored-by: Aarsh Shah <aarshkshah1992@gmail.com>
Co-authored-by: Rod Vagg <rod@vagg.org>
Co-authored-by: Peter Rabbitson <ribasushi@leporine.io>

* small fix in changelog

* fix: release: update goreleaser config file

Fixes: #12120

* fix go releaser and test with rc3

* Update CHANGELOG.md

* lotus v1.27.1 prep

* address review
- resolve one more conflicts
- revert 2 new line added

* doc: events: note events db migration impact

---------

Co-authored-by: Phi-rjan <orjan.roren@gmail.com>
Co-authored-by: Rod Vagg <rod@vagg.org>
Co-authored-by: Steven Allen <steven@stebalien.com>
Co-authored-by: smagdali <stefan@fil.org>
Co-authored-by: Aarsh Shah <aarshkshah1992@gmail.com>
Co-authored-by: Piotr Galar <piotr.galar@gmail.com>
Co-authored-by: Peter Rabbitson <ribasushi@protocol.ai>
Co-authored-by: Peter Rabbitson <ribasushi@leporine.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ☑️Done (Archive)
Development

No branches or pull requests

3 participants