Migration optimization #10048

jennijuju · 2023-01-18T04:03:13Z

travisperson · 2023-02-14T17:01:32Z

I went back and listen to our discussion to flesh out the proposed optimizations above and I think there is really only one concrete suggestion.

implement memory maps
opt-in to keep the cache without batch flush

These two points kind of boil down to the same thing as I was able to derive from the video notes.

We basically want to cache the intermediate statetree root after the migration completes such that on future calls to HandleStateForks we can look up the migrated statetree root and immediately return with the migrated statetree root. It would probably be best to avoid using the migration cache itself as that is a very large cache and ideally would be cleaned up fairly quickly after a successful migration to reduce memory usage.

Right now we could fairly easily store the migrated stateroot root on the migration structure directly. However, I think we should also think about persisting this value in the chain store as well to avoid having to redo the migration work after a restart occurs.

lotus/chain/stmgr/stmgr.go

Lines 53 to 57 in 64059ca

    
           type migration struct { 
        
           	upgrade       MigrationFunc 
        
           	preMigrations []PreMigration 
        
           	cache         *nv16.MemMigrationCache 
        
           }

be able to opt out the (pre)migration cache

This just sounds like opting out of the premigration (no point running the premigration without a cache). I thought this used to be a feature already but I guess has since been removed (though I can't even find when it existed). I believe there was an env DISABLE_PRE_MIGRATIONS that we should bring back.

Additionally, to make future improvements to migrations easier there are a few extra things I think we should be doing

Setup a process during releases to record and store the most recent snapshot right before all network upgrades so we can easily rerun migrations in the future.
Migration guide for node operators (Add migration guide for node operators lotus-docs#483)
- Reset the datastore from a snapshot to reduce the size of the datastore prior to migration
- Do not restart the lotus node after pre-migration starts until after the migration is completed.
- Identify the log lines operators should be paying attention to understand their progress/performance during pre-migration & migration
- Document important configuration / env variables for migrations
Additional metrics for migrations
- cache size
- cache hit/miss
Identify important metrics that we would like to record from all nodes and pass this information off to the NetOps team so they can collect and share it with us.

Additionally, an env has been added (#9784) LOTUS_MIGRATION_MAX_WORKER_COUNT to allow operators to set a max worker count to avoid lotus from using workers equal to the number of CPUs.

arajasek · 2023-02-14T21:01:53Z

@travisperson Thanks a lot for this detailed synthesis, and for the summary in standup today. Based on feedback we received from the various users of the nv17 migration, I think a list of changes to make in order of impact might look something like this:

Storing the migration result in memory. This is easy to do, as you described above, and addresses a need of a high-priority integration partner.
Provide the option to disable pre-migrations entirely for node operators that are okay with extended out-of-sync time.
Provide the option to have the pre-migration result persisted, so that it (a) becomes restart-resistant, and (b) reduces memory consumption. This is a bit more work than the previous items, but likely the most impactful item. We'll need to test that the performance is acceptable when doing this (the pre-migration is actually useful), and that splitstore doesn't interfere with this.
The migration guide.
Metrics for future innovation.

jennijuju added need/triage kind/enhancement Kind: Enhancement labels Jan 18, 2023

jennijuju assigned travisperson Jan 18, 2023

jennijuju added P2 P2: Should be resolved and removed need/triage labels Jan 18, 2023

This was referenced Feb 15, 2023

feat: stmgr: cache migrated stateroots #10282

Merged

feat: stmgr: add env to disable premigrations #10283

Merged

Fatman13 mentioned this issue Feb 20, 2023

[venus] exec multiple migration process / 重复多次执行升级迁移过程 filecoin-project/venus#5533

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migration optimization #10048

Migration optimization #10048

jennijuju commented Jan 18, 2023

travisperson commented Feb 14, 2023

arajasek commented Feb 14, 2023

Navigation Menu

Migration optimization #10048

Migration optimization #10048

Comments

jennijuju commented Jan 18, 2023

Checklist

Lotus component

Improvement Suggestion

travisperson commented Feb 14, 2023

arajasek commented Feb 14, 2023