Sort L0 files by newly introduced epoch_num #10922

hx235 · 2022-11-05T23:56:45Z

Context:
Sorting L0 files by largest_seqno has at least two inconvenience:

File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. force_consistency_check=true will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
  - insert k1@1 to memtable m1
  - ingest file s1 with k2@2, ingest file s2 with k3@3
  - insert k4@4 to m1
  - compact files s1, s2 and result in new file s3 of seqno range [2, 3]
  - flush m1 and result in new file s4 of seqno range [1, 4]. And force_consistency_check=true will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
Single delete can decrease a file's largest seqno and ordering by largest_seqno can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to @ajkr for this example)
  - an existing SST s1 contains only k1@1
  - insert k1@2 to memtable m1
  - ingest file s2 with k3@3, ingest file s3 with k4@4
  - insert single delete k5@5 in m1
  - flush m1 and result in new file s4 of seqno range [2, 5]
  - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
  - compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by force_consistency_check=true, there isn't a good way to prevent this from happening if ordering by largest_seqno

Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to @ajkr , we now introduce epoch_num which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum epoch_num among input files'). This will avoid the above inconvenience in the following ways:

In the first case above, there will no longer be overlap seqno range check in force_consistency_check=true but epoch_number ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class DBCompactionTestL0FilesMisorderCorruption* for more.
In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.

Summary:

Introduce epoch_number stored per ColumnFamilyData and sort CF's L0 files by their assigned epoch_number instead of largest_seqno.
- epoch_number is increased and assigned upon VersionEdit::AddFile() for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the kReservedEpochNumberForFileIngestedBehind)
- Compaction output file is assigned with the minimum epoch_number among input files'
  - Refit level: reuse refitted file's epoch_number
- Other paths needing epoch_number treatment:
  - Import column families: reuse file's epoch_number if exists. If not, assign one based on NewestFirstBySeqNo
  - Repair: reuse file's epoch_number if exists. If not, assign one based on NewestFirstBySeqNo.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon VersionEdit::AddFile() where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later Refit(target_level=0)). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
Persist epoch_number of each file in manifest and recover epoch_number on db recovery
- Backward compatibility with old db without epoch_number support is guaranteed by assigning epoch_number to recovered files by NewestFirstBySeqno order. See VersionStorageInfo::RecoverEpochNumbers() for more
- Forward compatibility with manifest is guaranteed by flexibility of NewFileCustomTag
Replace force_consistent_check on L0 with epoch_number and remove false positive check like case 1 with largest_seqno above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (NewestFirstBySeqno) to check/sort them till we infer their epoch number. See usages of EpochNumberRequirement.
Remove fix Fix corruption with intra-L0 on ingested files #5958 (comment) and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
Misc:
- update existing tests with epoch_number so make check will pass
- update Fix corruption with intra-L0 on ingested files #5958 (comment) tests to verify corruption is fixed using epoch_number and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()

Test:

make check
New unit tests under db/db_compaction_test.cc, db/db_test2.cc, db/version_builder_test.cc, db/repair_test.cc
Updated tests (i.e, DBCompactionTestL0FilesMisorderCorruption*) under Fix corruption with intra-L0 on ingested files #5958 (comment)
[Ongoing] Compatibility test: manually run ajkr@36a5686 (with file ingestion off for running the .orig binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on simple black/white box, cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox
[Ongoing] normal db stress test
[Ongoing] db stress test with aggressive value [CI only]Aggressive crash test value #10761

facebook-github-bot · 2022-11-06T00:14:12Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-11-06T00:35:01Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-11-06T00:36:02Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-11-06T00:39:46Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-11-06T00:40:00Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-11-06T04:30:04Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-11-07T20:21:15Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-11-07T21:48:12Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-11-07T22:12:17Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-11-07T22:12:31Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-11-07T23:49:04Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-11-08T00:05:44Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-11-08T00:48:07Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-11-08T04:43:55Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-11-08T05:48:55Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

ajkr

Finally finished thinking about all the parts I hoped to think about. Great work on this very difficult problem

db/column_family.cc

db/external_sst_file_ingestion_job.cc

db/column_family.cc

db/import_column_family_job.cc

db/version_builder.cc

facebook-github-bot · 2022-12-13T01:02:12Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-12-13T01:04:16Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-12-13T01:12:41Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-12-13T01:21:54Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-12-13T04:03:22Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-12-13T04:11:56Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-12-13T04:12:08Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2022-12-13T04:12:48Z

Update: Addressed the comments and going thru compatibility + stress test again. Will notify here when it's ready for review again.

facebook-github-bot · 2022-12-13T04:35:12Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-12-13T04:35:24Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ajkr

Thanks for addressing all the comments. LGTM!!

…ess comments + misc

facebook-github-bot · 2022-12-13T19:38:14Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-12-13T19:38:34Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2022-12-13T21:14:51Z

Thanks for addressing all the comments. LGTM!!

Thanks for the review!!! Compatibility test and stress tests finish good so I will land this once signals are clear.

…rash test (#11063) Summary: **Context/Summary:** #10777 was reverted (#10999) due to internal blocker and replaced with a better fix #10922. However, the revert also reverted the `Options::CompactionOptionsFIFO::allow_compaction` stress/crash coverage added by the PR. It's an useful coverage cuz setting `Options::CompactionOptionsFIFO::allow_compaction=true` will [increase](https://github.com/facebook/rocksdb/blob/7.8.fb/db/version_set.cc#L3255) the compaction score of L0 files for FIFO and then trigger more FIFO compaction. This speed up discovery of bug related to FIFO compaction like #10955. To see the speedup, compare the failure occurrence in following commands with `Options::CompactionOptionsFIFO::allow_compaction=true/false` ``` --fifo_allow_compaction=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=8.869062094789008 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=100 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=15 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=65 ``` Therefore this PR is adding it back to stress/crash test. Pull Request resolved: #11063 Test Plan: Rehearsal stress test to make sure stress/crash test is stable Reviewed By: ajkr Differential Revision: D42283650 Pulled By: hx235 fbshipit-source-id: 132e6396ab6e24d8dcb8fe51c62dd5211cdf53ef

facebook-github-bot added the CLA Signed label Nov 5, 2022

hx235 force-pushed the l0_epoch_num branch 3 times, most recently from 2107596 to a616822 Compare November 6, 2022 00:13

hx235 force-pushed the l0_epoch_num branch from a616822 to 570a536 Compare November 6, 2022 00:34

hx235 added the WIP Work in progress label Nov 6, 2022

hx235 force-pushed the l0_epoch_num branch from 570a536 to 69f8585 Compare November 6, 2022 00:39

hx235 force-pushed the l0_epoch_num branch from 69f8585 to f8592ab Compare November 6, 2022 04:29

hx235 force-pushed the l0_epoch_num branch from f8592ab to 2d6d6d9 Compare November 7, 2022 20:21

hx235 force-pushed the l0_epoch_num branch from 2d6d6d9 to eca5e73 Compare November 7, 2022 21:48

hx235 force-pushed the l0_epoch_num branch from eca5e73 to 8cce351 Compare November 7, 2022 22:12

hx235 force-pushed the l0_epoch_num branch from 53e507c to 877d246 Compare November 8, 2022 00:05

hx235 changed the title ~~Sort L0 files by newly introduced l0_epoch_num + misc~~ Sort L0 files by newly introduced l0_epoch_num Nov 8, 2022

hx235 force-pushed the l0_epoch_num branch from 877d246 to 79da7ad Compare November 8, 2022 00:48

hx235 force-pushed the l0_epoch_num branch from dc4f33e to e26efdb Compare November 8, 2022 05:48

ajkr requested changes Dec 9, 2022

View reviewed changes

hx235 force-pushed the l0_epoch_num branch from d8953e7 to c00438a Compare December 13, 2022 01:02

hx235 force-pushed the l0_epoch_num branch from c00438a to 28c9e10 Compare December 13, 2022 01:04

hx235 force-pushed the l0_epoch_num branch from 28c9e10 to 13a6ef5 Compare December 13, 2022 01:12

hx235 force-pushed the l0_epoch_num branch from 13a6ef5 to c75987a Compare December 13, 2022 01:21

hx235 force-pushed the l0_epoch_num branch from c75987a to 399c7ff Compare December 13, 2022 04:03

hx235 force-pushed the l0_epoch_num branch from 399c7ff to 0268841 Compare December 13, 2022 04:35

ajkr approved these changes Dec 13, 2022

View reviewed changes

hx235 added 2 commits December 13, 2022 11:37

Introduce epoch_number and sort L0 files by epoch_number

96b9b20

Reuse VSI::RecoverEpochNumbers() logic in import CF and Repair + addr…

3ff5045

…ess comments + misc

hx235 force-pushed the l0_epoch_num branch from 0268841 to 3ff5045 Compare December 13, 2022 19:38

facebook-github-bot closed this in 98d5db5 Dec 13, 2022

hx235 mentioned this pull request Dec 29, 2022

Add back Options::CompactionOptionsFIFO::allow_compaction to stress/crash test #11063

Closed

igorcanadi mentioned this pull request Jan 17, 2024

[SYS-6913] Upgrade RocksDB-Cloud to 8.9.1 rockset/rocksdb-cloud#315

Closed

v01dstar mentioned this pull request May 9, 2024

Cherry-pick TiKV related changes to 8.10.fb tikv/rocksdb#364

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort L0 files by newly introduced epoch_num #10922

Sort L0 files by newly introduced epoch_num #10922

hx235 commented Nov 5, 2022 •

edited

Loading

facebook-github-bot commented Nov 6, 2022

facebook-github-bot commented Nov 6, 2022

facebook-github-bot commented Nov 6, 2022

facebook-github-bot commented Nov 6, 2022

facebook-github-bot commented Nov 6, 2022

facebook-github-bot commented Nov 6, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 8, 2022

facebook-github-bot commented Nov 8, 2022

facebook-github-bot commented Nov 8, 2022

facebook-github-bot commented Nov 8, 2022

ajkr left a comment

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

hx235 commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

ajkr left a comment

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

hx235 commented Dec 13, 2022

Sort L0 files by newly introduced epoch_num #10922

Sort L0 files by newly introduced epoch_num #10922

Conversation

hx235 commented Nov 5, 2022 • edited Loading

facebook-github-bot commented Nov 6, 2022

facebook-github-bot commented Nov 6, 2022

facebook-github-bot commented Nov 6, 2022

facebook-github-bot commented Nov 6, 2022

facebook-github-bot commented Nov 6, 2022

facebook-github-bot commented Nov 6, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 8, 2022

facebook-github-bot commented Nov 8, 2022

facebook-github-bot commented Nov 8, 2022

facebook-github-bot commented Nov 8, 2022

ajkr left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

hx235 commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

ajkr left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Dec 13, 2022

facebook-github-bot commented Dec 13, 2022

hx235 commented Dec 13, 2022

hx235 commented Nov 5, 2022 •

edited

Loading