Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort L0 files by newly introduced epoch_num #10922

Closed
wants to merge 2 commits into from

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Nov 5, 2022

Context:
Sorting L0 files by largest_seqno has at least two inconvenience:

  • File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. force_consistency_check=true will catch such overlap seqno range even those harmless overlap.
    • For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
      • insert k1@1 to memtable m1
      • ingest file s1 with k2@2, ingest file s2 with k3@3
      • insert k4@4 to m1
      • compact files s1, s2 and result in new file s3 of seqno range [2, 3]
      • flush m1 and result in new file s4 of seqno range [1, 4]. And force_consistency_check=true will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
    • However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
  • Single delete can decrease a file's largest seqno and ordering by largest_seqno can introduce a wrong ordering hence file reordering corruption
    • For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to @ajkr for this example)
      • an existing SST s1 contains only k1@1
      • insert k1@2 to memtable m1
      • ingest file s2 with k3@3, ingest file s3 with k4@4
      • insert single delete k5@5 in m1
      • flush m1 and result in new file s4 of seqno range [2, 5]
      • compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
      • compact s4 and result in new file s6 of seqno range [2] due to single delete
    • By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by force_consistency_check=true, there isn't a good way to prevent this from happening if ordering by largest_seqno

Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to @ajkr , we now introduce epoch_num which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum epoch_num among input files'). This will avoid the above inconvenience in the following ways:

  • In the first case above, there will no longer be overlap seqno range check in force_consistency_check=true but epoch_number ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class DBCompactionTestL0FilesMisorderCorruption* for more.
  • In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.

Summary:

  • Introduce epoch_number stored per ColumnFamilyData and sort CF's L0 files by their assigned epoch_number instead of largest_seqno.
    • epoch_number is increased and assigned upon VersionEdit::AddFile() for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the kReservedEpochNumberForFileIngestedBehind)
    • Compaction output file is assigned with the minimum epoch_number among input files'
      • Refit level: reuse refitted file's epoch_number
    • Other paths needing epoch_number treatment:
      • Import column families: reuse file's epoch_number if exists. If not, assign one based on NewestFirstBySeqNo
      • Repair: reuse file's epoch_number if exists. If not, assign one based on NewestFirstBySeqNo.
    • Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon VersionEdit::AddFile() where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
    • Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later Refit(target_level=0)). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
  • Persist epoch_number of each file in manifest and recover epoch_number on db recovery
    • Backward compatibility with old db without epoch_number support is guaranteed by assigning epoch_number to recovered files by NewestFirstBySeqno order. See VersionStorageInfo::RecoverEpochNumbers() for more
    • Forward compatibility with manifest is guaranteed by flexibility of NewFileCustomTag
  • Replace force_consistent_check on L0 with epoch_number and remove false positive check like case 1 with largest_seqno above
    • Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (NewestFirstBySeqno) to check/sort them till we infer their epoch number. See usages of EpochNumberRequirement.
  • Remove fix Fix corruption with intra-L0 on ingested files #5958 (comment) and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
  • Misc:
    • update existing tests with epoch_number so make check will pass
    • update Fix corruption with intra-L0 on ingested files #5958 (comment) tests to verify corruption is fixed using epoch_number and cover universal/fifo compaction/CompactRange/CompactFile cases
    • assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()

Test:

  • make check
  • New unit tests under db/db_compaction_test.cc, db/db_test2.cc, db/version_builder_test.cc, db/repair_test.cc
  • Updated tests (i.e, DBCompactionTestL0FilesMisorderCorruption*) under Fix corruption with intra-L0 on ingested files #5958 (comment)
  • [Ongoing] Compatibility test: manually run ajkr@36a5686 (with file ingestion off for running the .orig binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on simple black/white box, cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox
  • [Ongoing] normal db stress test
  • [Ongoing] db stress test with aggressive value [CI only]Aggressive crash test value #10761

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 added the WIP Work in progress label Nov 6, 2022
@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 changed the title Sort L0 files by newly introduced l0_epoch_num + misc Sort L0 files by newly introduced l0_epoch_num Nov 8, 2022
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally finished thinking about all the parts I hoped to think about. Great work on this very difficult problem

db/column_family.cc Outdated Show resolved Hide resolved
db/external_sst_file_ingestion_job.cc Outdated Show resolved Hide resolved
db/column_family.cc Outdated Show resolved Hide resolved
db/import_column_family_job.cc Outdated Show resolved Hide resolved
db/version_builder.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235
Copy link
Contributor Author

hx235 commented Dec 13, 2022

Update: Addressed the comments and going thru compatibility + stress test again. Will notify here when it's ready for review again.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments. LGTM!!

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235
Copy link
Contributor Author

hx235 commented Dec 13, 2022

Thanks for addressing all the comments. LGTM!!

Thanks for the review!!! Compatibility test and stress tests finish good so I will land this once signals are clear.

facebook-github-bot pushed a commit that referenced this pull request Jan 3, 2023
…rash test (#11063)

Summary:
**Context/Summary:**
#10777 was reverted (#10999) due to internal blocker and replaced with a better fix #10922. However, the revert also reverted the `Options::CompactionOptionsFIFO::allow_compaction` stress/crash coverage added by the PR.

It's an useful coverage cuz setting `Options::CompactionOptionsFIFO::allow_compaction=true` will [increase](https://github.com/facebook/rocksdb/blob/7.8.fb/db/version_set.cc#L3255) the compaction score of L0 files for FIFO and then trigger more FIFO compaction. This speed up discovery of bug related to FIFO compaction like #10955. To see the speedup, compare the failure occurrence in following commands with `Options::CompactionOptionsFIFO::allow_compaction=true/false`

```
--fifo_allow_compaction=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=8.869062094789008 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=100 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=15 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0  --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=65
```

Therefore this PR is adding it back to stress/crash test.

Pull Request resolved: #11063

Test Plan: Rehearsal stress test to make sure stress/crash test is stable

Reviewed By: ajkr

Differential Revision: D42283650

Pulled By: hx235

fbshipit-source-id: 132e6396ab6e24d8dcb8fe51c62dd5211cdf53ef
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed WIP Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants