-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync dir containing CURRENT after RenameFile on CURRENT as much as possible #10573
Sync dir containing CURRENT after RenameFile on CURRENT as much as possible #10573
Conversation
af33dd4
to
829f905
Compare
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
829f905
to
e53ed6d
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
Update:
|
@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mind if the scope of this PR narrowed to eliminating the default value (nullptr) for CURRENT directory argument and making sure that non-test code always provides something reasonable for that argument (e.g., directories_.GetDbDir()
). Disallowing nullptr in SetCurrentFile()
seems like extra work (lower value, IMO) but fine if you want to do it.
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
b63503f
to
6f51040
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
6f51040
to
cc9004c
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Narrowed the scope of the PR for quicker turn-around. Also renamed the current_file_dir to be dir_contains_current_file to be more self-explanatory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!!
@hx235 has updated the pull request. You must reimport the pull request before landing. |
…ble value to it in non-test path
3129ce5
to
59e1084
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
59e1084
to
28b92fc
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…12828) Summary: **Context/Summary:** The assertion `tlist.find(tdn.second) == tlist.end()` https://github.com/facebook/rocksdb/blame/9eebaf11cbd875435b572f05f0378ecdb761cc74/utilities/fault_injection_fs.cc#L1003 can catch us false positive. Some context (1) When fault injection is enabled and db open fails because of that, crash test will retry open without injected error in order to proceed with a clean open: https://github.com/facebook/rocksdb/blob/9eebaf11cbd875435b572f05f0378ecdb761cc74/db_stress_tool/db_stress_test_base.cc#L3559 https://github.com/facebook/rocksdb/blob/9eebaf11cbd875435b572f05f0378ecdb761cc74/db_stress_tool/db_stress_test_base.cc#L3586-L3639 (2) a. `FaultInjectionTestFS::dir_to_new_files_since_last_sync` records files that are created but not yet synced. b. When we create CURRENT, we will first create a temp file and rename it as "CURRENT". As part of the renaming, we will [assert](https://github.com/facebook/rocksdb/blame/9eebaf11cbd875435b572f05f0378ecdb761cc74/utilities/fault_injection_fs.cc#L1003) `FaultInjectionTestFS::dir_to_new_files_since_last_sync ` doesn't already have a file named `CURRENT`. Suppose the following sequence of events happened: (1) 1st open, with metadata write error 1. As part of creating CURRENT file, added "CURRENT" to `FaultInjectionTestFS::dir_to_new_files_since_last_sync_` https://github.com/facebook/rocksdb/blob/9eebaf11cbd875435b572f05f0378ecdb761cc74/utilities/fault_injection_fs.cc#L735 2. `SyncDir()` here https://github.com/facebook/rocksdb/blob/9eebaf11cbd875435b572f05f0378ecdb761cc74/file/filename.cc#L412 failed with injected metadata write error. Therefore, "CURRENT" file didn't get removed from `FaultInjectionTestFS::dir_to_new_files_since_last_sync_` as it would if `SyncDir()` succeeded https://github.com/facebook/rocksdb/blob/9eebaf11cbd875435b572f05f0378ecdb761cc74/utilities/fault_injection_fs.h#L344 (2) 2st open 1. Attempted to create a CURRENT file and failed during renaming since `FaultInjectionTestFS::dir_to_new_files_since_last_sync_` already had a file called CURRENT. So will fail ``` assertion failed - tlist.find(tdn.second) == tlist.end() ``` This PR fixed this by removing the assertion. It used to catch us some missing sync of some directory (e.,g #10573) so we will keep thinking about a better way to catch that. Pull Request resolved: #12828 Test Plan: Command constantly failed before the fix but passed after the PR running for 10 minutes ``` python3 tools/db_crashtest.py --simple blackbox --interval=10 --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=100 --adaptive_readahead=1 --adm_policy=2 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=1 --bloom_bits=10 --bottommost_compression_type=lz4hc --bottommost_file_compaction_delay=86400 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=tiered_auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=1 --compress_format_version=1 --compressed_secondary_cache_ratio=0.5 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=15 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=1 --db_write_buffer_size=0 --default_temperature=kHot --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=2097152 --high_pri_pool_ratio=0 --index_block_restart_interval=2 --index_shortening=0 --index_type=2 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=1 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=1000000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --partition_filters=1 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=1000 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=32 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --set_options_one_in=0 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=1582 --universal_max_read_amp=4 --unpartitioned_pinning=0 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=1 --use_multi_get_entity=1 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=1 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --write_fault_one_in=8 --writepercent=35 ``` Reviewed By: cbi42 Differential Revision: D59241548 Pulled By: hx235 fbshipit-source-id: 5bb49e6a94943273f47578a2caf3d08ca5b67e5f
Context:
Below crash test revealed a bug that directory containing CURRENT file (short for
dir_contains_current_file
below) was not always get synced after a new CURRENT is created and being called withRenameFile
as part of the creation.This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.
The root-cause is that a nullptr
FSDirectory* dir_contains_current_file
sometimes gets passed-down toSetCurrentFile()
hence in those casedir_contains_current_file->FSDirectory::FsyncWithDirOptions()
will be skipped (which otherwise will internally callEnv/FS::SyncDic()
)Summary:
The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
directory_to_fsync
asdir_contains_current_file
inSetCurrentFile()
to tighten the association between this directory and CURRENT fileSetCurrentFile()
API to requiredir_contains_current_file
being passed-in, instead of making it by default nullptr.SetCurrentFile()
'sdir_contains_current_file
is passed down fromVersionSet::LogAndApply()
thenVersionSet::ProcessManifestWrites()
(i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to requiredir_contains_current_file
dir_contains_current_file
, which by current assumption of RocksDB, is theFSDirectory* db_dir
.db_impl
path will obtainDBImpl::directories_.getDbDir()
while others with no access to suchdirectories_
are obtained on the fly by creating such objectFileSystem::NewDirectory(..)
and manage it by unique pointers to ensure short life time.Test:
make check