-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove random writes from SST file ingestion #4172
Conversation
RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Test plan: ``` $make clean && make -j16 all check $./external_sst_file_test $./external_sst_file_basic_test ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@siying Could you PTAL? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we write a special flag in ExternalSstFilePropertyNames::kGlobalSeqno to indicate that this is not a valid seqno? Or maybe bump up "rocksdb.external_sst_file.version" and remove "rocksdb.external_sst_file.global_seqno" as a whole.
db/version_edit.h
Outdated
FileDescriptor(uint64_t number, uint32_t path_id, uint64_t _file_size) | ||
FileDescriptor(uint64_t number, uint32_t path_id, uint64_t _file_size, | ||
SequenceNumber _smallest_seqno = kMaxSequenceNumber, | ||
SequenceNumber _largest_seqno = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How hard it is to avoid defaults?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
table/block_based_table_reader.cc
Outdated
@@ -696,7 +707,7 @@ SequenceNumber GetGlobalSequenceNumber(const TableProperties& table_properties, | |||
return kDisableGlobalSequenceNumber; | |||
} | |||
|
|||
SequenceNumber global_seqno = DecodeFixed64(seqno_pos->second.c_str()); | |||
SequenceNumber global_seqno = largest_seqno; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we still read from DecodeFixed64(seqno_pos->second.c_str())
and make sure it is either the same as largest_seqno
or it is not modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by 'not modified', do you mean verify the checksum?
@riversand963 has updated the pull request. Re-import the pull request |
Closing this PR, moving to #4188 to avoid unrelated file changes. |
@riversand963 has updated the pull request. Re-import the pull request |
@siying can you PTAL? |
@riversand963 has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job!
include/rocksdb/options.h
Outdated
@@ -1265,6 +1265,9 @@ struct IngestExternalFileOptions { | |||
// with allow_ingest_behind=true since the dawn of time. | |||
// All files will be ingested at the bottommost level with seqno=0. | |||
bool ingest_behind = false; | |||
// Set to true if you would like to write global_seqno to a given offset in | |||
// the external SST file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention the backward compatibility scenario that this option is tries to address and recommend to set it false to new services or the compatibility is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
table/block_based_table_reader.cc
Outdated
@@ -697,6 +697,8 @@ SequenceNumber GetGlobalSequenceNumber(const TableProperties& table_properties, | |||
} | |||
|
|||
SequenceNumber global_seqno = DecodeFixed64(seqno_pos->second.c_str()); | |||
assert(global_seqno == 0 || global_seqno == largest_seqno); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than assert, I suggest explicitly fail the open and return corrupted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
table/table_builder.h
Outdated
@@ -31,14 +32,15 @@ struct TableReaderOptions { | |||
const EnvOptions& _env_options, | |||
const InternalKeyComparator& _internal_comparator, | |||
bool _skip_filters = false, bool _immortal = false, | |||
int _level = -1) | |||
int _level = -1, SequenceNumber _largest_seqno = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the defaults? I don't see a scenario we should do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean defaults of all the parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't care level, but I don't think it makes sense to have a default for largest_seqno. I don't care whether you reorder to two so you can keep the level default, or remove defaults for both of level and largest_seqno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Adding an overloaded constructor. Another option is to provide a setter (not implemented).
table/block_based_table_reader.cc
Outdated
@@ -697,6 +697,8 @@ SequenceNumber GetGlobalSequenceNumber(const TableProperties& table_properties, | |||
} | |||
|
|||
SequenceNumber global_seqno = DecodeFixed64(seqno_pos->second.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that the existing code has no verification that seqno_pos is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@riversand963 has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@riversand963 has updated the pull request. Re-import the pull request |
@riversand963 has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@riversand963 has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@riversand963 has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: facebook#4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
This uses he new flag added in facebook/rocksdb#4172 to avoid the global_seq_no edits and the SST copying they forced us to do to preserve the raft log. Passing this flag has to be gated on a cluster version that is defined well after the upgrade to rocks 5.17 -- using the flag requires that older versions of RocksDB (< 5.16) will never be used to read, as they will still be looking for and using the seq_no in the file. Release note (performance improvement): reduce write-amplification during bulk-loading (IMPORT and RESTORE)
This uses he new flag added in facebook/rocksdb#4172 to avoid the global_seq_no edits and the SST copying they forced us to do to preserve the raft log. Passing this flag has to be gated on a cluster version that is defined well after the upgrade to rocks 5.17 -- using the flag requires that older versions of RocksDB (< 5.16) will never be used to read, as they will still be looking for and using the seq_no in the file. Release note (performance improvement): reduce write-amplification during bulk-loading (IMPORT and RESTORE)
This uses he new flag added in facebook/rocksdb#4172 to avoid the global_seq_no edits and the SST copying they forced us to do to preserve the raft log. Passing this flag has to be gated on a cluster version that is defined well after the upgrade to rocks 5.17 -- using the flag requires that older versions of RocksDB (< 5.16) will never be used to read, as they will still be looking for and using the seq_no in the file. Release note (performance improvement): reduce write-amplification during bulk-loading (IMPORT and RESTORE)
34806: util/mon: Downgrade "bytes usage increases" log message r=knz a=bdarnell This message can be logged frequently in certain workloads, and its appearance in the info log is not actionable since there is no way to tie it back to the query that caused it. I see no reason to have these messages enabled by default. Release note: None 34886: storage/engine,libroach: 1x write amplification on ingest sst r=dt a=dt This uses the new flag added in facebook/rocksdb#4172 to avoid needing to make global_seq_no-related edits to SSTs, and thus avoid the SST copying those edits forced us to do to preserve the raft log immutability when hard-linking directly to side load SSTs. This is gated on a cluster version that is defined well after the upgrade to rocks 5.17, since using the flag requires that no older versions of RocksDB (<5.16), will ever be used to read these SSTs, as they will lack the seq_no that the older versions need for correctness. Release note (performance improvement): reduce write-amplification during bulk-loading (IMPORT and RESTORE) 34902: sql: ensure spans get anonymized in plan collection r=knz a=knz Fixes #34889. Release note (bug fix): The logical plans collected for display in the web UI now also hide the details of which key ranges are scanned in table lookups. Co-authored-by: Ben Darnell <ben@bendarnell.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
Summary: It it not safe to assume application had sync the SST file before ingest it into DB. Also the directory to put the ingested file needs to be fsync, otherwise the file can be lost. For integrity of RocksDB we need to sync the ingested file and directory before apply the change to manifest. Also syncing after writing global sequence when write_global_seqno=true was removed in #4172. Adding it back. Fixes #5287. Pull Request resolved: #5435 Test Plan: Test ingest file with ldb command and observe fsync/fdatasync in strace output. Tried both move_files=true and move_files=false. https://gist.github.com/yiwu-arbug/650a4023f57979056d83485fa863bef9 More test suggestions are welcome. Differential Revision: D15941675 Pulled By: riversand963 fbshipit-source-id: 389533f3923065a96df2cdde23ff4724a1810d78
Summary: It it not safe to assume application had sync the SST file before ingest it into DB. Also the directory to put the ingested file needs to be fsync, otherwise the file can be lost. For integrity of RocksDB we need to sync the ingested file and directory before apply the change to manifest. Also syncing after writing global sequence when write_global_seqno=true was removed in facebook#4172. Adding it back. Fixes facebook#5287. Pull Request resolved: facebook#5435 Test Plan: Test ingest file with ldb command and observe fsync/fdatasync in strace output. Tried both move_files=true and move_files=false. https://gist.github.com/yiwu-arbug/650a4023f57979056d83485fa863bef9 More test suggestions are welcome. Differential Revision: D15941675 Pulled By: riversand963 fbshipit-source-id: 389533f3923065a96df2cdde23ff4724a1810d78
RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses
pwrite
to update theglobal_seqno
. Since random write is not supported in some non-POSIX compliantfile systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update
global_seqno
duringfile ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.
Test plan: