Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove random writes from SST file ingestion #4172

Closed

Conversation

riversand963
Copy link
Contributor

RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses pwrite to update the
global_seqno. Since random write is not supported in some non-POSIX compliant
file systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update global_seqno during
file ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.

Test plan:

$make clean && make -j16 all check
$./external_sst_file_test
$./external_sst_file_basic_test

RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
`global_seqno`. Since random write is not supported in some non-POSIX compliant
file systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update `global_seqno` during
file ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.

Test plan:
```
$make clean && make -j16 all check
$./external_sst_file_test
$./external_sst_file_basic_test
```
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@riversand963
Copy link
Contributor Author

@siying Could you PTAL? Thanks

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we write a special flag in ExternalSstFilePropertyNames::kGlobalSeqno to indicate that this is not a valid seqno? Or maybe bump up "rocksdb.external_sst_file.version" and remove "rocksdb.external_sst_file.global_seqno" as a whole.

FileDescriptor(uint64_t number, uint32_t path_id, uint64_t _file_size)
FileDescriptor(uint64_t number, uint32_t path_id, uint64_t _file_size,
SequenceNumber _smallest_seqno = kMaxSequenceNumber,
SequenceNumber _largest_seqno = 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard it is to avoid defaults?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -696,7 +707,7 @@ SequenceNumber GetGlobalSequenceNumber(const TableProperties& table_properties,
return kDisableGlobalSequenceNumber;
}

SequenceNumber global_seqno = DecodeFixed64(seqno_pos->second.c_str());
SequenceNumber global_seqno = largest_seqno;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we still read from DecodeFixed64(seqno_pos->second.c_str()) and make sure it is either the same as largest_seqno or it is not modified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by 'not modified', do you mean verify the checksum?

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

@riversand963 riversand963 changed the title Remove random writes from SST file ingestion [DO NOT REVIEW] Remove random writes from SST file ingestion Jul 25, 2018
@riversand963 riversand963 changed the title [DO NOT REVIEW] Remove random writes from SST file ingestion Remove random writes from SST file ingestion Jul 25, 2018
@riversand963
Copy link
Contributor Author

riversand963 commented Jul 25, 2018

Closing this PR, moving to #4188 to avoid unrelated file changes.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

@riversand963
Copy link
Contributor Author

@siying can you PTAL?

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

@@ -1265,6 +1265,9 @@ struct IngestExternalFileOptions {
// with allow_ingest_behind=true since the dawn of time.
// All files will be ingested at the bottommost level with seqno=0.
bool ingest_behind = false;
// Set to true if you would like to write global_seqno to a given offset in
// the external SST file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention the backward compatibility scenario that this option is tries to address and recommend to set it false to new services or the compatibility is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -697,6 +697,8 @@ SequenceNumber GetGlobalSequenceNumber(const TableProperties& table_properties,
}

SequenceNumber global_seqno = DecodeFixed64(seqno_pos->second.c_str());
assert(global_seqno == 0 || global_seqno == largest_seqno);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than assert, I suggest explicitly fail the open and return corrupted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -31,14 +32,15 @@ struct TableReaderOptions {
const EnvOptions& _env_options,
const InternalKeyComparator& _internal_comparator,
bool _skip_filters = false, bool _immortal = false,
int _level = -1)
int _level = -1, SequenceNumber _largest_seqno = 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the defaults? I don't see a scenario we should do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean defaults of all the parameters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care level, but I don't think it makes sense to have a default for largest_seqno. I don't care whether you reorder to two so you can keep the level default, or remove defaults for both of level and largest_seqno.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Adding an overloaded constructor. Another option is to provide a setter (not implemented).

@@ -697,6 +697,8 @@ SequenceNumber GetGlobalSequenceNumber(const TableProperties& table_properties,
}

SequenceNumber global_seqno = DecodeFixed64(seqno_pos->second.c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that the existing code has no verification that seqno_pos is valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

rcane pushed a commit to rcane/rocksdb that referenced this pull request Sep 13, 2018
Summary:
RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
`global_seqno`. Since random write is not supported in some non-POSIX compliant
file systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update `global_seqno` during
file ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.
Pull Request resolved: facebook#4172

Differential Revision: D8961465

Pulled By: riversand963

fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
dt added a commit to dt/cockroach that referenced this pull request Feb 14, 2019
This uses he new flag added in
facebook/rocksdb#4172 to avoid the global_seq_no
edits and the SST copying they forced us to do to preserve the raft log.

Passing this flag has to be gated on a cluster version that is defined
well after the upgrade to rocks 5.17 -- using the flag requires that
older versions of RocksDB (< 5.16) will never be used to read, as they
will still be looking for and using the seq_no in the file.

Release note (performance improvement): reduce write-amplification during bulk-loading (IMPORT and RESTORE)
dt added a commit to dt/cockroach that referenced this pull request Feb 21, 2019
This uses he new flag added in
facebook/rocksdb#4172 to avoid the global_seq_no
edits and the SST copying they forced us to do to preserve the raft log.

Passing this flag has to be gated on a cluster version that is defined
well after the upgrade to rocks 5.17 -- using the flag requires that
older versions of RocksDB (< 5.16) will never be used to read, as they
will still be looking for and using the seq_no in the file.

Release note (performance improvement): reduce write-amplification during bulk-loading (IMPORT and RESTORE)
dt added a commit to dt/cockroach that referenced this pull request Feb 22, 2019
This uses he new flag added in
facebook/rocksdb#4172 to avoid the global_seq_no
edits and the SST copying they forced us to do to preserve the raft log.

Passing this flag has to be gated on a cluster version that is defined
well after the upgrade to rocks 5.17 -- using the flag requires that
older versions of RocksDB (< 5.16) will never be used to read, as they
will still be looking for and using the seq_no in the file.

Release note (performance improvement): reduce write-amplification during bulk-loading (IMPORT and RESTORE)
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Feb 25, 2019
34806: util/mon: Downgrade "bytes usage increases" log message r=knz a=bdarnell

This message can be logged frequently in certain workloads, and its
appearance in the info log is not actionable since there is no way to
tie it back to the query that caused it. I see no reason to have
these messages enabled by default.

Release note: None

34886: storage/engine,libroach: 1x write amplification on ingest sst r=dt a=dt

This uses the new flag added in facebook/rocksdb#4172 to avoid needing to make global_seq_no-related edits to SSTs, and thus avoid the SST copying those edits forced us to do to preserve the raft log immutability when hard-linking directly to side load SSTs.

This is gated on a cluster version that is defined well after the upgrade to rocks 5.17, since using the flag requires that no older versions of RocksDB (<5.16),
will ever be used to read these SSTs, as they will lack the seq_no that the older versions need for correctness.

Release note (performance improvement): reduce write-amplification during bulk-loading (IMPORT and RESTORE)

34902: sql: ensure spans get anonymized in plan collection r=knz a=knz

Fixes #34889.

Release note (bug fix): The logical plans collected for display in the
web UI now also hide the details of which key ranges are scanned in
table lookups.

Co-authored-by: Ben Darnell <ben@bendarnell.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
facebook-github-bot pushed a commit that referenced this pull request Jun 21, 2019
Summary:
It it not safe to assume application had sync the SST file before ingest it into DB. Also the directory to put the ingested file needs to be fsync, otherwise the file can be lost. For integrity of RocksDB we need to sync the ingested file and directory before apply the change to manifest.

Also syncing after writing global sequence when write_global_seqno=true was removed in #4172. Adding it back.

Fixes #5287.
Pull Request resolved: #5435

Test Plan:
Test ingest file with ldb command and observe fsync/fdatasync in strace output. Tried both move_files=true and move_files=false.
https://gist.github.com/yiwu-arbug/650a4023f57979056d83485fa863bef9

More test suggestions are welcome.

Differential Revision: D15941675

Pulled By: riversand963

fbshipit-source-id: 389533f3923065a96df2cdde23ff4724a1810d78
merryChris pushed a commit to merryChris/rocksdb that referenced this pull request Nov 18, 2019
Summary:
It it not safe to assume application had sync the SST file before ingest it into DB. Also the directory to put the ingested file needs to be fsync, otherwise the file can be lost. For integrity of RocksDB we need to sync the ingested file and directory before apply the change to manifest.

Also syncing after writing global sequence when write_global_seqno=true was removed in facebook#4172. Adding it back.

Fixes facebook#5287.
Pull Request resolved: facebook#5435

Test Plan:
Test ingest file with ldb command and observe fsync/fdatasync in strace output. Tried both move_files=true and move_files=false.
https://gist.github.com/yiwu-arbug/650a4023f57979056d83485fa863bef9

More test suggestions are welcome.

Differential Revision: D15941675

Pulled By: riversand963

fbshipit-source-id: 389533f3923065a96df2cdde23ff4724a1810d78
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants