New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGLog: store extra duplicate ops beyond the normal log entries #16172

Merged
merged 2 commits into from Jul 27, 2017

Conversation

Projects
None yet
4 participants
@ivancich
Member

ivancich commented Jul 6, 2017

This helps us avoid replaying non-idempotent client operations when the pg log is very short, e.g. in an effort to force OSDs to use backfill rather than regular recovery. This can be advantageous to avoid blocking i/o to objects, at the cost of longer total time to become clean (since backfill requires scanning the objects to see what is missing).

@ivancich ivancich added the core label Jul 6, 2017

@ivancich ivancich requested review from gregsfortytwo and jdurgin Jul 6, 2017

@ivancich

This comment has been minimized.

Member

ivancich commented Jul 6, 2017

@gregsfortytwo is thinking about a better way to test the dup-ops functionality. Shouldn't merge until better testing complete.

dout(10) << "merge_log result " << log << " " << missing << " changed=" << changed << dendl;
// now handle dups
merge_log_dups(olog, changed);

This comment has been minimized.

@jdurgin

jdurgin Jul 7, 2017

Member

should set dirty_dups here if (changed), so they are persisted when write_log_and_missing() is called

@@ -974,7 +1042,12 @@ struct PGLog : DoutPrefixProvider {
rollbacker,
this);
}
void merge_log_dups(pg_log_t& olog,
bool& changed);

This comment has been minimized.

@jdurgin

jdurgin Jul 7, 2017

Member

minor: olog could be const, and convention is to pass out parameters as pointers to make them explicit - since there's just one you could make changed the return value instead

assert(olog.dups.front().version <= log_tail_version);
#endif
auto insert_cursor = log.dups.end();

This comment has been minimized.

@jdurgin

jdurgin Jul 7, 2017

Member

could use splice() to simplify these two loops combining the lists (I guess that's a reason to keep olog non-const)

This comment has been minimized.

@jdurgin

jdurgin Jul 7, 2017

Member

nvm, probably not worth it since you need to index them anyway

@@ -517,6 +581,7 @@ struct PGLog : DoutPrefixProvider {
(dirty_from != eversion_t::max()) ||
(writeout_from != eversion_t::max()) ||
!(trimmed.empty()) ||
!(trimmed_dups.empty()) ||

This comment has been minimized.

@jdurgin

jdurgin Jul 7, 2017

Member

should check dirty_dups

@@ -2064,6 +2066,269 @@ TEST_F(PGLogTest, ErrorNotIndexedByObject) {
EXPECT_EQ(del.reqid, entry->reqid);
}
class PGLogMergeDupsTest : public ::testing::Test, protected PGLog {

This comment has been minimized.

@jdurgin

jdurgin Jul 7, 2017

Member

looks like you've got extensive tests of merging - it'd be good to have some for the trimming behavior as well

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jul 11, 2017

Member

👍 on unit tests for trimming.

There also aren't any explicit tests that it actually reports duplicate ops. :o Is that somehow covered by existing unit tests?

@jdurgin

This comment has been minimized.

Member

jdurgin commented Jul 7, 2017

looks quite good overall!

@jdurgin

This comment has been minimized.

Member

jdurgin commented Jul 8, 2017

Looking at the cache tiering test failures with high rates of messenger fault injection, the ones on master are due to having short pg logs and easily exceeding the bounds of the log for dup detection (two writes are proxied to the base tier, they end up being resent and only the 2nd is still in the log).

This particular failure (dup detection within a single tier) is fixed by the longer dup ops in this branch.
The reason it still fails with your branch is the way dup detection information is transferred between tiers. When promoting an object via the copy-get op, the response includes limited dup detection information by reading up to the last 10 entries and including those version numbers in the 'extra_reqids' field - git grep get_object_reqids to see this. This does not include anything from the log_dup_entries, so when enough retries occur, a later op still in the pg log is included in the promotion, and thus detected as a dup, and an earlier one is not and performed again, out of order.

To fix this I think we'd need to add the hobject_t to each dup entry as well, so we can include the dup_entrys in the extra_reqids during promotion. This isn't ideal, since it bloats the dup information more, but I don't see a better way to fix it. tiers are entirely different pools, so there is no one-to-one relationship between pgs and their dup logs.

@gregsfortytwo

Sorry for the delay in doing this officially. A few nits, and a few things to think about. Overall looks great!

#if 0
// make sure the logs overlap
assert(olog.dups.front().version <= log_tail_version);
#endif

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jul 11, 2017

Member

Going to remove this? :)

log.dups.push_back(create_dup_entry(a, b));
}
void add_dups(std::vector<pg_log_dup_t> l) {

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jul 11, 2017

Member

I realize it's just test code, but these should probably all be const refs instead of pass-by-value?

@@ -899,6 +899,7 @@ OPTION(osd_min_pg_log_entries, OPT_U32, 3000) // number of entries to keep in t
OPTION(osd_max_pg_log_entries, OPT_U32, 10000) // max entries, say when degraded, before we trim
OPTION(osd_force_recovery_pg_log_entries_factor, OPT_FLOAT, 1.3) // max entries factor before force recovery
OPTION(osd_pg_log_trim_min, OPT_U32, 100)
OPTION(osd_pg_log_dup_track_versions, OPT_U32, 1000) // how many versions back to track operations that could be duplicated in the pg log

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jul 11, 2017

Member

We should think through how this interacts with the full pg log entries a little more. With the defaults here, the 1000 dup ops are just pointless work since the pg log entries are always longer. I see a few choices:

  1. Turn the pg log entries way down when the PG is clean; only keep them around for log-based recovery when the PG is degraded. Increase dup op size accordingly.
  2. run without dup ops by default, and expect people turn them up when turning down the pg log sizes.
  3. Switch the config options around so that we have a "minimum tail of dup op detection" and auto-tune it so that we keep around as many dup ops as needed for that (which might be none, if the log is long enough, or if we want to "remember" the last 2000 ops and are only maintaining the last 1000 log entries, we switch them when evicting or something).

This comment has been minimized.

@jdurgin

jdurgin Jul 11, 2017

Member

As implemented this is a trailing number of dups stored past the end of the pg log - I like the idea of (3) though - just making this new option control total dup storage (pg log + dup log), along with decreasing the default pg log length. This is effectively (1) and (3), since there's already dynamic adjustment of how much we trim (whether to the min or max) based on being degraded or in recovery/backfill. See PrimaryLogPG::calc_trim_to().

This comment has been minimized.

@jdurgin

jdurgin Jul 11, 2017

Member

This is currently trailing dup entries past the end of the pg log. I like the idea of (3) - changing this option to be total dup + pg log entries, so it adjusts the dup size based on the pg log size. Then we could make it 3000 (the current min pg log length) and lower the default min pg log length to e.g. 1000 to save memory when not recovering/backfilling/degraded.

(1) is taken care of already by calc_trim_to() - the pg log is trimmed less (only to the max, 10000 by default) during recovery/degraded/backfill.

This comment has been minimized.

@ivancich

ivancich Jul 14, 2017

Member

I've implemented and am testing #3. My first test had PGPrimaryLog catching an out-of-order execution, so I've added more logging to trim and will re-test.

eversion_t version;
version_t user_version; // the user version for this entry
int32_t return_code; // only stored for ERRORs for dup detection
std::shared_ptr<hobject_t> soid_ref; // since hobject_t's are big, share them

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jul 11, 2017

Member

I'm not so sure about storing these. I know you only did it because of the failing cache-tier tests, but it really inflates the pg_log_dup_t size (I suspect there aren't going to be many repeats in most cache-tier cases?).

Given the use cases we're targeting, I don't think it's reasonable to expect to run base or cache pools without a real pg log. How about we enforce that via reasonable minimums, instead of contorting our cool new memory-slimming data structures to support them? :)

This comment has been minimized.

@ivancich

ivancich Jul 14, 2017

Member

I've removed the commit that adds and manages the hobject_t's in the dup log.

@@ -2064,6 +2066,269 @@ TEST_F(PGLogTest, ErrorNotIndexedByObject) {
EXPECT_EQ(del.reqid, entry->reqid);
}
class PGLogMergeDupsTest : public ::testing::Test, protected PGLog {

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Jul 11, 2017

Member

👍 on unit tests for trimming.

There also aren't any explicit tests that it actually reports duplicate ops. :o Is that somehow covered by existing unit tests?

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jul 19, 2017

@ivancich, this is failing tests. Not sure if they're valid failures or not, but there are also a bunch of build warnings in the unit tests that make me wary.

@jdurgin

This comment has been minimized.

Member

jdurgin commented Jul 19, 2017

afraid I caused some conflicts with my peering-deletes branch that just merged as well

@jdurgin jdurgin closed this Jul 19, 2017

@jdurgin jdurgin reopened this Jul 19, 2017

PGLog: store extra duplicate ops beyond the normal log entries
This helps us avoid replaying non-idempotent client operations when
the pg log is very short, e.g. in an effort to force OSDs to use
backfill rather than regular recovery. This can be advantageous to
avoid blocking i/o to objects, at the cost of longer total time to
become clean (since backfill requires scanning the objects to see what
is missing).

Signed-off-by: Josh Durgin <jdurgin@redhat.com>

@ivancich ivancich changed the title from DNM PGLog: store extra duplicate ops beyond the normal log entries to PGLog: store extra duplicate ops beyond the normal log entries Jul 27, 2017

@ivancich

This comment has been minimized.

Member

ivancich commented Jul 27, 2017

Perhaps one final issue is the default value of the three key config options. Previously osd_min_pg_log_entries was 3000 and osd_max_pg_log_entries was 10000.

The PR in its current form sets osd_min_pg_log_entries to 1500, osd_max_pg_log_entries to 5000, and osd_pg_log_dups_tracked to 10000. So this means we'll be able to detect duplicate ops going back 10000, as that's the count of pg_log_entries PLUS dup entries.

PGLog: continuation, store extra duplicate ops beyond the normal log …
…entries

This helps us avoid replaying non-idempotent client operations when
the pg log is very short, e.g. in an effort to force OSDs to use
backfill rather than regular recovery. This can be advantageous to
avoid blocking i/o to objects, at the cost of longer total time to
become clean (since backfill requires scanning the objects to see what
is missing).

Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
@ivancich

This comment has been minimized.

Member

ivancich commented Jul 27, 2017

Following discussion in today's core stand-up, this PR now sets osd_min_pg_log_entries to 1500 (reduced), osd_max_pg_log_entries to 10000 (unchanged), and osd_pg_log_dups_tracked to 3000 (to detect dups up to 3000 as did code before this PR).

@jdurgin jdurgin merged commit c1a4877 into ceph:master Jul 27, 2017

3 of 4 checks passed

make check (arm64) running make check
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
@kungf

This comment has been minimized.

Contributor

kungf commented Aug 23, 2017

hi, ivancich, what's "avoid replaying non-idempotent client operations when the pg log is very short" mean?

@ivancich

This comment has been minimized.

Member

ivancich commented Aug 28, 2017

@kungf An idempotent operation is one that you can apply multiple times without altering the underlying data. For example, you can write the byte 0xFF at offset 20 once, twice, or more times and the result is the same.

A non-idempotent operation is one where the underlying data would be different if you applied the operation a second, third, or more times. An example would be the operation to append the byte 0xFF to the end of the data. The data is different if you do it once or twice.

In ceph we do not want to apply the same (i.e., duplicate) operation more than once in case the op is non-idempotent. Ceph has used the pg log to detect duplicate operations. But there's a use case for one customer where we want to shorten the pg log, and so we might not detect a duplicate operation. This PR tracks ops beyond the pg log in a separate structure, so duplicate ops can still be detected.

@ivancich ivancich deleted the ivancich:wip-dup-ops branch Sep 8, 2017

@kungf

This comment has been minimized.

Contributor

kungf commented Dec 12, 2017

@ivancich thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment