Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
PGLog: store extra duplicate ops beyond the normal log entries #16172
This helps us avoid replaying non-idempotent client operations when the pg log is very short, e.g. in an effort to force OSDs to use backfill rather than regular recovery. This can be advantageous to avoid blocking i/o to objects, at the cost of longer total time to become clean (since backfill requires scanning the objects to see what is missing).
Looking at the cache tiering test failures with high rates of messenger fault injection, the ones on master are due to having short pg logs and easily exceeding the bounds of the log for dup detection (two writes are proxied to the base tier, they end up being resent and only the 2nd is still in the log).
This particular failure (dup detection within a single tier) is fixed by the longer dup ops in this branch.
To fix this I think we'd need to add the hobject_t to each dup entry as well, so we can include the dup_entrys in the extra_reqids during promotion. This isn't ideal, since it bloats the dup information more, but I don't see a better way to fix it. tiers are entirely different pools, so there is no one-to-one relationship between pgs and their dup logs.
Sorry for the delay in doing this officially. A few nits, and a few things to think about. Overall looks great!
Perhaps one final issue is the default value of the three key config options. Previously osd_min_pg_log_entries was 3000 and osd_max_pg_log_entries was 10000.
The PR in its current form sets osd_min_pg_log_entries to 1500, osd_max_pg_log_entries to 5000, and osd_pg_log_dups_tracked to 10000. So this means we'll be able to detect duplicate ops going back 10000, as that's the count of pg_log_entries PLUS dup entries.
Jul 27, 2017
This was referenced
Aug 17, 2017
@kungf An idempotent operation is one that you can apply multiple times without altering the underlying data. For example, you can write the byte 0xFF at offset 20 once, twice, or more times and the result is the same.
A non-idempotent operation is one where the underlying data would be different if you applied the operation a second, third, or more times. An example would be the operation to append the byte 0xFF to the end of the data. The data is different if you do it once or twice.
In ceph we do not want to apply the same (i.e., duplicate) operation more than once in case the op is non-idempotent. Ceph has used the pg log to detect duplicate operations. But there's a use case for one customer where we want to shorten the pg log, and so we might not detect a duplicate operation. This PR tracks ops beyond the pg log in a separate structure, so duplicate ops can still be detected.