Permalink
Browse files

Log major key events to be able to quickly recover HT state.

This change introduces a new on-disk binary format for logging major
key events only.  These events are limited to the creation and
destruction of keys only -- modification of keys is out of scope, as
is storing values.

* Flow

The logging hooks in to the persistence flow and does the following:

After writing any new key or deleting any existing key from disk, a
new log entry is buffered.

The buffer is written when a block is full or on a flush event
(configurable, usually after a commit1 or commit2 event).

Before sending a commit to the underlying store, a commit1 event is
logged.

After the underlying store completes its commit, a commit2 is logged.

* Configuration

There are four new engine parameters that come with this feature:

** klog_path

Where the events should be logged.  An empty string (default) disables
logging.

** klog_block_size

The buffer/block size for log entries.  The number should line up with
the underlying filesystem block size.  Multiples may increase
throughput.

** klog_flush

Configures when the buffer should be force-flushed.  There are four
possible values:

   off: never force a flush
   commit1: force a flush after commit1 only
   commit2: force a flush after commit2 only
   full: force a flush after both commit1 and commit2

** klog_sync

Configures when the file should be fsynced.  There are four
possible values:

   off: never fsync
   commit1: fsync after commit1 only
   commit2: fsync after commit2 only
   full: fsync after both commit1 and commit2

Change-Id: Icdfcb4ae2da129144390e199ef794d85cd8246da
Reviewed-on: http://review.couchbase.org/12494
Reviewed-by: Chiyoung Seo <chiyoung.seo@gmail.com>
Tested-by: Chiyoung Seo <chiyoung.seo@gmail.com>
Reviewed-by: Trond Norbye <trond.norbye@gmail.com>
  • Loading branch information...
1 parent 4cbba4a commit b7075ce6da292eb8969ce2765e656e37c4e3f249 @dustin dustin committed Dec 19, 2011
Showing with 1,694 additions and 13 deletions.
  1. +19 −0 Makefile.am
  2. +32 −0 configuration.json
  3. +124 −0 crc32.c
  4. +8 −0 crc32.h
  5. +6 −0 docs/engine-params.org
  6. +99 −0 docs/klog.org
  7. +14 −0 docs/stats.org
  8. +93 −8 ep.cc
  9. +15 −2 ep.hh
  10. +30 −1 ep_engine.cc
  11. +1 −0 ep_engine.h
  12. +1 −0 ep_testsuite.cc
  13. +2 −1 flusher.cc
  14. +1 −0 flusher.hh
  15. +36 −1 management/cbstats
  16. +510 −0 mutation_log.cc
  17. +379 −0 mutation_log.hh
  18. +322 −0 t/mutation_log_test.cc
  19. +2 −0 win32/Makefile.mingw
View
@@ -66,6 +66,7 @@ ep_la_SOURCES = \
command_ids.h \
common.hh \
config_static.h \
+ crc32.c crc32.h \
dispatcher.cc dispatcher.hh \
ep.cc ep.hh \
ep_engine.cc ep_engine.h \
@@ -79,6 +80,7 @@ ep_la_SOURCES = \
item_pager.cc item_pager.hh \
kvstore.hh \
locks.hh \
+ mutation_log.cc mutation_log.hh \
mutex.cc mutex.hh \
priority.cc priority.hh \
queueditem.cc queueditem.hh \
@@ -201,6 +203,7 @@ check_PROGRAMS=\
histo_test \
hrtime_test \
misc_test \
+ mutation_log_test \
mutex_test \
pathexpand_test \
priority_test \
@@ -335,6 +338,14 @@ checkpoint_test_DEPENDENCIES = checkpoint.hh vbucket.hh \
libobjectregistry.la libconfiguration.la
checkpoint_test_LDADD = libobjectregistry.la libconfiguration.la
+mutation_log_test_CXXFLAGS = $(AM_CXXFLAGS) -I$(top_srcdir) ${NO_WERROR}
+mutation_log_test_SOURCES = t/mutation_log_test.cc mutation_log.hh \
+ testlogger.cc mutation_log.cc \
+ byteorder.h byteorder.c \
+ crc32.h crc32.c
+mutation_log_test_DEPENDENCIES = mutation_log.hh
+mutation_log_test_LDADD =
+
hrtime_test_CXXFLAGS = $(AM_CXXFLAGS) -I$(top_srcdir) ${NO_WERROR}
hrtime_test_SOURCES = t/hrtime_test.cc common.hh
@@ -368,6 +379,7 @@ checkpoint_test_SOURCES += gethrtime.c
management_cbdbconvert_SOURCES += gethrtime.c
ep_testsuite_la_SOURCES += gethrtime.c
hash_table_test_SOURCES += gethrtime.c
+mutation_log_test_SOURCES += gethrtime.c
endif
if BUILD_BYTEORDER
@@ -474,6 +486,7 @@ CLEANFILES += ep_la-probes.o ep_la-probes.lo \
.libs/cddbconvert-probes.o .libs/cddbconvert-probes.o \
.libs/atomic_ptr_test-probes.o \
.libs/checkpoint_test-probes.o \
+ .libs/mutation_test-probes.o \
.libs/dispatcher_test-probes.o \
.libs/hash_table_test-probes.o \
.libs/vbucket_test-probes.o \
@@ -528,6 +541,12 @@ ep_testsuite_la-probes.lo: $(ep_testsuite_la_OBJECTS) dtrace/probes.h
-s ${srcdir}/dtrace/probes.d \
$(checkpoint_test_OBJECTS)
+.libs/mutation_test-probes.o: $(mutation_test_OBJECTS) dtrace/probes.h
+ $(DTRACE) $(DTRACEFLAGS) -G \
+ -o .libs/mutation_test-probes.o \
+ -s ${srcdir}/dtrace/probes.d \
+ $(mutation_test_OBJECTS)
+
.libs/dispatcher_test-probes.o: $(dispatcher_test_OBJECTS) dtrace/probes.h
$(DTRACE) $(DTRACEFLAGS) -G \
-o .libs/dispatcher_test-probes.o \
View
@@ -162,6 +162,38 @@
"descr": "True if we want to keep the closed checkpoints for each vbucket unless the memory usage is above high water mark",
"type": "bool"
},
+ "klog_block_size": {
+ "default": "4096",
+ "descr": "Logging block size.",
+ "type": "size_t"
+ },
+ "klog_flush": {
+ "default": "commit2",
+ "descr": "When to flush the log (complete current block).",
+ "enum": [
+ "off",
+ "commit1",
+ "commit2",
+ "full"
+ ],
+ "type": "std::string"
+ },
+ "klog_path": {
+ "default": "",
+ "descr": "Path to the mutation key log.",
+ "type": "std::string"
+ },
+ "klog_sync": {
+ "default": "commit2",
+ "descr": "When to sync the log.",
+ "enum": [
+ "off",
+ "commit1",
+ "commit2",
+ "full"
+ ],
+ "type": "std::string"
+ },
"max_checkpoints": {
"default": "2",
"type": "size_t"
View
@@ -0,0 +1,124 @@
+/* Crc - 32 BIT ANSI X3.66 CRC checksum files */
+
+#include <stdio.h>
+#include "crc32.h"
+
+/**********************************************************************\
+|* Demonstration program to compute the 32-bit CRC used as the frame *|
+|* check sequence in ADCCP (ANSI X3.66, also known as FIPS PUB 71 *|
+|* and FED-STD-1003, the U.S. versions of CCITT's X.25 link-level *|
+|* protocol). The 32-bit FCS was added via the Federal Register, *|
+|* 1 June 1982, p.23798. I presume but don't know for certain that *|
+|* this polynomial is or will be included in CCITT V.41, which *|
+|* defines the 16-bit CRC (often called CRC-CCITT) polynomial. FIPS *|
+|* PUB 78 says that the 32-bit FCS reduces otherwise undetected *|
+|* errors by a factor of 10^-5 over 16-bit FCS. *|
+\**********************************************************************/
+
+/* Need an unsigned type capable of holding 32 bits; */
+
+/* Copyright (C) 1986 Gary S. Brown. You may use this program, or
+ code or tables extracted from it, as desired without restriction.*/
+
+/* First, the polynomial itself and its table of feedback terms. The */
+/* polynomial is */
+/* X^32+X^26+X^23+X^22+X^16+X^12+X^11+X^10+X^8+X^7+X^5+X^4+X^2+X^1+X^0 */
+/* Note that we take it "backwards" and put the highest-order term in */
+/* the lowest-order bit. The X^32 term is "implied"; the LSB is the */
+/* X^31 term, etc. The X^0 term (usually shown as "+1") results in */
+/* the MSB being 1. */
+
+/* Note that the usual hardware shift register implementation, which */
+/* is what we're using (we're merely optimizing it by doing eight-bit */
+/* chunks at a time) shifts bits into the lowest-order term. In our */
+/* implementation, that means shifting towards the right. Why do we */
+/* do it this way? Because the calculated CRC must be transmitted in */
+/* order from highest-order term to lowest-order term. UARTs transmit */
+/* characters in order from LSB to MSB. By storing the CRC this way, */
+/* we hand it to the UART in the order low-byte to high-byte; the UART */
+/* sends each low-bit to hight-bit; and the result is transmission bit */
+/* by bit from highest- to lowest-order term without requiring any bit */
+/* shuffling on our part. Reception works similarly. */
+
+/* The feedback terms table consists of 256, 32-bit entries. Notes: */
+/* */
+/* 1. The table can be generated at runtime if desired; code to do so */
+/* is shown later. It might not be obvious, but the feedback */
+/* terms simply represent the results of eight shift/xor opera- */
+/* tions for all combinations of data and CRC register values. */
+/* */
+/* 2. The CRC accumulation logic is the same for all CRC polynomials, */
+/* be they sixteen or thirty-two bits wide. You simply choose the */
+/* appropriate table. Alternatively, because the table can be */
+/* generated at runtime, you can start by generating the table for */
+/* the polynomial in question and use exactly the same "updcrc", */
+/* if your application needn't simultaneously handle two CRC */
+/* polynomials. (Note, however, that XMODEM is strange.) */
+/* */
+/* 3. For 16-bit CRCs, the table entries need be only 16 bits wide; */
+/* of course, 32-bit entries work OK if the high 16 bits are zero. */
+/* */
+/* 4. The values must be right-shifted by eight bits by the "updcrc" */
+/* logic; the shift must be unsigned (bring in zeroes). On some */
+/* hardware you could probably optimize the shift in assembler by */
+/* using byte-swap instructions. */
+
+static uint32_t crc_32_tab[] = { /* CRC polynomial 0xedb88320 */
+ 0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
+ 0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
+ 0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
+ 0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
+ 0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,
+ 0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
+ 0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,
+ 0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
+ 0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
+ 0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
+ 0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,
+ 0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
+ 0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
+ 0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
+ 0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
+ 0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
+ 0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
+ 0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
+ 0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
+ 0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
+ 0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
+ 0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
+ 0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
+ 0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
+ 0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
+ 0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
+ 0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
+ 0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
+ 0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
+ 0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
+ 0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
+ 0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
+ 0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
+ 0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
+ 0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
+ 0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
+ 0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
+ 0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
+ 0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
+ 0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
+ 0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
+ 0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
+ 0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
+};
+
+#define UPDC32(octet, crc) (crc_32_tab[((crc) ^ (octet)) & 0xff] ^ ((crc) >> 8))
+
+uint32_t crc32buf(uint8_t *buf, size_t len) {
+ register uint32_t oldcrc32;
+
+ oldcrc32 = 0xFFFFFFFF;
+
+ for ( ; len; --len, ++buf) {
+ oldcrc32 = UPDC32(*buf, oldcrc32);
+ }
+
+ return ~oldcrc32;
+}
View
@@ -0,0 +1,8 @@
+#ifndef CRC32_H
+#define CRC32_H 1
+
+#include <stdint.h>
+
+uint32_t crc32buf(uint8_t *buf, size_t len);
+
+#endif /* CRC32_H */
@@ -108,6 +108,12 @@ memcached like this:
| | | means don't throttle. |
| tap_throttle_threshold | float | Percentage of memory in use before we |
| | | throttle tap streams |
+| klog_path | string | Path to the mutation key log. |
+| klog_block_size | int | Mutation key log block size. |
+| klog_flush | string | When to force buffer flushes during |
+| | | klog (off, commit1, commit2, full) |
+| klog_sync | string | When to fsync during klog. |
+
** Shard Patterns
View
@@ -0,0 +1,99 @@
+* Overview of the Key Mutation Log
+
+This change introduces a new on-disk binary format for logging major
+key events only. These events are limited to the creation and
+destruction of keys only -- modification of keys is out of scope, as
+is storing values.
+
+* Flow
+
+The logging hooks in to the persistence flow and does the following:
+
+After writing any new key or deleting any existing key from disk, a
+new log entry is buffered.
+
+The buffer is written when a block is full or on a flush event
+(configurable, usually after a commit1 or commit2 event).
+
+Before sending a commit to the underlying store, a commit1 event is
+logged.
+
+After the underlying store completes its commit, a commit2 is logged.
+
+* Configuration
+
+There are four new engine parameters that come with this feature:
+
+** klog_path
+
+Where the events should be logged. An empty string (default) disables
+logging.
+
+** klog_block_size
+
+The buffer/block size for log entries. The number should line up with
+the underlying filesystem block size. Multiples may increase
+throughput.
+
+** klog_flush
+
+Configures when the buffer should be force-flushed. There are four
+possible values:
+
+ off: never force a flush
+ commit1: force a flush after commit1 only
+ commit2: force a flush after commit2 only
+ full: force a flush after both commit1 and commit2
+
+** klog_sync
+
+Configures when the file should be fsynced. There are four
+possible values:
+
+ off: never fsync
+ commit1: fsync after commit1 only
+ commit2: fsync after commit2 only
+ full: fsync after both commit1 and commit2
+
+* Data Format
+
+Each file consists of a header and then an arbitrary number of blocks
+which each contain an arbitrary number of records. All fields are
+big-endian byte encoded.
+
+** Header
+
+The file begins with a header of at least 4,096 bytes long. The
+header defines some basic info about the file.
+
+- 32-bit version number (this document describes version 1)
+- 32-bit block size
+- 32-bit block count
+- k/v properties to store additional tagged config
+ - 8-bit key len
+ - 8-bit value len
+ - key bytes
+ - value bytes
+- terminated by zero-length key
+
+If the block size listed in the header is larger 4,096 bytes, the
+first block itself will be this length (assume anything extra is
+zero-padded).
+
+** Block
+
+- checksum (16-bits, IEEE crc32 & 0xffff)
+- record count (16-bits)
+- []record
+
+Block size is variable. I've been using 4k, for now. Blocks are 0
+padded at the end when we need to sync or we can't fit more entries.
+
+** Record
+
+- rowid (64-bit)
+- vbucket (16-bit)
+- magic (0x45)
+- type (8-bit)
+- key len (8-bit)
+- key ([]byte)
View
@@ -473,6 +473,9 @@ form to describe when time was spent doing various things:
| disk_commit | waiting for a commit after a batch of updates |
| disk_invalid_item_del | Waiting for disk to delete a chunk of invalid |
| | items with the old vbucket version |
+| klogPadding | Amount of wasted "padding" space in the klog. |
+| klogFlushTime | Time spent flushing the klog. |
+| klogSyncTime | Time spent syncing the klog. |
** Hash Stats
@@ -560,6 +563,17 @@ Note that tcmalloc stats are not available on some operating systems
| tcmalloc_current_thread_cache_bytes | A measure of some of the memory |
| | TCMalloc is using for small objects. |
+** Key Log
+
+Stats =klog= shows counts what's going on with the key mutation log.
+
+| size | The size of the logfile. |
+| count_new | Number of "new key" events in the log. |
+| count_del | Number of "deleted key" events in the log. |
+| count_del_all | Number of "delete all" events in the log. |
+| count_commit1 | Number of "commit1" events in the log. |
+| count_commit2 | Number of "commit2" events in the log. |
+
* Details
** Ages
Oops, something went wrong.

0 comments on commit b7075ce

Please sign in to comment.