Skip to content

Commit

Permalink
Log major key events to be able to quickly recover HT state.
Browse files Browse the repository at this point in the history
This change introduces a new on-disk binary format for logging major
key events only.  These events are limited to the creation and
destruction of keys only -- modification of keys is out of scope, as
is storing values.

* Flow

The logging hooks in to the persistence flow and does the following:

After writing any new key or deleting any existing key from disk, a
new log entry is buffered.

The buffer is written when a block is full or on a flush event
(configurable, usually after a commit1 or commit2 event).

Before sending a commit to the underlying store, a commit1 event is
logged.

After the underlying store completes its commit, a commit2 is logged.

* Configuration

There are four new engine parameters that come with this feature:

** klog_path

Where the events should be logged.  An empty string (default) disables
logging.

** klog_block_size

The buffer/block size for log entries.  The number should line up with
the underlying filesystem block size.  Multiples may increase
throughput.

** klog_flush

Configures when the buffer should be force-flushed.  There are four
possible values:

   off: never force a flush
   commit1: force a flush after commit1 only
   commit2: force a flush after commit2 only
   full: force a flush after both commit1 and commit2

** klog_sync

Configures when the file should be fsynced.  There are four
possible values:

   off: never fsync
   commit1: fsync after commit1 only
   commit2: fsync after commit2 only
   full: fsync after both commit1 and commit2

Change-Id: Icdfcb4ae2da129144390e199ef794d85cd8246da
Reviewed-on: http://review.couchbase.org/12494
Reviewed-by: Chiyoung Seo <chiyoung.seo@gmail.com>
Tested-by: Chiyoung Seo <chiyoung.seo@gmail.com>
Reviewed-by: Trond Norbye <trond.norbye@gmail.com>
  • Loading branch information
dustin committed Jan 23, 2012
1 parent 4cbba4a commit b7075ce
Show file tree
Hide file tree
Showing 19 changed files with 1,694 additions and 13 deletions.
19 changes: 19 additions & 0 deletions Makefile.am
Expand Up @@ -66,6 +66,7 @@ ep_la_SOURCES = \
command_ids.h \
common.hh \
config_static.h \
crc32.c crc32.h \
dispatcher.cc dispatcher.hh \
ep.cc ep.hh \
ep_engine.cc ep_engine.h \
Expand All @@ -79,6 +80,7 @@ ep_la_SOURCES = \
item_pager.cc item_pager.hh \
kvstore.hh \
locks.hh \
mutation_log.cc mutation_log.hh \
mutex.cc mutex.hh \
priority.cc priority.hh \
queueditem.cc queueditem.hh \
Expand Down Expand Up @@ -201,6 +203,7 @@ check_PROGRAMS=\
histo_test \
hrtime_test \
misc_test \
mutation_log_test \
mutex_test \
pathexpand_test \
priority_test \
Expand Down Expand Up @@ -335,6 +338,14 @@ checkpoint_test_DEPENDENCIES = checkpoint.hh vbucket.hh \
libobjectregistry.la libconfiguration.la
checkpoint_test_LDADD = libobjectregistry.la libconfiguration.la

mutation_log_test_CXXFLAGS = $(AM_CXXFLAGS) -I$(top_srcdir) ${NO_WERROR}
mutation_log_test_SOURCES = t/mutation_log_test.cc mutation_log.hh \
testlogger.cc mutation_log.cc \
byteorder.h byteorder.c \
crc32.h crc32.c
mutation_log_test_DEPENDENCIES = mutation_log.hh
mutation_log_test_LDADD =

hrtime_test_CXXFLAGS = $(AM_CXXFLAGS) -I$(top_srcdir) ${NO_WERROR}
hrtime_test_SOURCES = t/hrtime_test.cc common.hh

Expand Down Expand Up @@ -368,6 +379,7 @@ checkpoint_test_SOURCES += gethrtime.c
management_cbdbconvert_SOURCES += gethrtime.c
ep_testsuite_la_SOURCES += gethrtime.c
hash_table_test_SOURCES += gethrtime.c
mutation_log_test_SOURCES += gethrtime.c
endif

if BUILD_BYTEORDER
Expand Down Expand Up @@ -474,6 +486,7 @@ CLEANFILES += ep_la-probes.o ep_la-probes.lo \
.libs/cddbconvert-probes.o .libs/cddbconvert-probes.o \
.libs/atomic_ptr_test-probes.o \
.libs/checkpoint_test-probes.o \
.libs/mutation_test-probes.o \
.libs/dispatcher_test-probes.o \
.libs/hash_table_test-probes.o \
.libs/vbucket_test-probes.o \
Expand Down Expand Up @@ -528,6 +541,12 @@ ep_testsuite_la-probes.lo: $(ep_testsuite_la_OBJECTS) dtrace/probes.h
-s ${srcdir}/dtrace/probes.d \
$(checkpoint_test_OBJECTS)

.libs/mutation_test-probes.o: $(mutation_test_OBJECTS) dtrace/probes.h
$(DTRACE) $(DTRACEFLAGS) -G \
-o .libs/mutation_test-probes.o \
-s ${srcdir}/dtrace/probes.d \
$(mutation_test_OBJECTS)

.libs/dispatcher_test-probes.o: $(dispatcher_test_OBJECTS) dtrace/probes.h
$(DTRACE) $(DTRACEFLAGS) -G \
-o .libs/dispatcher_test-probes.o \
Expand Down
32 changes: 32 additions & 0 deletions configuration.json
Expand Up @@ -162,6 +162,38 @@
"descr": "True if we want to keep the closed checkpoints for each vbucket unless the memory usage is above high water mark",
"type": "bool"
},
"klog_block_size": {
"default": "4096",
"descr": "Logging block size.",
"type": "size_t"
},
"klog_flush": {
"default": "commit2",
"descr": "When to flush the log (complete current block).",
"enum": [
"off",
"commit1",
"commit2",
"full"
],
"type": "std::string"
},
"klog_path": {
"default": "",
"descr": "Path to the mutation key log.",
"type": "std::string"
},
"klog_sync": {
"default": "commit2",
"descr": "When to sync the log.",
"enum": [
"off",
"commit1",
"commit2",
"full"
],
"type": "std::string"
},
"max_checkpoints": {
"default": "2",
"type": "size_t"
Expand Down
124 changes: 124 additions & 0 deletions crc32.c
@@ -0,0 +1,124 @@
/* Crc - 32 BIT ANSI X3.66 CRC checksum files */

#include <stdio.h>
#include "crc32.h"

/**********************************************************************\
|* Demonstration program to compute the 32-bit CRC used as the frame *|
|* check sequence in ADCCP (ANSI X3.66, also known as FIPS PUB 71 *|
|* and FED-STD-1003, the U.S. versions of CCITT's X.25 link-level *|
|* protocol). The 32-bit FCS was added via the Federal Register, *|
|* 1 June 1982, p.23798. I presume but don't know for certain that *|
|* this polynomial is or will be included in CCITT V.41, which *|
|* defines the 16-bit CRC (often called CRC-CCITT) polynomial. FIPS *|
|* PUB 78 says that the 32-bit FCS reduces otherwise undetected *|
|* errors by a factor of 10^-5 over 16-bit FCS. *|
\**********************************************************************/

/* Need an unsigned type capable of holding 32 bits; */

/* Copyright (C) 1986 Gary S. Brown. You may use this program, or
code or tables extracted from it, as desired without restriction.*/

/* First, the polynomial itself and its table of feedback terms. The */
/* polynomial is */
/* X^32+X^26+X^23+X^22+X^16+X^12+X^11+X^10+X^8+X^7+X^5+X^4+X^2+X^1+X^0 */
/* Note that we take it "backwards" and put the highest-order term in */
/* the lowest-order bit. The X^32 term is "implied"; the LSB is the */
/* X^31 term, etc. The X^0 term (usually shown as "+1") results in */
/* the MSB being 1. */

/* Note that the usual hardware shift register implementation, which */
/* is what we're using (we're merely optimizing it by doing eight-bit */
/* chunks at a time) shifts bits into the lowest-order term. In our */
/* implementation, that means shifting towards the right. Why do we */
/* do it this way? Because the calculated CRC must be transmitted in */
/* order from highest-order term to lowest-order term. UARTs transmit */
/* characters in order from LSB to MSB. By storing the CRC this way, */
/* we hand it to the UART in the order low-byte to high-byte; the UART */
/* sends each low-bit to hight-bit; and the result is transmission bit */
/* by bit from highest- to lowest-order term without requiring any bit */
/* shuffling on our part. Reception works similarly. */

/* The feedback terms table consists of 256, 32-bit entries. Notes: */
/* */
/* 1. The table can be generated at runtime if desired; code to do so */
/* is shown later. It might not be obvious, but the feedback */
/* terms simply represent the results of eight shift/xor opera- */
/* tions for all combinations of data and CRC register values. */
/* */
/* 2. The CRC accumulation logic is the same for all CRC polynomials, */
/* be they sixteen or thirty-two bits wide. You simply choose the */
/* appropriate table. Alternatively, because the table can be */
/* generated at runtime, you can start by generating the table for */
/* the polynomial in question and use exactly the same "updcrc", */
/* if your application needn't simultaneously handle two CRC */
/* polynomials. (Note, however, that XMODEM is strange.) */
/* */
/* 3. For 16-bit CRCs, the table entries need be only 16 bits wide; */
/* of course, 32-bit entries work OK if the high 16 bits are zero. */
/* */
/* 4. The values must be right-shifted by eight bits by the "updcrc" */
/* logic; the shift must be unsigned (bring in zeroes). On some */
/* hardware you could probably optimize the shift in assembler by */
/* using byte-swap instructions. */

static uint32_t crc_32_tab[] = { /* CRC polynomial 0xedb88320 */
0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,
0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,
0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,
0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
};

#define UPDC32(octet, crc) (crc_32_tab[((crc) ^ (octet)) & 0xff] ^ ((crc) >> 8))

uint32_t crc32buf(uint8_t *buf, size_t len) {
register uint32_t oldcrc32;

oldcrc32 = 0xFFFFFFFF;

for ( ; len; --len, ++buf) {
oldcrc32 = UPDC32(*buf, oldcrc32);
}

return ~oldcrc32;
}
8 changes: 8 additions & 0 deletions crc32.h
@@ -0,0 +1,8 @@
#ifndef CRC32_H
#define CRC32_H 1

#include <stdint.h>

uint32_t crc32buf(uint8_t *buf, size_t len);

#endif /* CRC32_H */
6 changes: 6 additions & 0 deletions docs/engine-params.org
Expand Up @@ -108,6 +108,12 @@ memcached like this:
| | | means don't throttle. |
| tap_throttle_threshold | float | Percentage of memory in use before we |
| | | throttle tap streams |
| klog_path | string | Path to the mutation key log. |
| klog_block_size | int | Mutation key log block size. |
| klog_flush | string | When to force buffer flushes during |
| | | klog (off, commit1, commit2, full) |
| klog_sync | string | When to fsync during klog. |


** Shard Patterns

Expand Down
99 changes: 99 additions & 0 deletions docs/klog.org
@@ -0,0 +1,99 @@
* Overview of the Key Mutation Log

This change introduces a new on-disk binary format for logging major
key events only. These events are limited to the creation and
destruction of keys only -- modification of keys is out of scope, as
is storing values.

* Flow

The logging hooks in to the persistence flow and does the following:

After writing any new key or deleting any existing key from disk, a
new log entry is buffered.

The buffer is written when a block is full or on a flush event
(configurable, usually after a commit1 or commit2 event).

Before sending a commit to the underlying store, a commit1 event is
logged.

After the underlying store completes its commit, a commit2 is logged.

* Configuration

There are four new engine parameters that come with this feature:

** klog_path

Where the events should be logged. An empty string (default) disables
logging.

** klog_block_size

The buffer/block size for log entries. The number should line up with
the underlying filesystem block size. Multiples may increase
throughput.

** klog_flush

Configures when the buffer should be force-flushed. There are four
possible values:

off: never force a flush
commit1: force a flush after commit1 only
commit2: force a flush after commit2 only
full: force a flush after both commit1 and commit2

** klog_sync

Configures when the file should be fsynced. There are four
possible values:

off: never fsync
commit1: fsync after commit1 only
commit2: fsync after commit2 only
full: fsync after both commit1 and commit2

* Data Format

Each file consists of a header and then an arbitrary number of blocks
which each contain an arbitrary number of records. All fields are
big-endian byte encoded.

** Header

The file begins with a header of at least 4,096 bytes long. The
header defines some basic info about the file.

- 32-bit version number (this document describes version 1)
- 32-bit block size
- 32-bit block count
- k/v properties to store additional tagged config
- 8-bit key len
- 8-bit value len
- key bytes
- value bytes
- terminated by zero-length key

If the block size listed in the header is larger 4,096 bytes, the
first block itself will be this length (assume anything extra is
zero-padded).

** Block

- checksum (16-bits, IEEE crc32 & 0xffff)
- record count (16-bits)
- []record

Block size is variable. I've been using 4k, for now. Blocks are 0
padded at the end when we need to sync or we can't fit more entries.

** Record

- rowid (64-bit)
- vbucket (16-bit)
- magic (0x45)
- type (8-bit)
- key len (8-bit)
- key ([]byte)
14 changes: 14 additions & 0 deletions docs/stats.org
Expand Up @@ -473,6 +473,9 @@ form to describe when time was spent doing various things:
| disk_commit | waiting for a commit after a batch of updates |
| disk_invalid_item_del | Waiting for disk to delete a chunk of invalid |
| | items with the old vbucket version |
| klogPadding | Amount of wasted "padding" space in the klog. |
| klogFlushTime | Time spent flushing the klog. |
| klogSyncTime | Time spent syncing the klog. |

** Hash Stats

Expand Down Expand Up @@ -560,6 +563,17 @@ Note that tcmalloc stats are not available on some operating systems
| tcmalloc_current_thread_cache_bytes | A measure of some of the memory |
| | TCMalloc is using for small objects. |

** Key Log

Stats =klog= shows counts what's going on with the key mutation log.

| size | The size of the logfile. |
| count_new | Number of "new key" events in the log. |
| count_del | Number of "deleted key" events in the log. |
| count_del_all | Number of "delete all" events in the log. |
| count_commit1 | Number of "commit1" events in the log. |
| count_commit2 | Number of "commit2" events in the log. |

* Details

** Ages
Expand Down

0 comments on commit b7075ce

Please sign in to comment.