Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine: TestConcurrentBatch is flaky (locally) #15604

Closed
bdarnell opened this issue May 2, 2017 · 24 comments
Closed

engine: TestConcurrentBatch is flaky (locally) #15604

bdarnell opened this issue May 2, 2017 · 24 comments
Milestone

Comments

@bdarnell
Copy link
Contributor

bdarnell commented May 2, 2017

TestConcurrentBatch has recently (in the last couple of weeks, I think) become flaky on my laptop. There are a bunch of slow batches, each taking longer than the last (this is a partial log):

W170502 12:50:40.391933 293 storage/engine/rocksdb.go:1220  batch [355017/4194314/0] commit took 12.411354136s (>500ms):
goroutine 293 [running]:
runtime/debug.Stack(0x4abc360, 0x4ff5a40, 0xc420014e80)
	/opt/local/lib/go/src/runtime/debug/stack.go:24 +0x79
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).commitInternal(0xc4203e0480, 0x4030300, 0x4be0868, 0xc42002cc00)
	/Users/bdarnell/work/cockroach/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1220 +0x14d
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Commit(0xc4203e0480, 0xc406591100, 0x4670001, 0xc42041bfc8)
	/Users/bdarnell/work/cockroach/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1189 +0x704
github.com/cockroachdb/cockroach/pkg/storage/engine.TestConcurrentBatch.func2(0xc420346300, 0x4ffa7a0, 0xc4203e0480)
	/Users/bdarnell/work/cockroach/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb_test.go:391 +0x36
created by github.com/cockroachdb/cockroach/pkg/storage/engine.TestConcurrentBatch
	/Users/bdarnell/work/cockroach/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb_test.go:392 +0x75c
W170502 12:50:41.449260 305 storage/engine/rocksdb.go:1220  batch [355017/4194314/0] commit took 13.480834217s (>500ms):
goroutine 305 [running]:
runtime/debug.Stack(0x4abc360, 0x4ff5a40, 0xc420014e80)
	/opt/local/lib/go/src/runtime/debug/stack.go:24 +0x79
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).commitInternal(0xc4200b8900, 0x0, 0x0, 0x0)
	/Users/bdarnell/work/cockroach/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1220 +0x14d
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Commit(0xc4200b8900, 0x0, 0x0, 0x0)
	/Users/bdarnell/work/cockroach/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1189 +0x704
github.com/cockroachdb/cockroach/pkg/storage/engine.TestConcurrentBatch.func2(0xc420346300, 0x4ffa7a0, 0xc4200b8900)
	/Users/bdarnell/work/cockroach/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb_test.go:391 +0x36
created by github.com/cockroachdb/cockroach/pkg/storage/engine.TestConcurrentBatch
	/Users/bdarnell/work/cockroach/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb_test.go:392 +0x75c
I170502 12:50:41.449954 211 storage/engine/rocksdb.go:447  closing rocksdb instance at "/var/folders/xw/y575gcfd7wx30kskwzd6p2qr0000gn/T/TestConcurrentBatch351385987"
--- FAIL: TestConcurrentBatch (16.33s)
	rocksdb_test.go:416: write took 13.4s

When the test passes, it takes about 4 seconds on this laptop (a macbook pro, with filevault enabled).

I think this is probably a matter of other stuff going on on the machine. It fails a large fraction of the time when I run make test and there is stuff to rebuild, but it passes reliably when run in isolation or when there is no rebuilding going on. So it's not really a problem with the test per se, but I'd like to see it be more robust against concurrent activity (or for us to figure out a way for make test to not cause so much concurrent I/O)

@bdarnell bdarnell added this to the 1.1 milestone May 2, 2017
@a-robinson
Copy link
Contributor

Were you seeing this even before disk syncing got turned on by default in #15366?

@petermattis
Copy link
Collaborator

TestConcurrentBatch does not perform sync commits. And most tests use in-memory stores.

@andreimatei
Copy link
Contributor

FWIW, this has been happening to me too!

@tbg
Copy link
Member

tbg commented May 6, 2017

Happened to me too:

--- FAIL: TestConcurrentBatch (22.18s)
	rocksdb_test.go:416: write took 17.7s

That's... a very long write. I wasn't doing anything noteworthy other than make test. Maybe there's something in the weeds here?

@petermattis
Copy link
Collaborator

Hmm, I'm seeing the slow batches as well. I don't recall seeing these in the past. Time to bisect.

@petermattis
Copy link
Collaborator

The finger is pointing at 4d72e12:

Author: Nikhil Benesch <nikhil.benesch@gmail.com>
Date:   Fri Apr 7 16:39:06 2017 -0400

    build: minimize cgo's role in the build

    This commit excises cgo role in building our dependencies, limiting it
    to the link phase only. The c-protobuf, c-jemalloc, c-snappy, and
    c-rocksdb packages are removed in favor of Make rules that use these
    packages' native build systems. This has a few implications:

      1. We no longer need to maintain platform-specific headers for each
         platform we support; we can rely on our dependency's build systems
         to do feature detection as necessary.

      2. Windows needs no special treatment.

      3. We no longer require the parallel cgo patch for fast compliation
         of C/C++ dependencies.

      4. We, unfortunately, now have a build dependency on CMake. We could
         instead vendor CMake itself, but building CMake from source takes
         about five minutes on my machine and will have serious
         implications for CI times. Binary distributions of CMake are
         readily available, and the specific version of CMake used is not
         particularly critical, as it is with the other C dependencies we
         vendor.

      5. We're no longer compatible with `go get`; you *must* bootstrap
         using Make.

    This also adds Windows binaries to the usual deployment scripts.

    Closes https://github.com/cockroachlabs/support/issues/33.
    Closes https://github.com/cockroachdb/cockroach/issues/14406.
~/Development/go/src/github.com/cockroachdb/cockroach/pkg (4d72e12...)|BISECTING make test PKG=./storage/engine/ TESTS=TestConcurrentBatch TESTFLAGS='-v -count 10' 2>&1 | grep '^--- PASS'
--- PASS: TestConcurrentBatch (3.12s)
--- PASS: TestConcurrentBatch (3.11s)
--- PASS: TestConcurrentBatch (3.08s)
--- PASS: TestConcurrentBatch (3.11s)
--- PASS: TestConcurrentBatch (3.06s)
--- PASS: TestConcurrentBatch (3.05s)
--- PASS: TestConcurrentBatch (3.05s)
--- PASS: TestConcurrentBatch (3.09s)
--- PASS: TestConcurrentBatch (3.10s)
--- PASS: TestConcurrentBatch (3.11s)

~/Development/go/src/github.com/cockroachdb/cockroach/pkg (ef10b4d...)|BISECTING make test PKG=./storage/engine/ TESTS=TestConcurrentBatch TESTFLAGS='-v -count 10' 2>&1 | grep '^--- PASS'
--- PASS: TestConcurrentBatch (1.50s)
--- PASS: TestConcurrentBatch (1.49s)
--- PASS: TestConcurrentBatch (1.46s)
--- PASS: TestConcurrentBatch (1.28s)
--- PASS: TestConcurrentBatch (1.45s)
--- PASS: TestConcurrentBatch (1.45s)
--- PASS: TestConcurrentBatch (1.45s)
--- PASS: TestConcurrentBatch (2.01s)
--- PASS: TestConcurrentBatch (1.31s)
--- PASS: TestConcurrentBatch (1.45s)

@benesch Did we change compiler flags in the move away from the c-* repos?

@petermattis
Copy link
Collaborator

First suspicion was that we're now passing -DWITH_SSE42=OFF for non-release builds. That is not the problem.

@tamird
Copy link
Contributor

tamird commented May 15, 2017 via email

@benesch
Copy link
Contributor

benesch commented May 15, 2017

Yeah, that commit set WITH_SSE42=ON.

@petermattis
Copy link
Collaborator

@tamird See my message (which crossed yours in the ether). I tried -DWITH_SSE42=ON and just removing that line completely. No love.

@benesch
Copy link
Contributor

benesch commented May 15, 2017

I'd be terrified if this were the problem but... what if you pass -DCMAKE_BUILD_TYPE=Release to RocksDB? @tamird is the default for that Debug or Release?

@tamird
Copy link
Contributor

tamird commented May 15, 2017

I don't think that's it, CMAKE_BUILD_TYPE is empty unless set, and RocksDB's CMakeLists checks for if(NOT CMAKE_BUILD_TYPE STREQUAL "Debug").

@benesch
Copy link
Contributor

benesch commented May 15, 2017

Here's a sample compiler invocation while compiling RocksDB on 4d72e12.

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++  -DJEMALLOC_NO_DEMANGLE -DOS_MACOSX -DROCKSDB_JEMALLOC -DROCKSDB_LIB_IO_POSIX -DROCKSDB_PLATFORM_POSIX -DSNAPPY -I/Users/benesch/go/native/x86_64-apple-darwin16.5.0/jemalloc/include -I/Users/benesch/go/src/github.com/cockroachdb/cockroach/c-deps/snappy.src -I/Users/benesch/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src -I/Users/benesch/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/include -isystem /Users/benesch/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/third-party/gtest-1.7.0/fused-src  -msse4.2 -W -Wextra -Wall -Wsign-compare -Wshadow -Wno-unused-parameter -Wno-unused-variable -Woverloaded-virtual -Wnon-virtual-dtor -Wno-missing-field-initializers -std=c++11 -O2 -fno-omit-frame-pointer -momit-leaf-frame-pointer -Werror -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk -mmacosx-version-min=10.12.4   -o CMakeFiles/rocksdb.dir/utilities/compaction_filters/remove_emptyvalue_compactionfilter.cc.o -c /Users/benesch/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/utilities/compaction_filters/remove_emptyvalue_compactionfilter.cc

...it's missing -DNDEBUG. Could that be it?

@petermattis
Copy link
Collaborator

petermattis commented May 15, 2017 via email

@benesch
Copy link
Contributor

benesch commented May 15, 2017

Oops, yes it is.

Latest master:

$ make test PKG=./pkg/storage/engine/ TESTS=TestConcurrentBatch TESTFLAGS='-v -count 10' 2>&1 | grep '^--- PASS'
--- PASS: TestConcurrentBatch (3.53s)
--- PASS: TestConcurrentBatch (3.62s)
--- PASS: TestConcurrentBatch (3.60s)
--- PASS: TestConcurrentBatch (3.66s)
--- PASS: TestConcurrentBatch (3.65s)
--- PASS: TestConcurrentBatch (3.69s)
--- PASS: TestConcurrentBatch (3.61s)
--- PASS: TestConcurrentBatch (3.63s)
--- PASS: TestConcurrentBatch (3.58s)
--- PASS: TestConcurrentBatch (3.65s)

With the following patch:

diff --git a/build/common.mk b/build/common.mk
index 71c7d8b56..6b6da7ef5 100644
--- a/build/common.mk
+++ b/build/common.mk
@@ -380,7 +380,8 @@ $(ROCKSDB_DIR)/Makefile: $(C_DEPS_DIR)/rocksdb-rebuild $(ROCKSDB_SRC_DIR)/.extra
 	cd $(ROCKSDB_DIR) && cmake $(CMAKE_FLAGS) $(ROCKSDB_SRC_DIR) \
 	  $(if $(findstring release,$(TYPE)),,-DWITH_$(if $(findstring mingw,$(TARGET_TRIPLE)),AVX2,SSE42)=OFF) \
 	  -DSNAPPY_LIBRARIES=$(SNAPPY_DIR)/.libs/libsnappy.a -DSNAPPY_INCLUDE_DIR=$(SNAPPY_SRC_DIR) -DWITH_SNAPPY=ON \
-	  $(if $(USE_STDMALLOC),,-DJEMALLOC_LIBRARIES=$(JEMALLOC_DIR)/lib/libjemalloc.a -DJEMALLOC_INCLUDE_DIR=$(JEMALLOC_DIR)/include -DWITH_JEMALLOC=ON)
+	  $(if $(USE_STDMALLOC),,-DJEMALLOC_LIBRARIES=$(JEMALLOC_DIR)/lib/libjemalloc.a -DJEMALLOC_INCLUDE_DIR=$(JEMALLOC_DIR)/include -DWITH_JEMALLOC=ON) \
+		-DCMAKE_CXX_FLAGS=-DNDEBUG
 
 $(SNAPPY_DIR)/Makefile: $(C_DEPS_DIR)/snappy-rebuild $(SNAPPY_SRC_DIR)/.extracted
 	mkdir -p $(SNAPPY_DIR)
diff --git a/c-deps/rocksdb-rebuild b/c-deps/rocksdb-rebuild
index 81dc26025..621a785ba 100644
--- a/c-deps/rocksdb-rebuild
+++ b/c-deps/rocksdb-rebuild
@@ -1,4 +1,4 @@
 Bump the version below when changing rocksdb CMake flags. Search for "BUILD
 ARTIFACT CACHING" in build/common.mk for rationale.
 
-1
+2
$ make test PKG=./pkg/storage/engine/ TESTS=TestConcurrentBatch TESTFLAGS='-v -count 10' 2>&1 | grep '^--- PASS'
--- PASS: TestConcurrentBatch (1.55s)
--- PASS: TestConcurrentBatch (2.30s)
--- PASS: TestConcurrentBatch (1.53s)
--- PASS: TestConcurrentBatch (1.55s)
--- PASS: TestConcurrentBatch (1.56s)
--- PASS: TestConcurrentBatch (1.54s)
--- PASS: TestConcurrentBatch (1.56s)
--- PASS: TestConcurrentBatch (1.52s)
--- PASS: TestConcurrentBatch (1.54s)
--- PASS: TestConcurrentBatch (1.54s)

Welp. I guess that'll make for a nice performance boost in 1.0.1.

@benesch
Copy link
Contributor

benesch commented May 15, 2017

@petermattis want me to prepare a PR with that patch?

@petermattis
Copy link
Collaborator

@benesch Yes.

@petermattis
Copy link
Collaborator

I'm curious why that isn't the default when using cmake.

@benesch
Copy link
Contributor

benesch commented May 15, 2017

The CMake build system was historically Windows-only, so it's rather undertested on not-Windows. Have you seen https://github.com/cockroachdb/cockroach/blob/master/c-deps/rocksdb-0014-cmake-o2.patch?

The good news is the CMake build is now a first-class citizen on all platforms. I'll be sending a patch upstream, too, to fix this for everyone.

@petermattis
Copy link
Collaborator

This also explains why we've seen a handful of assertion failures from RocksDB. Guess I should have paid more attention to that.

@benesch
Copy link
Contributor

benesch commented May 15, 2017

Yup, it would. That's worrying for a rather different reason, though, no?

@petermattis
Copy link
Collaborator

Depends on the assertion. The ones that fired could be invalid. I seem to recall one in the skiplist code that I convinced myself wasn't a problem.

@petermattis
Copy link
Collaborator

Also:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine master benchstat out.old out.new
name                                   old time/op    new time/op    delta
MVCCBatchPut_RocksDB/1x10Bytes-8         7.32µs ± 6%    5.46µs ± 2%  -25.43%  (p=0.000 n=10+10)
MVCCBatchPut_RocksDB/100x10Bytes-8       3.82µs ± 4%    2.33µs ± 2%  -39.09%  (p=0.000 n=10+10)
MVCCBatchPut_RocksDB/10000x10Bytes-8     4.35µs ± 5%    2.55µs ± 3%  -41.47%  (p=0.000 n=10+10)
MVCCBatchPut_RocksDB/100000x10Bytes-8    4.96µs ±15%    2.89µs ± 7%  -41.75%  (p=0.000 n=10+10)

@benesch
Copy link
Contributor

benesch commented May 15, 2017

🤦‍♂️

benesch added a commit to benesch/cockroach that referenced this issue May 16, 2017
In the transition to the new Make-based build system for C dependencies
(4d72e12), RocksDB assertions were inadvertently enabled for all builds.
Restore the old behavior, which only enabled assertions for race builds
(cockroachdb/c-rocksdb#27).

Fixes cockroachdb#15604.
benesch added a commit to benesch/cockroach that referenced this issue May 16, 2017
In the transition to the new Make-based build system for C dependencies
(4d72e12), RocksDB assertions were inadvertently enabled for all builds.
Restore the old behavior, which only enabled assertions for race builds
(cockroachdb/c-rocksdb#27).

Fixes cockroachdb#15604.
mberhault pushed a commit that referenced this issue May 16, 2017
In the transition to the new Make-based build system for C dependencies
(4d72e12), RocksDB assertions were inadvertently enabled for all builds.
Restore the old behavior, which only enabled assertions for race builds
(cockroachdb/c-rocksdb#27).

Fixes #15604.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants