Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native maps crash (segfault, SIGSEGV) on empty std::vector<BigBlock>.back() in LinkedBlockAllocator.deleteLast #767

Closed
ctubbsii opened this issue Nov 14, 2018 · 2 comments

Comments

@ctubbsii
Copy link
Member

commented Nov 14, 2018

On Fedora 28 and 29, with latest OpenJDK 8, TServer processes crash due to a segmentation fault in the native libs.

In BlockAllocator.h, bigBlocks.back() fails to check for an empty vector first. This method is undefined for an empty vector, and a behavior change seems to cause this to fail with a segmentation fault.

Native stack trace from gdb:

(gdb) bt
#0  0x00007f73e095753f in raise () from /lib64/libc.so.6
#1  0x00007f73e0941895 in abort () from /lib64/libc.so.6
#2  0x00007f73dfd6ef6f in os::abort(bool) [clone .cold.54] () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-8.fc29.x86_64/jre/lib/amd64/server/libjvm.so
#3  0x00007f73e064a6b3 in VMError::report_and_die() () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-8.fc29.x86_64/jre/lib/amd64/server/libjvm.so
#4  0x00007f73e043f4c6 in JVM_handle_linux_signal () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-8.fc29.x86_64/jre/lib/amd64/server/libjvm.so
#5  0x00007f73e04324ec in signalHandler(int, siginfo_t*, void*) () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-8.fc29.x86_64/jre/lib/amd64/server/libjvm.so
#6  <signal handler called>
#7  LinkedBlockAllocator::deleteLast (p=0x7f72f0037b61, this=0x7f72e8064e50) at /usr/include/c++/8/bits/stl_iterator.h:844
#8  Field::clear (this=<synthetic pointer>, lba=0x7f72e8064e50) at nativeMap/Field.h:127
#9  NativeMap::startUpdate (row=<synthetic pointer>..., this=0x7f72e8064e00) at nativeMap/NativeMap.h:144
#10 NativeMap::startUpdate (r=<optimized out>, env=<optimized out>, this=0x7f72e8064e00) at nativeMap/NativeMap.h:129
#11 Java_org_apache_accumulo_tserver_NativeMap_startUpdate (env=<optimized out>, cls=<optimized out>, nm=140131495726592, r=<optimized out>) at nativeMap/org_apache_accumulo_tserver_NativeMap.cc:51
#12 0x00007f73998b6e3a in Java_org_apache_accumulo_tserver_NativeMap_singleUpdate (env=0x7f730c00e260, cls=0x7f733a488650, nm=140131495726592, r=<optimized out>, cf=0x7f733a488698, cq=0x7f733a488690, cv=0x7f733a488688, ts=3, del=1 '\001', val=0x7f733a488668, mutationCount=6) at nativeMap/org_apache_accumulo_tserver_NativeMap.cc:45

Java stack trace for the relevant thread:

Stack: [0x00007f733a38a000,0x00007f733a48b000],  sp=0x00007f733a488450,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libaccumulo.so+0x3173]  Java_org_apache_accumulo_tserver_NativeMap_startUpdate+0x1f3

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  org.apache.accumulo.tserver.NativeMap.singleUpdate(J[B[B[B[BJZ[BI)V+0
j  org.apache.accumulo.tserver.NativeMap._mutate(Lorg/apache/accumulo/core/data/Mutation;I)I+69
j  org.apache.accumulo.tserver.NativeMap.mutate(Ljava/util/List;I)V+88
j  org.apache.accumulo.tserver.InMemoryMap$NativeMapWrapper.mutate(Ljava/util/List;I)V+6
j  org.apache.accumulo.tserver.InMemoryMap$LocalityGroupMap.mutate(Ljava/util/List;I)V+62
j  org.apache.accumulo.tserver.InMemoryMap$SampleMap.mutate(Ljava/util/List;I)V+6
j  org.apache.accumulo.tserver.InMemoryMap.mutate(Ljava/util/List;)V+64
j  org.apache.accumulo.tserver.tablet.CommitSession.mutate(Ljava/util/List;)V+5
j  org.apache.accumulo.tserver.tablet.TabletMemory.mutate(Lorg/apache/accumulo/tserver/tablet/CommitSession;Ljava/util/List;)V+2
j  org.apache.accumulo.tserver.tablet.Tablet.commit(Lorg/apache/accumulo/tserver/tablet/CommitSession;Ljava/util/List;)V+62
j  org.apache.accumulo.tserver.tablet.CommitSession.commit(Ljava/util/List;)V+6
j  org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.update(Lorg/apache/accumulo/core/trace/thrift/TInfo;Lorg/apache/accumulo/core/security/thrift/TCredentials;Lorg/apache/accumulo/core/data/thrift/TKeyExtent;Lorg/apache/accumulo/core/data/thrift/TMutation;Lorg/apache/accumulo/core/tabletserver/thrift/TDurability;)V+369
v  ~StubRoutines::call_stub
j  sun.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+0
j  sun.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+100
j  sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+6
j  java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+56
j  org.apache.accumulo.core.trace.wrappers.RpcServerInvocationHandler.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object;+64
j  com.sun.proxy.$Proxy19.update(Lorg/apache/accumulo/core/trace/thrift/TInfo;Lorg/apache/accumulo/core/security/thrift/TCredentials;Lorg/apache/accumulo/core/data/thrift/TKeyExtent;Lorg/apache/accumulo/core/data/thrift/TMutation;Lorg/apache/accumulo/core/tabletserver/thrift/TDurability;)V+34
j  org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$update.getResult(Lorg/apache/accumulo/core/tabletserver/thrift/TabletClientService$Iface;Lorg/apache/accumulo/core/tabletserver/thrift/TabletClientService$update_args;)Lorg/apache/accumulo/core/tabletserver/thrift/TabletClientService$update_result;+29
j  org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$update.getResult(Ljava/lang/Object;Lorg/apache/thrift/TBase;)Lorg/apache/thrift/TBase;+9
j  org.apache.thrift.ProcessFunction.process(ILorg/apache/thrift/protocol/TProtocol;Lorg/apache/thrift/protocol/TProtocol;Ljava/lang/Object;)V+89
j  org.apache.thrift.TBaseProcessor.process(Lorg/apache/thrift/protocol/TProtocol;Lorg/apache/thrift/protocol/TProtocol;)Z+126
j  org.apache.accumulo.server.rpc.TimedProcessor.process(Lorg/apache/thrift/protocol/TProtocol;Lorg/apache/thrift/protocol/TProtocol;)Z+45
j  org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke()V+77
j  org.apache.accumulo.server.rpc.CustomNonBlockingServer$CustomFrameBuffer.invoke()V+65
j  org.apache.thrift.server.Invocation.run()V+4
j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j  org.apache.accumulo.fate.util.LoggingRunnable.run()V+4
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub

The following diff adds the explicit protection for the empty vector, and retains side-effects, but is probably not correct in context:

diff --git a/server/native/src/main/c++/nativeMap/BlockAllocator.h b/server/native/src/main/c++/nativeMap/BlockAllocator.h
index 81c14d86ee..768e19fa71 100644
--- a/server/native/src/main/c++/nativeMap/BlockAllocator.h
+++ b/server/native/src/main/c++/nativeMap/BlockAllocator.h
@@ -126,11 +126,14 @@ struct LinkedBlockAllocator {
                                blocks.back().rollback(p);
                                lastAlloc = NULL;
                                return;
-                       }else if(bigBlocks.back().ptr == p){
+                       }else if(!bigBlocks.empty() && bigBlocks.back().ptr == p){
                                memused -= (sizeof(BigBlock) + bigBlocks.back().length);
                                bigBlocks.pop_back();
                                delete((unsigned char *)p);
                                return;
+                       }else if(bigBlocks.empty()){
+                               delete((unsigned char *)p);
+                               return;
                        }
                }
@ctubbsii

This comment has been minimized.

Copy link
Member Author

commented Nov 14, 2018

FWIW, I think the code that calls this cleanup is making too many assumptions about how the allocator is used. Specifically, it assumes that we can know, for a fact, that we were the last ones to use it, even though we share it with C library code, whose implementations we cannot know for certain. Refactoring to avoid this assumption would be good, I think.

@ctubbsii

This comment has been minimized.

Copy link
Member Author

commented Nov 15, 2018

I applied the following patch to try to get a sense of what was going on:

diff --git a/server/native/src/main/c++/nativeMap/BlockAllocator.h b/server/native/src/main/c++/nativeMap/BlockAllocator.h
index 81c14d86ee..98772aebd3 100644
--- a/server/native/src/main/c++/nativeMap/BlockAllocator.h
+++ b/server/native/src/main/c++/nativeMap/BlockAllocator.h
@@ -43,6 +43,7 @@ struct Block {
 	}
 
 	void *allocate(size_t amount){
+		std::cerr<< "Block.allocate(" << amount << ")" << std::endl;
 		unsigned char *nextPos = currentPos + amount;
 		
 		if(nextPos > end){
@@ -96,6 +97,7 @@ struct LinkedBlockAllocator {
 	}
 	
 	void *allocate(size_t amount){
+      std::cerr<< "LinkedBlockAllocator.allocate(" << amount << ")" << std::endl;
 
 	        if(amount > (size_t)bigBlockSize){
 			unsigned char *p = new unsigned char[amount];
@@ -121,12 +123,13 @@ struct LinkedBlockAllocator {
 	}
 
 	void deleteLast(void *p){
+    std::cerr<< "LinkedBlockAllocator.deleteLast(" << p << ")" << std::endl;
 		if(p != NULL){
 			if(p == lastAlloc){
 				blocks.back().rollback(p);
 				lastAlloc = NULL;
 				return;
-			}else if(bigBlocks.back().ptr == p){
+			}else if(!bigBlocks.empty() && bigBlocks.back().ptr == p){
 				memused -= (sizeof(BigBlock) + bigBlocks.back().length);
 				bigBlocks.pop_back();
 				delete((unsigned char *)p);
@@ -135,7 +138,7 @@ struct LinkedBlockAllocator {
 		}
 
 		std::cerr << "Tried to delete something that was not last allocation " << p << " " << lastAlloc << std::endl;
-		exit(-1);
+		//exit(-1);
 	}	
 
 	size_t getMemoryUsed(){
@@ -146,7 +149,7 @@ struct LinkedBlockAllocator {
 	}
 	
 	~LinkedBlockAllocator(){
-		//std::cout << "Deleting " << blocks.size() << " blocks, memused : " << memused << std::endl;
+		std::cerr << "(~LinkedBlockAllocator) Deleting " << blocks.size() << " blocks, memused : " << memused << std::endl;
 		std::vector<Block>::iterator iter = blocks.begin();
 		while(iter != blocks.end()){
 		  	delete [] (iter->data);

This just does the protective check for the empty vector, stops exiting if the last allocation wasn't what was expected, and adds a tiny bit of debugging output.

The output shows a few interesting things. First, for some reason, there's allocation requests for blocks of size 0, which is weird. Second, all of the times that the deleteLast would have otherwise failed (and now prints Tried to delete something ...), are off by either 1, 2, 3, or 4 bytes, with the following distribution:

1: 8
2: 90
3: 14
4: 24

To me, this looks like some alignment issue with the allocator.

To reproduce, apply the patch above and run:

mvn clean verify -Dtest=blah -Dit.test=ShellIT -Dspotbugs.skip -Dcheckstyle.skip
for line in $(cat test/target/mini-tests/org.apache.accumulo.harness.SharedMiniClusterBase_*/logs/TabletServer_* | grep '^Tried to delete something' | cut -f10-11 -d' ' | tr ' ' '-'); do x=$(echo $line | cut -f1 -d-); y=$(echo $line | cut -f2 -d-); echo "$line: $y - $x = $(( $y - $x ))"; done

keith-turner added a commit to keith-turner/accumulo that referenced this issue Nov 15, 2018

keith-turner added a commit to keith-turner/accumulo that referenced this issue Nov 15, 2018

fixes apache#767 correct allocation issue in native maps
This fixes a bug that was found when building the native maps against
Fedora 29.  The cause of the bug was that an empty C++ map seemed to
allocate a small amount of memory, which was a new behavior.  This new
behavior violated an assumption made by the custom alloctor used by the
native map code.  The code was restructured to avoid this issue.

@asfgit asfgit closed this in df1061b Nov 15, 2018

@ctubbsii ctubbsii added this to Done in 1.9.3 Jun 14, 2019

@ctubbsii ctubbsii added this to Done in 2.0.0 Jun 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
2 participants
You can’t perform that action at this time.