Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in BlockBasedTableIterator::CheckDataBlockWithinUpperBound() using JNI #9378

Closed
dbw9580 opened this issue Jan 12, 2022 · 18 comments · Fixed by Alluxio/alluxio#14856
Closed

Comments

@dbw9580
Copy link

dbw9580 commented Jan 12, 2022

Hi, we are using RocksDB to store file system metadata in Alluxio. When retrieving these metadata from RocksDB, it occasionally segfaults with the following stack trace:

Stack: [0x00007f7aa60e1000,0x00007f7aa61e2000],  sp=0x00007f7aa61df7d8,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libc.so.6+0x16ad90]  __memcmp_sse4_1+0xcc0
C  [librocksdbjni6036254170959785193.so+0x587ff2]  rocksdb::BlockBasedTableIterator::CheckDataBlockWithinUpperBound()+0x92
C  [librocksdbjni6036254170959785193.so+0x588277]  rocksdb::BlockBasedTableIterator::InitDataBlock()+0x267
C  [librocksdbjni6036254170959785193.so+0x588d28]  rocksdb::BlockBasedTableIterator::FindBlockForward()+0x348
C  [librocksdbjni6036254170959785193.so+0x589155]  rocksdb::BlockBasedTableIterator::Next()+0xf5
C  [librocksdbjni6036254170959785193.so+0x5891d6]  rocksdb::BlockBasedTableIterator::NextAndGetResult(rocksdb::IterateResult*)+0x16
C  [librocksdbjni6036254170959785193.so+0x474371]
C  [librocksdbjni6036254170959785193.so+0x5e1a38]  rocksdb::MergingIterator::Next()+0x38
C  [librocksdbjni6036254170959785193.so+0x5ddb13]  rocksdb::MergingIterator::NextAndGetResult(rocksdb::IterateResult*)+0x13
C  [librocksdbjni6036254170959785193.so+0x3cbb74]  rocksdb::DBIter::Next()+0x284
J 5585  org.rocksdb.RocksIterator.next0(J)V (0 bytes) @ 0x00007f9735f054f0 [0x00007f9735f05440+0xb0]
J 18344 C2 alluxio.master.metastore.rocks.RocksBlockStore.getLocations(J)Ljava/util/List; (213 bytes) @ 0x00007f973729b918 [0x00007f973729aec0+0xa58]
J 19543 C2 alluxio.master.block.DefaultBlockMaster.generateBlockInfo(J)Ljava/util/Optional; (377 bytes) @ 0x00007f9737e01b40 [0x00007f9737e01580+0x5c0]
J 18270 C2 alluxio.master.block.DefaultBlockMaster.getBlockInfoList(Ljava/util/List;)Ljava/util/List; (64 bytes) @ 0x00007f9737082378 [0x00007f9737082100+0x278]
J 18317 C2 alluxio.master.file.DefaultFileSystemMaster.getFileInfoInternal(Lalluxio/master/file/meta/LockedInodePath;Lcom/codahale/metrics/Counter;)Lalluxio/wire/FileInfo; (399 bytes) @ 0x00007f9737e33940 [0x00007f9737e31160+0x27e0]
J 19536 C1 alluxio.master.file.DefaultFileSystemMaster.listStatusInternal(Lalluxio/master/file/contexts/ListStatusContext;Lalluxio/master/file/RpcContext;Lalluxio/master/file/meta/LockedInodePath;Lalluxio/master/audit/AuditContext;Lalluxio/file/options/DescendantType;Lalluxio/master/file/ResultStream;ILcom/codahale/metrics/Counter;)V (375 bytes) @ 0x00007f9737d8701c [0x00007f9737d84280+0x2d9c]
j  alluxio.master.file.DefaultFileSystemMaster.listStatusInternal(Lalluxio/master/file/contexts/ListStatusContext;Lalluxio/master/file/RpcContext;Lalluxio/master/file/meta/LockedInodePath;Lalluxio/master/audit/AuditContext;Lalluxio/file/options/DescendantType;Lalluxio/master/file/ResultStream;ILcom/codahale/metrics/Counter;)V+231
J 17052 C1 alluxio.master.file.DefaultFileSystemMaster.listStatus(Lalluxio/AlluxioURI;Lalluxio/master/file/contexts/ListStatusContext;Lalluxio/master/file/ResultStream;)V (936 bytes) @ 0x00007f9737a4c934 [0x00007f9737a440e0+0x8854]

We are using RocksDB 6.25.3.

Steps to reproduce the behavior

Currently we don't have a reproduction without involving Alluxio. So the steps to reproduce this is

  1. Start an Alluxio cluster with rocksdb as the metastore;

  2. Use a internal benchmark tool to create a heavy load on the Master, that creates the 1 million blocks under 1 path. Then read the metadata of the blocks concurrently from 1500 clients threads. Each of the clients is listing all the 1 million blocks. (more details in this comment)

  3. Observe that the JVM gets SIGSEGV and killed with a core dump.

We have observed that under light load the segfault is unlikely to occur. Only under heavy load (say > 1 million blocks in one path during one benchmark run) this segfault occurs quite frequently.

Full core dumps: https://drive.google.com/drive/folders/1bcP4bMyktLURFLPGoonIrbUL-hrVxLKQ?usp=sharing

@jiacheliu3
Copy link

jiacheliu3 commented Jan 12, 2022

Adding to the above context:

How does Alluxio use RocksDB

From the crash stacktrace, the code throwing the segfault is from
https://github.com/Alluxio/alluxio/blob/b65655e9ba7c3b7fed1ea77e6454174921c53965/core/server/master/src/main/java/alluxio/master/metastore/rocks/RocksBlockStore.java#L160

Pasting the exact code here, hoping that helps further discussion and reference a bit:

  @Override
  public List<BlockLocation> getLocations(long id) {
    byte[] startKey = RocksUtils.toByteArray(id, 0);
    byte[] endKey = RocksUtils.toByteArray(id, Long.MAX_VALUE);

    // Explicitly hold a reference to the ReadOptions object from the discussion in
    // https://groups.google.com/g/rocksdb/c/PwapmWwyBbc/m/ecl7oW3AAgAJ
    final ReadOptions readOptions = new ReadOptions().setIterateUpperBound(new Slice(endKey));
    try (RocksIterator iter = db().newIterator(mBlockLocationsColumn.get(), readOptions)) {
      iter.seek(startKey);
      List<BlockLocation> locations = new ArrayList<>();
      for (; iter.isValid(); iter.next()) {
        try {
          locations.add(BlockLocation.parseFrom(iter.value()));
        } catch (Exception e) {
          throw new RuntimeException(e);
        }
      }
      return locations;
    }
  }

Alluxio uses RocksDB to store the block metadata and associated locations. For example "metadata" means the block length etc, and "locations" means on which workers the block replicas reside. For locations, we use RocksDB as a KV store, where the key is simply a byte[] key = concat(long blockID, long workerID).

For completeness of logic, the codes for adding and removing block locations are as below, from the same RocksBlockStore. We are not too sure about the thread safety of the add/remove(the "writer" methods per se) either, honestly. Would hugely appreciate it if you happen to spot something wrong.

  @Override
  public void addLocation(long id, BlockLocation location) {
    byte[] key = RocksUtils.toByteArray(id, location.getWorkerId());
    try {
      db().put(mBlockLocationsColumn.get(), mDisableWAL, key, location.toByteArray());
    } catch (RocksDBException e) {
      throw new RuntimeException(e);
    }
  }

  @Override
  public void removeLocation(long blockId, long workerId) {
    byte[] key = RocksUtils.toByteArray(blockId, workerId);
    try {
      db().delete(mBlockLocationsColumn.get(), mDisableWAL, key);
    } catch (RocksDBException e) {
      throw new RuntimeException(e);
    }
  }

The RocksUtils.toByteArray(long, long) is just combining two long, namely the block ID and the corresponding worker ID (for location, of course). https://github.com/Alluxio/alluxio/blob/b65655e9ba7c3b7fed1ea77e6454174921c53965/core/server/master/src/main/java/alluxio/master/metastore/rocks/RocksUtils.java#L45

@apc999 Please correct me if i'm wrong on how Alluxio metadata is structured in RocksDB.

FYI we had a previous discussion on a similar segfault issue here. Can't tell if our previous fix attempt didn't work or this time it's a different issue. Much appreciated if anybody could shed some light regarding that too!

@jiacheliu3
Copy link

It'll be great if you could let us know what we can better collect from the somewhat-frequent reproductions (anything like RocksDB logs?), or how we can further debug this (like how to find the on-disk file causing the segfault etc).

@siying
Copy link
Contributor

siying commented Jan 12, 2022

This line:

final ReadOptions readOptions = new ReadOptions().setIterateUpperBound(new Slice(endKey))

might be problematic. I don't whether it is the root of the problem, but Slice probably needs to be hold during the whole process. This is very similar to your previous reporting: https://groups.google.com/g/rocksdb/c/PwapmWwyBbc/m/Wh4N89tjBgAJ

Can you give a similar fix a try?

@yuzhu
Copy link

yuzhu commented Jan 12, 2022

thanks for the quick response! In this case, since we do hold a reference to the readOptions object, which holds a reference to the object created by new Slice(endKey). I don't see how this object could be garbage collected?

@siying
Copy link
Contributor

siying commented Jan 12, 2022

thanks for the quick response! In this case, since we do hold a reference to the readOptions object, which holds a reference to the object created by new Slice(endKey). I don't see how this object could be garbage collected?

ReadOptions actually doesn't hold a reference to Slice. I think this is an implementation that we probably should improve but I think this is how it is now.

@yuzhu
Copy link

yuzhu commented Jan 12, 2022

Thanks, we will give this a try.

@jiacheliu3
Copy link

thanks for the quick response! In this case, since we do hold a reference to the readOptions object, which holds a reference to the object created by new Slice(endKey). I don't see how this object could be garbage collected?

ReadOptions actually doesn't hold a reference to Slice. I think this is an implementation that we probably should improve but I think this is how it is now.

Thanks a lot for the replies @siying ! Sure we will try the suggested fix and update asap.

But sorry I still fail to understand why the ReadOptions does not hold a ref to the Slice. I'm looking at

iterateUpperBoundSlice_ = iterateUpperBound;

  public ReadOptions setIterateUpperBound(final AbstractSlice<?> iterateUpperBound) {
    assert(isOwningHandle());
    setIterateUpperBound(
        nativeHandle_, iterateUpperBound == null ? 0 : iterateUpperBound.getNativeHandle());
    // Hold onto a reference so it doesn't get garbage collected out from under us.
    iterateUpperBoundSlice_ = iterateUpperBound;
    return this;
  }

The method argument is a reference to the anonymous new Slice() we created and later it's passed to the member variable iterateUpperBoundSlice_ as specified in the comment. These lines I'm quoting are from v6.25.3 which is the one we are using.

Am I reading it wrong? Or how do we tell if something held in a member variable does not stop the underlying object to be GC-ed?

@ghost
Copy link

ghost commented Jan 13, 2022

@jiacheliu3 ReadOptions does hold a reference to the Slice, but after the ReadOptions is created in final ReadOptions readOptions = new ReadOptions().setIterateUpperBound(new Slice(endKey));, it's only referenced once in try (RocksIterator iter = db().newIterator(mBlockLocationsColumn.get(), readOptions)) {, if you look into newIterator: https://github.com/facebook/rocksdb/blob/main/java/src/main/java/org/rocksdb/RocksDB.java#L3150, only the native handle is retrieved and passed to RocksIterator, the ReadOptions object itself is not, so, although RocksIterator is alive in the try block, ReadOptions could be GCed, which in turn may cause the Slice to be GCed.

@siying
Copy link
Contributor

siying commented Jan 13, 2022

thanks for the quick response! In this case, since we do hold a reference to the readOptions object, which holds a reference to the object created by new Slice(endKey). I don't see how this object could be garbage collected?

ReadOptions actually doesn't hold a reference to Slice. I think this is an implementation that we probably should improve but I think this is how it is now.

Thanks a lot for the replies @siying ! Sure we will try the suggested fix and update asap.

But sorry I still fail to understand why the ReadOptions does not hold a ref to the Slice. I'm looking at

iterateUpperBoundSlice_ = iterateUpperBound;

  public ReadOptions setIterateUpperBound(final AbstractSlice<?> iterateUpperBound) {
    assert(isOwningHandle());
    setIterateUpperBound(
        nativeHandle_, iterateUpperBound == null ? 0 : iterateUpperBound.getNativeHandle());
    // Hold onto a reference so it doesn't get garbage collected out from under us.
    iterateUpperBoundSlice_ = iterateUpperBound;
    return this;
  }

The method argument is a reference to the anonymous new Slice() we created and later it's passed to the member variable iterateUpperBoundSlice_ as specified in the comment. These lines I'm quoting are from v6.25.3 which is the one we are using.

Am I reading it wrong? Or how do we tell if something held in a member variable does not stop the underlying object to be GC-ed?

You are correct. Sorry for my previous wrong answer. I think it might have something to do with how RocksDB Java deals with native handle. As @Cheng-Chang said, it is possible that Java might think some objects are of no use anymore, while RocksDB still keeps its native handle. I suspect that it might have happened for endKey, and perhaps readOptions too. It's just a suspicious and I don't know how to test the theory or even how to fix it. I'm not sure whether Java has someway to guarantee the endKey and readOptions are not garbage collected until the end of the function.

@siying
Copy link
Contributor

siying commented Jan 13, 2022

@adamretter do you have any thought how we can test and/or fix the problem?

@siying
Copy link
Contributor

siying commented Jan 13, 2022

I guess one way to hack it around is to create a class contains endKey, readOptions and iter so that as long as the class is kept around, they are not GCed.

@jiacheliu3
Copy link

@jiacheliu3 ReadOptions does hold a reference to the Slice, but after the ReadOptions is created in final ReadOptions readOptions = new ReadOptions().setIterateUpperBound(new Slice(endKey));, it's only referenced once in try (RocksIterator iter = db().newIterator(mBlockLocationsColumn.get(), readOptions)) {, if you look into newIterator: https://github.com/facebook/rocksdb/blob/main/java/src/main/java/org/rocksdb/RocksDB.java#L3150, only the native handle is retrieved and passed to RocksIterator, the ReadOptions object itself is not, so, although RocksIterator is alive in the try block, ReadOptions could be GCed, which in turn may cause the Slice to be GCed.

I see a lot of patterns like

readOptions.nativeHandle_));
where the nativeHandle_ is passed while the original holder object is untouched (and subsequently may get GC-ed). I guess if this is a problem then it is quite general and applies to many places in RocksDB code?

Not sure if I understand correctly, it seems there are two potential problems when the java object and the native object lifecycle are not matched perfectly:

  1. If the java obj dies but the native one does not, then there can be a segfault
  2. If the native obj dies but the java one does not, then the java obj(and the associated references) may not be GC-ed and create a memory leak

I believe these problems must have been thought through in RocksDB JNI design. I must have missed something like written-out anti-pattern docs? I was looking at memory mgmt but it only mentioned how the java objects should be closed properly. Much appreciated if someone could point us to the right direction! @Cheng-Chang @siying @yuzhu

@alanpaxton
Copy link
Contributor

Hi, just dropping in to introduce myself. I work for @adamretter and I have been looking at some changes and improvements in RocksJava. First to confirm memory mgmt is the guidance for how to stay safe while using the JNI interface to RocksDB.

Also @jiacheliu3 you are right that we can potentially hold multiple java references to a nativeHandle_, only one of which is the "owning" reference and which will cause the underlying C++ rocksdb object to close() when it in turn is closed (often as an Autocloseable in a try block). So it is the responsibility of the Java code to ensure that other references do not outlive that reference, in order to avoid a potential SEGFAULT. But looking at the code, it is not obvious to me that this is causing the problem here. I don't believe

It's relevant to note that I'm doing some experiments with a new mechanism for native handles from Java which uses std::shared_ptrs to ensure that all Java references count in ownership of C++ objects, but that's highly experimental and may not amount to anything; but it's addressing a common problem because Java developers have expectations of a Java API which aren't currently met by the RocksJava.

I suspect, without any proof, that the problem may instead be a concurrency issue, although given it is all just readers, it would need to be some kind of erroneous state sharing between the read threads. In Alluxio or in rocks ? I don't know.

You could try @siying's idea re the Slice object, as it should be easy to test ? But I don't think it's the reason as nothing is going out of scope and becoming eligible for GC until the try has finished.

@jiacheliu3
Copy link

Thanks for the reply @alanpaxton !

We are currently trying siying's idea and scrutinizing all our past usages for RocksObject and closing them more carefully. However do you guys know how to squeeze more info from a Rocks segfault, like seeing which exact object caused it and what that object looks like? As you see we are using JNI and don't have too much expertise in core dump analysis.

On your extra info about a better Java-C++ mechanism, that sounds great! If there's some ongoing work we can subscribe to that'd be very intriguing :)

I share the same suspicion that the segfault on the read path could be from the writer's sin, but as previously mentioned, we don't have any evidence at the moment. Would it be possible that we can tell by examining embedded Rocks log or examining the on-disk storage to find more information for you?

Back to the current state that the existing docs might not be intuitive or comprehensive enough, we are more than happy to contribute something like code examples that people can copy-paste and follow, without need to worry about segfaults or leaks. Alluxio relies on RocksDB heavily to store metadata off-heap when dealing with hundreds of millions of files. So we need to consolidate that and make sure everything's correct anyways.... That'd be awesome if we can contribute the doc somewhere in the Rocks repo and discussion in your reviews. Where is the best link/doc to contribute that to?

@alanpaxton
Copy link
Contributor

Hi @jiacheliu3 I think for documentation you just want to add to the wiki where the info is missing or outdated.

I'll gladly share what I'm doing in a draft PR as soon as I make any progress. I generally work in public in my fork (github.com/alanpaxton/rocksdb) and I'm alan at evolvedbinary dot com.

I'm not aware how to get anything more out of the stack dump. That shows it's falling over when comparing a key against the upper bound, but you pretty much inferred that already. I'll have more of a dig around (it's useful to me to see what kind of real client errors happen) and let you know if I find anything.

@adamretter
Copy link
Collaborator

adamretter commented Jan 18, 2022

@jiacheliu3 @dbw9580 I have seen some examples of where Java Objects are GC'd but are still needed in subsequent API calls - this can cause such a segfault, although I can't be certain that this is your issue.

In the early design of the RocksJava API (before I was involved), it was decided that the finalize method of each Java Object should free any underlying C++ objects that are owned by that Java object. The idea was that the Java user would then not manually have to manage any memory deallocation manually and could just leave it up to the Java Garbage Collector. The downside of this, is trying to ensure that the RocksJava API correctly holds references to every object that it may need, this is complex, and probably sometimes impossible depending on the users intentions. Also the user probably has a better idea about when their resources are no longer needed, and could make this explicit.

So... some time ago I introduced a mechanism to manually control the memory release in RocksJava by adding a close() to all sub-classes of RocksObject. This allows the user to manually manage their deallocation. However, we have not yet removed the auto-deallocation via the GC, as we consider this to be a breaking change. We are likely to do this as part of the major upcoming 7.0.0 release.

At the moment @alanpaxton is investigating and prototyping a possible alternative strategy which may provide an automated but efficient approach that doesn't involved the GC, and instead uses reference counting - but that is still a work in progress at the moment.

If we do remove the auto-deallocation from RocksJava in 7.0.0, then of course the user must free their RocksObjects explicitly by calling close, otherwise they will leak memory. However, the problem of the GC disposing objects that you still need will be removed.

If you believe the GC is causing your problems. For the time being, your best approach is to keep a reference to the Java object that you need, and the free that reference once your are certain you are finished with it.

alluxio-bot pushed a commit to Alluxio/alluxio that referenced this issue Jan 21, 2022
### What changes are proposed in this pull request?

Fixes facebook/rocksdb#9378
The `orgs.rocksdb.RocksObject` extends `AutoCloseable` so they should be
closed properly to avoid leaks.

### Why are the changes needed?

According to the [RocksDB
documentation](https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management)
and a sample [PR
discussion](facebook/rocksdb#7608 (comment)),
the objects need to be closed or wrapped in a try-with-resource block.

### Does this PR introduce any user facing changes?

NA

pr-link: #14856
change-id: cid-69a164c547a14e0bb7710fc7a48731aecba4db88
@jiacheliu3
Copy link

@adamretter Thx for you explanation!

However, we have not yet removed the auto-deallocation via the GC, as we consider this to be a breaking change. We are likely to do this as part of the major upcoming 7.0.0 release.
On this, is there a github issue or something equivalent that we can subscribe to? According to google search bar, the 7.0.0 is in Feb/Mar 2022? We would like to actively follow and try out the new mechanism.

@jiacheliu3
Copy link

jiacheliu3 commented Jan 25, 2022

The segfault no longer occurs after the fix Alluxio/alluxio#14856
where we follow the rules defined in memory mgmt and close all RocksObject instances with try-with-resource blocks. @dbw9580 run the reproduction code for like a dozen times and did not see a reproduction.

The issue can be closed and many thanks to your help!

Xenorith pushed a commit to Alluxio/alluxio that referenced this issue Feb 7, 2022
### What changes are proposed in this pull request?

Fixes facebook/rocksdb#9378
The `orgs.rocksdb.RocksObject` extends `AutoCloseable` so they should be
closed properly to avoid leaks.

### Why are the changes needed?

According to the [RocksDB
documentation](https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management)
and a sample [PR
discussion](facebook/rocksdb#7608 (comment)),
the objects need to be closed or wrapped in a try-with-resource block.

### Does this PR introduce any user facing changes?

NA

pr-link: #14856
change-id: cid-69a164c547a14e0bb7710fc7a48731aecba4db88
flaming-archer pushed a commit to flaming-archer/alluxio that referenced this issue Sep 1, 2022
 Close RocksDB objects that are AutoCloseable

    ### What changes are proposed in this pull request?

    Fixes facebook/rocksdb#9378
    The `orgs.rocksdb.RocksObject` extends `AutoCloseable` so they should be
    closed properly to avoid leaks.

    ### Why are the changes needed?

    According to the [RocksDB
    documentation](https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management)
    and a sample [PR
    discussion](facebook/rocksdb#7608 (comment)),
    the objects need to be closed or wrapped in a try-with-resource block.

    ### Does this PR introduce any user facing changes?

    NA

    pr-link: Alluxio#14856
    change-id: cid-69a164c547a14e0bb7710fc7a48731aecba4db88
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants