Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RocksDB Iterator running out of bounds #12036

Open
Tracked by #12033
Zelldon opened this issue Mar 15, 2023 · 4 comments
Open
Tracked by #12033

RocksDB Iterator running out of bounds #12036

Zelldon opened this issue Mar 15, 2023 · 4 comments
Labels
area/performance Marks an issue as performance related component/engine component/zeebe Related to the Zeebe component/team kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc.

Comments

@Zelldon
Copy link
Member

Zelldon commented Mar 15, 2023

Description

Related to #12033
Came up in #11813

When iterating within a column family by a prefix, Zeebe controls when the iterator is out of bounds or not. This always results in x + 1 (whereby x >= 0) iterations. Meaning, the +1 iteration always happens, and doing a check if the iterator is within the expected prefix or not:

https://github.com/camunda/zeebe/blob/28fe054b79fcc7398a086a1576199efef01a122c/zb-db/src/main/java/io/camunda/zeebe/db/impl/rocksdb/transaction/TransactionalColumnFamily.java#L359-L361

This may cause reading the first key-value pair of the next logical column family (that contains data).

Potential improvements:

@Zelldon Zelldon added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. area/performance Marks an issue as performance related component/engine labels Mar 15, 2023
@Zelldon
Copy link
Member Author

Zelldon commented Apr 28, 2023

I investigated that further.

The following statement:

This may cause reading the first key-value pair of the next logical column family (that contains data).

is not true.

I did a test:

  @Test
  public void shouldLoopWhileEqualPrefixEndOfCF() {
    // given

    upsertInDefaultCF(10, 1);
    upsertInDefaultCF(10, 2);
    upsertInCompositeCF(11, "foo", "baring");
    upsertInCompositeCF(11, "foo2", "different value");
    upsertInCompositeCF(12, "hello", "world");
    upsertInCompositeCF(12, "this is the one", "as you know");
    upsertInCompositeCF(13, "another", "string");
    upsertInCompositeCF(14, "might", "be good");
    upsertInOtherCF(10, 1);
    upsertInOtherCF(10, 2);
    firstKey.wrapLong(14);

    // when
    final List<String> values = new ArrayList<>();
    compositeCF.whileEqualPrefix(
        firstKey,
        (KeyValuePairVisitor<DbCompositeKey<DbLong, DbString>, DbString>)
            (key, value) -> values.add(value.toString()));

    // then
    assertThat(values).containsExactly("be good");
  }

The debugger shows that we are NOT iterating over the column family bound. This is because we use .useFixedLengthPrefixExtractor(Long.BYTES) and .setPrefixSameAsStart(true).

The combination of these options allows RocksDB to check always the prefix of 8 bytes length, which is our ColumnFamily prefix. If we reach the end, the rocksdb iterator becomes invalid (and or loop ends).

See related comment.
and wiki page.

Still, the prefix check in the loop is necessary. For loops inside the same column family.

\cc @romansmirnov

@Zelldon
Copy link
Member Author

Zelldon commented Apr 28, 2023

I think it makes sense to separate the current iterating method, into two.

  • One with a given prefix, which iterates inside the column family and wants to see only specific prefixes here we need to check the prefix.
  • another one which without a given prefix, here we want to iterate over the whole CF (like foreach), here we don't need to check

@Zelldon
Copy link
Member Author

Zelldon commented May 3, 2023

It looks like there is some bug in the prefix iteration, previously also mentioned by @romansmirnov. He was able to produce this yesterday. Today I had a look at it and I was also able to reproduce it.

I remove the manual prefix check for the normal foreach methods and from time to time engine tests seems to fail. It seems to be not deterministic, which is weird.

I was able to debug a case where the test failed. We can see that the iterating should be inside CF 14 (PENDING DEPLOYMENT)

prefix

When we run out of bounce we are standing at CF 34 (INCIDENTS)

java.lang.IndexOutOfBoundsException: index=8 length=4 capacity=8
	at org.agrona.AbstractMutableDirectBuffer.boundsCheck0(AbstractMutableDirectBuffer.java:1716) ~[agrona-1.18.1.jar:1.18.1]
	at org.agrona.concurrent.UnsafeBuffer.ensureCapacity(UnsafeBuffer.java:646) ~[agrona-1.18.1.jar:1.18.1]
	at org.agrona.AbstractMutableDirectBuffer.getInt(AbstractMutableDirectBuffer.java:185) ~[agrona-1.18.1.jar:1.18.1]
	at io.camunda.zeebe.db.impl.DbInt.wrap(DbInt.java:27) ~[classes/:?]
	at io.camunda.zeebe.db.impl.DbCompositeKey.wrap(DbCompositeKey.java:47) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.visit(TransactionalColumnFamily.java:426) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.lambda$forEachWithoutPrefix$21(TransactionalColumnFamily.java:409) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.ColumnFamilyContext.withPrefixKey(ColumnFamilyContext.java:112) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.forEachWithoutPrefix(TransactionalColumnFamily.java:398) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.lambda$forEach$7(TransactionalColumnFamily.java:163) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.lambda$ensureInOpenTransaction$19(TransactionalColumnFamily.java:316) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.DefaultTransactionContext.runInTransaction(DefaultTransactionContext.java:31) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.ensureInOpenTransaction(TransactionalColumnFamily.java:315) ~[classes/:?]
	at io.camunda.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.forEach(TransactionalColumnFamily.java:161) ~[classes/:?]
	at io.camunda.zeebe.engine.state.deployment.DbDeploymentState.foreachPendingDeploymentDistribution(DbDeploymentState.java:128) ~[classes/:?]
	at io.camunda.zeebe.engine.processing.deployment.distribute.DeploymentRedistributor.runRetryCycle(DeploymentRedistributor.java:56) ~[classes/:?]
	at io.camunda.zeebe.stream.api.scheduling.SimpleProcessingScheduleService.lambda$runAtFixedRate$0(SimpleProcessingScheduleService.java:35) ~[classes/:?]

wrong

Right now it looks like we need the manual prefix check because of this bug/limitation. I would need to investigate further why this happens.

Might be related to transactions and missing prefix extractor support, see facebook/rocksdb#5100

Some other related issues:

@Zelldon
Copy link
Member Author

Zelldon commented May 3, 2023

Investigation 🕵️

Context:

The idea was that we can remove our manual prefix check for some cases, since this should be already guaranteed by Rocksdb since we use prefix extractor etc. But there seems to be some issues as described here #12036 (comment)

Details

On a branch I removed the prefix check and wrote a new test in order to unit test multiple columnFamilies.

Test case
/*
 * Copyright Camunda Services GmbH and/or licensed to Camunda Services GmbH under
 * one or more contributor license agreements. See the NOTICE file distributed
 * with this work for additional information regarding copyright ownership.
 * Licensed under the Zeebe Community License 1.1. You may not use this file
 * except in compliance with the Zeebe Community License 1.1.
 */
package io.camunda.zeebe.db.impl;

import static org.assertj.core.api.Assertions.assertThat;

import io.camunda.zeebe.db.ColumnFamily;
import io.camunda.zeebe.db.TransactionContext;
import io.camunda.zeebe.db.ZeebeDb;
import io.camunda.zeebe.db.ZeebeDbFactory;
import io.camunda.zeebe.db.ZeebeDbTransaction;
import java.io.File;
import java.util.HashMap;
import org.junit.Before;
import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.TemporaryFolder;

public final class MultiColumnFamilyTest {

@Rule public final TemporaryFolder temporaryFolder = new TemporaryFolder();
private final ZeebeDbFactory dbFactory =
DefaultZeebeDbFactory.getDefaultFactory();
private ZeebeDb zeebeDb;
private ColumnFamily<DbCompositeKey<DbLong, DbString>, DbString> compositeCF;
private DbLong compositeCFFirstKey;
private DbString compositeCFSecondKey;
private DbCompositeKey<DbLong, DbString> compositeKey;
private DbString value;
private DbLong defaultKey;
private DbLong defaultValue;
private ColumnFamily<DbLong, DbLong> defaultCF;
private DbString otherKey;
private DbString otherValue;
private ColumnFamily<DbString, DbString> otherCF;
private TransactionContext transactionContext;

@Before
public void setup() throws Exception {

final File pathName = temporaryFolder.newFolder();
zeebeDb = dbFactory.createDb(pathName);

transactionContext = zeebeDb.createContext();

compositeCFFirstKey = new DbLong();
compositeCFSecondKey = new DbString();
compositeKey = new DbCompositeKey<>(compositeCFFirstKey, compositeCFSecondKey);
value = new DbString();

compositeCF =
    zeebeDb.createColumnFamily(
        ColumnFamilies.COMPOSITE, transactionContext, compositeKey, value);

defaultKey = new DbLong();
defaultValue = new DbLong();
defaultCF =
    zeebeDb.createColumnFamily(
        ColumnFamilies.DEFAULT, transactionContext, defaultKey, defaultValue);

otherKey = new DbString();
otherValue = new DbString();
otherCF =
    zeebeDb.createColumnFamily(ColumnFamilies.OTHER, transactionContext, otherKey, otherValue);

}

@Test
public void shouldLoopWhileEqualPrefixEndOfCF() throws Exception {
// given
upsertInDefaultCF(10, 1);
upsertInDefaultCF(11, 2);
upsertInCompositeCF(11, "foo", "baring");
upsertInCompositeCF(11, "foo2", "different value");
upsertInCompositeCF(12, "hello", "world");
upsertInCompositeCF(12, "this is the one", "as you know");
upsertInCompositeCF(13, "another", "string");
upsertInCompositeCF(14, "might", "be good");
upsertInOtherCF("10", "1");
upsertInOtherCF("10", "2");

// when
defaultCF.forEach((key, value) -> keyValues.put(key.getValue(), value.getValue()));

// then
assertThat(keyValues).containsOnlyKeys(10L, 11L).containsValues(1L, 2L);

}

private void upsertInDefaultCF(final long key, final long value) {
defaultKey.wrapLong(key);
defaultValue.wrapLong(value);
defaultCF.upsert(defaultKey, defaultValue);
}

private void upsertInOtherCF(final String key, final String value) {
otherKey.wrapString(key);
otherValue.wrapString(value);
otherCF.upsert(otherKey, otherValue);
}

private void upsertInCompositeCF(
final long firstKey, final String secondKey, final String value) {
compositeCFFirstKey.wrapLong(firstKey);
compositeCFSecondKey.wrapString(secondKey);

this.value.wrapString(value);
compositeCF.upsert(compositeKey, this.value);

}

enum ColumnFamilies {
DEFAULT,
COMPOSITE,
OTHER,
}
}

The test runs without issues, even in a loop.

I checked again where and when actually the problem occurred. It was for example in the PendingDeploymentCheck. This is scheduled within the ProcessingScheduleService, which is the same actor as the StreamProcessor.

The stream processor creates for the processing a transaction in order to accumulate all state changes, the transaction will be committed later in a different actor job. There is a chance that the pending deployment check is executed interleaving before the transaction commit. This is also the reason why the issue was not deterministically reproducible as described earlier #12036 (comment).

The running between process and committing the transaction might be not an issue IF the pending deployment checker (or redistributor) uses a different transaction, but it uses the same. 🐛 This is quite problematic since it can see intermediate state changes, which might be not committed at all, because the transaction can still be rolled back on an error. (Issue: #12647).

Due to this reusing of the transaction and the issue (which I referenced earlier), e.g facebook/rocksdb#5100 and facebook/rocksdb#2343. It runs into the issue. It seems as soon as the transaction is dirty the prefix iterating doesn't work anymore, and it will fail.

I was able to reproduce this with my previous test:

  @Test
  public void shouldLoopWhileEqualPrefixEndOfCF() throws Exception {
    // given
    upsertInDefaultCF(10, 1);
    upsertInDefaultCF(11, 2);
    upsertInCompositeCF(11, "foo", "baring");
    upsertInCompositeCF(11, "foo2", "different value");
    upsertInCompositeCF(12, "hello", "world");
    upsertInCompositeCF(12, "this is the one", "as you know");
    upsertInCompositeCF(13, "another", "string");
    upsertInCompositeCF(14, "might", "be good");
    upsertInOtherCF("10", "1");
    upsertInOtherCF("10", "2");

    // when
    final var keyValues = new HashMap<Long, Long>();
+    final ZeebeDbTransaction currentTransaction = transactionContext.getCurrentTransaction();
+    currentTransaction.run(
+       () -> {
+         upsertInCompositeCF(114, "12", "be good");
+        upsertInOtherCF("10", "1121");
+     });

+   // we don't commit, so it is still on going
+   //    currentTransaction.commit();
    defaultCF.forEach((key, value) -> keyValues.put(key.getValue(), value.getValue()));

    // then
    assertThat(keyValues).containsOnlyKeys(10L, 11L).containsValues(1L, 2L);
  }

The test will fail with the ongoing transaction and using the same transaction context. If we would commit the transaction the test would be green again, or when using a different transaction context.

Right now it looks like that we will still need the manual prefix check in our foreach implementation, due to the limitation in RocksDB. 🔲

@romansmirnov romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related component/engine component/zeebe Related to the Zeebe component/team kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc.
Projects
None yet
Development

No branches or pull requests

2 participants