HDDS-4761. Implement increment count optimization in DeletedBlockLog V2 #1852

amaliujia · 2021-01-29T06:15:26Z

What changes were proposed in this pull request?

Implement increment count optimization in DeletedBlockLog V2. We only write retry count to DB in every 100 times.

This optimization is already in master branch: 39027e4

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4761

How was this patch tested?

UT

amaliujia · 2021-01-29T06:16:33Z

R: @runzhiwang @GlenGeng

amaliujia · 2021-01-29T06:17:30Z

Without this optimization, next time when master is synced to HDDS-2823, I think current V2 implementation won't pass the unit test in TestDeletedBlockLog:

ozone/hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/block/TestDeletedBlockLog.java

Line 242 in 8d3817c

public void testIncrementCountLessFrequentWritingToDB() throws Exception {

GlenGeng-awx · 2021-02-01T11:03:29Z

@amaliujia Thanks for back port this fix to HDDS-2823.

I consider we have to move transactionRetryCountMap from DeletedBlockLogStateManagerImpl#increaseRetryCountOfTransactionInDB to DeletedBlockLogImplV2#incrementCount, and clear it when SCM becomes leader.

The reason is, content of DeletedBlockLogImplV2 should be kept consistent and deterministic across SCM instance, which is not necessary and can not be maintained during SCM failover.

The design decision is to only encapsulate deletedTable into DeletedBlockLogImplV2.

amaliujia · 2021-02-03T06:48:49Z

@GlenGeng

I have updated this PR based on discussion. Basically we still need a in-memory state in the state manager. Meanwhile I updated to clear the state upon leader change.

Also update the UT as the UT was updated during the master to 2823 sync thus before this PR, the UT was not functionally correct.

GlenGeng-awx

LGTM

GlenGeng-awx · 2021-02-03T08:55:08Z

...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImplV2.java

   */
-  public void clearTransactionToDNsCommitMap() {
+  public void clear() {


Please rename to onBecomeLeader()

GlenGeng-awx · 2021-02-03T08:57:01Z

...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogStateManager.java

@@ -42,4 +42,6 @@ void increaseRetryCountOfTransactionInDB(ArrayList<Long> txIDs)
      KeyValue<Long, DeletedBlocksTransaction>> getReadOnlyIterator();

  void onFlush();
+
+  void clear();


ditto. onBecomeLeader()

GlenGeng-awx · 2021-02-03T09:04:43Z

...rver-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogStateManagerImpl.java

+          // then set the retry value to -1, stop retrying, admins can
+          // analyze those blocks and purge them manually by SCMCli.
+          builder.setCount(-1);
+          deletedTable.putWithBatch(


better transactionRetryCountMap.remove(txID); ? Otherwise might be a leak.

GlenGeng-awx · 2021-02-03T13:37:19Z

Hey Rui, after a second thought, I realized that my suggestion (the second commit) is wrong, sorry for that.
Your first commit is deterministic for increaseRetryCountOfTransactionInDB .

Given accross different SCMs,

the rocksDB is the same at the same raft log index,
and they has the same raft log, which means they will apply the same `increaseRetryCountOfTransactionInDB

The generated transactionRetryCountMap is deterministic, under all kinds of failovers, including restart, become leader and step down.

The real problem here is, this solution does not decrease the number of fsync. In non-HA mode, each write to DB will call fsync, and in HA mode, the writes will be buffered, but each write will trigger a raft client request, which will call fsync on raft log.

Let's go througth this problem again.

GlenGeng-awx · 2021-02-03T14:08:17Z

After a third thought, I consider both the first commit and first + second commit can not ensure the deterministic of the retry counter map, and the optimization in HA mode should be reduce the raft call. What do you think ?

amaliujia · 2021-02-04T02:50:52Z

@GlenGeng

Optimize to reduce the Ratis call make sure. So in that case, we will need both in-memory state in DeletedBlockLog and state manager. The former is to reduce ratis call, and the latter is required because of the transaction buffer.

What do you think?

GlenGeng-awx · 2021-02-04T03:11:59Z

Agree. The key point of a deterministic solution is to let each ratis call write rocksDB.

amaliujia · 2021-02-04T03:20:42Z

@GlenGeng unfortunately with transaction buffer that we cannot flush retry count into DB immediately thus we probably have to accept the non-deterministic result for this moment.

First of all, the current implementation is already bad because it reads from DB and then try to write into DB, thus in the worst case, DB value never increase (because previous writes are in buffer) thus for a txID it retries forever. So we have to introduce the in-memory state in state manager.

Then,
I think the non-deterministic here might not be a big deal. Even though DB state could be different, but

The count in DB will not be overwrite back because we have in-memory state now. (or it becomes consistent through installSnapshot)
there is a maxRetry as a bound.

At least current PR will avoid the worst case. What do you think?

GlenGeng-awx · 2021-02-04T04:36:09Z

unfortunately with transaction buffer that we cannot flush retry count into DB immediately thus we probably have to accept the non-deterministic result for this moment.

Not flush retry count into DB immediately may not lead to non-deterministic. Let's try to prove it:
1, if current SCM(both leader and follower) has not touched the txID, if will load the count from DB.
2, if current SCM(both leader and follower) touched the txID, then if will use the count in transactionRetryCountMap, which will be accurate if SCM is not restarted.
3, For a restarted SCM, loading DB and replaying corresponding log entry will help build a transactionRetryCountMap that is exactly the same as the map in other SCMs.

What we need take care is to move the counter check logic into ReadOnlyIterator, e.g,

            if (txn.getCount() > -1 && txn.getCount() <= maxRetry
                && !containerManager.getContainer(id).isOpen()) {
              numBlocksAdded += txn.getLocalIDCount();
              getTransaction(txn, transactions);
              transactionToDNsCommitMap
                  .putIfAbsent(txn.getTxID(), new LinkedHashSet<>());
            }

      private void findNext() {
        while (iter.hasNext()) {
          TypedTable.KeyValue<Long, DeletedBlocksTransaction> next = iter
              .next();
          long txID;
          try {
            txID = next.getKey();
          } catch (IOException e) {
            throw new IllegalStateException("");
          }

          int retryCount = transactionRetryCountMap.getOrDefault(txID, 0) 
          if (!deletingTxIDs.contains(txID) && retryCount  > -1 && retryCount <= maxRetry) {
            nextTx = next;
            if (LOG.isTraceEnabled()) {
              LOG.trace("DeletedBlocksTransaction matching txID:{}",
                  txID);
            }
            return;
          }
        }
        nextTx = null;
      }

Given the transactionRetryCountMap.getOrDefault, we don't need add new in-memory staff, since it is an accurate cache, and can be recovered from raft log. What do you think ?

GlenGeng-awx · 2021-02-04T05:15:59Z

Hey Rui, the trxBuffer introduce some issues into the delete block log, let me draft a design for it, we can discuss there.
Please hold on this PR for now, thanks!

amaliujia · 2021-02-09T04:45:22Z

I will open another PR to apply the optimization on retry count that we have discussed.

amaliujia added 2 commits February 2, 2021 22:31

HDDS-4761. Implement increment count optimization in DeletedBlockLog V2

53bbfbb

fixup! address comments

9fb911c

amaliujia force-pushed the HDDS-4761 branch from c4a3352 to 9fb911c Compare February 3, 2021 06:47

GlenGeng-awx reviewed Feb 3, 2021

View reviewed changes

amaliujia closed this Feb 9, 2021

amaliujia deleted the HDDS-4761 branch November 24, 2021 04:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-4761. Implement increment count optimization in DeletedBlockLog V2 #1852

HDDS-4761. Implement increment count optimization in DeletedBlockLog V2 #1852

amaliujia commented Jan 29, 2021 •

edited

amaliujia commented Jan 29, 2021

amaliujia commented Jan 29, 2021 •

edited

GlenGeng-awx commented Feb 1, 2021 •

edited

amaliujia commented Feb 3, 2021 •

edited

GlenGeng-awx left a comment

GlenGeng-awx Feb 3, 2021

GlenGeng-awx Feb 3, 2021

GlenGeng-awx Feb 3, 2021

GlenGeng-awx commented Feb 3, 2021

GlenGeng-awx commented Feb 3, 2021

amaliujia commented Feb 4, 2021

GlenGeng-awx commented Feb 4, 2021

amaliujia commented Feb 4, 2021 •

edited

GlenGeng-awx commented Feb 4, 2021 •

edited

GlenGeng-awx commented Feb 4, 2021

amaliujia commented Feb 9, 2021

HDDS-4761. Implement increment count optimization in DeletedBlockLog V2 #1852

HDDS-4761. Implement increment count optimization in DeletedBlockLog V2 #1852

Conversation

amaliujia commented Jan 29, 2021 • edited

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

amaliujia commented Jan 29, 2021

amaliujia commented Jan 29, 2021 • edited

GlenGeng-awx commented Feb 1, 2021 • edited

amaliujia commented Feb 3, 2021 • edited

GlenGeng-awx left a comment

Choose a reason for hiding this comment

GlenGeng-awx Feb 3, 2021

Choose a reason for hiding this comment

GlenGeng-awx Feb 3, 2021

Choose a reason for hiding this comment

GlenGeng-awx Feb 3, 2021

Choose a reason for hiding this comment

GlenGeng-awx commented Feb 3, 2021

GlenGeng-awx commented Feb 3, 2021

amaliujia commented Feb 4, 2021

GlenGeng-awx commented Feb 4, 2021

amaliujia commented Feb 4, 2021 • edited

GlenGeng-awx commented Feb 4, 2021 • edited

GlenGeng-awx commented Feb 4, 2021

amaliujia commented Feb 9, 2021

amaliujia commented Jan 29, 2021 •

edited

amaliujia commented Jan 29, 2021 •

edited

GlenGeng-awx commented Feb 1, 2021 •

edited

amaliujia commented Feb 3, 2021 •

edited

amaliujia commented Feb 4, 2021 •

edited

GlenGeng-awx commented Feb 4, 2021 •

edited