Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-4761. Implement increment count optimization in DeletedBlockLog V2 #1852

Closed
wants to merge 2 commits into from

Conversation

amaliujia
Copy link
Contributor

@amaliujia amaliujia commented Jan 29, 2021

What changes were proposed in this pull request?

Implement increment count optimization in DeletedBlockLog V2. We only write retry count to DB in every 100 times.

This optimization is already in master branch: 39027e4

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4761

How was this patch tested?

UT

@amaliujia
Copy link
Contributor Author

R: @runzhiwang @GlenGeng

@amaliujia
Copy link
Contributor Author

amaliujia commented Jan 29, 2021

Without this optimization, next time when master is synced to HDDS-2823, I think current V2 implementation won't pass the unit test in TestDeletedBlockLog:

public void testIncrementCountLessFrequentWritingToDB() throws Exception {

@GlenGeng-awx
Copy link
Contributor

GlenGeng-awx commented Feb 1, 2021

@amaliujia Thanks for back port this fix to HDDS-2823.

I consider we have to move transactionRetryCountMap from DeletedBlockLogStateManagerImpl#increaseRetryCountOfTransactionInDB to DeletedBlockLogImplV2#incrementCount, and clear it when SCM becomes leader.

The reason is, content of DeletedBlockLogImplV2 should be kept consistent and deterministic across SCM instance, which is not necessary and can not be maintained during SCM failover.

The design decision is to only encapsulate deletedTable into DeletedBlockLogImplV2.

@amaliujia
Copy link
Contributor Author

amaliujia commented Feb 3, 2021

@GlenGeng

I have updated this PR based on discussion. Basically we still need a in-memory state in the state manager. Meanwhile I updated to clear the state upon leader change.

Also update the UT as the UT was updated during the master to 2823 sync thus before this PR, the UT was not functionally correct.

Copy link
Contributor

@GlenGeng-awx GlenGeng-awx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

*/
public void clearTransactionToDNsCommitMap() {
public void clear() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename to onBecomeLeader()

@@ -42,4 +42,6 @@ void increaseRetryCountOfTransactionInDB(ArrayList<Long> txIDs)
KeyValue<Long, DeletedBlocksTransaction>> getReadOnlyIterator();

void onFlush();

void clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. onBecomeLeader()

// then set the retry value to -1, stop retrying, admins can
// analyze those blocks and purge them manually by SCMCli.
builder.setCount(-1);
deletedTable.putWithBatch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better transactionRetryCountMap.remove(txID); ? Otherwise might be a leak.

@GlenGeng-awx
Copy link
Contributor

Hey Rui, after a second thought, I realized that my suggestion (the second commit) is wrong, sorry for that.
Your first commit is deterministic for increaseRetryCountOfTransactionInDB .

Given accross different SCMs,

  • the rocksDB is the same at the same raft log index,
  • and they has the same raft log, which means they will apply the same `increaseRetryCountOfTransactionInDB

The generated transactionRetryCountMap is deterministic, under all kinds of failovers, including restart, become leader and step down.

The real problem here is, this solution does not decrease the number of fsync. In non-HA mode, each write to DB will call fsync, and in HA mode, the writes will be buffered, but each write will trigger a raft client request, which will call fsync on raft log.

Let's go througth this problem again.

@GlenGeng-awx
Copy link
Contributor

After a third thought, I consider both the first commit and first + second commit can not ensure the deterministic of the retry counter map, and the optimization in HA mode should be reduce the raft call. What do you think ?

@amaliujia
Copy link
Contributor Author

@GlenGeng

Optimize to reduce the Ratis call make sure. So in that case, we will need both in-memory state in DeletedBlockLog and state manager. The former is to reduce ratis call, and the latter is required because of the transaction buffer.

What do you think?

@GlenGeng-awx
Copy link
Contributor

Agree. The key point of a deterministic solution is to let each ratis call write rocksDB.

@amaliujia
Copy link
Contributor Author

amaliujia commented Feb 4, 2021

@GlenGeng unfortunately with transaction buffer that we cannot flush retry count into DB immediately thus we probably have to accept the non-deterministic result for this moment.

First of all, the current implementation is already bad because it reads from DB and then try to write into DB, thus in the worst case, DB value never increase (because previous writes are in buffer) thus for a txID it retries forever. So we have to introduce the in-memory state in state manager.

Then,
I think the non-deterministic here might not be a big deal. Even though DB state could be different, but

  1. The count in DB will not be overwrite back because we have in-memory state now. (or it becomes consistent through installSnapshot)
  2. there is a maxRetry as a bound.

At least current PR will avoid the worst case. What do you think?

@GlenGeng-awx
Copy link
Contributor

GlenGeng-awx commented Feb 4, 2021

unfortunately with transaction buffer that we cannot flush retry count into DB immediately thus we probably have to accept the non-deterministic result for this moment.

Not flush retry count into DB immediately may not lead to non-deterministic. Let's try to prove it:
1, if current SCM(both leader and follower) has not touched the txID, if will load the count from DB.
2, if current SCM(both leader and follower) touched the txID, then if will use the count in transactionRetryCountMap, which will be accurate if SCM is not restarted.
3, For a restarted SCM, loading DB and replaying corresponding log entry will help build a transactionRetryCountMap that is exactly the same as the map in other SCMs.

What we need take care is to move the counter check logic into ReadOnlyIterator, e.g,

            if (txn.getCount() > -1 && txn.getCount() <= maxRetry
                && !containerManager.getContainer(id).isOpen()) {
              numBlocksAdded += txn.getLocalIDCount();
              getTransaction(txn, transactions);
              transactionToDNsCommitMap
                  .putIfAbsent(txn.getTxID(), new LinkedHashSet<>());
            }
      private void findNext() {
        while (iter.hasNext()) {
          TypedTable.KeyValue<Long, DeletedBlocksTransaction> next = iter
              .next();
          long txID;
          try {
            txID = next.getKey();
          } catch (IOException e) {
            throw new IllegalStateException("");
          }

          int retryCount = transactionRetryCountMap.getOrDefault(txID, 0) 
          if (!deletingTxIDs.contains(txID) && retryCount  > -1 && retryCount <= maxRetry) {
            nextTx = next;
            if (LOG.isTraceEnabled()) {
              LOG.trace("DeletedBlocksTransaction matching txID:{}",
                  txID);
            }
            return;
          }
        }
        nextTx = null;
      }

Given the transactionRetryCountMap.getOrDefault, we don't need add new in-memory staff, since it is an accurate cache, and can be recovered from raft log. What do you think ?

@GlenGeng-awx
Copy link
Contributor

Hey Rui, the trxBuffer introduce some issues into the delete block log, let me draft a design for it, we can discuss there.
Please hold on this PR for now, thanks!

@amaliujia
Copy link
Contributor Author

I will open another PR to apply the optimization on retry count that we have discussed.

@amaliujia amaliujia closed this Feb 9, 2021
@amaliujia amaliujia deleted the HDDS-4761 branch November 24, 2021 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants