Skip to content

Conversation

@milleruntime
Copy link
Contributor

@milleruntime milleruntime commented Jun 15, 2022

Changes taken from #2215 with some improvements.

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, I like the idea. However, I think the API could be streamlined.

Comment on lines 56 to 57
List<FateTransactionStatus> fateStatus(Set<String> txids, List<String> tStatus)
throws AccumuloException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you had methods, fail() and delete() on FateTransactionStatus, this list (or set, really.. since the items should be unique) is all you'd really need, and you wouldn't need the separate FateOperations interface. It could also be a stream, so you can do things like:

   client.instanceOperations().fateTransactions()
     .filter(tx -> tx.getCreatedTime().before(threeDaysAgo))
     .peek(FateTransaction::fail)
     .filter(tx -> tx.getCreatedTime().before(fiveDaysAgo))
     .forEach(FateTransaction::delete);

Returning it as a set and calling .stream() on the set would work too. The main point is that FateTransactionImpl should have a reference to the client, so you can act on it with fail() and delete() operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will work. Here is a method in FateCommand that I got to compile:

public void failTx(Shell shellState, List txids) throws AccumuloException {
    shellState.getAccumuloClient().instanceOperations().fateTransactions().stream()
        .filter(fateTransaction -> txids.contains(fateTransaction.getId().canonical().toString()))
        .forEach(FateTransaction::fail);
  }

Some things to note if we go with the fail() method hanging off of FateTransactionImpl. You will only be able to make calls to fail a single TX at a time. And we won't be able to have any checked exceptions thrown from fail(). So the FateTransaction interface method will just be:

/**
   * Fails the fate transaction.
   *
   * @since 2.1.0
   */
  void fail();

I guess we could have another instance method that takes a Set<FateTransaction> if you want to fail a bunch at once.

Copy link
Contributor

@EdColeman EdColeman Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing a bunch is not uncommon - I'm not sure if the shell allows multiples right now, or not, but one way to handle it was to create a command file passed to the shell that had the individual txids / commands:

fate fail txid1
fate fail txid2
...

It would be friendly if this could be fate fail txid1 txid2... in the shell, and if the API supported it.

Comment on lines 32 to 37
String getTxid();

/**
* @return This fate operations transaction id, in its original long form.
*/
long getTxidLong();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these might be a bit backwards. The transaction ID (what should be returned from a getter of the transaction ID) is a long. The String is merely an prettyPrinted encoding. The method that returns the transaction ID should just be called getTxid, and the method that returns the String should be the one that is called something else.

Alternatively, we don't provide any String version, and just add a javadoc that points to Long.toHexString (I think that's what we're using). Or, we could just stop encoding the long in the logs, and just use the number in decimal in the logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but I started working on a FateTxId class, that extends AbstractId. Checkout my comment on #2495 (comment). I like the idea of using a long as the ID but I think we really need a class, like we did with TableId. We could do a cache, which won't be that useful yet since nothing is using it, but may be eventually. The only thing I wasn't sure about was extending AbstractId. If we store the internal long then we may not need to extend AbstractId. If we do, we have to convert it to a String every time its constructed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not so sure about using AbstractId either. That class was mainly for things that had a canonical String representation, and very specifically originated as a concrete Table or Namespace ID type... it was never intended as something for all possible identifiers, certainly not to provide a canonical string representation of a numeric id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about an AbstractNumericId or AbstractLongId? I feel like there has to be others... Only one I could find was flushId:

 // get the flush id before the new memmap is made available for write
    long flushId;
    try {
      flushId = getFlushID();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An abstract numeric ID would certainly be better than trying to reuse the string-based one... but I don't think the need for such a thing is particularly compelling at the moment.

/**
* @return The debug info for the operation on the top of the stack for this Fate operation.
*/
String getDebug();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method name is confusing, because it's not clear what a "debug" is as a noun to "get". Expanding it could help: getDebugInfo or getDebuggingInformation

Copy link
Contributor

@EdColeman EdColeman Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way that this is used - it is the command name (set when the transaction is seeded in Fate.seedTransaction ) something like getCommand that clarified that it is the op that created the fate stack.

Comment on lines 52 to 57
List<String> getHeldLocks();

/**
* @return list of namespace and table ids locked
*/
List<String> getWaitingLocks();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these lists or sets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason the collections for the locks are List types. I am not sure why. The method to retrieve them in AdimUtil creates Maps with Lists:

Map<Long,List<String>> heldLocks,
Map<Long,List<String>> waitingLocks

Comment on lines 59 to 62
/**
* @return The operation on the top of the stack for this Fate operation.
*/
String getTop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could be collapsed with getStackInfo into a method that returns an actual stack type (a List or Queue might work in Java), whose top could easily be viewed.

Comment on lines 64 to 72
/**
* @return The timestamp of when the operation was created in ISO format with UTC timezone.
*/
String getTimeCreatedFormatted();

/**
* @return The unformatted form of the timestamp.
*/
long getTimeCreated();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should just return a Date object or similar. The consumer of this API can format it if they like, using the normal APIs for formatting dates. Only one method is needed to return the date type.

@Manno15
Copy link
Contributor

Manno15 commented Jun 25, 2022

I am a little behind on following this PR. What else is remaining to get this ready for review?

@milleruntime
Copy link
Contributor Author

I am a little behind on following this PR. What else is remaining to get this ready for review?

It is still a WIP. Working on the new API design.

@milleruntime milleruntime force-pushed the fate-command-siteConfig branch from 5e08a64 to d743449 Compare June 27, 2022 18:43
@milleruntime
Copy link
Contributor Author

@ctubbsii @EdColeman In my latest changes I added another method to instanceOps and a new enum type to assist with the multiple TXs use case:

void executeFateAction(FateAction action, List<FateTxId> transactions)
      throws AccumuloException, AccumuloSecurityException;

I started writing the javadoc for the API and realized we need to provide some background about how to use some of the methods, such as cancel and fail, that can only be used on FateTransactions with a certain status. I thought the name FateAction is more appropriate for the API, instead of "operation", which refers to the FATE REPO. Then changed the Thrift types to reflect this naming. I also created an enum in FateTransaction for Status that directly mirrors the internal enum TStatus in ReadOnlyTStore. Let me know what you think.

@milleruntime milleruntime force-pushed the fate-command-siteConfig branch from d743449 to 45842aa Compare June 29, 2022 14:30
* @return The debug info for the operation on the top of the stack for this Fate operation.
*/
String getDebug();

Copy link
Contributor

@EdColeman EdColeman Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we can come up with a better name / description. I think that this entry is used to (or should) describe the command (target op) that creates the stack. That enables determining the overall command that is executing rather than needing to walk the entire fate stack to find the lowest numbered repo (usually repo_000000) but not sure if that must always be the case. When used with getTop() the status of the current op and the target command can be determined.

@milleruntime
Copy link
Contributor Author

There are a lot of conflicts in this PR with recent changes. I don't want to loose the design discussion here so I am going to close this in favor of a branch with all the changes merged together.

@ctubbsii
Copy link
Member

There are a lot of conflicts in this PR with recent changes. I don't want to loose the design discussion here so I am going to close this in favor of a branch with all the changes merged together.

Can you link to the new PR here when it exists, so we can follow the conversation to that PR?

@milleruntime
Copy link
Contributor Author

milleruntime commented Sep 2, 2022

@ctubbsii what do you think of the special methods, fail() and delete() being called on a FateTransaction? These currently require the Manager to be down to free the lock, which I feel is a special circumstance. A user can easily get the Set<FateTransaction> while the Manager is running and call these methods. We could easily throw an error if the Manager is running but it seems like that could trip up users.

@ctubbsii
Copy link
Member

ctubbsii commented Sep 2, 2022

The Manager is supposed to be off in order to prevent FaTE concurrency issues during a delicate manual failure-handling situation. If we're directly manipulating FaTE by executing these special methods, we're already in a "special circumstance". Manual manipulation is not a routine operation. I think it makes sense to continue to leave the requirement to have the Manager be down to prevent issues. That's the current requirement, and changing it could introduce a lot of unforeseen concurrency issues.

Overall, I'm inclined to scrap the idea of adding a FaTE RPC and API so we can access it from the shell. That's the whole basis for these changes... and it's become bloated with all sorts of issues. The FateAdmin command functionality never should have been rolled into the shell to begin with. Instead, I think it should be merged with the more general purpose Admin command, like some of the other recent utilities, and removed from the shell. That will satisfy the initial issue we were trying to fix, which was removing the shell's dependency on SiteConfiguration. And, it has the advantage of not needing any of these proposed RPC/API changes, or changing anything about the requirements for online/offline servers. But, it does make it more accessible, alongside the other admin utilities for server-side maintenance.

@milleruntime
Copy link
Contributor Author

Overall, I'm inclined to scrap the idea of adding a FaTE RPC and API so we can access it from the shell. That's the whole basis for these changes... and it's become bloated with all sorts of issues.

I have been tinkering with new FaTE RPCs for an API and I am starting to feel the same way. It doesn't really make sense to have the tablet server execute RPCs when we are just messing with ZK. And we can't have RPCs run in the manager that require it to be down. Also, FaTE is internal to the Manager and was never really meant to be exposed at such a high level.

I think it should be merged with the more general purpose Admin command.

I like that idea. It looks like FateAdmin has been deprecated for a few versions. Do you think we could drop it for 2.1?

@ctubbsii
Copy link
Member

ctubbsii commented Sep 2, 2022

I like that idea. It looks like FateAdmin has been deprecated for a few versions. Do you think we could drop it for 2.1?

No. We can drop it in 3, along with the fate stuff in the shell. But we should deprecate the fate stuff in the shell.

@milleruntime
Copy link
Contributor Author

But we should deprecate the fate stuff in the shell.

What about the new cancel operation?

@milleruntime
Copy link
Contributor Author

I started work to move the new FateCommand options (cancel and summary) added in 2.1 to the Admin command in #2914

@ctubbsii
Copy link
Member

ctubbsii commented Sep 2, 2022

But we should deprecate the fate stuff in the shell.

What about the new cancel operation?

I seem to remember making the point when that was added, that FaTE wasn't really user-facing. It probably makes more sense to expose that as "cancel scheduled compaction" or cancelling some other scheduled operation. Perhaps a cancel flag can be added to the relevant tasks? Alternatively, just put the new cancel option in the Admin command with fail and delete, and roll back the changes made in the shell.

@EdColeman
Copy link
Contributor

Operators are familiar with the current shell fate operations and may be using them more that expected. I'm not opposed to moving it to the admin command, but the same functionality should be preserved. It would also be nice if the shell could print a message that points to the moved command and at least a hint on how to execute it.

@milleruntime milleruntime deleted the fate-command-siteConfig branch September 21, 2022 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants