-
Notifications
You must be signed in to change notification settings - Fork 476
New FateOperations API #2780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New FateOperations API #2780
Conversation
ctubbsii
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, I like the idea. However, I think the API could be streamlined.
| List<FateTransactionStatus> fateStatus(Set<String> txids, List<String> tStatus) | ||
| throws AccumuloException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you had methods, fail() and delete() on FateTransactionStatus, this list (or set, really.. since the items should be unique) is all you'd really need, and you wouldn't need the separate FateOperations interface. It could also be a stream, so you can do things like:
client.instanceOperations().fateTransactions()
.filter(tx -> tx.getCreatedTime().before(threeDaysAgo))
.peek(FateTransaction::fail)
.filter(tx -> tx.getCreatedTime().before(fiveDaysAgo))
.forEach(FateTransaction::delete);Returning it as a set and calling .stream() on the set would work too. The main point is that FateTransactionImpl should have a reference to the client, so you can act on it with fail() and delete() operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will work. Here is a method in FateCommand that I got to compile:
public void failTx(Shell shellState, List txids) throws AccumuloException {
shellState.getAccumuloClient().instanceOperations().fateTransactions().stream()
.filter(fateTransaction -> txids.contains(fateTransaction.getId().canonical().toString()))
.forEach(FateTransaction::fail);
}
Some things to note if we go with the fail() method hanging off of FateTransactionImpl. You will only be able to make calls to fail a single TX at a time. And we won't be able to have any checked exceptions thrown from fail(). So the FateTransaction interface method will just be:
/** * Fails the fate transaction. * * @since 2.1.0 */ void fail();
I guess we could have another instance method that takes a Set<FateTransaction> if you want to fail a bunch at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing a bunch is not uncommon - I'm not sure if the shell allows multiples right now, or not, but one way to handle it was to create a command file passed to the shell that had the individual txids / commands:
fate fail txid1
fate fail txid2
...
It would be friendly if this could be fate fail txid1 txid2... in the shell, and if the API supported it.
core/src/main/java/org/apache/accumulo/core/client/admin/FateTransactionStatus.java
Outdated
Show resolved
Hide resolved
| String getTxid(); | ||
|
|
||
| /** | ||
| * @return This fate operations transaction id, in its original long form. | ||
| */ | ||
| long getTxidLong(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these might be a bit backwards. The transaction ID (what should be returned from a getter of the transaction ID) is a long. The String is merely an prettyPrinted encoding. The method that returns the transaction ID should just be called getTxid, and the method that returns the String should be the one that is called something else.
Alternatively, we don't provide any String version, and just add a javadoc that points to Long.toHexString (I think that's what we're using). Or, we could just stop encoding the long in the logs, and just use the number in decimal in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but I started working on a FateTxId class, that extends AbstractId. Checkout my comment on #2495 (comment). I like the idea of using a long as the ID but I think we really need a class, like we did with TableId. We could do a cache, which won't be that useful yet since nothing is using it, but may be eventually. The only thing I wasn't sure about was extending AbstractId. If we store the internal long then we may not need to extend AbstractId. If we do, we have to convert it to a String every time its constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not so sure about using AbstractId either. That class was mainly for things that had a canonical String representation, and very specifically originated as a concrete Table or Namespace ID type... it was never intended as something for all possible identifiers, certainly not to provide a canonical string representation of a numeric id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about an AbstractNumericId or AbstractLongId? I feel like there has to be others... Only one I could find was flushId:
// get the flush id before the new memmap is made available for write
long flushId;
try {
flushId = getFlushID();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An abstract numeric ID would certainly be better than trying to reuse the string-based one... but I don't think the need for such a thing is particularly compelling at the moment.
core/src/main/java/org/apache/accumulo/core/client/admin/FateTransactionStatus.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * @return The debug info for the operation on the top of the stack for this Fate operation. | ||
| */ | ||
| String getDebug(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method name is confusing, because it's not clear what a "debug" is as a noun to "get". Expanding it could help: getDebugInfo or getDebuggingInformation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way that this is used - it is the command name (set when the transaction is seeded in Fate.seedTransaction ) something like getCommand that clarified that it is the op that created the fate stack.
| List<String> getHeldLocks(); | ||
|
|
||
| /** | ||
| * @return list of namespace and table ids locked | ||
| */ | ||
| List<String> getWaitingLocks(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these lists or sets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason the collections for the locks are List types. I am not sure why. The method to retrieve them in AdimUtil creates Maps with Lists:
Map<Long,List<String>> heldLocks,
Map<Long,List<String>> waitingLocks| /** | ||
| * @return The operation on the top of the stack for this Fate operation. | ||
| */ | ||
| String getTop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method could be collapsed with getStackInfo into a method that returns an actual stack type (a List or Queue might work in Java), whose top could easily be viewed.
| /** | ||
| * @return The timestamp of when the operation was created in ISO format with UTC timezone. | ||
| */ | ||
| String getTimeCreatedFormatted(); | ||
|
|
||
| /** | ||
| * @return The unformatted form of the timestamp. | ||
| */ | ||
| long getTimeCreated(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should just return a Date object or similar. The consumer of this API can format it if they like, using the normal APIs for formatting dates. Only one method is needed to return the date type.
29adc49 to
e1b2c4a
Compare
|
I am a little behind on following this PR. What else is remaining to get this ready for review? |
It is still a WIP. Working on the new API design. |
5e08a64 to
d743449
Compare
|
@ctubbsii @EdColeman In my latest changes I added another method to instanceOps and a new enum type to assist with the multiple TXs use case: void executeFateAction(FateAction action, List<FateTxId> transactions)
throws AccumuloException, AccumuloSecurityException;I started writing the javadoc for the API and realized we need to provide some background about how to use some of the methods, such as |
d743449 to
45842aa
Compare
| * @return The debug info for the operation on the top of the stack for this Fate operation. | ||
| */ | ||
| String getDebug(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we can come up with a better name / description. I think that this entry is used to (or should) describe the command (target op) that creates the stack. That enables determining the overall command that is executing rather than needing to walk the entire fate stack to find the lowest numbered repo (usually repo_000000) but not sure if that must always be the case. When used with getTop() the status of the current op and the target command can be determined.
|
There are a lot of conflicts in this PR with recent changes. I don't want to loose the design discussion here so I am going to close this in favor of a branch with all the changes merged together. |
Can you link to the new PR here when it exists, so we can follow the conversation to that PR? |
|
@ctubbsii what do you think of the special methods, |
|
The Manager is supposed to be off in order to prevent FaTE concurrency issues during a delicate manual failure-handling situation. If we're directly manipulating FaTE by executing these special methods, we're already in a "special circumstance". Manual manipulation is not a routine operation. I think it makes sense to continue to leave the requirement to have the Manager be down to prevent issues. That's the current requirement, and changing it could introduce a lot of unforeseen concurrency issues. Overall, I'm inclined to scrap the idea of adding a FaTE RPC and API so we can access it from the shell. That's the whole basis for these changes... and it's become bloated with all sorts of issues. The FateAdmin command functionality never should have been rolled into the shell to begin with. Instead, I think it should be merged with the more general purpose Admin command, like some of the other recent utilities, and removed from the shell. That will satisfy the initial issue we were trying to fix, which was removing the shell's dependency on SiteConfiguration. And, it has the advantage of not needing any of these proposed RPC/API changes, or changing anything about the requirements for online/offline servers. But, it does make it more accessible, alongside the other admin utilities for server-side maintenance. |
I have been tinkering with new FaTE RPCs for an API and I am starting to feel the same way. It doesn't really make sense to have the tablet server execute RPCs when we are just messing with ZK. And we can't have RPCs run in the manager that require it to be down. Also, FaTE is internal to the Manager and was never really meant to be exposed at such a high level.
I like that idea. It looks like FateAdmin has been deprecated for a few versions. Do you think we could drop it for 2.1? |
No. We can drop it in 3, along with the fate stuff in the shell. But we should deprecate the fate stuff in the shell. |
What about the new cancel operation? |
|
I started work to move the new FateCommand options (cancel and summary) added in 2.1 to the Admin command in #2914 |
I seem to remember making the point when that was added, that FaTE wasn't really user-facing. It probably makes more sense to expose that as "cancel scheduled compaction" or cancelling some other scheduled operation. Perhaps a cancel flag can be added to the relevant tasks? Alternatively, just put the new cancel option in the Admin command with fail and delete, and roll back the changes made in the shell. |
|
Operators are familiar with the current shell fate operations and may be using them more that expected. I'm not opposed to moving it to the admin command, but the same functionality should be preserved. It would also be nice if the shell could print a message that points to the moved command and at least a hint on how to execute it. |
Changes taken from #2215 with some improvements.