-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Unifying MSQ's exception system and the new DruidException
Motivation
Currently, MSQ uses the MSQFault classes to represent a fault that has occurred while executing an MSQ task. #14004 recently introduced the DruidException class, which is an attempt to re-structuring the user-visible errors. It makes sense that we should either unify the MSQ's exceptions with DruidExceptions or provide an analog in MSQ for the latter since these faults are delivered to the end user. This feature intends to lay out a proposal for unifying MSQ's exceptions with DruidExceptions.
Current issue with the MSQ's exceptions
Currently, many of the MSQ exceptions mention the error appropriately for someone who has knowledge of MSQ or has been using it for a while, however, they fail to provide an alternative to the user-facing the error. Consequently, if someone has seen and resolved an MSQ exception earlier, it becomes easy for them to debug the cause, while for a new user, it becomes difficult to figure out an alternative. There have been some PRs that improve the error messaging around existing faults (#13794, #14475, #14511). This change would ensure that MSQ's faults and their messaging are constrained to the same conventions as we want the DruidException to be.
The aim of the change would be to align the exceptions generated by MSQ to DruidExceptions, while also allowing MSQ to use these faults to perform its functions (the main ones being worker retry and generating error reports).
MSQ's Exception System
MSQ's exception system consists of 2 main classes:
MSQException- This is a wrapper around theMSQFaults. It doesn't contain any logic of its own and works to extract the exception message from theMSQFault. There is however a hidden dependency of the worker retry on the error message - We deduce the fault type from the message inMSQFaultUtils#getErrorCodeFromMessage. Therefore any change should ensure that this function still performs correctly.MSQFault&BaseMSQFault- Parent interface and class from all the individual faults are derived. These contain the relevant information to return the error code and the error message.
Unifying
Unifying the two separate error systems, in my opinion, should involve unifying the two interfaces
ErrorResponsewithMSQFault- Both of these classes can be used to represent the information across the wire, and both act as a precursor to the final exception message that gets thrown.DruidExceptionwithMSQException- These represent the final exceptions that are thrown to the end user. Both of them can't be serded, and both are a representation of the high-level exception message that gets propagated to the user.
1. Making MSQException a subclass of the DruidException
The end goal should be that we can construct a DruidException from MSQException, which would ensure that MSQ's errors conform to the norms of DruidException. Subclassing would ensure that we always have the necessary info errorCode, targetPersona, and category while constructing the MSQException (which are necessary fields in the DruidExceptions. Also, having distinct classes would ensure that
a) User-facing impact is minimum
b) We don't have to tweak MSQ logic relying on fetching the MSQFault from the exception. (eg: MSQErrorReport#getFaultFromException)
2. Creating a way to construct ErrorResponse from the MSQFault
We would need to expose Persona and Category from the fault, which can be done as minor implementation detail below. We would need to maintain compatibility while upgrading/downgrading between workers and the controller
- Extend the MSQFault to expose Persona (
Persona getPersona()and Category (Category getCategory()) - Extend the MSQFault to expose a map containing Persona and Category
Open question: Would the MSQ's error code be the same as ErrorResponse#errorCode?
3. Cleaning up the MSQFault before returning it to the report
Before adding the fault to the report, we should clear the extra information (category, persona) present in the MSQFault that helped us create the MSQException. This isn't exposed to the end user while showing DruidException therefore MSQ should remove it as well.
4. Updating the error message of various classes - There has been some work done already to make the most commonly encountered error messages more informative. The initiative would involve converting the rest of them as well (or probably in follow-up patches)
Summary of above
@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "errorCode")
public interface MSQFault
{
....
DruidException.Persona getPersona();
DruidException.Category getCategory();
String getErrorCode();
}
public class MSQException extends DruidException
{
....
public MSQException(
@Nullable final Throwable cause,
final MSQFault fault
)
{
super(cause, fault.getErrorCode(), fault.getPersona(), fault.getCategory(), MSQFaultUtils.generateMessageWithErrorCode(fault));
this.fault = fault;
}
}Dev impact
Any new fault added to MSQ will conform to the same restrictions on the DruidExceptions, and hopefully will lead to better and more actionable error messages.
User-facing impact
The user-facing impact would be kept minimal assuming that someone doesn't pattern match the fault's exception message. We should ensure that the error code that we return doesn't change. From my understanding, we won’t be using the error codes set by the Category anywhere in the MSQ layer therefore we won’t be changing this.
Migration
While upgrading or downgrading a cluster between versions that have and don't have this change, we should ensure that we are maintaining compatibility between the different tasks. Since we won't be changing any on the wire transfer format, there shouldn't be any issue.
1. The controller is on a newer version, while the worker is on an older version
The controller would receive an error report (wrapping an MSQ fault) without the required fields. The controller should make sure to create an MSQException with default parameters of the new fields in that case
2. The controller is on an older version, while the worker is on a newer version
The controller would receive an error report (wrapping an MSQ fault) with extra fields. While serializing it back to the MSQFault, the controller would lose those fields and continue to work on the fault as if nothing has changed.