Skip to content

Unifying MSQ's exception system and DruidExceptions #14871

@LakshSingla

Description

@LakshSingla

Unifying MSQ's exception system and the new DruidException

Motivation

Currently, MSQ uses the MSQFault classes to represent a fault that has occurred while executing an MSQ task. #14004 recently introduced the DruidException class, which is an attempt to re-structuring the user-visible errors. It makes sense that we should either unify the MSQ's exceptions with DruidExceptions or provide an analog in MSQ for the latter since these faults are delivered to the end user. This feature intends to lay out a proposal for unifying MSQ's exceptions with DruidExceptions.

Current issue with the MSQ's exceptions

Currently, many of the MSQ exceptions mention the error appropriately for someone who has knowledge of MSQ or has been using it for a while, however, they fail to provide an alternative to the user-facing the error. Consequently, if someone has seen and resolved an MSQ exception earlier, it becomes easy for them to debug the cause, while for a new user, it becomes difficult to figure out an alternative. There have been some PRs that improve the error messaging around existing faults (#13794, #14475, #14511). This change would ensure that MSQ's faults and their messaging are constrained to the same conventions as we want the DruidException to be.

The aim of the change would be to align the exceptions generated by MSQ to DruidExceptions, while also allowing MSQ to use these faults to perform its functions (the main ones being worker retry and generating error reports).

MSQ's Exception System

MSQ's exception system consists of 2 main classes:

  1. MSQException - This is a wrapper around the MSQFaults. It doesn't contain any logic of its own and works to extract the exception message from the MSQFault. There is however a hidden dependency of the worker retry on the error message - We deduce the fault type from the message in MSQFaultUtils#getErrorCodeFromMessage. Therefore any change should ensure that this function still performs correctly.
  2. MSQFault & BaseMSQFault - Parent interface and class from all the individual faults are derived. These contain the relevant information to return the error code and the error message.

Unifying

Unifying the two separate error systems, in my opinion, should involve unifying the two interfaces

  1. ErrorResponse with MSQFault - Both of these classes can be used to represent the information across the wire, and both act as a precursor to the final exception message that gets thrown.
  2. DruidException with MSQException - These represent the final exceptions that are thrown to the end user. Both of them can't be serded, and both are a representation of the high-level exception message that gets propagated to the user.

1. Making MSQException a subclass of the DruidException

The end goal should be that we can construct a DruidException from MSQException, which would ensure that MSQ's errors conform to the norms of DruidException. Subclassing would ensure that we always have the necessary info errorCode, targetPersona, and category while constructing the MSQException (which are necessary fields in the DruidExceptions. Also, having distinct classes would ensure that
a) User-facing impact is minimum
b) We don't have to tweak MSQ logic relying on fetching the MSQFault from the exception. (eg: MSQErrorReport#getFaultFromException)

2. Creating a way to construct ErrorResponse from the MSQFault

We would need to expose Persona and Category from the fault, which can be done as minor implementation detail below. We would need to maintain compatibility while upgrading/downgrading between workers and the controller

  1. Extend the MSQFault to expose Persona (Persona getPersona() and Category (Category getCategory())
  2. Extend the MSQFault to expose a map containing Persona and Category

Open question: Would the MSQ's error code be the same as ErrorResponse#errorCode?

3. Cleaning up the MSQFault before returning it to the report

Before adding the fault to the report, we should clear the extra information (category, persona) present in the MSQFault that helped us create the MSQException. This isn't exposed to the end user while showing DruidException therefore MSQ should remove it as well.

4. Updating the error message of various classes - There has been some work done already to make the most commonly encountered error messages more informative. The initiative would involve converting the rest of them as well (or probably in follow-up patches)

Summary of above

@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "errorCode")
public interface MSQFault
{
....
  DruidException.Persona getPersona();

  DruidException.Category getCategory();

  String getErrorCode();
}

public class MSQException extends DruidException
{
....
  public MSQException(
      @Nullable final Throwable cause,
      final MSQFault fault
  )
  {
    super(cause, fault.getErrorCode(), fault.getPersona(), fault.getCategory(), MSQFaultUtils.generateMessageWithErrorCode(fault));
    this.fault = fault;
  }


}

Dev impact

Any new fault added to MSQ will conform to the same restrictions on the DruidExceptions, and hopefully will lead to better and more actionable error messages.

User-facing impact

The user-facing impact would be kept minimal assuming that someone doesn't pattern match the fault's exception message. We should ensure that the error code that we return doesn't change. From my understanding, we won’t be using the error codes set by the Category anywhere in the MSQ layer therefore we won’t be changing this.

Migration

While upgrading or downgrading a cluster between versions that have and don't have this change, we should ensure that we are maintaining compatibility between the different tasks. Since we won't be changing any on the wire transfer format, there shouldn't be any issue.

1. The controller is on a newer version, while the worker is on an older version

The controller would receive an error report (wrapping an MSQ fault) without the required fields. The controller should make sure to create an MSQException with default parameters of the new fields in that case

2. The controller is on an older version, while the worker is on a newer version

The controller would receive an error report (wrapping an MSQ fault) with extra fields. While serializing it back to the MSQFault, the controller would lose those fields and continue to work on the fault as if nothing has changed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Area - DevFor items related to the project itself, like dev docs and checklists, but not CIArea - MSQFor multi stage queries - https://github.com/apache/druid/issues/12262Feature/Change Description

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions