Skip to content

[fix](audit) Mark internal query failures as ERR in audit log#62908

Merged
hello-stephen merged 1 commit intoapache:masterfrom
yujun777:fix/audit-log-internal-query-failure
Apr 29, 2026
Merged

[fix](audit) Mark internal query failures as ERR in audit log#62908
hello-stephen merged 1 commit intoapache:masterfrom
yujun777:fix/audit-log-internal-query-failure

Conversation

@yujun777
Copy link
Copy Markdown
Contributor

@yujun777 yujun777 commented Apr 28, 2026

What problem does this PR solve?

Problem Summary:
When an internal query inside StmtExecutor.executeInternalQuery() failed (for example, the column-statistics gather SQL that ANALYZE issues against a user table when the underlying tablet hits a BE IO error), the audit_log entry recorded:

state=OK | error_code=0 | error_message=<empty> | return_rows=0

This is misleading: the gather query actually failed, but the audit log makes it look like it succeeded with zero rows. ANALYZE itself still surfaces the failure to the user, but the per-internal-query audit entries hide the root cause, complicating triage.

Root cause: executeInternalQuery() wraps the inner work in try { ... } finally { AuditLogHelper.logAuditLog(context, ...) }, but the inner catch handlers only re-throw the exception and never set ConnectContext state to ERR. The default OK state is therefore what gets logged.

Fix: add an outer catch (Exception e) around the inner try that, when state has not already been moved to ERR, records ERR_INTERNAL_ERROR together with the message (falling back to root-cause message when the exception message is empty), then re-throws so callers behave as before. The setNereids/setIsQuery/setInternal flags are also moved above the parse step so audit entries for parse/plan failures still carry the right metadata.

Release note

Internal query failures are now correctly recorded as ERR in audit_log instead of misleadingly showing OK with empty error info.

Check List (For Author)

  • Test:
    • Unit Test: StmtExecutorInternalQueryTest#testExecuteInternalQuerySetsErrorStateOnFailure
    • Regression test: fault_injection_p0/test_audit_log_internal_query_failure
    • Manual test: reproduced on a local cluster with the LocalFileReader::read_at_impl.io_error debug point; before the fix audit_log shows state=OK / error_message=, after the fix it shows state=ERR / error_code=1815 / full IO_ERROR description.
  • Behavior changed: Yes (audit_log entries for failed internal queries now show ERR; previously they showed OK).
  • Does this need documentation: No.

### What problem does this PR solve?

Issue Number: https://jira.selectdb-in.cc/browse/CIR-20019

Problem Summary:
When an internal query inside StmtExecutor.executeInternalQuery() failed
(for example, the column-statistics gather SQL that ANALYZE issues
against a user table when the underlying tablet hits a BE IO error),
the audit_log entry recorded:

  state=OK | error_code=0 | error_message=<empty> | return_rows=0

This is misleading: the gather query actually failed, but the audit
log makes it look like it succeeded with zero rows. ANALYZE itself
still surfaces the failure to the user, but the per-internal-query
audit entries hide the root cause, complicating triage.

Root cause: executeInternalQuery() wraps the inner work in
try { ... } finally { AuditLogHelper.logAuditLog(context, ...) }, but
the inner catch handlers only re-throw the exception and never set
ConnectContext state to ERR. The default OK state is therefore what
gets logged.

Fix: add an outer catch (Exception e) around the inner try that, when
state has not already been moved to ERR, records ERR_INTERNAL_ERROR
together with the message (falling back to root-cause message when
the exception message is empty), then re-throws so callers behave as
before. The setNereids/setIsQuery/setInternal flags are also moved
above the parse step so audit entries for parse/plan failures still
carry the right metadata.

### Release note

Internal query failures are now correctly recorded as ERR in
audit_log instead of misleadingly showing OK with empty error info.

### Check List (For Author)

- Test:
    - Unit Test: StmtExecutorInternalQueryTest#testExecuteInternalQuerySetsErrorStateOnFailure
    - Regression test: fault_injection_p0/test_audit_log_internal_query_failure
    - Manual test: reproduced on a local cluster with the
      LocalFileReader::read_at_impl.io_error debug point; before the
      fix audit_log shows state=OK / error_message=, after the fix
      it shows state=ERR / error_code=1815 / full IO_ERROR description.
- Behavior changed: Yes (audit_log entries for failed internal
  queries now show ERR; previously they showed OK).
- Does this need documentation: No.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yujun777
Copy link
Copy Markdown
Contributor Author

run buildall

@yujun777
Copy link
Copy Markdown
Contributor Author

/review

@yujun777 yujun777 marked this pull request as draft April 28, 2026 10:10
@yujun777 yujun777 marked this pull request as ready for review April 28, 2026 10:10
@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@morrySnow
Copy link
Copy Markdown
Contributor

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

skip buildall

1 similar comment
@hello-stephen
Copy link
Copy Markdown
Contributor

skip buildall

@hello-stephen hello-stephen merged commit 53c52f5 into apache:master Apr 29, 2026
39 checks passed
github-actions Bot pushed a commit that referenced this pull request Apr 29, 2026
### What problem does this PR solve?

Problem Summary:
When an internal query inside `StmtExecutor.executeInternalQuery()`
failed (for example, the column-statistics gather SQL that ANALYZE
issues against a user table when the underlying tablet hits a BE IO
error), the audit_log entry recorded:

```
state=OK | error_code=0 | error_message=<empty> | return_rows=0
```

This is misleading: the gather query actually failed, but the audit log
makes it look like it succeeded with zero rows. ANALYZE itself still
surfaces the failure to the user, but the per-internal-query audit
entries hide the root cause, complicating triage.

Root cause: `executeInternalQuery()` wraps the inner work in `try { ...
} finally { AuditLogHelper.logAuditLog(context, ...) }`, but the inner
catch handlers only re-throw the exception and never set
`ConnectContext` state to ERR. The default OK state is therefore what
gets logged.

Fix: add an outer `catch (Exception e)` around the inner try that, when
state has not already been moved to ERR, records `ERR_INTERNAL_ERROR`
together with the message (falling back to root-cause message when the
exception message is empty), then re-throws so callers behave as before.
The `setNereids/setIsQuery/setInternal` flags are also moved above the
parse step so audit entries for parse/plan failures still carry the
right metadata.

### Release note

Internal query failures are now correctly recorded as ERR in audit_log
instead of misleadingly showing OK with empty error info.

### Check List (For Author)

- Test:
- Unit Test:
`StmtExecutorInternalQueryTest#testExecuteInternalQuerySetsErrorStateOnFailure`
- Regression test:
`fault_injection_p0/test_audit_log_internal_query_failure`
- Manual test: reproduced on a local cluster with the
`LocalFileReader::read_at_impl.io_error` debug point; before the fix
audit_log shows `state=OK / error_message=`, after the fix it shows
`state=ERR / error_code=1815 / full IO_ERROR description`.
- Behavior changed: Yes (audit_log entries for failed internal queries
now show ERR; previously they showed OK).
- Does this need documentation: No.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
github-actions Bot pushed a commit that referenced this pull request Apr 29, 2026
### What problem does this PR solve?

Problem Summary:
When an internal query inside `StmtExecutor.executeInternalQuery()`
failed (for example, the column-statistics gather SQL that ANALYZE
issues against a user table when the underlying tablet hits a BE IO
error), the audit_log entry recorded:

```
state=OK | error_code=0 | error_message=<empty> | return_rows=0
```

This is misleading: the gather query actually failed, but the audit log
makes it look like it succeeded with zero rows. ANALYZE itself still
surfaces the failure to the user, but the per-internal-query audit
entries hide the root cause, complicating triage.

Root cause: `executeInternalQuery()` wraps the inner work in `try { ...
} finally { AuditLogHelper.logAuditLog(context, ...) }`, but the inner
catch handlers only re-throw the exception and never set
`ConnectContext` state to ERR. The default OK state is therefore what
gets logged.

Fix: add an outer `catch (Exception e)` around the inner try that, when
state has not already been moved to ERR, records `ERR_INTERNAL_ERROR`
together with the message (falling back to root-cause message when the
exception message is empty), then re-throws so callers behave as before.
The `setNereids/setIsQuery/setInternal` flags are also moved above the
parse step so audit entries for parse/plan failures still carry the
right metadata.

### Release note

Internal query failures are now correctly recorded as ERR in audit_log
instead of misleadingly showing OK with empty error info.

### Check List (For Author)

- Test:
- Unit Test:
`StmtExecutorInternalQueryTest#testExecuteInternalQuerySetsErrorStateOnFailure`
- Regression test:
`fault_injection_p0/test_audit_log_internal_query_failure`
- Manual test: reproduced on a local cluster with the
`LocalFileReader::read_at_impl.io_error` debug point; before the fix
audit_log shows `state=OK / error_message=`, after the fix it shows
`state=ERR / error_code=1815 / full IO_ERROR description`.
- Behavior changed: Yes (audit_log entries for failed internal queries
now show ERR; previously they showed OK).
- Does this need documentation: No.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants