Skip to content

bug: Incorrect implementation of Hex Literal X'...' (Inconsistent with SQL Standard) #19600

@youngsofun

Description

@youngsofun

Search before asking

  • I had searched in the issues and found no similar issues.

Version

1.2.890

What's Wrong?

1. Description

Currently, Databend treats the hex literal X'...' as a PostgreSQL Hex Integer Literal (PGLiteralHex) in the lexer/parser, which is then interpreted as an Integer.

// Current problematic regex in Databend lexer
#[regex(r"[xX]'[a-fA-F0-9]*'")]
PGLiteralHex,

However, this implementation deviates from the SQL Standard, confuses the behavior of PostgreSQL, and creates functional limitations for binary data.

2. Evidence of Non-Compliance

A. SQL Standard (ISO/IEC 9075)

According to the SQL:1999 (and subsequent versions like SQL:2016) specification, Section 5.3 <literal>, the syntax X'<hexit>...' is explicitly defined as a <binary string literal>.

  • It is intended to represent BINARY or BLOB data types.
  • It is not defined as an integer literal.

B. Misconception of PostgreSQL Behavior

The naming PGLiteralHex suggests alignment with PostgreSQL, but PG's behavior is different:

  • In PostgreSQL, X'0A' results in a Bit String (BIT VARYING), not an Integer.
  • SELECT X'0A' + 1; fails in PG because a Bit String cannot be implicitly used as an Integer.
  • PostgreSQL 16+ introduced 0x... for hex integers, but X'...' remains a Bit String.

3. Comparison of Database Implementations

The following table shows that Databend's current approach is an outlier compared to major SQL engines:

Implementation Group Databases Result Type of X'0A' SQL Standard Compliant?
Standard-Compliant DuckDB, Trino, Snowflake, MySQL, SQLite Binary / BLOB (Byte Stream) Yes
Dialect (Bit-string) PostgreSQL BIT VARYING (Bit Array) No
Non-Standard SQL Server Syntax Error (Only supports 0x) No
Databend (Current) Databend Integer No

4. Impact & Risks

  1. Integer Overflow: Binary literals (like SHA-256 hashes X'df4... ') frequently exceed 64 bits. Forcing them into an Integer type causes overflow errors where they should have been valid BINARY data.
  2. Migration Barrier: Users migrating from Snowflake, MySQL, or ClickHouse expect X'...' to work for binary ingestion.
  3. Inconsistent Logic: Databend lacks a standard-compliant way to input BINARY literals if X'...' is occupied by the Integer type.

5. Suggested Fix

  1. Refactor Lexer: Map X'...' literals to a Binary type instead of PGLiteralHex (Integer).

Reference

How to Reproduce?

root@localhost:8000/default/default> select X'ABCD';

SELECT 43981

╭────────╮
│  43981 │
│ UInt16 │
├────────┤
│  43981 │
╰────────╯

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCategory: something isn't workingagent-issueAgent-created issue marker

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions