[feat](sql-parser) Split SQL grammar into standalone fe-sql-parser#63823
Conversation
Split SQL syntax parsing out of fe-core into a new fe-sql-parser module that can be packaged as an independent jar for external consumers — no semantic analysis, no Catalog or LogicalPlan dependencies. This commit covers the new module only. fe-core is wired to reverse depend on it in a follow-up commit and does not compile until then. - Move DorisLexer.g4 / DorisParser.g4 and 8 supporting java files (CaseInsensitiveStream, Origin, ParserUtils, ParseErrorListener, PostProcessor, ParseException, SyntaxParseException, QueryParsingErrors). Package names are preserved so downstream imports do not move. - ParseException now extends RuntimeException directly to break the chain through AnalysisException, which references LogicalPlan. - ParserUtils drops the MoreFieldsThread fast path; the new module does not pull in fe-core thread-local context. - Add org.apache.doris.sqlparser.DorisSqlParser facade and 7 unit tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fe-core now consumes DorisLexer / DorisParser and the parser support classes from fe-sql-parser instead of generating them itself. The generated ANTLR classes keep the same fully qualified names, so all existing imports inside fe-core continue to resolve without change. fe-core's own antlr4-maven-plugin configuration is unchanged and now only processes the remaining JavaLexer.g4 / JavaParser.g4 (used by the Nereids pattern-generator annotation processor); SQL grammar generation moved with the .g4 files into fe-sql-parser. Verified: full fe reactor compiles, fe-core's 15 nereids parser test classes (160 cases) pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Reintroduce the per-thread field-based fast path that PR apache#52125 added for ParserUtils. To keep fe-sql-parser independent of fe-core, the fast path is now gated by a minimal `OriginAware` interface defined in fe-sql-parser and implemented by fe-core's `MoreFieldsThread`. Threads that don't implement it continue to use the ThreadLocal slow path — correctness is identical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`mvn -pl fe-sql-parser -Pcli package` produces a self-contained `fe-sql-parser-*-cli.jar` (~1.7 MB after minimize-shade) with `org.apache.doris.sqlparser.DorisSqlParserCli` as Main-Class. The CLI reads SQL from a positional argument, `-e`, `-f`, or stdin; parses as single statement (default), `--multi`, or `--expression`; and prints the ANTLR parse tree either LISP-style or `--pretty` indented. Exit codes are 0 success / 1 parse error / 2 usage or I/O error. The CLI is gated behind the `cli` Maven profile so default builds do not pay the shading cost. The thin jar consumed by fe-core is unchanged. Also adds README.md covering both build modes (library jar and CLI), end-to-end CLI examples, library and Visitor usage, configuration flags, the OriginAware fast-path SPI, and current caveats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an "Extending the Parser" section to fe/fe-sql-parser/README.md covering the four mechanisms downstream projects can use without modifying fe-sql-parser itself: Visitor subclasses, Listener subclasses, live ParseTreeListener hooks via newLexer/newParser, and wrapping the DorisSqlParser facade. Includes worked examples for SQL lineage extraction, policy/audit listeners, hint collection during parsing, and an instrumented facade with caching and metrics, plus tips on locating rule names and debugging visitors with the CLI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review summary: no blocking issues found in the actual PR file set.
Critical checkpoints:
- Goal/test: The PR extracts the Doris Nereids grammar/support classes into a standalone fe-sql-parser module and wires fe-core to depend on it. The module-level tests cover statement, multi-statement, expression, DDL, and malformed SQL parsing.
- Scope/focus: The change is focused on parser modularization plus the MoreFieldsThread OriginAware bridge. No unrelated code-path changes were present in the GitHub PR file list.
- Concurrency/lifecycle: DorisSqlParser is stateless; shared ParseErrorListener/PostProcessor instances do not keep mutable per-parse state. Origin tracking remains thread-confined through OriginAware/ThreadLocal.
- Config/compatibility/storage: No new configs, storage formats, or FE-BE protocol changes.
- Parallel paths: fe-core now consumes the extracted parser module; parser support classes moved with the grammar, and MoreFieldsThread implements the new OriginAware interface.
- Tests/results: Ran mvn -pl fe-sql-parser -DskipTests validate and mvn -pl fe-sql-parser test successfully. Attempted mvn -pl fe-core -am -DskipTests compile, but the runner lacks thirdparty/installed/bin/thrift and the build failed in fe-thrift before reaching fe-sql-parser/fe-core.
- Observability/transactions/data correctness: Not applicable; this is parser packaging/API work and does not affect data visibility, transactions, MoW delete bitmap, or persistence.
- Performance: No obvious performance regression found; the existing fast Origin path is preserved via OriginAware.
User focus: No additional user-provided review focus was present.
TPC-H: Total hot run time: 31019 ms |
TPC-DS: Total hot run time: 173189 ms |
FE Regression Coverage ReportIncrement line coverage |
…ance
Earlier in this PR `ParseException` was changed to extend `RuntimeException`
directly so that fe-sql-parser would not pull in `LogicalPlan` (referenced
by `nereids.exceptions.AnalysisException`). That broke catch sites in
fe-core that rely on `try { ... } catch (AnalysisException) { ... }` to
also catch parse-time failures — observed concretely in
`CreateTableTest#testAbnormal`, which uses
`expectThrowsWithMsg(AnalysisException.class, ...)` against SQL that the
parser rejects.
Fix by moving `AnalysisException` itself into fe-sql-parser. The
`Optional<LogicalPlan> plan` field was dead code (no fe-core caller ever
populated it; only `CastException` passed `Optional.empty()` explicitly),
so it is removed along with the `plan.treeString()` branch in
`getMessage()`. Everything else about `AnalysisException` is unchanged:
the package, the public constructors that fe-core actually calls, the
`ErrorCode` enum, and the line/pos rendering.
With that, `ParseException extends AnalysisException` is restored and
fe-core's catch chains behave as before. fe-core's `CastException` is
updated to drop its `Optional.empty()` plan argument so it matches the
slimmed-down 4-arg constructor.
Verified:
- fe-sql-parser unit tests (7) pass.
- Full fe reactor compiles.
- `CreateTableTest#testAbnormal` passes.
- 15 nereids parser test classes (160 cases) pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review result: no blocking issues found in the PR scope.
Critical checkpoint conclusions:
- Goal/test proof: The PR splits SQL grammar/support into
fe-sql-parser, wiresfe-coreto depend on it, restoresParseException -> AnalysisException, and adds parser facade/CLI tests. The included unit tests cover basic statement, multi-statement, expression, DDL, and malformed input paths. - Scope/focus: Changes are focused on parser modularization and build wiring; no unrelated behavioral rewrite was identified in the actual PR diff.
- Concurrency/lifecycle: The new
OriginAwarefast path is thread-owned field storage only;DorisSqlParsercreates fresh lexer/parser/token streams per call and shared listener instances are stateless. - Configuration/compatibility: No new Doris runtime config or persisted format/protocol change. Keeping package names and restoring
ParseExceptionas anAnalysisExceptionpreserves FE catch-site behavior. - Parallel code paths: Existing
NereidsParsercontinues using the moved grammar/support classes; the standalone facade has matching parser setup knobs for external callers. - Error handling: Parser syntax errors still go through
ParseErrorListener/ParseException; RPC/persistence boundaries are not involved. - Test coverage: Parser module tests were added. I could not run FE Maven tests in this runner because
thirdparty/installed/bin/protocis missing, and FE build instructions require stopping before build in that case. - Observability/performance: No new runtime service path requiring metrics/logging. The hot-path origin storage optimization is preserved through
OriginAwarewithout adding synchronization. - Transaction/data correctness/security: Not applicable to this parser/module split.
User focus: no additional user-provided review focus was present.
TPC-H: Total hot run time: 31263 ms |
TPC-DS: Total hot run time: 172031 ms |
FE UT Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
FE Regression Coverage ReportIncrement line coverage |
Summary
Split SQL syntax parsing out of
fe-coreinto a newfe-sql-parsermodule that produces an ANTLR parse tree (CST) without semantic analysis. The new module can be packaged as an independent jar for external consumers (third-party tools, linters, format converters, etc.) without dragging inLogicalPlan,Catalog,ConnectContext, or any other fe-core internals.Module changes
DorisLexer.g4/DorisParser.g4and 8 supporting Java files (CaseInsensitiveStream,Origin,ParserUtils,ParseErrorListener,PostProcessor,ParseException,SyntaxParseException,QueryParsingErrors) intofe-sql-parser. Package names are preserved so fe-core's hundreds of imports do not move.ParseExceptionnow extendsRuntimeExceptiondirectly to break the chain throughnereids.exceptions.AnalysisException, which referencesLogicalPlan.OriginAwareSPI infe-sql-parsersoParserUtilskeeps its per-thread field-based fast path (originally added in [refactor](nereids) Support Origin to provide error location #52125).MoreFieldsThreadin fe-core implements the interface; threads that don't fall back to ThreadLocal — correctness is identical either way.org.apache.doris.sqlparser.DorisSqlParserfacade withparseStatement/parseStatements/parseExpression.antlr4-maven-pluginnow only processes the Nereids pattern-generator'sJavaLexer.g4/JavaParser.g4.The new module's only runtime dependency is
org.antlr:antlr4-runtime.Standalone CLI
mvn -pl fe-sql-parser -Pcli packageproduces a self-contained executable jar (fe-sql-parser-*-cli.jar, ~1.7 MB after minimize-shade) so the parser can be invoked directly from a shell:The CLI is gated behind the
cliMaven profile so default Doris builds do not pay the shading cost; the thin jar consumed by fe-core is unchanged. Exit codes:0success,1parse error,2usage or I/O error.Extension hooks for downstream tools
Downstream projects can plug in custom logic (SQL lineage, policy enforcement, audit, rewriting, metrics) without modifying
fe-sql-parser. Four mechanisms are available:DorisParserBaseVisitor<T>DorisParserBaseListenerenter/exitinterceptionparser.addParseListener(...)vianewLexer/newParserDorisSqlParserfe/fe-sql-parser/README.mdcontains end-to-end examples for SQL lineage extraction, policy/audit listeners, hint collection during parsing, an instrumented facade with caching and metrics, plus tips on locating rule names and debugging visitors with the CLI. It also documents the build modes, library/Visitor usage, configuration flags (noBackslashEscapes,ansiSqlSyntax), theOriginAwarefast-path SPI, and current caveats.Test plan
fe-sql-parserunit tests: 7 new cases inDorisSqlParserTestcoveringSELECT,SELECT FROM WHERE, multi-statement, expression, DDL (CREATE TABLEwithDISTRIBUTED+PROPERTIES), malformed SQL, and trailing-garbage expressionsfereactor compiles (mvn -pl fe-core -am compile)org.apache.doris.nereids.parser.*Testclasses pass (160 cases, 0 failures)-e/-f/ stdin input modes;--multi/--expressionparse modes;--prettyoutput; parse-error exit code