-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Rework Auto-Complete To Work Based On PEG grammar #15003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… of string literals
|
CC @hamilton |
github-actions bot
pushed a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 27, 2024
Rework Auto-Complete To Work Based On PEG grammar (duckdb/duckdb#15003) Rely on extension-ci-tools workflow to build linux_amd64_gcc4 extensions (duckdb/duckdb#14987)
github-actions bot
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 27, 2024
Rework Auto-Complete To Work Based On PEG grammar (duckdb/duckdb#15003) Rely on extension-ci-tools workflow to build linux_amd64_gcc4 extensions (duckdb/duckdb#14987) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR reworks the
sql_auto_completefunction to work based off-of a PEG grammar. We include a set of PEG grammar files that can parse ~almost the entirety of DuckDB's SQL dialect (based on @hannes peg-parser experiments but greatly extended). Here's a short snippet:The grammar files are split up by statement type - but are inlined into a C++ header file (
extension/autocomplete/include/inlined_grammar.hpp) through the following script:Auto-Complete
The auto-complete now works as follows:
PEG Grammar Tokenization
First the PEG parser is parsed into a set of tokens by the
PEGParser::ParseRulesfunction and stored on a per-rule basis, e.g.:Token -> Matcher Conversion
The tokens are then converted into
Matcherobjects through theMatcherFactory::CreateMatcherfunction. This parses the tokens one-by-one and constructs the specified matchers.Tokenize SQL
The
BaseTokenizertokenizes the SQL string intoMatcherToken. These tokens are only strings, there are no special tokens or anything. The tokenization follows ~roughly the same rules as the current scanner but is a reimplementation so minor differences will likely be present.Parse SQL and provide suggestions
The matchers are then used to parse the input string except for the last word (i.e. the word containing the cursor). Any matchers that can accept more input at the time of parsing then provide suggestions.
For example, if we have the following query:
The matcher parsers the subset:
The
BaseTableRefparser then provides various suggestions for follow-ups:Filter Suggestions
Finally, the suggestions are filtered using the last word - i.e.
lineiin the above example. This filter will lead us to suggest the table namelineitem(given that we are operating on a TPC-H database).Matcher Infrastructure
The matchers are the (recursive) unit that define the parser. At its core, there are two methods that are implemented by various matchers:
SUCCESSis returned the tokens are consumed, ifFAILis returned the tokens are not consumedThe following matchers are available, that are automatically constructed in order to parse the PEG grammar:
Leaf Matchers
SELECT)lineitemor"Column")'text')>=)Recursive Matchers
The following recursive matchers are available:
nchild matchers, succeeds in parsing only if all child matchers (in-order) succeednchild matchers, succeeds if one child matcher succeedsThese matchers correspond to the PEG rules, i.e.
ChoiceMatcheris the OR rule (/).RepeatMatchercorresponds to the*rule.OptionalMatchercorresponds to the?rule.ListMatchercorresponds to the brackets (and every top-level rule is implicitly a list matcher).For example:
Note that matchers can be recursive, i.e. a matcher can refer back to itself. This is rather common (e.g.
Expressionrefers back toExpressioneventually). Matchers can only not be left-recursive, but this is a restriction that is also present in the PEG grammar itself.Testing & Supported Grammar
This PR also introduces a new function:
check_peg_parser. This function checks whether or not a given SQL statement is entirely consumed/parsed by the PEG parser using the matcher infrastructure. This has been tested against all SQL statements from our test infrastructure using thescripts/test_peg_parser.pyscript, example usage:There are roughly ~100 tests (out of 3.2K) that contain SQL that is not successfully parsed (yet) - this is mostly due to differences in handling of keywords (to be looked at in the future).
Limitations
Almost all SQL features are supported, but there are a few known limitations still: