-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A very-WIP implementation of the PRQL plugin, for discussion #5982
base: main
Are you sure you want to change the base?
Conversation
1a864fe
to
7c5919d
Compare
It's in a state where it would be useful to:
On balancing elegance with practicality — my biggest concern is making this very generic would involve quite some refactoring & inversion, if I understand the code correctly (very possibly I don't though, please correct me!) Specifically, the
...so it's long, and SQL is quite deep. So hence the heavy refactor that I think would be required to do this in an ideal way. Comparing that to the current python parsing — python parsing is implemented with an additional method on the Is that right? Is there an easier approach here? What would be the best path there, assuming we don't want to wait until it's perfect before merging anything? I'm open minded on the way forward. As discussed with folks, I'm very keen to have people write PRQL in dbt — I think it would be great for PRQL to have the reach of dbt & its users, and it would be great for dbt & its users to have an ergonomic language option. Given PRQL's focus is the data transformation, rather than project management & DAG management, there is a uniquely good fit between the two. So I'm happy to drive this and iterate to get this in. |
(tests still fail b/c typing_extensions is not installed)
Lots of interesting things here! We did look at creating a different node type for Python nodes, but in the end decided not to do it because there a fair amount of complications. Was there some reason you decided to go that direction? It also feels a bit wrong to have references to PrqlNode, if the actual prql code doesn't exist in the repo. So I'm thinking we might need something a bit more pluggable, such that if somebody install a language provider, a new value for the 'language' attribute becomes available, and the code to compile it is found by the plugin architecture. |
Thanks for responding @gshank ! I very much agree with your assessment — there's a tradeoff between the "fair amount of complications" :) vs. having something pluggable. In the currently proposed code, I avoided many of the complications, and so wasn't using the At what level would you like to engage here? I realize these kinds of quite broad proposals can be difficult to connect with the lines of code. I've tried to summarize the current state & salient questions in prose above. Is that a reasonable starting point or do you prefer an alternative approach? Also happy to chat on Slack, or do another Zoom with the broader team if that lets us iterate faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@max-sixty Thank you so much for the initial work to spec this out! This has prompted a lot of conversation for us over the past few weeks (including several mentions in casual conversation at Coalesce). There's definite interest in a future where dbt can support plugins for modeling languages, and PRQL would be an excellent early entrant in the category.
The question now is around how to get from here to there. Based on the code you've written thus far, the two biggest challenges I see are:
- Decoupling our parsers (
ModelParser
et al) from Jinja-SQL, so that it's easier to provide a new language parser that works in parallel to the Jinja-SQL ones — with the awareness that we'd also be turning a very internal thing into a real public interface - Determining a plugin pattern/framework that we're happy with ([CT-1453] [Spike] A pattern for plugins #6184), since we're feeling the limits of the pattern we've been using for adapter plugins
My goal over the next few weeks is to identify an engineer on the team who's motivated by these challenges — and by what better interfaces here could help us unlock — to assist in the areas of technical design where I should certainly not have final (or even first) say :)
@@ -60,6 +60,9 @@ def setUp(self): | |||
# Create file filesystem searcher | |||
self.filesystem_search = patch('dbt.parser.read_files.filesystem_search') | |||
def mock_filesystem_search(project, relative_dirs, extension, ignore_spec): | |||
# Adding in `and "prql" not in extension` will cause a bunch of tests to | |||
# fail; need to understand more on how these are constructed to debug. | |||
# Possibly `sql not in extension` is a way of having it only run once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a good chance this is some magic from years ago, which we'll need to reconsider/refactor. Good example of a place where we've hardcoded .sql
to date, and need to instead be thinking about using a common provider to tell us the valid model file extensions
# TOOD: Currently `ModelParser` inherits from `SimpleSQLParser`, which inherits from | ||
# SQLParser. So possibly `ModelParser` should instead be `SQLModelParser`, and in | ||
# `ManifestLoader.load`, we should resolve which to use? (I think that's how it works; | ||
# though then why all the inheritance and generics if every model is parsed with `ModelParser`?) | ||
# class PRQLParser( | ||
# ConfiguredParser[FileBlock, IntermediateNode, FinalNode], Generic[IntermediateNode, FinalNode] | ||
# ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like this piece is the biggest current challenge, and the thing that would require the most refactoring to make pluggable.
Currently, SimpleSQLParser
is used in all of these places:
ModelParser
(even for Python models)AnalysisParser
SingularTestParser
SqlBlockParser
(for semantic layer queries)SeedParser
(!), even thoughrender_with_context
is a total no-op
We shouldn't be looking to create new node types for each language — these should still be models, analyses, tests, etc — but it's clear that we need some way to allow a language plugin to define its own parser, and then provide it to resources matching that language input.
It also looks like the real Jinja-SQL entrypoint is ConfiguredParser.render_with_context
. We'd probably want to split that out, to live on a dedicated JinjaSQLParser
instead!
Caveats:
- There's some additional Jinja business in
update_parsed_node_config_dict
andupdate_parsed_node_name
, to respect rules around hooks and database/schema/alias — those have more to do with model configuration than model language, and the fact that we've stretched Jinja into a "rules engine" for configuration that goes beyond its more-straightforwardly understood role as a SQL template. - I'd like to punt on snapshots, which are defined as Jinja blocks (and mostly config, anyway)
- I don't know if we should even think/talk about macros, at this early point, but we know we'll want some way to define and reuse functions
Thanks for the review @jtcohen6, very much agree with your comments. As discussed offline, let's touch base in a few weeks on the best path forward. |
try: | ||
context = self._context_for(node, config) | ||
references = provider.list_references(node.raw_code) | ||
sql = provider.compile(node.raw_code, references) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After taking another look here — with plans to do some hacking on/around this soon! — I think we'd want the actual .compile()
call (PRQL → SQL transpilation) to happen within compilation, rather than during parsing. Here's where we do that for Python.
I think the goal here, within the parser, should just be to provide the set of references
, and then pick up again at compile time. Something to also think about: Certain modeling languages will want a database connection at compile time (e.g. Jinja-SQL, Cody's prototype work on Ibis); some won't need one (e.g. Snowpark/PySpark Python, PRQL), and so shouldn't require one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK great, that seems very reasonable.
Would you have it as an if
in Compiler
like the current python implementation, or inherit a class for each lang; i.e. CompilerPython
/ CompilerPrql
?
It's great to see all the work that already went into making |
Is this dead? |
I would love to revive it! There are 22 👍s, so there is a decent amount of community interest... If there were interest from the dbt team I'd be happy to make a significant time investment to make something that works. My read is that the dbt team has other priorities at the moment — understandable — and so I'm not sure there is a viable path without that changing. But if there is a path forward, let me know! For any future travelers, the precursor to this was https://github.com/PRQL/dbt-prql. But that doesn't work with some dialects, and the implementation is extremely hacky, so we're not supporting it, unfortunately. There's a note stating this on the repo. |
@max-sixty Long time no talk! You're right that we're focused on other priorities at the moment. The primary ones are adding unit testing (for SQL models) and stabilizing I still find this work fun & interesting. A number of the internals have changed since we sketched out the rough initial implementation over a year ago. I don't think we can do anything here before the v1.8 release (planned for April), but I would be up for taking another look after, once the dust has settled. |
I’m here for it! |
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
As discussed with some folks in #5796 & offline
This is a very WIP implementation of the PRQL plugin. I'm not overjoyed about the code I've written, but I figured it was better to get it out in the open.
There are lots of notes in the code; to summarize:
ModelParser
as a SQL file._dbt_prql.py
that would be in dbt-prql, but I'm including here for the PR so we can iterate without coordinating dependencies. (That file also does a hack to emulateprqlc
, the rust compiler that's bundled withprql-python
/dbt-prql
)Thanks in advance for folks' feedback and guidance.
Description
Checklist
changie new
to create a changelog entry