Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A very-WIP implementation of the PRQL plugin, for discussion #5982

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

max-sixty
Copy link
Contributor

@max-sixty max-sixty commented Oct 2, 2022

As discussed with some folks in #5796 & offline

This is a very WIP implementation of the PRQL plugin. I'm not overjoyed about the code I've written, but I figured it was better to get it out in the open.

There are lots of notes in the code; to summarize:

  • I was hoping to write something not overly generic, but at least modular.
  • This gets some of the way there — it takes the "API" rather than "Jinja" path for most of it, as discussed in dbt & Python: language agnosticism #5796.
  • But then eventually passes the compiled model off to ModelParser as a SQL file.
  • I added some notes on a better design, but as discussed, I'd like to socialize this to ensure we're on the same page before investing further. I found the current model parser code quite coupled, and SQL is high in a very long inheritance tree, so making it elegant is not trivial. The python parser is impressive, but def bolted on!
  • This code has unit tests but not an integration test (I also haven't hand-tested it end-to-end; I suspect there may be parts that don't work).
  • Note that it includes a file _dbt_prql.py that would be in dbt-prql, but I'm including here for the PR so we can iterate without coordinating dependencies. (That file also does a hack to emulate prqlc, the rust compiler that's bundled with prql-python / dbt-prql)

Thanks in advance for folks' feedback and guidance.

Description

Checklist

@max-sixty max-sixty requested a review from a team October 2, 2022 08:27
@max-sixty max-sixty requested review from a team as code owners October 2, 2022 08:27
@cla-bot cla-bot bot added the cla:yes label Oct 2, 2022
@max-sixty
Copy link
Contributor Author

It's in a state where it would be useful to:

  • discuss the current proposed implementation
  • discuss the best path between here and a nice generic approach, balancing elegance with practicality
  • confirm we're all aligned on having PRQL including in some form

On balancing elegance with practicality — my biggest concern is making this very generic would involve quite some refactoring & inversion, if I understand the code correctly (very possibly I don't though, please correct me!)

Specifically, the ModelParser seems to be the most congruent place to implement different languages — so each language would define a SQLModelParser / PythonModelParser / etc, with listing references, replacing them, etc (as discussed in #5796). That would likely involve splitting the parts of ModelParser that are SQL-specific from those that are generic. But ModelParser's mro is:

  • dbt.parser.models.ModelParser
  • dbt.parser.base.SimpleSQLParser
  • dbt.parser.base.SQLParser
  • dbt.parser.base.ConfiguredParser
  • dbt.parser.base.Parser
  • dbt.parser.base.BaseParser

...so it's long, and SQL is quite deep. So hence the heavy refactor that I think would be required to do this in an ideal way.

Comparing that to the current python parsing — python parsing is implemented with an additional method on the ModelParser (thanks @jtcohen6 for the original pointer, that was v helpful in getting me started). The current prql implementation takes a slightly more modular approach, still branching within ModelParser, but then calls out to a PrqlLanguageProvider, before passing the SQL back to ModelParser. In the more generic proposal above, the PrqlLanguageProvider would become a PrqlModelParser, and be composed with the parts of ModelParser that were relevant.

Is that right? Is there an easier approach here? What would be the best path there, assuming we don't want to wait until it's perfect before merging anything?


I'm open minded on the way forward. As discussed with folks, I'm very keen to have people write PRQL in dbt — I think it would be great for PRQL to have the reach of dbt & its users, and it would be great for dbt & its users to have an ergonomic language option. Given PRQL's focus is the data transformation, rather than project management & DAG management, there is a uniquely good fit between the two. So I'm happy to drive this and iterate to get this in.

@gshank
Copy link
Contributor

gshank commented Oct 5, 2022

Lots of interesting things here!

We did look at creating a different node type for Python nodes, but in the end decided not to do it because there a fair amount of complications. Was there some reason you decided to go that direction? It also feels a bit wrong to have references to PrqlNode, if the actual prql code doesn't exist in the repo. So I'm thinking we might need something a bit more pluggable, such that if somebody install a language provider, a new value for the 'language' attribute becomes available, and the code to compile it is found by the plugin architecture.

@max-sixty
Copy link
Contributor Author

Thanks for responding @gshank !

I very much agree with your assessment — there's a tradeoff between the "fair amount of complications" :) vs. having something pluggable.

In the currently proposed code, I avoided many of the complications, and so wasn't using the PrqlNode etc. I've just removed those for clarity.

At what level would you like to engage here? I realize these kinds of quite broad proposals can be difficult to connect with the lines of code. I've tried to summarize the current state & salient questions in prose above. Is that a reasonable starting point or do you prefer an alternative approach? Also happy to chat on Slack, or do another Zoom with the broader team if that lets us iterate faster.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@max-sixty Thank you so much for the initial work to spec this out! This has prompted a lot of conversation for us over the past few weeks (including several mentions in casual conversation at Coalesce). There's definite interest in a future where dbt can support plugins for modeling languages, and PRQL would be an excellent early entrant in the category.

The question now is around how to get from here to there. Based on the code you've written thus far, the two biggest challenges I see are:

  1. Decoupling our parsers (ModelParser et al) from Jinja-SQL, so that it's easier to provide a new language parser that works in parallel to the Jinja-SQL ones — with the awareness that we'd also be turning a very internal thing into a real public interface
  2. Determining a plugin pattern/framework that we're happy with ([CT-1453] [Spike] A pattern for plugins #6184), since we're feeling the limits of the pattern we've been using for adapter plugins

My goal over the next few weeks is to identify an engineer on the team who's motivated by these challenges — and by what better interfaces here could help us unlock — to assist in the areas of technical design where I should certainly not have final (or even first) say :)

core/dbt/parser/_dbt_prql.py Show resolved Hide resolved
@@ -60,6 +60,9 @@ def setUp(self):
# Create file filesystem searcher
self.filesystem_search = patch('dbt.parser.read_files.filesystem_search')
def mock_filesystem_search(project, relative_dirs, extension, ignore_spec):
# Adding in `and "prql" not in extension` will cause a bunch of tests to
# fail; need to understand more on how these are constructed to debug.
# Possibly `sql not in extension` is a way of having it only run once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a good chance this is some magic from years ago, which we'll need to reconsider/refactor. Good example of a place where we've hardcoded .sql to date, and need to instead be thinking about using a common provider to tell us the valid model file extensions

test/unit/test_graph_selector_methods.py Show resolved Hide resolved
Comment on lines +420 to +426
# TOOD: Currently `ModelParser` inherits from `SimpleSQLParser`, which inherits from
# SQLParser. So possibly `ModelParser` should instead be `SQLModelParser`, and in
# `ManifestLoader.load`, we should resolve which to use? (I think that's how it works;
# though then why all the inheritance and generics if every model is parsed with `ModelParser`?)
# class PRQLParser(
# ConfiguredParser[FileBlock, IntermediateNode, FinalNode], Generic[IntermediateNode, FinalNode]
# ):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this piece is the biggest current challenge, and the thing that would require the most refactoring to make pluggable.

Currently, SimpleSQLParser is used in all of these places:

  • ModelParser (even for Python models)
  • AnalysisParser
  • SingularTestParser
  • SqlBlockParser (for semantic layer queries)
  • SeedParser (!), even though render_with_context is a total no-op

We shouldn't be looking to create new node types for each language — these should still be models, analyses, tests, etc — but it's clear that we need some way to allow a language plugin to define its own parser, and then provide it to resources matching that language input.

It also looks like the real Jinja-SQL entrypoint is ConfiguredParser.render_with_context. We'd probably want to split that out, to live on a dedicated JinjaSQLParser instead!

Caveats:

  • There's some additional Jinja business in update_parsed_node_config_dict and update_parsed_node_name, to respect rules around hooks and database/schema/alias — those have more to do with model configuration than model language, and the fact that we've stretched Jinja into a "rules engine" for configuration that goes beyond its more-straightforwardly understood role as a SQL template.
  • I'd like to punt on snapshots, which are defined as Jinja blocks (and mostly config, anyway)
  • I don't know if we should even think/talk about macros, at this early point, but we know we'll want some way to define and reuse functions

@max-sixty
Copy link
Contributor Author

Thanks for the review @jtcohen6, very much agree with your comments. As discussed offline, let's touch base in a few weeks on the best path forward.

try:
context = self._context_for(node, config)
references = provider.list_references(node.raw_code)
sql = provider.compile(node.raw_code, references)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After taking another look here — with plans to do some hacking on/around this soon! — I think we'd want the actual .compile() call (PRQL → SQL transpilation) to happen within compilation, rather than during parsing. Here's where we do that for Python.

I think the goal here, within the parser, should just be to provide the set of references, and then pick up again at compile time. Something to also think about: Certain modeling languages will want a database connection at compile time (e.g. Jinja-SQL, Cody's prototype work on Ibis); some won't need one (e.g. Snowpark/PySpark Python, PRQL), and so shouldn't require one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great, that seems very reasonable.

Would you have it as an if in Compiler like the current python implementation, or inherit a class for each lang; i.e. CompilerPython / CompilerPrql?

@binste
Copy link

binste commented Sep 7, 2023

It's great to see all the work that already went into making prql available in dbt in a convenient way! I'm currently developing dbt-ibis so that users can write dbt models as Ibis expressions and I've learned a lot from looking at dbt-prql and this PR. Thanks! Curious to see how this continues :)

@dataders dataders mentioned this pull request Sep 26, 2023
5 tasks
@wearpants
Copy link

Is this dead?

@max-sixty
Copy link
Contributor Author

Is this dead?

I would love to revive it! There are 22 👍s, so there is a decent amount of community interest... If there were interest from the dbt team I'd be happy to make a significant time investment to make something that works.

My read is that the dbt team has other priorities at the moment — understandable — and so I'm not sure there is a viable path without that changing.

But if there is a path forward, let me know!


For any future travelers, the precursor to this was https://github.com/PRQL/dbt-prql. But that doesn't work with some dialects, and the implementation is extremely hacky, so we're not supporting it, unfortunately. There's a note stating this on the repo.

@jtcohen6
Copy link
Contributor

@max-sixty Long time no talk!

You're right that we're focused on other priorities at the moment. The primary ones are adding unit testing (for SQL models) and stabilizing dbt-core's external interfaces (especially for adapter maintainers). There's more detail about this in our last roadmap post.

I still find this work fun & interesting. A number of the internals have changed since we sketched out the rough initial implementation over a year ago. I don't think we can do anything here before the v1.8 release (planned for April), but I would be up for taking another look after, once the dust has settled.

@randypitcherii
Copy link

I’m here for it!

@dbeatty10 dbeatty10 added the community This PR is from a community member label Mar 22, 2024
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@jtcohen6 jtcohen6 self-assigned this Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes community This PR is from a community member discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants