Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft: allow users to augment their jinja context with python objects #5274

Closed
wants to merge 1 commit into from

Conversation

mistercrunch
Copy link

@mistercrunch mistercrunch commented May 18, 2022

This is a bit of a proposal and proof of concept at the same time.
It gives a handle to people over their jinja context, allowing them
to inject python objects in there under a extra_jinja_context
namespace.

how?

export PYTHONPATH=~/.dbt

Assuming a file ~/.dbt/dbt_config.py

# anything added into `extra_jinja_context` becomes available
# in the jinja context under a namespace of the same name
extra_jinja_context = {
    'hello': 'world',
    'test': lambda x: x,
}

assuming a model test.sql

SELECT '{{ extra_jinja_context.hello }}' AS test
UNION ALL
SELECT '{{ extra_jinja_context.test('print me') }}' AS test

why?

Writing Jinja is ok when you're writing a lot of SQL with a bit of logic
in it, but highly suboptimal when writing complex logic with a bit of
string outputs. Clearly python is superior to Jinja in many ways:

  • access to the full python standard lib
  • access to external/powerful libs
  • Turing complete, object oriented
  • testable using sane mechanisms

use cases

Use cases are limitless. Bind, hook, trigger, generate SQL.
In order of legitimacy:

  • string processing, what you'd typically use Jinja macros for, but
    maybe you prefer python functions over jinja macros. Our main use
    case is creating a generate_incremental_load_date_bounds() where
    we'll look at vars to offer different loading modes (catchup, date
    range, date list, offsets, ...)
  • custom logging similar to
    this but much more
    flexible, we happen to use BigQuery and this particular approach
    doesn't work for BigQuery
  • custom logic for hooks
  • trigger external things (webhooks!)
  • ...

need more thinking / conversation

  • I'm introducing dbt_config.py, a new place where python logic can be
    injected into dbt-core, it can be powerful, but can lead to
    environment issues / complexity. People need to not abuse this file.
    No import pandas as pd in there please! Where should this live?
    ~/.dbt/?
  • Seems like it'd be great to do this at the project level too, so a
    project can be packaged with the extra jinja context it needs to
    operate. macros/my_macros.py anyone!? Conceptually that makes a dbt
    project also a python app in some ways and that may not be ideal.
  • interoperability: people sharing projects may need to also share
    dbt_config.py and put it in their pythonpath. It's not that outrageous
    but raises the complexity of the project / env setup. This kind of
    complexity already exist with stuff like profiles.yml

TODO

  • agree this is useful
  • agree on an approach
  • write tests
  • write docs
  • hand off?

This is a bit of a proposal and proof of concept at the same time.
It gives a handle to people over their jinja context, allowing them
to inject python objects in there under a `extra_jinja_context`
namespace.

Writing Jinja is ok when you're writing a lot of SQL with a bit of logic
in it, but highly suboptimal when writing complex logic with a bit of
string outputs. Clearly python is superior to Jinja in many ways:

* access to the full python standard lib
* access to external/powerful libs
* Turing complete, object oriented
* testable using sane mechanisms

Use cases are limitless. Bind, hook, trigger, generate SQL.
In order of legitimacy:

* string processing, what you'd typically use Jinja macros for, but
  maybe you prefer python functions over jinja macros. Our main use
  case is creating a `generate_incremental_load_date_bounds()` where
  we'll look at `vars` to offer different loading modes (catchup, date
  range, date list, offsets, ...)
* custom logging similar to
  [this](https://github.com/dbt-labs/dbt-event-logging) but much more
  flexible, we happen to use BigQuery and this particular approach
  doesn't work for BigQuery
* custom logic for hooks
* trigger external things (webhooks!)
* ...

* I'm introducing `dbt_config.py`, a new place where python logic can be
  injected into dbt-core, it can be powerful, but can lead to
  environment issues / complexity. People need to not abuse this file.
  No `import pandas as pd` in there please! Where should this live?
  `~/.dbt/`?
* Seems like it'd be great to do this at the project level too, so a
  project can be packaged with the extra jinja context it needs to
  operate. `macros/my_macros.py` anyone!? Conceptually that makes a dbt
  project also a python app in some ways and that may not be ideal.
* interoperability: people sharing projects may need to also share
  dbt_config.py and put it in their pythonpath. It's not that outrageous
  but raises the complexity of the project / env setup. This kind of
  complexity already exist with stuff like `profiles.yml`
@mistercrunch mistercrunch requested review from a team as code owners May 18, 2022 23:17
@cla-bot
Copy link

cla-bot bot commented May 18, 2022

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @mistercrunch

@jtcohen6
Copy link
Contributor

@mistercrunch Thanks for the PR, and for the thoughtful accompanying writeup!

I like this idea better than giving people the ability to write custom Jinja filters, or hooking into other Jinja-specific "extensions." There's a very old issue proposing just that: #480. I find myself agreeing with the two most-recent comments there: This feels like a case for plugins!

I'll leave some more detailed thoughts below. I'm indebted to @jwills @gshank @nathaniel-may for talking through this with me yesterday. Maybe this PR actually wants to be a discussion?

Why not?

it can be powerful, but can lead to environment issues / complexity. People need to not abuse this file. No import pandas as pd in there please!

This is the name of the game. The limitation of Jinja is one of the best things about it :)

At definite risk of mixing my mythological metaphors, custom Python code within the dbt project environment opens a Pandora's box of anti-pattern practices, and books us a one-way passage over the river Styx Pipx to dependency hell. That risk is present to some degree in dbt Python models, too, but those dependencies will be installed on the warehouse, in per-model or per-cluster sandboxes. They will not be installed "globally" in the dbt environment, which is the terrifying part. We already fight enough with our dependencies as is. Imagine if we had to know every single PyPI package that everyone depended on at any point in time, before we felt confident upgrading any?

[One other quick note about Python models: It will of course be possible to write and use generate_date_filter_clause() as a Python function, insofar as it's Python code that runs remotely in the warehouse. In some cases, we think it will be desirable for a Python model (?) to actually register its result as a UDF in the database, which can be called from SQL models—but this makes more intuitive sense for ml_predict() or business_minutes_between() than it does for fancier SQL templating. The risk is much lower visibility for the end user, since dbt isn't "compiling" that more-concise function down to some more-verbose "actual" SQL, as it does with macros.]

So what then?

Here's my take, with which you're more than welcome to disagree: I think what you're describing is actually a dbt plugin, written in Python, which has the ability to register and expose some of its methods in the user's Jinja context. That plugin would be installed alongside dbt-core, and discovered by the dbt-core application by namespacing its modules in a well-understood way.

We have a pattern for that today, in the form of adapter plugins. We've understood for a long time that the work of translating between dbt <> Another Analytical Database is a much bigger lift than translating business/transformation logic into Jinja-SQL. Adapter authors need the ability to write and test their functionality in a real programming language, and in many cases to access data platform APIs that simply aren't exposed via SQL. (BigQuery is the most prominent example.)

It's also a recognition that the "adapter maintainer" persona is a very different one from "dbt user" / project code writer. There are many fewer adapter maintainers; we (as dbt-core maintainers) have a much closer relationship with them; they're opt into a higher maintenance burden, to do a more complex thing; and they understand that they're creating a product on top of dbt, with all the thinking about longer-term repercussions that entails, not just hacking a thing that needs to work right now.

Concretely, we allow adapter maintainers to write whatever Python code they need, and to register some of them as class methods and members on the adapter object—which end users experience call as "macros" from the adapter namespace. To your point about "string processing," for instance: as dbt-bigquery maintainers, we wrote the logic for validating user-provided partition_by config, and it rendering it as an appropriate string to BigQuery's specifications, in Python here. We then call it from within the Jinja context here and here.

How could this be better?

Down the line, I could see a generalized pattern for supporting dbt plugins that don't actually want to be database adapters. They could register custom namespaces, instead of over-loading the adapter namespace. Perhaps there could be a dbt-logging plugin, which makes available methods like logging.fire_event() in the Jinja context—although I'll note that we've meaningfully reinvested in the core eventing and structured logging interface for real-time metadata. We're still building the companion tooling for it, but we definitely recommend it instead of this package.) Such a plugin almost exists today for dbt + Openlineage, but openlineage-dbt has to be a wrapper around the dbt-core CLI, rather than a plugin registered within it.) Not today, but eventually, we'd want to support totally custom plugins in dbt Cloud, whether written by technology partners or community members or in-house dbt pros.

In the meantime: What do you think of forking of dbt-bigquery? Or an extension off it, in the same way that dbt-materialize is an extension of dbt-postgres? That requires a conscious choice, and a separation between Special Methods and Actual dbt Project Code. If you write methods that solve for a common problem, one which you think other folks would benefit from, a PR upstream to the main dbt-bigquery repo is always on the table.

@gshank
Copy link
Contributor

gshank commented Jun 6, 2022

I'm going to close this pr since we're not going to merge these changes. Please continue the conversation if you're so inclined.

@kdazzle
Copy link

kdazzle commented Jun 30, 2023

Hey @mistercrunch - did you ever continue with something like this? I just proposed a similar thing (#8000) and agree that it would be very useful to have this sort of flexibility within dbt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants