Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-660] [Feature] Add Grant SQL to Global Project #5263

Closed
1 of 9 tasks
nathaniel-may opened this issue May 17, 2022 · 9 comments · Fixed by #5369
Closed
1 of 9 tasks

[CT-660] [Feature] Add Grant SQL to Global Project #5263

nathaniel-may opened this issue May 17, 2022 · 9 comments · Fixed by #5369
Labels
enhancement New feature or request grants Issues related to dbt's grants functionality
Milestone

Comments

@nathaniel-may
Copy link
Contributor

nathaniel-may commented May 17, 2022

Description

This is the second in a two ticket series. #5189 is the first ticket. They will be merged sequentially, but both are required for this feature to be exposed to users.

Today users often configure a post_hook to grant permissions on models, seeds, and snapshots:

{{ config(post_hook = 'grant select on {{ this }} to role reporter') }}

These two tickets aim to make it easier to specify grants allowing it to be configured directly both in dbt_project.yml as well as in source files:

# dbt_project.yml
models:
  export:
    +grants:
      select: ['reporter', 'bi']
-- SQL
{{ config(grants = {'select': ['other_user']}) }}

These grant configs will not necessarily look the same for each and every warehouse. The logic to generate the sql from these configs can be overriden by adapters.

Some complexity gets added when you consider that you cannot simply revoke all privileges from a table and apply the configured ones. This is because revoking all may also revoke dbt's ownership of the table, and because some warehouses require explicitly naming the grants to be removed. This implementation reads the grants first, then constructs a minimal revoke and grant statement from the resulting diff.

Implementation

  • create a new macro in the global project: get_show_grant_sql(relation: Relation) -> str that retrieves grant information for that model. The default implementation should return a string in the form "show grants on table dbt_jcohen.whatever"
  • create a new macro in the global project: get_grant_sql(relation: Relation, grant_config: dict) -> str that creates the warehouse-specific sql from grant portion of the config. Since it will be common, the default implementation should return grant <privilege> on <relation.type> <relation> to <recipients>. Warehouses that deviate from this can override. Use dispatch pattern (dbt docs) with a default__ to enable this. The macro signature here is designed to accept different shapes of grant configs for each adapter to use whatever best fits the warehouse's permissions system.
  • create a new macro called get_revoke_sql that takes a relation, and a grant config dict and returns a string in the form f"revoke {grant_config.privilege} on {relation} from {grant_config.recipients}". The grant config will contain multiple priv-recipient mappings.
  • create a new macro in the global project: apply_grants which takes 3 parameters: revoke: Bool, relation: Relation, grant_config: dict and returns None. It calls get_show_grant_sql to determine what grants are currently applied, and uses grant config to determine what grants need to be revoked, and which grants need to be granted. If the revoke param is True, get_revoke_sql is called, then get_grant_sql with a new dictionary representing the diff to apply the grants. (see persist_docs for an example of similar implementation). This should be overridable by adapters so use the dispatch pattern (dbt docs) with a default__ to enable this. For a complete example see @jtcohen6's comment below.
  • add a call to apply_grants(..., revoke=True) in all materializations. This includes incremental models, seeds, and snapshots even though it will usually not be necessary. Users can override if they come across a special case where they need to.
  • [CT-808] Add "adapter zone" tests for grants #5437
    • test that projects can define grants in 1) the dbt_project.yml, 2) models, seeds, and snapshots, and 3) both the dbt_project.yml and models, seeds, and snapshots and those permissions can be read back directly from the postgres warehouse after a run.
    • test two subsequent runs (dbt build && dbt build) with: same grants; additional grants in second run; fewer grants in second run. Confirm that the final set of grants matches the final configured set of grants in all cases.
  • test a model that has preexisting grants, and does not have grants configured. dbt should leave it be, and not run any show/revoke/grant statements.

Adapters

Links for your convenience. These do not need to be completed to close out the ticket.

@nathaniel-may nathaniel-may added enhancement New feature or request triage labels May 17, 2022
@github-actions github-actions bot changed the title [Feature] <title> [CT-660] [Feature] <title> May 17, 2022
@nathaniel-may nathaniel-may changed the title [CT-660] [Feature] <title> [CT-660] [Feature] Add Grant SQL to Global Project May 17, 2022
@jtcohen6
Copy link
Contributor

Copying from Slack message sent a few weeks ago:

On some databases, dbt run completely replaces a model (view or table). All grants previously on that view/table are lost, and need to be reapplied from scratch. This is much easier for us to reason about, since the grants configured on the model by the user will be the exact grants that the view/table ends up with.

On some databases, dbt run replaces a view/table in such a way that the grants can be copied from the old view/table to the new view/table. This tends to be true for databases that use create or replace view|table. In these cases, there's a potential delta between the grants the user has configured (possibly none), and the grants on the resulting view/table. I believe BigQuery + Spark/Databricks work like this by default. Snowflake doesn't do this by default, but it can with the COPY GRANTS option (which is supported as a config in dbt-snowflake).

I think there are two approaches we can take:

  • Revoke all old grants, then apply all new ones. Some databases support a revoke all privileges statement. Definitely simpler.
  • Introspect the database to find current grants, calculate diffs, then just the exact revoke + grant statements needed. More complicated + error-prone. Many databases support a show grants statement, but this needs to be an API call on BigQuery.

Why would we prefer 2 over 1? Here's what I've heard from customers:

  • Because that's the "Terraform-y" approach: minimum changes to achieve new configuration
  • Because Security/DevOps/DBAs raise their eyebrows when they see tons of revoke statements in database logs

I think it's acceptable to pursue option 1 for now if it is indeed simpler. I anticipate we'll want to implement methods/macros for show_grants and/or revoke_all_grants within the adapter interface, in addition to get_grant_sql + apply_grants.

@VersusFacit
Copy link
Contributor

VersusFacit commented May 18, 2022

@jtcohen6 Note from convo with Gerda and Matt. We think we want grants after creation -- after the after_run method -- but these grants should run before post hooks are invoked. That sound right?

@jtcohen6
Copy link
Contributor

@VersusFacit I agree that we should run grants right after the model is created/replaced, to minimize downtime for downstream queriers! Important to call out that "post hooks," as users define them, are run within the materialization here:

{{ run_hooks(post_hooks, inside_transaction=True) }}

{{ run_hooks(post_hooks, inside_transaction=False) }}

The inside_transaction=True|False distinction is only relevant on databases that support transactions. On those databases, I believe it's possible to run grant within a transaction, except on external tables (per Redshift docs), which is okay, that's not really relevant to our case.

The after_run and after_execute methods, within the RunTask / ModelRunner, do other unrelated things.

@nathaniel-may
Copy link
Contributor Author

added revocation logic to implementation block

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 2, 2022

Summary of conversation just now

"Call stack":

  • materializations (view, table, incremental, seed, snapshot), in dbt-core and adapter overrides
    • calls apply_grants macro, immediately after the "main" statement
      • calls get_revoke_all_sql macro
      • calls get_grant_sql macro

Default behavior

What should the default do? Correctest by default: revoke all grants, then add all the configured ones. This will always return the correct results, even if it requires an extra step where not strictly needed. On databases that allow us to do it, we should "batch" together those two statements so that they share the same connection/session/transaction:

Pseudo code:

{% macro get_revoke_all_sql(relation) %}
  revoke all on {{ relation.type }} {{ relation }}
{% endmacro %}

{% macro get_grant_sql(relation, grant_config) %}
  {% for privilege in grant_config.keys() %}
     {% set recipients = grant_config[privilege] %}
     grant {{ privilege }} on {{ relation.type }} {{ relation }} to {{ recipients | join(', ') }}
  {% endfor %}
{% endmacro %}

{% macro apply_grants(relation, grant_config, revoke=True) %}
  {% if grant_config %}
    {% call statement('grants') %}
      {{ get_revoke_all_sql(relation) if revoke else "" }}
      {{ get_grant_sql(relation, grant_config) }}
    {% endcall %}
  {% endif %}
{% endmacro %}

If the user hasn't configured grants AT ALL, I assume they don't want us messing with them! If they do want dbt to manage grants on that table, they could specify something like:

{{ config(grants = {'select': []}) }}

dbt would still revoke all grants from the table, and apply no new ones.

If users want much more advanced revocation that calculates diffs dynamically, I'd be excited to see what they come up with! It will be totally possible to customize / reimplement get_revoke_all_sql and/or apply_grants. This is tricky to support in the general case, given different metadata supported by different adapters.

Thinking through different materialization types

Incremental models (partial refresh), seeds, snapshots: To revoke or not to revoke?

  1. Do revoke all grants, because the object isn't being replaced, the old grants are still there, what if I removed one? Do I want dbt to handle that for me?
  2. Don't revoke all grants, because I don't want any downtime for downstream queriers of that object. In case I need to remove some grants, I have to --full-refresh.

I lean toward option 1, which has us revoking all before reapplying. Optimize for correct access over risk of milliseconds of downtime. That means, even on adapters which are revoke=False for creating tables/views, they should be revoke=True for incremental/seed/snapshot.

Thinking through different adapters

As a special case, on Postgres/Redshift/Snowflake, IFF we're completely replacing an object, we don't strictly need the revoke all step. We could shave off a few milliseconds and avoid some DBA eyebrows by skipping it where it isn't strictly needed. The simplest way to go here: revoke=False in view + table materializations, revoke=True elsewhere. Most complete would be conditional logic for incremental/seed/snapshot based on --full-refresh and whether the table is preexisting. I don't view that as necessary for the v1.

BigQuery + Spark/Databricks copy over grants during create or replace, so we should always revoke all anyway.

Fuller gloss on Snowflake

  • If copy_grants: False (default), the grants aren't copied over during create or replace, so there's nothing that needs revoking
  • If copy_grants: True, the grants are copied over during create or replace. Let's assume that the user knows what they're doing, and is opting zero downtime. Meaning, let's not revoke the grants that they just asked to be copied over.
  • Anyone who disagrees with this default can implement custom behavior by overriding one macro.

@nathaniel-may
Copy link
Contributor Author

feedback from the above comment is now reflected in the description.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 3, 2022

@nathaniel-may Mea culpa, I clearly didn't do my full research ahead of time. Glad we're figuring it all out now, rather than mid-implementation!

This works (PostgreSQL), where other is another user/group:

revoke all on table dbt_jcohen.whatever from other;

These do not:

revoke all on table dbt_jcohen.whatever;
revoke all on table dbt_jcohen.whatever from all;

The only way to know how to go here is by first running a metadata query to show who else has grants on the table:

select distinct grantee
from information_schema.role_table_grants
where table_schema = 'dbt_jcohen'
    and table_name = 'whatever'
    and grantee != current_role; -- postgres offers us current_role, but we could also use {{ target.user }}

(This is PostgreSQL, which queries the information_schema. On other databases, the appropriate query might be show grants on table dbt_jcohen.whatever. On BigQuery, it's an API call.)


While many databases do support revoke all in the sense of all privileges, they do not support "revoke all privileges from all grantees." The set of recipients needs to be stated explicitly. This makes sense, since there are implicitly grants to the owner of the table (dbt's user), after having just created it, and we don't want to revoke those.

The good/bad news is, it means we do need to take a slightly cleverer approach, where we first inspect the object to see the grants / grantees on it, and then revoke grants if granted to not-the-current-user. There are still two forms this could take:

  • Simpler: Find just the set of users (not current user) who have grants on the table, then run revoke all on {{ relation.type }} {{ relation }} from {{ grantees | join(', ') }}
  • More sophisticated: Find which grantees have which privileges. Calculates diffs between current and configured/desired. At the point where we're already introspecting grants, that might not be a huge lift, and people will really love us for it. The role of each adapter's method could be to return a standardized zipped-dict of {privilege: [recipients]}, to match the structured of the grant_config.

So I think we will need three macros:

  • get_show_grants_sql
  • get_grant_sql
  • get_revoke_sql

get_show_grants_sql runs first, then 2+3 run in a batch together.

The default case of apply_grants should handle all of those. When apply_grants(revoke=False), we only need to get_grant_sql.


Example

Imagine a case where we'd previously over-granted on a model, and have since trimmed down its grant config to just the necessary select for user other. So previously, we ran an incremental model with:

-- models/whatever.sql

{{ config(
    materialized = 'incremental',
    grants = {
       'select': ['other, 'another'],
       'insert': ['other', 'another'],
    }
) }}

select ...

So dbt previously ran (= as mocked by me in psql):

jerco=# grant select on table dbt_jcohen.whatever to other, another; grant insert on table dbt_jcohen.whatever to other, another;
GRANT
GRANT

But now, its config is just:

-- models/whatever.sql

{{ config(materialized = 'incremental', grants = {'select': ['other']}) }}

Putting that flow together in PostgreSQL:

jerco=# select distinct grantee from information_schema.role_table_grants where table_schema = 'dbt_jcohen' and table_name = 'whatever' and grantee != current_role;
 grantee
---------
 another
 other
(2 rows)
jerco=# revoke all on table dbt_jcohen.whatever from other, another;
REVOKE
jerco=# grant select on table dbt_jcohen.whatever to other;
GRANT

Or, better yet ("sophisticated" approach):

jerco=# grant select on table dbt_jcohen.whatever to other, another; grant insert on table dbt_jcohen.whatever to other, another;
GRANT
GRANT
jerco=# select * from information_schema.role_table_grants where table_schema = 'dbt_jcohen' and table_name = 'whatever' and grantee != current_role;
 grantor | grantee | table_catalog | table_schema | table_name | privilege_type | is_grantable | with_hierarchy
---------+---------+---------------+--------------+------------+----------------+--------------+----------------
 jerco   | other   | jerco         | dbt_jcohen   | whatever   | INSERT         | NO           | NO
 jerco   | other   | jerco         | dbt_jcohen   | whatever   | SELECT         | NO           | YES
 jerco   | another | jerco         | dbt_jcohen   | whatever   | INSERT         | NO           | NO
 jerco   | another | jerco         | dbt_jcohen   | whatever   | SELECT         | NO           | YES
(4 rows)

jerco=# revoke select on dbt_jcohen.whatever from another; revoke insert on dbt_jcohen.whatever from other, another;
REVOKE
REVOKE

Caching?

Eventually, for truly optimal performance, we could think about including grant information for each dbt-controlled resource in the caching queries that dbt runs at the start, so that we're not running show grants or information_schema queries for every single model. I think it's fair to consider that a performance improvement down the line. For now, let's aim for correctness, and speedups only if they're obvious (= fully replacing views/tables on Postgres/Redshift/Snowflake).

@nathaniel-may
Copy link
Contributor Author

made minor changes to apply_grants definition and get_revoke_sql. Fleshed out test bullets as well.

@dbeatty10
Copy link
Contributor

@dbeatty10 dbeatty10 added the grants Issues related to dbt's grants functionality label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request grants Issues related to dbt's grants functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants