Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding dbt parse hook #195

Merged
merged 5 commits into from
Mar 28, 2024
Merged

Conversation

pgoslatara
Copy link
Contributor

Addresses #166.

I really like what dbt-checkpoint offers and want to use it on most of the dbt projects I work with. One downside is that it needs to run dbt compile for a lot of hooks and this requires a connection to the underlying database. This can be rather painful as a lot of my recent clients are using dbt Cloud and when we try to enable dbt-checkpoint through an Azure DevOps pipeline or GitHub workflow then we need to add authentication from the version control provider to the database, this is a barrier for a lot of clients and a hard blocker for some. This PR adds a dbt-parse hook that generates a manifest.json but doesn't require a database connection.

To do:

  • Re-evaluate what hooks can be updated to only require dbt-parse and not dbt-compile.
  • Add tests.

@pgoslatara
Copy link
Contributor Author

@BAntonellini I'd be interested in your opinion on this, especially,

  • Is this a suitable way forward?
  • What work is required to safely merge a PR like this?

@noel
Copy link
Collaborator

noel commented Mar 18, 2024

how are you handling the hooks that require docs generate? In general we are actually thinking of deprecating all dbt commands from the hooks as they can be run as prior steps in the pipeline and doing that you can fix issues such as those when you run slim-ci

@pgoslatara
Copy link
Contributor Author

how are you handling the hooks that require docs generate? In general we are actually thinking of deprecating all dbt commands from the hooks as they can be run as prior steps in the pipeline and doing that you can fix issues such as those when you run slim-ci

That would be interesting! Is there somewhere I can follow this conversation? My concern with this approach is that for pre-commit to work effectively locally it needs the latest version of the dbt artifacts, so my preference is that these are generated locally on every run. Obviously this can be slow, but this is the downside that comes with the upside of these awesome hooks. And in a CI pipeline we can just skip these hooks as the artifacts are already present (I already do this here). Remember as well that Slim CI is a dbt Cloud feature, not every dbt project will have this.

Regarding this PR, my thought process is that I'd need to re-evaluate whether each hook requires a compile, parse or docs generate command to be run beforehand and update the documentation. Bit manual but not impossible to do. At this stage I just wanted validation that such a change would be acceptable, then it makes sense for me to invest this time.

@noel
Copy link
Collaborator

noel commented Mar 20, 2024

@pgoslatara Slim CI is not a dbt Cloud feature, this is in core as well as deferral. We have many clients doing this in Datacoves.
There are a few patterns I have seen:

  1. manifest is in the repo, dont like this, but some people do it
  2. manifest is in S3. your ci/cd process pushes it there when you deploy
  3. manifest is in a snowflake stage similar to Add integration tests #2

I am curious about your CI/CD process. Are you having dbt cloud do the actual run and hence the CI runner does not need the credentials to the db?

Have you looked at the dbt Cloud Admin API? You can pull the manifest from there as well.

@pgoslatara
Copy link
Contributor Author

manifest is in a snowflake stage similar to #2

I don't think this link is what you intended.

Slim CI is not a dbt Cloud feature

True, I've implemented this using dbt Core. I think my process doesn't really matter here, I think dbt-checkpoint should be suitable for use with dbt Core and dbt Cloud. Generating the manifest.json at run time, downloading it via API, saving it in the repo, etc., all of these are valid ways of working and dbt-checkpoint is flexible enough for them all.

My concern is the local development experience. If we are "deprecating all dbt commands from the hooks" then the developer who uses dbt Core suffers as they will now have an additional step to perform to ensure their hooks have the latest artifact available every time they commit. Let's re-focus on the original question, is there value in adding a dbt-parse hook to this repo:

  • If "yes", then I'm happy to continue working on this PR and welcome any guidance.
  • If "no", then I'm happy to close this PR, although I really would like to participate in whatever discussion is ongoing about the future of this repo.

@noel
Copy link
Collaborator

noel commented Mar 20, 2024

Ok. Feel free to add it. Make sure to update any tests and readme.

One thing I want to make sure is that if there are any hooks where this approach doesn’t work then we call it out. Conversely if this works for all hooks then it should be preferred over compile, no?

�[0m18:19:33  Registered adapter: duckdb=1.7.3
�[0m18:19:33  Found 20 models, 53 tests, 0 sources, 0 exposures, 0 metrics, 615 macros, 0 groups, 0 semantic models
�[0m18:19:33
�[0m18:19:33  Concurrency: 8 threads (target='dev')
�[0m18:19:33   to �[0m18:19:37  Running with dbt=1.7.10
�[0m18:19:37  Registered adapter: duckdb=1.7.3
�[0m18:19:37  Performance info: /home/pslattery/repos/medium_scraper/dbt/target/perf_info.json
@codecov-commenter
Copy link

codecov-commenter commented Mar 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.94%. Comparing base (154fee6) to head (28b8fed).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #195      +/-   ##
==========================================
+ Coverage   96.91%   96.94%   +0.03%     
==========================================
  Files          55       56       +1     
  Lines        2592     2622      +30     
  Branches      349      349              
==========================================
+ Hits         2512     2542      +30     
  Misses         59       59              
  Partials       21       21              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@noel
Copy link
Collaborator

noel commented Mar 24, 2024

@pgoslatara thanks for the PR.
qq, how do you handle hooks that need catalog.json if you dont have access to the db in the CI?

@pgoslatara
Copy link
Contributor Author

@pgoslatara thanks for the PR. qq, how do you handle hooks that need catalog.json if you dont have access to the db in the CI?

@noel You're fast 🚀! In this scenario it's rather simple, those hooks cannot be used. I've sometimes been in this situation when the system where the CI runs is not allowed to access the database, adding dbt-parse to this package will finally allow us to use dbt-checkpoint for all hooks that do not require catalog.json.

@pgoslatara
Copy link
Contributor Author

Ok. Feel free to add it. Make sure to update any tests and readme.

One thing I want to make sure is that if there are any hooks where this approach doesn’t work then we call it out. Conversely if this works for all hooks then it should be preferred over compile, no?

I've updated this PR, primarily adding a test and updating documentation.

Regarding hooks where dbt-parse doesn't work, these are already identified in HOOKS.md by requiring catalog.json.

Regarding what should be the preferred approach, I've updated the documentation for all hooks that previously said:

It means that you need to run `dbt run`, `dbt compile` before run this hook.

To now read:

It means that you need to run `dbt parse` before run this hook.

The difference between the output of dbt parse and dbt compile is that dbt parse creates a manifest.json but without any information about the compiled SQL. But as none of dbt-checkpoint's hooks use this compiled SQL, all of them can instead be run after a dbt parse command. Exception of course being the catalog.json hooks as you point out in your previous comment, but the documentation for these hooks is untouched by this PR.

@BAntonellini
Copy link
Collaborator

BAntonellini commented Mar 25, 2024

Hey @pgoslatara

We found this in dbt parse docs:

Starting in v1.5, dbt parse will write or return a [manifest](https://docs.getdbt.com/reference/artifacts/manifest-json), enabling you to introspect dbt's understanding of all the resources in your project.

Meaning this PR won't work in dbt <1.5

Could you add some kind of warning message (maybe in the README) with this information?

@pgoslatara
Copy link
Contributor Author

Hey @pgoslatara

We found this in dbt parse docs:

Starting in v1.5, dbt parse will write or return a [manifest](https://docs.getdbt.com/reference/artifacts/manifest-json), enabling you to introspect dbt's understanding of all the resources in your project.

Meaning this PR won't work in dbt <1.5

Could you add some kind of warning message (maybe in the README) with this information?

@BAntonellini Good catch! In 123452f I added that dbt >= 1.5 is required when dbt parse is used.

@pgoslatara
Copy link
Contributor Author

@BAntonellini @noel Happy to move forward with this or is there another aspect we should look into?

@BAntonellini BAntonellini merged commit 0b1299b into dbt-checkpoint:main Mar 28, 2024
7 checks passed
@jaklan
Copy link

jaklan commented Apr 14, 2024

Why the hook is configured to run only for SQL files?

types_or: [sql]

Changes in YAML files, e.g. in dbt_project.yml, can affect parsing, so they should be also taken into consideration.

@jaklan
Copy link

jaklan commented Apr 14, 2024

Also, the hook is missing:
pass_filenames: false
instead of:
require_serial: true

add_filenames_args(parser) and add_dbt_cmd_model_args(parser) are also irrelevant.

@pgoslatara pgoslatara mentioned this pull request Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants