Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] add new dbt.deps type: url to internally hosted tarball #4205

Closed
1 task done
timle2 opened this issue Nov 4, 2021 · 8 comments · Fixed by #4689
Closed
1 task done

[Feature] add new dbt.deps type: url to internally hosted tarball #4205

timle2 opened this issue Nov 4, 2021 · 8 comments · Fixed by #4689
Labels
deps dbt's package manager enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors

Comments

@timle2
Copy link
Contributor

timle2 commented Nov 4, 2021

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

Summary

From experience with dbt at a large org, there is no great way to distribute internal (non-public) dbt modules. The only currently supported solution is leveraging GitPackages + shipping ssh credentials to wherever dbt deps would be run. A new second option for a private module source, is suggested below, which relies on direct tarball links, and that is unpinned only. The implication is creating a source type that does not leverage all the pinned/unpinned internals, but the trade off is reasonable to grant users a clean direct install source for private packages that is not git.

Full Description

dbt deps currently supports package installation from 3 options

  • LocalPackage: install via simlink. Audience: developers with local copy of dbt module to install.
  • GitPackage: install via git checkout. Audience: end users, and production systems, with modules that can be posted publicly. Also supports private modules for end users/production systems with appropriate git credentials.
  • RegistryPackage: install via tarball download and untar to modules destination from https://hub.getdbt.com/. Implemented via dedicated client due to API interactions with the Registry. Audience: end users, production systems, with modules that can be posted publicly. Great for beginners and power users alike.
Whats missing?
  • Non-public modules hosted outside of GitHub. GitPackages, in production, require bundling the production systems with git credentials.
  • Wouldn't it be nice to be able to directly link to tarballs hosted on artifactory like services? That way there is no need to pass around a GitHub ssh key in production systems.
  • Wouldn't it be nice to be able to directly link to tarballs hosted on internal GCS/S3/Blob storage, instead of hitting git for installs?
Solution?
  • Allow option to link to a tarball url. This allows for orgs to permission the tarball in a very granular way (either hosted on Artifactory like service, or cloud storage), instead the more broad git + ssh solution for non-public modules.
# example packages.yml
packages:
  - tarball: https://codeload.github.com/dbt-labs/dbt-utils/tar.gz/0.6.6
    # public package referenced here as example
    sha1: 549db5560efccbadc7ec7834c03cf13579199a66
    # optional. if specified, downloaded tar much match for install to proceed.
    subdirectory: dbt-utils-0.6.6
    # optional. same behavior as subdirectory arg with git packages
Appendix:
Why git is a poor solution/why a second option for private modules would benefit end users
  • Git is not brilliant at scoping access via ssh credentials.
  • From a security standpoint, bundling the git credentials on production systems is really really really really bad security practice. Yes one could dynamically load them in, no this is not a simple solution to orchestrate.
  • For companies with enterprise GitHub setups, extra git clones on all CI/CD builds, is waste of finite resources.
  • Larger companies already have Artifactory like services set up, being able to direct reference tarballs on these internal platforms a) solves all scoping issues without need to pass around an ssh key b) leverages caching on the Artifactory like services for free.
  • Outside of using Artifactory like services as source, one could imagine instead a tarball hosted on S3/GCS, where permissioning access is much easier via cloud IAM config. Easy to see where a setup like this would be preferable to handling access right through git ssh keys.

How I got here:

From experience with dbt at a large org, there is no great way to distribute internal (non-public) dbt modules. Gatekeeping for non-public repos relies on git credentials. But this is poor security practice (and not supported by our CI/CD infra, for good reason). Simply put, those reliant on exiting infra for managing builds may not have the luxury of issuing arbitrary git clone commands to external repos (as dbt deps does for GitPackage sources).

Describe alternatives you've considered

  • Currently, I bundle internal packages as part of our core dbt install image. All installs are then handled by doing LocalPackage symlinks to the packages on the core dbt image (along with some docker magic). It's working, but it's a hack, and being able to do straight install from a url would make me so much happier.
  • Creating an internal alternative to https://hub.getdbt.com/. Redirect to internal service by leveraging 'DBT_PACKAGE_HUB_URL`. Emulate current hub.getdbt.com API behavior in new internal only service, json object returned by API for a given package has a clear tarball source link. This would be a complicated way to do the same process as suggested here. Self hosted hub service remains an option for power users, but complex solution to simple problem to say the least.
  • Submodules. Not bad, but very manual. Does not easily scale for multiple teams creating and providing packages for install.

Who will this benefit?

Any user looking to run dbt deps for internally hosted modules that wishes to not host the packages on git (or is not permitted).

Are you interested in contributing this feature?

Yes, PR in progress here

Sketch of planned work:

  1. Add TarballPackage data class to project.py. No plan to support pinned/unpinned since handling versioning is all but impossible given an arbitrary link to a tarball. Dataclass would look much like existing LocalPackage.
  2. Add dbt.deps remoteTarball, TarballUnpinnedPackage as only class. Would look much like local in structure, but recycle most of registry install design pattern
  3. dbt.deps.resolver updated to support new url type, which maps newly created dbt.deps.tarball remoteTarball.
  4. unit tests added

Anything else?

First time dbt contributor, long time day to day dbt user. Looking forward to packaging this up and making my first contribution!

@timle2 timle2 changed the title [Feature] add new dbt.deps type: absolute links to tarball [Feature] add new dbt.deps type: absolute url to internally hosted tarball Nov 4, 2021
@timle2 timle2 changed the title [Feature] add new dbt.deps type: absolute url to internally hosted tarball [Feature] add new dbt.deps type: url to internally hosted tarball Nov 4, 2021
@jtcohen6 jtcohen6 added enhancement New feature or request triage labels Nov 4, 2021
timle2 pushed a commit to timle2/dbt-core that referenced this issue Nov 6, 2021
@timle2
Copy link
Contributor Author

timle2 commented Nov 6, 2021

I've started packaging this up on a fork here. Would love to get approval for running workflow related tests ❤️. Thanks!

@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 7, 2021

@tskleonard I appreciate the use case you're outlining! It sounds like support for an arbitrary tarball is a reasonable compromise: it would unblock users in your situation, while still putting most of the onus on them to host the right code at the right URLs.

In particular, I like the point about not supporting version/dependency resolution. If that's something you want, then a self-hosted Hub mirror is actually the right-sized solution! I agree that it's decidedly overkill for the simpler case you're outlining.

@leahwicz Let's check in with folks on the dbt Cloud side of the house, to see if there's security concern around support for downloading arbitrary tarballs. Users still wouldn't be able to execute arbitrary commands, but this does feel like a step beyond the current options of (a) downloading a known tarball, registered in the dbt Hub, or (b) git clone + user-provided repository. A user-provided tarball could contain any code.

Just noting, if of interest, that this feels adjacent to some conversations we've had in the past:

  • slack thread from 2 years ago: desire to use standard python/pip tooling for dependencies with dbt packages as well
  • slack thread from 4 months ago: Artifactory allows proxying pip, in a corporate environment that disallows access to arbitrary GitHub resources. It sounds like this might be a legit way to integrate with Artifactory!

@jtcohen6 jtcohen6 added good_first_issue Straightforward + self-contained changes, good for new contributors! packages Functionality for interacting with installed packages and removed triage labels Nov 7, 2021
@timle2
Copy link
Contributor Author

timle2 commented Nov 8, 2021

HI @jtcohen6 - Thanks for the excellent feedback here.

One day I would love to stand up an internal hubs service, with private repos, caching, some sort of artifactory backed proxy/public package passthrough. Until then, I'm glad you see the use in such a solution as direct link urls as an in-between solution.

Your point about security concerns for support of downloading arbitrary tarballs is well taken. I've updated my proposal (and code) to include an optional sha1 argument for the tarball dep type. This allows users to specify the expected sha1 of the tarball. If specified, the dbt deps install will only proceed if the downloaded tarball matches the sha1 in the packages.yml. I'll note that the dbt.hubs repository type install does not currently do this, and might be worth an add in the code as well.

Also appreciate the links to previous conversations on Slack around this theme. I'm not surprised other users in larger orgs are coming with similar problems.
I gave a lot of thought to a pip style implementation initially, but eventually came towards this solution realizing that pip is just pulling the tar.gz files from a pypi/artifactory server anyway. So why not just point to those files instead of using pip (and a lot of extra setup.py magic) to get it to work.
As this feature evolves (dbt deps tarballs) I'll solicit feedback from those threads you shared to see if this solution would have been a reasonable fix in both those previous circumstances. My gut is that this would have been a workable solution in both cases.

I've got the feature working, and will finish up with docs, tests and cleanup over next couple of days. Will update (the PR )[https://github.com//pull/4220] accordingly, and reach out once ready for review. Thanks!

@jtcohen6 jtcohen6 added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors and removed good_first_issue Straightforward + self-contained changes, good for new contributors! labels Feb 2, 2022
@timle2 timle2 mentioned this issue Feb 5, 2022
5 tasks
timle2 added a commit to timle2/dbt-core that referenced this issue Feb 21, 2022
in support of
dbt-labs#4205

flake8 fixes

adding max size tarball condition

clean up imports

typing

adding sha1 and subdirectory options; improve logging feedback

sha1: allow user to specify sha1 in packages.yaml, will only install if package matches
subdirectory: allow user to specify subdirectory of package in tarfile, if the package is a non standard structure (like with git subdirectory option)

simple tests added

flake fixes

cleanup

cleanup comments; remove asserts
timle2 added a commit to timle2/dbt-core that referenced this issue Feb 21, 2022
@jtcohen6 jtcohen6 added deps dbt's package manager and removed packages Functionality for interacting with installed packages labels Mar 30, 2022
@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Sep 28, 2022
@timle2
Copy link
Contributor Author

timle2 commented Oct 2, 2022

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

Will pick this back up in the next several weeks 👍

@github-actions github-actions bot removed the stale Issues that have gone stale label Oct 3, 2022
@the-serious-programmer
Copy link

@timle2 thanks so much for creating this issue and picking this up.

Running into a lot of DBT projects that just commit packages directly into their repository because they cannot access the internet when running DBT in Airflow for instance. The described tarball solution would help us out tremendously!

@timle2
Copy link
Contributor Author

timle2 commented Nov 6, 2022

@the-serious-programmer (and any others), I've picked this back up (rewrite # 3) and hoping to get this past the finish line this month. Please all, advocate for your support on this feature in #4689 it's the only way this will make it to release!

emmyoop added a commit that referenced this issue Dec 7, 2022
* v0 - new dbt deps type: tarball url

in support of
#4205

* flake8 fixes

* adding max size tarball condition

* clean up imports

* typing

* adding sha1 and subdirectory options; improve logging feedback

sha1: allow user to specify sha1 in packages.yaml, will only install if package matches
subdirectory: allow user to specify subdirectory of package in tarfile, if the package is a non standard structure (like with git subdirectory option)

* simple tests added

* flake fixes

* changes to support tests; adding exceptions; fire_event logging

* new logging events

* tarball exceptions added

* build out tests

* removing in memory tarball test

* update type codes to M - Misc

* adding new events to test_events

* fix spacing for flake

* add retry download code - as used in registry calls

* clean

* remove saving tar in memory inside tarfile object

will hit url multiple times instead

* remove duplicative code after refactor

* black updates

* black formatting

* black formatting

* refactor - no more in-memory tarfile - all as file operations now

- remove tarfile passing, always use tempfile instead
- reorganize system.* functions, removing duplicative code
- more notes on current flow and structure - esp need for pattern of 1) unpack 2) scan for package dir 3) copy to destination.
- cleaning

* cleaning and sync to new tarball code

* cleaning and sync to new tarball code

* requested changes from PR

#4689 (comment)

* reversions from revision 2

removing sha1 check to simplify/mirror hub install pattern

* simplify/mirror hub install pattern

to simplify/mirror hub install pattern
- removing sha1 check
- supply name/version to act as our 'metadata' source

* simplify/mirror hub install pattern

simplify with goal of mirroring hub install pattern
- supporting subfolders like git packages, and sha1 checks are removed
- existing code from RegistryPinnedPackage (install() and download_and_untar()) performs the operations
- RegistryPinnedPackage install() and download_and_untar() are not currently set up as functions that can be used across classes - this should be moved to dbt.deps.base, or to a dbt.deps.common file - need dbt labs feedback on how to proceed (or leave as is)

* remove revisions, no longer doing package check

* slim down to basic tests

more complex features have been removed (sha1, subfolder) so testing is much simpler!

* fix naming to match hubs behavior

remove version from package folder name

* refactor install and download to upstream PinnedPackage class

i'm on the fence if this is right approach, but seems like most sensible after some thought

* Create Features-20221107-105018.yaml

* fix flake, black, mypy errors

* additional flake/black fixes

* Update .changes/unreleased/Features-20221107-105018.yaml

fix username on changelog

Co-authored-by: Emily Rockman <ebuschang@gmail.com>

* change to fstring

Co-authored-by: Emily Rockman <ebuschang@gmail.com>

* cleaning - remove comment

* remove comment/question for dbt team

* in support of issuecomment 1334055944

#4689 (comment)

* in support of issuecomment 1334118433

#4689 (comment)

* black fixes; remove debug bits

* remove `.format` & add 'tarball' as version

'tarball' as version so that the temp files format nicely:
[tempfile_location]/dbt_utils_2..tar.gz # old
vs
[tempfile_location]/dbt_utils_1.tarball.tar.gz # current

* port os.path refs in `PinnedPackage._install` to pathlib

* lowercase as per PR feedback

* update tests after removing version arg

goes along with 8787ba4

Co-authored-by: Emily Rockman <ebuschang@gmail.com>
@manugarri
Copy link

+1 to tarballs hosted on S3/GC, much easier to interact with than having to setup artifactory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deps dbt's package manager enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants