Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-313] Package new adapters tests #4812

Closed
leahwicz opened this issue Mar 2, 2022 · 4 comments · Fixed by #4948
Closed

[CT-313] Package new adapters tests #4812

leahwicz opened this issue Mar 2, 2022 · 4 comments · Fixed by #4948
Assignees
Labels
adapter_plugins Issues relating to third-party adapter plugins repo ci/cd Testing and continuous integration for dbt-core + adapter plugins

Comments

@leahwicz
Copy link
Contributor

leahwicz commented Mar 2, 2022

We want to be able to package the new adapter tests and have 1 internal adapter be able to use them

@github-actions github-actions bot changed the title Package new adapters tests [CT-313] Package new adapters tests Mar 2, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 21, 2022

@gshank and I just had a chance to talk through this. Notes from the conversation:

Requirements / user stories

  • As a contributor to dbt-core, when I make a change in dbt-core that requires an associated change to the adapter tests (whether underlying framework, reused fixtures/methods, or specific test cases)...
    • Bad: I don’t find out the tests are broken until someone reports an issue — status quo with pytest-dbt-adapter
    • Good: I find out during CI.
    • Better: I can make requisite changes within the same PR (same codebase).
  • As a plugin developer/maintainer, I can install the adapter tests (framework + basic test cases) using pip...
  • As a plugin maintainer, I know which version of the adapter tests to install because...
    • Bad: The versions make no sense and are documented nowhere — status quo with pytest-dbt-adapter
    • Good: Version compatibility matrix is clearly documented.
    • Better: Version compatibility works the same as for all plugins—Major.Minor always match, then use latest patch for each plugin
      • Caveat: The adapter framework would be major-version 1 from the jump. Are we okay with that? (It's also true of brand-new adapter plugins that follow our versioning recommendations)
  • As a plugin maintainer, I need to make changes to tests when...
    • Bad: Anything changes in dbt-core
    • Good: In minor versions, to account for internal changes that are clearly documented / communicated
    • Best: When actual, end-user-facing dbt-core functionality changes

My instincts

  • Codebase: same as dbt-core This is an open question
  • Installation: separate
    • Utilities as extra: pip install dbt-core[tests] — people don’t need test dependencies (pytest) unless they want them
    • Test cases as totally separate python package that requires dbt-core

What's included in the package?

  1. “Baseline” functionality that every plugin must pass (or override/reimplement for a good reason) — parity with status quo pytest-dbt-adapter
  2. Optional test cases for additional functionality that is common across plugins. All of our plugins will need many of these (currently copy-pasted from dbt-core), and I anticipate external plugins may want to opt into some of them as well.
  3. Reusable fixtures + framework. These will enable each plugin to define net-new tests, for behavior that is specific/unique to that database.

@jtcohen6 jtcohen6 added repo ci/cd Testing and continuous integration for dbt-core + adapter plugins adapter_plugins Issues relating to third-party adapter plugins labels Mar 22, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 25, 2022

Thinking through two of the open questions mentioned above:

Interface boundary: test utilities vs. test cases

We feel good about this interface boundary:

  1. Reusable test utilities (dbt.tests module, usable by anyone running pytest + dbt). These are pretty "close to the metal," in that they hook into dbt-core internals and then expose a (relatively) stable interface for testing higher-order functionality. This is theoretically useful to any user of dbt-core, since it allows anyone to test dbt functionality (including their own custom dbt code) via pytest. (Cool!) Even if we don't document it right away, it will form part of the eventual feature-set of dbt-core as a Python library. One thought we had was whether these test utilities want to be an extra requirement, not included by default: pip install dbt-core[tests]. That's just a judgment call on our part, and should be very simple to execute if desired. Let's just make a decision before releasing v1.1.0rc1.
  2. Specific test cases (top-level tests module in this repo) should not live within dbt-core (the Python package), but they are key to the process of developing and testing dbt-core (the Python package). This includes both (a) unit tests that can be run with only dbt-core installed, and (b) functional/integration tests that require an adapter/database.

For many many tests of standard dbt-core behavior with a database, we use Postgres, and we don't need to replicate these tests on other databases. (Someday, we could aim to use another database, even more lightweight, such as SQLite—if we could mock things out to our satisfaction. That would enable us to move dbt-postgres out to a separate repository.)

A subset of the test cases described in (2) do need to be packaged up, because we want to test the same functionality against many (if not all) of the database adapters that live external to this repository. I've called this the "adapter zone" :) It's conceivable that we'd want to move tests, originally written for one adapter plugin, into the "adapter zone" if we realize that the underlying behavior is similar on another adapter, and we want to guarantee that it works the same, without needing to copy-paste-edit the code for it.

So, we want to bundle up those reusable "adapter zone" test cases, and make them available as a standalone Python package.

Where should the code for those reused test cases live?

There are two options:

  1. dbt-core repository
  2. Another repository

There are pros and cons to both approaches. We should do our best to list them all out and assess. I think we can also tackle this question from two angles: philosophical and practical.

Philosophically: Do these test cases represent a totally separate concern from dbt-core? Does there exist a solid interface between the two, that we want to formalize with separate repositories? A few thoughts:

  • The "adapter interface" lives within dbt-core (Python package) today — dbt.adapters module — it's tightly coupled code
  • dbt-core and dbt-tests-adapter don't represent separate pieces of software, as much as they represent code needing tests and tests for that code. That's true of all test types today (unit, functional-Postgres-only, and "adapter zone").
  • We do want these test cases to be fairly stable between minor versions, and non-breaking between patch releases, so that adapter plugin maintainers can treat the dbt-tests-adapter dependency relatively similar to how they treat dbt-core. It can be helpful to enforce that non-breakiness by eating our own dog food. And we will: We'll be reliant on dbt-tests-adapter as "external maintainers" of dbt-redshift, dbt-snowflake, ...

Practically, here are the main pain points to consider:

  • Releasing a Python package that lives in a subdirectory is more difficult than releasing a top-level package in a standalone repository. More importantly, having two different release processes presents difficulty. dbt-postgres makes this a necessity today, but our ultimate aim is to split out dbt-postgres as a top-level package in a standalone repository.
    • Does this get any simpler if we separate out our release processes, so that we separately release dbt-core vs. dbt-postgres vs. dbt-tests-adapter? That way, we can patch each of these independently (separate workflows), and coordinate those workflows for minor version releases. We'd just need GitHub releases/tags to include the package name.
    • The dbt-tests-adapter release process really just needs to build distribution and upload to PyPi. No Homebrew, Docker, ...
  • Every time we add dbt-core functionality that warrants test cases on external (non-Postgres) adapters — which isn't every week, but also is common enough, especially for community contributors — it would require coordinating PRs across multiple repositories, rather than an "all-in-one" PR, just to get the initial thing merged. (It will always require a separate PR in each external adapter plugin, to extend the functionality and inherit/run the new test cases.)
    • Coordinating the dbt-core repo PR and the dbt-tests-adapter repo would require something like: point one PR branch to install from the other, get tests passing, merge one, point the other back to main, and merge the other. This is tricky, though not impossible.

For the time being, while we're assessing our options:

  • Let's keep these test cases in the dbt-core repository
  • Let's not push up to PyPi just yet, and ask folks to keep installing as pip install "git+https://github.com/dbt-labs/dbt-core.git#egg=dbt-tests-adapter&subdirectory=tests/adapter"

@iknox-fa
Copy link
Contributor

iknox-fa commented Mar 28, 2022

I largely agree with the characterization you've put forth @jtcohen6 but I'd like to present a slightly different way to think of it.

The current way to describe the sql support of dbt is something like "it works if there's an adapter for it". Which is a fine abstraction for end users and compatibility questions, but when it comes to testing it leaves a lot un DRYed because each database has a very large amount of overlap in functionality.

The way to DRY these tests up is to determine the largest subset of SQL that is covered by all official adapters. This subset can then be used to create a test suite that any database adapter can use to ensure basic functionality. Functionality beyond this SQL subset would be the responsibility of the adapter itself to test. Most sql subsets/dialects have names and years to provide versioning like ANSI-SQL 1999, so perhaps we could follow suit with dbt-sql-common-subset 2022?

If this approach is taken, I think the structure falls into place pretty easily. dbt-core, dbt-<db-name>, and dbt-tests-adapter can all live in their own repos. Any dbt-<db-name> adapters can import dbt-tests-adapter to test itself against dbt-sql-subset 2022 regardless of wether if it's internal or external.

Over time the dbt-test-adapter repo can become a place to house various versions of the tests and associated schemas that define the dbt-core/dbt-<db-name> interface.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 29, 2022

@iknox-fa Appreciate that perspective! I'm aligned with many parts of your vision. At the risk of waxing even more philosophical...

From my point of view, a lot of the "very large amount of overlap in functionality" is already captured and implemented within dbt-core:

  • Adapters interface, which defines constructs like schemas, relations, quoting behavior, etc
  • Global project, which defines the default__ versions of macros (create_schema, create_table_as, etc)

(We want to capture even more of that overlap, by moving some of the common reusable building blocks from dbt-utils. Those aren't strictly necessary for baseline dbt operation, but they're very very useful building blocks, for people writing packages with higher-level functionality: #4813)

We've run into issues before by assuming these are a larger subset than they are, and over the past few years we've needed to significantly expand the scope of where an adapter plugin can reasonably intervene / reimeplement / override / disable that presumed subset of behavior. There is actually a tremendous amount of difference between the support offered by ANSI SQL, Hive SQL, T-SQL, and other "dialects" (are they really even dialects?). Some analytical data platforms have lacked full SQL support for accessing metadata, managing objects (DDL/DML), operating permissions (DCL); each has required hitting other APIs.

What do we say if the mountains will not come to the Prophet? In all cases, the right answer has been: Let's find a way to get dbt working on this data platform, by defining a clear-yet-flexible interface for all the things that might differ across databases. It's the role of the adapter creator / data platform expert to match up the square-ish pegs and the round-ish holes as best as they can.

What we've said historically, and continue to say today: The more your database functions like the accepted standard bits of Postgres/PostgreSQL, the easier it will be to write your adapter, and the easier it will be for users of your adapter to get dbt ecosystem packages "for free" (no shimming/reimplementing of low-level macros required). The more your data platform differs from a Postgres-like database, the more your work is cut out for you—but it's still possible.

Counting lines of code is a highly imperfect proxy for complexity. Still, I think these differences can be instructive:

  • dbt-postgres contains 364 lines of Python + 336 lines of Jinja-SQL
  • dbt-redshift contains 298 lines of Python + 534 lines of Jinja-SQL, because it adopts almost all its functionality from the standard behavior (default + Postgres)
  • dbt-snowflake contains 709 lines of Python (mostly dealing with wonky connection options) + 571 lines of Jinja-SQL
  • dbt-spark (full of idiosyncrasies) contains 1306 lines of Python + 721 lines of Jinja-SQL
  • dbt-bigquery (arguably the most different) contains 1760 lines of Python + 780 lines of Jinja-SQL

(A lot of that Jinja-SQL is still repeated unnecessarily, and we should always be on the lookout for ways to move more of that logic into the defaults, and enable more modular overrides when small things differ. As a tiny example: #4488)

Here's where the rubber hits the road, for this issue: We make changes to the "default" interface / SQL support within dbt-core, often by changing macros in the global project. That's what we did in both #4618 + #4858. We need to be able to create new test cases for that functionality:

  • While developing the feature, and as a precondition for merging the dbt-core PR
  • In a way that we can easy extend, to run the same test case in other adapter plugins. If the test passes, we're good! If it fails, the test case may need adjusting, the default implementation could be more generic, or that database's SQL support differs in from our "default" SQL, and we'll need to extend the feature accordingly.

To avoid adding stumbling blocks into our development/contribution process, I'm inclined to think that we should be defining the test cases in the same codebase where we define the default implementations/interface that they are testing. Today, that "default adapter" lives in the dbt-core Python package. Could I see us eventually splitting off the entire "default" adapter (pieces from dbt.adapters + include/global_project) into a separate Python package, splitting it off into a separate repository, and moving dbt.tests.adapter there too? Maybe... though even then, they'd still want to be separate packages, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adapter_plugins Issues relating to third-party adapter plugins repo ci/cd Testing and continuous integration for dbt-core + adapter plugins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants