Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cross db queries in redshift #3236

Closed
1 of 5 tasks
Tracked by #742
sonac opened this issue Apr 8, 2021 · 15 comments · Fixed by #3408
Closed
1 of 5 tasks
Tracked by #742

cross db queries in redshift #3236

sonac opened this issue Apr 8, 2021 · 15 comments · Fixed by #3408
Labels
bug Something isn't working redshift
Milestone

Comments

@sonac
Copy link
Contributor

sonac commented Apr 8, 2021

Describe the bug

When using source database that is different from the target (https://docs.getdbt.com/reference/resource-properties/database), I ran into an issue that adapter doesn't fetch columns, since it seems like it looks only into target db for meta information about them: https://github.com/fishtown-analytics/dbt/blob/develop/plugins/redshift/dbt/include/redshift/macros/adapters.sql#L79

Steps To Reproduce

  1. Have target db different from source db
  2. Run model based on source table that doesn't exist on target using adapter macroses

Expected behavior

Adapter expected to get meta information about source tables from source database

Screenshots and log output

Not applicable

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

0.17.2

The operating system you're using:
OSX

The output of python --version:
3.9.4

Additional context

Not applicable

@sonac sonac added bug Something isn't working triage labels Apr 8, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 8, 2021

Great point @sonac, thanks for opening the issue.

There are other places in the dbt-redshift adapter codebase where it takes a strong stance against the possibility of cross-database querying (#3179). Given that, I'm actually surprised you didn't run into a compilation error earlier.

With the advent of cross-db querying on RA3 nodes, it's clear that we'll need to change more functionality around the runtime cache + catalog queries. Thankfully, we already have a good sense of what this can look like in dbt-snowflake, and we can adopt it piecemeal—to my knowledge, it's still not yet to create tables/views in a different Redshift database.

We started discussion in the other issue about how dbt should determine its behavior. Ideally, there would be an obvious metadata query to run—or better yet, a piece of information to pull off the cursor object—but after checking with the Redshift team, the query to determine node type is a bit more involved. Rather than go that route, I find myself in favor of a new ra3: true | false profile config, false by default, that when true will enable features like cross-database querying. For all other node types, we'd leave the existing restrictions in place.

What do you think? Is the addition of that functionality something you'd be interested in helping to contribute?

@jtcohen6 jtcohen6 added redshift and removed triage labels Apr 8, 2021
@sonac
Copy link
Contributor Author

sonac commented Apr 8, 2021

indeed, I'd be keen helping, since this functionality would ease our testing setup.

@misteliy
Copy link

since we would also like to leverage cross-db queries it would be great to help. Also, since we wanted to re-arrange some of our dbt setup with this ra3 capability it would be great to have that rather sooner than later 😄

@jtcohen6
Copy link
Contributor

@sonac @misteliy Okay, I think there are three major steps involved here:

  1. Add a new credential property, ra3: bool = False.

  2. Reimplement SchemaSearchMap + _get_catalog_schemas for the Redshift adapter. If ra3 == False, the default behavior is right: raise an error if multiple databases are defined. If ra3 == True, then we want to disable those errors and enable querying across databases.

  3. Update the adapter queries to account for objects in other databases: redshift__get_columns_in_relation, redshift__list_schemas, redshift__get_catalog, etc. Do you know how objects from shared databases appear in information_schema/pg_catalog tables? I haven't been able to get a good sense from Redshift's documentation.

@sonac
Copy link
Contributor Author

sonac commented Apr 17, 2021

I've looked into it and it seems like it's actually not that trivial to extract metadata from another database within the redshift, system views are inaccessible in cross-db queries (maybe because of their reliance on inbuilt functions like current_database())
A good solution for column retrieval would've been this view: https://docs.aws.amazon.com/redshift/latest/dg/r_SVV_REDSHIFT_COLUMNS.html
though query from it were failing on our production, claiming that table x is missing, running vacuum fixed it. The only thing that disturbs me a bit in it - is that it might be inconsistent and it uses pg_get_shared_redshift_columns() psql function, for which I didn't found source code (am I missing smth here?)
Also this view is also works on dcX nodes, not only ra3, therefore we might not need the additional property

@misteliy
Copy link

Hi, fantastic to get started. I could also offer to leverage our current redshift expert from AWS which is currently engaged in our project. If you want I could set up a quick call next week to address all necessary questions. Let me know if you think that would be useful.
Furthermore, I‘ll try to (cross-db) query the r_SVV_REDSHIFT_COLUMNS view as suggested by @sonac tomorrow on our cluster. Thus, the way I understand it, this view would get us already quite far, but please correct me if I‘m wrong...

@jtcohen6
Copy link
Contributor

Also this view is also works on dcX nodes, not only ra3, therefore we might not need the additional property

I think the ideal version of this implementation preserves all existing behavior on dc2/ds2 nodes, including the explicit compilation error if a user tries to do something disallowed (querying tables in a different database). Hence the desire for an equally explicit ra3: true config to allow what has previously been disallowed. That's just my thinking now, though. If there's a slick way we can determine what is or isn't possible on a user's cluster, based on the contents of svv_redshift_columns, that would be compelling.

@sonac
Copy link
Contributor Author

sonac commented Apr 18, 2021

Furthermore, I‘ll try to (cross-db) query the r_SVV_REDSHIFT_COLUMNS view as suggested by @sonac tomorrow on our cluster. Thus, the way I understand it, this view would get us already quite far, but please correct me if I‘m wrong...

Cool, it would've been great if somebody else would test it out, since if I'm not using filters on that view - I'm running into weird errors that some relation doesn't exist (and it's hard to debug why it occurs since it using some inbuilt redshift functions)
Though when filtering on the desired table\view - works fine.

@misteliy
Copy link

misteliy commented Apr 26, 2021

Sorry for the late reply. We were able to query the SVV views but also run into some problems currently investigated further. We also can confirm that in dbt specifying source instead of ref works perfectly fine on cross-db runs. But since we want to be able to use cross-db references, first thing to re-write is:
https://github.com/fishtown-analytics/dbt/blob/749f87397ec1e0a270b2e09bd8dbeb71862fdb81/plugins/postgres/dbt/adapters/postgres/impl.py

        if database.startswith('"'):
            database = database.strip('"')
        expected = self.config.credentials.database
        if database.lower() != expected.lower():
            raise dbt.exceptions.NotImplementedException(
                'Cross-db references not allowed in {} ({} vs {})'
                .format(self.type(), database, expected)
            )
        # return an empty string on success so macros can call this
        return ```

@misteliy
Copy link

misteliy commented Apr 26, 2021

However, we might have identified bigger issues with redshift cross db:
this works

select * from "sources"."public"."cross_db_test";

but this not

drop table if exists "sources"."public"."cross_db_test" cascade;

SQL Error [0A000]: ERROR: cross-database reference to database "sources" is not supported

😟

@misteliy
Copy link

@sonac @jtcohen6 we have come up with a fix to use dbt cross db in redshift. However, it’s only a fix to have the source in a different DB but still I think this could be beneficial to a bigger audience. Also, the solution is not narrowing and should be “backwards” compatible.

@chriskl
Copy link

chriskl commented May 21, 2021

This would be great for our project :)

@jtcohen6
Copy link
Contributor

@misteliy That's very exciting! I think sources in "external" databases would still represent a huge value-add for many projects.

Did you get a sense of which macros and methods we need to reimplement to make it possible?

@kostek-pl
Copy link
Contributor

kostek-pl commented May 31, 2021

Hi Guys

To summarize what we discovered:

  • to read metadata across different databases in the same cluster we should use redshift SVV* tables however there is still an issue on AWS side - they should but they don't work.
  • it's not possible to perform DDL statements in other database - as AWS support added more detailed description about cross db access that in fact this access is read only.
  • as you mentioned before we can use cross db objects as a source in models and it works perfect but there is still problem in dbt whilst generating documentation - the exception "Cross-db references not allowed in adapter ..." is thrown. I've adjusted dbt adapter code to take into account the additional configuration key "ra3_node: true" in profiles.yml file and let the dbt generate documentation for cross db sources. We got it working. I'll try to create pull request with those changes for review.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 9, 2021

This was resolved by #3408 — as in, the bit we can do for now: querying sources that live in other databases when using RA3 nodes. When Redshift adds write DDL/DML support, we can consider adding support for other resources types in other databases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working redshift
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants