Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't access BigQuery tables whose original sources are spreadsheets on Google Drive with the oauth method #3040

Closed
1 of 5 tasks
yu-iskw opened this issue Jan 29, 2021 · 22 comments · Fixed by #3041
Closed
1 of 5 tasks
Labels
bigquery bug Something isn't working regression

Comments

@yu-iskw
Copy link
Contributor

yu-iskw commented Jan 29, 2021

Describe the bug

It is impossible to access BigQuery tables whose original sources are spreadsheets on Google Drive with the oauth method, because we don't pass any scopes parameters to the method to create BigQuery credentials. We need a scope of https://www.googleapis.com/auth/drive to access such tables. Moreover, there is no way to grant scopes, not permissions, to a google cloud authentication. We can only pass the scope information when creating a credentials.

Steps To Reproduce

  1. Create a BigQuery table whose data source is a spreadsheet.
  2. Set a profile with the oauth method for BigQuery.
  3. Create a dbt model which refers to the BigQuery table with the profile.

Expected behavior

Even when we use the oauth method of BigQuery profile, we should be able to access BigQuery tables whose original data sources are spreadsheets on Google Drive with the scope of https://www.googleapis.com/auth/drive.

Screenshots and log output

$ dbt run
...
Database Error in model my_model (models/my_model.sql)
  Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.
  compiled SQL at target/run/my_model.sql

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.19.0
   latest version: 0.19.0

Up to date!

Plugins:
  - bigquery: 0.19.0

The operating system you're using:
Mac OS 10.15.7 (19H114)

The output of python --version:
Python 3.8.5

@jtcohen6
Copy link
Contributor

Thanks for the detailed report @yu-iskw, for finding the relevant diff here, and for the associated PR.

I'm trying to reproduce this error locally. So far I haven't been able to:

  1. I created a BigQuery table whose data source is a spreadsheet.
  2. I'm using oauth method for dbt-bigquery==0.19.0. I've tried it with and without specifying the project in profiles.yml.
  3. I've successfully created and tested a model that selects from the GSheet-powered BigQuery table.

Does the change in #3041 resolve this issue for you locally? Even if I can't manage to reproduce the error, that's a helpful indication.

In any case, we definitely need better automated integration testing for external data sources (including GSheets) in BigQuery.

@jtcohen6 jtcohen6 added bigquery and removed triage labels Jan 29, 2021
@johnflux
Copy link

johnflux commented Jan 30, 2021

I have reproduced this bug myself. In my case, I personally (johnflux@mycompany.com) have access to the spreadsheet, but my company ("mycompany") as a group does not.

If I use 'bq' from the command line, I get a permission failed. From the webui it works.

If I change the spreadsheet to grant access to "mycompany" then both work.

I'm not a sysadmin for my company, and I have no idea how to test the provided patch sorry.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 1, 2021

@johnflux To clarify, were you able to reproduce this error using dbt? Or using the bq CLI, as opposed to the BigQuery UI? I don't follow how adding access to mycompany would resolve this specific issue, given that your oauth identity should be tied to your personal user, but I may be missing a piece of how scopes + oauth interact.

I have no idea how to test the provided patch sorry.

If you're able to reproduce this error with dbt==0.19.0, then you can try installing dbt (in a virtual environment) from the branch with the proposed fix:

python3 -m venv env
source env/bin/activate
pip install git+https://github.com/yu-iskw/dbt.git@issue-3040

Then re-run the model that depends on the GSheet, and see if it works, where previously it does not.

@dalebradman
Copy link

We're using BigQuery with dbt version 0.18.1.

We're getting that same Database Error: Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.

then you can try installing dbt (in a virtual environment) from the branch with the proposed fix

@jtcohen6 we installed that patch branch and got the exact same error. We couldn't locate anything in the logs either.

@walshie4
Copy link

We also see a failure that may be related to this change. Our failure only occurs on version 0.19.0. The same setup running on 0.18.0 or 0.18.1 works without problem.

We get the same error as has been previously mentioned:

Database Error in model <...>
  Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 11, 2021

Hey @walshie4! Could you try installing the fix from #3041 in a virtualenv, and see if that removes the error? Something like:

$ git clone https://github.com/yu-iskw/dbt.git
$ cd dbt
$ git checkout issue-3040
$ python3 -m venv env
$ source env/bin/activate
$ pip install -e ./core -e ./plugins/bigquery

I haven't yet been able to replicate this error myself, so hearing that the proposed fix resolves the issue is a good indication that it's the right change.

@heisencoder
Copy link
Contributor

I've got a project that reproduces the failure message with dbt 0.17, and confirmed that the PR #3041 does not resolve the issue for me.

@heisencoder
Copy link
Contributor

As another data point, I reproduced the same error by directly running the bq command against an external BigQuery table, but was then able to fix this error by running this command: gcloud auth login --enable-gdrive-access

However, even after running that gcloud command, dbt is still giving me the same error. The BigQuery documentation does suggest that this is an issue with not providing the appropriate scope, so I don't know why it's still failing even after patching in PR #3041.

https://cloud.google.com/bigquery/external-data-drive#python

Looking at that page for guidance, I'm able to reproduce the Drive error with just a small python snippet:

from google.cloud import bigquery
import google.auth

credentials, project = google.auth.default(
     scopes=[
         "https://www.googleapis.com/auth/drive",
         "https://www.googleapis.com/auth/bigquery",
     ]
)

client = bigquery.Client(credentials=credentials, project=project)
query_job = client.query('SELECT count(some_field) as c FROM my_dataset.some_external_table_linked_to_a_google_sheet')
print(list(query_job))

So this issue seems to me like it's an issue with the BigQuery client code itself. Still digging...

@heisencoder
Copy link
Contributor

Okay, I think I found the fix. It's necessary to run this command to add the Drive scope to the application default credentials:

gcloud auth application-default login --scopes=https://www.googleapis.com/auth/drive,https://www.googleapis.com/auth/bigquery

(Note that you also have to add the bigquery scope and anything else needed by dbt).

@heisencoder
Copy link
Contributor

And as expected, this is already documented within the dbt site here:

https://docs.getdbt.com/reference/warehouse-profiles/bigquery-profile#local-oauth-gcloud-setup

That documentation recommends running this command:

gcloud auth application-default login \
  --scopes=https://www.googleapis.com/auth/userinfo.email,\
https://www.googleapis.com/auth/cloud-platform,\
https://www.googleapis.com/auth/drive.readonly

However, when I run that command I get this error:

ERROR: gcloud crashed (Warning): Scope has changed from "https://www.googleapis.com/auth/drive.readonly https://www.googleapis.com/auth/userinfo.email https://www.googleapis.com/auth/cloud-platform https://www.googleapis.com/auth/accounts.reauth" to "https://www.googleapis.com/auth/accounts.reauth https://www.googleapis.com/auth/cloud-platform openid https://www.googleapis.com/auth/drive.readonly https://www.googleapis.com/auth/userinfo.email".

If I add ,openid to the end of the gcloud command above, then it works.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 3, 2021

@heisencoder Thanks for digging! I wonder if drive.readonly is an insufficient scope, given our findings over in #2953?

@yu-iskw @dalebradman @walshie4 Could you try again after running the gcloud auth command linked above? I'm still not sure why this error only crops up in v0.19.0, and works fine in v0.18.x.

@heisencoder
Copy link
Contributor

@jtcohen6 This is a good point! With my recent test cases, I've just been using the Python Bigquery client directly, but as you mentioned we may need to expand this to readwrite access based on the findings of #2953.

All that said, the dbt documentation has been specifying the drive.readonly scope for the gcloud command, so I'm curious whether the readonly scope actually works now.

@jtcohen6
Copy link
Contributor

To summarize from the conversation over in #2953:

  • If you're connecting dbt to BigQuery via service-account, service-account-json, or oauth-secrets, dbt will use the scopes defined here.
  • If you're using oauth-based connection method (i.e. the gcloud SDK), dbt will not use those scopes, but instead rely on the scopes defined in gcloud auth application-default.
  • It's likely that dbt requires more expansive scopes today than it strictly needs. That's the primary question raised by #2953, and we'll continue that conversation over there.

Given the confusion this issue stirred up inn the meantime, there's definitely appetite to clarify this behavior, by better documenting it or making it more intuitive. Right now, the Drive scope is "optional" for users who authenticate and develop locally via gcloud/oauth. The original intent was to make it so folks didn't have to give dbt access to all their personal Drive files if they aren't actually using Drive as a data source. (From the docs: "If your dbt project does not transform data in Google Sheets, then you may omit the --scopes flag.") Since it's likely that gcloud/oauth is based on a personal email address, this requires giving Drive access to a lot more than just GSheet data sources! That feels different from authentication via a dedicated service account, where it's feasible to strictly control access to only the Drive files needed for a specific purpose, so dbt can bake in the Drive scope request every time.

Is making that distinction still desirable? I think there are two valid answers:

  1. No: It's not worth the confusion. The get_bigquery_defaults() method should try to grab Drive scopes regardless (= what Pass the default scopes to the default BigQuery credentials #3041 attempts).
  2. Yes: dbt shouldn't over-eagerly grab Drive scopes without active user input. That said, we should better handle the exception Permission denied while getting Drive credentials. If a user is connecting via the oauth method, the error message should be wrapped in a bit of context and link out to the relevant docs.

Let's use this issue to discuss which of these two options makes sense, and plan for a code change that will help us avoid this confusion in the future.

@heisencoder
Copy link
Contributor

I'm not optimistic that #3041 will actually change anything, since the OAUTH method is just using the scopes that were passed into the gcloud auth application-default command. I support adding better error messages as outlined in option 2 above.

@walshie4
Copy link

Finally got around to testing #3041 and it appears to not solve the problem in my local reproduction setup. Added a print in here to let me know what scopes are being passed. Still errors out:

(env) adwalsh @ ~/honey/core-analytics {git:(v19*)} dbt run -m test_model                                                    [03/16/21 12:13:38]
Running with dbt=0.19.0
Found 577 models, 445 tests, 2 snapshots, 1 analysis, 540 macros, 0 operations, 10 seed files, 195 sources, 0 exposures

AUTHING WITH SCOPES: ('https://www.googleapis.com/auth/bigquery', 'https://www.googleapis.com/auth/cloud-platform', 'https://www.googleapis.com/auth/drive')
12:14:23 | Concurrency: 100 threads (target='test')
12:14:23 |
12:14:23 | 1 of 1 START table model adw.test_model.............................. [RUN]
12:14:24 | 1 of 1 ERROR creating table model adw.test_model..................... [ERROR in 1.26s]
12:14:24 |
12:14:24 | Finished running 1 table model in 5.29s.

Completed with 1 error and 0 warnings:

Database Error in model test_model (models/test_model.sql)
  Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.
  compiled SQL at target/run/honey_dbt/models/test_model.sql

Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
dbt run -m test_model  45.88s user 1.19s system 93% cpu 50.153 total

After running

gcloud auth application-default login \
  --scopes=https://www.googleapis.com/auth/bigquery,\
https://www.googleapis.com/auth/drive.readonly

This fixes the problem with both version 0.19 and the #3041 branch (which makes me question if #3041 is actually changing anything).

I'm still looking into why this isn't working when running as a service account which has been given access to the document and BQ. Will report back if I find anything of value

@walshie4
Copy link

Okay I'm back, and I have good news. After digging through here and trying out some of the changes I think I understand how this happened and how we can resolve the issue.

To start the issue was first introduced in PR #2908 where this line which passed scopes when using OAuth was replaced with this line which relies on the defaults configured with gcloud.

This is why even after this change if you authenticate with the scopes added (using gcloud auth application-default login --scopes=https://www.googleapis.com/auth/bigquery, https://www.googleapis.com/auth/drive.readonly you will not see the failure since your default credentials have the scopes added.

However, if you have a setup like we do over at Honey 😉 you will see the problem because we use a service account in an environment with no default credentials available and authenticate by activating the service account with gcloud auth activate-service-account which does not allow for passing scopes, but will cause the dbt auth code to still go down that (OAuth) path. This is a fairly standard auth flow for GCP (and is a common approach for production authentication with service accounts)

Looking back now this is actually not a mentioned path on the BQ profile setup page (can't remember if it was back a ~year ago when we set this up) which may be why this path got hit with the bug.

Tested our setup with #3041 and it does resolve the issue (manually adding the 4 lines of changes causes a previous failure to then work in the same environment back-to-back).

I think we should merge that change, and consider updating the docs to cover the activate-service-account authentication path.

Let me know if you have any questions. I also support making these optional in some form for those who don't need them.

@jtcohen6
Copy link
Contributor

@walshie4 Heroic! Thanks for the detailed investigative work.

we use a service account in an environment with no default credentials available and authenticate by activating the service account with gcloud auth activate-service-account which does not allow for passing scopes, but will cause the dbt auth code to still go down that (OAuth) path

It all becomes clear. If using a service account to authenticate via the gcloud-style oauth method, then the scopes passed by dbt here go unused, and no scopes were specified by gcloud auth, and so voila, scopes are missing.

Given that, I'm happy merging #3041 and sneaking this into v0.19.1. We'll likely want to cut a second release candidate, and ask you folks to test it out.

I'd also welcome a PR here that updates "Local OAuth gcloud setup" to include mention of Service Account OAuth gcloud setup as well.

@jtcohen6 jtcohen6 added this to the 0.19.1 (performance) milestone Mar 17, 2021
@kconvey
Copy link
Contributor

kconvey commented Mar 19, 2021

Just in case this helps anyone who may be trying to use the OAuth path within dbt from a GCP context: even without this regression (i.e. the credentials you create are requesting https://www.googleapis.com/auth/drive scope, as in 0.18 and 0.19.1), if the scopes you enable on your cluster (ex. GKE) are a subset of https://www.googleapis.com/auth/drive like https://www.googleapis.com/auth/spreadsheets.readonly, and even if you only practically use that subset, the request made by dbt may fail with the error described in this bug (Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.).

I didn't dig much further than to expand scope on GKE from https://www.googleapis.com/auth/spreadsheets.readonly -> https://www.googleapis.com/auth/drive and get this succeeding on 0.18, so maybe other configurations may work, but I'm doubtful that anything less than enabling the specific scope you're requesting will allow a request to succeed.

Yes: dbt shouldn't over-eagerly grab Drive scopes without active user input. That said, we should better handle the exception Permission denied while getting Drive credentials. If a user is connecting via the oauth method, the error message should be wrapped in a bit of context and link out to the relevant docs.

Given that, I'm happy merging #3041 and sneaking this into v0.19.1. We'll likely want to cut a second release candidate, and ask you folks to test it out.

I think dbt should consider making scopes a configurable profile / project config option for BigQuery. If you went down the local gcloud path and set your default scopes to include https://www.googleapis.com/auth/drive, etc., you wouldn't need to set this for local runs, so perhaps it could default to just bigquery. If you were running somewhere remotely, as a service account, you can't set default scopes, so you may need to set this option, but making it configurable would mean you could specify something like https://www.googleapis.com/auth/spreadsheets.readonly, if that was all you needed. If you were running in that context and your request failed with insufficient scopes, definitely a pointer to the dbt docs which describe the config option would be extremely helpful.

@jtcohen6
Copy link
Contributor

I think dbt should consider making scopes a configurable profile / project config option for BigQuery.

This seems reasonable to me!

@kconvey Could you drop that proposal over in https://github.com/fishtown-analytics/dbt/issues/2953? It sounds like, in addition to tightening the scopes requested by default, we should give users the option to request even tighter ones, e.g. if dbt doesn't need read access to drive/spreadsheets at all.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 25, 2021

@walshie4 We just merged the fix in #3041 and included it in v0.19.1rc2. When you get a chance, could you try installing the release candidate and authenticating via the magical service account / gcloud oauth combo?

@walshie4
Copy link

Just tried it out. Can confirm that the v0.19.1rc2 release resolves the issue for our setup. 🙌

@jtcohen6 jtcohen6 mentioned this issue Mar 26, 2021
11 tasks
@dalebradman
Copy link

Can also confirm that this works once you re-authenticate with the required scopes! 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery bug Something isn't working regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants