Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement to_dbq with intermediary parquet in gcs #44

Merged
merged 19 commits into from
May 8, 2023

Conversation

j-bennet
Copy link
Contributor

@j-bennet j-bennet commented Apr 26, 2023

Implement to_dbq using parquet in gcs as intermediary storage. It does what people have been doing already, in a more convenient wrapper.

Addresses #3.

@j-bennet j-bennet marked this pull request as draft April 26, 2023 22:49
@j-bennet j-bennet changed the title Initial version of writing to_dbq via parquet in gcs. Implement to_dbq to write a dask DataFrame to a big query table via intermediary parquet in gcs Apr 26, 2023
@ntabris
Copy link
Member

ntabris commented Apr 26, 2023

Nits:

  • I don't think this closes to_bigquery ideas - no intermediate storage #3, since that says "no intermediate storage" and this is a wrapper around using intermediate storage, right?
  • It feels like it would be good to document required permissions, since one might not guess that you'd need GCS permissions to use something called "dask-bigquery".

@dchudz
Copy link

dchudz commented Apr 26, 2023

Thanks Irina!

Am I right that as written, this leaves the parquet dataset around forever? It'd be nice to help people with cleanup / bucket lifecycle management. If it's possible to do something automated in code, that could be really nice. Otherwise we should document our recommendations.

(Also just to clarify Nat's nit, I don't think anyone's objecting to this approach, or objecting to closing #3. Just looking to avoid confusion for anyone who comes back here later and would be mislead by the description saying this does the "no intermediate storage" version.)

@j-bennet
Copy link
Contributor Author

Nits:

I don't think this closes to_bigquery ideas - no intermediate storage #3, since that says "no intermediate storage" and this is a wrapper around using intermediate storage, right?

I agree, "closes" is the wrong word. "Motivated by" is more like it. It addresses the problem described in the issue, but as of this moment, we don't have an easy way to use Storage Write API and bypass the intermediate storage, since the API accepts data as protobuf.

It feels like it would be good to document required permissions, since one might not guess that you'd need GCS permissions to use something called "dask-bigquery".

👍 Yes, that will be useful in the docstring. This PR is still WIP. I also plan to add more parameters to the method, so users have more control over how the data is written to intermediary storage, and to BQ.

@dchudz
Copy link

dchudz commented Apr 26, 2023

Cool, thanks for sharing a draft!

add more parameters to the method,

Probably on your mind already, but it'd be nice for users to be able to control the authentication with Google Cloud a bit. When the need for this came up recently, it was in a case with a dask cluster where setting up Application Default Credentials would be a little awkward.

@j-bennet
Copy link
Contributor Author

Am I right that as written, this leaves the parquet dataset around forever? It'd be nice to help people with cleanup / bucket lifecycle management. If it's possible to do something automated in code, that could be really nice. Otherwise we should document our recommendations.

This piece does the cleanup:

# cleanup temporary parquet
for blob in bucket.list_blobs(prefix=object_prefix):
blob.delete()

I'm also going to update the docstring to advise configuring a bucket with retention policy.

(Also just to clarify Nat's nit, I don't think anyone's objecting to this approach, or objecting to closing #3. Just looking to avoid confusion for anyone who comes back here later and would be mislead by the description saying this does the "no intermediate storage" version.)

I wish it did! Until the Storage Write API supports Arrow in addition to protobuf, I think we're out of luck there.

@dchudz
Copy link

dchudz commented Apr 27, 2023

This piece does the cleanup:

Oh right, thanks! Sorry!

@j-bennet j-bennet marked this pull request as ready for review April 28, 2023 21:19
@j-bennet
Copy link
Contributor Author

j-bennet commented May 1, 2023

@ncclementi @dchudz @ntabris I addressed the suggestions, please take another look.

dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
j-bennet and others added 3 commits May 2, 2023 21:38
Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com>
@j-bennet
Copy link
Contributor Author

j-bennet commented May 2, 2023

@fjetter @bnaul Updated with your suggestions.

dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Show resolved Hide resolved
dask_bigquery/tests/test_core.py Outdated Show resolved Hide resolved
@j-bennet j-bennet requested a review from fjetter May 3, 2023 17:31
Copy link
Contributor

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments

dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/core.py Show resolved Hide resolved
dask_bigquery/tests/test_core.py Show resolved Hide resolved
dask_bigquery/tests/test_core.py Outdated Show resolved Hide resolved
@j-bennet j-bennet requested a review from phofl May 4, 2023 22:13
dask_bigquery/core.py Outdated Show resolved Hide resolved
dask_bigquery/tests/test_core.py Show resolved Hide resolved
dask_bigquery/core.py Outdated Show resolved Hide resolved
parquet_kwargs_used.update(parquet_kwargs)

# override the following kwargs, even if user specified them
parquet_kwargs_used["engine"] = "pyarrow"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising a warning here might make sense as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pointed out in docstring that these will be enforced. Warning feels like an overkill.

def test_read_gbq(df, dataset, client):
project_id, dataset_id, table_id = dataset
ddf = read_gbq(project_id=project_id, dataset_id=dataset_id, table_id=table_id)

assert list(ddf.columns) == ["name", "number", "timestamp", "idx"]
assert ddf.npartitions == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing these?

Copy link
Contributor Author

@j-bennet j-bennet May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not 2 anymore, it's 1. I think it's changed on the side of Big Query, the number of streams it gives back by default. Without this change, the test breaks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnaul do you know what could be happening here? I would have expected that since we have defined a time_partitioning then we would be reading 2 different partitions, did something change in the default streams that you are aware of?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, like @j-bennet said I guess they just must have changed the chunking upstream. I don't think we actually care one way or the other, we didn't end up including any time partitioning-specific logic (our original implementation does have something along those lines, where we read every partition separately)

@jrbourbeau jrbourbeau changed the title Implement to_dbq to write a dask DataFrame to a big query table via intermediary parquet in gcs Implement to_dbq with intermediary parquet in gcs May 5, 2023
@jrbourbeau jrbourbeau mentioned this pull request May 5, 2023
Copy link
Contributor

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we have a test that does read_gbq from a bigquery table that was created with the big query API, can we add a roundtrip test to make sure that we can read the tables we write with to_gbq.

def test_read_gbq(df, dataset, client):
project_id, dataset_id, table_id = dataset
ddf = read_gbq(project_id=project_id, dataset_id=dataset_id, table_id=table_id)

assert list(ddf.columns) == ["name", "number", "timestamp", "idx"]
assert ddf.npartitions == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnaul do you know what could be happening here? I would have expected that since we have defined a time_partitioning then we would be reading 2 different partitions, did something change in the default streams that you are aware of?

@j-bennet
Copy link
Contributor Author

j-bennet commented May 5, 2023

@ncclementi and @phofl , I addressed your comments. Please take a look.

Copy link
Contributor

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j-bennet This looks good to me, I left one comment regarding an older pin we had on grpcio, I understand we we leave this to a future PR since it's not necessary part of the scope, but if can give it a try and see if CI is green that would be great.

@@ -9,6 +9,7 @@ dependencies:
- pandas
- pyarrow
- pytest
- gcsfs
- grpcio<1.45
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had this pin due to a seg fault on windows, but it's been a year since this happen can we try to unpin this and see what happen? I understand it's not part of this PR, we can leave it for a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try in a separate PR. Thanks for the tip!

@phofl
Copy link
Contributor

phofl commented May 8, 2023

lgtm

@j-bennet j-bennet merged commit 39e397b into main May 8, 2023
@j-bennet j-bennet deleted the j-bennet/3-to-bigquery branch May 8, 2023 20:39
@j-bennet
Copy link
Contributor Author

j-bennet commented May 8, 2023

Thank you @phofl @ncclementi @bnaul.

@bnaul bnaul mentioned this pull request May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants