Implement `to_dbq` with intermediary parquet in gcs #44

j-bennet · 2023-04-26T22:49:03Z

Implement to_dbq using parquet in gcs as intermediary storage. It does what people have been doing already, in a more convenient wrapper.

Addresses #3.

ntabris · 2023-04-26T23:01:22Z

Nits:

I don't think this closes to_bigquery ideas - no intermediate storage #3, since that says "no intermediate storage" and this is a wrapper around using intermediate storage, right?
It feels like it would be good to document required permissions, since one might not guess that you'd need GCS permissions to use something called "dask-bigquery".

dchudz · 2023-04-26T23:10:33Z

Thanks Irina!

Am I right that as written, this leaves the parquet dataset around forever? It'd be nice to help people with cleanup / bucket lifecycle management. If it's possible to do something automated in code, that could be really nice. Otherwise we should document our recommendations.

(Also just to clarify Nat's nit, I don't think anyone's objecting to this approach, or objecting to closing #3. Just looking to avoid confusion for anyone who comes back here later and would be mislead by the description saying this does the "no intermediate storage" version.)

j-bennet · 2023-04-26T23:17:25Z

Nits:

I don't think this closes to_bigquery ideas - no intermediate storage #3, since that says "no intermediate storage" and this is a wrapper around using intermediate storage, right?

I agree, "closes" is the wrong word. "Motivated by" is more like it. It addresses the problem described in the issue, but as of this moment, we don't have an easy way to use Storage Write API and bypass the intermediate storage, since the API accepts data as protobuf.

It feels like it would be good to document required permissions, since one might not guess that you'd need GCS permissions to use something called "dask-bigquery".

👍 Yes, that will be useful in the docstring. This PR is still WIP. I also plan to add more parameters to the method, so users have more control over how the data is written to intermediary storage, and to BQ.

dchudz · 2023-04-26T23:24:41Z

Cool, thanks for sharing a draft!

add more parameters to the method,

Probably on your mind already, but it'd be nice for users to be able to control the authentication with Google Cloud a bit. When the need for this came up recently, it was in a case with a dask cluster where setting up Application Default Credentials would be a little awkward.

j-bennet · 2023-04-27T02:46:51Z

Am I right that as written, this leaves the parquet dataset around forever? It'd be nice to help people with cleanup / bucket lifecycle management. If it's possible to do something automated in code, that could be really nice. Otherwise we should document our recommendations.

This piece does the cleanup:

dask-bigquery/dask_bigquery/core.py

Lines 272 to 274 in 1b3a308

    
           # cleanup temporary parquet 
        
           for blob in bucket.list_blobs(prefix=object_prefix): 
        
               blob.delete()

I'm also going to update the docstring to advise configuring a bucket with retention policy.

(Also just to clarify Nat's nit, I don't think anyone's objecting to this approach, or objecting to closing #3. Just looking to avoid confusion for anyone who comes back here later and would be mislead by the description saying this does the "no intermediate storage" version.)

I wish it did! Until the Storage Write API supports Arrow in addition to protobuf, I think we're out of luck there.

dchudz · 2023-04-27T13:21:31Z

This piece does the cleanup:

Oh right, thanks! Sorry!

j-bennet · 2023-05-01T22:23:29Z

@ncclementi @dchudz @ntabris I addressed the suggestions, please take another look.

dask_bigquery/core.py

Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com>

j-bennet · 2023-05-02T19:47:17Z

@fjetter @bnaul Updated with your suggestions.

dask_bigquery/core.py

dask_bigquery/tests/test_core.py

phofl

Small comments

dask_bigquery/core.py

dask_bigquery/tests/test_core.py

dask_bigquery/core.py

dask_bigquery/tests/test_core.py

dask_bigquery/core.py

phofl · 2023-05-05T09:22:51Z

dask_bigquery/core.py

+        parquet_kwargs_used.update(parquet_kwargs)
+
+    # override the following kwargs, even if user specified them
+    parquet_kwargs_used["engine"] = "pyarrow"


Raising a warning here might make sense as well?

I pointed out in docstring that these will be enforced. Warning feels like an overkill.

phofl · 2023-05-05T09:27:03Z

dask_bigquery/tests/test_core.py

 def test_read_gbq(df, dataset, client):
    project_id, dataset_id, table_id = dataset
    ddf = read_gbq(project_id=project_id, dataset_id=dataset_id, table_id=table_id)

    assert list(ddf.columns) == ["name", "number", "timestamp", "idx"]
-    assert ddf.npartitions == 2


Why are you removing these?

It's not 2 anymore, it's 1. I think it's changed on the side of Big Query, the number of streams it gives back by default. Without this change, the test breaks.

@bnaul do you know what could be happening here? I would have expected that since we have defined a time_partitioning then we would be reading 2 different partitions, did something change in the default streams that you are aware of?

nope, like @j-bennet said I guess they just must have changed the chunking upstream. I don't think we actually care one way or the other, we didn't end up including any time partitioning-specific logic (our original implementation does have something along those lines, where we read every partition separately)

ncclementi

Currently, we have a test that does read_gbq from a bigquery table that was created with the big query API, can we add a roundtrip test to make sure that we can read the tables we write with to_gbq.

ncclementi · 2023-05-05T16:46:49Z

dask_bigquery/tests/test_core.py

 def test_read_gbq(df, dataset, client):
    project_id, dataset_id, table_id = dataset
    ddf = read_gbq(project_id=project_id, dataset_id=dataset_id, table_id=table_id)

    assert list(ddf.columns) == ["name", "number", "timestamp", "idx"]
-    assert ddf.npartitions == 2


@bnaul do you know what could be happening here? I would have expected that since we have defined a time_partitioning then we would be reading 2 different partitions, did something change in the default streams that you are aware of?

j-bennet · 2023-05-05T18:13:27Z

@ncclementi and @phofl , I addressed your comments. Please take a look.

ncclementi

@j-bennet This looks good to me, I left one comment regarding an older pin we had on grpcio, I understand we we leave this to a future PR since it's not necessary part of the scope, but if can give it a try and see if CI is green that would be great.

ncclementi · 2023-05-08T15:53:07Z

ci/environment-3.10.yaml

@@ -9,6 +9,7 @@ dependencies:
  - pandas
  - pyarrow
  - pytest
+  - gcsfs
  - grpcio<1.45


We had this pin due to a seg fault on windows, but it's been a year since this happen can we try to unpin this and see what happen? I understand it's not part of this PR, we can leave it for a separate PR.

I'll try in a separate PR. Thanks for the tip!

phofl · 2023-05-08T16:53:14Z

lgtm

j-bennet · 2023-05-08T20:39:08Z

Thank you @phofl @ncclementi @bnaul.

Initial version of writing to_dbq via parquet in gcs.

85397bf

j-bennet marked this pull request as draft April 26, 2023 22:49

j-bennet changed the title ~~Initial version of writing to_dbq via parquet in gcs.~~ Implement to_dbq to write a dask DataFrame to a big query table via intermediary parquet in gcs Apr 26, 2023

This way, black is not fighting with isort.

1b3a308

Fix name and revert irrelevant change.

689eff6

Parametrize, including credentials. Update docstring.

aff3f90

j-bennet force-pushed the j-bennet/3-to-bigquery branch from 50f39df to aff3f90 Compare April 28, 2023 18:31

j-bennet marked this pull request as ready for review April 28, 2023 21:19

j-bennet requested review from ncclementi, ntabris, dchudz and jrbourbeau April 28, 2023 21:19

Default number of partitions seems to be changed now from 2 to 1.

2749c5a

j-bennet force-pushed the j-bennet/3-to-bigquery branch from 053c303 to 2749c5a Compare April 28, 2023 21:23

bnaul reviewed May 1, 2023

View reviewed changes

dask_bigquery/core.py Outdated Show resolved Hide resolved

dask_bigquery/core.py Outdated Show resolved Hide resolved

dask_bigquery/core.py Outdated Show resolved Hide resolved

Review feedback.

ce53b84

fjetter reviewed May 2, 2023

View reviewed changes

dask_bigquery/core.py Outdated Show resolved Hide resolved

dask_bigquery/core.py Outdated Show resolved Hide resolved

dask_bigquery/core.py Outdated Show resolved Hide resolved

dask_bigquery/core.py Outdated Show resolved Hide resolved

j-bennet and others added 3 commits May 2, 2023 21:38

Apply suggestions from code review

007bf81

Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com>

Review feedback.

3e2a976

Rename load api -> load job.

a0aa57d

fjetter reviewed May 3, 2023

View reviewed changes

Review feedback.

a65352a

j-bennet requested a review from fjetter May 3, 2023 17:31

Looks like I need all these fixtures.

797fc71

j-bennet force-pushed the j-bennet/3-to-bigquery branch from 70fa05d to 797fc71 Compare May 3, 2023 21:43

BugQuery -> BigQuery. I still think it was a wonderful typo, fwiw.

476c5ef

j-bennet force-pushed the j-bennet/3-to-bigquery branch from e55b3e0 to 476c5ef Compare May 3, 2023 22:03

phofl reviewed May 4, 2023

View reviewed changes

Review feedback.

88b95a7

j-bennet requested a review from phofl May 4, 2023 22:13

phofl reviewed May 5, 2023

View reviewed changes

jrbourbeau changed the title ~~Implement to_dbq to write a dask DataFrame to a big query table via intermediary parquet in gcs~~ Implement to_dbq with intermediary parquet in gcs May 5, 2023

jrbourbeau mentioned this pull request May 5, 2023

Fix CI #48

Merged

ncclementi reviewed May 5, 2023

View reviewed changes

j-bennet added 2 commits May 5, 2023 10:31

Review feedback (docstring)

fe7e27a

Added roundtrip test.

d712d21

j-bennet added 3 commits May 5, 2023 13:06

Merge branch 'main' into j-bennet/3-to-bigquery

c249b6f

No need to check if partitions >= 1.

9f5fcb4

Ensure cleanup behavior.

836f286

ncclementi approved these changes May 8, 2023

View reviewed changes

Merge branch 'main' into j-bennet/3-to-bigquery

c0d0e48

j-bennet merged commit 39e397b into main May 8, 2023

j-bennet deleted the j-bennet/3-to-bigquery branch May 8, 2023 20:39

bnaul mentioned this pull request May 8, 2023

Release 2023.5 #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `to_dbq` with intermediary parquet in gcs #44

Implement `to_dbq` with intermediary parquet in gcs #44

j-bennet commented Apr 26, 2023 •

edited

Loading

ntabris commented Apr 26, 2023

dchudz commented Apr 26, 2023

j-bennet commented Apr 26, 2023

dchudz commented Apr 26, 2023

j-bennet commented Apr 27, 2023

dchudz commented Apr 27, 2023

j-bennet commented May 1, 2023

j-bennet commented May 2, 2023

phofl left a comment

phofl May 5, 2023

j-bennet May 5, 2023

phofl May 5, 2023

j-bennet May 5, 2023 •

edited

Loading

ncclementi May 5, 2023

bnaul May 5, 2023

ncclementi left a comment

ncclementi May 5, 2023

j-bennet commented May 5, 2023

ncclementi left a comment

ncclementi May 8, 2023

j-bennet May 8, 2023

phofl commented May 8, 2023

j-bennet commented May 8, 2023

Implement to_dbq with intermediary parquet in gcs #44

Implement to_dbq with intermediary parquet in gcs #44

Conversation

j-bennet commented Apr 26, 2023 • edited Loading

ntabris commented Apr 26, 2023

dchudz commented Apr 26, 2023

j-bennet commented Apr 26, 2023

dchudz commented Apr 26, 2023

j-bennet commented Apr 27, 2023

dchudz commented Apr 27, 2023

j-bennet commented May 1, 2023

j-bennet commented May 2, 2023

phofl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j-bennet May 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncclementi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j-bennet commented May 5, 2023

ncclementi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented May 8, 2023

j-bennet commented May 8, 2023

Implement `to_dbq` with intermediary parquet in gcs #44

Implement `to_dbq` with intermediary parquet in gcs #44

j-bennet commented Apr 26, 2023 •

edited

Loading

j-bennet May 5, 2023 •

edited

Loading