Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial read_gbq support #4

Merged
merged 46 commits into from
Sep 23, 2021
Merged
Changes from 1 commit
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
94c41f6
add precommit config
ncclementi Aug 5, 2021
48becdb
add read_gbq
ncclementi Aug 5, 2021
a934259
add setup and req
ncclementi Aug 5, 2021
04bdd80
modifications suggested by bnaul
ncclementi Aug 6, 2021
ab16a32
raise error when table type is VIEW
ncclementi Aug 6, 2021
455f749
add linting github actions
ncclementi Aug 6, 2021
c417d5f
add comment on context manager related to possible upstram solution
ncclementi Aug 6, 2021
4839bbb
avoid scanning table when creating partitions
ncclementi Aug 11, 2021
774e79b
add first read_gbq test
ncclementi Aug 17, 2021
7bdd66a
add partitioning test
ncclementi Aug 17, 2021
31a1253
use pytest fixtures
ncclementi Aug 18, 2021
db4edb4
use context manager on test
ncclementi Aug 18, 2021
be1efbd
ignore bare except for now
ncclementi Aug 18, 2021
35cbdc6
remove prefix from delayed kwargs
ncclementi Aug 18, 2021
40de1ea
make dataset name random, remove annotate
ncclementi Aug 18, 2021
45e0004
better name for delayed _read_rows_arrow
ncclementi Aug 18, 2021
de93e88
implementation of HLG - wip
ncclementi Aug 19, 2021
3070ae3
Slight refactor
jrbourbeau Aug 20, 2021
b43daf6
Minor test tweaks
jrbourbeau Aug 20, 2021
50f3c6a
Update requirements.txt
ncclementi Sep 16, 2021
f8a578c
use context manager for bq client
ncclementi Sep 17, 2021
a91c73c
remove with_storage_api since it is always true
ncclementi Sep 17, 2021
548f2fb
remove partition fields option
ncclementi Sep 17, 2021
d3ffa79
add test github actions setup
ncclementi Sep 17, 2021
44096a1
add ci environments
ncclementi Sep 17, 2021
b19dca4
trigger ci
ncclementi Sep 17, 2021
982a5f5
trigger ci again
ncclementi Sep 17, 2021
4292ac3
add pytest to envs
ncclementi Sep 17, 2021
14ba56c
Only run CI on push events
jrbourbeau Sep 20, 2021
32b6686
Minor cleanup
jrbourbeau Sep 20, 2021
97b5d21
Use mamba
jrbourbeau Sep 20, 2021
e03e731
update docstrings
ncclementi Sep 21, 2021
d73b686
missing docstring
ncclementi Sep 21, 2021
3f8e397
trigger ci - testing workflow
ncclementi Sep 21, 2021
64fe0ec
use env variable for project id
ncclementi Sep 21, 2021
6f94825
add test for read with row_filter
ncclementi Sep 21, 2021
1a51981
add test for read with kwargs
ncclementi Sep 21, 2021
acb404e
Update dask_bigquery/tests/test_core.py
ncclementi Sep 21, 2021
d78c2a9
Update dask_bigquery/tests/test_core.py
ncclementi Sep 21, 2021
2b46c4f
Update dask_bigquery/tests/test_core.py
ncclementi Sep 21, 2021
5ac1358
Update dask_bigquery/tests/test_core.py
ncclementi Sep 21, 2021
216a4e7
Update dask_bigquery/tests/test_core.py
ncclementi Sep 21, 2021
46e4923
Update dask_bigquery/tests/test_core.py
ncclementi Sep 21, 2021
3204bc2
tweak on docstrings
ncclementi Sep 22, 2021
f17cfb8
add readme content
ncclementi Sep 22, 2021
d1398c2
Minor updates
jrbourbeau Sep 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 15 additions & 22 deletions dask_bigquery/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,33 +19,29 @@ def bigquery_client(project_id=None):
See googleapis/google-cloud-python#9457
and googleapis/gapic-generator-python#575 for reference.
"""

bq_storage_client = None
with bigquery.Client(project_id) as bq_client:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be this PR, but it would be really helpful if we could attribute these requests to Dask/Dask-BigQuery. #6

try:
bq_storage_client = bigquery_storage.BigQueryReadClient(
credentials=bq_client._credentials
)
yield bq_client, bq_storage_client
finally:
bq_storage_client.transport.grpc_channel.close()
bq_storage_client = bigquery_storage.BigQueryReadClient(
credentials=bq_client._credentials
)
yield bq_client, bq_storage_client
bq_storage_client.transport.grpc_channel.close()


def _stream_to_dfs(bqs_client, stream_name, schema, timeout):
def _stream_to_dfs(bqs_client, stream_name, schema, read_kwargs):
"""Given a Storage API client and a stream name, yield all dataframes."""
return [
pyarrow.ipc.read_record_batch(
pyarrow.py_buffer(message.arrow_record_batch.serialized_record_batch),
schema,
).to_pandas()
for message in bqs_client.read_rows(name=stream_name, offset=0, timeout=timeout)
for message in bqs_client.read_rows(name=stream_name, offset=0, **read_kwargs)
]


def bigquery_read(
make_create_read_session_request: callable,
project_id: str,
timeout: int,
read_kwargs: int,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrbourbeau I was going over the docs and realized that this still shows as an int. Shouldn't it go as a keyword argument at the end? read_kwargs: dict=None, and I wonder where should the * go, right before it? Like

def bigquery_read(
    make_create_read_session_request: callable,
    project_id: str,
    timeout: int,
    stream_name: str, 
    *, 
    read_kwargs: dict=None)

If this is correct I can modify it. and we should probably add a test that this works, although I'm not sure what's the easist to test for these kwargs, any ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since bigquery_read is only ever called internally in read_gbq I don't think it matters whether or not read_kwargs is a positional or keyword argument to bigquery_read. Though you bring up a good point that the type annotation is now incorrect and should be updated to dict instead of int

stream_name: str,
) -> pd.DataFrame:
"""Read a single batch of rows via BQ Storage API, in Arrow binary format.
Expand All @@ -65,7 +61,7 @@ def bigquery_read(
schema = pyarrow.ipc.read_schema(
pyarrow.py_buffer(session.arrow_schema.serialized_schema)
)
shards = _stream_to_dfs(bqs_client, stream_name, schema, timeout=timeout)
shards = _stream_to_dfs(bqs_client, stream_name, schema, read_kwargs)
# NOTE: BQ Storage API can return empty streams
if len(shards) == 0:
shards = [schema.empty_table().to_pandas()]
Expand All @@ -78,7 +74,7 @@ def read_gbq(
dataset_id: str,
table_id: str,
row_filter="",
read_timeout: int = 3600,
read_kwargs=None,
):
"""Read table as dask dataframe using BigQuery Storage API via Arrow format.
If `partition_field` and `partitions` are specified, then the resulting dask dataframe
Expand All @@ -99,12 +95,9 @@ def read_gbq(
dask dataframe
See https://github.com/dask/dask/issues/3121 for additional context.
"""

with bigquery_client(project_id) as (
bq_client,
bqs_client,
):
table_ref = bq_client.get_table(".".join((dataset_id, table_id)))
read_kwargs = read_kwargs or {}
with bigquery_client(project_id) as (bq_client, bqs_client):
table_ref = bq_client.get_table(f"{dataset_id}.{table_id}")
if table_ref.table_type == "VIEW":
raise TypeError("Table type VIEW not supported")

Expand Down Expand Up @@ -139,7 +132,7 @@ def make_create_read_session_request(row_filter=""):
dataset_id,
table_id,
row_filter,
read_timeout,
read_kwargs,
)

layer = DataFrameIOLayer(
Expand All @@ -150,7 +143,7 @@ def make_create_read_session_request(row_filter=""):
bigquery_read,
make_create_read_session_request,
project_id,
read_timeout,
read_kwargs,
),
label=label,
)
Expand Down