Implement Async `BigqueryOperator` #31

kaxil · 2021-12-27T20:18:01Z

Build async version of https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/operators/bigquery.py

Acceptance Criteria:

Unit Tests coverage in the PR (80% Code Coverage -- We will need to add CodeCov separately to measure code cov) with all of them passing
Example DAG using the async Operator that can be used to run Integration tests that are parametrized via Environment variables. Example - https://github.com/apache/airflow/blob/8a03a505e1df0f9de276038c5509135ac569a667/airflow/providers/google/cloud/example_dags/example_bigquery_to_gcs.py#L33-L35
Add proper docstrings for each of the methods and functions including Example DAG on how it should be used (populate
Exception Handling in case of errors
Improve the OSS Docs to make sure it covers the following:
- Has an example DAG for the sync version
- How to add a connection via Environment Variable & explain each of the fields. Example - https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/connections/postgres.html
- How to use Guide for the Operator - example: https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/operators/postgres_operator_howto_guide.html

phanikumv · 2022-01-13T11:35:45Z

Below classes need to be converted to their Async versions ( based on feasibility )

Operator	API used	Assignee	Async Implementation Needed ?	Comments
BigQueryCheckOperator		@phanikumv	Yes	Performs checks against BigQuery. The `BigQueryCheckOperator` expects a sql query that will return a single row.
BigQueryValueCheckOperator		@pankajastro	Yes	Performs a simple value check using sql code.
BigQueryIntervalCheckOperator		@sunank200	Yes	Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before.
BigQueryGetDataOperator	https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/list	@rajaths010494	Yes	Would require async but not gcloud aio not directly usable.Cannot get pid or query id to get status of long running tasks
BigQueryExecuteQueryOperator			No	NA(Deprecated)
BigQueryCreateEmptyTableOperator			No
BigQueryCreateExternalTableOperator			No
BigQueryDeleteDatasetOperator			No
BigQueryCreateEmptyDatasetOperator			No
BigQueryGetDatasetOperator			No
BigQueryGetDatasetTablesOperator			No
BigQueryPatchDatasetOperator			No	It only replaces fields that are provided in the submitted dataset resource.
BigQueryUpdateTableOperator		@sunank200	No	Cannot get pid or query id to get status of long running tasks. Changes some field in table as specified
BigQueryUpdateDatasetOperator			No	Changes some field in dataset as specified
BigQueryDeleteTableOperator			No
BigQueryUpsertTableOperator	https://cloud.google.com/bigquery/docs/reference/v2/tables#resource		No	If the table already exists, update the existing table if not create new. Since BigQuery does not natively allow table upserts, this is not an atomic operation.
BigQueryUpdateTableSchemaOperator	https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#TableSchema		No	Update fields within a schema for a given dataset and table.
BigQueryInsertJobOperator	https://cloud.google.com/bigquery/docs/reference/v2/jobs	@phanikumv	Yes	implementation completed

phanikumv · 2022-01-18T08:32:07Z

Getting the below api response from gcloud-aio library when I try to read a table. While the sync version of the BigQueryGetDataOperator is returning a list[Row] --> [Row(('100', '200'), {'col1': 0, 'col2': 1}), Row(('300', '400'), {'col1': 0, 'col2': 1})], this is returning a list --> [{'f': [{'v': '300'}, {'v': '400'}]}, {'f': [{'v': '100'}, {'v': '200'}]}]

Do we need to ensure that both Async and Sync versions of an Operator return same datatypes at the end of execution? Thoughts @kaxil

gcloud-aio response

{'kind': 'bigquery#getQueryResultsResponse', 'etag': 'Ycgra5bwYNpTA3Tgy9ICdw==', 'schema': {'fields': [{'name': 'col1', 'type': 'STRING', 'mode': 'NULLABLE'}, {'name': 'col2', 'type': 'STRING', 'mode': 'NULLABLE'}]}, 'jobReference': {'projectId': 'astronomer-airflow-providers', 'jobId': 'job_-2e9ok-6d9gWKmBnX3K92wLpF0km', 'location': 'US'}, 'totalRows': '2', 'rows': [{'f': [{'v': '300'}, {'v': '400'}]}, {'f': [{'v': '100'}, {'v': '200'}]}], 'totalBytesProcessed': '20', 'jobComplete': True, 'cacheHit': False}

phanikumv · 2022-01-18T10:17:44Z

Let's focus on the below 4 operators on priority:
BigQueryGetDataOperator
BigQueryExecuteQueryOperator
BigQueryCheckOperator
BigQueryInsertJobOperator

phanikumv · 2022-01-19T18:07:56Z

Only the jobs api of the BigQuery submits a job asynchronously. Refer https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/insert
It returns a job ID through which we can track the status of the request using the Triggerer.

This might become an issue for asynchronously calling the other BigQuery api's like https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/list , because this api doesnt return any kind of processId or a jobId , once the request is submitted. Hence, need to find alternate approach for implementing asynchronous version of the BigQueryGetDataOperator

phanikumv · 2022-01-21T12:03:03Z

When multiple queries are passed within a list, it doesnt work in OSS version of the BigQueryInsertJobOperator. It does work when the queries are passed as a single string.

BigQueryInsertJobOperator uses the Google Jobs API. if the user passes the queries within a list, it wont work because the Google Jobs API doesnt allow using an array to pass multiple queries (please refer to screenshot below).

There used to be an operator called BigQueryExecuteQueryOperator , which supported passing multiple queries through a list, but this is deprecated now. @kaxil @dstandish - please opine on this observation

This is discussed and we agreed that we will add a docstring to the Operator to describe how to pass multiple queries - through a string separated by a semi-colon rather than a list.

phanikumv · 2022-02-01T09:27:28Z

BigQueryExecuteQueryOperator is deprecated. Hence no need to create an async version of it.

phanikumv · 2022-02-08T06:19:21Z

Draft #47 for the BigQueryInsertJobOperator

phanikumv · 2022-02-11T11:43:39Z

PR is ready for BigQueryInsertJobOperatorAsync

Implements BigQueryInsertJobOperatorAsync which asynchronously submits jobs , generates a job id, and polls for job status using the job id on the Triggerer part of #31

sunank200 · 2022-02-15T14:23:25Z

BigQueryUpdateTableOperator uses Table object and change specified fields of a table. This operator doesn't update the data in the table but instead changes the fields of Table.

I have tested the Method: tables.update and have attached the screenshot. In my opinion, Async implementation for this BigQueryUpdateTableOperator or any other operator which uses Method: tables of google cloud would not improve the performance as it just deals with Table fields rather than Tabledata

Also, it doesn't return any job-id or process-id as such which can be tracked on the google-cloud side. This is only available for Operators which use Method:Jobs

Moreover, gcloud-aio class Table uses Table object instead of TableData

All the operators which use google-cloud jobs API would require async implementation.

@phanikumv , @rajaths010494 , @kaxil

rajaths010494 · 2022-02-15T16:24:52Z

In OSS BigQueryGetDataOperator uses Table object using list_rows methods to fetch data.
In gcloud aio class Table uses Table object instead of TableData which doesn't have TableData List api implemented and also all the calls for table in gcloud uses aiohttp calls to the api where it awaits until the response is given rather than getting a pid or job-id (Method:Jobs responses with the id).
So if a long running get operation is made it awaits until the API get call is finished rather than giving a query id so that we can check for the status of the query if it's finished or not.

All the operators which use google-cloud jobs API would require async implementation.

sunank200 · 2022-02-16T07:55:49Z

I have updated the table above with a field Async Implementation Needed and Comments with necessary details. Please let me know your views

phanikumv · 2022-02-16T10:10:34Z

My thought process for BigQueryGetDataOperator is that we’ll re-use the BigQueryInsertjobOperatorAsync to form a select * from table during run time, get a job id, and then poll it on the Trigger, because only the jobs API gives us the job ID. We can adopt same strategy for BigQueryCheckOperator, BigQueryValueCheckOperator and BigQueryIntervalCheckOperator

phanikumv · 2022-02-22T07:11:27Z

PR is ready for BigQueryCheckOperatorAsync

Implement operator , hook and trigger to execute BigQueryCheckOperator in asynchronous mode - Add BigQueryCheckOperatorAsync - Use get query results API of gcloud-aio to retrieve the results - Poll the results using the Triggerer and send data back to the Operator execute method. Part of #31

sunank200 · 2022-02-24T10:25:25Z

PR for BigQueryIntervalCheckOperatorAsync

Add BigQueryGetDataOperatorAsync Use get query results API of gcloud-aio to retrieve the results Poll the results using the Triggerer and send data back to the Operator execute method. Part of #31

kaxil · 2022-02-25T12:44:11Z

Can this be closed now, or are there any other pending implementations @phanikumv ?

phanikumv · 2022-02-25T15:46:02Z

Implementation of the below operators is done. Hence closing the story.

BigQueryCheckOperator
BigQueryValueCheckOperator
BigQueryIntervalCheckOperator
BigQueryGetDataOperator
BigQueryInsertJobOperator

Implements BigQueryInsertJobOperatorAsync which asynchronously submits jobs , generates a job id, and polls for job status using the job id on the Triggerer part of astronomer/astronomer-providers#31

Implement operator , hook and trigger to execute BigQueryCheckOperator in asynchronous mode - Add BigQueryCheckOperatorAsync - Use get query results API of gcloud-aio to retrieve the results - Poll the results using the Triggerer and send data back to the Operator execute method. Part of astronomer/astronomer-providers#31

phanikumv changed the title ~~Async BigqueryOperator~~ Implement Async BigqueryOperator Dec 30, 2021

phanikumv added the area/async Deferrable/async operators label Dec 30, 2021

kaxil assigned phanikumv Dec 30, 2021

kaxil pushed a commit that referenced this issue Feb 14, 2022

Implement BigQueryInsertJobOperatorAsync (#47)

f75762e

Implements BigQueryInsertJobOperatorAsync which asynchronously submits jobs , generates a job id, and polls for job status using the job id on the Triggerer part of #31

kaxil mentioned this issue Feb 22, 2022

Implement async for remaining Bigquery sensors #68

Closed

10 tasks

kaxil mentioned this issue Feb 23, 2022

Async BigQueryCheckOperator #61

Merged

rajaths010494 mentioned this issue Feb 23, 2022

Async bigquerygetdataoperator #59

Merged

sunank200 mentioned this issue Feb 24, 2022

Implement BigQueryIntervalCheckOperatorAsync #70

Merged

phanikumv closed this as completed Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Async `BigqueryOperator` #31

Implement Async `BigqueryOperator` #31

kaxil commented Dec 27, 2021 •

edited by phanikumv

phanikumv commented Jan 13, 2022 •

edited by sunank200

phanikumv commented Jan 18, 2022

phanikumv commented Jan 18, 2022

phanikumv commented Jan 19, 2022 •

edited

phanikumv commented Jan 21, 2022 •

edited

phanikumv commented Feb 1, 2022

phanikumv commented Feb 8, 2022 •

edited

phanikumv commented Feb 11, 2022 •

edited

sunank200 commented Feb 15, 2022 •

edited

rajaths010494 commented Feb 15, 2022

sunank200 commented Feb 16, 2022

phanikumv commented Feb 16, 2022

phanikumv commented Feb 22, 2022

sunank200 commented Feb 24, 2022

kaxil commented Feb 25, 2022

phanikumv commented Feb 25, 2022

Implement Async BigqueryOperator #31

Implement Async BigqueryOperator #31

Comments

kaxil commented Dec 27, 2021 • edited by phanikumv

phanikumv commented Jan 13, 2022 • edited by sunank200

phanikumv commented Jan 18, 2022

gcloud-aio response

phanikumv commented Jan 18, 2022

phanikumv commented Jan 19, 2022 • edited

phanikumv commented Jan 21, 2022 • edited

phanikumv commented Feb 1, 2022

phanikumv commented Feb 8, 2022 • edited

phanikumv commented Feb 11, 2022 • edited

sunank200 commented Feb 15, 2022 • edited

rajaths010494 commented Feb 15, 2022

sunank200 commented Feb 16, 2022

phanikumv commented Feb 16, 2022

phanikumv commented Feb 22, 2022

sunank200 commented Feb 24, 2022

kaxil commented Feb 25, 2022

phanikumv commented Feb 25, 2022

Implement Async `BigqueryOperator` #31

Implement Async `BigqueryOperator` #31

kaxil commented Dec 27, 2021 •

edited by phanikumv

phanikumv commented Jan 13, 2022 •

edited by sunank200

phanikumv commented Jan 19, 2022 •

edited

phanikumv commented Jan 21, 2022 •

edited

phanikumv commented Feb 8, 2022 •

edited

phanikumv commented Feb 11, 2022 •

edited

sunank200 commented Feb 15, 2022 •

edited