-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Async BigqueryOperator
#31
Comments
Below classes need to be converted to their Async versions ( based on feasibility )
|
Getting the below api response from gcloud-aio library when I try to read a table. While the sync version of the BigQueryGetDataOperator is returning a list[Row] --> [Row(('100', '200'), {'col1': 0, 'col2': 1}), Row(('300', '400'), {'col1': 0, 'col2': 1})], this is returning a list --> [{'f': [{'v': '300'}, {'v': '400'}]}, {'f': [{'v': '100'}, {'v': '200'}]}] Do we need to ensure that both Async and Sync versions of an Operator return same datatypes at the end of execution? Thoughts @kaxil gcloud-aio response{'kind': 'bigquery#getQueryResultsResponse', 'etag': 'Ycgra5bwYNpTA3Tgy9ICdw==', 'schema': {'fields': [{'name': 'col1', 'type': 'STRING', 'mode': 'NULLABLE'}, {'name': 'col2', 'type': 'STRING', 'mode': 'NULLABLE'}]}, 'jobReference': {'projectId': 'astronomer-airflow-providers', 'jobId': 'job_-2e9ok-6d9gWKmBnX3K92wLpF0km', 'location': 'US'}, 'totalRows': '2', 'rows': [{'f': [{'v': '300'}, {'v': '400'}]}, {'f': [{'v': '100'}, {'v': '200'}]}], 'totalBytesProcessed': '20', 'jobComplete': True, 'cacheHit': False} |
Let's focus on the below 4 operators on priority: |
Only the jobs api of the BigQuery submits a job asynchronously. Refer https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/insert This might become an issue for asynchronously calling the other BigQuery api's like https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/list , because this api doesnt return any kind of processId or a jobId , once the request is submitted. Hence, need to find alternate approach for implementing asynchronous version of the BigQueryGetDataOperator |
When multiple queries are passed within a list, it doesnt work in OSS version of the BigQueryInsertJobOperator. It does work when the queries are passed as a single string. BigQueryInsertJobOperator uses the Google Jobs API. if the user passes the queries within a list, it wont work because the Google Jobs API doesnt allow using an array to pass multiple queries (please refer to screenshot below). There used to be an operator called BigQueryExecuteQueryOperator , which supported passing multiple queries through a list, but this is deprecated now. @kaxil @dstandish - please opine on this observation This is discussed and we agreed that we will add a docstring to the Operator to describe how to pass multiple queries - through a string separated by a semi-colon rather than a list. |
BigQueryExecuteQueryOperator is deprecated. Hence no need to create an async version of it. |
Draft #47 for the BigQueryInsertJobOperator |
PR is ready for BigQueryInsertJobOperatorAsync |
BigQueryUpdateTableOperator uses Table object and change specified fields of a table. This operator doesn't update the data in the table but instead changes the fields of Table. I have tested the Method: tables.update and have attached the screenshot. In my opinion, Async implementation for this BigQueryUpdateTableOperator or any other operator which uses Method: tables of google cloud would not improve the performance as it just deals with Table fields rather than Tabledata Also, it doesn't return any job-id or process-id as such which can be tracked on the google-cloud side. This is only available for Operators which use Method:Jobs Moreover, gcloud-aio class Table uses Table object instead of TableData All the operators which use google-cloud jobs API would require async implementation. |
In OSS BigQueryGetDataOperator uses Table object using list_rows methods to fetch data. All the operators which use google-cloud jobs API would require async implementation. |
I have updated the table above with a field |
My thought process for BigQueryGetDataOperator is that we’ll re-use the BigQueryInsertjobOperatorAsync to form a select * from table during run time, get a job id, and then poll it on the Trigger, because only the jobs API gives us the job ID. We can adopt same strategy for BigQueryCheckOperator, BigQueryValueCheckOperator and BigQueryIntervalCheckOperator |
PR is ready for BigQueryCheckOperatorAsync |
Implement operator , hook and trigger to execute BigQueryCheckOperator in asynchronous mode - Add BigQueryCheckOperatorAsync - Use get query results API of gcloud-aio to retrieve the results - Poll the results using the Triggerer and send data back to the Operator execute method. Part of #31
PR for BigQueryIntervalCheckOperatorAsync |
Add BigQueryGetDataOperatorAsync Use get query results API of gcloud-aio to retrieve the results Poll the results using the Triggerer and send data back to the Operator execute method. Part of #31
Can this be closed now, or are there any other pending implementations @phanikumv ? |
Implementation of the below operators is done. Hence closing the story.
|
Implements BigQueryInsertJobOperatorAsync which asynchronously submits jobs , generates a job id, and polls for job status using the job id on the Triggerer part of astronomer/astronomer-providers#31
Implement operator , hook and trigger to execute BigQueryCheckOperator in asynchronous mode - Add BigQueryCheckOperatorAsync - Use get query results API of gcloud-aio to retrieve the results - Poll the results using the Triggerer and send data back to the Operator execute method. Part of astronomer/astronomer-providers#31
Build async version of https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/operators/bigquery.py
Acceptance Criteria:
The text was updated successfully, but these errors were encountered: