Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove cursor methods calls from GCSToBigQueryOperator #20119

Merged
merged 1 commit into from
Feb 28, 2022

Conversation

kazanzhy
Copy link
Contributor

@kazanzhy kazanzhy commented Dec 7, 2021

closes: #10288

I've tried to remove warnings during GCSToBigQueryOperator as described in issue #10288.
As a result, I copied almost all code from bq_hook.create_external_table and bq_hook.run_load which made the operator's execute method very sophisticated.

That's why I just switched from cursor method calls to BigQueryHook methods. In my opinion, these methods should not be deprecated.

We need someplace for complex dict creation, like configuration for bq_hook.insert_job() or table_recource for bq_hook.create_empty_table(). In my opinion, better to keep this code in BigQueryHook and don't move to Operator.

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Dec 7, 2021
@kazanzhy kazanzhy changed the title Remove deprecated method call of BigQueryHook cursor from GCSToBigQueryOperator Remove cursor methods calls from GCSToBigQueryOperator Dec 14, 2021
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 29, 2022
@kazanzhy kazanzhy force-pushed the fix_issue_10288 branch 3 times, most recently from 2889bb4 to c8a5705 Compare January 30, 2022 21:19
@kazanzhy kazanzhy marked this pull request as ready for review January 30, 2022 23:10
@kazanzhy
Copy link
Contributor Author

Warnings in BQ hook was added by this commit
8b54919

@eladkal eladkal removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 31, 2022
@potiuk
Copy link
Member

potiuk commented Feb 6, 2022

WDYT @turbaszek :) ?

@kazanzhy kazanzhy force-pushed the fix_issue_10288 branch 3 times, most recently from e7a31ae to e9ff9db Compare February 16, 2022 09:56
@potiuk
Copy link
Member

potiuk commented Feb 21, 2022

Rebased it to make sure it is still ok.

@turbaszek
Copy link
Member

We need someplace for complex dict creation, like configuration for bq_hook.insert_job() or table_recource for bq_hook.create_empty_table(). In my opinion, better to keep this code in BigQueryHook and don't move to Operator.

Those dicts should be created by users. If we create them then we tend to miss some input arguments. Also in my opinion one input which is well described by Google docs is better than 20 arguments with copy pasted descriptions.

@kazanzhy
Copy link
Contributor Author

@turbaszek
I assume that warning should be when the user uses hook directly but shouldn't be displayed if we use this method from operators.

@turbaszek
Copy link
Member

@turbaszek
I assume that warning should be when the user uses hook directly but shouldn't be displayed if we use this method from operators.

Correct, and I believe we have separated warning in operators. Users can use both hook and operators and in both cases should be informed.

@kazanzhy kazanzhy force-pushed the fix_issue_10288 branch 2 times, most recently from 4201e45 to a4f0923 Compare February 23, 2022 22:43
@kazanzhy
Copy link
Contributor Author

So I returned warnings back to the BQ hook and suppressed them from the operator.
And seems it still closes an issue.

Also, I think there are left no usages of bq_hook.get_conn().cursor().
Probably it might be deleted in future

@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Feb 26, 2022
@github-actions
Copy link

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

@potiuk potiuk merged commit 0c55ca2 into apache:main Feb 28, 2022
@kazanzhy kazanzhy deleted the fix_issue_10288 branch February 28, 2022 14:38
@eladkal
Copy link
Contributor

eladkal commented Jul 12, 2022

Regarding the create_external_table we already handled it in a diffrent operator - see #24363

@cswpy @kazanzhy maybe we can create a helper function to do this as we have more than 1 operator to use this functionality?

I assume we need something similar for the run_load ?

@kazanzhy
Copy link
Contributor Author

Very good work with the #24363.

As I wrote earlier in this discussion it should be the place with all this routine dict creation.
I tried to do it in my work, but it's awfully inconvenient. It's much easier to use Pycharm and autocomplete than explore Google docs to understand where would be placed exactly this argument.

My previous idea was to keep it in hook, but probably we could think about the handler.
Recently I updated my data pipelines and must say that the combination of ClusterGenerator and DataprocCreateClusterOperator is fantastic.I think it's a good balance between flexibility and code overload.
I'm working on another PR and it's quite big. Maybe @cswpy has time?

@cswpy
Copy link
Contributor

cswpy commented Jul 13, 2022

I'd love to work on this. But require more context on this issue. From my understanding, both run_load and create_external_table suffer from too many arguments. So we want to have a helper function for dictionary creation like configurations and table_resource? I am not following the design of "the combination of ClusterGenerator and DataprocCreateClusterOperator", can you elaborate?

@kazanzhy
Copy link
Contributor Author

kazanzhy commented Jul 13, 2022

here's the part of my DAG

from airflow.providers.google.cloud.operators.dataproc import ClusterGenerator,DataprocCreateClusterOperator

with DAG(**dag_args, default_args=default_task_args) as dag:
    dataproc_cluster = ClusterGenerator(
        project_id=project_id,
        storage_bucket='gs://...',
        num_masters=3,
        master_machine_type='ns-standard-4',
        master_disk_size=1000,
        num_workers=160,
        worker_machine_type='ns-standard-2',
        worker_disk_size=100,
        num_preemptible_workers=0,
        image_version="2.0.28-debian10",
        auto_delete_ttl=3600,
        subnetwork_uri='default',
        properties={
            'dataproc:dataproc.logging.stackdriver.job.driver.enable': "true",
            'dataproc:dataproc.logging.stackdriver.job.yarn.container.enable': "true",
        },
    )
    create_dataproc_cluster = DataprocCreateClusterOperator(
        task_id='create_dataproc_cluster',
        project_id=project_id,
        cluster_name='on_demand_cluster',
        region=region,
        labels=labels,
        gcp_conn_id=gcp_connection_id,
        cluster_config=dataproc_cluster.make()
    )

This pattern allows to create a dict for the DataprocCreateClusterOperator or use some constructor which has a lot of parameters and shows them in IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers okay to merge It's ok to merge this PR as it does not require more tests provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gcs_to_bigquery using deprecated methods
5 participants