Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1044] Leverage BigQuery table clones for a Low Cost CI environment #270

Closed
amirbtb opened this issue Aug 16, 2022 · 6 comments
Closed
Assignees
Labels
enhancement New feature or request refinement Product or leadership input needed

Comments

@amirbtb
Copy link

amirbtb commented Aug 16, 2022

Describe the feature

The BigQuery table clone feature is in Preview since February 2022. It would allow us to setup a CI environment in a dedicated GCP Project/Bigquery Database at Zero/Low Cost (following the zero-copy-clone logic of Snowflake).

Describe alternatives you've considered

Configure CI to build in a separate project.

Additional context

Limitations

You can't create a clone of a :

View and external tables being just an abstraction for data stored elsewhere, we could recreate them in the CI Google Project/BigQuery Database.

Workflow

GCP describe how to Create Tables Clones in SQL, I just need to figure out a couple of things about the workflow :

  1. Filter only new/modified materializations (Slim CI)
  2. Clone Tables + Create External Tables & Views
  3. Run dbt test on the new/modified materializations (Slim CI)
  4. Destroy Table Clones, External Tables & Views (optional)

Who will this benefit?

Any user of dbt-bigquery who wishes to setup a secured test environment for CI.

Are you interested in contributing this feature?

Yes.

@amirbtb amirbtb added enhancement New feature or request triage labels Aug 16, 2022
@github-actions github-actions bot changed the title Leverage BigQuery table clones for a Low Cost CI environment [CT-1044] Leverage BigQuery table clones for a Low Cost CI environment Aug 16, 2022
@romanoww
Copy link

romanoww commented Jan 31, 2023

I'd like to add to this issue/enhancement. I assume it might be fairly easy to add simple table clones within the current copy materialization framework provided by dbt-bigquery.

The CopyJobConfig is already incorporated in the dbt-bigquery framework and used for the materialized='copy' configurations. This CopyJobConfig is implemented at the following location.

connections.py

We can pass the operation_type parameter to the CopyJobConfig, see docs and accepted_strings. The accepted values for this parameter include both COPY and CLONE, my guess is that default is set to COPY.

Adding operation_type to the copy_bq_table and copy_and_results functions would reconfigure our copy job.

def copy_bq_table(self, source, destination, write_disposition, operation_type):
        conn = self.get_thread_connection()
        client = conn.handle

        # -------------------------------------------------------------------------------
        #  BigQuery allows to use copy API using two different formats:
        #  1. client.copy_table(source_table_id, destination_table_id)
        #     where source_table_id = "your-project.source_dataset.source_table"
        #  2. client.copy_table(source_table_ids, destination_table_id)
        #     where source_table_ids = ["your-project.your_dataset.your_table_name", ...]
        #  Let's use uniform function call and always pass list there
        # -------------------------------------------------------------------------------
        if type(source) is not list:
            source = [source]

        source_ref_array = [
            self.table_ref(src_table.database, src_table.schema, src_table.table)
            for src_table in source
        ]
        destination_ref = self.table_ref(
            destination.database, destination.schema, destination.table
        )

        logger.debug(
            'Copying table(s) "{}" to "{}" with disposition: "{}"',
            ", ".join(source_ref.path for source_ref in source_ref_array),
            destination_ref.path,
            write_disposition,
        )

        def copy_and_results():
            job_config = google.cloud.bigquery.CopyJobConfig(write_disposition=write_disposition, 
                operation_type=operation_type)
            copy_job = client.copy_table(source_ref_array, destination_ref, job_config=job_config)
            timeout = self.get_job_execution_timeout_seconds(conn) or 300
            iterator = copy_job.result(timeout=timeout)
            return copy_job, iterator

        self._retry_and_handle(
            msg='copy table "{}" to "{}"'.format(
                ", ".join(source_ref.path for source_ref in source_ref_array),
                destination_ref.path,
            ),
            conn=conn,
            fn=copy_and_results,
        )

Further down the line, in the BigQuery Adapter Class file we could redefine the copy_table function by adding a operation_type input parameters. This way, we can call the bigquery adapter with either copy or clone (should be the accepted values, default is set to clone)

def copy_table(self, source, destination, materialization, operation_type):
        if materialization == "incremental":
            write_disposition = WRITE_APPEND
        elif materialization == "table":
            write_disposition = WRITE_TRUNCATE
        else:
            raise dbt.exceptions.CompilationError(
                'Copy table materialization must be "copy" or "table", but '
                f"config.get('copy_materialization', 'table') was "
                f"{materialization}"
            )

        self.connections.copy_bq_table(source, destination, write_disposition, operation_type)
        
        return "{operation_type} TABLE with materialization: {}".format(materialization)

We can then duplicate the copy materialization and changing it to clone.sql and changing the write_dispositions input to clone copy.sql.

{# Call adapter copy_table function #}
  {%- set result_str = adapter.copy_table(
      source_array,
      destination,
      config.get('copy_materialization', default = 'table'),
      config.get('operation_type', default= 'clone') -%}

This would require the config to include an operation_type (logging should be more extensive).

I'm not an expert by any chance and have never meddled with dbt source code, so I'm probably missing something / underestimating things. However, these are my two cents.
This implementation would not tick all the boxes suggested by @amirbtb, but it's a start.

@Fleid
Copy link
Contributor

Fleid commented Feb 14, 2023

I'm moving that out of triage and into refinement.
Yes we want to do this, but have to think across the board for the other adapters. In the meantime, all contributions are welcome!

@Fleid Fleid added refinement Product or leadership input needed and removed triage labels Feb 14, 2023
@Timfrazer
Copy link

This would really help us as our costs for running dbt on bigquery is quite expensive on our increasing data tables. We are going to do something hacky like this.

@Fleid Fleid self-assigned this Mar 24, 2023
@jtcohen6
Copy link
Contributor

Let's do this for real :) and not just on BQ (but leveraging BQ-specific copying/cloning capabilities where possible!)

Check out:

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 17, 2023
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 25, 2023
@Fleid Fleid removed the Stale label Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refinement Product or leadership input needed
Projects
None yet
Development

No branches or pull requests

5 participants