Skip to content

How to use dbt with Databend Cloud

Jeremy edited this page Nov 30, 2022 · 2 revisions

This is a primer tutorial about how to use dbt-databend-cloud. Here we use the official dbt tutorial as an example to introduce the use of Databend in dbt together. The software used in this example and its version requirements:

  • dbt 1.01 or upper
  • dbt-databend-cloud 0.0.1

Installation

Installing dbt and dbt-databend-cloud requires only one command because dbt is installed as a dependency when we install dbt-databend-cloud.

pip3 install dbt-databend-cloud

dbt can also be installed by yourself. Please refer to the official installation guide.

Creating project: jaffle shop

The jaffle_shop is a project provided by dbt-lab to demonstrate dbt functionality. You can get it directly from GitHub.

$ git clone https://github.com/dbt-labs/jaffle_shop
$ cd jaffle_shop

All files in the jaffle_shop project directory are as follows.

  • dbt_project.yml is the dbt project configuration file, which holds the project name, database configuration file path information, etc.
  • models directory contains the SQL model and table schema for the project. Note that this section was written by the data analyst himself.
  • seed directory stores CSV files. Such CSV files can be dumped from database export tools. In the jaffle shop project, these CSV files are used as raw data to be processed. I'll go into more detail about them later when I use one of the files or directories above.

Configuring project

  1. Global Configuration

dbt has a default global profile: ~/.dbt/profiles.yml, which we first set up in the user directory and configured the connection information for the Databend database. If you do not have ~/.dbt/profiles.yml please create it first.

jaffle_shop_databend:
  target: dev
  outputs:
    dev:
      type: databend
      host: tnc7yee14--query-perf.ch.datafusecloud.com
      port: 443
      schema: sjh_dbt
      user: cloudapp
      pass: ********

You can find your connect host, user, pass information by reading Connecting to a Warehouse | Databend Cloud .

  1. Project configuration

In the jaffle_shop project directory, the project configuration file dbt_project.yml is available. Change the profile configuration item to jaffle_shop_databend, which is the project name in profiles.yml. Then, the project will query the database connection configuration in the ~/.dbt/profiles.yml file.

cat dbt_project.yml
name: 'jaffle_shop_databend'

config-version: 2
version: '0.1'

profile: 'jaffle_shop_databend'

model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]
analysis-paths: ["analysis"]
macro-paths: ["macros"]

target-path: "target"
clean-targets:
    - "target"
    - "dbt_modules"
    - "logs"

require-dbt-version: [">=1.0.0", "<2.0.0"]

models:
  jaffle_shop:
      materialized: table
      staging:
        materialized: view
  1. Verify the configuration

You can run dbt debug in the project directory to check whether the database and project configuration is correct.

image

Loading CSV

Load the CSV data and materialize the CSV as a table in the target database. Note: In general, dbt projects do not need this step because the data for your pending projects is in the database.

dbt seed
03:04:20  Running with dbt=1.3.0
03:04:20  Unable to do partial parsing because profile has changed
03:04:20  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 2 unused configuration paths:
- models.jaffle_shop
- models.jaffle_shop.staging

03:04:20  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 282 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
03:04:20
03:04:29  Concurrency: 1 threads (target='dev')
03:04:29
03:04:29  1 of 3 START seed file sjh_dbt.raw_customers ................................... [RUN]
sync ec1e4a32-9eb6-4f8c-88d7-35c40cfa89d1.csv duration:9s
03:04:42  1 of 3 OK loaded seed file sjh_dbt.raw_customers ............................... [INSERT 100 in 12.85s]
03:04:42  2 of 3 START seed file sjh_dbt.raw_orders ...................................... [RUN]
sync 3fd97eae-d20b-4d92-9dd7-9926e3ae4c86.csv duration:9s
03:04:55  2 of 3 OK loaded seed file sjh_dbt.raw_orders .................................. [INSERT 99 in 12.54s]
03:04:55  3 of 3 START seed file sjh_dbt.raw_payments .................................... [RUN]
sync 8da2b4b5-78c7-4230-b8c6-c1f030bdcb8f.csv duration:9s
03:05:07  3 of 3 OK loaded seed file sjh_dbt.raw_payments ................................ [INSERT 113 in 12.77s]
03:05:07
03:05:07  Finished running 3 seeds in 0 hours 0 minutes and 47.17 seconds (47.17s).
03:05:07
03:05:07  Completed successfully
03:05:07
03:05:07  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

In the above result, it is clear that three tasks have been performed and three tables ( sjh_dbt.raw_customers, sjh_dbt.raw_orders, sjh_dbt.raw_payments) were loaded. Next, go to the Databend Cloud database to see what happens. We found an extra sjh_dbt database, which is the engineering database that dbt created for us.

image There are three tables in sjh_dbt database, corresponding to the above three task results respectively.

image

model

Before skipping to the next step, it's important to understand what role the model plays in dbt. In dbt, models are used to describe the structure of a set of tables or views, and there are two main types of files: SQL and YML. Also note that in the jaffle_shop project, table structures are held under the models/ directory and view structures are held under the models/staging/ directory according to the materialization configuration. For example, models/orders.sql is an SQL query statement with jinja syntax. We will create a table by this query statement.

$ cat models/orders.sql
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}

with orders as (

    select * from {{ ref('stg_orders') }}

),

payments as (

    select * from {{ ref('stg_payments') }}

),

order_payments as (

    select
        order_id,

        {% for payment_method in payment_methods -%}
        sum(case when payment_method = '{{ payment_method }}' then amount else 0 end) as {{ payment_method }}_amount,
        {% endfor -%}

        sum(amount) as total_amount

    from payments

    group by order_id

),

final as (

    select
        orders.order_id,
        orders.customer_id,
        orders.order_date,
        orders.status,

        {% for payment_method in payment_methods -%}

        order_payments.{{ payment_method }}_amount,

        {% endfor -%}

        order_payments.total_amount as amount

    from orders


    left join order_payments
        on orders.order_id = order_payments.order_id

)

select * from final

Also, the constraint information that goes with this SQL is in the models/schema.yml file. The schema.yml is a registry of all models in the current directory. All models are organized into a tree structure that describes the description and attributes of each field. The tests items represent constraints for this field, which can be tested using the dbt test command. See the official documentation for more information.

Running

The result shows three views (sjh_dbt.stg_customers, sjh_dbt.stg_orders, sjh_dbt.stg_payments) and two tables (sjh_dbt.customers, sjh_dbt.orders).

dbt seed
03:04:20  Running with dbt=1.3.0
03:04:20  Unable to do partial parsing because profile has changed
03:04:20  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 2 unused configuration paths:
- models.jaffle_shop
- models.jaffle_shop.staging

03:04:20  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 282 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
03:04:20
03:04:29  Concurrency: 1 threads (target='dev')
03:04:29
03:04:29  1 of 3 START seed file sjh_dbt.raw_customers ................................... [RUN]
sync ec1e4a32-9eb6-4f8c-88d7-35c40cfa89d1.csv duration:9s
03:04:42  1 of 3 OK loaded seed file sjh_dbt.raw_customers ............................... [INSERT 100 in 12.85s]
03:04:42  2 of 3 START seed file sjh_dbt.raw_orders ...................................... [RUN]
sync 3fd97eae-d20b-4d92-9dd7-9926e3ae4c86.csv duration:9s
03:04:55  2 of 3 OK loaded seed file sjh_dbt.raw_orders .................................. [INSERT 99 in 12.54s]
03:04:55  3 of 3 START seed file sjh_dbt.raw_payments .................................... [RUN]
sync 8da2b4b5-78c7-4230-b8c6-c1f030bdcb8f.csv duration:9s
03:05:07  3 of 3 OK loaded seed file sjh_dbt.raw_payments ................................ [INSERT 113 in 12.77s]
03:05:07
03:05:07  Finished running 3 seeds in 0 hours 0 minutes and 47.17 seconds (47.17s).
03:05:07
03:05:07  Completed successfully
03:05:07
03:05:07  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
❯ dbt run
03:29:28  Running with dbt=1.3.0
03:29:28  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 2 unused configuration paths:
- models.jaffle_shop.staging
- models.jaffle_shop

03:29:28  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 282 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
03:29:28
03:29:38  Concurrency: 1 threads (target='dev')
03:29:38
03:29:38  1 of 5 START sql view model sjh_dbt.stg_customers .............................. [RUN]
03:29:52  1 of 5 OK created sql view model sjh_dbt.stg_customers ......................... [OK in 13.76s]
03:29:52  2 of 5 START sql view model sjh_dbt.stg_orders ................................. [RUN]
03:30:08  2 of 5 OK created sql view model sjh_dbt.stg_orders ............................ [OK in 16.28s]
03:30:08  3 of 5 START sql view model sjh_dbt.stg_payments ............................... [RUN]
03:30:22  3 of 5 OK created sql view model sjh_dbt.stg_payments .......................... [OK in 13.62s]
03:30:22  4 of 5 START sql view model sjh_dbt.customers .................................. [RUN]
03:30:35  4 of 5 OK created sql view model sjh_dbt.customers ............................. [OK in 13.29s]
03:30:35  5 of 5 START sql view model sjh_dbt.orders ..................................... [RUN]
03:30:48  5 of 5 OK created sql view model sjh_dbt.orders ................................ [OK in 13.32s]
03:30:48
03:30:48  Finished running 5 view models in 0 hours 1 minutes and 19.96 seconds (79.96s).
03:30:48
03:30:48  Completed successfully
03:30:48
03:30:48  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5

Go to the Databend database to verify that the creation is successful. The result illustrates that five more tables or views, such as customers, have been added, and the data in the tables or views have been transformed.

image

Generating doc

dbt also supports the generation of visual documents, using the following command.

  1. Generate
 dbt docs generate
04:01:20  Running with dbt=1.3.0
04:01:20  Unable to do partial parsing because profile has changed
04:01:20  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 2 unused configuration paths:
- models.jaffle_shop
- models.jaffle_shop.staging

04:01:20  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 282 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
04:01:20
04:01:26  Concurrency: 1 threads (target='dev')
04:01:26
04:01:26  Done.
04:01:26  Building catalog
04:01:33  Catalog written to /Users/hanshanjie/git-works/jaffle_shop/target/catalog.json
  1. Start server

Run dbt docs serve. The document, which contains the overall structure of the jaffle_shop project and a description of all the tables and views, can be viewed in a browser.

image

Conclusion

The use of Databend Cloud in dbt includes the following steps:

  1. Install dbt and dbt-databend-cloud
  2. Configuration
  3. Write SQL and YML files
  4. Running