Skip to content

dwickyfp/tutorial-dbt-from-zero

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

End-to-End dbt + Postgres Tutorial (DDL + Step-by-Step)

This guide walks you through setting up a small warehouse in Postgres using dbt, from database DDL to dbt project initialization, sources, staging, intermediate, and mart models, plus tests and docs. All steps are explicit and copy/paste-ready.

  • Target warehouse: Postgres
  • Example domain: simple e-commerce (customers, orders, payments)
  • Schemas used: raw (source tables) and analytics (dbt models)

1) Prerequisites

  • Postgres 12+ installed and running locally (or accessible remotely).
  • Python 3.9+ with pip.
  • dbt for Postgres: dbt-core and dbt-postgres.
  • Ability to run psql (or a GUI like pgAdmin; commands below use psql).

Install dbt packages in a virtual environment (recommended):

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install dbt-core dbt-postgres

Verify:

dbt --version

2) Create Database, Role, and Raw Tables (DDL)

Connect to your Postgres instance as a superuser (e.g., postgres) and create a dedicated role, database, and schemas:

-- 2.1 Create role and database
CREATE ROLE dbt_user WITH LOGIN PASSWORD 'dbt_password';
CREATE DATABASE dbt_tutorial OWNER dbt_user;

-- Connect to the new database
\c dbt_tutorial

-- 2.2 Create schemas
CREATE SCHEMA IF NOT EXISTS raw AUTHORIZATION dbt_user;
CREATE SCHEMA IF NOT EXISTS analytics AUTHORIZATION dbt_user;

-- 2.3 Grants (simple approach for local dev)
GRANT USAGE ON SCHEMA raw, analytics TO dbt_user;
GRANT CREATE ON SCHEMA analytics TO dbt_user; -- dbt builds models here

Create raw source tables and load small sample data:

-- 2.4 Raw tables
CREATE TABLE IF NOT EXISTS raw.customers (
    id            INTEGER PRIMARY KEY,
    first_name    TEXT,
    last_name     TEXT,
    email         TEXT,
    created_at    TIMESTAMP WITHOUT TIME ZONE
);

CREATE TABLE IF NOT EXISTS raw.orders (
    id            INTEGER PRIMARY KEY,
    user_id       INTEGER NOT NULL REFERENCES raw.customers(id),
    order_date    TIMESTAMP WITHOUT TIME ZONE,
    status        TEXT, -- e.g., placed, shipped, completed, returned, cancelled
    total_amount  NUMERIC(10,2)
);

CREATE TABLE IF NOT EXISTS raw.payments (
    id            INTEGER PRIMARY KEY,
    order_id      INTEGER NOT NULL REFERENCES raw.orders(id),
    payment_method TEXT, -- e.g., credit_card, bank_transfer, coupon, gift_card
    amount        NUMERIC(10,2) NOT NULL,
    paid_at       TIMESTAMP WITHOUT TIME ZONE
);

-- 2.5 Sample data
INSERT INTO raw.customers (id, first_name, last_name, email, created_at) VALUES
    (1, 'Alice',   'Nguyen',  'alice@example.com',   '2023-01-05 10:00:00'),
    (2, 'Brian',   'Smith',   'brian@example.com',   '2023-02-10 09:13:00'),
    (3, 'Carla',   'Lopez',   'carla@example.com',   '2023-03-21 15:45:00'),
    (4, 'Diego',   'Martins', 'diego@example.com',   '2023-04-12 08:05:00'),
    (5, 'Emily',   'Zhao',    'emily@example.com',   '2023-05-30 20:30:00')
ON CONFLICT DO NOTHING;

INSERT INTO raw.orders (id, user_id, order_date, status, total_amount) VALUES
    (100, 1, '2023-06-01 10:15:00', 'placed',    100.00),
    (101, 1, '2023-06-02 13:00:00', 'completed',  60.00),
    (102, 2, '2023-06-03 16:20:00', 'returned',   35.00),
    (103, 3, '2023-06-04 18:40:00', 'placed',     75.00),
    (104, 5, '2023-06-05 08:05:00', 'cancelled', 150.00)
ON CONFLICT DO NOTHING;

INSERT INTO raw.payments (id, order_id, payment_method, amount, paid_at) VALUES
    (1000, 100, 'credit_card',  60.00, '2023-06-01 10:20:00'),
    (1001, 100, 'gift_card',    40.00, '2023-06-01 10:21:00'),
    (1002, 101, 'credit_card',  60.00, '2023-06-02 13:05:00'),
    (1003, 102, 'credit_card',  35.00, '2023-06-03 16:25:00'),
    (1004, 104, 'credit_card', 150.00, '2023-06-05 08:07:00')
ON CONFLICT DO NOTHING;

Notes:

  • We intentionally include varied statuses and multiple payments per order to demonstrate downstream modeling.
  • In real projects, raw tables come from ingestion tools; here we create them directly for learning.

3) Initialize a dbt Project

From your chosen workspace directory, initialize a new dbt project (keep your virtualenv active):

mkdir -p ~/workspace && cd ~/workspace
dbt init dbt_pg_tutorial

During dbt init, choose postgres and provide connection details. If you prefer editing manually, configure your profile as below.

3.1 Configure profiles.yml

Edit ~/.dbt/profiles.yml and add a profile named dbt_pg_tutorial to match your project name:

# ~/.dbt/profiles.yml

dbt_pg_tutorial:
  target: dev
  outputs:
    dev:
      type: postgres
      host: 127.0.0.1
      user: dbt_user
      password: dbt_password
      port: 5432
      dbname: dbt_tutorial
      schema: analytics   # dbt creates/uses this schema for models
      threads: 4
      keepalives_idle: 0

Verify connectivity:

cd dbt_pg_tutorial
dbt debug

3.2 Project structure

Inside the project folder you’ll have dbt_project.yml and a models/ directory. We’ll organize models by layer:

models/
  ├── sources.yml          # source definitions for raw tables
  ├── staging/             # cleaned replicas of sources
  │   ├── stg_customers.sql
  │   ├── stg_orders.sql
  │   └── stg_payments.sql
  ├── intermediate/        # business logic joins/aggregations
  │   └── int_order_payments.sql
  └── marts/               # final dimensional models
      ├── dim_customers.sql
      └── fct_orders.sql

Ensure dbt_project.yml recognizes these folders (dbt’s default is fine). Example minimal content:

# dbt_project.yml (inside dbt_pg_tutorial)
name: 'dbt_pg_tutorial'
version: '1.0.0'
config-version: 2

profile: 'dbt_pg_tutorial'

models:
  dbt_pg_tutorial:
    +materialized: view    # default; override per-model when needed
    staging:
      +schema: staging     # optional: place staging views in analytics_staging
    intermediate:
      +schema: intermediate
    marts:
      +materialized: table
      +schema: marts

4) Declare Sources (models/sources.yml)

Create models/sources.yml to point dbt at the raw tables:

version: 2

sources:
  - name: raw
    database: dbt_tutorial
    schema: raw
    tables:
      - name: customers
      - name: orders
      - name: payments

5) Staging Models (models/staging/*.sql)

Staging models standardize column names, types, and apply light cleaning.

Create models/staging/stg_customers.sql:

{{ config(materialized='view') }}

with source as (
    select * from {{ source('raw', 'customers') }}
),
renamed as (
    select
        id                as customer_id,
        lower(trim(first_name)) as first_name,
        lower(trim(last_name))  as last_name,
        lower(trim(email))      as email,
        created_at
    from source
)
select * from renamed;

Create models/staging/stg_orders.sql:

{{ config(materialized='view') }}

with source as (
    select * from {{ source('raw', 'orders') }}
),
casted as (
    select
        id           as order_id,
        user_id      as customer_id,
        order_date::timestamp as order_timestamp,
        lower(status) as status,
        total_amount::numeric(10,2) as total_amount
    from source
)
select * from casted;

Create models/staging/stg_payments.sql:

{{ config(materialized='view') }}

with source as (
    select * from {{ source('raw', 'payments') }}
),
casted as (
    select
        id           as payment_id,
        order_id,
        lower(payment_method) as payment_method,
        amount::numeric(10,2) as amount,
        paid_at::timestamp    as paid_at
    from source
)
select * from casted;

6) Intermediate Model (models/intermediate/int_order_payments.sql)

Aggregate payments by order and compute totals. This model is helpful for the fact table.

{{ config(materialized='view') }}

with payments as (
    select
        order_id,
        sum(amount) as total_paid,
        count(*)    as payment_count
    from {{ ref('stg_payments') }}
    group by 1
)

select
    o.order_id,
    o.customer_id,
    o.order_timestamp,
    o.status,
    o.total_amount,
    coalesce(p.total_paid, 0)   as total_paid,
    coalesce(p.payment_count, 0) as payment_count,
    (o.total_amount - coalesce(p.total_paid, 0)) as outstanding_amount
from {{ ref('stg_orders') }} o
left join payments p on o.order_id = p.order_id;

7) Mart Models (models/marts/*.sql)

Create a dimension table for customers and a fact table for orders.

Create models/marts/dim_customers.sql:

{{ config(materialized='table') }}

select
    c.customer_id,
    initcap(c.first_name) as first_name,
    initcap(c.last_name)  as last_name,
    c.email,
    c.created_at
from {{ ref('stg_customers') }} c;

Create models/marts/fct_orders.sql:

{{ config(materialized='table') }}

select
    i.order_id,
    i.customer_id,
    i.order_timestamp,
    i.status,
    i.total_amount,
    i.total_paid,
    i.payment_count,
    i.outstanding_amount,
    case
        when i.outstanding_amount = 0 and i.total_amount > 0 then true
        else false
    end as is_fully_paid
from {{ ref('int_order_payments') }} i;

8) Add Tests and Documentation (models/schema.yml)

Add basic tests for uniqueness and not-null constraints, plus descriptions for docs:

version: 2

models:
  - name: stg_customers
    description: "Cleaned customers from raw.customers"
    columns:
      - name: customer_id
        tests: [not_null, unique]
      - name: email
        tests: [not_null]

  - name: stg_orders
    description: "Cleaned orders from raw.orders"
    columns:
      - name: order_id
        tests: [not_null, unique]
      - name: customer_id
        tests: [not_null]

  - name: stg_payments
    description: "Cleaned payments from raw.payments"
    columns:
      - name: payment_id
        tests: [not_null, unique]
      - name: order_id
        tests: [not_null]

  - name: int_order_payments
    description: "Aggregated payments per order"
    columns:
      - name: order_id
        tests: [not_null, unique]

  - name: dim_customers
    description: "Customer dimension"
    columns:
      - name: customer_id
        tests: [not_null, unique]

  - name: fct_orders
    description: "Order fact table with payments info"
    columns:
      - name: order_id
        tests: [not_null, unique]
      - name: customer_id
        tests: [not_null]

9) Run dbt

From the project root (dbt_pg_tutorial):

# Validate profile/connection
dbt debug

# Build models by layer
dbt run -m staging
dbt run -m intermediate
dbt run -m marts

# Or build everything with tests
dbt build    # (equivalent to run + test + snapshot if any)

Run tests and generate docs:

dbt test
dbt docs generate
dbt docs serve   # open the web UI with model DAG + docs

10) Validate Results with Sample Queries

In psql, inspect the built models in the analytics schema (or analytics_marts etc., depending on your dbt_project.yml schema config):

-- Top customers by total fully-paid order amount
select d.customer_id, d.first_name, d.last_name, sum(f.total_amount) as total
from analytics.marts__dim_customers d
join analytics.marts__fct_orders f on d.customer_id = f.customer_id
where f.is_fully_paid
group by 1,2,3
order by total desc;

-- Orders with outstanding amounts
select * from analytics.marts__fct_orders where outstanding_amount > 0;

Note: dbt composes final schema names based on your dbt_project.yml folder-level +schema settings. For example, if you set marts folder to +schema: marts, the final relation might be analytics_marts.fct_orders depending on your adapter’s schema naming; some teams prefer a single analytics schema with folder prefixes added to table names. Adjust folder +schema settings to your conventions.

11) Optional: Use Seeds Instead of Manual DDL

If you prefer CSV seeds managed by dbt instead of manual DDL, you can:

  • Place CSV files under seeds/ (e.g., seeds/customers.csv, seeds/orders.csv, seeds/payments.csv).
  • Configure a seeds: block in dbt_project.yml if needed.
  • Run dbt seed to load data into your target schema (commonly not raw, but some teams use a raw schema and set seeds: +schema: raw).

Example dbt_project.yml snippet:

seeds:
  dbt_pg_tutorial:
    +schema: raw

Then run:

dbt seed

And update sources.yml to reference the seeded tables if you load them into raw.

12) Troubleshooting Tips

  • FATAL: password authentication failed → Check profiles.yml credentials and whether the role has access to the database.
  • permission denied for schema analytics → Ensure dbt_user has USAGE and CREATE on the target schema.
  • database "dbt_tutorial" does not exist → Make sure you created it and are connecting to the correct host.
  • Model not found errors → Verify file names and that you’re using ref('model_name') and source('source','table') correctly.
  • Docs not opening → If dbt docs serve is blocked (firewall), use dbt docs generate and open target/index.html in a browser manually.

You now have an end-to-end dbt + Postgres setup: raw tables, dbt sources, staging, intermediate, and mart layers, with tests and docs. Customize models, add more sources, and expand marts as your analytics needs grow.

About

Practical guide to building modular SQL transformations using dbt and Postgres.

Topics

Resources

Stars

Watchers

Forks