SQL Server to Snowflake Migration Lab

A complete, runnable migration from SQL Server to Snowflake — covering four source databases (e-commerce, ERP, CRM, inventory), two ingestion paths (AWS DMS and Airbyte), a four-stage Snowflake medallion architecture, Airflow orchestration, dbt transformations, and a live migration dashboard.

This is a training lab, not a production template. The goal is to show you every layer of a real migration so you can understand the decisions, not just copy the configs.

What you're building

SQL Server (Docker or RDS)
        │
        ├── via AWS DMS ────────► S3 (Parquet) ──► Snowpipe ──► RAW_MSSQL.RAW_DMS_VARIANT
        │                                                                │
        └── via Airbyte ─────────────────────────────────────► MSSQL_MIGRATION_LAB.AIRBYTE_RAW
                                                                         │
                                                          BRONZE (typed, MERGE dedup)
                                                                         │
                                                          SILVER (SCD Type-2 via dbt snapshot)
                                                                         │
                                                          GOLD (facts + dims via dbt run)
                                                                         │
                                                          dbt test (bronze + silver + gold)

DMS takes the AWS-native route: it writes Parquet files to S3, Snowpipe picks them up automatically, and Airflow runs a MERGE to promote typed rows into Bronze. Airbyte skips S3 entirely and writes directly to Snowflake using a connector — less infrastructure but no Parquet archive. You can run both side by side and compare counts.

Repository layout

sqlserver-to-snowflake-migration/
├── sqlserver/              Local SQL Server (Docker + DDL + seed data)
│   ├── docker-compose.yml
│   ├── sql/                01_create_database → 22_lab_inventory (all 4 source DBs)
│   └── scripts/            run_lab.sh, run_validation.sh, inspect_counts.sql
│
├── infra/
│   ├── aws-cdk/            CDK stacks: DataLanding (S3+SNS), DmsCdc, TerraformState
│   └── terraform/
│       ├── snowflake/      Warehouse, DB, schemas, Snowpipes, DDL tables, SPs, UDFs, streams, tasks
│       └── aws/            DMS replication instance, endpoint, task
│
├── dms/                    DMS task settings JSON + table mappings
├── airbyte/                Airbyte OSS docker-compose + setup + comparison script
│
├── airflow/
│   ├── docker-compose.yml
│   └── dags/               4 DAGs (bronze → silver → gold → quality)
│
├── dbt/
│   ├── dbt_project.yml
│   ├── profiles.yml.example
│   └── models/             customers / orders / products  (stg → int → gold)
│
├── snowflake_ddl/
│   ├── bronze/             Typed tables for all 4 source DBs
│   ├── silver/             SCD Type-2 surrogate-key tables
│   ├── gold/               Facts + dims + MIGRATION_METADATA audit
│   ├── iceberg/            Iceberg external tables for long-term audit
│   ├── procedures/         Snowflake SPs replacing SQL Server stored procs
│   ├── streams_tasks/      Snowflake Streams + Tasks replacing DML triggers
│   └── udfs/               Scalar UDFs + Snowpark Python for XML/TVP
│
└── .env.example            All required credentials (copy → .env, never commit .env)

Prerequisites

You need these accounts and tools before running anything:

Docker Desktop (for local SQL Server and Airflow)
AWS account with permissions to create S3, DMS, IAM, VPC resources
Snowflake account (trial works — you'll create everything via Terraform)
Python 3.10+ with pip
AWS CDK v2: npm install -g aws-cdk
Terraform 1.5+
dbt Core 1.7+: pip install dbt-snowflake

Copy .env.example to .env at the repo root and fill in your values. Every script in this repo reads from that file. Nothing is hardcoded.

Step 1 — Stand up SQL Server locally

The lab uses four SQL Server databases to simulate a realistic mixed-workload migration: a stress/e-commerce DB (orders, customers, products), an ERP DB (employees, payroll, departments), a CRM DB (accounts, contacts, opportunities), and an inventory DB (warehouses, SKUs, stock movements).

cd sqlserver
cp .env.example .env          # set SA_PASSWORD to something complex (SQL Server requires it)
docker compose up -d

Wait about 30 seconds for SQL Server to initialize, then seed all four databases:

cd scripts
./run_lab.sh

This runs all DDL scripts in order (01_create_database.sql through 22_lab_inventory.sql), creates stored procedures, triggers, views, and inserts seed rows. After it finishes:

./run_validation.sh           # prints row counts per table

You should see roughly 100-500 rows per table depending on the seed script. If any count is 0, check docker logs sqlserver_mssql_1 — the most common issue is the SA password not meeting SQL Server's complexity rules.

CDC note: Some scripts enable Change Data Capture on the stress DB. This requires SQL Server Agent to be running. The Docker image in this lab includes Agent. If you're using your own SQL Server image and Agent isn't running, CDC setup will fail silently — check with SELECT * FROM sys.dm_cdc_log_scan_sessions.

Step 2 — Deploy AWS infrastructure

DMS needs to reach SQL Server. It cannot connect to localhost on your machine — it runs inside AWS VPC and needs a resolvable endpoint. You have two options:

Option A (recommended for this lab): Use SQL Server on RDS. Deploy a small db.t3.medium RDS MSSQL instance, restore the seed data there, and point DMS at the RDS endpoint. RDS lives in the same VPC as your DMS replication instance so latency is low and you avoid VPN/firewall issues entirely.

Option B: Run SQL Server on an EC2 instance and expose it via private IP within the VPC. More complex, not covered here.

Deploy with CDK

cd infra/aws-cdk
pip install -r requirements.txt
cp ../../.env .env

cdk bootstrap aws://<your-account-id>/<region>   # once per account/region
cdk deploy DataLandingStack     # creates S3 DMS bucket + SNS notification
cdk deploy TerraformStateStack  # creates S3 bucket for Terraform state

The DataLandingStack output includes the S3 bucket name. Add it to your .env as DMS_BUCKET.

Deploy RDS SQL Server (if using Option A)

cd infra/terraform/aws
terraform init -backend-config="bucket=${TF_STATE_BUCKET}" \
               -backend-config="key=mssql-migration/aws/terraform.tfstate" \
               -backend-config="region=${AWS_REGION}"
terraform apply

This creates a DMS replication instance, source endpoint (pointing at RDS), target endpoint (S3), and a full-load task for all 15 tables across the four databases. After apply:

terraform output dms_task_arn   # save this, you'll start the task next

After RDS is up, load the same seed data you ran locally. The simplest way is to run run_lab.sh against the RDS endpoint:

# install sqlcmd or use the MSSQL Docker container as a client
MSSQL_HOST=<rds-endpoint> MSSQL_USER=admin MSSQL_PASSWORD=<your-rds-pass> ./run_lab.sh

Already have an RDS SQL Server? Just deploy DMS

If you created an RDS SQL Server instance yourself — through the console, CLI, or your own Terraform — you don't need this repo to provision RDS again. You can skip the RDS creation and use Terraform only to set up the DMS replication instance, endpoints, and task.

Here's how to do it.

1. Tell Terraform not to create a new RDS instance

Open (or create) infra/terraform/aws/terraform.tfvars and set use_rds = false. Then provide the connection details for your existing instance:

use_rds        = false
mssql_host     = "your-existing-instance.xxxxxx.us-east-1.rds.amazonaws.com"
mssql_port     = 1433
mssql_db       = "SnowConvertStressDB"   # or whatever your source DB is named
mssql_user     = "sa"
mssql_password = "your-password"

When use_rds = false, Terraform skips creating the RDS subnet group, parameter group, security group, and DB instance entirely. It only creates the DMS replication instance, the source endpoint pointing at your host, the S3 target endpoint, and the replication task.

2. Make sure DMS can actually reach your RDS instance

This is the step people most often forget. DMS runs inside your VPC, so your existing RDS security group needs to allow inbound TCP on port 1433 from the DMS replication instance's security group. The easiest way to do this is to add an inbound rule on your RDS security group that allows traffic from the DMS security group. Terraform outputs the DMS security group ID after apply — you can add the rule manually in the console or reference it in your own Terraform config.

If your RDS instance is in a different VPC than where DMS will run, you'll need VPC peering or a Transit Gateway — that's beyond the scope of this lab. The simplest setup is to let Terraform use your default VPC (leave dms_vpc_id = null and dms_subnet_ids = []) and make sure your RDS instance is also in that default VPC.

3. Run Terraform

cd infra/terraform/aws
terraform init -backend-config="bucket=${TF_STATE_BUCKET}" \
               -backend-config="key=mssql-migration/aws/terraform.tfstate" \
               -backend-config="region=${AWS_REGION}"
terraform plan   # review — you should see DMS resources only, no aws_db_instance
terraform apply

After apply you'll get outputs like dms_replication_instance_arn and dms_replication_task_arn. Save the task ARN — you'll use it to start the migration.

4. Enable CDC on your existing RDS SQL Server

Before starting DMS in CDC mode, CDC needs to be turned on at the database and table level. On RDS SQL Server, you do this through stored procedures rather than directly — Microsoft SQL Server on RDS uses a wrapper SP instead of the native sys.sp_cdc_enable_db:

-- Connect to your RDS instance as the master user
USE SnowConvertStressDB
GO

-- Enable CDC on the database (RDS-specific wrapper)
EXEC msdb.dbo.rds_cdc_enable_db 'SnowConvertStressDB'
GO

-- Enable CDC on each table you want to replicate
EXEC sys.sp_cdc_enable_table
    @source_schema = 'dbo',
    @source_name   = 'Orders',
    @role_name     = NULL
GO
-- Repeat for each table (Customers, Products, OrderItems, etc.)

If you're only doing a one-time full load (no ongoing CDC), you can skip this step — DMS full-load mode doesn't require CDC to be enabled.

5. Start the DMS task

aws dms start-replication-task \
    --replication-task-arn $(terraform output -raw dms_replication_task_arn) \
    --start-replication-task-type start-replication

Watch progress in the AWS console under Database Migration Service → Replication tasks, or poll via CLI:

aws dms describe-replication-tasks \
    --filters Name=replication-task-arn,Values=$(terraform output -raw dms_replication_task_arn) \
    --query 'ReplicationTasks[0].{Status:Status,Progress:ReplicationTaskStats.FullLoadProgressPercent}'

Once the task shows Load complete, your Parquet files are in S3 and Snowpipe will pick them up automatically. Continue from Step 3 (Snowflake setup) from here.

Step 3 — Set up Snowflake

cd infra/terraform/snowflake
terraform init -backend-config="bucket=${TF_STATE_BUCKET}" \
               -backend-config="key=mssql-migration/snowflake/terraform.tfstate" \
               -backend-config="region=${AWS_REGION}"
terraform apply

Set these environment variables before running (Terraform picks them up automatically):

export SNOWFLAKE_ORGANIZATION_NAME=<your-org>
export SNOWFLAKE_ACCOUNT_NAME=<your-account>
export SNOWFLAKE_USER=<your-user>
export SNOWFLAKE_PASSWORD=<your-password>

Terraform creates:

Database MSSQL_MIGRATION_LAB with schemas RAW_MSSQL, BRONZE, SILVER, GOLD, AIRBYTE_RAW
Warehouse WH_MSSQL_MIGRATION (X-SMALL, auto-suspend 60s)
External stage pointing at your S3 DMS bucket with a storage integration
Snowpipe PIPE_RAW_DMS_VARIANT — auto-ingests Parquet files from S3 into RAW_MSSQL.RAW_DMS_VARIANT
All Bronze, Silver, and Gold DDL tables
All stored procedures, UDFs, streams, and tasks from snowflake_ddl/

After apply, the Snowpipe is live. When DMS writes a Parquet file to S3, the SNS notification triggers Snowpipe and the row lands in RAW_DMS_VARIANT within seconds.

Step 4 — Run DMS full-load

Start the DMS task from the console or CLI:

aws dms start-replication-task \
    --replication-task-arn $(terraform -chdir=infra/terraform/aws output -raw dms_task_arn) \
    --start-replication-task-type start-replication

Monitor progress in the AWS console under Database Migration Service → Replication tasks. A full load of ~5,000 seed rows typically completes in under 5 minutes.

When it finishes, check Snowflake:

SELECT COUNT(*) FROM MSSQL_MIGRATION_LAB.RAW_MSSQL.RAW_DMS_VARIANT;

You should see rows. If the count is 0, check:

S3 bucket — do Parquet files exist under the DMS prefix?
Snowpipe history: SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(TABLE_NAME=>'RAW_DMS_VARIANT', START_TIME=>DATEADD('hour',-1,CURRENT_TIMESTAMP())))
Storage integration IAM trust policy — the most common setup mistake is the Snowflake AWS account not having permission to read your S3 bucket

Step 5 — Airbyte alternative (optional)

Airbyte is a good comparison to DMS. It connects directly from Docker to SQL Server, runs discovery, and writes to MSSQL_MIGRATION_LAB.AIRBYTE_RAW without any S3 involvement. The tradeoff: you lose the Parquet archive on S3, but you gain a much simpler setup and no AWS cost.

cd airbyte
docker compose up -d

Wait for Airbyte to initialize (about 2 minutes). Then run the setup script:

pip install requests python-dotenv
python setup_connections.py

This creates a SQL Server source connector, a Snowflake destination connector, creates a connection between them, and triggers the first full sync. The script polls until the sync completes and prints a summary.

Important: The Airbyte Docker containers reach SQL Server at host.docker.internal, not localhost. This is already set in the config. If you're on Linux, host.docker.internal requires --add-host=host.docker.internal:host-gateway in your Docker run command — the docker-compose.yml handles this automatically.

After the sync, compare row counts between DMS Bronze and Airbyte RAW:

python compare_dms_vs_airbyte.py

Airbyte row counts should be greater than or equal to Bronze counts. Bronze uses MERGE (dedup by primary key, latest wins). Airbyte in append-dedup mode keeps one row per PK per sync cycle, so it can accumulate more rows if you ran multiple syncs.

Airbyte UI: http://localhost:8000 (user: airbyte, password: password)

Step 6 — Run the Airflow pipeline

Airflow orchestrates four DAGs that move data from RAW through Bronze, Silver, and Gold, then run quality checks.

cd airflow
docker compose up -d

Open http://localhost:8080. Log in with admin / admin. You'll see four DAGs:

DAG 1: mssql_01_ingest_bronze

Reads from RAW_DMS_VARIANT and runs a MERGE into each of the 15 Bronze tables. The MERGE deduplicates by primary key using QUALIFY ROW_NUMBER() OVER (PARTITION BY <pk>) = 1 — this is necessary because SELECT DISTINCT doesn't deduplicate when computed columns like CURRENT_TIMESTAMP() produce different values for identical rows.

Before MERGE, this DAG also resumes the Snowflake Streams + Tasks (CDC equivalents for triggers). After MERGE, it suspends them. This prevents tasks from running on partial data.

Tasks: check_raw_counts → resume_sf_tasks → merge_bronze_stress + merge_bronze_erp + merge_bronze_crm + merge_bronze_inv (parallel) → suspend_sf_tasks → report_bronze_counts

DAG 2: mssql_02_silver_snapshots

Waits for DAG 1 to complete, then runs dbt snapshot for all 15 tables. Each snapshot builds an SCD Type-2 history in Silver — adding dbt_scd_id, dbt_valid_from, dbt_valid_to, and dbt_updated_at columns. The current record for any row is where dbt_valid_to IS NULL.

DAG 3: mssql_03_gold_transforms

Waits for DAG 2, then runs dbt run for the staging, intermediate, and gold models. Gold produces three fact/dimension tables: CUSTOMERS_DIM, ORDERS_FACT, PRODUCTS_DIM. These are what you'd expose to BI tools.

DAG 4: mssql_04_data_quality

Waits for DAG 3, then runs dbt test across all three layers. Expected result: 27 PASS, 3 WARN. The warnings are for nullable columns where data quality is expected to improve once CDC is fully running.

Running the pipeline

Trigger DAG 1 first. After it completes successfully, trigger DAGs 2, 3, and 4 — they will wait on the sensor and then proceed automatically. If you're running them manually, trigger each one with the same --exec-date as DAG 1:

# get DAG 1's execution date from the Airflow UI, then:
airflow dags trigger mssql_02_silver_snapshots --exec-date "2026-04-19T15:27:15+00:00"
airflow dags trigger mssql_03_gold_transforms  --exec-date "2026-04-19T15:27:15+00:00"
airflow dags trigger mssql_04_data_quality     --exec-date "2026-04-19T15:27:15+00:00"

The ExternalTaskSensor matches by exact execution_date, not by "most recent successful run." If the downstream DAGs are triggered at a different time, the sensor will poll forever. The --exec-date flag is how you tell Airflow these runs belong to the same logical batch.

Step 7 — Live migration dashboard

cd airflow/dashboard
docker compose up -d

Open http://localhost:8050. The dashboard shows:

DAG status cards (running / success / failed) polled from the Airflow REST API
Row counts per table across RAW, Bronze, Silver, and Gold
Coverage percentages (how many of the 15 tables have at least 1 row in each layer)
A CDC events table showing the last 50 rows landed in RAW (useful for watching live DMS streaming)

The dashboard refreshes every 30 seconds in a background thread. You can force a refresh by clicking the "Refresh Now" button or POSTing to http://localhost:8050/api/refresh.

SQL Server object migration map

SQL Server has objects that don't exist in Snowflake. Here's what each became.

Stored procedures

SQL Server	Snowflake	Notes
`usp_RefreshOrderTotals(@OrderId)`	`SP_REFRESH_ORDER_TOTALS(order_id)`	Direct port
`usp_ListOpenOrders`	`SP_LIST_OPEN_ORDERS()`	Returns RESULTSET
`usp_Stress_DynamicSearchOrders`	`SP_DYNAMIC_SEARCH_ORDERS(...)`	`sp_executesql` → `EXECUTE IMMEDIATE`
`usp_Stress_CursorRepriceProductsByCategory`	`SP_REPRICE_PRODUCTS_BY_CATEGORY(...)`	Cursor removed; set-based UPDATE
`usp_Stress_MergeUpsertCustomers(@json)`	`SP_MERGE_UPSERT_CUSTOMERS(variant)`	`OPENJSON` → `FLATTEN`
`usp_Stress_XmlOrderDocument(@OrderId)`	`SP_XML_ORDER_DOCUMENT(order_id)`	Snowpark Python — no native XML type in Snowflake
`usp_Stress_JsonOrderLines(@OrderId)`	`SP_JSON_ORDER_LINES(order_id)`	`FOR JSON PATH` → `ARRAY_AGG + OBJECT_CONSTRUCT`
`usp_Stress_ChainedA/B/C`	`SP_CHAINED_A/B/C`	Nested `CALL` statements
`usp_Stress_RecursiveCategoryClosure`	`SP_BUILD_CATEGORY_CLOSURE()`	`WITH RECURSIVE` CTE — Snowflake supports this natively
`usp_Stress_SavePointPartialRollback`	Not ported	Snowflake has no `SAVE TRANSACTION`
`usp_Stress_ThrowCatchAndRethrow`	`SP_THROW_CATCH_RETHROW()`	`BEGIN TRY/CATCH` → `EXCEPTION WHEN OTHER THEN`
`usp_Stress_OutputMergePriceHistory`	`SP_UPDATE_PRICE_WITH_HISTORY(...)`	`OUTPUT` clause → read old value before update
`usp_Stress_WhileBatchNumbers(@n)`	`SP_WHILE_BATCH_NUMBERS(n)`	Temp table → variables; returns OBJECT
`usp_Stress_MultiResultSets`	`SP_MULTI_RESULT_DEMO(customer_id)`	Multiple result sets → single VARIANT
`usp_Stress_WaitForShort`	`SP_WAITFOR_SHORT()`	`WAITFOR DELAY` → `SYSTEM$WAIT`
`usp_Stress_TempTableDynamicPivot`	`SP_DYNAMIC_PIVOT()`	`PIVOT` + `EXECUTE IMMEDIATE`
`usp_Stress_ScopedTempTableCaller/Callee`	Not ported	Snowflake temp tables don't cross SP call boundaries
`usp_Stress_TvpAppendOrderLines(@tvp)`	`SP_TVP_APPEND_ORDER_LINES(order_id, lines VARIANT)`	TVP → VARIANT array + FLATTEN
`usp_Stress_OpenJsonApplyPatch`	`SP_PATCH_PRODUCTS_FROM_JSON(variant)`	`OPENJSON` → `FLATTEN`
`usp_Erp_DynamicDeptReport`	`SP_ERP_DYNAMIC_DEPT_REPORT(...)`	Dynamic ORDER BY with whitelist
`usp_Erp_ClosePayrollRun(@RunId)`	`SP_ERP_CLOSE_PAYROLL_RUN(run_id)`	Direct port
`usp_Crm_MergeAccountsFromJson`	`SP_CRM_MERGE_ACCOUNTS_FROM_JSON(variant)`	`OPENJSON` → `FLATTEN`
`usp_Crm_ListPipeline(@Region)`	`SP_CRM_LIST_PIPELINE(region)`	Direct port
`usp_Inv_StockXml(@WhId)`	`SP_INV_STOCK_XML(wh_id)`	Snowpark Python
`usp_Inv_DynamicWhFilter`	`SP_INV_DYNAMIC_WH_FILTER(wh_code)`	`EXECUTE IMMEDIATE` with quote-escaped param

Triggers

Snowflake has no triggers. The lab handles this two ways depending on what the trigger was doing.

Audit and side-effect triggers become Streams + Tasks (async, fire within about 1 minute):

SQL Server	Snowflake stream	Snowflake task	File
`tr_Orders_Audit_IU`	`STREAM_ORDERS_CHANGES`	`TASK_ORDERS_AUDIT`	`01_stressdb_streams.sql`
`tr_OrderItems_RecalcAndQueue`	`STREAM_ORDER_ITEMS_CHANGES`	`TASK_ORDER_ITEMS_RECALC`	`01_stressdb_streams.sql`
`tr_Products_ListPriceAudit`	`STREAM_PRODUCTS_CHANGES`	`TASK_PRODUCTS_PRICE_AUDIT`	`01_stressdb_streams.sql`
`tr_Employees_Audit`	`STREAM_ERP_EMPLOYEES_CHANGES`	`TASK_ERP_EMPLOYEES_AUDIT`	`02_erp_streams.sql`
`tr_PayrollLines_Recalc`	`STREAM_ERP_PAYROLL_LINES_CHANGES`	`TASK_ERP_PAYROLL_RECALC`	`02_erp_streams.sql`
`tr_Accounts_Activity`	`STREAM_CRM_ACCOUNTS_CHANGES`	`TASK_CRM_ACCOUNTS_ACTIVITY`	`03_crm_streams.sql`
`tr_Sku_NoNegativeCost`	`STREAM_INV_SKU_CHANGES`	`TASK_INV_SKU_COST_GUARD`	`04_inventory_streams.sql`

One thing to know: Snowflake Tasks are suspended by default. Airflow DAG 1 resumes them before the Bronze MERGE and suspends them after. This is intentional — you don't want tasks firing while a MERGE is in progress.

INSTEAD OF triggers become stored procedures (callers must use the SP, not write directly to the table):

SQL Server trigger	Snowflake SP	File
`tr_vw_Orders_Dml_IOD` (INSTEAD OF DELETE)	`SP_SOFT_DELETE_ORDER(order_id)`	`06_dml_guard_procs.sql`
`tr_vw_Orders_Dml_IOU` (INSTEAD OF UPDATE)	`SP_UPDATE_ORDER(order_id, ...)`	`06_dml_guard_procs.sql`
`tr_vw_StockOrders_IOI` (INSTEAD OF INSERT)	`SP_INV_SOFT_INSERT_STOCK_ORDER(...)`	`05_inventory_procs.sql`
`tr_Opportunities_StageGuard` (AFTER UPDATE)	`SP_CRM_UPDATE_OPPORTUNITY_STAGE(opp_id, stage)`	`04_crm_procs.sql`

Constraints

SQL Server	Snowflake	Notes
`OrderItems.Quantity CHECK (Quantity > 0)`	`NOT ENFORCED` constraint	Enforced inside `SP_TVP_APPEND_ORDER_LINES`
`Orders.Status DEFAULT 'Open'`	Column `DEFAULT 'Open'`	Supported natively
`Customers.Country DEFAULT 'US'`	Column `DEFAULT 'US'`	Supported natively
`OrderItems.LineTotal PERSISTED computed`	Stored as `LINE_TOTAL NUMBER(18,4)`	Computed during MERGE, no computed columns in Snowflake
`Products.SKU UNIQUE`	`UNIQUE NOT ENFORCED`	Enforced by MERGE dedup logic
`Orders → Customers FK`	`FOREIGN KEY NOT ENFORCED`	Declared but not enforced at write time

Functions

SQL Server	Snowflake	Location
`fn_FormatMoney(@amount)` scalar UDF	`FN_FORMAT_MONEY(amount)` SQL UDF	`udfs/01_scalar_udfs.sql`
`fn_OrderLineCount(@orderId)` inline TVF	`FN_ORDER_LINE_COUNT(order_id)` scalar UDF	`udfs/01_scalar_udfs.sql`

DMS vs Airbyte — which to use

This lab supports both. Here's an honest comparison:

	AWS DMS	Airbyte
Setup complexity	High — IAM, VPC, DMS instance, replication task, Snowpipe, S3	Low — one `docker compose up`
Data durability	Parquet files in S3 (retain forever)	No intermediate storage
Source support	SQL Server, Oracle, MySQL, PostgreSQL, MongoDB, and more	300+ connectors including SaaS APIs
CDC support	Yes, via native CDC	Yes, via CDC or STANDARD mode
Cost	DMS replication instance runs 24/7 even when idle	Free OSS; Airbyte Cloud has usage pricing
Best for	AWS-native shops, regulated environments needing audit trail	Faster setup, many heterogeneous sources

If your company is already on AWS and you need an audit trail of raw source data, DMS is the better choice. If you need to pull from Salesforce, Hubspot, and SQL Server into the same warehouse, Airbyte covers all of them without extra plumbing.

Credential security

Your .env file contains real credentials and must never be committed to git. It is in .gitignore — verify with:

git check-ignore -v .env

If it's not listed, add .env to .gitignore immediately and rotate any credentials that were already committed.

The .env.example file shows every variable this lab needs. Copy it to .env and fill in your values:

cp .env.example .env

For production, use a secret manager (AWS Secrets Manager, HashiCorp Vault) instead of .env files. Terraform supports Vault and SSM Parameter Store natively.

Smoke tests after setup

After Terraform apply and a successful DMS load, run these in Snowflake to verify the migration:

-- Basic counts
SELECT COUNT(*) FROM BRONZE.CUSTOMERS;
SELECT COUNT(*) FROM BRONZE.ORDERS;

-- Procedure smoke test
CALL BRONZE.SP_LIST_OPEN_ORDERS();
CALL BRONZE.SP_DYNAMIC_SEARCH_ORDERS('Open', NULL, NULL, 'ORDER_DATE', 'DESC');

-- UDF
SELECT BRONZE.FN_FORMAT_MONEY(12345.678);

-- Chained proc (appends [C][B][A] to notes)
CALL BRONZE.SP_CHAINED_A_OUTER(1);
SELECT NOTES FROM BRONZE.ORDERS WHERE ORDER_ID = 1;

-- Stream check (needs a row in BRONZE.ORDERS first)
SELECT SYSTEM$STREAM_HAS_DATA('BRONZE.STREAM_ORDERS_CHANGES');

-- Task history
SELECT NAME, STATE, COMPLETED_TIME, ERROR_MESSAGE
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
    SCHEDULED_TIME_RANGE_START => DATEADD('hour', -1, CURRENT_TIMESTAMP()),
    TASK_NAME => 'TASK_ORDERS_AUDIT'
));

-- dbt test results (run via Airflow DAG 4 or directly)
-- Should show 27 PASS, 3 WARN

Cost estimate

Running this lab for one day on AWS:

Component	Approximate cost
DMS replication instance (dms.t3.medium)	$0.12/hr ≈ $3/day
RDS SQL Server SE (db.t3.medium)	$0.10/hr ≈ $2.40/day
S3 storage for Parquet files	< $0.01
Snowflake X-SMALL warehouse (auto-suspend 60s)	~$1-2/day depending on query volume

Tear down when done:

terraform destroy -chdir=infra/terraform/aws
terraform destroy -chdir=infra/terraform/snowflake
docker compose down -v   # in sqlserver/, airflow/, airbyte/

Troubleshooting

DMS task shows "Load complete" but RAW_DMS_VARIANT is empty Check Snowpipe: SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(...)). If errors show, it's usually the storage integration — the IAM trust relationship between Snowflake and your S3 bucket needs the Snowflake AWS account ID and external ID that Terraform output after apply.

Airflow Bronze MERGE fails with Duplicate row detected during DML action This happens when the source subquery returns two rows with the same primary key. The root cause is usually SELECT DISTINCT not deduplicating rows with different computed column values (CURRENT_TIMESTAMP() differs per row). The fix is QUALIFY ROW_NUMBER() OVER (PARTITION BY <pk>) = 1 in the MERGE source query.

ExternalTaskSensor polls forever and never succeeds The sensor matches by exact execution_date. Trigger downstream DAGs with --exec-date equal to the upstream DAG's execution_date. You can find it in the Airflow UI under the upstream DAG run's details.

dbt fails with --profiles-dir is not a valid option This is a dbt 1.x breaking change. The flag --profiles-dir is not valid as a global pre-command flag. Set ENV DBT_PROFILES_DIR=/opt/dbt in the Dockerfile and use ENTRYPOINT ["dbt"] without arguments.

Airbyte sync fails with Cannot connect to host.docker.internal On Linux, host.docker.internal isn't automatically available. Add --add-host=host.docker.internal:host-gateway to your Docker run, or add it under extra_hosts in the docker-compose service definition.

CI/CD

GitHub Actions at .github/workflows/migration-cicd.yml:

Trigger	Jobs
PR to main	CDK synth + terraform validate + dbt parse
Push to main	CDK deploy + terraform apply (Snowflake + AWS DMS)

Required GitHub Secrets:

Secret	Description
`AWS_ROLE_ARN`	OIDC IAM role ARN
`AWS_REGION`	e.g. `us-east-1`
`TF_STATE_BUCKET`	S3 bucket for Terraform state
`SNOWFLAKE_ACCOUNT`	Snowflake account identifier
`SNOWFLAKE_USER`	Snowflake service user
`DMS_BUCKET_NAME`	DMS landing S3 bucket
`SNOWFLAKE_S3_ROLE_ARN`	IAM role ARN for Snowflake storage integration

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
airbyte		airbyte
airflow		airflow
dbt		dbt
dms		dms
docs		docs
glue/jobs		glue/jobs
infra		infra
snowflake_ddl		snowflake_ddl
sqlserver		sqlserver
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

SQL Server to Snowflake Migration Lab

What you're building

Repository layout

Prerequisites

Step 1 — Stand up SQL Server locally

Step 2 — Deploy AWS infrastructure

Deploy with CDK

Deploy RDS SQL Server (if using Option A)

Already have an RDS SQL Server? Just deploy DMS

Step 3 — Set up Snowflake

Step 4 — Run DMS full-load

Step 5 — Airbyte alternative (optional)

Step 6 — Run the Airflow pipeline

DAG 1: mssql_01_ingest_bronze

DAG 2: mssql_02_silver_snapshots

DAG 3: mssql_03_gold_transforms

DAG 4: mssql_04_data_quality

Running the pipeline

Step 7 — Live migration dashboard

SQL Server object migration map

Stored procedures

Triggers

Constraints

Functions

DMS vs Airbyte — which to use

Credential security

Smoke tests after setup

Cost estimate

Troubleshooting

CI/CD

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages