etl-prototype

PoC of an end-to-end ETL pipeline on AWS: Kinesis Data Stream + Kinesis Firehose + Glue (batch and streaming) + Glue Data Catalog + Athena. Python producer simulates stock ticker data; both Iceberg and Parquet+projection storage modes coexist for side-by-side comparison.

Architecture

The diagram shows the AWS components deployed by Terraform: a single Kinesis Data Stream fans out to two Firehose delivery streams (Iceberg and Parquet+projection) and to a Glue streaming job for anomaly detection. The Glue batch job reads the raw layer and writes the aggregated tables; Athena queries both storage modes through the Glue Data Catalog.

The sequence diagram below traces the order of messages during a typical session: how records reach Kinesis, how Firehose fans out to both storage modes in parallel, when the Glue jobs kick in, and how Athena resolves tables at query time.

sequenceDiagram
    actor Producer as Producer (local)
    participant Stream as Kinesis Stream
    participant FH_I as Firehose iceberg
    participant FH_P as Firehose parquet
    participant S3
    participant GC as Glue Catalog
    participant Batch as Glue batch (OHLC)
    participant Stream2 as Glue streaming (z-score)
    participant Athena

    Producer->>Stream: put_records (stock JSON)
    Stream->>FH_I: consume
    Stream->>FH_P: consume
    FH_I->>S3: write iceberg/raw/
    FH_I->>GC: update iceberg.raw metadata
    FH_P->>S3: write parquet_projection/raw/year=.../
    Stream->>Stream2: consume
    Stream2->>S3: write anomalies (both storage modes)
    Stream2->>GC: write anomalies table rows
    Batch->>S3: read raw, write aggregated_1m/5m
    Batch->>GC: MERGE INTO iceberg aggregated
    Athena->>GC: resolve tables
    Athena->>S3: scan files

Security

IAM scoped policies: firehose and glue roles use inline policies limited to the project bucket, Kinesis stream, and Glue databases matching the project prefix. No cross-account or cross-service grants
No public endpoints: every step of the pipeline (Producer -> Stream -> Firehose -> S3 -> Glue -> Athena) requires AWS credentials; nothing is Internet-facing. The S3 bucket has aws_s3_bucket_public_access_block blocking ACLs, bucket policies, and public buckets
CloudWatch Logs retention: Firehose log groups are created with retention_in_days = 7 to keep log storage cost predictable and avoid indefinite retention
Encryption at rest: S3 uses AWS-managed SSE-S3 by default; Kinesis Stream is left unencrypted at rest in this PoC (KMS can be enabled by adding a stream_encryption block in terraform/kinesis.tf)

Estimated costs

Rough figures for a ~3-hour demo session in eu-central-1, pay-per-use pricing.

Resource	Unit cost	Demo session
S3 storage	$0.023/GB/month	< $0.01
S3 PUT/GET	$0.005 per 1k requests	< $0.01
Kinesis Data Stream (on-demand)	$0.04/hr per shard + $0.04/GB ingested	~$0.50 (4 shards, 3h)
Kinesis Firehose x2	$0.029/GB ingested + $0.018/GB format conversion	< $0.05 (MB-scale)
Glue Data Catalog	free tier	$0
Glue batch job	$0.44 per DPU-hour (Glue 5.0)	~$0.05 per run (2 DPU x ~3 min)
Glue streaming job	$0.44 per DPU-hour	~$0.90/hr if kept running
Athena	$5 per TB scanned	< $0.01 (MB-scale)
CloudWatch Logs	$0.50/GB ingested	< $0.05

Total: about $1-3 USD for a full demo (3 hours, includes streaming job running for ~1 hour). Destroy the stack via make tf-destroy to stop Kinesis Stream billing.

Prerequisites

AWS account with access to the target region and a user/role that can create S3, Kinesis, Firehose, Glue, and IAM resources. See AWS account setup
AWS CLI v2 configured with a named profile in ~/.aws/credentials. See AWS CLI v2 install and configure
uv >= 0.4 for dependency management and Python 3.10-3.11. See uv install
Terraform >= 1.11 to deploy AWS resources. See Terraform install
Docker and Docker Compose for the local Glue image-based integration tests. See Docker Engine install
jq to parse the integration test JSON files from tests/integration/

Usage

Quick start

Bootstrap the Python environment:

uv sync

Build the glue_common wheel that the Glue jobs consume. This produces dist/glue_common-<version>-py3-none-any.whl:

make build-wheel

Deploy the AWS infrastructure (S3, Kinesis, Firehose, Glue Data Catalog, Glue jobs). The first terraform apply takes 1-2 minutes:

export AWS_PROFILE=mine
cd terraform && terraform init && terraform apply

Copy .env.example to .env and match the values to terraform output (bucket, Kinesis stream, Glue databases, job names):

cp .env.example .env

Run the producer to send ~10 minutes of simulated stock ticker data to Kinesis:

make produce SCENARIO=mixed DURATION=600 &

After ~2 minutes, raw data lands in both Glue databases. Trigger the batch job on AWS to populate aggregated_1m and aggregated_5m in both databases (takes ~3 minutes):

make batch-run

Optionally start the streaming job to detect anomalies in real time. Remember to stop it when done (it runs indefinitely):

make stream-run

Validate the full stack end-to-end via the pytest evaluation suite. It runs producer + batch job + Athena queries:

make test-evaluation

Swap read path on the batch job

The batch OHLC job supports three alternative strategies to read the raw layer, so you can benchmark them against the same deploy:

make batch-run LOAD_DATA_MODE=parquet  # glueContext.from_options on S3
make batch-run LOAD_DATA_MODE=spark  # spark.read.parquet
make batch-run LOAD_DATA_MODE=iceberg  # glueContext.from_catalog on iceberg DB

Inspect CloudWatch Logs for the job to compare read time, S3 fetch count, and partition pruning behavior between the three modes.

Packaging

glue_common is the only Python package shipped to Glue workers. The build and upload flow is:

make build-wheel runs uv build --wheel --out-dir dist, producing dist/glue_common-<version>-py3-none-any.whl (PEP 427 conformant name)
Terraform reads the version dynamically from src/glue_common/__init__.py (regex on __version__) and composes the wheel filename
A null_resource with a source-hash trigger rebuilds the wheel when any src/glue_common/**/*.py file changes
aws_s3_object uploads the wheel to s3://bucket/artifacts/wheels/glue_common-<version>-py3-none-any.whl
Each Glue job has --additional-python-modules in its default arguments pointing to the S3 URI; the Glue worker pip installs the wheel at job startup
The thin wrapper scripts under src/glue_jobs/ are uploaded separately as Glue script_location; all business logic lives in glue_common and is testable via pytest
Local Docker-based tests install the same package in editable mode (pip install -e .) so imports resolve the same way on AWS and on the workstation

Project structure

etl-prototype/
├── pyproject.toml  # uv + ruff strict + pyright strict + pytest + bump-my-version + git-cliff
├── Makefile  # all automation targets
├── README.md  # this file
├── LICENSE  # MIT
├── .gitignore
├── .env.example  # copy to .env and fill after terraform apply
├── POST.md  # decision log (brainstorming output)
├── TODO.md  # deferred items
├── docker-compose.yaml  # profiles: glue4, glue5
├── dist/  # gitignored, output of make build-wheel
├── src/
│   ├── producer/  # local CLI, not packaged
│   ├── glue_common/  # packaged as wheel
│   └── glue_jobs/  # thin wrappers uploaded as Glue script_location
├── terraform/  # AWS infrastructure
│   ├── main.tf, variables.tf, locals.tf, outputs.tf
│   ├── s3.tf, kinesis.tf, iam.tf
│   ├── firehose.tf, glue_catalog.tf, glue_jobs.tf
│   └── modules/glue-job/  # reusable local module
├── tests/
│   ├── unit/  # pytest, fast
│   ├── integration/  # JSON + local_test.sh via docker compose
│   └── evaluation/  # pytest @evaluation, runs against a live AWS deploy
│       └── queries/  # .sql files driven by test_athena_queries.py
└── images/  # architecture.drawio and exported PNG

Development

Local setup

uv sync
# optional: source .venv/bin/activate  # if you want python/pytest/ruff in PATH without the uv run prefix

Tests

make test  # unit + integration, excluding evaluation
make test-evaluation  # pytest smoke tests against a live AWS deploy

Local integration tests via docker compose

Each Glue job variant has a JSON file under tests/integration/. To run one of them locally on a Glue image:

make test-integration-local TEST_JSON=tests/integration/batch_iceberg.json PROFILE=glue5  # or PROFILE=glue4

Two profiles in docker-compose.yaml:

glue4 uses public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 (Spark 3.3, Python 3.10)
glue5 uses public.ecr.aws/glue/aws-glue-libs:5 (Spark 3.5, Python 3.11, Iceberg built-in)

Default profile is glue5. The mount path differs between the two images (/home/glue_user/ vs /home/hadoop/); local_test.sh handles this transparently, so the same JSON works on both.

Linting and type-checking

make lint  # ruff check, no auto-fix
make format  # ruff format, check-only
make typecheck  # pyright on src + tests

Terraform

Set your AWS profile before every Terraform command:

export AWS_PROFILE=mine

Then:

make tf-fmt  # formatting check
make tf-validate  # init -backend=false + validate (no AWS call)
make tf-plan  # read-only diff between desired and live state
make tf-apply  # creates or updates AWS resources
make tf-destroy  # tears down AWS resources

After terraform apply, copy .env.example to .env and update the values with what terraform output reports (bucket, stream, databases, job names). The smoke tests below read these via environment variables.

Keeping the S3 bucket across destroy cycles

By default Terraform creates and manages the S3 bucket; make tf-destroy removes it (with force_destroy = true, even when not empty). To keep the bucket across destroy/apply cycles, use the conditional create pattern:

After the first make tf-apply, detach the bucket and its config from the Terraform state:

cd terraform
terraform state rm 'aws_s3_bucket.main[0]' 'aws_s3_bucket_versioning.main[0]' 'aws_s3_bucket_public_access_block.main[0]'

Create a terraform/terraform.tfvars (gitignored) and set the existing bucket name:

s3_bucket_name = "etl-prototype-demo-bucket"

From then on, terraform apply resolves the bucket via the data.aws_s3_bucket.main lookup and skips creation; terraform destroy cannot touch it because it is no longer in the state. The S3 objects inside the bucket (wheel under artifacts/, scripts under scripts/, raw/aggregated/anomalies data) are managed separately by other resources or by the producer and Glue jobs, and follow their own destroy lifecycle.

Smoke tests

Evaluation tests require a live AWS deploy and read resource names from the environment (see .env.example). Run them once the relevant phase is deployed.

Populate .env first (copy from .env.example and set AWS_PROFILE), then:

make test-evaluation

All evaluation tests share session-scoped fixtures that run the producer once (~90 seconds) and wait for the Firehose flush (~90 seconds); the batch-backed tests additionally start the Glue batch job and wait for completion (~5-10 minutes). Total runtime stays around 10-15 minutes regardless of how many tests depend on the data flow. Shell export still wins over .env if both are set.

Current evaluation tests:

Producer -> Kinesis

tests/evaluation/test_producer_to_kinesis.py verifies that at least one record with the expected JSON shape is readable from the stream after the producer ran.

Glue Data Catalog

tests/evaluation/test_glue_catalog.py verifies that the 2 databases and 8 tables exist with the expected columns, that iceberg tables report table_type=ICEBERG, and that parquet_projection tables have projection.enabled=true with partition keys year/month/day/hour.

Firehose -> S3

tests/evaluation/test_firehose_to_s3.py verifies that at least one object landed under both s3://bucket/iceberg/raw/ and s3://bucket/parquet_projection/raw/year=.../... after the producer session and Firehose flush.

Note on Firehose schema cache: when the Glue raw table schema changes (e.g. via terraform apply), the parquet_projection Firehose keeps the previous schema_configuration cached for up to 15 minutes (in practice ~5 minutes were enough). Records ingested in that window land under s3://bucket/parquet_projection/_firehose_errors/format-conversion-failed/. Wait, or force a refresh with terraform apply -replace="aws_kinesis_firehose_delivery_stream.parquet_projection[0]".

Glue jobs

tests/evaluation/test_glue_jobs.py verifies that both Glue jobs (batch OHLC and streaming anomaly) are deployed with the expected command type (glueetl / gluestreaming) and default arguments (--LOAD_DATA_MODE, --TIMEFRAMES, --WINDOW_MINUTES, ..., --additional-python-modules pointing to a glue_common wheel on S3). These tests read metadata only, they do NOT execute the jobs.

Athena queries

tests/evaluation/test_athena_queries.py runs the four .sql files under tests/evaluation/queries/ (latest candles iceberg, latest candles parquet_projection, anomalies today, compare storage modes) via the Athena client. Depends on batch_job_ran, a session-scoped fixture that starts the Glue batch job and waits for completion before the queries run, so aggregated_1m and aggregated_5m are populated in both databases. The compare query asserts that row counts between iceberg and parquet_projection align within a small drift.

Versioning and changelog

make patch (or minor, major) runs the full release flow via the release target: bump-my-version updates pyproject.toml + src/glue_common/__init__.py, git-cliff regenerates CHANGELOG.md, the changelog is amended on the bump commit, a new v<version> tag is created, then branch and tags are pushed. The Terraform null_resource picks up the new __version__ and uploads the new wheel filename on the next terraform apply.

make patch  # or minor, major: bump + regenerate CHANGELOG + tag + push
make changelog  # regenerate CHANGELOG.md and amend it on the last commit, no bump or tag

The release target is internal; call major, minor, or patch and let it orchestrate.

Blog post

License

This repo is released under the MIT license. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

etl-prototype

Architecture

Security

Estimated costs

Prerequisites

Usage

Quick start

Swap read path on the batch job

Packaging

Project structure

Development

Local setup

Tests

Local integration tests via docker compose

Linting and type-checking

Terraform

Keeping the S3 bucket across destroy cycles

Smoke tests

Producer -> Kinesis

Glue Data Catalog

Firehose -> S3

Glue jobs

Athena queries

Versioning and changelog

Blog post

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
images		images
src		src
terraform		terraform
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
POST.en.md		POST.en.md
POST.it.md		POST.it.md
README.md		README.md
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

etl-prototype

Architecture

Security

Estimated costs

Prerequisites

Usage

Quick start

Swap read path on the batch job

Packaging

Project structure

Development

Local setup

Tests

Local integration tests via docker compose

Linting and type-checking

Terraform

Keeping the S3 bucket across destroy cycles

Smoke tests

Producer -> Kinesis

Glue Data Catalog

Firehose -> S3

Glue jobs

Athena queries

Versioning and changelog

Blog post

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages