PoC of an end-to-end ETL pipeline on AWS: Kinesis Data Stream + Kinesis Firehose + Glue (batch and streaming) + Glue Data Catalog + Athena. Python producer simulates stock ticker data; both Iceberg and Parquet+projection storage modes coexist for side-by-side comparison.
The diagram shows the AWS components deployed by Terraform: a single Kinesis Data Stream fans out to two Firehose delivery streams (Iceberg and Parquet+projection) and to a Glue streaming job for anomaly detection. The Glue batch job reads the raw layer and writes the aggregated tables; Athena queries both storage modes through the Glue Data Catalog.
The sequence diagram below traces the order of messages during a typical session: how records reach Kinesis, how Firehose fans out to both storage modes in parallel, when the Glue jobs kick in, and how Athena resolves tables at query time.
sequenceDiagram
actor Producer as Producer (local)
participant Stream as Kinesis Stream
participant FH_I as Firehose iceberg
participant FH_P as Firehose parquet
participant S3
participant GC as Glue Catalog
participant Batch as Glue batch (OHLC)
participant Stream2 as Glue streaming (z-score)
participant Athena
Producer->>Stream: put_records (stock JSON)
Stream->>FH_I: consume
Stream->>FH_P: consume
FH_I->>S3: write iceberg/raw/
FH_I->>GC: update iceberg.raw metadata
FH_P->>S3: write parquet_projection/raw/year=.../
Stream->>Stream2: consume
Stream2->>S3: write anomalies (both storage modes)
Stream2->>GC: write anomalies table rows
Batch->>S3: read raw, write aggregated_1m/5m
Batch->>GC: MERGE INTO iceberg aggregated
Athena->>GC: resolve tables
Athena->>S3: scan files
- IAM scoped policies:
firehoseandglueroles use inline policies limited to the project bucket, Kinesis stream, and Glue databases matching the project prefix. No cross-account or cross-service grants - No public endpoints: every step of the pipeline (Producer -> Stream -> Firehose -> S3 -> Glue -> Athena) requires AWS credentials; nothing is Internet-facing. The S3 bucket has
aws_s3_bucket_public_access_blockblocking ACLs, bucket policies, and public buckets - CloudWatch Logs retention: Firehose log groups are created with
retention_in_days = 7to keep log storage cost predictable and avoid indefinite retention - Encryption at rest: S3 uses AWS-managed SSE-S3 by default; Kinesis Stream is left unencrypted at rest in this PoC (KMS can be enabled by adding a
stream_encryptionblock interraform/kinesis.tf)
Rough figures for a ~3-hour demo session in eu-central-1, pay-per-use pricing.
| Resource | Unit cost | Demo session |
|---|---|---|
| S3 storage | $0.023/GB/month | < $0.01 |
| S3 PUT/GET | $0.005 per 1k requests | < $0.01 |
| Kinesis Data Stream (on-demand) | $0.04/hr per shard + $0.04/GB ingested | ~$0.50 (4 shards, 3h) |
| Kinesis Firehose x2 | $0.029/GB ingested + $0.018/GB format conversion | < $0.05 (MB-scale) |
| Glue Data Catalog | free tier | $0 |
| Glue batch job | $0.44 per DPU-hour (Glue 5.0) | ~$0.05 per run (2 DPU x ~3 min) |
| Glue streaming job | $0.44 per DPU-hour | ~$0.90/hr if kept running |
| Athena | $5 per TB scanned | < $0.01 (MB-scale) |
| CloudWatch Logs | $0.50/GB ingested | < $0.05 |
Total: about $1-3 USD for a full demo (3 hours, includes streaming job running for ~1 hour). Destroy the stack via make tf-destroy to stop Kinesis Stream billing.
- AWS account with access to the target region and a user/role that can create S3, Kinesis, Firehose, Glue, and IAM resources. See AWS account setup
- AWS CLI v2 configured with a named profile in
~/.aws/credentials. See AWS CLI v2 install and configure - uv >= 0.4 for dependency management and Python 3.10-3.11. See uv install
- Terraform >= 1.11 to deploy AWS resources. See Terraform install
- Docker and Docker Compose for the local Glue image-based integration tests. See Docker Engine install
- jq to parse the integration test JSON files from
tests/integration/
Bootstrap the Python environment:
uv syncBuild the glue_common wheel that the Glue jobs consume. This produces dist/glue_common-<version>-py3-none-any.whl:
make build-wheelDeploy the AWS infrastructure (S3, Kinesis, Firehose, Glue Data Catalog, Glue jobs). The first terraform apply takes 1-2 minutes:
export AWS_PROFILE=mine
cd terraform && terraform init && terraform applyCopy .env.example to .env and match the values to terraform output (bucket, Kinesis stream, Glue databases, job names):
cp .env.example .envRun the producer to send ~10 minutes of simulated stock ticker data to Kinesis:
make produce SCENARIO=mixed DURATION=600 &After ~2 minutes, raw data lands in both Glue databases. Trigger the batch job on AWS to populate aggregated_1m and aggregated_5m in both databases (takes ~3 minutes):
make batch-runOptionally start the streaming job to detect anomalies in real time. Remember to stop it when done (it runs indefinitely):
make stream-runValidate the full stack end-to-end via the pytest evaluation suite. It runs producer + batch job + Athena queries:
make test-evaluationThe batch OHLC job supports three alternative strategies to read the raw layer, so you can benchmark them against the same deploy:
make batch-run LOAD_DATA_MODE=parquet # glueContext.from_options on S3
make batch-run LOAD_DATA_MODE=spark # spark.read.parquet
make batch-run LOAD_DATA_MODE=iceberg # glueContext.from_catalog on iceberg DBInspect CloudWatch Logs for the job to compare read time, S3 fetch count, and partition pruning behavior between the three modes.
glue_common is the only Python package shipped to Glue workers. The build and upload flow is:
make build-wheelrunsuv build --wheel --out-dir dist, producingdist/glue_common-<version>-py3-none-any.whl(PEP 427 conformant name)- Terraform reads the version dynamically from
src/glue_common/__init__.py(regex on__version__) and composes the wheel filename - A
null_resourcewith a source-hash trigger rebuilds the wheel when anysrc/glue_common/**/*.pyfile changes aws_s3_objectuploads the wheel tos3://bucket/artifacts/wheels/glue_common-<version>-py3-none-any.whl- Each Glue job has
--additional-python-modulesin its default arguments pointing to the S3 URI; the Glue workerpip installs the wheel at job startup - The thin wrapper scripts under
src/glue_jobs/are uploaded separately as Gluescript_location; all business logic lives inglue_commonand is testable via pytest - Local Docker-based tests install the same package in editable mode (
pip install -e .) so imports resolve the same way on AWS and on the workstation
etl-prototype/
├── pyproject.toml # uv + ruff strict + pyright strict + pytest + bump-my-version + git-cliff
├── Makefile # all automation targets
├── README.md # this file
├── LICENSE # MIT
├── .gitignore
├── .env.example # copy to .env and fill after terraform apply
├── POST.md # decision log (brainstorming output)
├── TODO.md # deferred items
├── docker-compose.yaml # profiles: glue4, glue5
├── dist/ # gitignored, output of make build-wheel
├── src/
│ ├── producer/ # local CLI, not packaged
│ ├── glue_common/ # packaged as wheel
│ └── glue_jobs/ # thin wrappers uploaded as Glue script_location
├── terraform/ # AWS infrastructure
│ ├── main.tf, variables.tf, locals.tf, outputs.tf
│ ├── s3.tf, kinesis.tf, iam.tf
│ ├── firehose.tf, glue_catalog.tf, glue_jobs.tf
│ └── modules/glue-job/ # reusable local module
├── tests/
│ ├── unit/ # pytest, fast
│ ├── integration/ # JSON + local_test.sh via docker compose
│ └── evaluation/ # pytest @evaluation, runs against a live AWS deploy
│ └── queries/ # .sql files driven by test_athena_queries.py
└── images/ # architecture.drawio and exported PNG
uv sync
# optional: source .venv/bin/activate # if you want python/pytest/ruff in PATH without the uv run prefixmake test # unit + integration, excluding evaluation
make test-evaluation # pytest smoke tests against a live AWS deployEach Glue job variant has a JSON file under tests/integration/. To run one of them locally on a Glue image:
make test-integration-local TEST_JSON=tests/integration/batch_iceberg.json PROFILE=glue5 # or PROFILE=glue4Two profiles in docker-compose.yaml:
glue4usespublic.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01(Spark 3.3, Python 3.10)glue5usespublic.ecr.aws/glue/aws-glue-libs:5(Spark 3.5, Python 3.11, Iceberg built-in)
Default profile is glue5. The mount path differs between the two images (/home/glue_user/ vs /home/hadoop/); local_test.sh handles this transparently, so the same JSON works on both.
make lint # ruff check, no auto-fix
make format # ruff format, check-only
make typecheck # pyright on src + testsSet your AWS profile before every Terraform command:
export AWS_PROFILE=mineThen:
make tf-fmt # formatting check
make tf-validate # init -backend=false + validate (no AWS call)
make tf-plan # read-only diff between desired and live state
make tf-apply # creates or updates AWS resources
make tf-destroy # tears down AWS resourcesAfter terraform apply, copy .env.example to .env and update the values with what terraform output reports (bucket, stream, databases, job names). The smoke tests below read these via environment variables.
By default Terraform creates and manages the S3 bucket; make tf-destroy removes it (with force_destroy = true, even when not empty). To keep the bucket across destroy/apply cycles, use the conditional create pattern:
- After the first
make tf-apply, detach the bucket and its config from the Terraform state:
cd terraform
terraform state rm 'aws_s3_bucket.main[0]' 'aws_s3_bucket_versioning.main[0]' 'aws_s3_bucket_public_access_block.main[0]'- Create a
terraform/terraform.tfvars(gitignored) and set the existing bucket name:
s3_bucket_name = "etl-prototype-demo-bucket"From then on, terraform apply resolves the bucket via the data.aws_s3_bucket.main lookup and skips creation; terraform destroy cannot touch it because it is no longer in the state. The S3 objects inside the bucket (wheel under artifacts/, scripts under scripts/, raw/aggregated/anomalies data) are managed separately by other resources or by the producer and Glue jobs, and follow their own destroy lifecycle.
Evaluation tests require a live AWS deploy and read resource names from the environment (see .env.example). Run them once the relevant phase is deployed.
Populate .env first (copy from .env.example and set AWS_PROFILE), then:
make test-evaluationAll evaluation tests share session-scoped fixtures that run the producer once (~90 seconds) and wait for the Firehose flush (~90 seconds); the batch-backed tests additionally start the Glue batch job and wait for completion (~5-10 minutes). Total runtime stays around 10-15 minutes regardless of how many tests depend on the data flow. Shell export still wins over .env if both are set.
Current evaluation tests:
tests/evaluation/test_producer_to_kinesis.py verifies that at least one record with the expected JSON shape is readable from the stream after the producer ran.
tests/evaluation/test_glue_catalog.py verifies that the 2 databases and 8 tables exist with the expected columns, that iceberg tables report table_type=ICEBERG, and that parquet_projection tables have projection.enabled=true with partition keys year/month/day/hour.
tests/evaluation/test_firehose_to_s3.py verifies that at least one object landed under both s3://bucket/iceberg/raw/ and s3://bucket/parquet_projection/raw/year=.../... after the producer session and Firehose flush.
Note on Firehose schema cache: when the Glue raw table schema changes (e.g. via
terraform apply), the parquet_projection Firehose keeps the previousschema_configurationcached for up to 15 minutes (in practice ~5 minutes were enough). Records ingested in that window land unders3://bucket/parquet_projection/_firehose_errors/format-conversion-failed/. Wait, or force a refresh withterraform apply -replace="aws_kinesis_firehose_delivery_stream.parquet_projection[0]".
tests/evaluation/test_glue_jobs.py verifies that both Glue jobs (batch OHLC and streaming anomaly) are deployed with the expected command type (glueetl / gluestreaming) and default arguments (--LOAD_DATA_MODE, --TIMEFRAMES, --WINDOW_MINUTES, ..., --additional-python-modules pointing to a glue_common wheel on S3). These tests read metadata only, they do NOT execute the jobs.
tests/evaluation/test_athena_queries.py runs the four .sql files under tests/evaluation/queries/ (latest candles iceberg, latest candles parquet_projection, anomalies today, compare storage modes) via the Athena client. Depends on batch_job_ran, a session-scoped fixture that starts the Glue batch job and waits for completion before the queries run, so aggregated_1m and aggregated_5m are populated in both databases. The compare query asserts that row counts between iceberg and parquet_projection align within a small drift.
make patch (or minor, major) runs the full release flow via the release target: bump-my-version updates pyproject.toml + src/glue_common/__init__.py, git-cliff regenerates CHANGELOG.md, the changelog is amended on the bump commit, a new v<version> tag is created, then branch and tags are pushed. The Terraform null_resource picks up the new __version__ and uploads the new wheel filename on the next terraform apply.
make patch # or minor, major: bump + regenerate CHANGELOG + tag + push
make changelog # regenerate CHANGELOG.md and amend it on the last commit, no bump or tagThe release target is internal; call major, minor, or patch and let it orchestrate.
This repo is released under the MIT license. See LICENSE for details.
