Mini Project: Enhancing Data Workflow with Python and Gemini #50

akash-coded · 2025-08-26T08:07:19Z

akash-coded
Aug 26, 2025
Maintainer

Project: Enhancing Data Workflow with Python and Gemini

Objective

Apply Python for DE with a small ETL that blends a public dataset and a public API
Use GitHub for version control and peer collaboration
Use a Gemini gem to generate an Airflow-ready plan and a standards table
Produce a short report with findings and workflow improvements

Scope choices (pick one pair)

Air quality + weather: city-level air quality dataset + Open-Meteo hourly weather API
Mobility + weather: public transport/traffic dataset + Open-Meteo
Service tickets + holidays: civic 311 dataset + public holidays API

Tip: prefer APIs that need no key (Open-Meteo), or use a free key if everyone can obtain one quickly.

Deliverables

GitHub repo with code, README, and CI config
Processed outputs (CSV/Parquet) in data/processed/
Gemini outputs:
- a Summary Table for tasks/operators/sensors/XComs/retries/fallbacks/errors/SLA
- an Orchestration YAML spec that mirrors the table
Report (PDF or Markdown) covering objectives, steps, code snippets, results, and pipeline improvements

Suggested repo structure

project-root/
  data/
    raw/
    processed/
  etl/
    __init__.py
    ingest.py
    transform.py
    validate.py
    load.py
  orchestration/
    plan.yaml            # produced by Gemini and reviewed by you
    plan.json            # optional: JSON “contract” from Gemini
  dags/
    generator.py         # reads YAML and prints a TaskFlow skeleton (design-time)
  docs/
    report.md
    gem_prompts.md
  requirements.txt
  pipeline.py            # local runner: python pipeline.py --fmt csv
  README.md
  .gitignore
  .github/workflows/ci.yml

Step-by-step (semi-guided)

Step 1 — Pick data + API (15 min)

Choose one dataset (Kaggle/data.gov). Download a small CSV. Drop into data/raw/
Choose API (e.g., Open-Meteo). Decide city/coords and date window

Acceptance

1 dataset CSV in data/raw/
API URL and sample response recorded in docs/report.md

Step 2 — Ingest (30–40 min)

etl/ingest.py:
- fetch_api(endpoint, params) with timeout + 3 retries + backoff
- write API response to data/raw/ as JSON Lines
- load dataset CSV with pandas.read_csv
Keep functions small and pure

Code hint

def get_with_retries(url, tries=3):
    import time, requests
    for i in range(tries):
        try:
            r = requests.get(url, timeout=10); r.raise_for_status()
            return r
        except Exception:
            if i == tries-1: raise
            time.sleep(2**i)

Acceptance

ingest_dataset() returns a DataFrame
ingest_api() writes a JSONL and returns a file path

Step 3 — Transform (40–50 min)

etl/transform.py:
- Normalize API JSON → tidy hourly or daily table
- Clean dataset columns (types, dates, key fields)
- Integrate: join on date/city or nearest-hour
- Derive a few pragmatic fields (e.g., temp_bin, aqi_flag, hour_of_day)

Acceptance

A merged DataFrame with at least two derived columns
Clear, commented logic for join keys

Code hint

df_api["ts"] = pd.to_datetime(df_api["time"], utc=True)
df_ds["ts"] = pd.to_datetime(df_ds["date_local"]).dt.tz_localize("UTC")
merged = pd.merge_asof(
    df_api.sort_values("ts"),
    df_ds.sort_values("ts"),
    on="ts", direction="nearest", tolerance=pd.Timedelta("1H")
)

Step 4 — Validate (15–20 min)

etl/validate.py:
- uniqueness checks on natural keys
- null rules for required columns
- quarantine to data/processed/quarantine_*.csv
- continue with good rows

Acceptance

Script prints counts of good vs quarantined
No hard crash on a few bad rows

Code hint

bad = merged[merged["key_col"].isna()]
good = merged.drop(bad.index)
bad.to_csv("data/processed/quarantine_key.csv", index=False)

Step 5 — Load (10–15 min)

etl/load.py:
- Atomic writes: write to *.tmp, then rename
- Support --fmt csv|parquet
pipeline.py:
- Glue steps together
- python pipeline.py --fmt csv

Acceptance

Files appear in data/processed/
Re-running overwrites deterministically

Step 6 — GitHub usage (ongoing)

Create a private repo. Invite a peer with Write access
Work on main for simplicity in class. Commit small and often
Trigger one basic conflict and resolve in editor or GitHub UI
Add CI:

.github/workflows/ci.yml

name: CI
on: [push]
jobs:
  lint-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.10" }
      - run: pip install -r requirements.txt
      - run: pip install black flake8 pylint
      - run: black --check .
      - run: flake8 .
      - run: pylint etl || true

Acceptance

CI passes on push
Repo has a resolved merge conflict in history

Step 7 — Use the Gemini gem for an Airflow plan (45–60 min)

Open your Gemini gem (“Evergent Airflow Architect”)
Paste a plain-language requirement for your ETL
Example:
“Ingest {dataset} daily at 02:00 IST, enrich with {API} hourly data. Validate required fields. Soft-fail if zero rows after clean. Load to a partitioned table. Send Slack alert on failure. Keep 7-day backfills. GCP Composer, BigQuery, GCS.”

Ask the gem to return (must-haves)

Clarifying Questions
Solution Blueprint + Mermaid diagram
Summary Table with columns:
Task ID | Description | Operator | Executor | Sensor/Trigger | XComs (keys) | Retry Policy | Fallback | Error Handling | SLA/Alerts | Idempotency | Data Contract Check | Cost Notes
Orchestration YAML that mirrors the table

Save outputs

Copy the table and YAML into orchestration/plan.yaml (keep both in one file or split YAML + MD)
Optional: also get a JSON plan and save to orchestration/plan.json

Acceptance

Table covers all tasks with real operator names
YAML loads with yaml.safe_load and includes dag + tasks + dependencies

Step 8 — Generate a DAG scaffold from YAML (design-time only, 20–30 min)

dags/generator.py reads orchestration/plan.yaml and prints a minimal TaskFlow skeleton with task IDs and docstrings
Do not implement business logic here; treat it as scaffolding
Keep a human review step before turning it into a real DAG

Code hint

import yaml
from textwrap import indent

y = yaml.safe_load(open("orchestration/plan.yaml"))
for t in y["tasks"]:
    print(f"@task()\ndef {t['id']}():\n    \"\"\"{t['operator']} — {t['description']}\"\"\"\n    ...\n")

Acceptance

Script prints function stubs for all task IDs
Names match the Summary Table exactly

Step 9 — Report (Markdown or PDF, 2–4 pages)

Title, team, goals
Data sources and API choices
ETL steps with 3–4 code snippets
Validations and results
GitHub CI screenshot and conflict note
Gemini outputs (table + YAML) and what you edited
Proposed workflow improvements

Quality checks

Code style: run black, flake8, pylint
Repeatability: run pipeline.py twice → same outputs
Integrity: quarantine exists only when rules fail
YAML sanity: yaml.safe_load works; task IDs match the table
Gemini sanity: answers include questions, table, YAML; no invented operators

Rubric for auto-evaluation (100 points)

Completion (30)
- Ingest, transform, validate, load wired end-to-end (15)
- Processed outputs present, deterministic on re-run (10)
- CI pipeline green (5)
Code quality (20)
- PEP 8/black/flake8 clean (10)
- Clear functions, docstrings, comments (10)
Documentation (20)
- README with setup/run, data/API notes (10)
- Report with snippets, results, lessons (10)
GitHub usage (10)
- Meaningful commits, one resolved conflict, collaborator added (10)
Gemini + Airflow planning (20)
- Summary Table complete and realistic (8)
- YAML consistent with table and loads cleanly (8)
- Sensible retries/fallbacks/XComs/SLAs (4)

Pass mark: 70.

Stretch ideas

Partitioned Parquet outputs under data/processed/dt=YYYY-MM-DD/
Simple CLI: python pipeline.py --city hyderabad --start 2025-08-01
Tiny unit tests for slugify() or joins
BigQuery load via pandas-gbq if credentials exist

Verification steps (fast)

python pipeline.py --fmt csv → check three files in data/processed/
Open orchestration/plan.yaml → load with a 5-line Python snippet
Run dags/generator.py → confirm one stub per task ID
Push to GitHub → see green CI check
Skim docs/report.md for the questions → table → YAML flow

Keep it crisp. Keep XComs tiny. Treat the Gemini output like a draft RFD and refine it with your judgment.

akash-coded · 2025-08-26T08:21:03Z

akash-coded
Aug 26, 2025
Maintainer Author

Step 1 — Choose dataset + API (scoping, contracts, and setup)

1.1 Pick a dataset (small, clean, useful)

Source: Kaggle, data.gov, or any public portal you can cite
Size target: 1–50 MB CSV so everyone can run locally
Schema target: at least one date/time field and one categorical field
Examples: city air-quality, transit on-time stats, service complaints, energy usage

Decision checklist

Is there a primary key or a field you can promote to one?
Do you know the time zone for timestamps?
Any PII? If yes, choose another dataset for classroom use.

Deliverables

data/raw/source_dataset.csv
A short data dictionary in docs/report.md:
- Column → type → description → allowed values → nullable? → example

1.2 Pick a public API that complements the dataset

Favors: no key or free key everyone can get in minutes
Must return JSON with timestamps and geospatial or categorical context
Examples: simple weather, holidays, exchange rates, air quality, mobility

Probe the API

Read the docs for:
- Base URL, query params, pagination, rate limits, HTTP status codes
- Units (°C vs °F), time zones, response shape, “null” conventions

Dry-run a request in the browser or curl:

curl "https://api.example.com/v1/data?city=Hyderabad&start=2025-08-01&end=2025-08-07"

Deliverables

Save a sample JSON response in docs/samples/api_sample.json
Write a mini contract in docs/report.md:
- Endpoint, params, rate-limit policy, expected fields, example URL
- Retry guidance: “3 tries with exponential backoff 1s/2s/4s”

1.3 Define the join story now

Join key: date/hour, city/region, or a surrogate key you derive
If times don’t align exactly, plan a nearest-timestamp join with a tolerance (e.g., 1 hour)
If there’s no direct key, choose a mapping:
- City → Lat/Lon grid
- District → Postal code lookup

Write this down

“We will join dataset rows to API rows on timestamp using nearest match within 1 hour and same city.”

1.4 Repo and environment

Create a private GitHub repo; add a collaborator
.gitignore includes data/raw/ and data/processed/
requirements.txt: requests, pandas, pyarrow, pyyaml (optional: python-dotenv)

Checkpoint commit

“chore: scaffold repo, add raw sample + data dictionary + API sample”

Step 2 — Ingest (raw→landing with retries, caching, and naming)

2.1 Function contracts (keep them pure)

ingest_dataset(path: str) -> pd.DataFrame
ingest_api(url: str, params: dict) -> Path # writes JSONL
get_with_retries(url: str, params: dict, tries=3, timeout=10) -> requests.Response

Code hint

def get_with_retries(url, params=None, tries=3, timeout=10):
    import time, requests
    for i in range(tries):
        try:
            r = requests.get(url, params=params, timeout=timeout)
            r.raise_for_status()
            return r
        except Exception as exc:
            if i == tries - 1:
                raise
            time.sleep(2**i)

2.2 Raw-zone layout and file naming

Use UTC timestamp in filenames: YYYYMMDDThhmmssZ
Keep dataset and API separate
Examples:
- data/raw/dataset_city_metrics_20250820T060000Z.csv
- data/raw/api_weather_20250820T060500Z.jsonl

Why JSONL?

Streamable, fault-tolerant, and easy to append/debug
One JSON object per line; no massive arrays

Write helper

def write_jsonl(rows, path):
    from pathlib import Path
    import json
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

2.3 Offline fallback (classroom-proof)

If API is down:
- Load the latest JSONL from data/raw/
- Log a warning that you used cached data
Implement latest_file(dir, prefix) -> Optional[Path]

Acceptance

API path exists in data/raw/ on success
On forced failure (disable Wi-Fi), pipeline still runs from cache

2.4 Minimal logging

Print or logger.info:
- URL called, status_code
- Row counts from dataset and API
- Path written

Checkpoint commit

“feat(ingest): dataset csv + api jsonl with retries, utc naming, offline fallback”

Step 3 — Transform (normalize, align, integrate, derive)

3.1 Normalize both sources early

Dataset:
- Parse dates: pd.to_datetime(..., errors="coerce")
- Standardize casing for categories: .str.strip().str.lower()
- Type cast integers/floats to stable types
API:
- Flatten JSON with pd.json_normalize
- Expand nested fields (e.g., data.hourly.temperature → temperature_c)
- Convert time to UTC and name it ts

Code hint

df_api = pd.json_normalize(api_rows)
df_api["ts"] = pd.to_datetime(df_api["time"], utc=True)
df_ds["ts"] = pd.to_datetime(df_ds["date_local"]).dt.tz_localize("UTC")

3.2 Align the time axis

If the API is hourly and the dataset is daily:

Aggregate API by day: df_api.groupby(df_api["ts"].dt.date).agg(...)

Or do nearest merge within a tolerance:

merged = pd.merge_asof(
    df_api.sort_values("ts"),
    df_ds.sort_values("ts"),
    on="ts", direction="nearest",
    tolerance=pd.Timedelta("1H")
)

Document your choice in docs/report.md

3.3 Define and compute derived fields

Examples:
- temp_bin = bucketized temperature (e.g., <20, 20–30, >30)
- aqi_flag = “good/moderate/poor” from AQI value
- busy_hour = hour in [7–9, 17–19] for mobility data
- city_key = normalized city name for stable joins
Keep naming snake_case and consistent

3.4 Handle missingness and duplicates

Dedupe on a natural key: drop_duplicates(subset=["ts","city_key"])
Fill safe defaults where sensible; leave the rest for validation
Create a small mapping table if you must unify city labels

3.5 Document transform rules

In docs/report.md, add a “Transform Spec” table:
- Column | Rule | Example | Rationale
This becomes your data contract seed

Checkpoint commit

“feat(transform): normalized tables, aligned timestamps, first set of derived columns”

Step 4 — Validate (data contract, quarantine, summary)

4.1 Contract rules (start small, be explicit)

Uniqueness: ts + city_key unique after join
Required: ts, city_key, and your key metrics not null
Ranges: temperature within a plausible window, counts >= 0
Referential: if you use a mapping table, every incoming city maps

Severity levels

Error: breaks the pipeline (e.g., missing ts)
Warn + Quarantine: removes bad rows but continues (e.g., out-of-range temps)
Info: logs rare but allowed values

4.2 Quarantine design

Write bad rows to data/processed/quarantine_<rule>.csv
Log counts and keep a small sample in the report
Never silently drop rows

Code hint

bad_required = merged[merged["ts"].isna() | merged["city_key"].isna()]
good = merged.drop(bad_required.index)
bad_required.to_csv("data/processed/quarantine_required.csv", index=False)

4.3 Run summary (for governance later)

Compute:
- rows_in_dataset, rows_in_api
- rows_merged, rows_quarantined_total
- Min/max of key metrics

Print one single-line summary so logs stay readable:

RUN: ds=1234 api=168 rows=merged=1210 q=24 temp[min=21.1,max=38.6]

Checkpoint commit

“feat(validate): contract checks + quarantine + run summary”

Step 5 — Load (formats, partitions, idempotency, and docs)

5.1 Output formats and directory layout

Support --fmt csv|parquet
Prefer Parquet for analytics; CSV for sanity checks
Deterministic filenames:
- Non-partitioned: data/processed/metrics_latest.parquet
- Partitioned by date:
```
data/processed/dt=2025-08-20/metrics.parquet
data/processed/dt=2025-08-21/metrics.parquet
```

Atomic writes

def atomic_write_csv(df, path):
    import os
    tmp = str(path) + ".tmp"
    df.to_csv(tmp, index=False)
    os.replace(tmp, path)

5.2 Idempotency on re-run

If non-partitioned: overwrite metrics_latest.* each run
If partitioned: overwrite the same dt folder for that execution date
Avoid appending blindly; classroom runs should be repeatable

5.3 Load function and CLI

etl/load.py exports write_df(df, name, fmt, partition_dt=None)
pipeline.py parses:
- --fmt csv|parquet
- --partition today|YYYY-MM-DD
- --strict to fail on any quarantine (optional)

Code hint

import argparse, datetime as dt
ap = argparse.ArgumentParser()
ap.add_argument("--fmt", default="csv", choices=["csv","parquet"])
ap.add_argument("--partition", default="today")
args = ap.parse_args()
partition_dt = dt.date.today().isoformat() if args.partition=="today" else args.partition

5.4 Post-load verification

Confirm files exist and are non-empty
Read them back and validate schema (spot-check)
Re-run the pipeline; confirm deterministic outputs

5.5 Documentation and housekeeping

Update README.md with:
- Setup steps, commands, CLI flags
- Input data sources and API references
- Expected outputs and their schema
Add a small table in the report:
- Output | Format | Partitioning | Primary key | Row count (last run)

Checkpoint commit

“feat(load): atomic writes + partitioned outputs + CLI + docs”

Nuances and pro tips across Steps 1–5

Time zones: normalize to UTC inside files; localize in visuals only
Units: write units in column names where helpful (temperature_c)
Column naming: snake_case, stable, no spaces
Memory: for big CSVs, load in chunks and select columns early
Logging: one-line summary per run; avoid chatty DEBUG in class
Testing seam: keep functions small so you can unit-test derive_* logic later
Config: if you have time, move constants to a config.yaml with city, coords, and API window

Acceptance summary (per step)

Step 1: dataset + API chosen, samples saved, join story written
Step 2: raw CSV and JSONL present; offline fallback works
Step 3: normalized, aligned, joined; at least two derived columns
Step 4: contract rules applied; quarantine files created when needed; summary line printed
Step 5: outputs written atomically; re-run produces deterministic results; README updated

0 replies

akash-coded · 2025-08-26T08:21:50Z

akash-coded
Aug 26, 2025
Maintainer Author

Step 6 — GitHub, the right way (simple → powerful)

6.1 Set up once

Create private repo → Settings → Collaborators → add your peer with Write
Add a .gitignore that excludes data/raw/ and data/processed/

Set identity (first run only):

git config --global user.name "Your Name"
git config --global user.email you@evergent.com

Clone and check:

git clone <repo-url>
cd <repo>
git status

6.2 Everyday loop (keep it boring = keep it safe)

flowchart LR
  I[Issue or TODO] --> B[Make a small change]
  B --> C[git add/commit]
  C --> P[git pull --rebase]
  P --> U[git push]
  U --> R[Open PR]
  R --> CI[CI runs]
  CI --> RV[Review]
  RV --> M[Merge to main]

Small change → small commit → push → PR → merge
Repeat. No hero commits with 500 lines.

6.3 Simple mode (single branch: main)

Use this if you have very little time.

Pull latest first, then push

git pull --rebase
git add .
git commit -m "feat: add transform step for hourly merge"
git push

Resolve a conflict fast

git pull --rebase
# VS Code shows <<<<<<< ======= >>>>>> markers
# Keep both edits if harmless; or pick one
git add <files>
git rebase --continue
git push

Why it’s still powerful
- Everyone sees change history
- Conflicts surface early, not at 5 PM
- You get blame/annotate for quick root-cause

6.4 Power mode (tiny PRs + one safety rail)

Use this when you want a touch more rigor while staying simple.

Create branch, push, open PR

git checkout -b feature/ingest-api
# changes...
git add .
git commit -m "feat(ingest): jsonl writer with utc filenames"
git push -u origin feature/ingest-api

Open PR on GitHub
- Base: main, Compare: feature/ingest-api
- Title: “feat(ingest): jsonl ingest w/ retries”
- Body: what changed, why, how you tested
- Add your peer as Reviewer
- Merge once CI is green
One safety rail (optional, 1 min)
Settings → Branches → Add rule → Pattern: main
- Require PR before merging
- Dismiss stale approvals on new commits

6.5 Write great commits (the two-line rule)

Line 1: type(scope): intent
- feat(ingest): fetch weather and write jsonl
- fix(transform): correct nearest-hour tolerance to 1h
Line 2: a tiny reason or link to an issue: “closes Task - Python Loops and User Input - Number Guessing Game #12”

This pays off when you debug at 2 a.m.

6.6 Pull Requests that reviewers love

Before you open
- black . && flake8 .
- Run python pipeline.py once

PR template (paste into the PR body)

### What changed
- Ingest API with retries + UTC filenames

### Why
- Stable raw zone for later backfills

### How tested
- Local run, produced 3 files in data/processed/

### Risks
- None, read-only API

### Screenshots / Logs
- RUN: ds=1234 api=168 rows=merged=1210 q=24

Close issues from the PR: write “closes Task - Functions - Fibonacci Sequence #7” in the body

6.7 CI in 10 lines (prove your code works)

Add .github/workflows/ci.yml:

name: CI
on: [push, pull_request]
jobs:
  lint-smoke:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.10" }
      - run: pip install -r requirements.txt
      - run: pip install black flake8
      - run: black --check .
      - run: flake8 .

Green check = merge with confidence
Red check = fix now, not in production

6.8 Use Issues to plan, not to decorate

Create 3–5 small issues:

“Normalize API timestamps to UTC”
“Add nearest-hour join with 1h tolerance”
“Write quarantine for missing ts or city_key”

Tag them: ingest, transform, validate, load.
Assign one each and link PRs: “closes #3”.

Simple Kanban (optional): Projects → Table with columns To do / In progress / Done.

6.9 One conflict on purpose (learn it once)

Both teammates edit the same line in pipeline.py (a banner print)

Person A pushes first → Person B:

git push          # rejected
git pull --rebase # resolve markers
# Accept Both Changes
git add pipeline.py
git rebase --continue
git push

Now you know the drill. No fear next time.

6.10 Lightweight “release” (freeze a milestone)

Tag the version after your report is ready:

git tag -a v0.1.0 -m "First ETL milestone"
git push origin v0.1.0

This gives you a clean restore point.

6.11 Proof you used GitHub well (acceptance)

Repo is private with a collaborator
You have:
- At least 8–12 commits with clear messages
- 1 merged PR (power mode) or 1 conflict resolved (simple mode)
- CI green on the latest push
- 3 issues created and closed via PRs
README shows how to run, what it outputs, and a CI badge (optional)

6.12 Tiny extras that feel pro

Co-authors when pair-programming:
```
Co-authored-by: Akash <akash@evergent.com>
```
Add this line at the end of the commit message.

CODEOWNERS to auto-request reviews (optional):

# .github/CODEOWNERS
/etl/ @evergent-data/maintainers

Security: never commit keys; if you must run Actions with secrets, store them under Settings → Secrets and variables → Actions

6.13 Quick checklist (run at the end)

git pull --rebase returns “Already up to date”
python pipeline.py finishes and writes outputs
CI checkmark is green on your last commit
One PR merged or one conflict resolved
Issues closed with PR links

Ask yourself:

“Could someone new run this repo in 5 minutes?”
“Can we roll back to yesterday in one command?”
If yes, you used GitHub like a pro.

0 replies

akash-coded · 2025-08-26T08:23:14Z

akash-coded
Aug 26, 2025
Maintainer Author

Step 7 — Use the Gemini gem to design the Airflow plan

7.1 Set the scene

Open your Evergent Airflow Architect gem
Turn off web browsing for this run (force answers to use your context pack)
Load your context files (standards, operators list, Composer guide, etc.)

7.2 Feed the gem a clean requirement

Example prompt (edit the placeholders):

We ingest <dataset name> daily at 02:00 IST and enrich it with <API name> hourly data.
Join key: nearest-hour within 1h on UTC timestamps; scope restricted to <city/region>.
Validate required fields; quarantine bad rows; soft-fail if rows_clean == 0.
Load to BigQuery: <project>.<dataset>.metrics (partitioned by DATE(ts)).
GCP Composer; logs in GCS; Slack alerts on task failure and SLA miss.
Backfill rules: last 7 days if file late.
Please ask clarifying questions if any field is unclear.

7.3 Enforce a 3-step response from the gem

Clarifying Questions
Solution Blueprint with a Mermaid diagram
Artifacts:
- Summary Table with columns:
  Task ID | Description | Operator | Executor | Sensor/Trigger | XComs (keys) | Retry Policy | Fallback | Error Handling | SLA/Alerts | Idempotency | Data Contract Check | Cost Notes
- YAML Orchestration Spec matching the table
- (Optional) JSON plan that matches your schema (see Step 8)

7.4 Guardrails to reduce hallucinations

Ask the gem to:
- Only use real operator names from your allow-list
- Keep XComs tiny (URIs, counts, booleans)
- Mark missing facts as TODO instead of guessing
- Prefer TaskFlow API + deferrable sensors for long waits
- Call out assumptions and risks explicitly

7.5 Save the outputs

orchestration/plan.yaml → YAML spec + copied Summary Table
orchestration/plan.json → JSON plan (if you request it)

Commit with message:

feat(orchestration): initial gem-generated plan (table + yaml)

Acceptance

Table covers every task in the YAML
Operators exist in your providers
YAML loads with yaml.safe_load and contains dag, tasks, dependencies

Step 8 — Validate and harden the gem output

8.1 JSON schema check (optional but recommended)

Ask the gem to also output JSON that fits your plan schema
Save as orchestration/plan.json

Run validator (example):

pip install jsonschema pyyaml
python validators/validate_plan.py orchestration/plan.json

Pass if

No schema errors
Operator names are on the allow-list
YAML inside the JSON parses and includes the same task IDs

8.2 Human review

Cross-check task IDs between table and YAML
Confirm retry, fallback, SLA, alerts, idempotency are all filled
Confirm datasets or sensors drive triggers, not only crons

Commit

docs: review notes for gem plan; aligned table↔yaml

Step 9 — Generate a TaskFlow scaffold from YAML (design-time)

9.1 Write a tiny generator

dags/generator.py reads orchestration/plan.yaml and prints TaskFlow stubs

Snippet

import yaml
from textwrap import indent

spec = yaml.safe_load(open("orchestration/plan.yaml"))
print("from airflow.decorators import dag, task")
print("from datetime import datetime")

print("@dag(dag_id='{}', start_date=datetime(2025,1,1), schedule='{}', catchup=False, tags={})"
      .format(spec["dag"]["id"], spec["dag"]["schedule"], spec["dag"].get("tags", ["evergent"])))
print("def generated_pipeline():")
for t in spec["tasks"]:
    doc = f'{t["operator"]} — {t["description"]}'
    fn = t["id"]
    print(indent(f"@task()\ndef {fn}():\n    \"\"\"{doc}\"\"\"\n    ...\n", '    '))
print(indent("# wire dependencies per spec['dependencies']", '    '))
print("generated_pipeline()")

9.2 Sanity run

python dags/generator.py > dags/generated_pipeline.py

Open the file; confirm one function per task ID
Don’t implement business logic here yet

Commit

chore(dag): generate TaskFlow stubs from plan.yaml

Step 10 — Dry-run the plan locally

10.1 Static checks

black --check .
flake8 .

10.2 YAML → table echo (consistency probe)

Write a small helper to re-create the table from YAML and compare row counts
Flag mismatches before you reach Airflow

Snippet

import yaml, pandas as pd
y = yaml.safe_load(open("orchestration/plan.yaml"))
tbl = pd.DataFrame([{
  "Task ID": t["id"],
  "Operator": t["operator"],
  "Retry": t.get("retry"),
} for t in y["tasks"]])
print(tbl)

Acceptance

Table row count matches tasks length
No missing IDs

Step 11 — Version and review in GitHub

11.1 Open a PR titled:

plan: gem-generated orchestration (table + yaml) + TaskFlow scaffold

11.2 PR body template

### What changed
- Added plan.yaml (table + YAML), plan.json
- Generated TaskFlow stubs

### Why
- Lock design before code; make Airflow work boring

### Checks
- JSON validated ✓
- YAML parsed ✓
- IDs match ✓

### Risks
- Operator choice may evolve during implementation

closes #<issue-id>

11.3 CI runs

Lint passes
Optional: a job that loads YAML and verifies task IDs are unique

Merge once green.

Step 12 — Write the project report (2–4 pages)

12.1 Structure

Title + team
Objectives and scope
Data + API with sample payloads
ETL steps with 3–4 code snippets
Validations and quarantine counts
Gemini outputs: the table and YAML; what you edited
Workflow improvements you’d adopt in Airflow
Verification: how you checked quality (lint, schema, YAML loader)

12.2 Export

Keep docs/report.md in repo
Export to PDF for submission if needed

Commit

docs(report): ETL results + orchestration design + verification steps

Step 13 — Final quality gates

13.1 Functional

python pipeline.py --fmt csv runs twice → same outputs
Quarantine shows only when rules fail
Log summary line present

13.2 Orchestration design

Plan JSON validates
YAML parses, task IDs unique, dependencies present
Table ↔ YAML match

13.3 Repo hygiene

CI green
One conflict resolved or one PR merged
README explains setup, run, outputs

Step 14 — Stretch (optional, bite-sized)

Add partitioned outputs: data/processed/dt=YYYY-MM-DD
Add config.yaml for city/coords and date window
Add unit tests for a small function (slugify, nearest-hour join)
Add operator allow-list and lint for operator names inside YAML

Add a Makefile:

run:
	python pipeline.py --fmt csv
validate:
	python validators/validate_plan.py orchestration/plan.json

Step 15 — Hand-off to Airflow (when ready)

15.1 Map design → implementation

Each stub in generated_pipeline.py becomes a real @task
Replace prints with real code from your local ETL functions
Pass URIs/metrics via XCom, not DataFrames

15.2 Composer specifics (GCP)

DAGs live in GCS dags/ bucket
Keep data in separate data buckets
Set Connections and Variables in the environment
Use deferrable sensors for long waits
Turn on alerts: Email/Slack for failures and SLA misses

15.3 Test plan

Trigger the DAG manually with yesterday’s execution date
Verify each task’s logs, XCom values, and BigQuery table state
Backfill a 3-day window; confirm idempotent loads

One-glance acceptance for 7–15

Gem plan: questions → blueprint → table → YAML present
JSON and YAML validate
TaskFlow stubs generated and match IDs
PR merged; CI green
Report written; verification steps documented
Ready to implement the DAG with low friction

Keep it simple. Keep it observable. Let the gem do the planning, and let GitHub keep you honest.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mini Project: Enhancing Data Workflow with Python and Gemini #50

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Mini Project: Enhancing Data Workflow with Python and Gemini #50

Uh oh!

akash-coded Aug 26, 2025 Maintainer

Project: Enhancing Data Workflow with Python and Gemini

Objective

Scope choices (pick one pair)

Deliverables

Suggested repo structure

Step-by-step (semi-guided)

Step 1 — Pick data + API (15 min)

Step 2 — Ingest (30–40 min)

Step 3 — Transform (40–50 min)

Step 4 — Validate (15–20 min)

Step 5 — Load (10–15 min)

Step 6 — GitHub usage (ongoing)

Step 7 — Use the Gemini gem for an Airflow plan (45–60 min)

Step 8 — Generate a DAG scaffold from YAML (design-time only, 20–30 min)

Step 9 — Report (Markdown or PDF, 2–4 pages)

Quality checks

Rubric for auto-evaluation (100 points)

Stretch ideas

Verification steps (fast)

Replies: 3 comments

Uh oh!

akash-coded Aug 26, 2025 Maintainer Author

Step 1 — Choose dataset + API (scoping, contracts, and setup)

1.1 Pick a dataset (small, clean, useful)

1.2 Pick a public API that complements the dataset

1.3 Define the join story now

1.4 Repo and environment

Step 2 — Ingest (raw→landing with retries, caching, and naming)

2.1 Function contracts (keep them pure)

2.2 Raw-zone layout and file naming

2.3 Offline fallback (classroom-proof)

2.4 Minimal logging

Step 3 — Transform (normalize, align, integrate, derive)

3.1 Normalize both sources early

3.2 Align the time axis

3.3 Define and compute derived fields

3.4 Handle missingness and duplicates

3.5 Document transform rules

Step 4 — Validate (data contract, quarantine, summary)

4.1 Contract rules (start small, be explicit)

4.2 Quarantine design

4.3 Run summary (for governance later)

Step 5 — Load (formats, partitions, idempotency, and docs)

5.1 Output formats and directory layout

5.2 Idempotency on re-run

5.3 Load function and CLI

5.4 Post-load verification

5.5 Documentation and housekeeping

Nuances and pro tips across Steps 1–5

Acceptance summary (per step)

Uh oh!

akash-coded Aug 26, 2025 Maintainer Author

Step 6 — GitHub, the right way (simple → powerful)

6.1 Set up once

6.2 Everyday loop (keep it boring = keep it safe)

6.3 Simple mode (single branch: main)

6.4 Power mode (tiny PRs + one safety rail)

6.5 Write great commits (the two-line rule)

6.6 Pull Requests that reviewers love

6.7 CI in 10 lines (prove your code works)

6.8 Use Issues to plan, not to decorate

6.9 One conflict on purpose (learn it once)

6.10 Lightweight “release” (freeze a milestone)

6.11 Proof you used GitHub well (acceptance)

6.12 Tiny extras that feel pro

6.13 Quick checklist (run at the end)

Uh oh!

akash-coded Aug 26, 2025 Maintainer Author

Step 7 — Use the Gemini gem to design the Airflow plan

7.1 Set the scene

7.2 Feed the gem a clean requirement

7.3 Enforce a 3-step response from the gem

7.4 Guardrails to reduce hallucinations

7.5 Save the outputs

Step 8 — Validate and harden the gem output

8.1 JSON schema check (optional but recommended)

8.2 Human review

akash-coded
Aug 26, 2025
Maintainer

akash-coded
Aug 26, 2025
Maintainer Author

akash-coded
Aug 26, 2025
Maintainer Author

akash-coded
Aug 26, 2025
Maintainer Author