# Week 2 — Pipelines & ETL with Naming Conventions

**Learning Objectives (Week 2 – Pipelines & ETL)**  
- Build a reproducible ETL pipeline from Redshift to feature tables.  
- Externalize configuration and adopt consistent naming conventions (ZAP).  
- Package pipeline steps for CI and future automation.

## Company Naming Conventions (Propose & Refer Importance)
- Pipelines: `zap-ml-etl-<domain>-<purpose>`  
- Jobs: `zap-ci-<pipeline>-<step>`  
- S3 paths: `s3://zap-ml/<env>/<pipeline>/<artifact>`  
- Feature tables: `zap_feature_<entity>_<version>`  
> Consistent naming makes CI/CD traceable and eases auditing.

## Exercises
1. Parameterize a pipeline with `config.yaml` (env, data sources, output path).  
2. Implement steps: extract (Redshift), transform (clean, encode), load (persist to S3/local).  
3. Add simple data validation checks (row counts, schema conformity).  
4. Package as a CLI: `python pipelines/etl.py --config config.yaml`.  
5. Emit artifacts to `artifacts/week2/` and logs to `logs/`.

## Peer Validation
- **Peer Review Checklist:**  
  - Clear naming / paths.  
  - Configurable environments.  
  - Validation checks implemented.  
  - Pipeline script runnable end-to-end.

In [None]:
# %pip install pandas pyyaml boto3 --quiet
import os, json, yaml, pandas as pd, numpy as np

os.makedirs("artifacts/week2", exist_ok=True)
os.makedirs("logs", exist_ok=True)

CONFIG_PATH = "config.yaml"

default_cfg = {
  "env": "dev",
  "sources": {
    "redshift": {
      "enabled": False,
      "query": "SELECT * FROM zap_sandbox.transactions LIMIT 10000;",
      "host": "YOUR-REDSHIFT-ENDPOINT",
      "port": 5439,
      "database": "dev",
      "user": "YOUR-USER",
      "password": "YOUR-PASSWORD"
    },
    "synthetic": {
      "enabled": True,
      "n_rows": 5000
    }
  },
  "outputs": {
    "local_csv": "artifacts/week2/features.csv",
    "s3_path": "s3://zap-ml/dev/zap-ml-etl-core/features.csv"
  }
}

if not os.path.exists(CONFIG_PATH):
    with open(CONFIG_PATH,"w") as f:
        yaml.safe_dump(default_cfg, f, sort_keys=False)

print(f"Wrote default config to {CONFIG_PATH}. Edit values and re-run.")

In [None]:
# ---- ETL Implementation (extract -> transform -> load) ----
def extract(cfg):
    if cfg["sources"]["redshift"]["enabled"]:
        # %pip install redshift_connector --quiet
        import redshift_connector, pandas as pd
        conn = redshift_connector.connect(
            host=cfg["sources"]["redshift"]["host"],
            port=cfg["sources"]["redshift"]["port"],
            database=cfg["sources"]["redshift"]["database"],
            user=cfg["sources"]["redshift"]["user"],
            password=cfg["sources"]["redshift"]["password"],
        )
        df = pd.read_sql(cfg["sources"]["redshift"]["query"], conn)
        conn.close()
        return df
    else:
        n = cfg["sources"]["synthetic"]["n_rows"]
        df = pd.DataFrame({
            "amount": np.random.gamma(2.0, 50.0, n),
            "country": np.random.choice(["PT","ES","FR","DE"], size=n),
            "channel": np.random.choice(["web","store","mobile"], size=n),
            "label": np.random.choice([0,1], size=n, p=[0.7,0.3])
        })
        return df

def transform(df):
    df = df.copy()
    # Simple cleaning
    df = df.dropna()
    # Example encoding
    df = pd.get_dummies(df, columns=["country","channel"], drop_first=True)
    return df

def validate(df):
    assert len(df) > 0, "Empty dataframe after transform"
    assert "label" in df.columns, "Missing target column 'label'"
    return True

def load(df, path):
    df.to_csv(path, index=False)
    return path

with open("config.yaml") as f:
    cfg = yaml.safe_load(f)

df_raw = extract(cfg)
df_feat = transform(df_raw)
validate(df_feat)
out_path = cfg["outputs"]["local_csv"]
load(df_feat, out_path)
print(f"Saved features to: {out_path}")