### 00 – Architecture Overview (RAW NOTEBOOK)

**1. Data sources**
| Source | Format | Frequency | Owner |
|--------|--------|-----------|-------|
| ATAS trade exports | `.xlsx` | Manual (user upload) | Trading desk |
| Screen captures | `.png` | Real‑time (stream) | Pattern‑recog service |
| Economic indicators | REST/CSV | ‑15 min lag | External APIs |
| News feeds | RSS/JSON | Streaming | News API |
| Options chain | REST/CSV | EOD + intraday | Broker API |

---
**2. Ingestion patterns**
* **Push** – user drops files into `input/` → landing bucket.
* **Push** – screen‑capture microservice pushes JPEG/PNG to stream.
* **Pull** – cron jobs query APIs for macro, options, news.

We tag each record with `source`, `ingest_ts`, `version`.

---
**3. Storage layers**
```
┌────────────┐      ┌────────────┐      ┌────────────┐      ┌────────────┐
│  Landing   │ →    │  Staging   │ →    │  Feature   │ →    │   ML Ops   │
│  (S3 raw)  │      │ (Parquet)  │      │  Store     │      │ Artifacts  │
└────────────┘      └────────────┘      └────────────┘      └────────────┘
```
* **Landing** – append‑only, immutable, same format as received.
* **Staging** – columnar Parquet, partitioned (date/source).
* **Feature Store** – ready for ML (DB, e.g. DuckDB/Redshift).
* **Artifacts** – model weights, metrics, notebooks (S3).

---
**4. Processing workflows**
| Step | Tool | Notes |
|------|------|-------|
| Validation  | `pandas‑schema` | shape, dtypes, nulls |
| Transformation | `pandas` / `polars` | cast, enrich |
| Image vectorisation | `torch` + CNN | converts PNG → embedding |
| Join & dedup | SQL | stage → feature store |

Trigger types:
* Batch (cron, hourly) for files & APIs.
* Stream (on message) for images.

---
**5. Real‑time pipeline**
```
Screen → WebSocket → Queue → CV service → Feature Store → Alert engine
```
Latency target < **2 s**.
Queue = Kafka/Kinesis. CV service publishes detected pattern (`signal`, `prob`, `ts`). Alert engine pushes to GUI / mobile.

---
**6. Batch pipeline (trades, options, macro)**
1. **Ingest** landing files.
2. **ETL** (Glue/Airflow) → staging.
3. **Aggregate** daily P&L, greeks.
4. **Persist** to analytics DB.
5. **Train** nightly models.

---
**7. Orchestration & lineage**
* Batch – Apache Airflow DAGs.
* Stream – Managed Kinesis + Lambda.
* Metadata – OpenLineage tags (dataset, job, run_id).

---
**8. Schemas (staging)**
```sql
CREATE TABLE trades_raw(
    trade_id BIGINT,
    symbol TEXT,
    timestamp TIMESTAMPTZ,
    qty NUMERIC,
    price NUMERIC,
    source TEXT,
    ingest_ts TIMESTAMPTZ,
    version INT
);

CREATE TABLE news_raw(
    uuid TEXT,
    headline TEXT,
    body TEXT,
    published TIMESTAMPTZ,
    sentiment NUMERIC,
    ingest_ts TIMESTAMPTZ
);
```
(Analogous tables for `options_raw`, `macro_raw`, `img_signals`.)

---
**9. Model training & online inference**
| Type | Algo | Freshness |
|------|------|-----------|
| Pattern recog | CNN/ViT | realtime |
| News sentiment | fin‑BERT | ≤5 min |
| Signal ensemble | GradientBoost | hourly |

Models log to MLflow; on update, inference endpoint hot‑swaps.

---
**10. Alerting & monitoring**
* Data quality – Great Expectations, alerts via Slack.
* Pipeline – AWS CloudWatch / Grafana.
* Model drift – statistical tests + alert.

---
**11. Security & access**
* S3 bucket policies: landing is write‑only, staging read‑only for analysts.
* IAM roles per microservice.
* Secrets in AWS Secrets Manager.

---
**12. Next tasks**
1. Finalise schema for `img_signals`.
2. Prototype Kinesis producer for screen‑capture.
3. Define FE pipeline (stats feats) in Feature Store.
4. Draft Airflow DAG for trades ETL.