|
6 | 6 | - bin/: Executable examples and demos (e.g., `hello_world.ml`, `moons_benchmark.ml`). |
7 | 7 | - test/: Expect and inline tests grouped by topic (`einsum/`, `operations/`, `training/`, `ppx/`). |
8 | 8 | - docs/: Slides and reference docs; datasets/: helper data; build_files/ and log_files/: generated artifacts. |
9 | | -- Key config: copy `ocannl_config.example` to `ocannl_config` and adjust backend. |
| 9 | +- Global configuration explained in `ocannl_config.example`. |
10 | 10 |
|
11 | 11 | ## Build, Test, and Development Commands |
12 | 12 | - opam deps: `opam install . --deps-only` (OCaml ≥ 5.3 per `dune-project`). |
|
30 | 30 | - PRs: clear description, linked issues, reproduction or `dune runtest` output, and mention backend(s) exercised. Include any new example commands. |
31 | 31 |
|
32 | 32 | ## Configuration & Backends |
33 | | -- Backend selection and runtime options are read from `ocannl_config` and `OCANNL_BACKEND`. See `ocannl_config.example` for available keys (debug logging, device, precision). |
34 | | -- For CUDA/Metal specifics and debugging, consult README “Using the tracing debugger” and adjust config accordingly. |
| 33 | +- Backend selection and runtime options are read from the file `ocannl_config` in the current directory (or from test/config/ocannl_config for tests), from environment variables e.g. `OCANNL_BACKEND=sync_cc` (but it is not reliable for tests other than env var `OCANNL_BACKEND` which has dedicated support), commandline arguments e.g. `--ocannl_backend=sync_cc` (but this doesn't work with `dune test` which runs multiple tests). See `ocannl_config.example` for available keys (debug logging, device, precision). |
35 | 34 |
|
| 35 | +**Developer Cheatsheet** |
| 36 | +- **Packages:** `arrayjit` (compiler/backends) and `neural_nets_lib` (DL framework). Build high-level tensors in `lib/`, lower/compile in `arrayjit/`. |
| 37 | +- **Execution Model:** Express computations as tensors → derive forward/backprop → infer shapes/projections → lower `Assignments` → compile/link per-backend → run on streams (CPU cores/CUDA streams). |
| 38 | +- **Key Types:** `Tensor.t` (value/grad nodes), `Tnode.t` (node-level arrays), `Assignments.comp` (accumulating statements), `Indexing.projections` (loop derivation), `Ndarray.t` (host/device buffers). |
| 39 | +- **Backends:** `sync_cc`/`multicore_cc` (C via schedulers), `gccjit`, `cuda`, `metal` (if built). Use `Backends.fresh_backend ()` in examples/tests. |
| 40 | + |
| 41 | +**Syntax Extensions** |
| 42 | +- **`%op` (operations):** Builds differentiable tensors using `Operation.TDSL`. |
| 43 | + - **Inline params:** `{ w; o = [ dims ] }` creates parameters; with initialization requires `Operation.PDSL` in scope. |
| 44 | + - **Convenience:** Regular OCaml works for many tensor expressions; `%op` mainly improves labels and inline decls. |
| 45 | +- **`%cd` (code):** Builds `Assignments.comp` for forward/backward code via `Operation.NTDSL` (non‑diff tensors inside). |
| 46 | + - **Accum ops:** Infix assignment operators pick accumulation: `=+`, `=-`, `=*`, `=/`, `=**`, and variants `=:+` etc. |
| 47 | + - **Projections:** Provide `~projections` or rely on mnemonics (`v`, `v1`, `v2`, `g`, `g1`, `g2`, `lhs`, `rhs1`, `rhs2`) to select slots. |
| 48 | + - **Array refs:** `.v` for value node, `.grad` for gradient node, `.merge` for stream merge buffers. |
| 49 | + - **Embedded tensors:** `%cd` auto‑inserts forward code for created tensors and tracks `embedded_nodes` to avoid recompute. |
| 50 | + - **Pow operator:** Use `**.` for pointwise power with numeric exponent; gradients are specialized (fast path for p=1,2). |
| 51 | +- **Generalized einsum:** Use `~logic:"...=>..."` for concise projections; shapes use `batch|input->output` notation. |
| 52 | + |
| 53 | +**Shape & Projection Inference** |
| 54 | +- **Pipeline:** `propagate_shapes` during build; `finish_inference` before jitting closes shapes (LUB or 1/broadcastable); then `derive_projections` freshens projection ids to avoid cross‑op contamination. |
| 55 | +- **Monomorphic now:** Existential `row`/`dim` vars; future polymorphism could reuse `%op ~config` functions with abstract namespaces. |
| 56 | +- **Rows:** Three rows per tensor: batch | input -> output; broadcasting can happen “in the middle” with fixed head/tail axes. |
| 57 | +- **Indexing:** Projections unify per‑assignment instances (union‑find), yield iterators for product dims; dim=1 maps to `Fixed_idx 0`. |
| 58 | +- **Convolutions:** Low‑level buffers include padding in `dims`; high‑level shapes exclude it—padding becomes observable after forcing dims. |
| 59 | + |
| 60 | +**Backend Anatomy** |
| 61 | +- **Frontend modules:** `Task`, `Ops`, `Ndarray`, `Tnode` (per‑device arrays, can be virtual), `Indexing`, `Assignments`, `Low_level`. |
| 62 | +- **Interfaces:** `Backend_intf` (records parametric in `'buffer_ptr`, `'dev`, `'runner`, `'event`); `Backend_impl` for implementations; `C_syntax` helpers. |
| 63 | +- **Implementations:** `Cc_backend`, `Gcc_backend`, `Cuda_backend`, `Metal_backend` plus `Schedulers` for CPU parallelism. |
| 64 | +- **Lifting:** `Backends.Add_device` + `Schedulers` → CPU backends; `Raise_backend` maps `Low_level` to `Assignments` and adds buffer retrieval + syncing. |
| 65 | +- **Lifecycle:** Compile routines in batches; link per‑stream context; free arrays with `Backends.finalize`. |
| 66 | + |
| 67 | +**Scheduling, Streams, Transfers** |
| 68 | +- **Streams:** Loose notion (CPU core/CUDA stream). Linking binds compiled code to a stream; scheduling preserves W→R order via events. |
| 69 | +- **Transfers:** `from_host`, `to_host`, `device_to_device` are scheduled like compute; destination waits non‑blocking on source. |
| 70 | +- **Merge buffers:** One per stream; use `.merge` in `%cd` (e.g., `[%cd p.grad =+ p.grad.merge]`). Modes: `Streaming_for` (source ptr, may fall back to `Copy` across devices) and `Copy` (physical buffer grown as needed). |
| 71 | +- **Auto host transfers:** If `automatic_host_transfers`: |
| 72 | + - `Tnode.do_read/do_write` perform sync and schedule `to_host`/sync; fields: `prepare_read`, `prepare_write`, `devices_not_lagging_host`. |
| 73 | + - `Raise_backend.sync_routine` pre‑schedules `from_host` for untagged inputs; `update_writer_event` tags writers and sets `to_host`. |
| 74 | + - `Raise_backend.alloc_if_needed` schedules `from_host` for constants and tags device. |
| 75 | + |
| 76 | +**Debugging & Tracing** |
| 77 | +- **Logs:** Enable tracing in config; `%cd` supports block comments to annotate generated files; debug prints/plots appear in logs. |
| 78 | +- **PPX tips:** Keep `%op` parameters non‑nested when labels matter; avoid capturing inner function params for labels. |
| 79 | +- **Shape issues:** Inspect `Tensor.shape` after `finish_inference`; watch for padding effects when dims are forced. |
| 80 | +- **Streams/merges:** Mismatch of expected vs. scheduled merge node is detected at scheduling; check `.merge` usage and stream contexts. |
| 81 | +- **Backend checks:** Start with `sync_cc` for clarity; move to `multicore_cc`/`cuda` once semantics are validated. |
| 82 | + |
| 83 | +**Adding Features (Guidelines)** |
| 84 | +- **New op:** Define in `arrayjit/lib/ops.ml` + `Ir.Ops`; add infix if needed; implement forward/backprop with `%cd` (use `~projections`). |
| 85 | +- **Tensor API:** Prefer small composable helpers in `lib/operation.ml`; mirror `%op` conveniences when useful. |
| 86 | +- **Shape rules:** Add constraints in `lib/shape.ml` and rows in `lib/row.ml`; ensure `propagate_shapes` derives intended LUBs; update `derive_projections` if new projection forms. |
| 87 | +- **Backend codegen:** Prefer `Low_level` lowering hooks; reuse `C_syntax`; keep kernel/routine boundaries stable for batching. |
| 88 | +- **Docs/tests:** Add `%expect` examples under `test/` showing shapes, projections, and generated code snippets. |
| 89 | + |
| 90 | +**Testing & Validation** |
| 91 | +- **Unit slices:** Run subsets like `dune runtest test/einsum` or `test/operations` to iterate quickly. |
| 92 | +- **Golden files:** Many tests diff emitted `.ll/.c/.cu/.metal`; update expected outputs only when semantics are intended. |
| 93 | +- **Backends in CI:** Use `OCANNL_BACKEND=sync_cc` locally first; selectively exercise `cuda`/`metal` if available. |
| 94 | + |
| 95 | +**Research Tips** |
| 96 | +- **Read paths:** `lib/operation.ml` (ops), `lib/tensor.ml` (graph), `lib/shape.ml`/`lib/row.ml` (inference), `arrayjit/lib/*backend*.ml` (runtimes), `arrayjit/lib/indexing.ml` (projections), `arrayjit/lib/low_level.ml` (loops). |
| 97 | +- **Compare designs:** Multi‑stream + merge buffers vs. typical single‑stream AD frameworks; generalized einsum for projections vs. manual loops. |
| 98 | +- **Trace small models:** Use `bin/micrograd_demo*.ml` and `bin/moons_demo*.ml` with `%cd` comments and higher log level to understand pipeline. |
| 99 | +- **Experiment knobs:** Toggle `automatic_host_transfers`, switch backends, vary precision, inspect shapes before/after jitting. |
0 commit comments