Some more progress on concatenation-along-axes before we give up

lukstafi · lukstafi · commit 07092c0a6c37 · 2025-06-03T11:22:48.000+02:00
diff --git a/README.md b/README.md
@@ -66,24 +66,24 @@ NOTE: debug logging from CUDA in complex settings is a bit tricky, it involves a
 
 This is very tentative.
 
-* 0.6: Replicate the scaffolding from [llm.c](https://github.com/karpathy/llm.c) for training GPT-2.
+* 0.6: Hopefully-efficient expressivity: concatenation and splitting, convolution, maybe block tensors.
+  * Requires extending expressivity of projections and the generalized einsum notation.
+  * Then, we can add convnet building blocks and corresponding examples starting with MNIST.
+  * Verify or rethink usefulness of dimension labels, and whether to introduce axis labels.
+* 0.7: Replicate the scaffolding from [llm.c](https://github.com/karpathy/llm.c) for training GPT-2.
   * Useful building blocks for models in [lib/nn_blocks.ml](lib/nn_blocks.ml).
   * A language model example.
   * Port (translate or bind) the Python files from [llm.c](https://github.com/karpathy/llm.c) to implement tokenization, data loading and saving etc.
   * At the end of 0.6.x, we should have an apples-to-apples benchmark comparing OCANNL to [llm.c](https://github.com/karpathy/llm.c) for both CPU and GPU.
-* 0.7: Optimize performance -- low hanging fruit.
+* 0.8: Optimize performance -- low hanging fruit.
   * First harvested from [Fast Multidimensional Matrix Multiplication on CPU from Scratch](https://siboehm.com/articles/22/Fast-MMM-on-CPU).
   * Then harvested from [How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog](https://siboehm.com/articles/22/CUDA-MMM).
   * Finally from [llm.c](https://github.com/karpathy/llm.c).
   * These will require splitting a routine into multiple CUDA kernels.
-* 0.8: A new abstraction layer automating compilation/linking, execution, and some data transfers.
+* 0.9: A new abstraction layer automating compilation/linking, execution, and some data transfers.
   * E.g. host-device transfers: copy from host if host update is later than the previous device update.
   * Concise syntax for transfers into the merge buffer since we know which tensor node is transferred and where to.
   * At the end of 0.8.x, OCANNL has a REPL.
-* 0.9: Hopefully-efficient expressivity: block tensors, convolution.
-  * Requires extending expressivity of projections and the generalized einsum notation.
-  * Then, we can add convnet building blocks and corresponding examples starting with MNIST.
-  * Verify or rethink usefulness of dimension labels, and whether to introduce axis labels.
 * 0.10: Optimize performance: program search.
   * Instead of dynamic scheduling as in tinygrad, we can schedule statically by program search.
   * We should also reproduce the search that tinygrad is doing.
diff --git a/lib/row.ml b/lib/row.ml
@@ -875,12 +875,7 @@ let%debug5_sexp solve_dim_ineq ~(stage : stage) ~(cur : dim) ~(subr : dim) (env
         @@ Shape_error
              ( "Cannot compare Prod with unresolved variables in inequality",
                [ Dim_mismatch [ cur; subr ] ] )
-  | Prod _, Var _ | Var _, Prod _ ->
-      (* Similar to above - we need all dimensions resolved to compare *)
-      raise
-      @@ Shape_error
-           ("Cannot compare Prod with variables in inequality", [ Dim_mismatch [ cur; subr ] ])
-  | Var cur_v, Var subr_v -> (
+ | Var cur_v, Var subr_v -> (
       match (Map.find env.dim_env cur_v, Map.find env.dim_env subr_v) with
       | Some (Bounds_dim { cur = cur1; _ }), _ when List.mem ~equal:equal_dim_var cur1 subr_v ->
           ([ Dim_eq { d1 = cur; d2 = subr } ], env)
@@ -1055,7 +1050,7 @@ let%debug5_sexp solve_dim_ineq ~(stage : stage) ~(cur : dim) ~(subr : dim) (env
                 Map.set env.dim_env ~key:subr_v
                   ~data:(Bounds_dim { lub = Some cur; cur = cur2; subr = subr2; constr = constr2 });
             } ))
-  | Var _, Dim _ (* when d2 > 1 *) -> ([ Dim_eq { d1 = cur; d2 = subr } ], env)
+  | Var _, (Dim _ (* when d2 > 1 *) | Prod _) -> ([ Dim_eq { d1 = cur; d2 = subr } ], env)
   | Dim _, Dim _ ->
       raise
       @@ Shape_error ("dimension comparison for axis: mismatch", [ Dim_mismatch [ cur; subr ] ])
diff --git a/lib/row.mli b/lib/row.mli
@@ -28,7 +28,6 @@ val get_dim : d:int -> ?label:string -> unit -> dim
 val dim_to_int_exn : dim -> int
 val dim_to_string : [> `Only_labels ] -> dim -> string
 
-
 (** Extracts all dimension variables from a dim, including from nested products. *)
 val dim_vars : dim -> dim_var list
 
diff --git a/lib/shape_inference.md b/lib/shape_inference.md
@@ -112,7 +112,7 @@ type logic =
 
 ### Non-tensor-like constraints
 
-The above mechanisms (excluding `dim_constraint` and `row_constraint`) are sufficient to express tensor applications such as inner and outer products, axis permutations. They cannot directly express: size constraints, fixed position indexing (except for the special case of position 0), axis concatenation and "reverse concatenation" / splitting, strides, convolutions. At present, we implement size constraints and fixed position indexing.
+The above mechanisms (excluding `dim_constraint` and `row_constraint`) are sufficient to express tensor applications such as inner and outer products, axis permutations. Axis concatenation and "reverse concatenation" / splitting is handled by the representation above via the "product" `Prod` dimension constructor. The above mechanisms cannot directly express: size constraints, fixed position indexing (except for the special case of position 0), strides, convolutions. At present, we implement size constraints and fixed position indexing.
 
 ```ocaml
 type dim_constraint = Unconstrained_dim | At_least_dim of int
@@ -125,13 +125,33 @@ type row_constraint =
 
 During the solution process, the constraints are incorporated, or propagated, into the environment `constr` entry fields, and into further `constraint_` constraints, as needed. This provides sufficient scaffolding to implement the other complex constraints as the need arises.
 
+### Product dimensions (Prod)
+
+The `Prod` construct represents an axis that is a product of other axes. This can be used to model:
+
+1. **Concatenation and splitting**: Multiple axes concatenated into a single axis or a single axis split into multiple as part of an operation.
+2. **Multi-axis views**: Treating multiple axes as a single flattened axis and vice-versa.
+
+For a `Prod [d1; d2; ...; dn]`:
+
+* The dimension is the product of all constituent dimensions: `dim(d1) × dim(d2) × ... × dim(dn)`
+* The projection respects the order of axes, implementing a row-major indexing scheme
+* During inference, constraints on the product propagate to constraints on the constituents
+* In the einsum notation, product axes will be denoted using `&`, e.g., `i&j` represents a single axis that is the product of axes `i` and `j`
+
+Product dimensions interact with other shape inference features:
+
+* **Broadcasting**: A Prod dimension can be broadcasted if its constituents are dimension-1
+* **Inequalities**: `Prod ds1 ≥ Prod ds2` requires compatible structures and element-wise inequalities
+* **Constraints**: An `At_least_dim` constraint on a Prod propagates to its constituents
+
 ## Solving the constraints
 
 The constraints are solved by: unification of the equation constraints, unification-like simplification of the inequality constraints, propagation of the complex constraints. Simplification of an inequality, and constraint propagation, can generate more constraints, so we need to be careful to keep it terminating. The solution proceeds in stages.
 
 * Stage 1 is online as tensors are composed, and conservatively performs unification and constraint propagation. Stages 2, 3, 4 are only performed once necessary: when projections or dimensions are requested.
 * Stage 2, when solving the constraints, substitutes dim variables in terminal shapes that do not have a LUB or other constraints, by dimension-1. (This is generalized at stage 6 to all variables.) It substitutes row variables in terminal shapes that do not have a LUB by one axis if that's required to satisfy the variable's constraint.
-* Stage 3, when solving the constraints, sets yet-unknown dimension and row variables in terminal shapes to their least upper bounds (if any). It substitutes row variables in terminal shapes that do not have a LUB by no-further-axes. (This is generalized at stage 5 to all variables.)
+* Stage 3, when solving the constraints, sets yet-unknown dimension and row variables in terminal shapes to their least upper bounds (if any). It substitutes row variables in terminal shapes that do not have a LUB by no-further-axes. (This is generalized at stage 6 to all variables.)
 * Stage 4 sets yet-unknown dimensions with >1 lower bounds from direct accesses, to their LUBs if they have any, otherwise to the lower bound.
 * Stage 5 addresses `Total_elems` constraints with yet-unknown row variables. If the constraint can be satisfied by assuming the row variable is no-further-axes, it sets the row variable to `Broadcastable`, otherwise it sets it to one axis of the required dimension.
 * Stage 6 sets row variables in the remaining inequalities to no-further-axes values. This can unlock further between-axis inequalities because of row variables sandwiched between leftmost axes from their side of the inequality and rightmost axes from the other side of the inequality.
@@ -187,25 +207,3 @@ Other important functions in the `Shape` module.
 * `finish_inference` is called right before some projections or array dimensions are required (typically, because of jitting). It performs a second round of `propagate_shapes`, and then once again attempts to solve any remaining constraints that `propagate_shapes` didn't solve. Then it "closes the shapes": substitutes out remaining shape variables by their LUBs if any, or dimension-1 / `Broadcastable` (no-more-axes). Then it resets the environment state, since the shapes are now guaranteed to not have variables.
 * `derive_projections` starts by freshening the `proj_id`s in the `update_step`. Then it generates and solves shape inequalities, and then generates and solves projection equations, and constructs the `projections` record.
 * `of_spec` constructs a shape record from an einsum slot spec. If `deduced = Input_equals_output`, it adds the corresponding equation to the global environment.
-
-### Product dimensions (Prod)
-
-The `Prod` construct represents an axis that is a product of other axes. This can be used to model:
-
-1. **Concatenation**: Multiple axes concatenated into a single axis
-2. **Multi-axis views**: Treating multiple axes as a single flattened axis
-
-For a `Prod [d1; d2; ...; dn]`:
-
-* The dimension is the product of all constituent dimensions: `dim(d1) × dim(d2) × ... × dim(dn)`
-* The projection respects the order of axes, implementing a row-major indexing scheme
-* During inference, constraints on the product propagate to constraints on the constituents
-* In the einsum notation, product axes will be denoted using `&`, e.g., `i&j` represents a single axis that is the product of axes `i` and `j`
-
-Product dimensions interact with other shape inference features:
-
-* **Broadcasting**: A Prod dimension can be broadcasted if all its constituents are dimension-1
-* **Inequalities**: `Prod ds1 ≥ Prod ds2` requires compatible structures and element-wise inequalities
-* **Constraints**: An `At_least_dim` constraint on a Prod propagates to its constituents
-
-The actual shape inference combines row polymorphism with (nominal) subtyping, as known in the type inference literature.