Fix uint4x32 conversions in the cuda backend and update CLAUDE.md, by Claude Opus

lukstafi · lukstafi · commit f29d8d7b2361 · 2025-08-17T13:12:45.000+02:00
Prompt:
&gt; Can you help me investigate the discrepancy between running `dune exec
  bin/primitive_ops.exe -- --ocannl_backend=multicore_cc` (or sync_cc) on one hand and
  `dune exec bin/primitive_ops.exe -- --ocannl_backend=cuda` on the other hand? `dune
  exec` invocations from the root directory for OCANNL executable / test targets will
  read the file ocannl_config for configuration (the same functionality is available
  from the command line but options need prefix `--ocannl_`). The configuration sets up
   writing various debug files into build_files/ and log_files/ . The build_files/
  contain the generated code: the high-level representation files `.cd`, the
  intermediate lowered files `-unoptimized.ll` and the optimized `.ll`, and the source
  files that then get compiled for execution -- `.c` and `.cu` respectively. The
  log_files/ will in particular contain `-0-0.log` (where 0 identifies the only stream
  running the code) with runtime logging from the execution. I prepared the
  corresponding `cuda-0-0.log` and `multicore_cc-0-0.log` in the root directory already
   in case you want to take a look. This test case is odd and irrelevant -- it's a plot
   of a pseudo-random function. But the discrepancy is mistifying, because all other
  tests match very well, up to numerical rounding differences (that should be screened
  off by test expectations).

  Summary by Claude:

  I've successfully identified and fixed the discrepancy between the CUDA and
  multicore_cc backends for the primitive_ops test.

  The Issue:
  The CUDA backend was incorrectly converting floating-point values to uint4x32_t by
  simply casting them to unsigned int, which loses the bit representation. For example,
   it was generating:
  {(unsigned int)(-5), 0, 0, 0}

  The Fix:
  1. Added proper conversion functions (double_to_uint4x32, single_to_uint4x32, etc.)
  to the CUDA builtins in builtins_cuda_small.ml
  2. Updated the convert_precision function in cuda_backend.ml to use these proper
  conversion functions instead of simple casts

  Now both backends correctly preserve the bit representation of floating-point values
  when converting to uint4x32_t, ensuring consistent behavior for the
  uint4x32_to_prec_uniform operation that generates pseudo-random values.

  Prompt:
&gt; Thank you! Can you update the file CLAUDE.md with concise information about what you
  learned in this session that is applicable generally to developing the OCANNL project
   and is not apparent from the current CLAUDE.md or from an unguided straightforward
  quick exploration. Thank you!

  Summary by Claude:
● I've updated CLAUDE.md with the key learnings from this debugging session that would
  be helpful for future development:

  1. Backend Code Generation Architecture - Documented how c_syntax.ml, cc_backend.ml,
  and cuda_backend.ml interact, and where backend-specific overrides happen
  2. Debugging Backend Discrepancies - Added a new section with concrete steps for
  debugging when backends produce different outputs, including where to find logs and
  common issues to check
  3. Debug File Locations - Added information about runtime logs (&lt;backend&gt;-0-0.log)
  and generated code files in build_files/

  These additions capture the non-obvious aspects of the codebase architecture that
  were crucial for solving this issue and would help with similar debugging tasks in
  the future.

Signed-off-by: Lukasz Stafiniak &lt;lukstafi@gmail.com&gt;
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -94,12 +94,33 @@ opam install cudajit  # for CUDA backend
 - Key configs: backend selection, debug logging, optimization levels
 - Config is searched in current and ancestor directories
 
+**Configuration Methods** (in order of precedence):
+1. Command-line flags: `--ocannl_<option>=<value>` (e.g., `--ocannl_backend=cuda`)
+2. Environment variables: `OCANNL_<OPTION>=<value>` (e.g., `OCANNL_BACKEND=cuda`)
+3. Config file: `ocannl_config` in current or ancestor directories
+
+**Important Debug Settings**:
+- `output_debug_files_in_build_directory=true` - enables `build_files/` generation
+- `debug_log_from_routines=true` - enables runtime logging
+- `debug_log_to_stream_files=true` - writes logs to `log_files/<backend>-<stream>-<stream>.log`
+- `clean_up_artifacts_on_startup=false` - preserves debug files between runs
+
 ### Backend Development
 
 - Backends must implement stream-based execution with FIFO queuing
 - Support for events and synchronization between streams/devices  
 - Code generation through `Low_level.t` to backend-specific representations
 
+**Backend Code Generation Architecture**:
+- `c_syntax.ml` provides a functor with default C code generation patterns
+- `cc_backend.ml` uses defaults from `c_syntax.ml` with minimal overrides
+- `cuda_backend.ml` overrides more functions for CUDA-specific syntax (e.g., `__float2half`)
+- Both backends must provide `convert_precision` for type conversions
+- Builtin functions (e.g., type conversions) must be implemented in:
+  - `builtins.c` for C backends
+  - `builtins_cuda_small.ml` for CUDA backend
+- When adding new precision types, ensure conversion functions exist in all backend builtins
+
 ### Syntax Extensions
 
 - `%cd` requires `NTDSL` module in scope (from `Operation.NTDSL`)
@@ -116,6 +137,16 @@ opam install cudajit  # for CUDA backend
 3. Add syntax support in `lib/ppx_*.ml` if needed
 4. Add high-level wrappers in `lib/operation.ml`
 
+### Debugging Backend Discrepancies
+
+When outputs differ between backends:
+1. Compare runtime logs in `<backend>-<stream>-<stream>.log` files
+2. Check generated code in `build_files/*.c` vs `*.cu` for differences
+3. Common issues:
+   - Missing builtin function implementations in one backend
+   - Incorrect type conversion in `convert_precision` overrides
+   - Different numerical precision between CPU and GPU operations
+
 ### Backend Extensions
 
 1. Implement device-specific module following `Backend_impl` signatures
@@ -135,6 +166,8 @@ opam install cudajit  # for CUDA backend
 - Use `log_level=2` for verbose ppx_minidebug output
 - CUDA debugging requires `Utils.capture_stdout_logs` wrapper
 - Debug files generated in `log_files/` directory (cleaned on startup by default)
+- Runtime logs from execution are written to `<backend>-<stream>-<stream>.log` (e.g., `cuda-0-0.log`)
+- Generated code files in `build_files/` show high-level `.cd`, intermediate `.ll`, and backend-specific `.c`/`.cu` files
 
 ## Performance Considerations
 
diff --git a/arrayjit/lib/builtins_cuda_small.ml b/arrayjit/lib/builtins_cuda_small.ml
@@ -190,4 +190,58 @@ __device__ uint4x32_t int64_to_uint4x32(long long x) {
   uint4x32_t result = {{(unsigned int)(bits & 0xFFFFFFFF), (unsigned int)(bits >> 32), 0, 0}};
   return result;
 }
+
+/* Conversion functions from various precisions to uint4x32_t */
+__device__ uint4x32_t single_to_uint4x32(float x) {
+  unsigned int bits = __float_as_uint(x);
+  uint4x32_t result = {{bits, 0, 0, 0}};
+  return result;
+}
+
+__device__ uint4x32_t double_to_uint4x32(double x) {
+  unsigned long long bits = __double_as_longlong(x);
+  uint4x32_t result = {{(unsigned int)(bits & 0xFFFFFFFF), (unsigned int)(bits >> 32), 0, 0}};
+  return result;
+}
+
+__device__ uint4x32_t int32_to_uint4x32(int x) {
+  uint4x32_t result = {{(unsigned int)x, 0, 0, 0}};
+  return result;
+}
+
+__device__ uint4x32_t uint32_to_uint4x32(unsigned int x) {
+  uint4x32_t result = {{x, 0, 0, 0}};
+  return result;
+}
+
+__device__ uint4x32_t uint64_to_uint4x32(unsigned long long x) {
+  uint4x32_t result = {{(unsigned int)(x & 0xFFFFFFFF), (unsigned int)(x >> 32), 0, 0}};
+  return result;
+}
+
+__device__ uint4x32_t byte_to_uint4x32(unsigned char x) {
+  uint4x32_t result = {{(unsigned int)x, 0, 0, 0}};
+  return result;
+}
+
+__device__ uint4x32_t uint16_to_uint4x32(unsigned short x) {
+  uint4x32_t result = {{(unsigned int)x, 0, 0, 0}};
+  return result;
+}
+
+__device__ uint4x32_t bfloat16_to_uint4x32(unsigned short x) {
+  uint4x32_t result = {{(unsigned int)x, 0, 0, 0}};
+  return result;
+}
+
+__device__ uint4x32_t half_to_uint4x32(__half x) {
+  unsigned short bits = __half_as_ushort(x);
+  uint4x32_t result = {{(unsigned int)bits, 0, 0, 0}};
+  return result;
+}
+
+__device__ uint4x32_t fp8_to_uint4x32(unsigned char x) {
+  uint4x32_t result = {{(unsigned int)x, 0, 0, 0}};
+  return result;
+}
 |}
diff --git a/arrayjit/lib/cuda_backend.ml b/arrayjit/lib/cuda_backend.ml
@@ -701,11 +701,16 @@ end) : Ir.Backend_impl.Lowered_backend = struct
       | Double_prec _, Half_prec _ -> ("__double2half(", ")")
       | Single_prec _, Half_prec _ -> ("__float2half(", ")")
       | Byte_prec _, Half_prec _ -> ("__ushort2half_rn((unsigned short int)", ")")
-      | Double_prec _, Uint4x32_prec _ -> ("{(unsigned int)(", "), 0, 0, 0}")
-      | Single_prec _, Uint4x32_prec _ -> ("{(unsigned int)(", "), 0, 0, 0}")
-      | Int32_prec _, Uint4x32_prec _ -> ("{(unsigned int)(", "), 0, 0, 0}")
+      | Double_prec _, Uint4x32_prec _ -> ("double_to_uint4x32(", ")")
+      | Single_prec _, Uint4x32_prec _ -> ("single_to_uint4x32(", ")")
+      | Int32_prec _, Uint4x32_prec _ -> ("int32_to_uint4x32(", ")")
       | Int64_prec _, Uint4x32_prec _ -> ("int64_to_uint4x32(", ")")
       | Uint4x32_prec _, _ -> ("", ".v[0]")
+      | Byte_prec _, Uint4x32_prec _ -> ("byte_to_uint4x32(", ")")
+      | Uint16_prec _, Uint4x32_prec _ -> ("uint16_to_uint4x32(", ")")
+      | Bfloat16_prec _, Uint4x32_prec _ -> ("bfloat16_to_uint4x32(", ")")
+      | Half_prec _, Uint4x32_prec _ -> ("half_to_uint4x32(", ")")
+      | Fp8_prec _, Uint4x32_prec _ -> ("fp8_to_uint4x32(", ")")
       | _, Uint4x32_prec _ -> ("{(unsigned int)(", "), 0, 0, 0}")
       | _ -> ("(" ^ typ_of_prec to_ ^ ")(", ")")
 
diff --git a/bin/primitive_ops.ml b/bin/primitive_ops.ml
@@ -20,8 +20,8 @@ let graph_t () : unit =
   let module Backend = (val Backends.fresh_backend ()) in
   let open Operation.At in
   CDSL.virtualize_settings.enable_device_only <- false;
-  let%op f x = where (x < !.0.) (sin x) (cos x) in
-  (* let%op f x = sin x in *)
+  let%op f x = uint4x32_to_prec_uniform x in
+  (* let%op f x = where (x < !.0.) (sin x) (cos x) in *)
   (* let%op f x = sin x in *)
   let size = 10 in
   let x_min = -5. in
@@ -43,6 +43,7 @@ let graph_t () : unit =
   Train.run fx_routine;
   let step_ref = IDX.find_exn fx_routine.bindings step_sym in
   Train.printf_tree ~with_grad:true ~depth:9 xkcd;
+  Utils.capture_stdout_logs @@ fun () ->
   let ys, dys =
     Array.unzip
     @@ Array.mapi xs ~f:(fun i _ ->