Skip to content

Commit f29d8d7

Browse files
committed
Fix uint4x32 conversions in the cuda backend and update CLAUDE.md, by Claude Opus
Prompt: > Can you help me investigate the discrepancy between running `dune exec bin/primitive_ops.exe -- --ocannl_backend=multicore_cc` (or sync_cc) on one hand and `dune exec bin/primitive_ops.exe -- --ocannl_backend=cuda` on the other hand? `dune exec` invocations from the root directory for OCANNL executable / test targets will read the file ocannl_config for configuration (the same functionality is available from the command line but options need prefix `--ocannl_`). The configuration sets up writing various debug files into build_files/ and log_files/ . The build_files/ contain the generated code: the high-level representation files `.cd`, the intermediate lowered files `-unoptimized.ll` and the optimized `.ll`, and the source files that then get compiled for execution -- `.c` and `.cu` respectively. The log_files/ will in particular contain `-0-0.log` (where 0 identifies the only stream running the code) with runtime logging from the execution. I prepared the corresponding `cuda-0-0.log` and `multicore_cc-0-0.log` in the root directory already in case you want to take a look. This test case is odd and irrelevant -- it's a plot of a pseudo-random function. But the discrepancy is mistifying, because all other tests match very well, up to numerical rounding differences (that should be screened off by test expectations). Summary by Claude: I've successfully identified and fixed the discrepancy between the CUDA and multicore_cc backends for the primitive_ops test. The Issue: The CUDA backend was incorrectly converting floating-point values to uint4x32_t by simply casting them to unsigned int, which loses the bit representation. For example, it was generating: {(unsigned int)(-5), 0, 0, 0} The Fix: 1. Added proper conversion functions (double_to_uint4x32, single_to_uint4x32, etc.) to the CUDA builtins in builtins_cuda_small.ml 2. Updated the convert_precision function in cuda_backend.ml to use these proper conversion functions instead of simple casts Now both backends correctly preserve the bit representation of floating-point values when converting to uint4x32_t, ensuring consistent behavior for the uint4x32_to_prec_uniform operation that generates pseudo-random values. Prompt: > Thank you! Can you update the file CLAUDE.md with concise information about what you learned in this session that is applicable generally to developing the OCANNL project and is not apparent from the current CLAUDE.md or from an unguided straightforward quick exploration. Thank you! Summary by Claude: ● I've updated CLAUDE.md with the key learnings from this debugging session that would be helpful for future development: 1. Backend Code Generation Architecture - Documented how c_syntax.ml, cc_backend.ml, and cuda_backend.ml interact, and where backend-specific overrides happen 2. Debugging Backend Discrepancies - Added a new section with concrete steps for debugging when backends produce different outputs, including where to find logs and common issues to check 3. Debug File Locations - Added information about runtime logs (<backend>-0-0.log) and generated code files in build_files/ These additions capture the non-obvious aspects of the codebase architecture that were crucial for solving this issue and would help with similar debugging tasks in the future. Signed-off-by: Lukasz Stafiniak <lukstafi@gmail.com>
1 parent 1abac3f commit f29d8d7

File tree

4 files changed

+98
-5
lines changed

4 files changed

+98
-5
lines changed

CLAUDE.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,12 +94,33 @@ opam install cudajit # for CUDA backend
9494
- Key configs: backend selection, debug logging, optimization levels
9595
- Config is searched in current and ancestor directories
9696

97+
**Configuration Methods** (in order of precedence):
98+
1. Command-line flags: `--ocannl_<option>=<value>` (e.g., `--ocannl_backend=cuda`)
99+
2. Environment variables: `OCANNL_<OPTION>=<value>` (e.g., `OCANNL_BACKEND=cuda`)
100+
3. Config file: `ocannl_config` in current or ancestor directories
101+
102+
**Important Debug Settings**:
103+
- `output_debug_files_in_build_directory=true` - enables `build_files/` generation
104+
- `debug_log_from_routines=true` - enables runtime logging
105+
- `debug_log_to_stream_files=true` - writes logs to `log_files/<backend>-<stream>-<stream>.log`
106+
- `clean_up_artifacts_on_startup=false` - preserves debug files between runs
107+
97108
### Backend Development
98109

99110
- Backends must implement stream-based execution with FIFO queuing
100111
- Support for events and synchronization between streams/devices
101112
- Code generation through `Low_level.t` to backend-specific representations
102113

114+
**Backend Code Generation Architecture**:
115+
- `c_syntax.ml` provides a functor with default C code generation patterns
116+
- `cc_backend.ml` uses defaults from `c_syntax.ml` with minimal overrides
117+
- `cuda_backend.ml` overrides more functions for CUDA-specific syntax (e.g., `__float2half`)
118+
- Both backends must provide `convert_precision` for type conversions
119+
- Builtin functions (e.g., type conversions) must be implemented in:
120+
- `builtins.c` for C backends
121+
- `builtins_cuda_small.ml` for CUDA backend
122+
- When adding new precision types, ensure conversion functions exist in all backend builtins
123+
103124
### Syntax Extensions
104125

105126
- `%cd` requires `NTDSL` module in scope (from `Operation.NTDSL`)
@@ -116,6 +137,16 @@ opam install cudajit # for CUDA backend
116137
3. Add syntax support in `lib/ppx_*.ml` if needed
117138
4. Add high-level wrappers in `lib/operation.ml`
118139

140+
### Debugging Backend Discrepancies
141+
142+
When outputs differ between backends:
143+
1. Compare runtime logs in `<backend>-<stream>-<stream>.log` files
144+
2. Check generated code in `build_files/*.c` vs `*.cu` for differences
145+
3. Common issues:
146+
- Missing builtin function implementations in one backend
147+
- Incorrect type conversion in `convert_precision` overrides
148+
- Different numerical precision between CPU and GPU operations
149+
119150
### Backend Extensions
120151

121152
1. Implement device-specific module following `Backend_impl` signatures
@@ -135,6 +166,8 @@ opam install cudajit # for CUDA backend
135166
- Use `log_level=2` for verbose ppx_minidebug output
136167
- CUDA debugging requires `Utils.capture_stdout_logs` wrapper
137168
- Debug files generated in `log_files/` directory (cleaned on startup by default)
169+
- Runtime logs from execution are written to `<backend>-<stream>-<stream>.log` (e.g., `cuda-0-0.log`)
170+
- Generated code files in `build_files/` show high-level `.cd`, intermediate `.ll`, and backend-specific `.c`/`.cu` files
138171

139172
## Performance Considerations
140173

arrayjit/lib/builtins_cuda_small.ml

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,4 +190,58 @@ __device__ uint4x32_t int64_to_uint4x32(long long x) {
190190
uint4x32_t result = {{(unsigned int)(bits & 0xFFFFFFFF), (unsigned int)(bits >> 32), 0, 0}};
191191
return result;
192192
}
193+
194+
/* Conversion functions from various precisions to uint4x32_t */
195+
__device__ uint4x32_t single_to_uint4x32(float x) {
196+
unsigned int bits = __float_as_uint(x);
197+
uint4x32_t result = {{bits, 0, 0, 0}};
198+
return result;
199+
}
200+
201+
__device__ uint4x32_t double_to_uint4x32(double x) {
202+
unsigned long long bits = __double_as_longlong(x);
203+
uint4x32_t result = {{(unsigned int)(bits & 0xFFFFFFFF), (unsigned int)(bits >> 32), 0, 0}};
204+
return result;
205+
}
206+
207+
__device__ uint4x32_t int32_to_uint4x32(int x) {
208+
uint4x32_t result = {{(unsigned int)x, 0, 0, 0}};
209+
return result;
210+
}
211+
212+
__device__ uint4x32_t uint32_to_uint4x32(unsigned int x) {
213+
uint4x32_t result = {{x, 0, 0, 0}};
214+
return result;
215+
}
216+
217+
__device__ uint4x32_t uint64_to_uint4x32(unsigned long long x) {
218+
uint4x32_t result = {{(unsigned int)(x & 0xFFFFFFFF), (unsigned int)(x >> 32), 0, 0}};
219+
return result;
220+
}
221+
222+
__device__ uint4x32_t byte_to_uint4x32(unsigned char x) {
223+
uint4x32_t result = {{(unsigned int)x, 0, 0, 0}};
224+
return result;
225+
}
226+
227+
__device__ uint4x32_t uint16_to_uint4x32(unsigned short x) {
228+
uint4x32_t result = {{(unsigned int)x, 0, 0, 0}};
229+
return result;
230+
}
231+
232+
__device__ uint4x32_t bfloat16_to_uint4x32(unsigned short x) {
233+
uint4x32_t result = {{(unsigned int)x, 0, 0, 0}};
234+
return result;
235+
}
236+
237+
__device__ uint4x32_t half_to_uint4x32(__half x) {
238+
unsigned short bits = __half_as_ushort(x);
239+
uint4x32_t result = {{(unsigned int)bits, 0, 0, 0}};
240+
return result;
241+
}
242+
243+
__device__ uint4x32_t fp8_to_uint4x32(unsigned char x) {
244+
uint4x32_t result = {{(unsigned int)x, 0, 0, 0}};
245+
return result;
246+
}
193247
|}

arrayjit/lib/cuda_backend.ml

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -701,11 +701,16 @@ end) : Ir.Backend_impl.Lowered_backend = struct
701701
| Double_prec _, Half_prec _ -> ("__double2half(", ")")
702702
| Single_prec _, Half_prec _ -> ("__float2half(", ")")
703703
| Byte_prec _, Half_prec _ -> ("__ushort2half_rn((unsigned short int)", ")")
704-
| Double_prec _, Uint4x32_prec _ -> ("{(unsigned int)(", "), 0, 0, 0}")
705-
| Single_prec _, Uint4x32_prec _ -> ("{(unsigned int)(", "), 0, 0, 0}")
706-
| Int32_prec _, Uint4x32_prec _ -> ("{(unsigned int)(", "), 0, 0, 0}")
704+
| Double_prec _, Uint4x32_prec _ -> ("double_to_uint4x32(", ")")
705+
| Single_prec _, Uint4x32_prec _ -> ("single_to_uint4x32(", ")")
706+
| Int32_prec _, Uint4x32_prec _ -> ("int32_to_uint4x32(", ")")
707707
| Int64_prec _, Uint4x32_prec _ -> ("int64_to_uint4x32(", ")")
708708
| Uint4x32_prec _, _ -> ("", ".v[0]")
709+
| Byte_prec _, Uint4x32_prec _ -> ("byte_to_uint4x32(", ")")
710+
| Uint16_prec _, Uint4x32_prec _ -> ("uint16_to_uint4x32(", ")")
711+
| Bfloat16_prec _, Uint4x32_prec _ -> ("bfloat16_to_uint4x32(", ")")
712+
| Half_prec _, Uint4x32_prec _ -> ("half_to_uint4x32(", ")")
713+
| Fp8_prec _, Uint4x32_prec _ -> ("fp8_to_uint4x32(", ")")
709714
| _, Uint4x32_prec _ -> ("{(unsigned int)(", "), 0, 0, 0}")
710715
| _ -> ("(" ^ typ_of_prec to_ ^ ")(", ")")
711716

bin/primitive_ops.ml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ let graph_t () : unit =
2020
let module Backend = (val Backends.fresh_backend ()) in
2121
let open Operation.At in
2222
CDSL.virtualize_settings.enable_device_only <- false;
23-
let%op f x = where (x < !.0.) (sin x) (cos x) in
24-
(* let%op f x = sin x in *)
23+
let%op f x = uint4x32_to_prec_uniform x in
24+
(* let%op f x = where (x < !.0.) (sin x) (cos x) in *)
2525
(* let%op f x = sin x in *)
2626
let size = 10 in
2727
let x_min = -5. in
@@ -43,6 +43,7 @@ let graph_t () : unit =
4343
Train.run fx_routine;
4444
let step_ref = IDX.find_exn fx_routine.bindings step_sym in
4545
Train.printf_tree ~with_grad:true ~depth:9 xkcd;
46+
Utils.capture_stdout_logs @@ fun () ->
4647
let ys, dys =
4748
Array.unzip
4849
@@ Array.mapi xs ~f:(fun i _ ->

0 commit comments

Comments
 (0)