Skip to content

Commit 3cd6ea2

Browse files
committed
Update priorities: reorder upcoming milestones
1 parent 54645ab commit 3cd6ea2

File tree

2 files changed

+10
-9
lines changed

2 files changed

+10
-9
lines changed

CHANGES.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
## [0.5.4] -- current
1+
## [0.6.0] -- current
22

33
### Added
44

55
- Support for Brain float aka. bfloat16 aka. BF16, and for FP8.
6-
- TODO: utilities for using Hugging Face Tokenizers.
6+
- TODO: support for convolution via affine indexing expressions in: projections, einsum notation, shape inference.
77

88
## [0.5.3] -- 2025-05-24
99

README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -66,24 +66,25 @@ NOTE: debug logging from CUDA in complex settings is a bit tricky, it involves a
6666

6767
This is very tentative.
6868

69-
* 0.6: Replicate the scaffolding from [llm.c](https://github.com/karpathy/llm.c) for training GPT-2.
69+
* 0.6: more precisions, convolution, block tensors, improvements to dimension labels.
70+
* DONE at head: BF16, FP8.
71+
* Requires extending expressivity of projections and the generalized einsum notation.
72+
* Then, we can add convnet building blocks and corresponding examples starting with MNIST.
73+
* Verify or rethink usefulness of dimension labels, and whether to introduce axis labels.
74+
* 0.7: Replicate the scaffolding from [llm.c](https://github.com/karpathy/llm.c) for training GPT-2.
7075
* Useful building blocks for models in [lib/nn_blocks.ml](lib/nn_blocks.ml).
7176
* A language model example.
7277
* Port (translate or bind) the Python files from [llm.c](https://github.com/karpathy/llm.c) to implement tokenization, data loading and saving etc.
7378
* At the end of 0.6.x, we should have an apples-to-apples benchmark comparing OCANNL to [llm.c](https://github.com/karpathy/llm.c) for both CPU and GPU.
74-
* 0.7: Optimize performance -- low hanging fruit.
79+
* 0.8: Optimize performance -- low hanging fruit.
7580
* First harvested from [Fast Multidimensional Matrix Multiplication on CPU from Scratch](https://siboehm.com/articles/22/Fast-MMM-on-CPU).
7681
* Then harvested from [How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog](https://siboehm.com/articles/22/CUDA-MMM).
7782
* Finally from [llm.c](https://github.com/karpathy/llm.c).
7883
* These will require splitting a routine into multiple CUDA kernels.
79-
* 0.8: A new abstraction layer automating compilation/linking, execution, and some data transfers.
84+
* 0.9: A new abstraction layer automating compilation/linking, execution, and some data transfers.
8085
* E.g. host-device transfers: copy from host if host update is later than the previous device update.
8186
* Concise syntax for transfers into the merge buffer since we know which tensor node is transferred and where to.
8287
* At the end of 0.8.x, OCANNL has a REPL.
83-
* 0.9: Hopefully-efficient expressivity: block tensors, convolution.
84-
* Requires extending expressivity of projections and the generalized einsum notation.
85-
* Then, we can add convnet building blocks and corresponding examples starting with MNIST.
86-
* Verify or rethink usefulness of dimension labels, and whether to introduce axis labels.
8788
* 0.10: Optimize performance: program search.
8889
* Instead of dynamic scheduling as in tinygrad, we can schedule statically by program search.
8990
* We should also reproduce the search that tinygrad is doing.

0 commit comments

Comments
 (0)