Skip to content

Commit 88fa04a

Browse files
committed
Milestones update
1 parent e9afb70 commit 88fa04a

File tree

1 file changed

+11
-4
lines changed

1 file changed

+11
-4
lines changed

README.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -58,18 +58,24 @@ NOTE: debug logging from CUDA in complex settings is a bit tricky, it involves a
5858

5959
This is very tentative.
6060

61-
* **0.6.2: Shape inference improvements, convolution NNs, toy generative transformer.**
61+
* **0.6.2: "you forgot to specify a hidden dimension".**
62+
* Detection of user errors where there is missing information about a hidden dimension: disables guessing "no axes" or "dimension 1" for shapes of parameters.
63+
* RoPE embeddings.
6264
* Transformer for the Names dataset.
65+
* **0.6.3: Padding inference for convolutions; HIP backend.**
6366
* Padding inference during shape inference.
64-
* Add convnet examples starting with MNIST.
67+
* A standalone bindings package for HIP (AMD hardware), and a backend using the bindings.
68+
* Sokoban RL policy gradient example with a CNN.
69+
* **0.6.4: Real world examples.**
70+
* Add convnet examples: MNIST and CIFAR.
71+
* Bindings to a tokenizer (e.g. _llama.cpp_).
72+
* Transformer inference for a small open-weights model (one of GPT2, LLaMA, Gemma).
6573
* **0.7: CPU-style performance and memory efficiency.**
6674
* Cleanup of deprecated streams functionality.
6775
* Migrating from the "hosted tensor" idea to always requiring a context when accessing tensors and dealing with devices directly.
6876
* Optimizations: loop invariant lifting and common subexpression elimination.
6977
* Universal Pool Allocator.
7078
* Milestone phrasing: Enhancements for: inlining-related and simplification-related optimizations, memory management, session management.
71-
* **0.7.1: Real-life transformer inference. HIP backend (AMD hardware) and WebGPU backend.**
72-
* Add a GPT-2 or Llama style example. Tokenization using llama.cpp extracted tokenizer.
7379
* **0.8: GPU-style performance -- low hanging fruit.**
7480
* First harvested from [Fast Multidimensional Matrix Multiplication on CPU from Scratch](https://siboehm.com/articles/22/Fast-MMM-on-CPU).
7581
* Then harvested from [How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog](https://siboehm.com/articles/22/CUDA-MMM).
@@ -79,6 +85,7 @@ This is very tentative.
7985
* **0.8.1: shape understanding and manipulation enhancements.**
8086
* Verify or rethink usefulness of dimension labels aka. dimension units, and whether to introduce axis labels.
8187
* Add concatenation to the einsum syntax (an axis that isq a concatenation of two axes each from another tensor); it's a generalization of stacking tensors.
88+
* An academic-style paper e.g. for the OCaml Workshop.
8289
* **0.9: Optimize performance: program search.**
8390
* Instead of dynamic scheduling as in tinygrad, we can schedule statically by program search.
8491
* We should also reproduce the search that tinygrad is doing. Inspiration: Halide.

0 commit comments

Comments
 (0)