You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-7Lines changed: 8 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,24 +66,25 @@ NOTE: debug logging from CUDA in complex settings is a bit tricky, it involves a
66
66
67
67
This is very tentative.
68
68
69
-
* 0.6: Replicate the scaffolding from [llm.c](https://github.com/karpathy/llm.c) for training GPT-2.
69
+
* 0.6: more precisions, convolution, block tensors, improvements to dimension labels.
70
+
* DONE at head: BF16, FP8.
71
+
* Requires extending expressivity of projections and the generalized einsum notation.
72
+
* Then, we can add convnet building blocks and corresponding examples starting with MNIST.
73
+
* Verify or rethink usefulness of dimension labels, and whether to introduce axis labels.
74
+
* 0.7: Replicate the scaffolding from [llm.c](https://github.com/karpathy/llm.c) for training GPT-2.
70
75
* Useful building blocks for models in [lib/nn_blocks.ml](lib/nn_blocks.ml).
71
76
* A language model example.
72
77
* Port (translate or bind) the Python files from [llm.c](https://github.com/karpathy/llm.c) to implement tokenization, data loading and saving etc.
73
78
* At the end of 0.6.x, we should have an apples-to-apples benchmark comparing OCANNL to [llm.c](https://github.com/karpathy/llm.c) for both CPU and GPU.
74
-
* 0.7: Optimize performance -- low hanging fruit.
79
+
* 0.8: Optimize performance -- low hanging fruit.
75
80
* First harvested from [Fast Multidimensional Matrix Multiplication on CPU from Scratch](https://siboehm.com/articles/22/Fast-MMM-on-CPU).
76
81
* Then harvested from [How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog](https://siboehm.com/articles/22/CUDA-MMM).
77
82
* Finally from [llm.c](https://github.com/karpathy/llm.c).
78
83
* These will require splitting a routine into multiple CUDA kernels.
79
-
* 0.8: A new abstraction layer automating compilation/linking, execution, and some data transfers.
84
+
* 0.9: A new abstraction layer automating compilation/linking, execution, and some data transfers.
80
85
* E.g. host-device transfers: copy from host if host update is later than the previous device update.
81
86
* Concise syntax for transfers into the merge buffer since we know which tensor node is transferred and where to.
0 commit comments