sync : ggml (im2col, GPU conv, 32-bit arm compat) #4060

ggerganov · 2023-11-13T12:42:25Z

Main update here is the new implementation of the convolution ops: ggml_conv_1d and ggml_conv_2d. They are now represented as a sequence of 2 other ops: ggml_im2col + ggml_mul_mat allowing to reuse the matrix multiplication code (more info ggerganov/ggml#564)

The convolutions can now be offloaded to the GPU (both CUDA and Metal supported), so likely with this change we should be able to support CLIP running fully on the GPU

`🤖 Generated by Copilot at 9f72de7`

Summary

🚀🐛🧹

This pull request adds and improves the im2col feature for half-precision tensors in the CUDA and Metal APIs, fixes various bugs and warnings in the code, and updates the public interface and the internal macros of the llama.cpp library. The im2col feature allows faster convolution operations on the GPU. The bug fixes and updates aim to enhance the correctness, efficiency, and compatibility of the library. The files affected by this pull request are ggml-cuda.cu, ggml-metal.m, ggml-metal.metal, ggml-quants.c, ggml.h, ggml-impl.h, and ggml-metal.h.

The pull request has many changes
To improve the ggml ranges
It adds im2col
And fixes some holes
In the macros and the half-precisions

Walkthrough

Implement im2col feature for half-precision tensors using CUDA and Metal (link, link, link, link, link, link, link, link, link, link, link)
Support matrix-vector multiplication on half-precision tensors using CUDA and Metal (link, link, link, link, link, link, link, link, link, link)
Fix bugs and improve performance in ggml_vec_dot_q2_K_q8_K, ggml_vec_dot_q3_K_q8_K, ggml_vec_dot_q4_K_q8_K, ggml_vec_dot_q5_K_q8_K, and ggml_vec_dot_q6_K_q8_K functions by using compatible types and functions for 64-bit architectures (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
Fix bug in ggml_cuda_op_mul_mat function by checking the batch sizes of the input tensors (link)
Fix bug in ggml_metal_op_positional_encoding function by using the correct index for the n_orig_ctx parameter (link)
Fix typo in ggml_cuda_op_mul_mat function by using the correct pointer for the input tensor (link)
Fix compiler warning in ggml_metal_init function by casting the enum value to int (link)
Remove redundant definitions of MIN and MAX macros from ggml-impl.h file (link)
Remove redundant definitions of inline functions that emulate ARM NEON intrinsics from ggml-quants.c file (link, link)
Increase the value of GGML_METAL_MAX_BUFFERS macro from 16 to 64 in ggml-metal.h file (link)
Remove unused enum values from ggml_op enum in ggml.h file (link)
Add GGML_ATTRIBUTE_FORMAT macro to ggml_metal_log function declaration in ggml-metal.m file (link)
Add function name as the first argument to GGML_METAL_LOG_WARN macro in ggml-metal.m file (link)
Add space between variable and attribute qualifier in kernel_mul_mv_f16_f32_1row function in ggml-metal.metal file (link)
Add empty line to separate header inclusion section from the rest of the code in ggml-quants.c file (link)
Add more conditions to #if directive that checks if immintrin.h header should be included in ggml-quants.c file (link, link)
Add check for GGML_METAL_PATH_RESOURCES environment variable in ggml_metal_init function in ggml-metal.m file (link)
Add padding to threadgroup memory length arguments in ggml_metal_op_softmax function in ggml-metal.m file (link, link)
Comment out unused variable assignment in ggml-metal.m file (link)

ggml-ci

FSSRepo · 2023-11-13T14:45:22Z

support CLIP running fully on the GPU

Only the operation ggml_acc need a cuda and metal kernel 💀 for that. And the implementation of ggml_repeat in CUDA is slow, so there is still room for improvement in performance.

ggml-ci

ggerganov force-pushed the sync branch from d570422 to 1103975 Compare November 13, 2023 12:43

ggml : sync (im2col, GPU conv, 32-bit arm compat)

9f72de7

ggml-ci

ggerganov force-pushed the sync branch from 1103975 to 9f72de7 Compare November 13, 2023 12:47

ggerganov changed the title ~~ggml : sync (im2col, GPU conv, 32-bit arm compat)~~ sync : ggml (im2col, GPU conv, 32-bit arm compat) Nov 13, 2023

ggerganov merged commit 3d68f36 into master Nov 13, 2023
40 checks passed

ggerganov deleted the sync branch November 13, 2023 14:55

ggerganov mentioned this pull request Nov 13, 2023

clip : offload to GPU #4061

Closed

olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023

ggml : sync (im2col, GPU conv, 32-bit arm compat) (ggerganov#4060)

82139f9

ggml-ci

cebtenzzre mentioned this pull request Nov 25, 2023

metal: fix yarn #4220

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : ggml (im2col, GPU conv, 32-bit arm compat) #4060

sync : ggml (im2col, GPU conv, 32-bit arm compat) #4060

ggerganov commented Nov 13, 2023 •

edited by ghost

FSSRepo commented Nov 13, 2023 •

edited

sync : ggml (im2col, GPU conv, 32-bit arm compat) #4060

sync : ggml (im2col, GPU conv, 32-bit arm compat) #4060

Conversation

ggerganov commented Nov 13, 2023 • edited by ghost

🤖 Generated by Copilot at 9f72de7

Summary

Walkthrough

FSSRepo commented Nov 13, 2023 • edited

ggerganov commented Nov 13, 2023 •

edited by ghost

`🤖 Generated by Copilot at 9f72de7`

FSSRepo commented Nov 13, 2023 •

edited