Navigation Menu

Skip to content


Repository files navigation


Roadmap / Manifesto

Tensor library for machine learning

Note that this project is under active development.
Some of the development is currently happening in the llama.cpp and whisper.cpp repos


  • Written in C
  • 16-bit float support
  • Integer quantization support (4-bit, 5-bit, 8-bit, etc.)
  • Automatic differentiation
  • ADAM and L-BFGS optimizers
  • Optimized for Apple Silicon
  • On x86 architectures utilizes AVX / AVX2 intrinsics
  • On ppc64 architectures utilizes VSX intrinsics
  • No third-party dependencies
  • Zero memory allocations during runtime


Whisper inference (example)

With ggml you can efficiently run Whisper inference on the CPU.

Memory requirements:

Model Disk Mem
tiny 75 MB ~280 MB
base 142 MB ~430 MB
small 466 MB ~1.0 GB
medium 1.5 GB ~2.6 GB
large 2.9 GB ~4.7 GB

GPT inference (example)

With ggml you can efficiently run GPT-2 and GPT-J inference on the CPU.

Here is how to run the example programs:

# Build ggml + examples
git clone
cd ggml
mkdir build && cd build
cmake ..
make -j4 gpt-2-backend gpt-j

# Run the GPT-2 small 117M model
../examples/gpt-2/ 117M
./bin/gpt-2-backend -m models/gpt-2-117M/ggml-model.bin -p "This is an example"

# Run the GPT-J 6B model (requires 12GB disk space and 16GB CPU RAM)
../examples/gpt-j/ 6B
./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example"

# Install Python dependencies
python3 -m pip install -r ../requirements.txt

# Run the Cerebras-GPT 111M model
# Download from:
python3 ../examples/gpt-2/ /path/to/Cerebras-GPT-111M/
./bin/gpt-2 -m /path/to/Cerebras-GPT-111M/ggml-model-f16.bin -p "This is an example"

The inference speeds that I get for the different models on my 32GB MacBook M1 Pro are as follows:

Model Size Time / Token
GPT-2 117M 5 ms
GPT-2 345M 12 ms
GPT-2 774M 23 ms
GPT-2 1558M 42 ms
--- --- ---
GPT-J 6B 125 ms

For more information, checkout the corresponding programs in the examples folder.

Using Metal (only with GPT-2)

For GPT-2 models, offloading to GPU is possible. Note that it will not improve inference performances but will reduce power consumption and free up the CPU for other tasks.

To enable GPU offloading on MacOS:


# add -ngl 1
./bin/gpt-2 -t 4 -ngl 100 -m models/gpt-2-117M/ggml-model.bin -p "This is an example"

Using cuBLAS

# fix the path to point to your CUDA compiler
cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc ..

Using clBLAST


Compiling for Android

Download and unzip the NDK from this download page. Set the NDK_ROOT_PATH environment variable or provide the absolute path to the CMAKE_ANDROID_NDK in the command below.

cmake .. \
# Create directories
adb shell 'mkdir /data/local/tmp/bin'
adb shell 'mkdir /data/local/tmp/models'

# Push the compiled binaries to the folder
adb push bin/* /data/local/tmp/bin/

# Push the ggml library
adb push src/ /data/local/tmp/

# Push model files
adb push models/gpt-2-117M/ggml-model.bin /data/local/tmp/models/

# Now lets do some inference ...
adb shell

# Now we are in shell
cd /data/local/tmp
export LD_LIBRARY_PATH=/data/local/tmp
./bin/gpt-2-backend -m models/ggml-model.bin -p "this is an example"

CLBlast for Android

Build CLBlast.

# In CLBlast/build
$ANDROID_SDK_PATH/cmake/3.22.1/bin/cmake .. \
    -DCMAKE_ANDROID_ARCH_ABI=arm64-v8a \
    -DCMAKE_ANDROID_STL_TYPE=c++_static \
    -DOPENCL_ROOT=$(readlink -f ../../OpenCL-Headers) \

# Build
make -j4

Pull to

# In ggml project root.
mkdir arm64-v8a
adb pull /system/vendor/lib64/egl/ arm64-v8a/

Build ggml with CLBlast.

# In ggml/build
cd build
$ANDROID_SDK_PATH/cmake/3.22.1/bin/cmake .. \
    -DCMAKE_ANDROID_ARCH_ABI=arm64-v8a \
    -DCMAKE_ANDROID_STL_TYPE=c++_shared \
    -DCLBLAST_HOME=$(readlink -f ../../CLBlast) \
    -DOPENCL_LIB=$(readlink -f ../arm64-v8a/

# Run make, adb push, etc.

Then in adb shell...

cd /data/local/tmp
export LD_LIBRARY_PATH=/system/vendor/lib64/egl:/data/local/tmp
./bin/gpt-2-backend -m models/ggml-model.bin -n 64 -p "Pepperoni pizza"

OpenCL does not have the same level of support in ggml-backend as CUDA or Metal. In the gpt-2-backend example, OpenCL will only be used for the matrix multiplications when evaluating large prompts.