This notebook shows how to use llama.cpp via the [lama-cpp-python](https://github.com/abetlen/llama-cpp-python) package.

llama.cpp enables model inference in c/c++. Its original goal was to "run the LLaMA model using 4-bit integer quantization on a MacBook" but its scope has since expanded considerably.

Here, we will use it with CUDA.

# 1. Install llama.cpp

In [0]:
# Clone the repo
!git clone https://github.com/ggerganov/llama.cpp /databricks/driver/llama.cpp/
%cd /databricks/driver/llama.cpp

In [0]:
# install nvidia cuda toolkit
!apt-get install nvidia-cuda-toolkit -y

## Build llama.cpp with cuBLAS support

In [0]:
%%bash
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

Skip to step 5 if you've already converted and quantized the model and saved the quantized model!
# 2. Download or move the model to the ./models directory of llama.cpp
First, download the model from [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-hf). You can use, e.g.,

```
git lfs install
git clone git@hf.co:meta-llama/Llama-2-7b-hf <target_directory>
```

In [0]:
%cp -r /dbfs/daniel.liden/models/llama2/ ./models/llama2/

In [0]:
# install python dependencies
!python3 -m pip install -r requirements.txt

# 3. Convert the model to the gguf format

In [0]:
!python3 convert.py ./models/llama2/

In [0]:
# verify we now have a gguf model
!find ./models/llama2/ -name "*.gguf"

# 4. Quantize the model

You can see the different quantization options with `quantize --help`. Recommendations can be found in [this issue](https://github.com/ggerganov/llama.cpp/discussions/2094) (may not be up to date)

In [0]:
!./build/bin/quantize --help

In [0]:
!./build/bin/quantize ./models/llama2/ggml-model-f16.gguf ./models/llama2/ggml-model-q5_k_m.gguf q5_k_m

In [0]:
# Optionally, save the quantized model to dbfs
import os
source_path = "./models/llama2/ggml-model-q5_k_m.gguf"
target_path = "/daniel.liden/models/ggml-model-q5_k_m.gguf"

dbutils.fs.cp("file:" + os.path.abspath(source_path), target_path)

# 5. Run inference

Resume here if you've already downloaded and quantized the model!

In [0]:
dbutils.fs.ls("daniel.liden/models/")

In [0]:
%cp /dbfs/daniel.liden/models/ggml-model-q5_k_m.gguf ./models/ggml-model-q5_k_m.gguf

In [0]:
!ls ./models/

In [0]:
# run the inference
!./build/bin/main -m ./models/ggml-model-q5_k_m.gguf -n 128 -ngl 64 --prompt "The steps to make a good chemex pour-over coffee are as follows:\n1."

# 6. Python bindings with [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)

Once you have a gguf-formatted and quantized model, you can use the high-level Python API offered by `llama-cpp-python` to work with it. See notebook 8a for details.