We need to compile llama-cpp-python[server] using Nvidia Compile (nvcc). Make sure you are running a CUDA Notebook !

In [10]:
!NVCC_APPEND_FLAGS='-allow-unsupported-compiler' CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=all-major" FORCE_CMAKE=1 pip install llama-cpp-python[server] --no-cache-dir --force-reinstall --upgrade

Collecting llama-cpp-python[server]
  Downloading llama_cpp_python-0.2.65.tar.gz (38.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.0/38.0 MB[0m [31m148.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting diskcache>=5.6.1
  Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m214.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-extensions>=4.5.0
  Downloading typing_extensions-4.11.0-py3-none-any.whl (34 kB)
Collecting jinja2>=2.11.3
  Downloading Jinja2-3.1.3-py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.2/133.2 kB[0m [31m247.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy>=1.20

Next, install the hugging face CLI and Torch tool so we can grab a llama model to chat to.

In [None]:
pip install 'huggingface_hub[cli,torch]'

Login to Hugging Face, you will need your API_KEY set as an environment variable. 

In [None]:
from huggingface_hub import login, logout
import os
login(os.getenv("HF_API_KEY"))

Download a llama-3 model. This has a few quantized, finetuned
versions - https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF

In [None]:
!huggingface-cli download lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF --local-dir . --local-dir-use-symlinks False

Load the model making sure to specify to use the GPU with n_gpu_layers.

In [None]:
from llama_cpp import Llama
llm = Llama(model_path="Meta-Llama-3-8B-Instruct-Q8_0.gguf", n_gpu_layers=-1)

This is where we specify the completions.

In [None]:
output = llm(
  "<PROMPT>", # Prompt
  max_tokens=512,  # Generate up to 512 tokens
  stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
  echo=True        # Whether to echo the prompt
)

In [None]:
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are a story writing assistant with an Australian style."},
        {
            "role": "user",
            "content": "Write a story about llamas."
        }
    ]
)

In [None]:
!python3 -m llama_cpp.server --model Meta-Llama-3-8B-Instruct-Q8_0.gguf --n_gpu_layers=-1 --host=0.0.0.0 --port=8080

Try browsing to the Route and Proxy that your notebook exposes. Be sure to create the two Routes in OpenShift first.

```yaml
oc apply -n rhods-notebooks -f - <<EOF
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: llama-openapi-json
  annotations:
    haproxy.router.openshift.io/rewrite-target: /notebook/rhods-notebooks/jupyter-nb-admin/proxy/8080/openapi.json
spec:
  host: jupyter-nb-admin-rhods-notebooks.apps.sno.sandbox.opentlc.com
  path: /openapi.json
  port:
    targetPort: oauth-proxy
  tls:
    termination: Reencrypt
    insecureEdgeTerminationPolicy: None
  to:
    kind: Service
    name: jupyter-nb-admin-tls
    weight: 100
  wildcardPolicy: None
EOF

oc apply -n rhods-notebooks -f - <<EOF
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: llama-v1
  annotations:
    haproxy.router.openshift.io/rewrite-target: /notebook/rhods-notebooks/jupyter-nb-admin/proxy/8080/v1
spec:
  host: jupyter-nb-admin-rhods-notebooks.apps.sno.sandbox.opentlc.com
  path: /v1
  port:
    targetPort: oauth-proxy
  tls:
    termination: Reencrypt
    insecureEdgeTerminationPolicy: None
  to:
    kind: Service
    name: jupyter-nb-admin-tls
    weight: 100
  wildcardPolicy: None
EOF
```


Open in a browser tab and try it out using swagger.

https://jupyter-nb-admin-rhods-notebooks.apps.sno.sandbox.opentlc.com/notebook/rhods-notebooks/jupyter-nb-admin/proxy/8080/docs