This code is following the instruction: llama.cpp.python

conda activate conda_tensorflow

In [1]:
import tensorflow


## Install Llama.cpp

In [2]:
%pip install llama-cpp-python
# # #for reinstall update:
# # pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir


[0mNote: you may need to restart the kernel to use updated packages.


Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture; Otherwise, while installing it will build the llama.ccp x86 version which will be 10x slower on Apple Silicon (M1) Mac. For example:

In [3]:
# wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
# !bash Miniforge3-MacOSX-arm64.sh

## Installation with Hardware Acceleration

llama.cpp supports multiple BLAS backends for faster processing, e.g. OpenBLAS, cuBLAS, CLBlast, etc. 

To install with OpenBLAS, set the LLAMA_BLAS and LLAMA_BLAS_VENDOR environment variables before installing:

In [4]:
!CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

[0m

In [3]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

[0m

## High-level API

The high-level API provides a simple managed interface through the Llama class.

A simple example will be:

In [5]:
from llama_cpp import Llama
llm = Llama(model_path="/Users/astridz/Documents/AI_recipe/llama.cpp/llama-2-7b-chat.Q4_K_M.gguf", n_threads = 4)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /Users/astridz/Documents/AI_recipe/llama.cpp/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.w

In [6]:
output = llm("Q: Who is the first president in U.S? A: ", 
             max_tokens=50, 
             stop=["Q:", "\n"], 
             echo=True)
print(output)

{'id': 'cmpl-6f421698-b0cb-4521-a29a-ef8afdeabab5', 'object': 'text_completion', 'created': 1699153179, 'model': '/Users/astridz/Documents/AI_recipe/llama.cpp/llama-2-7b-chat.Q4_K_M.gguf', 'choices': [{'text': 'Q: Who is the first president in U.S? A:  George Washington', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 16, 'completion_tokens': 3, 'total_tokens': 19}}



llama_print_timings:        load time =  8478.47 ms
llama_print_timings:      sample time =     2.14 ms /     3 runs   (    0.71 ms per token,  1402.52 tokens per second)
llama_print_timings: prompt eval time =  8477.33 ms /    16 tokens (  529.83 ms per token,     1.89 tokens per second)
llama_print_timings:        eval time = 15803.96 ms /     2 runs   ( 7901.98 ms per token,     0.13 tokens per second)
llama_print_timings:       total time = 24300.50 ms
