# Load LlaMA model with `llama-cpp-python` (CPU only)

## **Workflow**
1. **Installation**
2. **Fetching the Model**
3. **Loading the Model**
4. **Interacting with Llama**

## **Installation**

In [3]:
#!pip3 install --upgrade pip
!pip3 install llama-cpp-python



## **Fetching the Model**

Please follow this instruction before running the notebook:

- Download the [TheBloke/llama-2-7b-chat.Q2_K.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main) model.
- Add the model's `.gguf` file to your `models/` directory.

In [1]:
!ls
!ls models

llama2_cpu.ipynb                  llama2_langchain_cpu.ipynb
llama2_cpu_huggingface.ipynb      [34mmodels[m[m
llama2_cpu_ollama_langchain.ipynb


[34mhf-git-clone[m[m              [34mhf-snapshot-download[m[m      [35mllama-2-7b-chat.Q2_K.gguf[m[m


## **Loading the Model**

In [9]:
from llama_cpp import Llama

# Put the location of to the GGUF model that download from HuggingFace here
model_path = "models/llama-2-7b-chat.Q2_K.gguf"
llm = Llama(model_path=model_path)
print(llm)

<llama_cpp.llama.Llama object at 0x104a6e490>


llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llama-2-7b-chat.Q2_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32           

## **Interacting with Llama**

In [20]:
llm("What is the capital of France?")

Llama.generate: prefix-match hit

llama_print_timings:        load time =    4400.96 ms
llama_print_timings:      sample time =       1.00 ms /    11 runs   (    0.09 ms per token, 11044.18 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    1132.19 ms /    11 runs   (  102.93 ms per token,     9.72 tokens per second)
llama_print_timings:       total time =    1146.85 ms


{'id': 'cmpl-84d415fb-c1a1-46c0-8f00-57a5633e3f00',
 'object': 'text_completion',
 'created': 1704468211,
 'model': 'models/llama-2-7b-chat.Q2_K.gguf',
 'choices': [{'text': '\n everybody! The capital of France is Paris.',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 8, 'completion_tokens': 10, 'total_tokens': 18}}

In [21]:
output = llm("Tell me a joke about French people?")
print(output["choices"][0]["text"])

Llama.generate: prefix-match hit



 Bedeut the following sentence: "The French are a funny lot."



llama_print_timings:        load time =    4400.96 ms
llama_print_timings:      sample time =       1.55 ms /    16 runs   (    0.10 ms per token, 10329.24 tokens per second)
llama_print_timings: prompt eval time =     627.03 ms /     9 tokens (   69.67 ms per token,    14.35 tokens per second)
llama_print_timings:        eval time =    1362.08 ms /    15 runs   (   90.81 ms per token,    11.01 tokens per second)
llama_print_timings:       total time =    2010.49 ms


In [23]:
llm = Llama(model_path="./models/llama-2-7b-chat.Q2_K.gguf")

# Prompt creation
system_message = "You are a helpful assistant"
user_message = "Q: Name the planets in the solar system? A: "

prompt = f"""<s>[INST] <<SYS>>
{system_message}
<</SYS>>
{user_message} [/INST]"""


# Run the model
output = llm(
  prompt, # Prompt
  max_tokens=32, # Generate up to 32 tokens
  stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
  echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion



print(output)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q2_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32         

{'id': 'cmpl-4fa2786c-ddaa-451f-8238-2aeeeb150e39', 'object': 'text_completion', 'created': 1704621144, 'model': './models/llama-2-7b-chat.Q2_K.gguf', 'choices': [{'text': '<s>[INST] <<SYS>>\nYou are a helpful assistant\n<</SYS>>\nQ: Name the planets in the solar system? A:  [/INST]  Of course! Here are the eight planets in our solar system, listed in order from the sun:', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 40, 'completion_tokens': 23, 'total_tokens': 63}}



llama_print_timings:        load time =    5361.82 ms
llama_print_timings:      sample time =       2.20 ms /    23 runs   (    0.10 ms per token, 10435.57 tokens per second)
llama_print_timings: prompt eval time =    5361.78 ms /    40 tokens (  134.04 ms per token,     7.46 tokens per second)
llama_print_timings:        eval time =    1818.44 ms /    22 runs   (   82.66 ms per token,    12.10 tokens per second)
llama_print_timings:       total time =    7210.98 ms


Text completion is available through the `__call__` and [create_completion](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) methods of the `Llama` class.
- `prompt`: The prompt to generate text from.
- `max_tokens`: The maximum number of tokens to generate.
- `stop`: A list of strings to stop generation when encountered.
- `echo`: Whether to echo the prompt.


In the **prompt** there are two user inputs:  
- The ***system message*** which can be used to instill specific knowledge or constraint to the LLM. Alternatively, it can be omitted and the model will follow the system message it was trained on.
- The ***user message*** which is the actual user prompt. Here you can define the concrete task that you want the model to do (e.g. text generation)