# How to Run Llama 2 Locally with Python (Quickstart)

This Jupyter Notebook is part of a Blog Post on https://swharden.com

https://swharden.com/blog/2023-07-29-ai-chat-locally-with-python/

https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

In [11]:
from llama_cpp import Llama

from IPython.display import display, HTML
import json
import time
import pathlib
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

Load two different models so we can compare their responses to the same prompt.

Note that `n_ctx` is the maximum number of context tokens, and increasing this value increases the maximum length of the responses.

In [13]:
MODEL_Q4_K_M = Llama(
    model_path="../models/llama-2-7b-chat.Q4_K_M.gguf",
    n_ctx=2048)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32      

In [14]:
def query(model, question):
    model_name = pathlib.Path(model.model_path).name
    time_start = time.time()
    prompt = f"Q: {question} A:"
    output = model(prompt=prompt, max_tokens=0) # if max tokens is zero, depends on n_ctx
    response = output["choices"][0]["text"]
    time_elapsed = time.time() - time_start
    display(HTML(f'<code>{model_name} response time: {time_elapsed:.02f} sec</code>'))
    display(HTML(f'<strong>Question:</strong> {question}'))
    display(HTML(f'<strong>Answer:</strong> {response}'))
    print(json.dumps(output, indent=2))

In [None]:
query(MODEL_Q4_K_M, "What is the most common cause of death globally?")

Llama.generate: prefix-match hit
