<a href="https://colab.research.google.com/github/aelkhodary/AI_ML_Coders/blob/main/Working_With_Llama3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
# Link to visit
https://llama-cpp-python.readthedocs.io/en/latest/
"""

In [None]:
"""
Quantized models and the GGUF format are techniques and file formats used to optimize large language models (LLMs) for efficient inference, especially on devices with limited computational resources (e.g., CPUs or low-memory GPUs). Let me break it down for you:

1. What is Quantization?
Quantization is a process of reducing the precision of the numbers used to represent a model's weights and activations. In simpler terms, it compresses the model by using fewer bits to store each number, which reduces the model's size and speeds up inference.

Why Quantize?
Smaller Model Size: Quantized models take up less disk space and memory.

Faster Inference: Lower precision calculations are faster, especially on CPUs.

Lower Hardware Requirements: Quantized models can run on devices with limited resources (e.g., laptops, edge devices).

Common Quantization Levels:
FP32 (32-bit floating point): Original precision, no quantization.

FP16 (16-bit floating point): Half precision, smaller and faster.

INT8 (8-bit integer): Even smaller and faster, with some loss of accuracy.

INT4 (4-bit integer): Highly compressed, fastest, but with more accuracy loss.

For example:

A model originally in FP32 might be quantized to INT8 or INT4, reducing its size by 4x or 8x, respectively.

2. What is GGUF?
GGUF (GPT-Generated Unified Format) is a file format introduced by the llama.cpp project to store quantized models efficiently.
It is the successor to the older GGML format and is designed to be more flexible and future-proof.

Key Features of GGUF:
Supports Multiple Quantization Levels: GGUF files can store models quantized to different levels (e.g., Q4_K_M, Q5_K_M, Q8_0).

Optimized for CPU Inference: GGUF models are designed to run efficiently on CPUs, making them ideal for devices without GPUs.

Self-Contained: GGUF files include all necessary metadata (e.g., model architecture, tokenizer configuration) in a single file.

Cross-Platform: GGUF models can be used across different operating systems (Linux, Windows, macOS).

Why Use GGUF?
Efficiency: GGUF models are highly optimized for inference, making them ideal for resource-constrained environments.

Ease of Use: A single .gguf file contains everything needed to run the model.

Compatibility: GGUF is widely supported by tools like llama.cpp and llama-cpp-python.

3. How Quantization and GGUF Work Together
A model is first trained in high precision (e.g., FP32 or FP16).

After training, the model is quantized to a lower precision (e.g., INT8 or INT4) and saved in GGUF format.

The quantized GGUF model can then be loaded and run efficiently using tools like llama.cpp or llama-cpp-python.

4. Example of Quantized GGUF Models
When you browse a model repository on Hugging Face (e.g., TheBloke/Llama-2-7B-GGUF), you’ll see files like:

llama-2-7b.Q4_K_M.gguf

llama-2-7b.Q5_K_M.gguf

llama-2-7b.Q8_0.gguf

These files represent the same model but quantized to different levels:

Q4_K_M: 4-bit quantization, medium precision.

Q5_K_M: 5-bit quantization, medium precision.

Q8_0: 8-bit quantization, higher precision.

5. Trade-offs of Quantization
While quantization offers significant benefits, it also comes with some trade-offs:

Accuracy Loss: Lower precision can reduce the model's accuracy, especially for complex tasks.

Quantization Artifacts: Some models may behave differently after quantization.

Not Suitable for Training: Quantization is primarily used for inference; training requires higher precision.

6. How to Use GGUF Models
To use a GGUF model, you typically:

Download the .gguf file from a repository like Hugging Face.

Load the model using a compatible library (e.g., llama-cpp-python).

Run inference on the model.

Example Code:
python
Copy
from llama_cpp import Llama

# Load the GGUF model
model_path = "llama-2-7b.Q4_K_M.gguf"
llm = Llama(model_path=model_path)

# Run inference
prompt = "What is the capital of France?"
output = llm(prompt)
print(output['choices'][0]['text'])


7. Comparison: GGUF vs PyTorch Models
Feature	-->GGUF Format	-->PyTorch Format (e.g., .bin)
File Size	-->Smaller (due to quantization)	Larger (full precision)
Inference Speed	-->Faster (optimized for CPU)	Slower (requires GPU for best speed)
Hardware Support	-->Runs on CPUs and low-resource devices	Requires GPUs for efficient inference
Use Case	-->Inference-only	Training and inference
Flexibility	-->Limited to inference	Full flexibility for training and fine-tuning


Summary
Quantization reduces the precision of a model's weights to make it smaller and faster.

GGUF is a file format designed to store quantized models efficiently, optimized for CPU inference.

Quantized GGUF models are ideal for running large language models on devices with limited resources.

Let me know if you need further clarification!

"""

In [None]:
# Install Models file
"""
Pulling models from Hugging Face Hub
You can download Llama models in gguf format directly from Hugging Face using the from_pretrained method.
You'll need to install the huggingface-hub package to use this feature (pip install huggingface-hub).
llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen2-0.5B-Instruct-GGUF",
    filename="*q8_0.gguf",
    verbose=False
)

By default from_pretrained will download the model to the huggingface cache directory, you can then manage installed model files with the huggingface-cli tool.


1. What is filename?
The filename is the name of the specific file you want to download from the repository.

In the case of GGUF models, this is typically a .gguf file that contains the quantized version of the model.

You must provide the exact name of the file as it appears in the repository. Wildcards (e.g., *) are not supported.

2. Where to Find the filename?
Go to the model's Hugging Face repository page (e.g., meta-llama/Llama-3.2-1B).

Navigate to the "Files and versions" tab.

Look for the .gguf file you want to download. For example, you might see files like:

llama-3.2-1b.Q4_K_M.gguf

llama-3.2-1b.Q5_K_M.gguf

llama-3.2-1b.Q8_0.gguf

Copy the exact name of the file you want to download and use it as the filename argument.


3. Why is filename Important?
Hugging Face repositories can contain multiple files (e.g., model weights, configuration files, tokenizer files, etc.).

The filename ensures that you download the specific file you need (in this case, the .gguf file).

If you provide an incorrect or non-existent filename, the download will fail with an error.


######## Load model using transfoemer :
The code you provided mixes two different libraries and approaches:

transformers library: Used for loading models like meta-llama/Llama-3.2-1B in PyTorch format.

llama-cpp-python library: Used for loading GGUF models (quantized models compatible with llama.cpp).

These two libraries are not directly compatible because:

transformers loads models in PyTorch format (e.g., .bin files).

llama-cpp-python loads models in GGUF format (e.g., .gguf files).

If you want to use the llama-cpp-python library, you need to download a GGUF version of the model, not the PyTorch version. Here's how you can modify your code:

#Call another model

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")



from llama_cpp import Llama

model = Llama(model_path=model_path)
# Run inference
prompt = "What is the capital of France?"
prompt = "What is the capital of Egypt?"
output = llm(prompt)
print(output['choices'][0]['text'])


"""


In [1]:

# !pip install huggingface-hub --quiet


from huggingface_hub import hf_hub_download
from huggingface_hub import login

# Replace with your Hugging Face token
#login(token="YOUR_HUG_FACE_TOKEN")
# hf_hub_download(
#     repo_id="meta-llama/Llama-3.2-1B",
#     filename="llama-3.2-1b.Q4_K_M.gguf",
#     cache_dir="/content/model"
# )


model_path =hf_hub_download(
    repo_id="medmekk/Llama-3.2-1B-Instruct.GGUF",
    filename="Llama-3.2-1B-Instruct-Q3_K_L.gguf",
    cache_dir="/content/model"
)

print(model_path)

Llama-3.2-1B-Instruct-Q3_K_L.gguf:   0%|          | 0.00/733M [00:00<?, ?B/s]

/content/model/models--medmekk--Llama-3.2-1B-Instruct.GGUF/snapshots/52fc2aa2f47dd03c7f666680eb080240b4fd6c96/Llama-3.2-1B-Instruct-Q3_K_L.gguf


In [2]:
!pip install llama-cpp-python --quiet
from llama_cpp import Llama

#llm = Llama(model_path=model_path, stop=["Q:", "\n"])
llm = Llama(model_path=model_path)

# Run inference
prompt = "What is the capital of France?"
prompt = "What is the capital of Egypt?"
prompt = "where is damietta exist as city?"
output = llm(prompt)
print(output['choices'][0]['text'])

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.7/66.7 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone


llama_model_loader: loaded meta data with 31 key-value pairs and 147 tensors from /content/model/models--medmekk--Llama-3.2-1B-Instruct.GGUF/snapshots/52fc2aa2f47dd03c7f666680eb080240b4fd6c96/Llama-3.2-1B-Instruct-Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:            

Damietta is a city located in the province of Port Said, Egypt.


Parsing Llama 3 completion outputs :

Your company wants to use the Llama models in its Bronx Zoo question-answering bot for the animal exhibits.

Your task is to extract the model's completion from the result stored in output. The output contains the completion and many other metadata. An early step to evaluate the model is to ask Llama 3 a question, and figure out how to parse its output. You are given a Llama model preloaded in llm, and given the prompt which asks it to name five foods that llamas eat, with the result stored in output.

You are tasked with parsing the result in output and only retrieve the string result of the completion and store it in completion_string.


Instructions :


Limit the number of tokens generated to a max of 20 tokens.
Stop the generation if the completion produces a line break, ie '\n'.
Parse the output variable and store the completion string in a new variable, completion_string.


"""
from llama_cpp import Llama

llm = Llama(
      model_path="./models/7B/llama-model.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed to make sure you will
                     get the same result every time
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
"""


In [3]:
output = llm(
	"Q: Name 5 foods that llamas eat? A: ", # prompt
  	# restrict to 20 tokens
	max_tokens=32,
	# add relevant stopping tokens
	stop=["Q:", "\n"],
  # to get same result every time
  seed=1337,
  temperature=0.8,#default value
 repeat_penalty=1.1#default value(make the model produce new words by reducing the probabilities of words that it has already produced)
)
# Retrieve the completion text and store in completion_string
completion_string = output['choices'][0]['text']
print(completion_string)

Llama.generate: 1 prefix-match hit, remaining 14 prompt tokens to eval
llama_perf_context_print:        load time =     149.61 ms
llama_perf_context_print: prompt eval time =     194.38 ms /    14 tokens (   13.88 ms per token,    72.03 tokens per second)
llama_perf_context_print:        eval time =     843.40 ms /    30 runs   (   28.11 ms per token,    35.57 tokens per second)
llama_perf_context_print:       total time =    1132.59 ms /    44 tokens


1. Almonds, 2. Carrots, 3. Apples, 4. Spinach, and 5. Alfalfa hay


In [15]:
"""
use chat completions makes the llm more conversational.
"""

output = llm.create_chat_completion(
      messages = [
          {"role": "system",
           "content": "You are an assistant who perfectly describes the animals in the zoo."},
          {
              "role": "user",
              "content": "Name 5 foods that llamas eat?"
          }
      ]
)
# Retrieve the completion text and store in completion_string
completion_string = output['choices'][0]['message']
print (completion_string)

print ("***********************************************")

"""
 Same logic after adding max_tokens, stop, seed, temperature, repeat_penalty
"""
output = llm.create_chat_completion(
      messages = [
          {"role": "system",
           "content": "You are an assistant who perfectly describes the animals in the zoo."},
          {
              "role": "user",
              "content": "Name 5 foods that llamas eat?"
          }
      ],
      	max_tokens=500,
	# add relevant stopping tokens
	#stop=["Q:", "\n"],
  # to get same result every time
  seed=1337,
  temperature=0.8,#default value
 repeat_penalty=1.1
)
# Retrieve the completion text and store in completion_string
completion_string = output['choices'][0]['message']["content"]
print (completion_string)



Llama.generate: 56 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     149.61 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    8322.35 ms /   270 runs   (   30.82 ms per token,    32.44 tokens per second)
llama_perf_context_print:       total time =    8728.77 ms /   271 tokens
Llama.generate: 56 prefix-match hit, remaining 1 prompt tokens to eval


{'role': 'assistant', 'content': "As a zoo assistant, I'm happy to help. Here are 5 foods that llamas typically eat:\n\n1. **Grasses**: Llamas love to graze on various types of grasses, including timothy grass, orchard grass, and bluegrass.\n2. **Hay**: Hay is a staple in a llama's diet, and they enjoy munching on various types of hay, such as alfalfa hay, timothy hay, and orchard hay.\n3. **Grain**: Llamas can be fed a diet rich in grains, including oats, barley, and wheat. However, it's essential to provide these grains in moderation to prevent digestive issues.\n4. **Leafy greens**: Leafy greens like kale, spinach, and collard greens are a great addition to a llama's diet. They're rich in fiber and nutrients.\n5. **Fruits and vegetables**: Llamas enjoy a variety of fruits and vegetables, such as apples, carrots, and sweet potatoes. However, it's crucial to introduce these foods gradually and in moderation to prevent digestive issues.\n\nRemember, it's essential to consult with a vet

llama_perf_context_print:        load time =     149.61 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    8092.39 ms /   265 runs   (   30.54 ms per token,    32.75 tokens per second)
llama_perf_context_print:       total time =    9042.46 ms /   266 tokens


As a knowledgeable assistant, I'd be happy to help. Llamas are herbivores and their diet consists mainly of plants. Here are 5 foods that llamas typically eat:

1. Grasses: Llamas love to graze on various types of grasses, including timothy grass, orchard grass, and other cool-season grasses.
2. Hay: Providing them with high-quality hay is essential for their health. Llama owners can offer Timothy hay or other grass hays like alfalfa hay or oat hay.
3. Alfalfa meal: Alfalfa is a legume that provides essential nutrients like calcium, phosphorus, and nitrogen. Alfalfa meal is commonly used as a supplement in llama diets.
4. Grains: Llamas may also be offered grains like oats, barley, or corn, especially during times of malnutrition or when they're not eating their usual diet.
5. Leaves: Llamas enjoy munching on leaves from trees and shrubs. This can include leaves from plants like willow, cottonwood, or cedar trees.

It's essential to note that the specific dietary needs of llamas vary d

**Exercise**

**More creative Llama completions**

You are a software developer working on integrating Llama in your company's chatbot pipelines. Unfortunately, the current Llama model you are using produces repetitive completions and often produces exactly the same results if you ask it the same question, which makes the bot feel less personable to your users.

You decide to debug this issue by looking through the completion code and modify it so that the responses produced are more varied. The model is already instantiated with a model using llama_cpp and is stored in llm.

**Instructions**

Add the parameter and a corresponding value to the completion code so the model considers a wider variety of words during generation.

Add the parameter to the completion code which penalizes the model for repeating the same words often.

In [18]:
output = llm(
		"Q: Give me directions from grand central station to the Empire State building. A: ",
  		# Modify for the model to sample from more words
		temperature=2,
  		# Modify to penalize repeated use of the same words
		repeat_penalty=2,
        max_tokens=15,
        stop=["Q:", "\n"],
        echo=False
)

print(output['choices'][0]['text'])
#1) Exit of 34th Street at Broadway (toward the west
#1) Walk east on 33rd Street for five blocks, then turn
#1) Head north on Broadway. 2 or a (c)

Llama.generate: 18 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     149.61 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =     410.14 ms /    14 runs   (   29.30 ms per token,    34.13 tokens per second)
llama_perf_context_print:       total time =     451.05 ms /    15 tokens


1) Head north on Broadway. 2 or a (c)


**Exercise**

**Make a philosophy chatbot**
You are a tester at a company building AI personas, and your task is to evaluate how well the new Llama models are able to generate completions in certain voices and styles.

You will make a chatbot that thinks it's a philosopher and answers questions by pretending it is Plato. You are given a partially completed create_chat_completion call, which you will modify to make the chatbot respond to a user's question as if it was Plato himself.

**Instructions**


Fill in the dictionary in the first index of the messages list with the instructions to make the model respond as if it is the Greek philosopher Plato and the appropriate role.
Fill in the dictionary in the second index of messages with the prompting question from the user and the appropriate role.
Ensure that both your instruction, and the user's question are correctly passed to the function call.

In [20]:
history = [
	# Instruct the model to behave like Plato
	{"role": "system",
	 "content": "You are the Greek philosopher Plato. Answer every question using his voice."
	},
	# Identify that the following text is from the user
	{
          	"role": "user",
			"content": "Can any shape that exist in the real world be perfect and why?"
    }
]
# Pass in conversation context to the completion call
result = llm.create_chat_completion(history, max_tokens=500)
print(result)
completion_string = result['choices'][0]['message']["content"]
print (completion_string)

Llama.generate: 62 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     149.61 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    9803.60 ms /   332 runs   (   29.53 ms per token,    33.87 tokens per second)
llama_perf_context_print:       total time =   10313.98 ms /   333 tokens


{'id': 'chatcmpl-99abe7cc-a624-4f76-a122-9b9b3624951f', 'object': 'chat.completion', 'created': 1740542476, 'model': '/content/model/models--medmekk--Llama-3.2-1B-Instruct.GGUF/snapshots/52fc2aa2f47dd03c7f666680eb080240b4fd6c96/Llama-3.2-1B-Instruct-Q3_K_L.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'My dear friend, I shall ponder this question with the wisdom of the ages. In the realm of geometry, we find that the perfect shapes are those that possess the greatest proportion of harmony and balance.\n\nConsider, if you will, the circle. It is the shape of the perfect sphere, where every point is equidistant from every other point. The circle is the epitome of perfection, for it is the only shape that can be perfectly round, without any sharp corners or edges. This, my friend, is the hallmark of the perfect shape.\n\nBut, I must also consider the triangle. Ah, the triangle! It is the shape of the perfect pyramid, where every angle is a perfect right angle

In [51]:
"""
Use Model Llama for translation
!pip install llama-cpp-python --quiet

"""

from llama_cpp import Llama
llm = Llama(model_path=model_path)

text = """EN:Hello
FR:Bonjour
EN:Goodbye
FR:Au revoir
EN:Good day
FR:
"""
output = llm(text, max_tokens=60 ,temperature=0 ,stop=["Q:","\n"])
print(output['choices'][0]['text'])




llama_model_loader: loaded meta data with 31 key-value pairs and 147 tensors from /content/model/models--medmekk--Llama-3.2-1B-Instruct.GGUF/snapshots/52fc2aa2f47dd03c7f666680eb080240b4fd6c96/Llama-3.2-1B-Instruct-Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:            

EN:Hello


In [50]:
"""
Use Model Llama for translation
!pip install llama-cpp-python --quiet

"""

from llama_cpp import Llama
llm = Llama(model_path=model_path)

text = """AR:آحمد
EN:Ahmed
AR:علي
EN:Ali
AR:محمود
EN:Mohmmed
AR:مني
EN:
"""
output = llm(text, max_tokens=60 ,temperature=0 )
print(output['choices'][0]['text'])

llama_model_loader: loaded meta data with 31 key-value pairs and 147 tensors from /content/model/models--medmekk--Llama-3.2-1B-Instruct.GGUF/snapshots/52fc2aa2f47dd03c7f666680eb080240b4fd6c96/Llama-3.2-1B-Instruct-Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:            

AR:مصعب
EN:Mossab
AR:مصعب
EN:Mossab
AR:مصعب
EN:Mossab
AR:مصعب
EN:Mossab
AR:مصعب
EN:Mossab
AR:مصعب


Exercise:

Make Llama speak like a pirate
Your task is to create a prompt for a Llama model to serve as the language backend for an interactive pirate robot at Disney World. Ensure the model's output is always in a pirate voice and includes "Aye Matey" in its response. Create an appropriate instruction for this prompt, using keywords to guide the model's output.

The Llama class has already been instantiated in the llm variable and the code to call the completion is provided.

Instructions:


Include the appropriate keywords in the prompt in the correct locations: Instruction:, Question:, and Response: and ensure the instruction includes some directive on including "Aye Matey" in the model response and to make the model have a pirate voice.

In [17]:
# Write the keywords and instructions in the correct locations in the following prompt
text="""Instruction: You are robot at Disney World include Aye Matey in your response
Question: How long does it take to go around the Earth once?
Response:
"""

output = llm(
      text,
      max_tokens=15,
      stop=["Q:", "\n"],
)

print(output['choices'][0]['text'])

Llama.generate: 1 prefix-match hit, remaining 34 prompt tokens to eval
llama_perf_context_print:        load time =     938.78 ms
llama_perf_context_print: prompt eval time =     614.50 ms /    34 tokens (   18.07 ms per token,    55.33 tokens per second)
llama_perf_context_print:        eval time =     588.91 ms /    14 runs   (   42.06 ms per token,    23.77 tokens per second)
llama_perf_context_print:       total time =    1227.41 ms /    48 tokens


Aye matey, it takes about 24 hours, but it depends


**Exercise:**

3-shot prompting with Llama
You work at a food delivery company as a data analyst, and you are investigating the sentiment (positive or negative) people have about your company from reviews on Google and Yelp.

Since you don't want to train a classification model from scratch to identify the reviews as positive or negative, you decide to create a prompt that you will feed to your instance of Llama 3. You decide to use few shot learning by writing three examples with the review and the sentiment, and use the model identify the sentiment on the 4th example, which you will replace with each review you collected.

**Instructions:**

Create a prompt using a few-shot prompting template with 3 examples.

In [49]:
# Fill in the 3-shot prompt (you can use multiple lines)
text = """
Review 1: The food was cold, and the delivery was late.
Sentiment 1: Negative

Review 2: Amazing experience! The food was fresh, and the delivery was super fast.
Sentiment 2: Positive

Review 3: The order was wrong, and the customer service was unhelpful.
Sentiment 3: Negative

Review 4: Delicious food, and excellent customer service!
Sentiment 4:"""

# Call the Llama 3 model to predict the sentiment for Review 4
output = llm(text, max_tokens=2, stop=["Q:"])

# Print the predicted sentiment
print(output['choices'][0]['text'])

Llama.generate: 82 prefix-match hit, remaining 14 prompt tokens to eval
llama_perf_context_print:        load time =     938.78 ms
llama_perf_context_print: prompt eval time =    1064.06 ms /    14 tokens (   76.00 ms per token,    13.16 tokens per second)
llama_perf_context_print:        eval time =      49.46 ms /     1 runs   (   49.46 ms per token,    20.22 tokens per second)
llama_perf_context_print:       total time =    1118.45 ms /    15 tokens


 Positive




In [4]:
"""
Streaming completions

"""

from llama_cpp import Llama
llm = Llama(model_path=model_path)


output = llm("Q:Which galaxy is the closest to Us  ? A: ",#Prompt
             max_tokens=32 ,
             temperature=0,
             echo=True,
             stream=True,)
for token in output:
    print(token['choices'][0]['text'], end='')


print(output)


llama_model_loader: loaded meta data with 31 key-value pairs and 147 tensors from /content/model/models--medmekk--Llama-3.2-1B-Instruct.GGUF/snapshots/52fc2aa2f47dd03c7f666680eb080240b4fd6c96/Llama-3.2-1B-Instruct-Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:            

 Andromeda Galaxy (M31) - 2.5 million light-years away.
Q:What is the largest planet in our solar system?

llama_perf_context_print:        load time =     480.27 ms
llama_perf_context_print: prompt eval time =     480.15 ms /    15 tokens (   32.01 ms per token,    31.24 tokens per second)
llama_perf_context_print:        eval time =    1500.22 ms /    31 runs   (   48.39 ms per token,    20.66 tokens per second)
llama_perf_context_print:       total time =    2035.54 ms /    46 tokens


  A<generator object Llama._create_completion at 0x4f639b0>


**Exercise**:

Creating a JSON inventory list
You are asked to use an LLM to produce a structured JSON with a list of items and their count to help a supermarket automate their inventory process.

The model takes a text description of the inventory as input and produces the JSON as output. This feature of the inventory management system automatically extracts inventory data from natural language and stores it in a structured format for downstream tasks.

You are provided with the llm class instance with a Llama model pre-loaded and the system prompt to get you started.

**Instructions**:


Specify the parameters in create_chat_completion that lets you generate responses in JSON format.

In [5]:
output = llm.create_chat_completion(
        messages=[
            {"role": "system",
             "content": "You are a helpful assistant processing lists from text to JSON format: you extract item counts from text and output it in JSON with the item name as the key and the number of that item as the value",},
            {"role": "user",
             "content": "I have fifteen apples, thirty-three oranges, and five thousand fifty-two potatoes."},
        ],
		response_format={"type": "json_object"}, # Specify output format to JSON

)

print(output['choices'][0]['message']['content'])

Llama.generate: 1 prefix-match hit, remaining 90 prompt tokens to eval
llama_perf_context_print:        load time =     480.27 ms
llama_perf_context_print: prompt eval time =    1701.97 ms /    90 tokens (   18.91 ms per token,    52.88 tokens per second)
llama_perf_context_print:        eval time =    1263.09 ms /    27 runs   (   46.78 ms per token,    21.38 tokens per second)
llama_perf_context_print:       total time =    3923.85 ms /   117 tokens


{
    "Apples": 15,
    "Oranges": 33,
    "Potatoes": 5052
}


**Exercise**:

Generating answers with a JSON schema
You are part of a team working on an online education platform. In a course teaching about space, there is an interactive exercise where students are able to ask questions about a planet and the answer is shown on their screen through a graphical view. This question-answering feature is powered by an LLM, but the graphical view requires a JSON as an input with the fields Question and Answer to correctly showing the question and answer.

You believe that using the new Llama models and llama-cpp-python, you can get the LLM to produce the answer and format it into the correct JSON schema in one step.

**Instructions**:

Add the field to specify a JSON schema in response_format and the properties it may have.
Specify the Question and Answer fields in the schema with the string type.
Specify the required fields in the schema.

In [4]:
output = llm.create_chat_completion(
        messages=[
            {"role": "system",
             "content": "You are a helpful assistant that answers questions about space. You return your results in a JSON format with the Question and Answer fields.",},
            {"role": "user",
             "content": "How old is the Milky Way Galaxy?"},
        ],
        response_format={
            "type": "json_object",
          	# Set the keyword that lets you specify a schema
            "schema": {
            "type": "object",
            # Set the properties of the JSON fields and their data types
            "properties": {
            "Question": {"type": "string"},
             "Answer": {"type": "string"}
             },
            # Declare the required JSON fields here
            "required": ["Question", "Answer"],
            },
        },
)

print(output['choices'][0]['message']['content'])

Llama.generate: 1 prefix-match hit, remaining 68 prompt tokens to eval
llama_perf_context_print:        load time =     190.07 ms
llama_perf_context_print: prompt eval time =     793.98 ms /    68 tokens (   11.68 ms per token,    85.64 tokens per second)
llama_perf_context_print:        eval time =     823.25 ms /    36 runs   (   22.87 ms per token,    43.73 tokens per second)
llama_perf_context_print:       total time =    2234.81 ms /   104 tokens


{"Question": "How old is the Milky Way Galaxy?", "Answer": "The age of the Milky Way Galaxy is estimated to be around 13.6 billion years."}


**Exercise**:

Creating a JSON inventory list
Being able to use LLMs to generate structured outputs means that they can be used to reliably exchange information between systems with minimal processing.

You'll test your prompting skills by converting a stock inventory list written in natural language into JSON format. A model and system prompt have already been defined for you to use.

**Instructions**:


Complete the .create_chat_completion() code to generate responses in JSON format.

In [11]:
 output = llm.create_chat_completion(
        messages=[
            {"role": "system", "content": "You are a helpful assistant processing lists from text to JSON format: you extract item counts from text and output it in JSON with the item name as the key and the number of that item as the value",},
            {"role": "user", "content": "I have fifteen apples, thirty-three oranges, and five thousand fifty-two potatoes."},
        ],
		response_format={"type": "json_object",# Specify output format to JSON
            "schema": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "itemName": {"type": "string"},
                    "number": {"type": "integer"}
                },
                "required": ["itemName", "number"]
            }
            },

                   },



)

print(output['choices'][0]['message']['content'])

Llama.generate: 90 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     304.87 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    1231.14 ms /    48 runs   (   25.65 ms per token,    38.99 tokens per second)
llama_perf_context_print:       total time =    1933.86 ms /    49 tokens


[{ "itemName": "Apples", "number": 15 }, { "itemName": "Oranges", "number": 33 }, { "itemName": "Potatoes", "number": 5052 }]


In [12]:
output = llm.create_chat_completion(
        messages=[
            {"role": "system", "content": "You convert inventory lists from text to JSON, extracting item counts and names from the text as keys and values in the form: item: count; for example, 'banana': 32.",},
            {"role": "user", "content": "Fifteen apples, thirty-three oranges, and five thousand fifty-two potatoes."},
        ],
  		# Specify output format to JSON
        response_format={
            "type": "json_object",
        }
)

print(output['choices'][0]['message']['content'])

Llama.generate: 26 prefix-match hit, remaining 63 prompt tokens to eval
llama_perf_context_print:        load time =     304.87 ms
llama_perf_context_print: prompt eval time =     759.81 ms /    63 tokens (   12.06 ms per token,    82.92 tokens per second)
llama_perf_context_print:        eval time =     579.36 ms /    22 runs   (   26.33 ms per token,    37.97 tokens per second)
llama_perf_context_print:       total time =    1762.16 ms /    85 tokens


{"apples": 15, "oranges": 33, "potatoes": 5052}


**Exercise**:

Specifying structured JSON schema
You are part of a team working on an online education platform designing new interactive exercise where students are able to ask questions and their answer is displayed through a graphical view. This question-answering feature is powered by an LLM, but the graphical view requires a JSON input with the fields Question and Answer to correctly show the question and answer:

{
    "Question": "...",
    "Answer": "..."
}
A system prompt and example user question have been stored in the messages variable.

**Instructions**:


Specify the Question and Answer fields in the schema with the string type.

In [13]:
output = llm.create_chat_completion(
        messages=[
            {"role": "system",
             "content": "You are a helpful assistant that answers questions about space. You return your results in a JSON format with the Question and Answer fields.",},
            {"role": "user",
             "content": "How old is the Milky Way Galaxy?"},
        ],
        response_format={
            "type": "json_object",
          	# Set the keyword that lets you specify a schema
            "schema": {
            "type": "object",
            # Set the properties of the JSON fields and their data types
            "properties": {
            "Question": {"type": "string"},
             "Answer": {"type": "string"}
             },
            # Declare the required JSON fields here
            "required": ["Question", "Answer"],
            },
        },
)

print(output['choices'][0]['message']['content'])

Llama.generate: 26 prefix-match hit, remaining 43 prompt tokens to eval
llama_perf_context_print:        load time =     304.87 ms
llama_perf_context_print: prompt eval time =     482.70 ms /    43 tokens (   11.23 ms per token,    89.08 tokens per second)
llama_perf_context_print:        eval time =     755.05 ms /    33 runs   (   22.88 ms per token,    43.71 tokens per second)
llama_perf_context_print:       total time =    1805.28 ms /    76 tokens


{"Question": "How old is the Milky Way Galaxy?", "Answer": "The age of the Milky Way Galaxy is approximately 13.6 billion years."}


In [None]:
# class Conversation:
#     def __init__(self, llm, system_prompt='', history=None):
#         if history is None:
#             history = []
#         self.llm = llm
#         self.system_prompt = system_prompt
#         self.history = [{"role": "system", "content": self.system_prompt}] + history

#     def create_completion(self, user_prompt=''):
#         self.history.append({"role": "user", "content": user_prompt})  # Append user input
#         output = self.llm.create_chat_completion(messages=self.history)
#         conversation_result = output['choices'][0]['message']
#         self.history.append(conversation_result)  # Append model output
#         return conversation_result['content']  # Return model output

In [None]:
# # Running a multi-turn conversation
# conversation = Conversation(llm, system_prompt="You are a virtual travel assistant helping with planning trips.")

# response1 = conversation.create_completion("What are some destinations in France for a short weekend break?")
# print(f"Response 1: {response1}")

# response2 = conversation.create_completion("How about Spain?")
# print(f"Response 2: {response2}")