<a href="https://colab.research.google.com/github/aelkhodary/AI_ML_Coders/blob/main/Working_With_Llama3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
# Link to visit
https://llama-cpp-python.readthedocs.io/en/latest/
"""

In [None]:
"""
Quantized models and the GGUF format are techniques and file formats used to optimize large language models (LLMs) for efficient inference, especially on devices with limited computational resources (e.g., CPUs or low-memory GPUs). Let me break it down for you:

1. What is Quantization?
Quantization is a process of reducing the precision of the numbers used to represent a model's weights and activations. In simpler terms, it compresses the model by using fewer bits to store each number, which reduces the model's size and speeds up inference.

Why Quantize?
Smaller Model Size: Quantized models take up less disk space and memory.

Faster Inference: Lower precision calculations are faster, especially on CPUs.

Lower Hardware Requirements: Quantized models can run on devices with limited resources (e.g., laptops, edge devices).

Common Quantization Levels:
FP32 (32-bit floating point): Original precision, no quantization.

FP16 (16-bit floating point): Half precision, smaller and faster.

INT8 (8-bit integer): Even smaller and faster, with some loss of accuracy.

INT4 (4-bit integer): Highly compressed, fastest, but with more accuracy loss.

For example:

A model originally in FP32 might be quantized to INT8 or INT4, reducing its size by 4x or 8x, respectively.

2. What is GGUF?
GGUF (GPT-Generated Unified Format) is a file format introduced by the llama.cpp project to store quantized models efficiently.
It is the successor to the older GGML format and is designed to be more flexible and future-proof.

Key Features of GGUF:
Supports Multiple Quantization Levels: GGUF files can store models quantized to different levels (e.g., Q4_K_M, Q5_K_M, Q8_0).

Optimized for CPU Inference: GGUF models are designed to run efficiently on CPUs, making them ideal for devices without GPUs.

Self-Contained: GGUF files include all necessary metadata (e.g., model architecture, tokenizer configuration) in a single file.

Cross-Platform: GGUF models can be used across different operating systems (Linux, Windows, macOS).

Why Use GGUF?
Efficiency: GGUF models are highly optimized for inference, making them ideal for resource-constrained environments.

Ease of Use: A single .gguf file contains everything needed to run the model.

Compatibility: GGUF is widely supported by tools like llama.cpp and llama-cpp-python.

3. How Quantization and GGUF Work Together
A model is first trained in high precision (e.g., FP32 or FP16).

After training, the model is quantized to a lower precision (e.g., INT8 or INT4) and saved in GGUF format.

The quantized GGUF model can then be loaded and run efficiently using tools like llama.cpp or llama-cpp-python.

4. Example of Quantized GGUF Models
When you browse a model repository on Hugging Face (e.g., TheBloke/Llama-2-7B-GGUF), you’ll see files like:

llama-2-7b.Q4_K_M.gguf

llama-2-7b.Q5_K_M.gguf

llama-2-7b.Q8_0.gguf

These files represent the same model but quantized to different levels:

Q4_K_M: 4-bit quantization, medium precision.

Q5_K_M: 5-bit quantization, medium precision.

Q8_0: 8-bit quantization, higher precision.

5. Trade-offs of Quantization
While quantization offers significant benefits, it also comes with some trade-offs:

Accuracy Loss: Lower precision can reduce the model's accuracy, especially for complex tasks.

Quantization Artifacts: Some models may behave differently after quantization.

Not Suitable for Training: Quantization is primarily used for inference; training requires higher precision.

6. How to Use GGUF Models
To use a GGUF model, you typically:

Download the .gguf file from a repository like Hugging Face.

Load the model using a compatible library (e.g., llama-cpp-python).

Run inference on the model.

Example Code:
python
Copy
from llama_cpp import Llama

# Load the GGUF model
model_path = "llama-2-7b.Q4_K_M.gguf"
llm = Llama(model_path=model_path)

# Run inference
prompt = "What is the capital of France?"
output = llm(prompt)
print(output['choices'][0]['text'])


7. Comparison: GGUF vs PyTorch Models
Feature	-->GGUF Format	-->PyTorch Format (e.g., .bin)
File Size	-->Smaller (due to quantization)	Larger (full precision)
Inference Speed	-->Faster (optimized for CPU)	Slower (requires GPU for best speed)
Hardware Support	-->Runs on CPUs and low-resource devices	Requires GPUs for efficient inference
Use Case	-->Inference-only	Training and inference
Flexibility	-->Limited to inference	Full flexibility for training and fine-tuning


Summary
Quantization reduces the precision of a model's weights to make it smaller and faster.

GGUF is a file format designed to store quantized models efficiently, optimized for CPU inference.

Quantized GGUF models are ideal for running large language models on devices with limited resources.

Let me know if you need further clarification!

"""

In [None]:
# Install Models file
"""
Pulling models from Hugging Face Hub
You can download Llama models in gguf format directly from Hugging Face using the from_pretrained method.
You'll need to install the huggingface-hub package to use this feature (pip install huggingface-hub).
llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen2-0.5B-Instruct-GGUF",
    filename="*q8_0.gguf",
    verbose=False
)

By default from_pretrained will download the model to the huggingface cache directory, you can then manage installed model files with the huggingface-cli tool.


1. What is filename?
The filename is the name of the specific file you want to download from the repository.

In the case of GGUF models, this is typically a .gguf file that contains the quantized version of the model.

You must provide the exact name of the file as it appears in the repository. Wildcards (e.g., *) are not supported.

2. Where to Find the filename?
Go to the model's Hugging Face repository page (e.g., meta-llama/Llama-3.2-1B).

Navigate to the "Files and versions" tab.

Look for the .gguf file you want to download. For example, you might see files like:

llama-3.2-1b.Q4_K_M.gguf

llama-3.2-1b.Q5_K_M.gguf

llama-3.2-1b.Q8_0.gguf

Copy the exact name of the file you want to download and use it as the filename argument.


3. Why is filename Important?
Hugging Face repositories can contain multiple files (e.g., model weights, configuration files, tokenizer files, etc.).

The filename ensures that you download the specific file you need (in this case, the .gguf file).

If you provide an incorrect or non-existent filename, the download will fail with an error.


######## Load model using transfoemer :
The code you provided mixes two different libraries and approaches:

transformers library: Used for loading models like meta-llama/Llama-3.2-1B in PyTorch format.

llama-cpp-python library: Used for loading GGUF models (quantized models compatible with llama.cpp).

These two libraries are not directly compatible because:

transformers loads models in PyTorch format (e.g., .bin files).

llama-cpp-python loads models in GGUF format (e.g., .gguf files).

If you want to use the llama-cpp-python library, you need to download a GGUF version of the model, not the PyTorch version. Here's how you can modify your code:

#Call another model

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")



from llama_cpp import Llama

model = Llama(model_path=model_path)
# Run inference
prompt = "What is the capital of France?"
prompt = "What is the capital of Egypt?"
output = llm(prompt)
print(output['choices'][0]['text'])


"""


In [1]:

# !pip install huggingface-hub --quiet


from huggingface_hub import hf_hub_download
from huggingface_hub import login

# Replace with your Hugging Face token
#login(token="YOUR_HUG_FACE_TOKEN")
# hf_hub_download(
#     repo_id="meta-llama/Llama-3.2-1B",
#     filename="llama-3.2-1b.Q4_K_M.gguf",
#     cache_dir="/content/model"
# )


model_path =hf_hub_download(
    repo_id="medmekk/Llama-3.2-1B-Instruct.GGUF",
    filename="Llama-3.2-1B-Instruct-Q3_K_L.gguf",
    cache_dir="/content/model"
)

print(model_path)

Llama-3.2-1B-Instruct-Q3_K_L.gguf:   0%|          | 0.00/733M [00:00<?, ?B/s]

/content/model/models--medmekk--Llama-3.2-1B-Instruct.GGUF/snapshots/52fc2aa2f47dd03c7f666680eb080240b4fd6c96/Llama-3.2-1B-Instruct-Q3_K_L.gguf


In [14]:
!pip install llama-cpp-python --quiet
from llama_cpp import Llama

#llm = Llama(model_path=model_path, stop=["Q:", "\n"])
# Initialize the Llama model with a larger context window
llm = Llama(
    model_path=model_path,
    n_ctx=2048,  # Increase context window to 2048 tokens
    verbose=False
)

# Run inference
prompt = "What is the capital of France?"
prompt = "What is the capital of Egypt?"
output = llm(prompt)
print(output['choices'][0]['text'])

llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


 Cairo.

Cairo is the largest city in Egypt and the country's economic and


Parsing Llama 3 completion outputs :

Your company wants to use the Llama models in its Bronx Zoo question-answering bot for the animal exhibits.

Your task is to extract the model's completion from the result stored in output. The output contains the completion and many other metadata. An early step to evaluate the model is to ask Llama 3 a question, and figure out how to parse its output. You are given a Llama model preloaded in llm, and given the prompt which asks it to name five foods that llamas eat, with the result stored in output.

You are tasked with parsing the result in output and only retrieve the string result of the completion and store it in completion_string.


Instructions :


Limit the number of tokens generated to a max of 20 tokens.
Stop the generation if the completion produces a line break, ie '\n'.
Parse the output variable and store the completion string in a new variable, completion_string.


"""
from llama_cpp import Llama

llm = Llama(
      model_path="./models/7B/llama-model.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed to make sure you will
                     get the same result every time
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
"""


In [4]:
output = llm(
	"Q: Name 5 foods that llamas eat? A: ", # prompt
  	# restrict to 20 tokens
	max_tokens=32,
	# add relevant stopping tokens
	stop=["Q:", "\n"],
  # to get same result every time
  seed=1337,
  temperature=0.8,#default value
 repeat_penalty=1.1#default value(make the model produce new words by reducing the probabilities of words that it has already produced)
)
# Retrieve the completion text and store in completion_string
completion_string = output['choices'][0]['text']
print(completion_string)

1. Almonds, 2. Carrots, 3. Apples, 4. Sunflower seeds, 5. Hay.


In [20]:
"""
use chat completions makes the llm more conversational.
"""

output = llm.create_chat_completion(
      messages = [
          {"role": "system",
           "content": "You are an assistant who perfectly describes the animals in the zoo."},
          {
              "role": "user",
              "content": "Name 5 foods that llamas eat?"
          }
      ]
)
# Retrieve the completion text and store in completion_string
completion_string = output['choices'][0]['message']
print (completion_string)

print ("***********************************************")

"""
 Same logic after adding max_tokens, stop, seed, temperature, repeat_penalty
"""
output = llm.create_chat_completion(
      messages = [
          {"role": "system",
           "content": "You are an assistant who perfectly describes the animals in the zoo."},
          {
              "role": "user",
              "content": "Name 5 foods that llamas eat?"
          }
      ],
      	max_tokens=500,
	# add relevant stopping tokens
	#stop=["Q:", "\n"],
  # to get same result every time
  seed=1337,
  temperature=0.8,#default value
 repeat_penalty=1.1
)
# Retrieve the completion text and store in completion_string
completion_string = output['choices'][0]['message']["content"]
print (completion_string)



{'role': 'assistant', 'content': "As a knowledgeable assistant, I'd be happy to help. Llamas are herbivores, which means they primarily eat plants. Here are 5 foods that llamas typically enjoy:\n\n1. Grass: Llamas love to graze on various types of grass, including timothy grass, orchard grass, and bluegrass.\n2. Hay: Hay is a staple in a llama's diet, and they often eat it in large quantities. Timothy hay is a popular choice due to its high nutritional value.\n3. Alfalfa: Alfalfa is a legume that's rich in nutrients, making it a nutritious addition to a llama's diet. It's often fed to them in small amounts.\n4. Silage: Silage is a fermented plant-based food made from grasses, legumes, or other crops. It's a nutritious and energy-rich food source for llamas.\n5. Alfalfa hay and silage: Many llamas are fed a combination of alfalfa hay and silage, which provides a balanced mix of nutrients and fiber.\n\nRemember, it's essential to provide a balanced diet for llamas, and their dietary need

**Exercise**

**More creative Llama completions**

You are a software developer working on integrating Llama in your company's chatbot pipelines. Unfortunately, the current Llama model you are using produces repetitive completions and often produces exactly the same results if you ask it the same question, which makes the bot feel less personable to your users.

You decide to debug this issue by looking through the completion code and modify it so that the responses produced are more varied. The model is already instantiated with a model using llama_cpp and is stored in llm.

**Instructions**

Add the parameter and a corresponding value to the completion code so the model considers a wider variety of words during generation.

Add the parameter to the completion code which penalizes the model for repeating the same words often.

In [18]:
output = llm(
		"Q: Give me directions from grand central station to the Empire State building. A: ",
  		# Modify for the model to sample from more words
		temperature=2,
  		# Modify to penalize repeated use of the same words
		repeat_penalty=2,
        max_tokens=15,
        stop=["Q:", "\n"],
        echo=False
)

print(output['choices'][0]['text'])
#1) Exit of 34th Street at Broadway (toward the west
#1) Walk east on 33rd Street for five blocks, then turn
#1) Head north on Broadway. 2 or a (c)

Llama.generate: 18 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =     149.61 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =     410.14 ms /    14 runs   (   29.30 ms per token,    34.13 tokens per second)
llama_perf_context_print:       total time =     451.05 ms /    15 tokens


1) Head north on Broadway. 2 or a (c)


**Exercise**

**Make a philosophy chatbot**
You are a tester at a company building AI personas, and your task is to evaluate how well the new Llama models are able to generate completions in certain voices and styles.

You will make a chatbot that thinks it's a philosopher and answers questions by pretending it is Plato. You are given a partially completed create_chat_completion call, which you will modify to make the chatbot respond to a user's question as if it was Plato himself.

**Instructions**


Fill in the dictionary in the first index of the messages list with the instructions to make the model respond as if it is the Greek philosopher Plato and the appropriate role.
Fill in the dictionary in the second index of messages with the prompting question from the user and the appropriate role.
Ensure that both your instruction, and the user's question are correctly passed to the function call.

In [21]:
history = [
	# Instruct the model to behave like Plato
	{"role": "system",
	 "content": "You are the Greek philosopher Plato. Answer every question using his voice."
	},
	# Identify that the following text is from the user
	{
      "role": "user",
			"content": "Can any shape that exist in the real world be perfect and why?"
    }
]
# Pass in conversation context to the completion call
result = llm.create_chat_completion(history, max_tokens=500 )
print(result)
completion_string = result['choices'][0]['message']["content"]
print (completion_string)

{'id': 'chatcmpl-63f31f4b-479e-48f7-ab8d-83ef483f9e61', 'object': 'chat.completion', 'created': 1740949180, 'model': '/content/model/models--medmekk--Llama-3.2-1B-Instruct.GGUF/snapshots/52fc2aa2f47dd03c7f666680eb080240b4fd6c96/Llama-3.2-1B-Instruct-Q3_K_L.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'My dear friend, I shall ponder this question with the wisdom of the ages. The concept of perfection is a complex and multifaceted one, and I shall attempt to unravel it for you.\n\nIn the realm of geometry, we find shapes that are indeed perfect, but not necessarily in the classical sense. For instance, the perfect circle, as you may know, is a circle that has no beginning or end, no corners, and no edges. It is a shape that is, by definition, complete and unbroken.\n\nHowever, when we consider the concept of "beauty" or "perfection" in the human experience, we find that it is a much more nuanced and subjective matter. Beauty, my friend, is not a fixed prope

In [14]:
"""
Use Model Llama for translation
!pip install llama-cpp-python --quiet

"""

from llama_cpp import Llama
llm = Llama(model_path=model_path, verbose=False)

text = """EN:Hello
FR:Bonjour
EN:Goodbye
FR:Au revoir
EN:Good day
FR:
"""
output = llm(text, max_tokens=60 ,temperature=0 ,stop=["Q:","\n"])
print(output['choices'][0]['text'])




llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


EN:Hello


In [22]:
"""
Use Model Llama for translation
!pip install llama-cpp-python --quiet

"""

from llama_cpp import Llama
llm = Llama(model_path=model_path , verbose=False)

text = """AR:آحمد
EN:Ahmed
AR:علي
EN:Ali
AR:محمود
EN:Mohmmed
AR:مني
EN:
"""
output = llm(text, max_tokens=60 ,temperature=0 )
print(output['choices'][0]['text'])

llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


AR:مصعب
EN:Mossab
AR:مصعب
EN:Mossab
AR:مصعب
EN:Mossab
AR:مصعب
EN:Mossab
AR:مصعب
EN:Mossab
AR:مصعب


Exercise:

Make Llama speak like a pirate
Your task is to create a prompt for a Llama model to serve as the language backend for an interactive pirate robot at Disney World. Ensure the model's output is always in a pirate voice and includes "Aye Matey" in its response. Create an appropriate instruction for this prompt, using keywords to guide the model's output.

The Llama class has already been instantiated in the llm variable and the code to call the completion is provided.

Instructions:


Include the appropriate keywords in the prompt in the correct locations: Instruction:, Question:, and Response: and ensure the instruction includes some directive on including "Aye Matey" in the model response and to make the model have a pirate voice.

In [23]:
# Write the keywords and instructions in the correct locations in the following prompt
text="""Instruction: You are robot at Disney World include Aye Matey in your response
Question: How long does it take to go around the Earth once?
Response:
"""

output = llm(
      text,
      max_tokens=15,
      stop=["Q:", "\n"],
)

print(output['choices'][0]['text'])

Aye matey, it be a good question! 


**Exercise:**

3-shot prompting with Llama
You work at a food delivery company as a data analyst, and you are investigating the sentiment (positive or negative) people have about your company from reviews on Google and Yelp.

Since you don't want to train a classification model from scratch to identify the reviews as positive or negative, you decide to create a prompt that you will feed to your instance of Llama 3. You decide to use few shot learning by writing three examples with the review and the sentiment, and use the model identify the sentiment on the 4th example, which you will replace with each review you collected.

**Instructions:**

Create a prompt using a few-shot prompting template with 3 examples.

In [24]:
# Fill in the 3-shot prompt (you can use multiple lines)
text = """
Review 1: The food was cold, and the delivery was late.
Sentiment 1: Negative

Review 2: Amazing experience! The food was fresh, and the delivery was super fast.
Sentiment 2: Positive

Review 3: The order was wrong, and the customer service was unhelpful.
Sentiment 3: Negative

Review 4: Delicious food, and excellent customer service!
Sentiment 4:"""

# Call the Llama 3 model to predict the sentiment for Review 4
output = llm(text, max_tokens=2, stop=["Q:"])

# Print the predicted sentiment
print(output['choices'][0]['text'])

 Positive




In [26]:
"""
Streaming completions

"""

from llama_cpp import Llama
llm = Llama(model_path=model_path, verbose=False)


output = llm("Q:Which galaxy is the closest to Us  ? A: ",#Prompt
             max_tokens=32 ,
             temperature=0,
             echo=True,
             stream=True,)
for token in output:
    print(token['choices'][0]['text'], end='')


print(output)


llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


 Andromeda Galaxy (M31) - 2.5 million light-years away.
Q:What is the largest planet in our solar system?  A<generator object Llama._create_completion at 0x3fb5620>


**Exercise**:

Creating a JSON inventory list
You are asked to use an LLM to produce a structured JSON with a list of items and their count to help a supermarket automate their inventory process.

The model takes a text description of the inventory as input and produces the JSON as output. This feature of the inventory management system automatically extracts inventory data from natural language and stores it in a structured format for downstream tasks.

You are provided with the llm class instance with a Llama model pre-loaded and the system prompt to get you started.

**Instructions**:


Specify the parameters in create_chat_completion that lets you generate responses in JSON format.

In [27]:
output = llm.create_chat_completion(
        messages=[
            {"role": "system",
             "content": "You are a helpful assistant processing lists from text to JSON format: you extract item counts from text and output it in JSON with the item name as the key and the number of that item as the value",},
            {"role": "user",
             "content": "I have fifteen apples, thirty-three oranges, and five thousand fifty-two potatoes."},
        ],
		response_format={"type": "json_object"}, # Specify output format to JSON

)

print(output['choices'][0]['message']['content'])

{
    "Apples": 15,
    "Oranges": 33,
    "Potatoes": 5052
}


**Exercise**:

Generating answers with a JSON schema
You are part of a team working on an online education platform. In a course teaching about space, there is an interactive exercise where students are able to ask questions about a planet and the answer is shown on their screen through a graphical view. This question-answering feature is powered by an LLM, but the graphical view requires a JSON as an input with the fields Question and Answer to correctly showing the question and answer.

You believe that using the new Llama models and llama-cpp-python, you can get the LLM to produce the answer and format it into the correct JSON schema in one step.

**Instructions**:

Add the field to specify a JSON schema in response_format and the properties it may have.
Specify the Question and Answer fields in the schema with the string type.
Specify the required fields in the schema.

In [28]:
output = llm.create_chat_completion(
        messages=[
            {"role": "system",
             "content": "You are a helpful assistant that answers questions about space. You return your results in a JSON format with the Question and Answer fields.",},
            {"role": "user",
             "content": "How old is the Milky Way Galaxy?"},
        ],
        response_format={
            "type": "json_object",
          	# Set the keyword that lets you specify a schema
            "schema": {
            "type": "object",
            # Set the properties of the JSON fields and their data types
            "properties": {
            "Question": {"type": "string"},
             "Answer": {"type": "string"}
             },
            # Declare the required JSON fields here
            "required": ["Question", "Answer"],
            },
        },
)

print(output['choices'][0]['message']['content'])

{"Question": "How old is the Milky Way Galaxy?", "Answer": "The age of the Milky Way Galaxy is approximately 13.6 billion years."}


**Exercise**:

Creating a JSON inventory list
Being able to use LLMs to generate structured outputs means that they can be used to reliably exchange information between systems with minimal processing.

You'll test your prompting skills by converting a stock inventory list written in natural language into JSON format. A model and system prompt have already been defined for you to use.

**Instructions**:


Complete the .create_chat_completion() code to generate responses in JSON format.

In [30]:
 output = llm.create_chat_completion(
        messages=[
            {"role": "system", "content": "You are a helpful assistant processing lists from text to JSON format: you extract item counts from text and output it in JSON with the item name as the key and the number of that item as the value",},
            {"role": "user", "content": "I have fifteen apples, thirty-three oranges, and five thousand fifty-two potatoes."},
        ],
		response_format={"type": "json_object",# Specify output format to JSON
            "schema": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "itemName": {"type": "string"},
                    "number": {"type": "integer"}
                },
                "required": ["itemName", "number"]
            }
            },

            },

)

print(output['choices'][0]['message']['content'])

[{ "itemName": "Apples", "number": 15 }, { "itemName": "Oranges", "number": 33 }, { "itemName": "Potatoes", "number": 5052 }]


In [31]:
output = llm.create_chat_completion(
        messages=[
            {"role": "system", "content": "You convert inventory lists from text to JSON, extracting item counts and names from the text as keys and values in the form: item: count; for example, 'banana': 32.",},
            {"role": "user", "content": "Fifteen apples, thirty-three oranges, and five thousand fifty-two potatoes."},
        ],
  		# Specify output format to JSON
        response_format={
            "type": "json_object",
        }
)

print(output['choices'][0]['message']['content'])

{"apples": 15, "oranges": 33, "potatoes": 5052}


**Exercise**:

Specifying structured JSON schema
You are part of a team working on an online education platform designing new interactive exercise where students are able to ask questions and their answer is displayed through a graphical view. This question-answering feature is powered by an LLM, but the graphical view requires a JSON input with the fields Question and Answer to correctly show the question and answer:

{
    "Question": "...",
    "Answer": "..."
}
A system prompt and example user question have been stored in the messages variable.

**Instructions**:


Specify the Question and Answer fields in the schema with the string type.

In [32]:
output = llm.create_chat_completion(
        messages=[
            {"role": "system",
             "content": "You are a helpful assistant that answers questions about space. You return your results in a JSON format with the Question and Answer fields.",},
            {"role": "user",
             "content": "How old is the Milky Way Galaxy?"},
        ],
        response_format={
            "type": "json_object",
          	# Set the keyword that lets you specify a schema
            "schema": {
            "type": "object",
            # Set the properties of the JSON fields and their data types
            "properties": {
            "Question": {"type": "string"},
             "Answer": {"type": "string"}
             },
            # Declare the required JSON fields here
            "required": ["Question", "Answer"],
            },
        },
)

print(output['choices'][0]['message']['content'])

{"Question": "How old is the Milky Way Galaxy?", "Answer": "The age of the Milky Way Galaxy is estimated to be around 13.6 billion years."}


**Exercise**:

System messages
Your previous messages gave you a valid response, but it also allows users to ask any question they like, even if it doesn't relate to internet service support. In this exercise, you'll utilize a system message to steer the model into only answering customer questions about their internet service.

The Llama model is still available as llm.

**Instructions**:


Add the system message provided to the conv list of dictionaries, which should respond with Sorry, I can't answer that for non-internet related queries.
Extract the model response from the result object.

In [22]:
# Add a system message to the conversation list
conv = [
	{
        "role": "system",
        "content": "You are a helpful and professional customer support assistant for an internet service provider. If the question or instruction doesn't relate to internet service, quote the response: 'Sorry, I can't answer that.'"},
	{
        "role": "user",
	    "content": "Help me decide which stocks to invest in."
    }
]

result = llm.create_chat_completion(messages=conv, max_tokens=15)
# Extract the model response from the result object
assistant_content = result['choices'][0]['message']['content']
print(assistant_content)

I can't answer that.


**Exercise**:

Formatting prompts for Llama
Models can sometimes struggle to separate the task, expected output, and additional context from a long, unstructured prompt. To remedy this, you can insert clear labels to break up and differentiate this information for the model.

The Llama model is available as llm, and will be available for the remainder of the course.

**Instructions**:


Add the labels Instruction, Question, and Answer to the prompt to format it more effectively.

In [23]:
# Add formatting to the prompt
prompt="""
Instruction: Explain the concept of gravity in simple terms.
Question: What is gravity?
Answer:
"""

# Send the prompt to the model
output = llm(prompt, max_tokens=15, stop=["Q:"])
print(output['choices'][0]['text'])

## Step 1: Define Gravity
Gravity is a fundamental force of nature


**Exercise**:

**Few-shot prompting with Llama**:

You're using a Llama model to identify the sentiment of customer reviews from Google and Yelp as Positive or Negative. To ensure these labels are consistent for each review, you'll design a few-shot prompt containing three examples.

Here are the examples you want to provide to the model:

I ordered from this place last night, and I'm impressed! → Positive
My order was delayed by over an hour without any updates. Disappointing! → Negative
The food quality is top-notch. Highly recommend! → Positive

**Instructions**:


Complete the few-shot prompt by assigning Positive or Negative to the reviews provided.
Send the prompt to the model with the "Review:" stop word so the model only responds to one review.

In [45]:
# Complete the few-shot prompt
prompt="""Review 1: I ordered from this place last night, and I'm impressed!
Sentiment 1: Positive,
Review 2: My order was delayed by over an hour without any updates. Disappointing!
Sentiment 2: Negative,
Review 3: The food quality is top-notch. Highly recommend!
Sentiment 3: Positive,
Review 4: Delicious food, and excellent customer service!
Sentiment 4:"""

# Send the prompt to the model with a stop word
output = llm(prompt, max_tokens=2, stop=["Review:"])
print(output['choices'][0]['text'])

 Positive,



**Exercise**:

**Ensuring safe responses**:

You are developing an internal chatbot for a medical team. The chatbot provides the team with help on insurance guidelines, and for compliance reasons, answers must be consistent each time a question is asked.

To meet audit requirements, you need to restrict variability in responses by limiting the response length, and the model's token selection to only the most likely options.

You have been provided the Llama class instance in the llm variable and the code to call the completion. You are also given a sample prompt to test with.

**Instructions**:


Set the maximum number of tokens to limit the length of the response to 10 tokens.
Restrict the top-k parameter so that it only ever chooses between the two most likely tokens at each completion step.


**Explanation of Parameters**
**max_tokens=10:**

This limits the response to a maximum of 10 tokens. Tokens can be words, subwords, or characters, depending on the tokenizer used by the model.

For example, if the model generates a response like "The symptoms of strep throat include...", it will stop after 10 tokens.

**top_k=2:**

This restricts the model to only consider the top 2 most likely tokens at each step of the generation process.

By limiting the model to the top 2 tokens, you reduce variability in the responses, ensuring more consistency.

In [48]:
output = llm(
		"What are the symptoms of strep throat?",
  		# Set the maximum number of tokens
      	max_tokens=20,
		# Restrict decoding to choose between top two tokens
		top_k=5
)

print(output['choices'][0]['text'])

?

Strep throat is an infectious disease caused by the bacteria Streptococcus pyogenes.


**Exercise**:

**Generating creative copy**

You're developing an AI-powered content assistant for a SaaS marketing team. The team needs to automate social media posts about their latest software updates, and you need to adjust response diversity so that multiple calls to the model result in different variations.

You have been provided the Llama class instance in the llm variable and the code to call the completion. You are also given a sample prompt to test with.

**Instructions**:

Adjust the top-p parameter to a value in the upper half of its range so that it generates more varied responses.

**Explanation of Parameters**:

**max_tokens=15:**

Limits the response to a maximum of 15 tokens. This ensures the tweet is concise and fits within the character limit of a typical tweet.

**top_p=0.9:**

The top_p parameter controls nucleus sampling. It restricts the model to only consider the smallest set of tokens whose cumulative probability exceeds top_p.

A higher value (e.g., 0.9) allows the model to choose from a broader set of likely tokens, resulting in more varied and creative responses.

For example, with top_p=0.9, the model might generate:

"Excited to launch our new analytics dashboard for enterprise users! 🚀 Gain deeper insights and make data-driven decisions. #DataAnalytics #Enterprise"

"Introducing our new analytics dashboard for enterprise users! 📊 Unlock powerful insights and streamline your workflows. #BusinessIntelligence"

**Why Use top_p?**

Variety: A higher top_p value allows the model to explore more diverse options, which is useful for creative tasks like writing tweets.

Coherence: Unlike temperature, which can introduce randomness, top_p ensures that the model only considers tokens with high likelihood, maintaining coherence in the response.

In [49]:
output = llm(
      	"Write a tweet announcing a new analytics dashboard feature for enterprise users.",
		max_tokens=15,
		# Set top-p to a value in the upper range for more varied responses
		top_p=0.9
	)

print(output['choices'][0]['text'])

 "Get instant insights into your data with our new analytics dashboard feature. Quickly


**Conversation**:
How to keep conversation History.

In [12]:
class Conversation:
    def __init__(self, llm, system_prompt='', history=None):
        if history is None:
            history = []
        self.llm = llm
        self.system_prompt = system_prompt
        self.history = [{"role": "system", "content": self.system_prompt}] + history

    def create_completion(self, user_prompt=''):
        self.history.append({"role": "user", "content": user_prompt})  # Append user input
        output = self.llm.create_chat_completion(messages=self.history)
        conversation_result = output['choices'][0]['message']
        self.history.append(conversation_result)  # Append model output
        return conversation_result['content']  # Return model output

In [13]:
# Running a multi-turn conversation
conversation = Conversation(llm, system_prompt="You are a virtual travel assistant helping with planning trips.")

response1 = conversation.create_completion("What are some destinations in France for a short weekend break?")
print(f"Response 1: {response1}")

response2 = conversation.create_completion("How about Spain?")
print(f"Response 2: {response2}")

Response 1: France is a wonderful destination for a short weekend break. Here are some popular and charming options:

1. **Paris**: The City of Light is a must-visit destination. Explore iconic landmarks like the Eiffel Tower, Notre-Dame, and the Louvre Museum. Enjoy the city's famous cafes, restaurants, and shopping.
2. **The French Riviera (Nice, Cannes, Antibes)**: Enjoy the stunning beaches, crystal-clear waters, and picturesque towns of the Côte d'Azur. Visit the famous Promenade des Anglais in Nice, and explore the artistic and cultural scene of Cannes and Antibes.
3. **The Loire Valley (Chambord, Blois)**: Visit the stunning châteaux, such as Chambord and Chenonceau, and explore the beautiful countryside. Enjoy the region's famous cuisine, wine, and history.
4. **The Dordogne Region (Périgord)**: Discover the region's rich history, culture, and natural beauty. Visit the famous Prehistoric caves, explore the medieval towns, and enjoy the region's famous cuisine and wine.
5. **The

**Exercise**:

Creating a conversational class
One of the most popular applications for LLMs like Llama is conversational chatbots, which allow users to send messages, get a response, and ask follow-up questions using a message history.

In this exercise, you'll create a class called Conversation that will allow you to create conversation with minimal code.

**Instructions**:

Complete the __init__ method of the Conversation class to define the LLM, system prompt, and history, which is the system message and history added together.

In [17]:
class Conversation:
    # Complete the __init__ method of the Conversation class
    def __init__(self, llm: Llama, system_prompt='', history=[]):
        self.llm = llm
        self.system_prompt = system_prompt
        self.history = [{"role": "system", "content": self.system_prompt}] + history

    def create_completion(self, user_prompt=''):
        # Add the user prompt to the history
        self.history.append({"role": "user", "content": user_prompt})
        # Send the history messages to the LLM
        output = self.llm.create_chat_completion(messages=self.history)
        conversation_result = output['choices'][0]['message']
        # Append the conversation_result to the history
        self.history.append(conversation_result)
        return conversation_result['content']

**Exercise**:

**Single-turn conversation**:

Time to give your Conversation class a go! You'll use this class to create a travel recommendation chatbot, that takes a description or specification of what you're looking for, and the model returns a location.

The Conversation class you created in the previous exercise is still available with the following methods:

__init__(self, llm: Llama, system_prompt='', history=[])
create_completion(self, user_prompt='')

**Instructions**:

Instantiate an Conversation class with the instruction defined and the pre-loaded llm.
Send a prompt to the model to get a travel recommendation (feel free to add your own prompt here).

In [19]:
instruction = "You are a travel expert that recommends a travel destination based on a specification. Return the location name only in City, Country form."

# Define a chatbot using the Conversation class
chatbot = Conversation(llm ,system_prompt=instruction)

# Send a prompt to the model
result = chatbot.create_completion("I'd like to learn about the Aztecs.")
print(result)

response2 = conversation.create_completion("What About Dubai?")
print(f"Response 2: {response2}")

I recommend Mexico City, Mexico.
Response 2: Dubai is a fantastic destination for a short weekend break. Here are some popular and exciting options:

**Must-visit attractions:**

1. **Burj Khalifa**: The tallest building in the world, offering breathtaking views of the city.
2. **Dubai Mall**: A massive shopping and entertainment complex, featuring over 1,200 stores and attractions.
3. **Dubai Fountain**: A stunning fountain show that displays over 1,200 water jets and 5,000 lights.
4. **Palm Jumeirah**: A man-made island with a beautiful beach and stunning views of the city.
5. **Dubai Marina**: A picturesque waterfront area with stunning views of the city and the sea.

**Other attractions:**

1. **Dubai Museum**: A museum showcasing the city's history and culture.
2. **Jumeirah Beach**: A popular beach with stunning views of the city and the sea.
3. **Dubai Desert Park**: A park featuring a desert landscape, camel rides, and a zoo.
4. **Dubai Aquarium and Underwater Zoo**: A massive 

**Exercise**:

**Multi-turn conversations**:

Let's extend the travel chatbot to allow users to respond to the model's initial recommendation. You'll again use the Conversation class, but this time, you'll make repeated calls to the model to see how the model handles previous information.

As a reminder, here are the methods from the Conversation class:

__init__(self, llm: Llama, system_prompt='', history=[])
create_completion(self, user_prompt='')

**Instructions**:

Ask for an initial travel recommendation (feel free to modify this with your own ideas).
Provide a follow-up request to place additional constraints on where you would like to travel.

In [21]:
chatbot = Conversation(llm, system_prompt="You are a travel expert that recommends a travel destination based on a prompt. Return the location name only as 'City, Country'.")

# Ask for the initial travel recommendation
first_recommendation = chatbot.create_completion("Recommend a Spanish-speaking city.")
print(first_recommendation)

# Add an additional request to update the recommendation
second_recommendation = chatbot.create_completion("A different city in the same country")
print(second_recommendation)

Barcelona, Spain
Medellín, Colombia
