# **Prompt Engineering Demo**

This notebook demonstrates prompt engineering principles. It contains explanations, code examples (zero-shot and few-shot), tips on *task, context, examples, role, format, tone*, and guidance for using chain-of-thought (CoT) techniques.

# Setup

Make sure you set your `GROQ_API_KEY` in the environment before running any cells that call the API.

In [None]:
# Import packages
import os
from groq import Groq
from dotenv import load_dotenv

In [17]:
# Load GROQ API key from .env file
load_dotenv()
api_key = os.getenv("GROQ_API_KEY")

In [18]:
# Define a helper function around Groq's Chat Completion
client = Groq(api_key=api_key)

def get_completion(prompt, model="llama3-8b-8192"):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        messages=messages,
        model=model,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message.content

# Components of a Prompt 

## 1. Task

The explicit instruction stating what you want the model to do.

Keep it clear, specific. and focused on a single taks at a time.

In [None]:
# Task-only (Zero-shot)
prompt = "Explain the importance of low latency LLMs in 3 brief bullet points."

print(get_completion(prompt))

Here are three brief bullet points explaining the importance of low latency Large Language Models (LLMs):

• **Real-time interactions**: Low latency LLMs enable real-time interactions, such as conversational AI, chatbots, and voice assistants, which require fast response times to maintain user engagement and satisfaction. High latency can lead to frustrating delays, causing users to abandon interactions.

• **Time-critical applications**: Low latency LLMs are crucial for applications that require rapid processing, such as natural language processing (NLP) for autonomous vehicles, medical diagnosis, or financial trading. In these scenarios, even a few milliseconds of latency can have significant consequences.

• **Scalability and efficiency**: Low latency LLMs can be deployed in cloud-based services, allowing for efficient scaling and cost savings. By reducing latency, developers can build more responsive and scalable applications, which is essential for meeting the demands of modern us

## 2. Context

Any background information or constraints that help the model produce a more accurate result.

In [None]:
# Task + Context
prompt = """You are explaining to undergraduate computer science students who know basic systems concepts.
Explain the importance of low latency LLMs and give 3 real-world examples where latency matters.
Keep the explanation accessible and include one short analogy.
"""

print(get_completion(prompt))

Hello students! Today, we're going to talk about the importance of low latency Large Language Models (LLMs). You might be wondering, what's the big deal about latency? Well, let me explain.

Latency refers to the time it takes for a system to respond to a request or input. In the context of LLMs, latency is critical because it directly affects the user experience. Imagine you're trying to have a conversation with a friend, but there's a 5-second delay between your responses. It would be frustrating, right? That's what happens when you're dealing with high latency LLMs.

Low latency LLMs are essential because they enable fast and seamless interactions between humans and machines. This is particularly important in applications where speed and responsiveness are crucial. Here are three real-world examples where latency matters:

1. **Virtual Assistants**: Virtual assistants like Siri, Google Assistant, or Alexa rely heavily on LLMs to understand and respond to voice commands. Low latency 

## 3 Examples (Zero-shot / Few-shot)

**Zero-shot:** No examples. Rely on the task and context.

**Few-shot:** Provide 1–3 short input→output examples so the model can infer the structure and style.

In [33]:
# Zero-shot example
prompt = "Provide 3 essential tips for writing effective code documentation."

print(get_completion(prompt))

Here are three essential tips for writing effective code documentation:

**1. Keep it concise and focused**:

Good code documentation should be brief and to the point. Aim for a few sentences or a short paragraph at most. Avoid lengthy descriptions or unnecessary details that can confuse or overwhelm the reader. Focus on the essential information that a developer needs to understand the code, such as:
	* What the code does
	* How it works
	* Any assumptions or dependencies
	* Any potential issues or edge cases

**2. Use clear and consistent language**:

Use simple, clear language that is easy to understand. Avoid using technical jargon or overly complex terminology unless it's absolutely necessary. Use consistent formatting, headings, and syntax throughout your documentation to make it easy to scan and read. Consider using a standard documentation template or style guide to ensure consistency across your codebase.

**3. Make it accessible and up-to-date**:

Code documentation should be

In [32]:
# Few-shot example
prompt = """Here are some examples of good documentation tips:

Input: "How to write clear function names?"
Output: "Use verbs, be specific, and indicate the return value (e.g., get_user_by_id, calculate_total_price)."

Input: "How to document API endpoints?"
Output: "Include HTTP method, parameters, response format, and authentication requirements."

Now, provide 3 essential tips for writing effective code documentation."""

print(get_completion(prompt))

Here are three essential tips for writing effective code documentation:

**Tip 1: Be Consistent**

Consistency is key to making your code documentation easy to read and understand. Establish a consistent format, tone, and style throughout your documentation. This includes using the same verb tenses, formatting, and terminology. Consistency will help readers quickly understand the structure and content of your documentation.

**Tip 2: Focus on the Why, Not Just the What**

While it's important to explain what your code does, it's equally important to explain why it does it. Providing context and motivation behind the code will help readers understand the purpose and intent behind the code. This can include explaining the problem being solved, the design decisions made, and the trade-offs considered.

**Tip 3: Keep it Concise and Up-to-Date**

Code documentation should be concise and to the point. Aim for a balance between providing enough information and not overwhelming the reader. Kee

## 4. Role

Tell the model who it should *act as* (teacher, peer, senior engineer, etc.).

In [31]:
# Role example
prompt = """You are an experienced software engineer teaching undergrads.
Explain the trade-offs between low latency and model size for LLMs in 4 bullet points."""

print(get_completion(prompt))

As an experienced software engineer teaching undergrads, I'd be happy to explain the trade-offs between low latency and model size for Large Language Models (LLMs) in 4 bullet points:

• **Latency vs. Model Size: The Bigger the Model, the Slower the Response**: Larger language models require more computational resources and memory to process, which can lead to slower response times. Conversely, smaller models can respond faster but may not be as accurate or capable. A good balance between model size and latency is crucial for real-world applications.

• **Smaller Models are Faster, but Less Accurate**: Smaller models are often faster because they require fewer computations and less memory. However, they may not be as accurate as larger models, which can be trained on more data and have more complex architectures. This trade-off is particularly important for applications where accuracy is critical, such as natural language processing and machine translation.

• **Larger Models are More 

## 5. Format

Specify the structure of the output (bullets, JSON, table, code snippet, etc.).

In [30]:
# Format example (JSON)
prompt = """Return a JSON object with keys 'summary' (string) and 'examples' (list of strings)
about why low latency matters for LLMs. Keep 'summary' under 80 chars."""

print(get_completion(prompt))

Here is a JSON object with the requested information:

```
{
  "summary": "Low latency enables faster response times, improving user experience and enabling real-time interactions with LLMs.",
  "examples": [
    "Faster response times allow for more efficient workflows and increased productivity.",
    "Low latency enables real-time feedback and iteration, improving model accuracy and adaptability.",
    "In applications like chatbots and virtual assistants, low latency is crucial for providing a seamless and responsive user experience."
  ]
}
```

The summary is under 80 characters, and the examples provide more detailed information on why low latency matters for LLMs.


## 6. Tone

The voice or attitude for the response (formal, friendly, concise, humorous, etc.).

In [None]:
# Tone example
prompt = """You are a friendly tutor who speaks in short, encouraging sentences.
Explain why low latency LLMs are important in 2 sentences.
"""

print(get_completion(prompt))


You're on the right track! Low latency LLMs are super important because they enable fast and efficient processing of natural language, allowing for more seamless and responsive interactions. This means you can get answers and insights quickly, without waiting around for slow responses!


# Chain-of-Thought (CoT) Techniques

Chain-of-thought prompts instruct the model to reveal intermediate reasoning steps. Use them carefully:

- **When to use:** debugging, teaching, transparent reasoning.
- **When to avoid:** tasks requiring concise or private outputs, or when you don't want the model to hallucinate extra steps.

**Controlled-CoT pattern (preferred):** Ask for a short step-by-step reasoning section followed by a final concise answer.

In [35]:
# Cell: Controlled Chain-of-Thought example
prompt = """You are a clear explainer for undergrads. First give a numbered list of 3 short reasoning steps (each < 20 words), then provide a single-line final answer prefixed with 'Answer:'.
Should a chat application prioritize a smaller model with caching or a larger model for better user experience? Give one clear recommendation and justify it.
"""

print(get_completion(prompt))

Here are the 3 short reasoning steps:

1. A smaller model with caching can quickly respond to user input, reducing latency and improving initial interactions.
2. However, a larger model can provide more accurate and informative responses, enhancing the overall user experience.
3. Caching can be effective for frequently accessed data, but may not be sufficient for complex or dynamic conversations.

Answer: A larger model with caching is recommended, as it balances the need for accurate responses with the importance of fast initial interactions and efficient data retrieval.


**Note:** Explicit CoT can sometimes increase risk of hallucination and token use. If the model is unreliable with CoT, use few-shot examples of the expected reasoning format instead.

# Combined Example: Design a prompt step-by-step

We'll build a high-quality prompt iteratively and then send it.

In [36]:
# Build combined prompt

prompt = """You are a patient lecturer for 2nd-year undergrad CS students. Keep things simple and avoid jargon.
Your task is to explain why low latency is important for LLMs.
Students know basic networking and web apps but not ML internals.
Provide a 3-bullet explanation, then one real-world example, then a 1-sentence analogy.
Act friendly and concise.
Please also include 2 short follow-up study suggestions for students.
"""

print(get_completion(prompt))


Hello there! I'm excited to explain why low latency is crucial for Large Language Models (LLMs).

Here are three key points:

• **Response time matters**: When you interact with a language model, you expect a quick response. Low latency ensures that the model responds rapidly, making the experience feel more natural and engaging.
• **Real-time feedback is essential**: LLMs are designed to learn from user interactions. Low latency enables the model to receive and process feedback in real-time, which is vital for improving its performance and accuracy.
• **User experience suffers with high latency**: Imagine waiting for what feels like an eternity for a response. High latency can lead to frustration and a poor user experience, which is detrimental to the model's adoption and success.

Let's consider a real-world example: **Google's Smart Compose feature**. When you start typing an email, Google's LLM predicts the next word or phrase and suggests it to you. If the latency is high, the sug