# Prompt Engineering

This document demonstrates how to load and use a pre-trained conversational language model (`Phi-3-mini-4k-instruct`) with Hugging Face Transformers in Python. We'll walk through model loading, prompt construction, and parameter tuning for text generation.

## Loading the Model

We begin by importing necessary libraries and loading the causal language model and its tokenizer. The pipeline is set up for easy inference using GPU acceleration.


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load the pre-trained causal language model (Phi-3) onto GPU
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",           # Use GPU for inference
    torch_dtype="auto",          # Automatically select appropriate tensor type
    trust_remote_code=False,     # Do not trust remote code execution
)

# Load the corresponding tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

# Create a text-generation pipeline for easy inference
pipe = pipeline(
    "text-generation",           # Specify task type
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,      # Only return generated text, not prompt
    max_new_tokens=500,          # Limit output length
    do_sample=False,             # Use deterministic decoding (no sampling)
)

## Creating and Sending a Prompt

Here's how you can define a user prompt and generate a response using the loaded pipeline.

In [None]:
# Prompt
messages = [
    {"role": "user", "content": "Create a funny joke about tigers."}
]

# Generate the output
output = pipe(messages)
print(output[0]["generated_text"])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 Why don't tigers ever play hide and seek? Because good luck hiding when they're that good at it!


## Viewing the Model's Prompt Template

To see how the prompt appears to the model (after tokenization and formatting), you can use the pipeline's template utility.

In [None]:
# Apply prompt template
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False)
print(prompt)

<|user|>
Create a funny joke about tigers.<|end|>
<|endoftext|>


## Generating Text with Custom Sampling Parameters

By adjusting parameters such as `temperature` or `top_p`, sampling becomes more exploratory, which can lead to more creative or diverse outputs.

In [None]:
# Using a high temperature
output = pipe(messages, do_sample=True, temperature=1)
print(output[0]["generated_text"])

 Why did the tiger sit on the computer? Because he wanted to keep his cool and surf the web!


In [None]:
# Using a high top_p
output = pipe(messages, do_sample=True, top_p=1)
print(output[0]["generated_text"])

 Why don't tigers ever start a fight club? Because they always end up with a stripey mess!


# **Intro to Prompt Engineering**

This section introduces the fundamental concepts behind prompt engineering, which is the art and science of crafting instructions to guide large language models (LLMs) in producing desired outputs.

## The Basic Ingredients of a Prompt

Effective prompts often combine clear user roles, explicit instructions, context, formatting hints, constraints, audience details, and tone guidelines to influence the LLM’s responses.

# **Advanced Prompt Engineering**

Here, we progress to more sophisticated prompt construction, which often involves complex tasks, customized formatting, and targeted outcomes.

## Complex Prompt

Let's walk through an advanced example that demonstrates how to distill procedural and benefit-driven information about a technical topic—Retrieval-Augmented Generation (RAG)—for executive decision-makers and technical leads.


In [None]:
# Text to summarize (source: aws.amazon.com/what-is/retrieval-augmented-generation/ and others)
text = """
Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model (LLM), so it references an authoritative knowledge base outside of its training data before generating a response. RAG augments LLMs with up-to-date, external information without retraining the model, improving accuracy, relevance, and trust. Typical LLMs only use static training data that can become outdated and may hallucinate or provide generic, non-authoritative answers.

The RAG process:
- Transforms the user’s query into vector form.
- Matches the query against external data sources (e.g., knowledge bases, APIs, updated files) using semantic or vector search.
- Retrieves and supplies the most relevant content to the LLM as extra context.
- The LLM then generates an answer grounded in current, authoritative information.

Benefits of RAG:
- Cost-effective: No need to retrain models for updated or organizational data.
- Provides current information by connecting to live sources.
- Enhances user trust with source attributions.
- Gives developers more control; allows restricting data sources and troubleshooting.

RAG is increasingly used in enterprise AI, chatbots, search, customer support, and any scenario where accurate, up-to-date, and context-rich answers are critical.
"""

# Prompt components tailored for this task
persona = "You are an expert in natural language processing and AI systems.\n"
instruction = "Summarize the key principles, workflows, and benefits of Retrieval-Augmented Generation as applied to Large Language Models.\n"
context = "The summary should clarify where RAG fits in the AI landscape, what problems it solves, and why it is important for modern applications.\n"
data_format = "Return bullet points on the RAG workflow and benefits, followed by a concise summary paragraph suitable for an executive or technical decision maker.\n"
audience = "This summary is intended for technical leads, AI developers, and decision-makers evaluating modern AI strategies.\n"
tone = "The tone should be precise, authoritative, and accessible to both technical and non-technical stakeholders.\n"
data = f"Text to summarize: {text}"

# Assemble the advanced prompt
query = persona + instruction + context + data_format + audience + tone + data

In the prompt above, each variable defines a different component of the instruction, ensuring the model understands the **role**, **objective**, **context**, **formatting requirements**, **audience**, and **tone**, alongside the actual data to summarize.

Now, input this assembled prompt into the model’s tokenizer using the appropriate chat template. This step shows exactly how the LLM will “see” the query.

In [None]:
messages = [
    {"role": "user", "content": query}
]
print(tokenizer.apply_chat_template(messages, tokenize=False))

<|user|>
You are an expert in natural language processing and AI systems.
Summarize the key principles, workflows, and benefits of Retrieval-Augmented Generation as applied to Large Language Models.
The summary should clarify where RAG fits in the AI landscape, what problems it solves, and why it is important for modern applications.
Return bullet points on the RAG workflow and benefits, followed by a concise summary paragraph suitable for an executive or technical decision maker.
This summary is intended for technical leads, AI developers, and decision-makers evaluating modern AI strategies.
The tone should be precise, authoritative, and accessible to both technical and non-technical stakeholders.
Text to summarize: 
Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model (LLM), so it references an authoritative knowledge base outside of its training data before generating a response. RAG augments LLMs with up-to-date, external informatio

This step prints the full, formatted prompt that gets sent to the LLM. Reviewing this output can help you verify that all key instructions are present and properly formatted.

Finally, use the text-generation pipeline to obtain a summarized, LLM-grounded response to your engineered prompt.

In [None]:
# Generate the output
outputs = pipe(messages)
print(outputs[0]["generated_text"])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 - RAG workflow:
  - Vectorize user query.
  - Search external data sources for relevant information.
  - Supply retrieved content to LLM.
  - LLM generates an informed response.

- RAG benefits:
  - Cost savings by avoiding retraining.
  - Access to current, live data.
  - Increased user trust with source attributions.
  - Greater control for developers.

RAG represents a significant advancement in the AI landscape, particularly for Large Language Models (LLMs). It addresses the critical issue of outdated or non-authoritative information that can arise from static training data. By integrating a retrieval component, RAG ensures that LLMs can access and utilize the most current and relevant information available, enhancing the accuracy and relevance of their outputs. This is especially important in applications where trust and precision are paramount, such as enterprise AI, customer support, and search functionalities. The RAG approach not only improves the quality of AI-generated cont

## In-Context Learning: Providing Examples

### Zero-Shot Output Verification

In [None]:
prompt = "Generate a meeting summary in bullet points in JSON format. Meeting notes: The team discussed project goals and next steps."

# Prompt
input_prompt = [{"role": "user", "content": prompt}]

outputs = pipe(messages)
print(outputs[0]["generated_text"])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 ```json

{

  "MeetingSummary": {

    "Date": "2023-04-10",

    "Attendees": ["Alice", "Bob", "Charlie", "Diana"],

    "TopicsDiscussed": [

      {

        "Subject": "Project Goals",

        "Details": "Reviewed the main objectives for the upcoming quarter and aligned on the expected outcomes."

      },

      {

        "Subject": "Next Steps",

        "Details": "Identified key milestones and assigned tasks to team members to ensure timely progress."

      }

    ],

    "DecisionsMade": [

      {

        "Action": "Set a follow-up meeting for next month to review progress.",

        "Responsible": "Alice"

      }

    ],

    "ActionItems": [

      {

        "Task": "Draft a detailed project plan.",

        "AssignedTo": "Bob",

        "DueDate": "2023-04-17"

      },

      {

        "Task": "Compile a list of resources needed.",

        "AssignedTo": "Charlie",

        "DueDate": "2023-04-15"

      }

    ]

  }

}

```


In [None]:
print(tokenizer.apply_chat_template(input_prompt, tokenize=False))

<|user|>
Generate a meeting summary in bullet points in JSON format. Meeting notes: The team discussed project goals and next steps.<|end|>
<|endoftext|>


### One-Shot for Structure Specificity

In [None]:
prompt = (
    "Produce a JSON meeting summary exactly like this example:\n"
    '{ "summary": "Brief high-level summary", "action_items": ["item 1", "item 2"] }\n'
    "Meeting notes: The team established quarterly goals."
)

# Prompt
input_prompt = [{"role": "user", "content": prompt}]

outputs = pipe(messages)
print(outputs[0]["generated_text"])


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 ```json

{

  "MeetingSummary": {

    "Date": "2023-04-10",

    "Attendees": ["Alice", "Bob", "Charlie", "Diana"],

    "TopicsDiscussed": [

      {

        "Subject": "Project Goals",

        "Details": "Reviewed the main objectives for the upcoming quarter and aligned on the expected outcomes."

      },

      {

        "Subject": "Next Steps",

        "Details": "Identified key milestones and assigned tasks to team members to ensure timely progress."

      }

    ],

    "DecisionsMade": [

      {

        "Action": "Set a follow-up meeting for next month to review progress.",

        "Responsible": "Alice"

      }

    ],

    "ActionItems": [

      {

        "Task": "Draft a detailed project plan.",

        "AssignedTo": "Bob",

        "DueDate": "2023-04-17"

      },

      {

        "Task": "Compile a list of resources needed.",

        "AssignedTo": "Charlie",

        "DueDate": "2023-04-15"

      }

    ]

  }

}

```


In [None]:
print(tokenizer.apply_chat_template(input_prompt, tokenize=False))

<|user|>
Produce a JSON meeting summary exactly like this example:
{ "summary": "Brief high-level summary", "action_items": ["item 1", "item 2"] }
Meeting notes: The team established quarterly goals.<|end|>
<|endoftext|>


###  Few-Shot for Handling Complexity or Ambiguity

In [None]:
prompt = (
    "Here are example outputs for different meeting types:\n"
    "Meeting notes: Budget discussion.\n"
    '{ "summary": "Reviewed financials", "action_items": ["Audit expenses"] }\n'
    "Meeting notes: Sprint planning.\n"
    '{ "summary": "Outlined sprint stories", "action_items": ["Finalize backlog"] }\n'
    "Meeting notes: Annual strategy review.\n"
)

# Prompt
input_prompt = [{"role": "user", "content": prompt}]

outputs = pipe(messages)
print(outputs[0]["generated_text"])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 ```json

{

  "MeetingSummary": {

    "Date": "2023-04-10",

    "Attendees": ["Alice", "Bob", "Charlie", "Diana"],

    "TopicsDiscussed": [

      {

        "Subject": "Project Goals",

        "Details": "Reviewed the main objectives for the upcoming quarter and aligned on the expected outcomes."

      },

      {

        "Subject": "Next Steps",

        "Details": "Identified key milestones and assigned tasks to team members to ensure timely progress."

      }

    ],

    "DecisionsMade": [

      {

        "Action": "Set a follow-up meeting for next month to review progress.",

        "Responsible": "Alice"

      }

    ],

    "ActionItems": [

      {

        "Task": "Draft a detailed project plan.",

        "AssignedTo": "Bob",

        "DueDate": "2023-04-17"

      },

      {

        "Task": "Compile a list of resources needed.",

        "AssignedTo": "Charlie",

        "DueDate": "2023-04-15"

      }

    ]

  }

}

```


In [None]:
print(tokenizer.apply_chat_template(input_prompt, tokenize=False))

<|user|>
Here are example outputs for different meeting types:
Meeting notes: Budget discussion.
{ "summary": "Reviewed financials", "action_items": ["Audit expenses"] }
Meeting notes: Sprint planning.
{ "summary": "Outlined sprint stories", "action_items": ["Finalize backlog"] }
Meeting notes: Annual strategy review.
<|end|>
<|endoftext|>


### Principle—Specificity in Prompts

In [None]:
# Not good (expected generic, non-specific output)
bad_prompt = "Write a description for a meeting tool."
print("Bad prompt response:")
outputs = pipe([{"role": "user", "content": bad_prompt}])
print(outputs[0]["generated_text"])

# Good (concise, specific requirements)
good_prompt = (
    "Write a formal, one-sentence description for a meeting scheduling tool, "
    "highlighting its privacy and time-zone features."
)
print("\nGood prompt response:")
outputs = pipe([{"role": "user", "content": good_prompt}])
print(outputs[0]["generated_text"])


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Bad prompt response:


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 MeetUp is a powerful and user-friendly meeting tool designed to help teams collaborate and communicate more effectively. With its intuitive interface and a range of features, MeetUp makes it easy for team members to schedule, join, and participate in virtual meetings from anywhere in the world.

Key features of MeetUp include:

1. Easy scheduling: MeetUp allows team members to create and schedule meetings with just a few clicks. You can choose from a variety of meeting types, such as video conferencing, audio calls, or text-based chats, and set up recurring meetings to ensure regular communication.

2. Seamless joining: Once a meeting is scheduled, team members can easily join from their preferred device, whether it's a computer, tablet, or smartphone. MeetUp supports multiple video conferencing platforms, including Zoom, Skype, and Google Meet, so you can choose the one that works best for your team.

3. Real-time collaboration: MeetUp provides a range of tools to help teams collabor

### Principle—Preventing Hallucinations

In [None]:
# Primacy effect: context up front
primacy_prompt = (
    "You are a business analyst. Summarize key risks identified in the provided company report.\n"
    "REPORT TEXT: [Insert report text here]"
)
print("Primacy effect:\n")
outputs = pipe([{"role": "user", "content": primacy_prompt}])
print(outputs[0]["generated_text"])

# Recency effect: constraints at the end
recency_prompt = (
    "You are a business analyst. Summarize key risks identified in the provided company report. "
    "REPORT TEXT: [Insert report text here]\n"
    "If analysis is uncertain, reply with 'I don't know.' Results must be in YAML format."
)
print("\nRecency effect:\n")
outputs = pipe([{"role": "user", "content": recency_prompt}])
print(outputs[0]["generated_text"])



# BAD EXAMPLE: Bury requirement mid-prompt (AVOID)
bad_order_prompt = (
    "Summarize key risks, making sure to consider data from all sections. Results must be in YAML format. "
    "If possible use evidence from the attached tables. The report covers financial, strategic, and compliance areas."
)
print("\nBad prompt order (AVOID):\n")
outputs = pipe([{"role": "user", "content": bad_order_prompt}])
print(outputs[0]["generated_text"])

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Primacy effect:



The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 Key risks identified in the company report include:

1. Market volatility: The report highlights the potential impact of fluctuating market conditions on the company's revenue and profitability.
2. Regulatory changes: The company may face challenges due to changes in government regulations, which could affect its operations and compliance requirements.
3. Supply chain disruptions: The report points out the risk of disruptions in the supply chain, which could lead to delays in production and increased costs.
4. Cybersecurity threats: The company is vulnerable to cybersecurity attacks, which could result in data breaches and financial losses.
5. Competitive pressures: The report identifies the risk of increased competition in the market, which could lead to a loss of market share and reduced profitability.

Recency effect:



The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 ```yaml

risks:

  - name: Market Volatility

    description: The report indicates a high level of market volatility which could impact the company's revenue streams.

  - name: Supply Chain Disruptions

    description: There is a risk of supply chain disruptions due to geopolitical tensions in key supplier regions.

  - name: Regulatory Changes

    description: Potential changes in regulations could impose additional compliance costs and affect market access.

  - name: Technological Obsolescence

    description: The rapid pace of technological advancement poses a risk of the company's products becoming obsolete.

  - name: Competitive Pressure

    description: Increasing competition from both established players and new entrants could erode market share.

```

Bad prompt order (AVOID):

 ```yaml
FinancialRisks:
  - RiskID: "FR-001"
    Description: "Market volatility leading to potential revenue fluctuations."
    Evidence: "Table 1: Historical market trends showing increased v

## Chain Prompting: Breaking up the Problem


In [None]:
# Create name and slogan for a product
product_prompt = [
    {"role": "user", "content": "Create a name and slogan for a chatbot that leverages LLMs."}
]
outputs = pipe(product_prompt)
product_description = outputs[0]["generated_text"]
print(product_description)

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 Name: ChatSage
Slogan: "Your AI Companion for Smart Conversations"


In [None]:
# Based on a name and slogan for a product, generate a sales pitch
sales_prompt = [
    {"role": "user", "content": f"Generate a very short sales pitch for the following product: '{product_description}'"}
]
outputs = pipe(sales_prompt)
sales_pitch = outputs[0]["generated_text"]
print(sales_pitch)

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 Introducing ChatSage, your AI Companion for Smart Conversations! With ChatSage, you'll have a personalized and intelligent assistant at your fingertips, ready to engage in meaningful dialogue, provide helpful information, and enhance your daily interactions. Experience the future of communication with ChatSage – your smart and reliable conversation partner.


# **Reasoning with Generative Models**

This section explores how large language models can be prompted to explain their reasoning process step-by-step, leading to higher accuracy, transparency, and reliability in their responses.

## Chain-of-Thought: Think Before Answering

Chain-of-thought (CoT) prompting encourages the model to break down a problem and articulate intermediate steps, producing more robust solutions to multi-step reasoning tasks.

In [None]:
# Chain-of-Thought Prompting with Missouri Sports Context
cot_prompt = [
    {
        "role": "user",
        "content": "The Missouri Tigers football team threw 12 touchdown passes in their first game against Arkansas. In their second game against Kentucky, they threw 7 more touchdown passes. How many touchdown passes have they thrown in total across both games?"
    },
    {
        "role": "assistant",
        "content": "Let me work through this step-by-step:\n\nStep 1: Identify the touchdown passes from the first game\n- Against Arkansas: 12 touchdown passes\n\nStep 2: Identify the touchdown passes from the second game\n- Against Kentucky: 7 touchdown passes\n\nStep 3: Add them together\n- Total touchdown passes = 12 + 7 = 19\n\nTherefore, the Missouri Tigers have thrown 19 touchdown passes total across both games."
    },
    {
        "role": "user",
        "content": "At the Columbia Farmers Market, there were 35 baskets of Missouri-grown strawberries on Wednesday morning. The vendors sold 15 baskets to customers and received a delivery of 10 more baskets on Thursday. How many strawberry baskets are available now?"
    }
]

# Generate the response
outputs = pipe(cot_prompt, max_new_tokens=200, temperature=0.1)
print(outputs[0]["generated_text"])


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 Let me work through this step-by-step:

Step 1: Identify the initial number of baskets
- Wednesday morning: 35 baskets

Step 2: Subtract the baskets sold
- Baskets sold: 15
- Remaining baskets: 35 - 15 = 20

Step 3: Add the new delivery
- New delivery: 10 baskets
- Total baskets now: 20 + 10 = 30

Therefore, there are now 30 baskets of Missouri-grown strawberries available at the Columbia Farmers Market.


In this example, the assistant models step-by-step math reasoning for local, context-rich problems.

### Zero-Shot Chain-of-Thought

Zero-shot CoT combines a standard question with an explicit "Let's think step-by-step" instruction. This trigger helps the model organize its reasoning, even without prior examples.

In [None]:
# Zero-Shot Chain-of-Thought with Local Business Context
def zero_shot_cot_reasoning(prompt_text, model_pipeline):
    """
    Implement zero-shot chain-of-thought reasoning
    Args:
        prompt_text: The problem to solve
        model_pipeline: Configured language model pipeline
    Returns:
        Generated reasoning and solution
    """

    # Add the magic trigger phrase
    enhanced_prompt = [
        {
            "role": "user",
            "content": f"{prompt_text} Let's think step-by-step."
        }
    ]

    # Configure for consistent reasoning
    generation_config = {
        "max_new_tokens": 300,
        "temperature": 0.1,      # Low temperature for consistent reasoning
        "do_sample": True,
        "pad_token_id": model_pipeline.tokenizer.eos_token_id
    }

    # Generate reasoning
    outputs = model_pipeline(enhanced_prompt, **generation_config)
    return outputs[0]["generated_text"]

# Test with Tiger Café problem
tiger_cafe_problem = """The Tiger Café at the University of Missouri served 48 cups of coffee during the morning rush (7-11 AM). During the afternoon period (1-5 PM), they served 37 additional cups. However, 15 customers returned their cups because the coffee was too cold and received replacements. How many cups of coffee did the Tiger Café actually serve to satisfied customers?"""

result = zero_shot_cot_reasoning(tiger_cafe_problem, pipe)
print(result)


 Step 1: Calculate the total number of cups served during the morning rush.
The Tiger Café served 48 cups of coffee during the morning rush.

Step 2: Calculate the total number of cups served during the afternoon period.
The Tiger Café served 37 additional cups during the afternoon period.

Step 3: Calculate the total number of cups served before considering replacements.
Total cups served = Morning rush cups + Afternoon cups
Total cups served = 48 cups + 37 cups
Total cups served = 85 cups

Step 4: Calculate the number of cups that were returned and replaced.
15 customers returned their cups because the coffee was too cold.

Step 5: Calculate the total number of cups served to satisfied customers.
Total cups served to satisfied customers = Total cups served - Replaced cups
Total cups served to satisfied customers = 85 cups - 15 cups
Total cups served to satisfied customers = 70 cups

The Tiger Café actually served 70 cups of coffee to satisfied customers.


### Self-Consistency Method

Self-consistency sampling generates multiple independent reasoning paths and uses a voting system to select the most likely answer, improving confidence and robustness for complex tasks.


In [None]:
import json
from collections import Counter

def self_consistency_prompting(prompt, model_pipe, num_samples=5):
    """
    Generate multiple reasoning paths and use majority voting
    """
    # Configure for stochastic sampling (NOT greedy decoding)
    sampling_config = {
        "do_sample": True,        # Enable sampling
        "temperature": 0.7,       # Add randomness for diversity
        "top_p": 0.9,            # Nucleus sampling
        "max_new_tokens": 500,
        "return_full_text": False
    }

    responses = []
    extracted_answers = []

    # Generate multiple diverse reasoning paths
    for i in range(num_samples):
        # Apply sampling configuration
        model_pipe.temperature = sampling_config["temperature"]
        model_pipe.do_sample = sampling_config["do_sample"]

        output = model_pipe(prompt)
        response = output[0]["generated_text"]
        responses.append(response)

        # Extract numerical answer using regex
        import re
        answer_pattern = r'(?:answer is|=|total.*?is)\s*(\d+)'
        match = re.search(answer_pattern, response, re.IGNORECASE)

        if match:
            answer = int(match.group(1))
            extracted_answers.append(answer)

    # Majority voting mechanism
    if extracted_answers:
        vote_counts = Counter(extracted_answers)
        most_common = vote_counts.most_common(1)[0]
        final_answer = most_common[0]
        confidence = most_common[1] / len(extracted_answers)

        return final_answer, responses, vote_counts


Applying self-consistency with a Missouri context:


In [None]:
# Self-Consistency with Missouri context
self_consistency_prompt = [
    {
        "role": "user",
        "content": "The Tiger Café at the University of Missouri served 48 cups of coffee in the morning. In the afternoon, they served 37 more cups. If 15 cups were returned because they were too cold, how many cups did they serve in total? Let's think step by step."
    }
]

# Execute self-consistency with 5 samples
final_answer, responses, vote_counts = self_consistency_prompting(
    self_consistency_prompt, pipe, num_samples=5
)


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [None]:
final_answer, responses, vote_counts

(48,
 [' Step 1: Determine the total number of cups served in the morning and afternoon before any were returned.\n\nMorning cups served: 48\nAfternoon cups served: 37\n\nTotal cups served before returns = Morning cups + Afternoon cups\nTotal cups served before returns = 48 + 37\nTotal cups served before returns = 85\n\nStep 2: Subtract the number of cups that were returned due to being too cold.\n\nCups returned: 15\n\nTotal cups served after returns = Total cups served before returns - Cups returned\nTotal cups served after returns = 85 - 15\nTotal cups served after returns = 70\n\nSo, the Tiger Café at the University of Missouri served a total of 70 cups of coffee after accounting for the returns.',
  ' Step 1: Determine the total number of cups served in the morning and afternoon before any were returned.\n\nMorning cups served: 48\nAfternoon cups served: 37\n\nTotal cups served before returns = Morning cups + Afternoon cups\nTotal cups served before returns = 48 + 37\nTotal cups s

## Zero-shot Chain-of-Thought


In [None]:
# Zero-shot Chain-of-Thought with Columbia, MO context
zeroshot_cot_prompt = [
    {
        "role": "user",
        "content": "The Tiger Café at the University of Missouri served 48 cups of coffee in the morning. In the afternoon, they served 37 more cups. If 15 cups were returned because they were too cold, how many cups did they serve in total? Let's think step by step."
    }
]


# Generate the output
outputs = pipe(zeroshot_cot_prompt)
print(outputs[0]["generated_text"])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 Step 1: Determine the total number of cups served in the morning and afternoon before any were returned.

Morning cups served: 48
Afternoon cups served: 37

Total cups served before returns = Morning cups + Afternoon cups
Total cups served before returns = 48 + 37
Total cups served before returns = 85

Step 2: Subtract the number of cups that were returned due to being too cold.

Cups returned: 15

Total cups served after returns = Total cups served before returns - Cups returned
Total cups served after returns = 85 - 15
Total cups served after returns = 70

So, the Tiger Café at the University of Missouri served a total of 70 cups of coffee after accounting for the returns.


## Tree-of-Thought: Exploring Intermediate Steps


In [None]:
zeroshot_tot_prompt = [
    {
        "role": "user",
        "content": (
            "Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. "
            "Then all experts will go on to the next step, etc. If any expert realises they're wrong at any point then they leave. "
            "The question is 'The MU Student Center had 40 leftover pizzas. They gave away 18 pizzas and bought 12 more. How many pizzas do they have now?' "
            "Make sure to discuss the results."
        )
    }
]


In [None]:
# Generate the output
outputs = pipe(zeroshot_tot_prompt)
print(outputs[0]["generated_text"])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 Expert 1:
Step 1: Start with the initial number of pizzas (40).

Expert 2:
Step 1: Start with the initial number of pizzas (40).
Step 2: Subtract the number of pizzas given away (18).

Expert 3:
Step 1: Start with the initial number of pizzas (40).
Step 2: Subtract the number of pizzas given away (18).
Step 3: Add the number of pizzas bought (12).

Expert 1:
Step 1: Start with the initial number of pizzas (40).
Step 2: Subtract the number of pizzas given away (18).
Step 3: Add the number of pizzas bought (12).

Expert 2:
Step 1: Start with the initial number of pizzas (40).
Step 2: Subtract the number of pizzas given away (18).
Step 3: Add the number of pizzas bought (12).

Expert 3:
Step 1: Start with the initial number of pizzas (40).
Step 2: Subtract the number of pizzas given away (18).
Step 3: Add the number of pizzas bought (12).

Final result:
Expert 1: 34 pizzas
Expert 2: 34 pizzas
Expert 3: 34 pizzas

All experts agree that the MU Student Center now has 34 pizzas.


# **Output Verification**

Verifying outputs is crucial when prompting language models, especially when requiring structured results like JSON or enforcing specific schema formats. This section demonstrates strategies from zero-shot to grammar-constrained generation.

## Providing Examples

### Zero-Shot Output—No Example (Starfield-Themed)

In zero-shot learning, you instruct the model without giving it any examples. The model must infer the format and details from the instructions alone:


In [None]:
# Zero-shot learning: Providing no examples (Starfield-themed)
zeroshot_prompt = [
    {
        "role": "user",
        "content": "Create a character profile for the RPG game Starfield in JSON format."
    }
]

# Send this prompt to your model, then print(outputs[0]["generated_text"])

# Generate the output
outputs = pipe(zeroshot_prompt)
print(outputs[0]["generated_text"])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 ```json

{

  "name": "Captain Elara Vance",

  "class": "Space Marine",

  "race": "Human",

  "age": 34,

  "skills": {

    "combat": 90,

    "leadership": 85,

    "navigation": 75,

    "engineering": 65

  },

  "equipment": {

    "weapon": "Plasma Rifle",

    "armor": "Titanium Plate Suit",

    "tools": ["Holo-Map Projector", "Repair Kit"]

  },

  "personality": {

    "brave": 85,

    "loyal": 90,

    "curious": 70,

    "strategic": 80

  },

  "backstory": "Captain Elara Vance was once a high-ranking officer in the United Earth Space Force. After a mysterious event led to the loss of her crew, she vowed to uncover the truth behind the incident. Now, she leads a small group of survivors in search of answers and a new home among the stars."
}

```


Results from zero-shot prompts may be inconsistent in structure or level of detail, making downstream automated processing more challenging.

### One-Shot Output—With Structure Example

Providing a single example (“one-shot learning”) primes the model to produce more predictable results in the desired format:

In [None]:
# One-shot learning: Providing an example of the output structure
one_shot_template = """Create a short character profile for an RPG Starfield game. Make sure to only use this format:

{
  "description": "A SHORT DESCRIPTION",
  "name": "THE CHARACTER'S NAME",
  "armor": "ONE PIECE OF ARMOR",
  "weapon": "ONE OR MORE WEAPONS"
}
"""
one_shot_prompt = [
    {"role": "user", "content": one_shot_template}
]

# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


 {
  "description": "A seasoned space trader with a mysterious past, known for his cunning and resourcefulness.",
  "name": "Zorak",
  "armor": "Leather Vest",
  "weapon": "Plasma Blaster"
}


With a concrete structure to mimic, the model's output is more likely to comply with your formatting requirements—greatly improving reliability.

## Grammar: Constrained Sampling

Advanced output verification involves not just examples, but also explicit constraints enforced via code or model capabilities.

First, release model and tokenizer resources to avoid GPU memory issues:

In [None]:
import gc
import torch
del model, tokenizer, pipe

# Flush memory
gc.collect()
torch.cuda.empty_cache()

In [None]:
!pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.14.tar.gz (51.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.14-cp311-cp311-linux_x86_64.whl size=4299341 sha256=0e2ea1267c77f3273b

In [None]:
from llama_cpp.llama import Llama

# Load Phi-3
llm = Llama.from_pretrained(
    repo_id="microsoft/Phi-3-mini-4k-instruct-gguf",
    filename="*fp16.gguf",
    n_gpu_layers=-1,
    n_ctx=2048,
    verbose=False
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
llama_context: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility


Now, prompt the model with output constraints—using a parameter that forces the response to JSON:


In [None]:
# Generate output
output = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Create a warrior for an RPG in JSON format."},
    ],
    response_format={"type": "json_object"},
    temperature=0,
)['choices'][0]['message']["content"]


In [None]:
import json

# Format as json
json_output = json.dumps(json.loads(output), indent=4)
print(json_output)

{
    "warrior": {
        "name": "Eldric Stormbringer",
        "class": "Warrior",
        "level": 5,
        "attributes": {
            "strength": 18,
            "dexterity": 10,
            "constitution": 16,
            "intelligence": 8,
            "wisdom": 10,
            "charisma": 12
        },
        "skills": [
            {
                "name": "Martial Arts",
                "proficiency": 20,
                "description": "Expert in hand-to-hand combat and weapon handling."
            },
            {
                "name": "Shield Block",
                "proficiency": 18,
                "description": "Highly skilled at deflecting attacks with a shield."
            },
            {
                "name": "Heavy Armor",
                "proficiency": 16,
                "description": "Expertly equipped with heavy armor for protection."
            },
            {
                "name": "Survival",
                "proficiency": 14,
                "

This approach minimizes formatting errors, ensures that the output is properly structured, and makes it easy to use the result in downstream applications.
