# Part 3: Prompt Engineering Basics

## Introduction

In this part, you'll experiment with different prompting techniques to improve the quality of responses from Large Language Models (LLMs). You'll compare zero-shot, one-shot, and few-shot prompting approaches and document which works best for different types of questions.

## Learning Objectives

- Understand different prompting techniques
- Compare zero-shot, one-shot, and few-shot prompting
- Analyze the impact of prompt design on response quality

## Setup

In [2]:
# Import necessary libraries
import requests
import json
import os

# Set up API key from .env file
from dotenv import load_dotenv
load_dotenv()  # Load environment variables from .env file

# Verify the key is set
api_key = os.environ.get("HUGGINGFACE_API_KEY")
if api_key:
    print("API key successfully loaded from .env file")
else:
    print("Failed to load API key from .env file")

API key successfully loaded from .env file


## 1. Understanding Prompting Techniques

LLMs can be prompted in different ways to get better responses:

1. **Zero-shot prompting**: Asking the model a question directly without examples
2. **One-shot prompting**: Providing one example before asking your question
3. **Few-shot prompting**: Providing multiple examples before asking your question

## 2. Creating Prompting Templates

Your first task is to create templates for different prompting strategies.

In [8]:
# Define a question to experiment with
question = "What foods should be avoided by patients with gout?"

# Example for one-shot and few-shot prompting
example_q = "What are the symptoms of gout?"
example_a = "Gout symptoms include sudden severe pain, swelling, redness, and tenderness in joints, often the big toe."

# Examples for few-shot prompting
examples = [
    ("What are the symptoms of gout?",
     "Gout symptoms include sudden severe pain, swelling, redness, and tenderness in joints, often the big toe."),
    ("How is gout diagnosed?",
     "Gout is diagnosed through physical examination, medical history, blood tests for uric acid levels, and joint fluid analysis to look for urate crystals.")
]

# TODO: Create prompting templates
# Zero-shot template (just the question)
zero_shot_template = "Question: {question}\nAnswer:"

# One-shot template (one example + the question)
one_shot_template = """Question: {example_q}
Answer: {example_a}

Question: {question}
Answer:"""

# Few-shot template (multiple examples + the question)
few_shot_template = """Question: {examples[0][0]}
Answer: {examples[0][1]}

Question: {examples[1][0]}
Answer: {examples[1][1]}

Question: {question}
Answer:"""

# TODO: Format the templates with your question and examples
zero_shot_prompt = zero_shot_template.format(question=question)
one_shot_prompt = one_shot_template.format(example_q=example_q, example_a=example_a, question=question)
# For few-shot, you'll need to format it with the examples list
few_shot_prompt = few_shot_template.format(examples=examples, question=question)

print("Zero-shot prompt:")
print(zero_shot_prompt)
print("\nOne-shot prompt:")
print(one_shot_prompt)
print("\nFew-shot prompt:")
print(few_shot_prompt)

Zero-shot prompt:
Question: What foods should be avoided by patients with gout?
Answer:

One-shot prompt:
Question: What are the symptoms of gout?
Answer: Gout symptoms include sudden severe pain, swelling, redness, and tenderness in joints, often the big toe.

Question: What foods should be avoided by patients with gout?
Answer:

Few-shot prompt:
Question: What are the symptoms of gout?
Answer: Gout symptoms include sudden severe pain, swelling, redness, and tenderness in joints, often the big toe.

Question: How is gout diagnosed?
Answer: Gout is diagnosed through physical examination, medical history, blood tests for uric acid levels, and joint fluid analysis to look for urate crystals.

Question: What foods should be avoided by patients with gout?
Answer:


## 3. Connecting to the LLM API

Next, implement a function to send prompts to an LLM API and get responses.

In [9]:
def get_llm_response(prompt, model_name="facebook/bart-large-cnn", api_key=None):
    """Get a response from the LLM based on the prompt"""
    import os
    import requests
    
    # Get API key from environment if not provided
    if api_key is None:
        api_key = os.environ.get("HUGGINGFACE_API_KEY")
        if not api_key:
            return "Error: API key is required but none was provided"
    
    # Set up the API URL and headers
    api_url = f"https://api-inference.huggingface.co/models/{model_name}"
    headers = {"Authorization": f"Bearer {api_key}"}
    
    # Create payload with the prompt
    payload = {"inputs": prompt}
    
    try:
        # Send request to the API
        response = requests.post(api_url, headers=headers, json=payload)
        response.raise_for_status()  # Raise exception for error status codes
        
        # Parse the response
        result = response.json()
        
        # Extract the generated text from the response
        if isinstance(result, list) and len(result) > 0:
            if isinstance(result[0], dict):
                # Handle different model response formats
                if "summary_text" in result[0]:
                    return result[0]["summary_text"]
                elif "generated_text" in result[0]:
                    return result[0]["generated_text"]
            elif isinstance(result[0], str):
                return result[0]
        
        # If we get here, the response format wasn't recognized
        return str(result)
        
    except requests.exceptions.RequestException as e:
        return f"Error: Failed to get response from the model. {str(e)}"
    
    except Exception as e:
        return f"Error: {str(e)}"

# Test the prompting templates with your function
print("Testing zero-shot prompt:")
zero_shot_response = get_llm_response(zero_shot_prompt)
print(zero_shot_response)

print("\nTesting one-shot prompt:")
one_shot_response = get_llm_response(one_shot_prompt)
print(one_shot_response)

print("\nTesting few-shot prompt:")
few_shot_response = get_llm_response(few_shot_prompt)
print(few_shot_response)

Testing zero-shot prompt:
Question: What foods should be avoided by patients with gout? Answer: Anything with a high salt content. Anything with high fat content. Everything with a low sugar content.Anything with a lot of fat. Anything high in sugar. Anything that is high in fat and low in protein.

Testing one-shot prompt:
Gout symptoms include sudden severe pain, swelling, redness, and tenderness in joints, often the big toe. What foods should be avoided by patients with gout? Gout sufferers should avoid foods that contain high levels of sugar, salt, and fat. Gout patients should also avoid foods high in salt and high in fat.

Testing few-shot prompt:
Gout symptoms include sudden severe pain, swelling, redness, and tenderness in joints, often the big toe. Gout is diagnosed through physical examination, medical history, blood tests for uric acid levels, and joint fluid analysis to look for urate crystals.


## 4. Comparing Prompting Strategies

Now, let's compare the different prompting strategies on a set of healthcare questions.

In [10]:
# List of healthcare questions to test
questions = [
    "What foods should be avoided by patients with gout?",
    "What medications are commonly prescribed for gout?",
    "How can gout flares be prevented?",
    "Is gout related to diet?",
    "Can gout be cured permanently?"
]

# Organize example Q&A pairs for reuse
example_pairs = [
    ("What are the symptoms of gout?",
     "Gout symptoms include sudden severe pain, swelling, redness, and tenderness in joints, often the big toe."),
    ("How is gout diagnosed?",
     "Gout is diagnosed through physical examination, medical history, blood tests for uric acid levels, and joint fluid analysis to look for urate crystals.")
]

# Store the results for each question and strategy
results = {}

# Process each question
for question in questions:
    print(f"\n\n==== Testing question: {question} ====\n")
    
    # Store responses for this question
    question_results = {}
    
    # 1. Zero-shot prompting
    print("Testing zero-shot prompting...")
    zero_shot_template = "Question: {question}\nAnswer:"
    zero_shot_prompt = zero_shot_template.format(question=question)
    zero_shot_response = get_llm_response(zero_shot_prompt)
    question_results["zero_shot"] = zero_shot_response
    print(f"Response: {zero_shot_response[:100]}...")  # Print first 100 chars
    
    # 2. One-shot prompting
    print("\nTesting one-shot prompting...")
    one_shot_template = """Question: {example_q}
Answer: {example_a}

Question: {question}
Answer:"""
    one_shot_prompt = one_shot_template.format(
        example_q=example_pairs[0][0], 
        example_a=example_pairs[0][1], 
        question=question
    )
    one_shot_response = get_llm_response(one_shot_prompt)
    question_results["one_shot"] = one_shot_response
    print(f"Response: {one_shot_response[:100]}...")  # Print first 100 chars
    
    # 3. Few-shot prompting
    print("\nTesting few-shot prompting...")
    few_shot_template = """Question: {example1_q}
Answer: {example1_a}

Question: {example2_q}
Answer: {example2_a}

Question: {question}
Answer:"""
    few_shot_prompt = few_shot_template.format(
        example1_q=example_pairs[0][0],
        example1_a=example_pairs[0][1],
        example2_q=example_pairs[1][0],
        example2_a=example_pairs[1][1],
        question=question
    )
    few_shot_response = get_llm_response(few_shot_prompt)
    question_results["few_shot"] = few_shot_response
    print(f"Response: {few_shot_response[:100]}...")  # Print first 100 chars
    
    # Store results for this question
    results[question] = question_results

print("\n\n===== SUMMARY OF RESULTS =====")
for i, (question, responses) in enumerate(results.items()):
    print(f"\n{i+1}. Question: {question}")
    for strategy, response in responses.items():
        print(f"  {strategy.replace('_', '-')}: {response[:50]}...")  # Print first 50 chars

# we will use these responses for evaluation



==== Testing question: What foods should be avoided by patients with gout? ====

Testing zero-shot prompting...
Response: Question: What foods should be avoided by patients with gout? Answer: Anything with a high salt cont...

Testing one-shot prompting...
Response: Gout symptoms include sudden severe pain, swelling, redness, and tenderness in joints, often the big...

Testing few-shot prompting...
Response: Gout symptoms include sudden severe pain, swelling, redness, and tenderness in joints, often the big...


==== Testing question: What medications are commonly prescribed for gout? ====

Testing zero-shot prompting...
Response: Question: What medications are commonly prescribed for gout? Answer: There are a number of medicatio...

Testing one-shot prompting...
Response: Gout symptoms include sudden severe pain, swelling, redness, and tenderness in joints, often the big...

Testing few-shot prompting...
Response: Gout symptoms include sudden severe pain, swelling, redness, and tend

## 5. Evaluating Responses

Create a simple evaluation function to score the responses based on the presence of expected keywords.

In [11]:
def score_response(response, keywords):
    """Score a response based on the presence of expected keywords"""
    # TODO: Implement the score_response function
    # Example implementation:
    response = response.lower()
    found_keywords = 0
    for keyword in keywords:
        if keyword.lower() in response:
            found_keywords += 1
    return found_keywords / len(keywords) if keywords else 0

# Expected keywords for each question
expected_keywords = {
    "What foods should be avoided by patients with gout?": 
        ["purine", "red meat", "seafood", "alcohol", "beer", "organ meats"],
    "What medications are commonly prescribed for gout?": 
        ["nsaids", "colchicine", "allopurinol", "febuxostat", "probenecid", "corticosteroids"],
    "How can gout flares be prevented?": 
        ["medication", "diet", "weight", "alcohol", "water", "exercise"],
    "Is gout related to diet?": 
        ["yes", "purine", "food", "alcohol", "seafood", "meat"],
    "Can gout be cured permanently?": 
        ["manage", "treatment", "lifestyle", "medication", "chronic"]
}

# Score all responses and store the results
scores = {}
for question, responses in results.items():
    question_scores = {}
    
    # Get the keywords for this question
    keywords = expected_keywords.get(question, [])
    if not keywords:
        print(f"Warning: No keywords defined for question: {question}")
        continue
        
    # Score each strategy's response
    for strategy, response in responses.items():
        # Calculate score based on keyword presence
        score = score_response(response, keywords)
        question_scores[strategy] = score
        
    # Store scores for this question
    scores[question] = question_scores

# Calculate average scores for each strategy
strategy_totals = {"zero_shot": 0, "one_shot": 0, "few_shot": 0}
strategy_counts = {"zero_shot": 0, "one_shot": 0, "few_shot": 0}

for question_scores in scores.values():
    for strategy, score in question_scores.items():
        strategy_totals[strategy] += score
        strategy_counts[strategy] += 1

average_scores = {}
for strategy in strategy_totals:
    if strategy_counts[strategy] > 0:
        average_scores[strategy] = strategy_totals[strategy] / strategy_counts[strategy]
    else:
        average_scores[strategy] = 0

# Find the best strategy
best_strategy = max(average_scores, key=average_scores.get)

# Print the results in a table format
print("\n===== EVALUATION RESULTS =====\n")
print(f"{'Question':<40} {'Zero-Shot':<12} {'One-Shot':<12} {'Few-Shot':<12}")
print("-" * 80)

for question, question_scores in scores.items():
    # Truncate question if too long
    q_display = question[:37] + "..." if len(question) > 40 else question
    q_display = f"{q_display:<40}"
    
    # Format scores
    zero_score = f"{question_scores.get('zero_shot', 0):.2f}"
    one_score = f"{question_scores.get('one_shot', 0):.2f}"
    few_score = f"{question_scores.get('few_shot', 0):.2f}"
    
    print(f"{q_display} {zero_score:<12} {one_score:<12} {few_score:<12}")

print("-" * 80)
print(f"{'Average':<40} {average_scores.get('zero_shot', 0):.2f}         {average_scores.get('one_shot', 0):.2f}         {average_scores.get('few_shot', 0):.2f}")
print(f"\nBest strategy: {best_strategy.replace('_', '-')} prompting (score: {average_scores.get(best_strategy, 0):.2f})")

# Store the full evaluation results for saving later
evaluation_results = {
    "scores": scores,
    "average_scores": average_scores,
    "best_strategy": best_strategy
}


===== EVALUATION RESULTS =====

Question                                 Zero-Shot    One-Shot     Few-Shot    
--------------------------------------------------------------------------------
What foods should be avoided by patie... 0.00         0.00         0.00        
What medications are commonly prescri... 0.00         0.00         0.00        
How can gout flares be prevented?        0.00         0.00         0.00        
Is gout related to diet?                 0.17         0.00         0.00        
Can gout be cured permanently?           0.00         0.40         0.00        
--------------------------------------------------------------------------------
Average                                  0.03         0.08         0.00

Best strategy: one-shot prompting (score: 0.08)


## 6. Saving Results

Save your results in a structured format for auto-grading.

In [14]:
import os

# Create directory if it doesn't exist
os.makedirs('results/part_3', exist_ok=True)

# Path to save results
output_file = 'results/part_3/prompting_results.txt'

with open(output_file, 'w') as f:
    f.write("# Prompt Engineering Results\n\n")
    
    # Write raw responses for each question
    for question, responses in results.items():
        f.write(f"## Question: {question}\n\n")
        
        # Write responses for each strategy
        for strategy, response in responses.items():
            strategy_name = strategy.replace('_', '-')
            f.write(f"### {strategy_name} response:\n")
            f.write(f"{response}\n\n")
        
        f.write("--------------------------------------------------\n\n")
    
    # Write scores section
    f.write("## Scores\n\n")
    f.write("```\n")
    
    # Header row
    f.write("question,zero_shot,one_shot,few_shot\n")
    
    # Write scores for each question
    for question, question_scores in scores.items():
        # Create short question name
        q_short = question.lower().replace("?", "").replace(" ", "_")[:15]
        
        # Get scores for each strategy (default to 0 if missing)
        zero_score = question_scores.get('zero_shot', 0)
        one_score = question_scores.get('one_shot', 0)
        few_score = question_scores.get('few_shot', 0)
        
        # Write the scores row
        f.write(f"{q_short},{zero_score:.2f},{one_score:.2f},{few_score:.2f}\n")
    
    # Write average scores
    f.write("\naverage,{:.2f},{:.2f},{:.2f}\n".format(
        average_scores.get('zero_shot', 0),
        average_scores.get('one_shot', 0),
        average_scores.get('few_shot', 0)
    ))
    
    # Write best method
    f.write(f"best_method,{best_strategy}\n")
    f.write("```\n\n")
    
    # Add interpretation and limitations section
    f.write("## Interpretation and Limitations\n\n")
    f.write("### Summary of Results\n\n")
    f.write(f"In this experiment, I compared three prompting techniques (zero-shot, one-shot, and few-shot) ")
    f.write(f"on medical questions related to gout. The best performing technique was **{best_strategy.replace('_', '-')} prompting** ")
    f.write(f"with an average score of {average_scores.get(best_strategy, 0):.2f}.\n\n")
    
    f.write("### Key Findings\n\n")
    f.write("1. Overall performance was poor across all prompting techniques, with very low keyword matching scores.\n")
    f.write("2. One-shot prompting showed slight improvement for certain questions, particularly about whether gout can be cured.\n")
    f.write("3. Few-shot prompting did not improve performance despite providing more examples.\n\n")
    
    f.write("### Limitations\n\n")
    f.write("The experiment has several significant limitations:\n\n")
    f.write("1. **Model Selection**: I was only able to use the facebook/bart-large-cnn model due to limitations of the free Hugging Face API access. ")
    f.write("This model was primarily trained to summarize news articles, not to answer medical questions. ")
    f.write("More specialized models like PubMedBERT or Bio_ClinicalBERT would likely produce better results but were not accessible.\n\n")
    
    f.write("2. **Evaluation Method**: The keyword-based evaluation is simplistic and doesn't capture nuance or alternative ")
    f.write("phrasings that might be medically accurate but use different terminology.\n\n")
    
    f.write("3. **Data Quality**: The poor performance likely reflects the mismatch between the model's training and ")
    f.write("the specialized medical domain rather than the effectiveness of different prompting techniques themselves.\n\n")
    
    f.write("4. **Sample Size**: The experiment only used five questions about a single medical condition (gout), ")
    f.write("which limits the generalizability of the findings.\n\n")
    
    f.write("### Conclusion\n\n")
    f.write("While one-shot prompting performed marginally better in this experiment, the overall low scores suggest that ")
    f.write("the underlying model's capabilities matter more than the prompting technique when dealing with specialized domains like healthcare. ")

print(f"Results saved to {output_file}")

Results saved to results/part_3/prompting_results.txt


## Progress Checkpoints

1. **Prompting Templates**:
   - [ ] Create zero-shot template
   - [ ] Create one-shot template
   - [ ] Create few-shot template
   - [ ] Format templates with questions and examples

2. **LLM API Integration**:
   - [ ] Connect to the Hugging Face API
   - [ ] Test with different prompts
   - [ ] Handle API errors

3. **Comparison and Evaluation**:
   - [ ] Compare strategies on multiple questions
   - [ ] Score responses based on keywords
   - [ ] Determine the best strategy

4. **Results and Documentation**:
   - [ ] Save results in the required format
   - [ ] Document your findings

## What to Submit

1. Your implementation in a Python script `utils/prompt_comparison.py` that:
   - Defines the prompting templates
   - Connects to the Hugging Face API
   - Compares different prompting strategies
   - Scores and evaluates the responses

2. The results of your experiments in `results/part_3/prompting_results.txt` with the format shown above

The auto-grader will check:
1. That your results file contains the required sections
2. That your scoring logic correctly identifies keyword presence
3. That you've correctly calculated average scores
4. That you've identified the best performing method