# 👩‍💻 **Prompt and Compare Across LLMs**

**Time Estimate:** 60 minutes

## 🔍 Overview

This activity is designed to give you hands-on experience with comparing outputs across different Large Language Models (LLMs) by crafting innovative prompts and observing how variations in prompting can impact model responses. Understanding these differences will empower you to effectively use LLMs for various applications, tailoring prompts to achieve desired outputs.

In this activity, you'll connect with multiple free LLMs, analyze their outputs, and refine your prompting techniques—a vital skill for AI practitioners seeking optimal results across diverse tasks in creative industries, technical communication, and content generation.

## 🎯 Activity Goals

By completing this activity, you will:
- Develop the ability to analyze LLM outputs for tone, clarity, and factual accuracy
- Learn to craft and modify prompts to achieve desired responses from multiple LLMs
- Understand the impact of tuning parameters on LLM output for both creative and precise tasks

## 🌎 Scenario

Your company, AI Innovations Inc., is evaluating the capabilities of different Large Language Models for a new AI-driven content generation tool. The goal is to discern which model offers the best balance of creativity and accuracy, adapting responses based on user prompting. Your task is to test several free LLMs, comparing their responses to various prompts and adjust prompting techniques to measure output variability.

## Task 1: Setup Free LLM Access [15 minutes]

In this task, you will set up access to multiple free LLMs using Hugging Face's transformers library and free online platforms.

In [None]:
# Install required libraries
!pip install transformers torch accelerate requests

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import requests
import json

In [None]:
# Setup local models - these are free and don't require API keys
# We'll use smaller models that can run locally

# Model 1: DistilGPT-2 (lightweight GPT-2 variant)
model1_name = "distilgpt2"
generator1 = pipeline('text-generation', model=model1_name, max_length=150)

# Model 2: GPT-2 small
model2_name = "gpt2"
generator2 = pipeline('text-generation', model=model2_name, max_length=150)

print("✅ Local models loaded successfully!")
print(f"Model 1: {model1_name}")
print(f"Model 2: {model2_name}")

In [None]:
# Optional: Setup access to free online APIs (no authentication required)
# Hugging Face Inference API (free tier available)

def query_hf_api(model_id, prompt, max_tokens=100):
    """Query Hugging Face's free inference API"""
    API_URL = f"https://api-inference.huggingface.co/models/{model_id}"
    headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}  # Optional: add your free HF token
    
    payload = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": max_tokens,
            "temperature": 0.7,
            "return_full_text": False
        }
    }
    
    try:
        response = requests.post(API_URL, headers=headers, json=payload)
        return response.json()
    except Exception as e:
        return f"Error: {e}"

# Test function (will work even without token for many models)
print("🌐 Online API function ready (optional)")

### 🏃 Practice
1. Run the setup code and confirm both local models load successfully
2. Test each model with a simple prompt to ensure they're working
3. (Optional) Create a free Hugging Face account and get a token for expanded access

In [None]:
# your code here to test both models with a simple prompt


### ✅ Success Checklist
- [ ] Both local models loaded without errors
- [ ] Test prompts generate responses from both models
- [ ] Understanding of free vs. paid LLM access options

### 💡 Key Points
- Local models don't require API keys or costs but have limited capabilities
- Free online APIs often have rate limits but provide access to larger models
- Consider model size vs. performance trade-offs for your use case

## Task 2: Baseline Prompt Evaluation [15 minutes]

Create a baseline prompt and evaluate how different models respond to the same input.

In [None]:
# Define baseline prompt
baseline_prompt = "Explain the concept of sustainability in simple terms."

def compare_models(prompt, temperature=0.7, max_tokens=100):
    """Compare responses from multiple models"""
    results = {}
    
    # Model 1 response
    try:
        response1 = generator1(prompt, max_length=len(prompt.split()) + max_tokens, 
                              temperature=temperature, do_sample=True, pad_token_id=50256)
        results['DistilGPT-2'] = response1[0]['generated_text'][len(prompt):].strip()
    except Exception as e:
        results['DistilGPT-2'] = f"Error: {e}"
    
    # Model 2 response
    try:
        response2 = generator2(prompt, max_length=len(prompt.split()) + max_tokens, 
                              temperature=temperature, do_sample=True, pad_token_id=50256)
        results['GPT-2'] = response2[0]['generated_text'][len(prompt):].strip()
    except Exception as e:
        results['GPT-2'] = f"Error: {e}"
    
    return results

# Test the function
print("🔍 Testing baseline prompt comparison...")
baseline_results = compare_models(baseline_prompt)

for model, response in baseline_results.items():
    print(f"\n{model}:")
    print(f"{response}")
    print("-" * 50)

### 🏃 Practice
1. Run the baseline prompt and document the outputs from both models
2. Analyze differences in tone, clarity, and factual accuracy
3. Create your own baseline prompt and test it

In [None]:
# your code here to create and test your own baseline prompt
my_baseline_prompt = ""  # Add your prompt here

# Test your prompt and compare outputs


### ✅ Success Checklist
- [ ] Baseline prompt outputs recorded and analyzed
- [ ] Differences in model responses documented
- [ ] Personal baseline prompt tested successfully

### 💡 Key Points
- Even similar models can produce notably different responses
- Smaller models may be less coherent but still useful for comparison
- Baseline evaluation helps establish model characteristics

## Task 3: Prompt Variation and Comparison [15 minutes]

Experiment with different prompting techniques to see how they affect model outputs.

In [None]:
# Different prompting techniques
prompts = {
    "Basic": "Explain sustainability.",
    "Context": "As an environmental expert, explain sustainability to a 10-year-old.",
    "Few-shot": """Q: What is recycling?
A: Recycling is reusing materials to make new products instead of throwing them away.

Q: What is sustainability?
A:""",
    "Detailed": "Provide a comprehensive explanation of sustainability that covers environmental, economic, and social aspects."
}

# Test all prompt variations
all_results = {}
for prompt_type, prompt in prompts.items():
    print(f"\n{'='*20} {prompt_type.upper()} PROMPT {'='*20}")
    print(f"Prompt: {prompt}")
    print("\nResponses:")
    
    results = compare_models(prompt, temperature=0.7)
    all_results[prompt_type] = results
    
    for model, response in results.items():
        print(f"\n{model}:")
        print(response)
        print("-" * 30)

### 🏃 Practice
1. Run the prompt variations and observe how each technique affects responses
2. Create your own prompt variation using a different technique
3. Compare which prompting style works best for different types of tasks

In [None]:
# your code here to create and test your own prompt variation
my_prompt_variation = ""  # Add your creative prompt here

# Test and analyze the results


### ✅ Success Checklist
- [ ] All prompt variations tested successfully
- [ ] Differences in output style and quality documented
- [ ] Best prompting techniques identified for different tasks

### 💡 Key Points
- Context and role-playing can significantly improve response quality
- Few-shot examples help models understand the desired format
- Prompt structure heavily influences LLM output

## Task 4: Creativity vs. Precision Tuning [15 minutes]

Experiment with temperature settings to understand the creativity-precision trade-off.

In [None]:
# Test different temperature settings
creative_prompt = "Write a short story about a world where plants can talk."
factual_prompt = "List the top 5 renewable energy sources and their efficiency ratings."

temperatures = [0.1, 0.5, 0.9]

def test_temperature_effects(prompt, description):
    print(f"\n{'='*15} {description} {'='*15}")
    print(f"Prompt: {prompt}")
    
    for temp in temperatures:
        print(f"\n--- Temperature: {temp} ---")
        results = compare_models(prompt, temperature=temp, max_tokens=80)
        
        for model, response in results.items():
            print(f"{model}: {response[:100]}...")

# Test creative task
test_temperature_effects(creative_prompt, "CREATIVE TASK")

# Test factual task
test_temperature_effects(factual_prompt, "FACTUAL TASK")

### 🏃 Practice
1. Run the temperature experiments and observe output differences
2. Create your own creative and factual prompts to test
3. Determine optimal temperature settings for different task types

In [None]:
# your code here to test your own prompts with different temperatures
my_creative_prompt = ""  # Add your creative prompt
my_factual_prompt = ""   # Add your factual prompt

# Test both with different temperature settings


### ✅ Success Checklist
- [ ] Temperature effects on creativity and precision observed
- [ ] Optimal settings identified for different task types
- [ ] Trade-offs between creativity and accuracy understood

### 💡 Key Points
- Lower temperature (0.1-0.3) = more focused, consistent, factual outputs
- Higher temperature (0.7-0.9) = more creative, diverse, but potentially less accurate
- Parameter adjustments are crucial for tailoring responses to specific needs

## ⚠️ Common Mistakes to Avoid

- **Not considering model limitations**: Free/smaller models have constraints compared to larger commercial models
- **Overlooking prompt engineering**: Small changes in wording can dramatically affect outputs
- **Ignoring temperature effects**: Using wrong temperature settings for your task type
- **Insufficient comparison criteria**: Not establishing clear metrics for evaluation
- **Changing too many variables**: Modify one parameter at a time for clear comparisons


<details>
<summary><strong>Click HERE to see an exemplar solution</strong></summary>

```python
# Complete LLM comparison analysis
import torch
from transformers import pipeline
import pandas as pd

# Setup models
models = {
    'DistilGPT-2': pipeline('text-generation', model='distilgpt2'),
    'GPT-2': pipeline('text-generation', model='gpt2')
}

def comprehensive_comparison(prompt, temperatures=[0.1, 0.5, 0.9]):
    """Comprehensive comparison across models and temperatures"""
    results = []
    
    for model_name, model in models.items():
        for temp in temperatures:
            try:
                response = model(prompt, 
                               max_length=len(prompt.split()) + 50,
                               temperature=temp,
                               do_sample=True,
                               pad_token_id=50256)
                
                generated_text = response[0]['generated_text'][len(prompt):].strip()
                
                results.append({
                    'Model': model_name,
                    'Temperature': temp,
                    'Response': generated_text[:200],  # First 200 chars
                    'Length': len(generated_text.split()),
                    'Coherence': 'High' if len(generated_text.split()) > 10 else 'Low'
                })
            except Exception as e:
                results.append({
                    'Model': model_name,
                    'Temperature': temp,
                    'Response': f'Error: {e}',
                    'Length': 0,
                    'Coherence': 'Error'
                })
    
    return pd.DataFrame(results)

# Test prompts
sustainability_prompt = "Explain sustainable development in simple terms."
creative_prompt = "Write a haiku about renewable energy."

# Analyze sustainability prompt
print("=== SUSTAINABILITY ANALYSIS ===")
sustainability_df = comprehensive_comparison(sustainability_prompt)
print(sustainability_df.to_string(index=False))

print("\n=== CREATIVE ANALYSIS ===")
creative_df = comprehensive_comparison(creative_prompt)
print(creative_df.to_string(index=False))

# Analysis insights
print("\n=== KEY INSIGHTS ===")
print("1. Temperature Effects:")
print("   - Low temp (0.1): More focused, factual responses")
print("   - High temp (0.9): More creative, varied responses")

print("\n2. Model Differences:")
print("   - DistilGPT-2: Faster but less coherent for complex tasks")
print("   - GPT-2: More coherent but requires more computational resources")

print("\n3. Best Practices:")
print("   - Use low temperature (0.1-0.3) for factual questions")
print("   - Use high temperature (0.7-0.9) for creative tasks")
print("   - Always test multiple models for important applications")
print("   - Consider prompt engineering for better results")

# Evaluation metrics
def evaluate_response_quality(response, task_type):
    """Simple quality evaluation"""
    metrics = {
        'length': len(response.split()),
        'coherence': 'High' if len(response.split()) > 10 and '.' in response else 'Low',
        'relevance': 'High' if any(keyword in response.lower() for keyword in 
                                 ['sustain', 'environment', 'future', 'energy']) else 'Medium'
    }
    
    if task_type == 'creative':
        metrics['creativity'] = 'High' if any(word in response.lower() for word in 
                                             ['beautiful', 'flowing', 'whisper', 'gentle']) else 'Medium'
    
    return metrics

# Example evaluation
sample_response = "Sustainable development means meeting our current needs without compromising the ability of future generations to meet their own needs. It focuses on balancing economic growth, environmental protection, and social equity."

quality_metrics = evaluate_response_quality(sample_response, 'factual')
print(f"\nSample Response Quality: {quality_metrics}")
```
</details>