# Challenge Three: Gemini Testing & Evaluation

This notebook demonstrates testing and evaluation of Gemini-powered functions for government services, including:

## Key Components

1. **Question Classification**: Categorizes citizen inquiries into government service areas
2. **Social Media Generation**: Creates government announcements for various platforms
3. **Unit Testing**: pytest-based tests to validate function behavior
4. **Evaluation API**: Compare different prompt strategies using Google's Evaluation API

## Functions to Test

### classify_question(question: str) -> str
Classifies user questions into:
- Employment
- General Information
- Emergency Services
- Tax Related

### generate_social_media_post(announcement: str, platform: str) -> str
Generates social media posts for government announcements with proper tone, length, and hashtags.

## Evaluation Strategy

Use Vertex AI Evaluation API to compare:
- Different prompt formulations (few-shot vs zero-shot)
- Various instruction styles (formal vs conversational)
- Metrics: fulfillment, groundedness, accuracy


## Step 1: Install Dependencies

In [None]:
pip install --quiet --upgrade google-cloud-aiplatform pandas pytest ipytest


## Step 2: Import Libraries and Initialize


In [None]:
import pandas as pd
import datetime
import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig
from vertexai.preview.evaluation import EvalTask
from vertexai import generative_models
import ipytest

# Configuration
PROJECT_ID = "qwiklabs-gcp-01-752385122246"
LOCATION = "us-central1"

# Initialize Vertex AI
vertexai.init(project=PROJECT_ID, location=LOCATION)

# Configure ipytest for running pytest in notebooks
ipytest.autoconfig()

print(f"‚úì Initialized Vertex AI for project: {PROJECT_ID}")
print(f"‚úì pytest configured for notebook testing")


  from google.cloud.aiplatform.utils import gcs_utils


‚úì Initialized Vertex AI for project: qwiklabs-gcp-01-752385122246
‚úì pytest configured for notebook testing


## Step 3: Question Classification Function

Create a function that classifies citizen questions into government service categories.


In [None]:
# Initialize Gemini model
model = GenerativeModel("gemini-2.5-pro")

def classify_question(question: str) -> str:
    """
    Classify a citizen's question into government service categories.

    Args:
        question: The user's question text

    Returns:
        One of: "Employment", "General Information", "Emergency Services", "Tax Related"
    """
    prompt = """You are a question classifier. Classify the question into EXACTLY ONE category. Output ONLY the category name, nothing else.

Categories:
- Employment
- General Information
- Emergency Services
- Tax Related

Examples:
Question: Where can I apply for unemployment benefits?
Category: Employment

Question: What are the city hall hours?
Category: General Information

Question: How do I report a fire?
Category: Emergency Services

Question: When is the property tax deadline?
Category: Tax Related

Question: {0}
Category:""".format(question)

    response = model.generate_content(
        prompt,
        generation_config=GenerationConfig(temperature=0.0)
    )
    return response.text.strip()

# Test the function
test_questions = [
    "How do I file for unemployment?",
    "What is the mayor's email address?",
    "I need to call 911",
    "Where do I pay my income tax?"
]

print("Testing classify_question():\n")
for q in test_questions:
    category = classify_question(q)
    print(f"Question: {q}")
    print(f"Category: {category}\n")


Testing classify_question():

Question: How do I file for unemployment?
Category: Employment

Question: What is the mayor's email address?
Category: General Information

Question: I need to call 911
Category: Emergency Services

Question: Where do I pay my income tax?
Category: Tax Related



## Step 4: Social Media Post Generator

Create a function that generates government social media posts with specific constraints.


In [None]:
def generate_social_media_post(announcement: str, platform: str = "twitter") -> str:
    """
    Generate a social media post for government announcements.

    Args:
        announcement: The announcement content (e.g., "School closed tomorrow due to snow")
        platform: Platform type ("twitter", "facebook", "instagram")

    Returns:
        Formatted social media post following platform constraints
    """
    prompt = """You write social media posts for City Government. Follow these STRICT rules:

MANDATORY REQUIREMENTS (MUST follow):
1. ALWAYS end the post with #CityGov (this is REQUIRED)
2. Twitter: Stay under 280 characters total
3. Facebook: Stay under 500 characters total
4. Instagram: Stay under 300 characters total

Style Guidelines:
- Professional but friendly government tone
- Use emoji when appropriate
- For emergencies: urgent, clear language
- For celebrations: warm, welcoming tone

Examples (notice #CityGov at the end of EVERY post):

Input: School closed tomorrow due to heavy snow
Output: ‚ùÑÔ∏è SCHOOL CLOSURE: All city schools closed tomorrow due to heavy snowfall. Stay safe! Updates at citywebsite.gov #CityGov

Input: City offices closed for Thanksgiving
Output: ü¶É Happy Thanksgiving! City offices closed Nov 23-24. Emergency services available 24/7. #CityGov

Input: New park opening Saturday
Output: üéâ Riverside Park grand opening Saturday 10am! Food trucks, activities & fun for all ages. See you there! #CityGov

Platform: {0}
Input: {1}
Output:""".format(platform, announcement)

    response = model.generate_content(
        prompt,
        generation_config=GenerationConfig(
            temperature=0.2,
            top_p=0.9,
            max_output_tokens=300
        )
    )
    return response.text.strip()

# Test the function
test_announcements = [
    ("Severe thunderstorm warning in effect until 8pm", "twitter"),
    ("City Hall will be closed for Independence Day", "twitter"),
    ("New recycling program starting next month", "facebook"),
]

print("Testing generate_social_media_post():\n")
for announcement, platform in test_announcements:
    post = generate_social_media_post(announcement, platform)
    print(f"Announcement: {announcement}")
    print(f"Platform: {platform}")
    print(f"Generated Post: {post}")
    print(f"Length: {len(post)} characters\n")


Testing generate_social_media_post():

Platform: twitter
Length: 59 characters

Announcement: City Hall will be closed for Independence Day
Platform: twitter
Generated Post: üéÜ City Hall will be closed for Independence Day. We wish
Length: 56 characters

Announcement: New recycling program starting next month
Platform: facebook
Generated Post: ‚ôªÔ∏è Get ready to recycle! Our new city-wide
Length: 42 characters



## Step 5: Unit Tests

In [None]:
%%ipytest

import pytest

class TestClassifyQuestion:
    """Unit tests for question classification function."""

    def test_employment_question(self):
        response = classify_question("How do I apply for a job with the city?")
        assert response == "Employment", f"Expected 'Employment', got '{response}'"

    def test_employment_unemployment(self):
        response = classify_question("Where can I file for unemployment benefits?")
        assert response == "Employment", f"Expected 'Employment', got '{response}'"

    def test_general_info_hours(self):
        response = classify_question("What are the library hours?")
        assert response == "General Information", f"Expected 'General Information', got '{response}'"

    def test_general_info_contact(self):
        response = classify_question("How do I contact the mayor's office?")
        assert response == "General Information", f"Expected 'General Information', got '{response}'"

    def test_emergency_fire(self):
        response = classify_question("How do I report a fire?")
        assert response == "Emergency Services", f"Expected 'Emergency Services', got '{response}'"

    def test_emergency_911(self):
        response = classify_question("I need to call 911 for an emergency")
        assert response == "Emergency Services", f"Expected 'Emergency Services', got '{response}'"

    def test_tax_property(self):
        response = classify_question("When is my property tax due?")
        assert response == "Tax Related", f"Expected 'Tax Related', got '{response}'"

    def test_tax_payment(self):
        response = classify_question("How do I pay my city taxes online?")
        assert response == "Tax Related", f"Expected 'Tax Related', got '{response}'"


[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[33m                                                                                     [100%][0m
../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1272
    self._mark_plugins_for_rewrite(hook, disable_autoload)



In [None]:
%%ipytest

import pytest

class TestSocialMediaPost:
    """Unit tests for social media post generation."""

    def test_twitter_length_limit(self):
        """Test that Twitter posts stay under 280 characters."""
        post = generate_social_media_post("Summer concert series starts next week", "twitter")
        assert len(post) <= 280, f"Twitter post too long: {len(post)} characters"

    def test_post_is_not_empty(self):
        """Test that a post is actually generated."""
        post = generate_social_media_post("City park cleanup day Saturday", "twitter")
        assert len(post) > 0, "Post should not be empty"

    def test_emergency_contains_keyword(self):
        """Test that emergency posts mention the emergency."""
        post = generate_social_media_post("Severe weather alert", "twitter")
        assert "weather" in post.lower() or "alert" in post.lower() or "severe" in post.lower(), \
            f"Post should mention emergency: {post}"

    def test_generates_different_content(self):
        """Test that different announcements generate different posts."""
        post1 = generate_social_media_post("Park cleanup event", "twitter")
        post2 = generate_social_media_post("Tax deadline reminder", "twitter")
        # Just verify they're different and both generated
        assert len(post1) > 0 and len(post2) > 0, "Both posts should be generated"
        assert post1 != post2, "Different announcements should produce different posts"

    def test_reasonable_length(self):
        """Test that posts are a reasonable length (not truncated)."""
        post = generate_social_media_post("New Year celebration at City Hall", "twitter")
        # Should be at least 20 characters (not truncated)
        assert len(post) >= 20, f"Post seems truncated: {post}"


[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[33m                                                                                        [100%][0m
../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1272
    self._mark_plugins_for_rewrite(hook, disable_autoload)



## Step 6: Evaluation - Classification Prompts

Use the Evaluation API to compare different classification prompt variants.

In [None]:
# Create evaluation dataset for question classification

eval_questions = [
    "How do I apply for a city job?",
    "Where can I get unemployment assistance?",
    "What jobs are available in the parks department?",
    "What are city hall operating hours?",
    "How do I get a building permit?",
    "Where is the nearest library?",
    "I need to report a fire emergency",
    "What is the police non-emergency number?",
    "How do I contact emergency services?",
    "When are property taxes due?",
    "How do I pay my business tax?",
    "Where can I find tax forms?",
]

expected_categories = [
    "Employment",
    "Employment",
    "Employment",
    "General Information",
    "General Information",
    "General Information",
    "Emergency Services",
    "Emergency Services",
    "Emergency Services",
    "Tax Related",
    "Tax Related",
    "Tax Related",
]

# Create DataFrame for evaluation
classification_eval_dataset = pd.DataFrame({
    "instruction": eval_questions,
    "context": ["" for _ in eval_questions],
    "reference": expected_categories,
})

print("STEP 1: Created Evaluation Dataset")
print(f"Dataset size: {len(classification_eval_dataset)} examples")
print("\nDataset preview:")
print(classification_eval_dataset.head())
print(f"\n‚úì Created evaluation dataset with {len(classification_eval_dataset)} examples")


STEP 1: Created Evaluation Dataset
Dataset size: 12 examples

Dataset preview:
                                        instruction context  \
0                    How do I apply for a city job?           
1          Where can I get unemployment assistance?           
2  What jobs are available in the parks department?           
3               What are city hall operating hours?           
4                   How do I get a building permit?           

             reference  
0           Employment  
1           Employment  
2           Employment  
3  General Information  
4  General Information  

‚úì Created evaluation dataset with 12 examples


In [None]:
# RUN INFERENCE AND EVALUATION
print("\n" + "="*70)
print("STEP 2: RUN INFERENCE")
print("="*70)
print("Generating classifications for all questions...")

# Use the same prompt template as classify_question() function
classification_prompt_template = """You are a question classifier. Classify the question into EXACTLY ONE category. Output ONLY the category name, nothing else.

Categories:
- Employment
- General Information
- Emergency Services
- Tax Related

Examples:
Question: Where can I apply for unemployment benefits?
Category: Employment

Question: What are the city hall hours?
Category: General Information

Question: How do I report a fire?
Category: Emergency Services

Question: When is the property tax deadline?
Category: Tax Related

Question: {instruction}
Category:"""

from vertexai.preview.evaluation import EvalTask

eval_task = EvalTask(
    dataset=classification_eval_dataset,
    metrics=["exact_match"],
    experiment="question-classification-eval",
)

print("\nSTEP 3: RUN EVALUATION")
print("="*70)
run_ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

eval_result = eval_task.evaluate(
    model=model,
    prompt_template=classification_prompt_template,
    experiment_run_name=f"classification-{run_ts}"
)

print("‚úì Evaluation complete!")

# STEP 4: EXAMINE THE RESULTS
print("\n" + "="*70)
print("STEP 4: EXAMINE RESULTS")
print("="*70)

# Display function as specified in instructions
def display_eval_report(report_data):
    """Display evaluation report with summary metrics and detailed table."""
    name, summary, metrics_table = report_data
    print(f"\n{name}")
    print("-" * 70)
    print("\nSummary Metrics:")
    print(summary)
    print("\nDetailed Metrics:")
    print(metrics_table)

# Display results using the specified format
display_eval_report(("Eval Result", eval_result.summary_metrics, eval_result.metrics_table))



STEP 2: RUN INFERENCE
Generating classifications for all questions...

STEP 3: RUN EVALUATION


INFO:vertexai.preview.evaluation.eval_task:Logging Eval experiment evaluation metadata: {'prompt_template': 'You are a question classifier. Classify the question into EXACTLY ONE category. Output ONLY the category name, nothing else.\n\nCategories:\n- Employment\n- General Information\n- Emergency Services\n- Tax Related\n\nExamples:\nQuestion: Where can I apply for unemployment benefits?\nCategory: Employment\n\nQuestion: What are the city hall hours?\nCategory: General Information\n\nQuestion: How do I report a fire?\nCategory: Emergency Services\n\nQuestion: When is the property tax deadline?\nCategory: Tax Related\n\nQuestion: {instruction}\nCategory:', 'model_name': 'publishers/google/models/gemini-2.5-pro'}
INFO:vertexai.preview.evaluation._evaluation:Assembling prompts from the `prompt_template`. The `prompt` column in the `EvalResult.metrics_table` has the assembled prompts used for model response generation.
INFO:vertexai.preview.evaluation._pre_eval_utils:Generating a total o

‚úì Evaluation complete!

STEP 4: EXAMINE RESULTS

Eval Result
----------------------------------------------------------------------

Summary Metrics:
{'row_count': 12, 'exact_match/mean': np.float64(1.0), 'exact_match/std': 0.0}

Detailed Metrics:
                                         instruction context  \
0                     How do I apply for a city job?           
1           Where can I get unemployment assistance?           
2   What jobs are available in the parks department?           
3                What are city hall operating hours?           
4                    How do I get a building permit?           
5                      Where is the nearest library?           
6                  I need to report a fire emergency           
7           What is the police non-emergency number?           
8               How do I contact emergency services?           
9                       When are property taxes due?           
10                     How do I pay my busines

## Step 7: Compare Different Prompt Variants

Create alternative prompts and compare their performance.


In [None]:
# Create variant classification function with different prompt strategy
def classify_question_v2(question: str) -> str:
    """
    Alternative classification with simpler, zero-shot prompt.
    """
    prompt = """Classify the following question into exactly one category:
Employment, General Information, Emergency Services, or Tax Related

Question: {0}
Category:""".format(question)

    response = model.generate_content(
        prompt,
        generation_config=GenerationConfig(temperature=0.0)
    )
    return response.text.strip()


# Create third variant with structured output
def classify_question_v3(question: str) -> str:
    """
    Classification with more detailed instructions but no examples.
    """
    prompt = """You are a government services classifier. Analyze the question and return ONE category.

Categories and their scope:
- "Employment" = jobs, unemployment, work permits, labor issues
- "General Information" = hours, locations, contacts, general city services
- "Emergency Services" = police, fire, medical, disasters, 911
- "Tax Related" = any tax payments, forms, deadlines, assessments

Respond with ONLY the category name, nothing else.

Question: {0}
Category:""".format(question)

    response = model.generate_content(
        prompt,
        generation_config=GenerationConfig(temperature=0.0)
    )
    return response.text.strip()

# Compare all three variants
print("Comparing prompt variants on sample questions:\n")
sample_q = "When is the tax filing deadline?"

print(f"Question: {sample_q}")
print(f"  V1 (few-shot): {classify_question(sample_q)}")
print(f"  V2 (zero-shot): {classify_question_v2(sample_q)}")
print(f"  V3 (detailed):  {classify_question_v3(sample_q)}")


Comparing prompt variants on sample questions:

Question: When is the tax filing deadline?
  V1 (few-shot): Tax Related
  V2 (zero-shot): Tax Related
  V3 (detailed):  Tax Related


In [None]:
# Evaluate Variant 2: Zero-shot simple prompt
print("\nEvaluating Variant 2 (Zero-shot simple)...")

v2_prompt_template = """Classify the following question into exactly one category:
Employment, General Information, Emergency Services, or Tax Related

Question: {instruction}
Category:"""

eval_task_v2 = EvalTask(
    dataset=classification_eval_dataset,
    metrics=["exact_match"],  # Classification uses exact_match only
    experiment="classification-prompt-comparison",
)

result_v2 = eval_task_v2.evaluate(
    model=model,
    prompt_template=v2_prompt_template,
    experiment_run_name=f"v2-zero-shot-{run_ts}"
)

print("‚úì V2 Evaluation complete")
display_eval_report(("V2: Zero-shot Simple", result_v2.summary_metrics, result_v2.metrics_table))

# Evaluate Variant 3: Detailed instructions
print("\nEvaluating Variant 3 (Detailed instructions)...")

v3_prompt_template = """You are a government services classifier. Analyze the question and return ONE category.

Categories and their scope:
- "Employment" = jobs, unemployment, work permits, labor issues
- "General Information" = hours, locations, contacts, general city services
- "Emergency Services" = police, fire, medical, disasters, 911
- "Tax Related" = any tax payments, forms, deadlines, assessments

Respond with ONLY the category name, nothing else.

Question: {instruction}
Category:"""

eval_task_v3 = EvalTask(
    dataset=classification_eval_dataset,
    metrics=["exact_match"],  # Classification uses exact_match only
    experiment="classification-prompt-comparison",
)

result_v3 = eval_task_v3.evaluate(
    model=model,
    prompt_template=v3_prompt_template,
    experiment_run_name=f"v3-detailed-{run_ts}"
)

print("‚úì V3 Evaluation complete")
display_eval_report(("V3: Detailed Instructions", result_v3.summary_metrics, result_v3.metrics_table))

# Display comparison of all variants
print("\n" + "="*70)
print("PROMPT VARIANT COMPARISON - ALL METRICS")
print("="*70)

comparison_df = pd.DataFrame({
    'Variant': ['V1: Few-shot', 'V2: Zero-shot', 'V3: Detailed'],
    'Exact Match': [
        f"{eval_result.summary_metrics.get('exact_match/mean', 0):.2%}",
        f"{result_v2.summary_metrics.get('exact_match/mean', 0):.2%}",
        f"{result_v3.summary_metrics.get('exact_match/mean', 0):.2%}"
    ]
})

print(comparison_df.to_string(index=False))
print("="*70)


Evaluating Variant 2 (Zero-shot simple)...


INFO:vertexai.preview.evaluation.eval_task:Logging Eval experiment evaluation metadata: {'prompt_template': 'Classify the following question into exactly one category:\nEmployment, General Information, Emergency Services, or Tax Related\n\nQuestion: {instruction}\nCategory:', 'model_name': 'publishers/google/models/gemini-2.5-pro'}
INFO:vertexai.preview.evaluation._evaluation:Assembling prompts from the `prompt_template`. The `prompt` column in the `EvalResult.metrics_table` has the assembled prompts used for model response generation.
INFO:vertexai.preview.evaluation._pre_eval_utils:Generating a total of 12 responses from Gemini model gemini-2.5-pro.
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12/12 [00:07<00:00,  1.52it/s]
INFO:vertexai.preview.evaluation._pre_eval_utils:All 12 responses are successfully generated from model.
INFO:vertexai.preview.evaluation._evaluation:Multithreaded Batch Inference took: 7.921320671001013 seconds.
INFO:vertexai.preview.evaluation._evaluation:Computing metr

‚úì V2 Evaluation complete

V2: Zero-shot Simple
----------------------------------------------------------------------

Summary Metrics:
{'row_count': 12, 'exact_match/mean': np.float64(1.0), 'exact_match/std': 0.0}

Detailed Metrics:
                                         instruction context  \
0                     How do I apply for a city job?           
1           Where can I get unemployment assistance?           
2   What jobs are available in the parks department?           
3                What are city hall operating hours?           
4                    How do I get a building permit?           
5                      Where is the nearest library?           
6                  I need to report a fire emergency           
7           What is the police non-emergency number?           
8               How do I contact emergency services?           
9                       When are property taxes due?           
10                     How do I pay my business tax?        

INFO:vertexai.preview.evaluation.eval_task:Logging Eval experiment evaluation metadata: {'prompt_template': 'You are a government services classifier. Analyze the question and return ONE category.\n\nCategories and their scope:\n- "Employment" = jobs, unemployment, work permits, labor issues\n- "General Information" = hours, locations, contacts, general city services  \n- "Emergency Services" = police, fire, medical, disasters, 911\n- "Tax Related" = any tax payments, forms, deadlines, assessments\n\nRespond with ONLY the category name, nothing else.\n\nQuestion: {instruction}\nCategory:', 'model_name': 'publishers/google/models/gemini-2.5-pro'}
INFO:vertexai.preview.evaluation._evaluation:Assembling prompts from the `prompt_template`. The `prompt` column in the `EvalResult.metrics_table` has the assembled prompts used for model response generation.
INFO:vertexai.preview.evaluation._pre_eval_utils:Generating a total of 12 responses from Gemini model gemini-2.5-pro.
100%|‚ñà‚ñà‚ñà‚ñà‚ñà

‚úì V3 Evaluation complete

V3: Detailed Instructions
----------------------------------------------------------------------

Summary Metrics:
{'row_count': 12, 'exact_match/mean': np.float64(1.0), 'exact_match/std': 0.0}

Detailed Metrics:
                                         instruction context  \
0                     How do I apply for a city job?           
1           Where can I get unemployment assistance?           
2   What jobs are available in the parks department?           
3                What are city hall operating hours?           
4                    How do I get a building permit?           
5                      Where is the nearest library?           
6                  I need to report a fire emergency           
7           What is the police non-emergency number?           
8               How do I contact emergency services?           
9                       When are property taxes due?           
10                     How do I pay my business tax?   

## Step 8: Evaluate Social Media Post Generation

Create evaluation for social media post quality and compare prompt strategies.


In [None]:
# Create evaluation dataset for social media posts
social_media_announcements = [
    "Heavy snowfall expected tonight, all schools closed tomorrow",
    "City offices closed for Memorial Day weekend",
    "New community center grand opening this Saturday at 2pm",
    "Water main break on Main Street, avoid the area",
    "Free tax preparation help available at library every Tuesday",
    "Fourth of July fireworks show at the park, starts at 9pm",
]

# Create reference posts for comparison (optional, for BLEU/ROUGE)
reference_posts = [
    "‚ùÑÔ∏è SCHOOL CLOSURE: All city schools closed tomorrow due to heavy snowfall. #CityGov",
    "ü¶É City offices closed for Memorial Day weekend. Emergency services remain available. #CityGov",
    "üéâ New community center grand opening Saturday at 2pm! Come celebrate with us! #CityGov",
    "üö® Water main break on Main Street. Please avoid the area. Updates at citywebsite.gov #CityGov",
    "üìã Free tax preparation help available at the library every Tuesday in April. #CityGov",
    "üéÜ Fourth of July fireworks at the park! Show starts at 9pm. #CityGov",
]

social_media_eval_dataset = pd.DataFrame({
    "instruction": social_media_announcements,
    "context": ["twitter" for _ in social_media_announcements],
    "reference": reference_posts,
})

print("STEP 1: Social Media Evaluation Dataset")
print("="*70)
print(social_media_eval_dataset[['instruction', 'reference']].head())

# STEP 2 & 3: Run evaluation with BLEU and ROUGE for text generation
print("\n" + "="*70)
print("EVALUATING SOCIAL MEDIA POST GENERATION")
print("="*70)

social_media_prompt_template = """You write social media posts for City Government. Follow these STRICT rules:

MANDATORY REQUIREMENTS (MUST follow):
1. ALWAYS end the post with #CityGov (this is REQUIRED)
2. Twitter: Stay under 280 characters total
3. Facebook: Stay under 500 characters total
4. Instagram: Stay under 300 characters total

Style Guidelines:
- Professional but friendly government tone
- Use emoji when appropriate
- For emergencies: urgent, clear language
- For celebrations: warm, welcoming tone

Examples (notice #CityGov at the end of EVERY post):

Input: School closed tomorrow due to heavy snow
Output: ‚ùÑÔ∏è SCHOOL CLOSURE: All city schools closed tomorrow due to heavy snowfall. Stay safe! Updates at citywebsite.gov #CityGov

Input: City offices closed for Thanksgiving
Output: ü¶É Happy Thanksgiving! City offices closed Nov 23-24. Emergency services available 24/7. #CityGov

Input: New park opening Saturday
Output: üéâ Riverside Park grand opening Saturday 10am! Food trucks, activities & fun for all ages. See you there! #CityGov

Platform: twitter
Input: {instruction}
Output:"""

social_eval_task = EvalTask(
    dataset=social_media_eval_dataset,
    metrics=["bleu", "rouge_1"],  #
    experiment="social-media-generation-eval",
)

social_result = social_eval_task.evaluate(
    model=model,
    prompt_template=social_media_prompt_template,
    experiment_run_name=f"social-media-{run_ts}"
)

print("‚úì Social media evaluation complete!")

# Display results
display_eval_report(("Social Media Generation Eval", social_result.summary_metrics, social_result.metrics_table))


STEP 1: Social Media Evaluation Dataset
                                         instruction  \
0  Heavy snowfall expected tonight, all schools c...   
1       City offices closed for Memorial Day weekend   
2  New community center grand opening this Saturd...   
3    Water main break on Main Street, avoid the area   
4  Free tax preparation help available at library...   

                                           reference  
0  ‚ùÑÔ∏è SCHOOL CLOSURE: All city schools closed tom...  
1  ü¶É City offices closed for Memorial Day weekend...  
2  üéâ New community center grand opening Saturday ...  
3  üö® Water main break on Main Street. Please avoi...  
4  üìã Free tax preparation help available at the l...  

EVALUATING SOCIAL MEDIA POST GENERATION


INFO:vertexai.preview.evaluation.eval_task:Logging Eval experiment evaluation metadata: {'prompt_template': 'You write social media posts for City Government. Follow these STRICT rules:\n\nMANDATORY REQUIREMENTS (MUST follow):\n1. ALWAYS end the post with #CityGov (this is REQUIRED)\n2. Twitter: Stay under 280 characters total\n3. Facebook: Stay under 500 characters total\n4. Instagram: Stay under 300 characters total\n\nStyle Guidelines:\n- Professional but friendly government tone\n- Use emoji when appropriate\n- For emergencies: urgent, clear language\n- For celebrations: warm, welcoming tone\n\nExamples (notice #CityGov at the end of EVERY post):\n\nInput: School closed tomorrow due to heavy snow\nOutput: ‚ùÑÔ∏è SCHOOL CLOSURE: All city schools closed tomorrow due to heavy snowfall. Stay safe! Updates at citywebsite.gov #CityGov\n\nInput: City offices closed for Thanksgiving\nOutput: ü¶É Happy Thanksgiving! City offices closed Nov 23-24. Emergency services available 24/7. #CityGov

‚úì Social media evaluation complete!

Social Media Generation Eval
----------------------------------------------------------------------

Summary Metrics:
{'row_count': 6, 'bleu/mean': np.float64(0.17606888583333333), 'bleu/std': 0.12304910900169112, 'rouge_1/mean': np.float64(0.5892693016666667), 'rouge_1/std': 0.09147359462551637}

Detailed Metrics:
                                         instruction  context  \
0  Heavy snowfall expected tonight, all schools c...  twitter   
1       City offices closed for Memorial Day weekend  twitter   
2  New community center grand opening this Saturd...  twitter   
3    Water main break on Main Street, avoid the area  twitter   
4  Free tax preparation help available at library...  twitter   
5  Fourth of July fireworks show at the park, sta...  twitter   

                                           reference  \
0  ‚ùÑÔ∏è SCHOOL CLOSURE: All city schools closed tom...   
1  ü¶É City offices closed for Memorial Day weekend...   
2  üéâ New c

In [None]:
# Define social media prompt variants for comparison

# Variant 1: Detailed with examples (already used in Cell 20 as social_media_prompt_template)
social_v1_prompt = """You write social media posts for City Government. Follow these STRICT rules:

MANDATORY REQUIREMENTS (MUST follow):
1. ALWAYS end the post with #CityGov (this is REQUIRED)
2. Twitter: Stay under 280 characters total

Style Guidelines:
- Professional but friendly government tone
- Use emoji when appropriate
- For emergencies: urgent, clear language

Examples (notice #CityGov at the end of EVERY post):

Input: School closed tomorrow due to heavy snow
Output: ‚ùÑÔ∏è SCHOOL CLOSURE: All city schools closed tomorrow due to heavy snowfall. Stay safe! Updates at citywebsite.gov #CityGov

Input: City offices closed for Thanksgiving
Output: ü¶É Happy Thanksgiving! City offices closed Nov 23-24. Emergency services available 24/7. #CityGov

Platform: twitter
Input: {instruction}
Output:"""

# Variant 2: Simpler, more casual prompt
social_v2_prompt = """Create a Twitter post for City Government.

Rules:
- Keep it under 280 characters
- End with #CityGov
- Be friendly and clear

Announcement: {instruction}
Post:"""

# Variant 3: Formal, no examples
social_v3_prompt = """Generate a professional government social media post for Twitter (max 280 chars).
Include #CityGov hashtag.
Keep tone official and informative.

Announcement: {instruction}
Post:"""

print("Created 3 social media prompt variants for comparison")


Created 3 social media prompt variants for comparison


In [None]:
# Evaluate Variant 2: Casual/Simple
print("\nEvaluating Social Media Variant 2 (Casual/Simple)...")

social_eval_task_v2 = EvalTask(
    dataset=social_media_eval_dataset,
    metrics=["bleu", "rouge_1"],
    experiment="social-media-prompt-comparison",
)

social_result_v2 = social_eval_task_v2.evaluate(
    model=model,
    prompt_template=social_v2_prompt,
    experiment_run_name=f"social-v2-casual-{run_ts}"
)

print("‚úì V2 Evaluation complete")
display_eval_report(("Social Media V2: Casual/Simple", social_result_v2.summary_metrics, social_result_v2.metrics_table))



Evaluating Social Media Variant 2 (Casual/Simple)...


INFO:vertexai.preview.evaluation.eval_task:Logging Eval experiment evaluation metadata: {'prompt_template': 'Create a Twitter post for City Government.\n\nRules:\n- Keep it under 280 characters\n- End with #CityGov\n- Be friendly and clear\n\nAnnouncement: {instruction}\nPost:', 'model_name': 'publishers/google/models/gemini-2.5-pro'}
INFO:vertexai.preview.evaluation._evaluation:Assembling prompts from the `prompt_template`. The `prompt` column in the `EvalResult.metrics_table` has the assembled prompts used for model response generation.
INFO:vertexai.preview.evaluation._pre_eval_utils:Generating a total of 6 responses from Gemini model gemini-2.5-pro.
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [00:18<00:00,  3.06s/it]
INFO:vertexai.preview.evaluation._pre_eval_utils:All 6 responses are successfully generated from model.
INFO:vertexai.preview.evaluation._evaluation:Multithreaded Batch Inference took: 18.366832757998054 seconds.
INFO:vertexai.preview.evaluation._evaluation:Computing metr

‚úì V2 Evaluation complete

Social Media V2: Casual/Simple
----------------------------------------------------------------------

Summary Metrics:
{'row_count': 6, 'bleu/mean': np.float64(0.12983086716666667), 'bleu/std': 0.0948869846489956, 'rouge_1/mean': np.float64(0.47062538833333334), 'rouge_1/std': 0.09101318426920406}

Detailed Metrics:
                                         instruction  context  \
0  Heavy snowfall expected tonight, all schools c...  twitter   
1       City offices closed for Memorial Day weekend  twitter   
2  New community center grand opening this Saturd...  twitter   
3    Water main break on Main Street, avoid the area  twitter   
4  Free tax preparation help available at library...  twitter   
5  Fourth of July fireworks show at the park, sta...  twitter   

                                           reference  \
0  ‚ùÑÔ∏è SCHOOL CLOSURE: All city schools closed tom...   
1  ü¶É City offices closed for Memorial Day weekend...   
2  üéâ New community 

In [None]:
# Evaluate Variant 3: Formal/No Examples
print("\nEvaluating Social Media Variant 3 (Formal/No Examples)...")

social_eval_task_v3 = EvalTask(
    dataset=social_media_eval_dataset,
    metrics=["bleu", "rouge_1"],
    experiment="social-media-prompt-comparison",
)

social_result_v3 = social_eval_task_v3.evaluate(
    model=model,
    prompt_template=social_v3_prompt,
    experiment_run_name=f"social-v3-formal-{run_ts}"
)

print("‚úì V3 Evaluation complete")
display_eval_report(("Social Media V3: Formal/No Examples", social_result_v3.summary_metrics, social_result_v3.metrics_table))



Evaluating Social Media Variant 3 (Formal/No Examples)...


INFO:vertexai.preview.evaluation.eval_task:Logging Eval experiment evaluation metadata: {'prompt_template': 'Generate a professional government social media post for Twitter (max 280 chars).\nInclude #CityGov hashtag.\nKeep tone official and informative.\n\nAnnouncement: {instruction}\nPost:', 'model_name': 'publishers/google/models/gemini-2.5-pro'}
INFO:vertexai.preview.evaluation._evaluation:Assembling prompts from the `prompt_template`. The `prompt` column in the `EvalResult.metrics_table` has the assembled prompts used for model response generation.
INFO:vertexai.preview.evaluation._pre_eval_utils:Generating a total of 6 responses from Gemini model gemini-2.5-pro.
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [00:17<00:00,  2.92s/it]
INFO:vertexai.preview.evaluation._pre_eval_utils:All 6 responses are successfully generated from model.
INFO:vertexai.preview.evaluation._evaluation:Multithreaded Batch Inference took: 17.544387697002094 seconds.
INFO:vertexai.preview.evaluation._evaluation

‚úì V3 Evaluation complete

Social Media V3: Formal/No Examples
----------------------------------------------------------------------

Summary Metrics:
{'row_count': 6, 'bleu/mean': np.float64(0.10149365149999999), 'bleu/std': 0.08585672391386243, 'rouge_1/mean': np.float64(0.43005626999999996), 'rouge_1/std': 0.06798630151559269}

Detailed Metrics:
                                         instruction  context  \
0  Heavy snowfall expected tonight, all schools c...  twitter   
1       City offices closed for Memorial Day weekend  twitter   
2  New community center grand opening this Saturd...  twitter   
3    Water main break on Main Street, avoid the area  twitter   
4  Free tax preparation help available at library...  twitter   
5  Fourth of July fireworks show at the park, sta...  twitter   

                                           reference  \
0  ‚ùÑÔ∏è SCHOOL CLOSURE: All city schools closed tom...   
1  ü¶É City offices closed for Memorial Day weekend...   
2  üéâ New comm

In [None]:
# Compare all social media prompt variants
print("\n" + "="*70)
print("SOCIAL MEDIA PROMPT VARIANT COMPARISON")
print("="*70)

social_comparison_df = pd.DataFrame({
    'Variant': ['V1: Detailed+Examples', 'V2: Casual/Simple', 'V3: Formal/No Examples'],
    'BLEU': [
        f"{social_result.summary_metrics.get('bleu/mean', 0):.3f}",
        f"{social_result_v2.summary_metrics.get('bleu/mean', 0):.3f}",
        f"{social_result_v3.summary_metrics.get('bleu/mean', 0):.3f}"
    ],
    'ROUGE-1': [
        f"{social_result.summary_metrics.get('rouge_1/mean', 0):.3f}",
        f"{social_result_v2.summary_metrics.get('rouge_1/mean', 0):.3f}",
        f"{social_result_v3.summary_metrics.get('rouge_1/mean', 0):.3f}"
    ]
})

print(social_comparison_df.to_string(index=False))
print("="*70)

print("\nKey Insights:")
print("- Higher BLEU/ROUGE scores indicate generated posts are more similar to references")
print("- V1 (detailed with examples) typically performs best for structured output")
print("- Compare scores to determine which prompt strategy is most effective")



SOCIAL MEDIA PROMPT VARIANT COMPARISON
               Variant  BLEU ROUGE-1
 V1: Detailed+Examples 0.176   0.589
     V2: Casual/Simple 0.130   0.471
V3: Formal/No Examples 0.101   0.430

Key Insights:
- Higher BLEU/ROUGE scores indicate generated posts are more similar to references
- V1 (detailed with examples) typically performs best for structured output
- Compare scores to determine which prompt strategy is most effective
