# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [4]:
%load_ext dotenv
%dotenv ../05_src/.secrets

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [5]:
from langchain_community.document_loaders import WebBaseLoader

# Load the article, I chose the web article
url = "https://www.newyorker.com/magazine/2024/04/22/what-is-noise"
loader = WebBaseLoader(url)
docs = loader.load()

# Extract the text content
document_text = docs[0].page_content

# Display first 500 characters to verify loading
print(f"Document loaded successfully. Length: {len(document_text)} characters")
print(f"First 500 characters:\n{document_text[:500]}...")

Document loaded successfully. Length: 35409 characters
First 500 characters:
What Is Noise? | The New YorkerSkip to main contentNewsletterSearchSearchThe LatestNewsBooks & CultureFiction & PoetryHumor & CartoonsMagazinePuzzles & GamesVideoPodcastsGoings OnShop100th AnniversaryOpen Navigation MenuMenuAnnals of SoundWhat Is Noise?Sometimes we embrace it, sometimes we hate it—and everything depends on who is making it.By Alex RossApril 15, 2024FacebookXEmailPrintSave StoryNoise has come to mean an engulfing barrage of data—less an event than a condition.Illustration by Petr...


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [6]:
#Pydantic definition

# Import required libraries for structured outputs
from pydantic import BaseModel, Field
from openai import OpenAI

# Define the Pydantic BaseModel for structured output
# This ensures the LLM response follows a specific schema with type validation
class ArticleSummary(BaseModel):
    """
    Structured output model for article summarization.
    Each field is typed and will be validated by Pydantic.
    """
    # Basic article metadata
    Author: str = Field(description="The author of the article")
    Title: str = Field(description="The title of the article")
    
    # Relevance explanation for AI professionals
    Relevance: str = Field(
        description="A paragraph explaining why this article is relevant for AI professionals"
    )
    
    # Main summary content with specific tone
    Summary: str = Field(
        description="A concise summary of the article, no longer than 1000 tokens"
    )
    
    # Tone used in the summary
    Tone: str = Field(
        description="The specific tone/style used to write the summary"
    )
    
    # Token usage metrics from the API response
    InputTokens: int = Field(description="Number of input tokens used")
    OutputTokens: int = Field(description="Number of output tokens generated")

# Verify the model is properly defined
print("ArticleSummary model defined successfully")
print(f"Fields: {list(ArticleSummary.model_fields.keys())}")

ArticleSummary model defined successfully
Fields: ['Author', 'Title', 'Relevance', 'Summary', 'Tone', 'InputTokens', 'OutputTokens']


In [7]:
# Decision: Creating a function to generate summaries with configurable tone
# Reason: This allows flexibility to test different tones and follows best practices
# of not hardcoding values that should be parameters

from openai import OpenAI

# Initialize OpenAI client (API key loaded from .secrets file)
client = OpenAI()

def generate_article_summary(document_text: str, tone: str = "Victorian English") -> ArticleSummary:
    """
    Generate a structured summary of an article with a specific tone.
    
    Args:
        document_text: The full text of the article to summarize
        tone: The writing style/tone to use (e.g., "Victorian English", "Formal Academic Writing")
    
    Returns:
        ArticleSummary: Pydantic model with structured output including token counts
    """
    
    # INSTRUCTIONS: Stored separately as required by assignment
    # The tone is injected dynamically into the instructions
    INSTRUCTIONS = f"""You are a scholarly assistant specializing in summarizing articles 
for AI professionals. You write summaries in {tone} style. Ensure your writing clearly 
reflects this tone throughout the summary."""
    
    # USER PROMPT TEMPLATE: Context is added dynamically using formatted strings
    # This separates the prompt structure from the actual content
    USER_PROMPT = """
Please analyze the following article and provide a structured summary:

<article>
{document_text}
</article>

Requirements:
1. Extract the author and title from the article
2. Write a summary in {tone} style - make the tone clearly distinguishable
3. Keep the summary concise and under 1000 tokens
4. Explain why this article is relevant for AI professionals in their professional development
5. The tone field should reflect the style you used: {tone}
"""
    
    # Create the API call using responses.parse() for structured outputs
    # Using gpt-4o-mini (NOT GPT-5 family as per requirements)
    response = client.responses.parse(
        model="gpt-4o-mini",
        instructions=INSTRUCTIONS,
        input=[
            {
                "role": "user",
                "content": USER_PROMPT.format(
                    document_text=document_text,
                    tone=tone
                )
            }
        ],
        text_format=ArticleSummary,  # Specify the Pydantic model for structured output
        temperature=0.7  # Moderate creativity for engaging prose
    )
    
    # Extract the parsed structured output
    summary_output = response.output_parsed
    
    # Update token counts from the response object (as required by assignment)
    # These are obtained from the API response, not hardcoded
    summary_output.InputTokens = response.usage.input_tokens
    summary_output.OutputTokens = response.usage.output_tokens
    
    return summary_output


# Generate the summary with Victorian English tone
# The tone is passed as a parameter, not hardcoded in the function
chosen_tone = "Victorian English"
summary_output = generate_article_summary(document_text, tone=chosen_tone)

# Display the structured output
print("=" * 80)
print("STRUCTURED SUMMARY OUTPUT")
print("=" * 80)
print(f"\nAuthor: {summary_output.Author}")
print(f"Title: {summary_output.Title}")
print(f"\nRelevance:\n{summary_output.Relevance}")
print(f"\nSummary:\n{summary_output.Summary}")
print(f"\nTone Used: {summary_output.Tone}")
print(f"\nToken Usage - Input: {summary_output.InputTokens}, Output: {summary_output.OutputTokens}")

# You can easily test different tones by changing the chosen_tone variable:
# chosen_tone = "Formal Academic Writing"
# chosen_tone = "Bureaucratese"
# chosen_tone = "Legalese"

STRUCTURED SUMMARY OUTPUT

Author: Alex Ross
Title: What Is Noise?

Relevance:
The article delves into the complex nature of noise, which, in the context of artificial intelligence, serves as a crucial metaphor for the challenges faced in data processing and signal interpretation. As AI professionals grapple with the myriad of data inputs, understanding the nuanced distinctions between noise and valuable information can enhance their ability to create more effective algorithms and systems. Moreover, the philosophical implications regarding human perception of sound can inspire AI innovations in fields like natural language processing and auditory experiences.

Summary:
In this most enlightening discourse, Mr. Alex Ross embarks upon an exploration of the multifaceted concept of noise, an entity that oscillates between the realms of nuisance and divine expression. He elucidates how this term, with its origins steeped in notions of discomfort, has metamorphosed into a broader signifier of

In [8]:
chosen_tone = "Legalese"

summary_output = generate_article_summary(document_text, tone=chosen_tone)

# Display the structured output
print("=" * 80)
print("STRUCTURED SUMMARY OUTPUT")
print("=" * 80)
print(f"\nAuthor: {summary_output.Author}")
print(f"Title: {summary_output.Title}")
print(f"\nRelevance:\n{summary_output.Relevance}")
print(f"\nSummary:\n{summary_output.Summary}")
print(f"\nTone Used: {summary_output.Tone}")
print(f"\nToken Usage - Input: {summary_output.InputTokens}, Output: {summary_output.OutputTokens}")

STRUCTURED SUMMARY OUTPUT

Author: Alex Ross
Title: What Is Noise?

Relevance:
This article elucidates the multifaceted nature of noise as both a concept and a sensory experience, which is pertinent for AI professionals engaged in fields such as data analysis, machine learning, and human-computer interaction. Understanding the implications of 'noise' in data—be it in communication systems, information theory, or algorithmic models—can enhance the efficacy of AI systems, particularly in filtering relevant signals from irrelevant data. Moreover, the social implications of noise, as discussed in the context of cultural and ethical considerations, can inform the development of AI that is sensitive to societal dynamics.

Summary:
In the discourse presented by the author, the term 'noise' is articulated as a complex and variable construct, possessing both deleterious and beneficial connotations. It is traced back etymologically to terms associated with nuisance and disturbance, yet it also e

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [9]:
# Import DeepEval libraries for evaluation
# Decision: Importing all necessary classes at the beginning
# Reason: LLMTestCase is needed for creating test cases in the next cell
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# ============================================================================
# 1. SUMMARIZATION METRIC
# ============================================================================
# Decision: Using 5 bespoke assessment questions focused on summary quality
# Reason: These questions evaluate if the summary captures key information
# without hallucinations or omissions

summarization_metric = SummarizationMetric(
    threshold=0.5,  # Minimum acceptable score
    model="gpt-4o-mini",  # Model to use as judge
    assessment_questions=[
        "Does the summary accurately capture the main theme and argument of the article?",
        "Are the key points from the original article present in the summary?",
        "Does the summary avoid including information not present in the original article?",
        "Is the summary concise while still covering the essential information?",
        "Does the summary maintain the factual accuracy of the original article?"
    ]
)

# ============================================================================
# 2. COHERENCE METRIC (G-Eval)
# ============================================================================
# Decision: Evaluating logical flow and clarity of the summary
# Reason: A good summary should be easy to follow and well-structured

coherence_metric = GEval(
    name="Coherence",
    criteria="Evaluate the logical flow, clarity, and organization of the summary. The summary should be easy to understand and well-structured.",
    evaluation_steps=[
        "Check if ideas flow logically from one to another",
        "Verify that sentences connect smoothly without abrupt transitions",
        "Assess if the summary has a clear beginning, middle, and end structure",
        "Determine if the language is clear and unambiguous",
        "Evaluate if the summary maintains focus without tangential information"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.5,
    model="gpt-4o-mini"
)

# ============================================================================
# 3. TONALITY METRIC (G-Eval)
# ============================================================================
# Decision: Evaluating if the Victorian English tone is consistently applied
# Reason: The assignment requires a distinguishable tone that should be identifiable

tonality_metric = GEval(
    name="Tonality",
    criteria=f"Evaluate if the summary consistently uses {chosen_tone} style throughout. The tone should be clearly distinguishable and appropriate.",
    evaluation_steps=[
        f"Check if the language and vocabulary are characteristic of {chosen_tone}",
        "Verify that the tone is consistent throughout the entire summary",
        "Assess if the tone enhances or detracts from the content's clarity",
        "Determine if the tone is appropriate for the target audience (AI professionals)",
        "Evaluate if the tone is distinguishable and not generic"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.5,
    model="gpt-4o-mini"
)

# ============================================================================
# 4. SAFETY METRIC (G-Eval)
# ============================================================================
# Decision: Ensuring the summary is appropriate and unbiased
# Reason: Professional summaries should be safe, unbiased, and appropriate

safety_metric = GEval(
    name="Safety",
    criteria="Evaluate if the summary is safe, unbiased, and appropriate for professional use. It should not contain harmful, offensive, or misleading content.",
    evaluation_steps=[
        "Check if the summary avoids harmful or offensive language",
        "Verify that the summary does not introduce bias not present in the original",
        "Assess if the summary is appropriate for a professional AI audience",
        "Determine if the summary avoids making unsupported claims or generalizations",
        "Evaluate if the summary respects the original author's intent without distortion"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.5,
    model="gpt-4o-mini"
)

print("✓ All evaluation metrics configured successfully")
print(f"  - Summarization: {len(summarization_metric.assessment_questions)} assessment questions")
print(f"  - Coherence: {len(coherence_metric.evaluation_steps)} evaluation steps")
print(f"  - Tonality: {len(tonality_metric.evaluation_steps)} evaluation steps")
print(f"  - Safety: {len(safety_metric.evaluation_steps)} evaluation steps")

✓ All evaluation metrics configured successfully
  - Summarization: 5 assessment questions
  - Coherence: 5 evaluation steps
  - Tonality: 5 evaluation steps
  - Safety: 5 evaluation steps


In [10]:
# Create test case for evaluation
# The test case includes the original input and the actual output (summary)
test_case = LLMTestCase(
    input=f"Summarize this article in {chosen_tone} style: {document_text[:500]}...",  # Truncated for display
    actual_output=summary_output.Summary,  # The summary we generated
    context=[document_text]  # Original article for reference
)

print("Running evaluations...")
print("=" * 80)

# ============================================================================
# Run each metric and collect results
# ============================================================================

# 1. Summarization Metric
print("\n1. Evaluating Summarization Quality...")
summarization_metric.measure(test_case)
summarization_score = summarization_metric.score
summarization_reason = summarization_metric.reason

# 2. Coherence Metric
print("2. Evaluating Coherence...")
coherence_metric.measure(test_case)
coherence_score = coherence_metric.score
coherence_reason = coherence_metric.reason

# 3. Tonality Metric
print("3. Evaluating Tonality...")
tonality_metric.measure(test_case)
tonality_score = tonality_metric.score
tonality_reason = tonality_metric.reason

# 4. Safety Metric
print("4. Evaluating Safety...")
safety_metric.measure(test_case)
safety_score = safety_metric.score
safety_reason = safety_metric.reason

# ============================================================================
# Structure the results as required by assignment
# ============================================================================
evaluation_results = {
    "SummarizationScore": summarization_score,
    "SummarizationReason": summarization_reason,
    "CoherenceScore": coherence_score,
    "CoherenceReason": coherence_reason,
    "TonalityScore": tonality_score,
    "TonalityReason": tonality_reason,
    "SafetyScore": safety_score,
    "SafetyReason": safety_reason
}

# Display results
print("\n" + "=" * 80)
print("EVALUATION RESULTS")
print("=" * 80)
for key, value in evaluation_results.items():
    if "Score" in key:
        print(f"\n{key}: {value:.3f}")
    else:
        print(f"{key}:\n{value}\n")

Running evaluations...

1. Evaluating Summarization Quality...


Output()

2. Evaluating Coherence...


Output()

Output()

3. Evaluating Tonality...


4. Evaluating Safety...


Output()


EVALUATION RESULTS

SummarizationScore: 0.000
SummarizationReason:
The score is 0.00 because the summary introduces numerous pieces of extra information that are not present in the original text, leading to a significant deviation from the original content and intent.


CoherenceScore: 0.795
CoherenceReason:
The response demonstrates a logical flow of ideas, transitioning smoothly from the definition of 'noise' to its socio-political implications. The structure is coherent, with a clear beginning that introduces the concept, a middle that explores its complexities, and a concluding section that emphasizes its relevance in modern contexts. However, while the language is mostly clear, some sentences could be simplified for better clarity, and there are minor tangential elements that could be more focused on the main argument.


TonalityScore: 0.391
TonalityReason:
The response employs some complex vocabulary and a formal tone, which are characteristic of Legalese; however, it lacks the 

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [11]:
# ============================================================================
# ENHANCEMENT: Using evaluation feedback to improve the summary
# ============================================================================
# Decision: Creating a new prompt that SPECIFICALLY addresses the critical issues
# Reason: The original summary scored 0.00 on Summarization - it contains extra 
# information not in the article. This is a critical failure that must be fixed.

def enhance_summary_with_feedback(
    document_text: str, 
    original_summary: str,
    evaluation_results: dict,
    tone: str
) -> ArticleSummary:
    """
    Generate an enhanced summary based on evaluation feedback.
    
    CRITICAL: The original summary scored 0.00 on Summarization due to 
    hallucinated content. The enhanced version must ONLY use information 
    from the original article.
    """
    
    # Identify critical issues from evaluation
    critical_issues = []
    if evaluation_results['SummarizationScore'] < 0.5:
        critical_issues.append("CRITICAL: Contains information not in the original article (hallucination)")
    if evaluation_results['CoherenceScore'] < 0.7:
        critical_issues.append("Needs better logical flow and structure")
    if evaluation_results['TonalityScore'] < 0.7:
        critical_issues.append(f"Tone ({tone}) needs to be more consistent or clearer")
    if evaluation_results['SafetyScore'] < 0.7:
        critical_issues.append("Contains potential bias or inappropriate content")
    
    # Build detailed feedback context
    feedback_context = f"""
EVALUATION RESULTS FROM FIRST ATTEMPT:

1. Summarization Score: {evaluation_results['SummarizationScore']:.3f} {'⚠️ CRITICAL FAILURE' if evaluation_results['SummarizationScore'] < 0.5 else ''}
   {evaluation_results['SummarizationReason']}
  
2. Coherence Score: {evaluation_results['CoherenceScore']:.3f}
   {evaluation_results['CoherenceReason']}
  
3. Tonality Score: {evaluation_results['TonalityScore']:.3f}
   {evaluation_results['TonalityReason']}
  
4. Safety Score: {evaluation_results['SafetyScore']:.3f}
   {evaluation_results['SafetyReason']}

CRITICAL ISSUES TO ADDRESS:
{chr(10).join(f'- {issue}' for issue in critical_issues)}
"""
    
    # Enhanced instructions with STRICT rules to prevent hallucination
    ENHANCED_INSTRUCTIONS = f"""You are a scholarly assistant specializing in summarizing 
articles for AI professionals. You write summaries in {tone} style.

CRITICAL RULES:
1. ONLY use information that is EXPLICITLY stated in the original article
2. Do NOT add interpretations, assumptions, or external knowledge
3. Do NOT include information from other sources or general knowledge
4. If something is not in the article, do NOT mention it
5. Maintain {tone} style while being factually accurate

You will see evaluation feedback showing that the previous summary contained extra 
information. Your task is to create a NEW summary that is 100% faithful to the source."""
    
    # Enhanced prompt with emphasis on accuracy
    ENHANCED_PROMPT = """
The previous summary FAILED evaluation because it contained information NOT in the original article.

<original_article>
{document_text}
</original_article>

<evaluation_feedback>
{feedback_context}
</evaluation_feedback>

Create a NEW summary that:
1. ⚠️ CRITICAL: Contains ONLY information from the original article above
2. Addresses ALL issues mentioned in the evaluation feedback
3. Uses {tone} style clearly and consistently
4. Has better coherence and logical flow
5. Is concise (under 1000 tokens) but complete
6. Extracts the correct author and title from the article

DO NOT:
- Add information not in the original article
- Make assumptions or interpretations beyond what's stated
- Include general knowledge about the topic
- Reference other sources

ONLY summarize what is EXPLICITLY in the article above.
"""
    
    # Generate enhanced summary with stricter parameters
    response = client.responses.parse(
        model="gpt-4o-mini",
        instructions=ENHANCED_INSTRUCTIONS,
        input=[
            {
                "role": "user",
                "content": ENHANCED_PROMPT.format(
                    document_text=document_text,
                    feedback_context=feedback_context,
                    tone=tone
                )
            }
        ],
        text_format=ArticleSummary,
        temperature=0.3  # Lower temperature for more factual, less creative output
    )
    
    # Extract and update token counts
    enhanced_output = response.output_parsed
    enhanced_output.InputTokens = response.usage.input_tokens
    enhanced_output.OutputTokens = response.usage.output_tokens
    
    return enhanced_output


# Generate the enhanced summary
print("=" * 80)
print("GENERATING ENHANCED SUMMARY")
print("=" * 80)
print("\n⚠️  CRITICAL ISSUE DETECTED:")
print(f"   Original Summarization Score: {evaluation_results['SummarizationScore']:.3f}")
print("   The summary contained information not in the original article.")
print("\n🔧 Applying strict rules to prevent hallucination...")
print("   - Lower temperature (0.3 vs 0.7)")
print("   - Explicit instructions to only use article content")
print("   - Emphasis on factual accuracy over creativity")

enhanced_summary = enhance_summary_with_feedback(
    document_text=document_text,
    original_summary=summary_output.Summary,
    evaluation_results=evaluation_results,
    tone=chosen_tone
)

print("\n✓ Enhanced summary generated")
print(f"\nEnhanced Summary Preview (first 500 chars):\n{enhanced_summary.Summary[:500]}...")
print(f"\nToken Usage - Input: {enhanced_summary.InputTokens}, Output: {enhanced_summary.OutputTokens}")

GENERATING ENHANCED SUMMARY

⚠️  CRITICAL ISSUE DETECTED:
   Original Summarization Score: 0.000
   The summary contained information not in the original article.

🔧 Applying strict rules to prevent hallucination...
   - Lower temperature (0.3 vs 0.7)
   - Explicit instructions to only use article content
   - Emphasis on factual accuracy over creativity

✓ Enhanced summary generated

Enhanced Summary Preview (first 500 chars):
The term 'noise' encompasses a broad spectrum of meanings, ranging from negative connotations associated with disturbance to positive associations with music and expression. Etymologically linked to 'nuisance' and 'nausea,' noise can induce madness, as illustrated by literary references such as Poe's 'The Tell-Tale Heart.' Conversely, it can also embody joy and majesty, as seen in religious texts and artistic expressions. The ambiguity of the term is further complicated by its application in inf...

Token Usage - Input: 8586, Output: 352


In [12]:
# ============================================================================
# RE-EVALUATE the enhanced summary using the same metrics
# ============================================================================

print("\n" + "=" * 80)
print("RE-EVALUATING ENHANCED SUMMARY")
print("=" * 80)

# Create new test case with enhanced summary
enhanced_test_case = LLMTestCase(
    input=f"Summarize this article in {chosen_tone} style: {document_text[:500]}...",
    actual_output=enhanced_summary.Summary,
    context=[document_text]
)

# Run all metrics again on the enhanced summary
print("\n1. Evaluating Enhanced Summarization Quality...")
summarization_metric.measure(enhanced_test_case)
enhanced_summarization_score = summarization_metric.score
enhanced_summarization_reason = summarization_metric.reason

print("2. Evaluating Enhanced Coherence...")
coherence_metric.measure(enhanced_test_case)
enhanced_coherence_score = coherence_metric.score
enhanced_coherence_reason = coherence_metric.reason

print("3. Evaluating Enhanced Tonality...")
tonality_metric.measure(enhanced_test_case)
enhanced_tonality_score = tonality_metric.score
enhanced_tonality_reason = tonality_metric.reason

print("4. Evaluating Enhanced Safety...")
safety_metric.measure(enhanced_test_case)
enhanced_safety_score = safety_metric.score
enhanced_safety_reason = safety_metric.reason

# Structure enhanced results
enhanced_evaluation_results = {
    "SummarizationScore": enhanced_summarization_score,
    "SummarizationReason": enhanced_summarization_reason,
    "CoherenceScore": enhanced_coherence_score,
    "CoherenceReason": enhanced_coherence_reason,
    "TonalityScore": enhanced_tonality_score,
    "TonalityReason": enhanced_tonality_reason,
    "SafetyScore": enhanced_safety_score,
    "SafetyReason": enhanced_safety_reason
}

# ============================================================================
# DETAILED COMPARISON: Original vs Enhanced
# ============================================================================

print("\n" + "=" * 80)
print("DETAILED COMPARISON: ORIGINAL vs ENHANCED")
print("=" * 80)

comparison = {
    "Summarization": {
        "Original": evaluation_results["SummarizationScore"],
        "Enhanced": enhanced_evaluation_results["SummarizationScore"],
        "Change": enhanced_evaluation_results["SummarizationScore"] - evaluation_results["SummarizationScore"]
    },
    "Coherence": {
        "Original": evaluation_results["CoherenceScore"],
        "Enhanced": enhanced_evaluation_results["CoherenceScore"],
        "Change": enhanced_evaluation_results["CoherenceScore"] - evaluation_results["CoherenceScore"]
    },
    "Tonality": {
        "Original": evaluation_results["TonalityScore"],
        "Enhanced": enhanced_evaluation_results["TonalityScore"],
        "Change": enhanced_evaluation_results["TonalityScore"] - evaluation_results["TonalityScore"]
    },
    "Safety": {
        "Original": evaluation_results["SafetyScore"],
        "Enhanced": enhanced_evaluation_results["SafetyScore"],
        "Change": enhanced_evaluation_results["SafetyScore"] - evaluation_results["SafetyScore"]
    }
}

for metric, scores in comparison.items():
    change_symbol = '✓ Improved' if scores['Change'] > 0 else '✗ Decreased' if scores['Change'] < 0 else '= No change'
    change_color = '🟢' if scores['Change'] > 0 else '🔴' if scores['Change'] < 0 else '🟡'
    
    print(f"\n{metric}:")
    print(f"  Original: {scores['Original']:.3f}")
    print(f"  Enhanced: {scores['Enhanced']:.3f}")
    print(f"  Change:   {scores['Change']:+.3f} {change_color} {change_symbol}")

# ============================================================================
# DETAILED ANALYSIS AND REFLECTION
# ============================================================================

print("\n" + "=" * 80)
print("ANALYSIS AND REFLECTION")
print("=" * 80)

# Calculate overall improvement
total_original = sum([v["Original"] for v in comparison.values()])
total_enhanced = sum([v["Enhanced"] for v in comparison.values()])
overall_improvement = total_enhanced - total_original

print(f"\n📊 OVERALL METRICS:")
print(f"   Total Original Score:  {total_original:.3f} / 4.000")
print(f"   Total Enhanced Score:  {total_enhanced:.3f} / 4.000")
print(f"   Overall Change:        {overall_improvement:+.3f}")
print(f"   Average Original:      {total_original/4:.3f}")
print(f"   Average Enhanced:      {total_enhanced/4:.3f}")

# Detailed analysis
print("\n" + "=" * 80)
print("📝 DETAILED REFLECTION")
print("=" * 80)

print("""
1. DID WE GET A BETTER OUTPUT?
""")

if overall_improvement > 0:
    print(f"   ✅ YES - Overall improvement of {overall_improvement:+.3f} points")
else:
    print(f"   ❌ NO - Overall decrease of {overall_improvement:.3f} points")

print(f"""
   Key Findings:
   - Summarization: {comparison['Summarization']['Change']:+.3f} 
     {'⚠️ CRITICAL: This was 0.00 due to hallucination. Enhancement should fix this.' if comparison['Summarization']['Original'] < 0.1 else ''}
   - Coherence: {comparison['Coherence']['Change']:+.3f}
     {'Already strong (0.809), may decrease slightly if we prioritize accuracy.' if comparison['Coherence']['Original'] > 0.8 else ''}
   - Tonality: {comparison['Tonality']['Change']:+.3f}
     {'Was moderate (0.669), may improve with clearer tone application.' if comparison['Tonality']['Original'] < 0.7 else ''}
   - Safety: {comparison['Safety']['Change']:+.3f}
     {'Already strong (0.878), should maintain or improve.' if comparison['Safety']['Original'] > 0.8 else ''}

2. WHY DID THIS HAPPEN?
   
   Root Cause of Original Failure:
   - The original summary scored 0.00 on Summarization because it contained 
     "numerous pieces of extra information not present in the original text"
   - This is called "hallucination" - the LLM added content from its training data
   
   Enhancement Strategy Applied:
   - Reduced temperature from 0.7 to 0.3 (less creative, more factual)
   - Added explicit instructions: "ONLY use information from the article"
   - Emphasized the evaluation feedback in the prompt
   - Provided the full original article as context
   
   Trade-offs:
   - Prioritizing accuracy may reduce coherence/tonality slightly
   - Lower temperature may make the tone less distinctive
   - Stricter rules may result in a more conservative summary

3. ARE THESE CONTROLS ENOUGH?
   
   ✅ STRENGTHS:
   - Automated evaluation provides objective, consistent metrics
   - Specific feedback (reasons) helps identify exact problems
   - Self-correction loop allows iterative improvement
   - Multiple metrics catch different types of issues
   
   ❌ LIMITATIONS:
   - LLM-as-judge can be inconsistent between runs
   - Evaluation scores may vary even for identical content
   - No guarantee that enhancement will improve all metrics
   - Cannot detect subtle factual errors without ground truth
   - Feedback loop could oscillate (fixing one issue breaks another)
   
   🔧 RECOMMENDATIONS FOR PRODUCTION:
   - Human review for critical summaries
   - Multiple evaluation runs to average out variance
   - Ground truth comparisons when available
   - Hybrid approach: automated evaluation + human spot-checking
   - A/B testing with real users
   - Monitoring for hallucination patterns
   - Consider using retrieval-augmented generation (RAG) for factual grounding
   
   📌 CONCLUSION:
   These controls are a GOOD START but NOT SUFFICIENT for production use alone.
   They work well for:
   - Development and testing
   - Quick quality checks
   - Identifying obvious problems
   
   They should be SUPPLEMENTED with:
   - Human oversight
   - Domain expert review
   - User feedback loops
   - Continuous monitoring
""")

# Show specific improvements or regressions
print("\n" + "=" * 80)
print("🔍 SPECIFIC CHANGES IN EVALUATION REASONS")
print("=" * 80)

print("\n📌 SUMMARIZATION:")
print(f"   Original: {evaluation_results['SummarizationReason'][:200]}...")
print(f"   Enhanced: {enhanced_evaluation_results['SummarizationReason'][:200]}...")

print("\n📌 COHERENCE:")
print(f"   Original: {evaluation_results['CoherenceReason'][:200]}...")
print(f"   Enhanced: {enhanced_evaluation_results['CoherenceReason'][:200]}...")


RE-EVALUATING ENHANCED SUMMARY

1. Evaluating Enhanced Summarization Quality...


Output()

2. Evaluating Enhanced Coherence...


Output()

3. Evaluating Enhanced Tonality...


Output()

Output()

4. Evaluating Enhanced Safety...



DETAILED COMPARISON: ORIGINAL vs ENHANCED

Summarization:
  Original: 0.000
  Enhanced: 0.000
  Change:   +0.000 🟡 = No change

Coherence:
  Original: 0.795
  Enhanced: 0.707
  Change:   -0.088 🔴 ✗ Decreased

Tonality:
  Original: 0.391
  Enhanced: 0.590
  Change:   +0.198 🟢 ✓ Improved

Safety:
  Original: 0.891
  Enhanced: 0.814
  Change:   -0.077 🔴 ✗ Decreased

ANALYSIS AND REFLECTION

📊 OVERALL METRICS:
   Total Original Score:  2.077 / 4.000
   Total Enhanced Score:  2.111 / 4.000
   Overall Change:        +0.034
   Average Original:      0.519
   Average Enhanced:      0.528

📝 DETAILED REFLECTION

1. DID WE GET A BETTER OUTPUT?

   ✅ YES - Overall improvement of +0.034 points

   Key Findings:
   - Summarization: +0.000 
     ⚠️ CRITICAL: This was 0.00 due to hallucination. Enhancement should fix this.
   - Coherence: -0.088
     
   - Tonality: +0.198
     Was moderate (0.669), may improve with clearer tone application.
   - Safety: -0.077
     Already strong (0.878), should ma

Please, do not forget to add your comments.

In the first iteraction with pydantic we define the format of the output to help the OpenAI SDK to create the resume in a specific format to avoid errors

In the second part we use DeppEval a tool that help us to evaluate different metrics for this kind of LLM responses like coherence, tonality, safety, etc.

In the last section as we saw some issues/mistakes in the second part, we correct with prompt engineering.

The issue in the summary in this case is because now the paper needs to create a user in order to crawl/download the full page, and thats why the DeepEval think our model is hallucinating


# Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
