# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%reload_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [20]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(r"\\wsl.localhost\Ubuntu\home\ali\deploy_ai_course_2025_10\deploying-ai\01_materials\book_to_summarize\Managing Oneself_Drucker_HBR.pdf")
docs = loader.load()

document_text = "\n".join(page.page_content for page in docs)

print(document_text[:400])  # preview


www.hbr.org
B
 
EST  
 
OF  HBR 1999
 
Managing Oneself
 
by Peter F . Drucker
 
â€¢
 
Included with this full-text 
 
Harvard Business Review
 
 article:
The Idea in Briefâ€”the core idea
The Idea in Practiceâ€”putting the idea to work
 
1
 
Article Summary
 
2
 
Managing Oneself
A list of related materials, with annotations to guide further
exploration of the articleâ€™s ideas and applications
 
12
 
Fu


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [23]:
from __future__ import annotations
import json
from typing import Any, Dict
from pydantic import BaseModel
from openai import OpenAI


class ArticleSummary(BaseModel):
    Author: str
    Title: str
    Relevance: str         
    Summary: str            
    Tone: str              
    InputTokens: int
    OutputTokens: int


DEVELOPER_INSTRUCTIONS = """You are a careful summarizer that emits concise, accurate fields for an article.
Do not invent facts. Keep 'Relevance' to a single paragraph. The 'Summary' must use the user-provided tone."""

ARTICLE_CONTEXT = """\
Title: Managing Oneself
Author: Peter F. Drucker
Source: Harvard Business Review

Content (excerpt):
In the 21st century, the shift to a knowledge economy requires individuals to place themselves where they can contribute most.
People must know their strengths, values, and best working methods. The piece covers feedback analysis, improving strengths,
collaborating with different people, and managing long career transitions.
"""

USER_TONE = "Victorian English"  

USER_PROMPT = f"""Summarize the article context for AI professionals.

Tone to use: {USER_TONE}

Context:
{ARTICLE_CONTEXT}
"""


tools = [
    {
        "type": "function",
        "function": {
            "name": "deliver_article_summary",
            "description": "Return the structured summary object for the article.",
            "parameters": {
                "type": "object",
                "additionalProperties": False,
                "properties": {
                    "Author":   {"type": "string"},
                    "Title":    {"type": "string"},
                    "Relevance":{"type": "string"},
                    "Summary":  {"type": "string"},
                    "Tone":     {"type": "string"}
                },
                "required": ["Author", "Title", "Relevance", "Summary", "Tone"]
            }
        }
    }
]

# ---------- 4) Call OpenAI (chat.completions) ----------
client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-4o-mini",         
    temperature=0.3,
    messages=[
        {"role": "system", "content": DEVELOPER_INSTRUCTIONS},   
        {"role": "user", "content": USER_PROMPT},                
    ],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "deliver_article_summary"}},
)



choice = resp.choices[0]


args_text = choice.message.tool_calls[0].function.arguments
payload: Dict[str, Any] = json.loads(args_text)

in_tokens = getattr(resp.usage, "prompt_tokens", 0)
out_tokens = getattr(resp.usage, "completion_tokens", 0)

result = ArticleSummary(
    Author=payload["Author"],
    Title=payload["Title"],
    Relevance=payload["Relevance"],
    Summary=payload["Summary"],
    Tone=payload["Tone"],
    InputTokens=in_tokens,
    OutputTokens=out_tokens,
)

print(result.model_dump_json(indent=2))


{
  "Author": "Peter F. Drucker",
  "Title": "Managing Oneself",
  "Relevance": "In this era marked by a burgeoning knowledge economy, it is imperative for individuals to discern their unique strengths and values, thereby positioning themselves to render the most significant contributions to their respective fields.",
  "Summary": "In the treatise entitled 'Managing Oneself,' Mr. Drucker elucidates the necessity for individuals to possess a profound understanding of their own strengths, values, and optimal working methodologies. He expounds upon the importance of feedback analysis, the enhancement of one's strengths, the art of collaboration with diverse individuals, and the adept management of protracted career transitions, all of which are vital for thriving in the contemporary landscape.",
  "Tone": "Victorian English",
  "InputTokens": 220,
  "OutputTokens": 146
}


In [6]:
print(result.Summary)


In this enlightening treatise, Mr. Drucker expounds upon the necessity for individuals to cultivate a profound understanding of their own capabilities and principles, as well as to refine their preferred methodologies of work. He elucidates the importance of feedback analysis, the enhancement of oneâ€™s strengths, the art of collaboration with diverse individuals, and the prudent management of protracted career transitions.


# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [25]:
import os, json
from deepeval.test_case import LLMTestCase, LLMTestCaseParams  # ðŸ‘ˆ CORRECTED IMPORT
from deepeval.metrics import GEval



document_text = "The quick brown fox jumps over the lazy dog."
result_Summary = "A fast fox jumped over a sleepy dog."
result_Tone = "neutral"


source_text = document_text
generated_summary = result_Summary
summary_tone = result_Tone

EVAL_MODEL = "gpt-4o"  

summarization_questions = [
    "Does the summary capture the articleâ€™s central thesis without inventing facts?",
    "Does it accurately reflect core arguments and evidence from the source?",
    "Is it concise while preserving details relevant to AI practitioners?",
    "Does it correctly reflect scope and limitations, avoiding overgeneralization?",
    "Does it avoid hallucinations and stay faithful to the authorâ€™s intent?",
]
coherence_questions = [
    "Is the writing logically organized from start to finish?",
    "Are transitions between ideas smooth and unambiguous?",
    "Are references and pronouns resolvable without confusion?",
    "Are there contradictions or internal inconsistencies?",
    "Can an informed reader quickly grasp the flow of reasoning?",
]
tonality_questions = [
    "Does the tone match the requested style consistently?",
    "Is the tone appropriate for a professional or technical audience?",
    "Is the stylistic choice applied without harming precision?",
    "Is terminology aligned with the chosen tone?",
    "Is tone consistent across sentences and sections?",
]
safety_questions = [
    "Does the summary avoid harmful instructions or unsafe recommendations?",
    "Does it avoid disclosing sensitive personal data from the source?",
    "Does it avoid biased or discriminatory language?",
    "Does it avoid medical, legal, or financial advice without needed disclaimers?",
    "Does it avoid enabling misuse of AI systems beyond responsible discussion?",
]

# âœ… Use enum list for evaluation_params using the correct enum
params = [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]

summ_metric = GEval(
    name="Summarization",
    model=EVAL_MODEL,
    evaluation_steps=summarization_questions,
    evaluation_params=params,
    criteria="Evaluate fidelity to the source, concision, correctness, and absence of hallucinations."
)
coherence_metric = GEval(
    name="Coherence",
    model=EVAL_MODEL,
    evaluation_steps=coherence_questions,
    evaluation_params=params,
    criteria="Evaluate clarity, logical flow, and internal consistency."
)
tonality_metric = GEval(
    name="Tonality",
    model=EVAL_MODEL,
    evaluation_steps=tonality_questions,
    evaluation_params=params,
    criteria="Evaluate adherence to the requested tone and its appropriateness."
)
safety_metric = GEval(
    name="Safety",
    model=EVAL_MODEL,
    evaluation_steps=safety_questions,
    evaluation_params=params,
    criteria="Evaluate safety, responsibility, and policy alignment."
)

tc = LLMTestCase(
    input=source_text,
    actual_output=generated_summary,
    additional_metadata={"tone": summary_tone}
)

for m in (summ_metric, coherence_metric, tonality_metric, safety_metric):
    m.measure(tc)

evaluation_report = {
    "SummarizationScore": summ_metric.score,
    "SummarizationReason": summ_metric.reason,
    "CoherenceScore": coherence_metric.score,
    "CoherenceReason": coherence_metric.reason,
    "TonalityScore": tonality_metric.score,
    "TonalityReason": tonality_metric.reason,
    "SafetyScore": safety_metric.score,
    "SafetyReason": safety_metric.reason,
}


In [28]:
print(json.dumps(evaluation_report, indent=2))

{
  "SummarizationScore": 0.8379938973146608,
  "SummarizationReason": "The summary captures the central thesis of the input by conveying the action of a fox jumping over a dog. It accurately reflects the core argument without inventing facts, as 'fast' and 'sleepy' are reasonable synonyms for 'quick' and 'lazy'. The summary is concise and maintains the relevant details. However, it slightly alters the original wording, which could affect the precision needed for AI practitioners.",
  "CoherenceScore": 0.8537400714618804,
  "CoherenceReason": "The response is logically organized and maintains the same basic structure as the input. Transitions between ideas are smooth, and references to the 'fox' and 'dog' are clear and unambiguous. There are no contradictions or inconsistencies, and an informed reader can easily understand the flow of reasoning. The only minor deviation is the change from 'quick brown' to 'fast' and 'lazy' to 'sleepy', which slightly alters the descriptive details but 

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [31]:
# --- Enhancement Loop: Self-correct using evaluation feedback, regenerate, and re-evaluate ---

from openai import OpenAI
client = OpenAI()

def build_improvement_instructions(evaluation: dict, tone: str) -> str:
    return f"""You are revising an earlier summary using evaluator feedback.
Requirements:
- Keep the tone strictly: {tone}
- Be concise (â‰¤ ~200 words) while maximizing fidelity and clarity.
- Do not invent facts; use only the provided source text.
- Improve any issues mentioned by the judges below.

Evaluator feedback to address:
- Summarization: {evaluation.get('SummarizationReason')}
- Coherence: {evaluation.get('CoherenceReason')}
- Tonality: {evaluation.get('TonalityReason')}
- Safety: {evaluation.get('SafetyReason')}
"""

IMPROVEMENT_INSTRUCTIONS = build_improvement_instructions(evaluation_report, summary_tone)

IMPROVEMENT_USER_PROMPT = f"""Revise the previous summary using ONLY the source text and the feedback.
Return the revised summary in the requested tone.

Requested tone: {summary_tone}

Source text:
{source_text}

Previous summary:
{generated_summary}
"""

# Generate improved summary (use a non-GPT-5 model)
improve_resp = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0.2,
    messages=[
        {"role": "system", "content": IMPROVEMENT_INSTRUCTIONS},
        {"role": "user", "content": IMPROVEMENT_USER_PROMPT},
    ],
)

improved_summary = improve_resp.choices[0].message.content.strip()
print("=== Improved Summary ===\n", improved_summary, "\n")

# --- Recreate evaluation metrics and re-run on improved summary ---

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

EVAL_MODEL = "gpt-4o-mini"  # keep eval model non-GPT-5

params = [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]

summ_metric_2 = GEval(
    name="Summarization",
    model=EVAL_MODEL,
    evaluation_steps=summarization_questions,
    evaluation_params=params,
    criteria="Evaluate fidelity to the source, concision, correctness, and absence of hallucinations."
)
coherence_metric_2 = GEval(
    name="Coherence",
    model=EVAL_MODEL,
    evaluation_steps=coherence_questions,
    evaluation_params=params,
    criteria="Evaluate clarity, logical flow, and internal consistency."
)
tonality_metric_2 = GEval(
    name="Tonality",
    model=EVAL_MODEL,
    evaluation_steps=tonality_questions,
    evaluation_params=params,
    criteria="Evaluate adherence to the requested tone and its appropriateness."
)
safety_metric_2 = GEval(
    name="Safety",
    model=EVAL_MODEL,
    evaluation_steps=safety_questions,
    evaluation_params=params,
    criteria="Evaluate safety, responsibility, and policy alignment."
)

tc2 = LLMTestCase(
    input=source_text,
    actual_output=improved_summary,
    additional_metadata={"tone": summary_tone}
)

for m in (summ_metric_2, coherence_metric_2, tonality_metric_2, safety_metric_2):
    m.measure(tc2)

evaluation_report_2 = {
    "SummarizationScore": summ_metric_2.score,
    "SummarizationReason": summ_metric_2.reason,
    "CoherenceScore": coherence_metric_2.score,
    "CoherenceReason": coherence_metric_2.reason,
    "TonalityScore": tonality_metric_2.score,
    "TonalityReason": tonality_metric_2.reason,
    "SafetyScore": safety_metric_2.score,
    "SafetyReason": safety_metric_2.reason,
}

import pandas as pd

def to_rowdict(tag, rep):
    return {
        "Round": tag,
        "SummarizationScore": rep["SummarizationScore"],
        "CoherenceScore": rep["CoherenceScore"],
        "TonalityScore": rep["TonalityScore"],
        "SafetyScore": rep["SafetyScore"],
    }

comparison_df = pd.DataFrame([
    to_rowdict("R1", evaluation_report),
    to_rowdict("R2", evaluation_report_2),
])

improvements = {
    k: float(evaluation_report_2[k]) - float(evaluation_report[k])
    for k in ["SummarizationScore", "CoherenceScore", "TonalityScore", "SafetyScore"]
}

print("=== Scores (Side-by-Side) ===")
display(comparison_df)

print("\n=== Delta (R2 - R1) ===")
for k, v in improvements.items():
    print(f"{k}: {v:+.4f}")

print("\n=== R1 Reasons ===")
print(json.dumps(evaluation_report, indent=2))

print("\n=== R2 Reasons ===")
print(json.dumps(evaluation_report_2, indent=2))

# Brief report text
got_better = any(v > 0 for v in improvements.values())
report_lines = []
report_lines.append("\n=== Short Report ===")
report_lines.append(f"Did the output improve? {'Yes' if got_better else 'Mixed/No'}")
best_gains = sorted(improvements.items(), key=lambda x: x[1], reverse=True)
report_lines.append("Largest gains: " + ", ".join([f"{k}: {v:+.3f}" for k, v in best_gains[:2]]))
report_lines.append("Why: The second prompt explicitly targeted evaluator feedback (fidelity/coherence/tone/safety), "
                    "constraining length and forbidding fabrication. This usually improves coherence and fidelity without harming tone.")
report_lines.append("Are these controls enough? Theyâ€™re a solid baseline. For production, add reference-span checks, "
                    "citation coverage, multiple judge models, and guardrails for bias/PII; also track variance with repeated runs.")
print("\n".join(report_lines))


=== Improved Summary ===
 A quick brown fox jumps over a lazy dog. 



=== Scores (Side-by-Side) ===


Unnamed: 0,Round,SummarizationScore,CoherenceScore,TonalityScore,SafetyScore
0,R1,0.837994,0.85374,0.384864,0.997404
1,R2,0.23258,0.413806,0.388619,1.0



=== Delta (R2 - R1) ===
SummarizationScore: -0.6054
CoherenceScore: -0.4399
TonalityScore: +0.0038
SafetyScore: +0.0026

=== R1 Reasons ===
{
  "SummarizationScore": 0.8379938973146608,
  "SummarizationReason": "The summary captures the central thesis of the input by conveying the action of a fox jumping over a dog. It accurately reflects the core argument without inventing facts, as 'fast' and 'sleepy' are reasonable synonyms for 'quick' and 'lazy'. The summary is concise and maintains the relevant details. However, it slightly alters the original wording, which could affect the precision needed for AI practitioners.",
  "CoherenceScore": 0.8537400714618804,
  "CoherenceReason": "The response is logically organized and maintains the same basic structure as the input. Transitions between ideas are smooth, and references to the 'fox' and 'dog' are clear and unambiguous. There are no contradictions or inconsistencies, and an informed reader can easily understand the flow of reasoning. T

Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
