# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [18]:
%reload_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [19]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(r"\\wsl.localhost\Ubuntu\home\ali\deploy_ai_course_2025_10\deploying-ai\01_materials\book_to_summarize\Managing Oneself_Drucker_HBR.pdf")
docs = loader.load()

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

print("document length:", len(document_text))
print("pages loaded:", len(docs))


document length: 51452
pages loaded: 13


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [20]:
from __future__ import annotations
import json
from typing import Any, Dict
from pydantic import BaseModel, Field
from openai import OpenAI
from typing import Literal

selected_tone="Bureaucratese"
system_prompt = f"""You are an expert summarization AI. Your task is to process the provided document
and generate a structured response.

Your tone is strictly limited to {selected_tone}.

You MUST output a valid JSON object that conforms to the following
Ensure the summary captures the key points and essence of the document accurately.
Avoid adding any information not present in the document.
Keep the summary clear, coherent, and engaging.
Use the following format for the output (JSON format):
    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object)."""

prompt=f"""
Here is the document to be summarized:
<book>
    {document_text}
</book>
""" 



class PydanticSummaryClass(BaseModel):
    Author: str = Field(description="Author of the Article")
    Title: str = Field(description="The title of the Article")
    Relevance: str = Field(description="â‰¤1 paragraph explaining professional relevance for AI practitioners")
    Summary: str = Field(description="A concise and succinct summary no longer than 1000 tokens")
    Tone: Literal["Bureaucratese"] = Field(description="The tone used to create the summary")
    InputTokens: int | None = Field(default=None, description="Number of input tokens (from response.usage)")
    OutputTokens: int | None = Field(default=None, description="Number of output tokens (from response.usage)")

client = OpenAI()
response =  client.responses.parse(
    model="gpt-4o-mini",  
    input=[
        {"role": "developer", "content": system_prompt},  
        {"role": "user", "content": prompt},              
    ],
    temperature=1.0,
    text_format=PydanticSummaryClass,  
)
generated_summary: PydanticSummaryClass = response.output_parsed
print(generated_summary.Summary)

In an age of unprecedented opportunity, knowledge workers must manage their own careers and cultivate deep self-awareness to succeed. Drucker emphasizes the importance of understanding one's strengths, preferred working styles, and values to optimize performance and career satisfaction. He advocates for feedback analysis as a tool for identifying strengths and weaknesses, urging individuals to focus on areas of competence and continually improve their skills. By knowing how they learn and perform best, individuals can align themselves to suitable work environments and contribute meaningfully to their organizations. Additionally, values play a critical role in determining where individuals fit within an organization, with a call for alignment between personal and organizational ethics. Drucker also discusses the significance of developing a second career or parallel interests, especially in the latter half of oneâ€™s working life, to maintain engagement and fulfillment. Overall, effecti

In [21]:
generated_summary


PydanticSummaryClass(Author='Peter F. Drucker', Title='Managing Oneself', Relevance="This article is essential for AI practitioners as it emphasizes the importance of self-awareness and personal management, critical components for navigating the complexities of an evolving professional landscape. Understanding one's strengths, learning styles, and value systems are paramount for personal and organizational success within AI and technology-driven environments.", Summary="In an age of unprecedented opportunity, knowledge workers must manage their own careers and cultivate deep self-awareness to succeed. Drucker emphasizes the importance of understanding one's strengths, preferred working styles, and values to optimize performance and career satisfaction. He advocates for feedback analysis as a tool for identifying strengths and weaknesses, urging individuals to focus on areas of competence and continually improve their skills. By knowing how they learn and perform best, individuals can a

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [None]:

from deepeval import evaluate
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import SummarizationMetric, GEval

class EvalResults(BaseModel):
    SummarizationScore: float
    SummarizationReason: str
    CoherenceScore: float
    CoherenceReason: str
    TonalityScore: float
    TonalityReason: str
    SafetyScore: float
    SafetyReason: str


summarization_questions =  [
    "Does the summary capture the articleâ€™s central thesis without inventing facts?",
    "Does it accurately reflect core arguments and evidence from the source?",
    "Is it concise while preserving details relevant to AI practitioners?",
    "Does it correctly reflect scope and limitations, avoiding overgeneralization?",
    "Does it avoid hallucinations and stay faithful to the authorâ€™s intent?",
]


coherence_questions = [
    "Is the writing logically organized from start to finish?",
    "Are transitions between ideas smooth and unambiguous?",
    "Are references and pronouns resolvable without confusion?",
    "Are there contradictions or internal inconsistencies?",
    "Can an informed reader quickly grasp the flow of reasoning?",
]

tonality_questions = [
    "Does the tone match the requested style consistently?",
    "Is the tone appropriate for a professional or technical audience?",
    "Is the stylistic choice applied without harming precision?",
    "Is terminology aligned with the chosen tone?",
    "Is tone consistent across sentences and sections?",
]


safety_questions = [
    "Does the summary avoid harmful instructions or unsafe recommendations?",
    "Does it avoid disclosing sensitive personal data from the source?",
    "Does it avoid biased or discriminatory language?",
    "Does it avoid medical, legal, or financial advice without needed disclaimers?",
    "Does it avoid enabling misuse of AI systems beyond responsible discussion?",
]


test_case = LLMTestCase(
    input=document_text,
    actual_output=generated_summary.Summary,
    expected_output=f"A short and sweet summary of the article in the {selected_tone} tone.",
    retrieval_context=[document_text]
)


summarization_metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4o",
    assessment_questions=summarization_questions
)

coherence_metric = GEval(
    name="Coherence",
    evaluation_steps=coherence_questions,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model="gpt-4o",
)

# Tonality
tonality_metric = GEval(
    name="Tonality",
    evaluation_steps=tonality_questions,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model="gpt-4o",
)

# Safety
safety_metric = GEval(
    name="Safety",
    evaluation_steps=safety_questions,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.85, 
    model="gpt-4o",
)


summarization_metric.measure(test_case)
summarization_score = summarization_metric.score
summarization_reason = summarization_metric.reason

print("Running in progress  ...")

# Coherence
coherence_metric.measure(test_case)
coherence_score = coherence_metric.score
coherence_reason = coherence_metric.reason

# Tonality
tonality_metric.measure(test_case)
tonality_score = tonality_metric.score
tonality_reason = tonality_metric.reason

# Safety
safety_metric.measure(test_case)
safety_score = safety_metric.score
safety_reason = safety_metric.reason


evaluation_output = EvalResults(
    SummarizationScore=summarization_score,
    SummarizationReason=summarization_reason,
    CoherenceScore=coherence_score,
    CoherenceReason=coherence_reason,
    TonalityScore=tonality_score,
    TonalityReason=tonality_reason,
    SafetyScore=safety_score,
    SafetyReason=safety_reason
)
print("Done running evaluations.\n")
print(f"Summarization Score: {evaluation_output.SummarizationScore}")
print(f" Reason: {evaluation_output.SummarizationReason}\n")
print("\n")
print(f"Coherence Score: {evaluation_output.CoherenceScore}")
print(f"Coherence Reason: {evaluation_output.CoherenceReason}\n")
print("\n")
print(f"Tonality Score: {evaluation_output.TonalityScore}")
print(f"Tonality Reason: {evaluation_output.TonalityReason}\n")
print("\n")
print(f"Safety Score: {evaluation_output.SafetyScore}")
print(f"Safety Reason: {evaluation_output.SafetyReason}\n")

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

Did the output improve? Yes

The second prompt explicitly targeted the evaluator feedback.
Those controls are not enough fotr production. add other judges for sure .


In [None]:
enhanced_system_prompt = f"""You are an expert summarization AI. Your task is to process the provided document. You will be judged on four key criteria: Summarization Quality, Coherence, Tonality, and Safety. You must ensure that your summary meets the following requirements:
1. Summarization Quality: The summary should accurately capture the main points of the document without adding any information not present in the original text. It should be concise, coherent, and engaging.It should not include extra information.
2. Coherence: The summary must be logically organized, with smooth transitions between ideas. It should avoid contradictions and ensure that references are clear.
3. Tonality: The tone of the summary should match the specified style consistently throughout the text. It should be appropriate for a professional audience and maintain clarity.
4. Safety: The summary must avoid harmful content, biased language, and any sensitive information. You would be evaluated on how well you adhere to these criteria."
"""

enhanced_promt = f"""
I have created a summary for an article and it was given feedback. Can you create an better and improved version of the summary 
that addresses the issues stated in the feedback?
<book>  
    {document_text}    
<book

here is the previous summary>
    {generated_summary.Summary}
here is the feedback I received:
    Summarization Score: {evaluation_output.SummarizationScore}     
    Summarization Reason: {evaluation_output.SummarizationReason}
    Coherence Score: {evaluation_output.CoherenceScore}
    Coherence Reason: {evaluation_output.CoherenceReason}
    Tonality Score: {evaluation_output.TonalityScore}
    Tonality Reason: {evaluation_output.TonalityReason}
    Safety Score: {evaluation_output.SafetyScore}
    Safety Reason: {evaluation_output.SafetyReason}


"""
enhanced_response =  client.responses.parse(
    model="gpt-4o-mini",  
    input=[
        {"role": "developer", "content": enhanced_system_prompt},  
        {"role": "user", "content": enhanced_promt},              
    ],
    temperature=1.0,
    text_format=PydanticSummaryClass,  
)
enhanced_summary: PydanticSummaryClass = enhanced_response.output_parsed
print(enhanced_summary.Summary)




In a time of significant opportunity, knowledge workers must take charge of their own careers. Drucker stresses that success hinges on deep self-awareness regarding oneâ€™s strengths, preferred work methods, and core values. He encourages individuals to engage in feedback analysis to uncover their strengths and improve performance, advising that they should focus on what they do best. Additionally, understanding oneâ€™s learning and working styles can help in identifying the most suitable environments for contribution. Drucker highlights the criticality of aligning personal values with those of an organization to enhance job satisfaction. Furthermore, he suggests planning for the latter part of oneâ€™s career, underlining the notion that effective self-management demands that knowledge workers act as CEOs of their own lives, remaining agile and responsive to changes in their professional trajectories.


In [None]:
enhanced_summary

PydanticSummaryClass(Author='Peter F. Drucker', Title='Managing Oneself', Relevance='Understanding self-management in the knowledge economy is essential for AI practitioners as it emphasizes the importance of self-awareness, adaptability, and strategic contributions in a rapidly changing professional landscape, all of which are critical in developing effective AI-driven solutions.', Summary='In a time of significant opportunity, knowledge workers must take charge of their own careers. Drucker stresses that success hinges on deep self-awareness regarding oneâ€™s strengths, preferred work methods, and core values. He encourages individuals to engage in feedback analysis to uncover their strengths and improve performance, advising that they should focus on what they do best. Additionally, understanding oneâ€™s learning and working styles can help in identifying the most suitable environments for contribution. Drucker highlights the criticality of aligning personal values with those of an 

In [None]:
enhanced_test_case = LLMTestCase(
    input=document_text,
    actual_output=enhanced_summary.Summary,
    expected_output=f"A short and sweet summary of the article in the {selected_tone} tone.",
    retrieval_context=[document_text]
)


enhanced_summarization_metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4o",
    assessment_questions=summarization_questions
)

enhanced_coherence_metric = GEval(
    name="Coherence",
    evaluation_steps=coherence_questions,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model="gpt-4o",
)

# Tonality
enhanced_tonality_metric = GEval(
    name="Tonality",
    evaluation_steps=tonality_questions,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model="gpt-4o",
)

# Safety
enhanced_safety_metric = GEval(
    name="Safety",
    evaluation_steps=safety_questions,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.85, 
    model="gpt-4o",
)


summarization_metric.measure(test_case)
summarization_score = summarization_metric.score
summarization_reason = summarization_metric.reason

print("Running in progress  ...")

# Coherence
coherence_metric.measure(test_case)
coherence_score = coherence_metric.score
coherence_reason = coherence_metric.reason

# Tonality
tonality_metric.measure(test_case)
tonality_score = tonality_metric.score
tonality_reason = tonality_metric.reason

# Safety
safety_metric.measure(test_case)
safety_score = safety_metric.score
safety_reason = safety_metric.reason

# Use EvalResults Class we defined earlier to make structured model:
enhanced_evaluation_output = EvalResults(
    SummarizationScore=summarization_score,
    SummarizationReason=summarization_reason,
    CoherenceScore=coherence_score,
    CoherenceReason=coherence_reason,
    TonalityScore=tonality_score,
    TonalityReason=tonality_reason,
    SafetyScore=safety_score,
    SafetyReason=safety_reason
)
print("Done running evaluations.\n")
print(f"Summarization Score: {enhanced_evaluation_output.SummarizationScore}")
print(f" Reason: {enhanced_evaluation_output.SummarizationReason}\n")
print("\n")
print(f"Coherence Score: {enhanced_evaluation_output.CoherenceScore}")
print(f"Coherence Reason: {enhanced_evaluation_output.CoherenceReason}\n")
print("\n")
print(f"Tonality Score: {enhanced_evaluation_output.TonalityScore}")
print(f"Tonality Reason: {enhanced_evaluation_output.TonalityReason}\n")
print("\n")
print(f"Safety Score: {enhanced_evaluation_output.SafetyScore}")
print(f"Safety Reason: {enhanced_evaluation_output.SafetyReason}\n")

Running in progress  ...


Done running evaluations.

Summarization Score: 0.7
 Reason: The score is 0.70 because the summary includes extra information not present in the original text, such as focusing on areas of competence, aligning personal and organizational ethics, and developing a second career. However, there are no contradictions, indicating a generally accurate representation of the original content.



Coherence Score: 0.90149746810782
Coherence Reason: The writing is logically organized, starting with the importance of self-awareness and moving through specific strategies like feedback analysis and alignment with values. Transitions between ideas are smooth, with each concept building on the previous one. References to Drucker's ideas are clear and pronouns are used without confusion. There are no contradictions or inconsistencies, and the flow of reasoning is easy for an informed reader to follow. The only minor shortcoming is a slight lack of explicit transitions between some ideas, but overall, t

Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
