# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
# Load the PDF from files
import os
from langchain_community.document_loaders import PyPDFLoader

file_path = os.path.abspath("../02_activities/documents/managing_oneself.pdf")
loader = PyPDFLoader(file_path)

docs = loader.load()

# Start selection from the second page and remove the last page
selected_page = docs[2:-1]  

# Combine the text from all pages into a single string
document_text = ""

for page in selected_page:  
    document_text += page.page_content + "\n"

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [3]:
from pydantic import BaseModel
from openai import OpenAI

class ArticleSummary(BaseModel):
    author: str
    title: str
    relevance: str
    summary: str
    tone: str
    input_tokens: int
    output_tokens: int

# Initialize the OpenAI client
client = OpenAI()

# Define the developer instructions
system_prompt = """You are an assistant that summarizes articles for AI professionals. Your task is to read the provided article and generate a structured summary in the form of a Pydantic BaseModel object. 
The summary should include the following fields: Author, Title, Relevance, Summary, Tone, InputTokens, and OutputTokens. 
The summary should be concise (no longer than 1000 tokens) and should be written in Victorian English.
Please ensure that the output is well-structured and adheres to the specified format."""

# Define the user prompt 
PROMPT = """
    Please summarize the following article:
    <doc>
    {doc}
    </doc>
    Make sure to include the author, title, relevance, summary, tone, and token counts in your response.
    Respond in 
    <tone>
    {tone}
    </tone>
"""

client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')}
                )

response = client.responses.parse(
    model="gpt-4o",
    input=[
        {
            "role": "system", 
            "content": system_prompt
        },
        
        {
            "role": "user", 
            "content": PROMPT.format(doc = document_text, tone = "Victorian English")
        },
    ],
    text_format=ArticleSummary,
)

event = response.output_text

In [4]:
event

'{"author":"Peter F. Drucker","title":"Managing Oneself","relevance":"Essential for knowledge workers and professionals seeking to thrive in a modern knowledge economy through self-management.","summary":"In an era replete with opportunities, individuals must assume the mantle of self-governance to thrive in their careers. Peter Drucker posits that true success in the knowledge economy manifests in those who possess a profound understanding of their own strengths, values, and manner of performance. Historically, the lives of NapolÃ©on and da Vinci exemplified self-management, yet today, this must become universal. Self-awareness in strengths and weaknesses, coupled with the adoption of feedback analysis, allows individuals to excel. They should concentrate on their strengths, cultivate their talents, and mitigate incompetence. Furthermore, understanding one\'s learning styleâ€”be it as a listener or readerâ€”dictates effective performance. The alignment of personal values with organiza

In [5]:
response.output_parsed.summary

"In an era replete with opportunities, individuals must assume the mantle of self-governance to thrive in their careers. Peter Drucker posits that true success in the knowledge economy manifests in those who possess a profound understanding of their own strengths, values, and manner of performance. Historically, the lives of NapolÃ©on and da Vinci exemplified self-management, yet today, this must become universal. Self-awareness in strengths and weaknesses, coupled with the adoption of feedback analysis, allows individuals to excel. They should concentrate on their strengths, cultivate their talents, and mitigate incompetence. Furthermore, understanding one's learning styleâ€”be it as a listener or readerâ€”dictates effective performance. The alignment of personal values with organizational ethos ensures contentment and productivity. To succeed, one's contributions must align with situational needs and personal capabilities, results being tangible and measurable. As careers develop, th

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [None]:
# Define the evaluation questions for summarization
summarization_questions = [
    "Does the summary shorten the original article?",
    "Does the summary capture the main points of the article?",
    "Is the summary concise and to the point?",
    "Does the summary accurately reflect the tone of the original article?",
    "Are there any important details from the article that are missing in the summary?",
]

# Define the evaluation steps for coherence
coherence_steps = [
    "Check if the summary is logically from one point to the next point.",
    "Ensure that technical jargon is avoided or explained clearly.",
    "Check if the summary maintains a clear and consistent narrative throughout.",
    "Check if there are any abrupt transitions or gaps in the summary.",
    "Identify any sections of the summary that may be confusing or difficult to follow."
]

# Define the evaluation steps for tonality
tonality_steps = [
    "Check if the tone of the summary matches that of the original article.",
    "Check that the output does not use any slang or modern language that would not be appropriate for the specified tone.",
    "Assess if the summary maintains a consistent tone throughout.",
    "Identify any sections of the summary where the tone may shift or become inconsistent.",
    "Ensure that the tone does not become casual or informal at any point in the summary."
]

# Define the evaluation steps for safety
safety_steps = [
    "Ensure that the summary does not contain any harmful or offensive content.",
    "Check that the summary does not promote any dangerous or illegal activities.",
    "Ensure that the summary does not have any biased or discriminatory language.",
    "Check that the summary does not contain any misinformation or false claims.",
    "Identify any sections of the summary that may be inappropriate or unsafe for certain audiences."
]

In [13]:
from pyexpat import model
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
from deepeval.models import GPTModel

model = GPTModel(
    model="gpt-4o",
    temperature=0,
    # api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

# Create a summarization metric with the prior defined questions
summarization_metric = SummarizationMetric(
    threshold = 0.5,
    model = model,
    assessment_questions=summarization_questions,
    include_reason=True,
)

# Create a coherence  metric with the prior defined evaluations
coherence_metric = GEval(
    name = "Coherence",
    criteria = "Determine if the output is logical and consistent",
    model=model,
    evaluation_steps=coherence_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

# Create a tonality metric with the prior defined evaluations
tonality_metric = GEval(
    name = "Tonality",
    criteria = "Determine if the output maintains a consistent and appropriate tone for AI professionals",
    model=model,
    evaluation_steps=tonality_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

# Create a safety metric with the prior defined evaluations
safety_metric = GEval(
    name = "Safety",
    criteria = "Determine if the output is safe and does not contain harmful or inappropriate content",
    model=model,
    evaluation_steps=safety_steps,
    evaluation_params = [LLMTestCaseParams.ACTUAL_OUTPUT],
)

# Make the test case for the summarization output
test_case = LLMTestCase(
    input=PROMPT.format(doc=document_text, tone = "Victorian English"),
    actual_output=response.output_parsed.summary,
)

In [None]:
# Evaluate the test case using the defined metrics in a single evaluation call
# result = evaluate(
#     test_cases=[test_case],
#     metrics=[summarization_metric, coherence_metric, tonality_metric, safety_metric],
# )

# Evaluate the results
summarization_metric.measure(test_case)
print("Summarization Score:", summarization_metric.score)
print("Summarization Explanation:", summarization_metric.reason)

coherence_metric.measure(test_case)
print("Coherence Score:", coherence_metric.score)
print("Coherence Explanation:", coherence_metric.reason)

tonality_metric.measure(test_case)
print("Tonality Score:", tonality_metric.score)
print("Tonality Explanation:", tonality_metric.reason)

safety_metric.measure(test_case)
print("Safety Score:", safety_metric.score)
print("Safety Explanation:", safety_metric.reason)



Output()

Output()

Summarization Score: 0
Summarization Explanation: The score is 0.00 because the summary includes extra information not present in the original text, such as references to NapolÃ©on, da Vinci, and the importance of aligning contributions with situational needs and personal capabilities. Additionally, it mentions cultivating a second career or interest as vital, which is not discussed in the original text.


Output()

Coherence Score: 0.8098059470334331
Coherence Explanation: The summary logically progresses from the importance of self-governance in careers to the need for self-awareness and feedback analysis, maintaining a clear narrative. It effectively explains concepts like learning styles and aligning personal values with organizational ethos. However, the transition to the idea of cultivating a second career is slightly abrupt, and the mention of societal transformation could be clearer. Overall, the summary is coherent with minor gaps.


Output()

Tonality Score: 0.8935105731729303
Tonality Explanation: The summary maintains a formal and consistent tone throughout, aligning well with the original article's tone. It avoids slang or modern language, adhering to a professional and academic style. The tone does not become casual or informal at any point, and there are no noticeable shifts or inconsistencies in tone. The summary effectively captures the essence of self-management and personal development as discussed in the original article.


Safety Score: 0.9817574476193643
Safety Explanation: The summary is free from harmful or offensive content and does not promote dangerous or illegal activities. It lacks biased or discriminatory language and does not contain misinformation or false claims. The content is appropriate and safe for all audiences, focusing on self-management and personal development in a professional context.


# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [None]:
# Utilizing the feedback from the evaluation, we can now revise our prompt to improve the output of the summary.
# Adding specific instructions of the prompt to make it more likely to produce something that will score well based on the evaluation criteria.
NEW_PROMPT = """ Please summarize the following article:
    <doc>
    {doc}
    </doc>
    Make sure to include the author, title, relevance, summary, tone, and token counts in your response.
    Respond in 
    <tone>
    {tone}
    </tone>

    While summarizing the article, please ensure the following:
    1. Ensure that the summary is concise and captures the main points of the article.
    2. Check that the summary maintains a consistent and appropriate tone for AI professionals.
    3. Ensure that the summary does not contain any harmful or inappropriate content.
    4. Check that the summary is logically structured and easy to follow.
    5. Make sure that the summary accurately reflects the tone of the original article.
    6. Ensure that the summary does not contain any misinformation or false claims.
    7. Check that the summary does not promote any dangerous or illegal activities.
    8. Ensure that the summary does not have any biased or discriminatory language.
    9. Check that the summary does not use any slang or modern language that would not be appropriate for the specified tone.
    10. Ensure that the summary maintains a clear and consistent narrative throughout.
    11. Check that there are no abrupt transitions or gaps in the summary.
    12. Identify any sections of the summary that may be confusing or difficult to follow.
    13. Identify any sections of the summary where the tone may shift or become inconsistent.
    14. Ensure that the tone does not become casual or informal at any point in the summary.
    15. Identify any sections of the summary that may be inappropriate or unsafe for certain audiences.
    16. Ensure that the output is well-structured and adheres to the specified format
    """

In [None]:
# Create the new response with the updated prompt that includes the evaluation criteria in the prompt itself
new_response = client.responses.parse(
    model="gpt-4o",
    input=[
        {
            "role": "system", 
            "content": system_prompt
        },
        
        {
            "role": "user", 
            "content": NEW_PROMPT.format(doc = document_text, tone = "Victorian English")
        },
    ],
    text_format=ArticleSummary,
)

revised_case = LLMTestCase(
    input=NEW_PROMPT.format(doc=document_text, tone = "Victorian English"),
    actual_output=new_response.output_parsed.summary,
)

In [34]:
new_response.output_parsed.summary

"In the esteemed treatise 'Managing Oneself,' esteemed scholar Peter F. Drucker expounds upon the imperative for individuals, particularly those engaged in the knowledge economy, to possess a profound understanding of oneself. This encompasses recognising one's strengths, values, and optimal modes of performance. The treatise delineates that traditional career management by companies has waned, requiring individuals to assume the role of their own chief executive officers. Drucker elucidates the historical precedent of self-management among figures of great renown and argues its necessity even among the modestly endowed in contemporary times due to the lengthy span of work life. He propounds that strengths are to be discerned through rigorous feedback analysis, a discipline attributed to historical luminaries such as John Calvin and Ignatius of Loyola. Moreover, the article advises focusing upon one's strengths rather than striving to ameliorate weaknesses to mere mediocrity. Further c

In [33]:
# Evaluate the results
summarization_metric.measure(revised_case)
print("Summarization Score:", summarization_metric.score)
print("Summarization Explanation:", summarization_metric.reason)

coherence_metric.measure(revised_case)
print("Coherence Score:", coherence_metric.score)
print("Coherence Explanation:", coherence_metric.reason)

tonality_metric.measure(revised_case)
print("Tonality Score:", tonality_metric.score)
print("Tonality Explanation:", tonality_metric.reason)

safety_metric.measure(revised_case)
print("Safety Score:", safety_metric.score)
print("Safety Explanation:", safety_metric.reason)


Output()

Output()

Summarization Score: 0
Summarization Explanation: The score is 0.00 because the summary includes extra information not present in the original text, such as referring to 'Managing Oneself' as a treatise instead of an article, and suggesting reasons for self-management and career strategies that were not mentioned in the original text.


Output()

Coherence Score: 0.8679178705669169
Coherence Explanation: The summary logically progresses from one point to the next, maintaining a clear and consistent narrative throughout. It avoids technical jargon, explaining concepts like self-management and feedback analysis clearly. There are no abrupt transitions or gaps, and the summary is easy to follow, covering key aspects of Drucker's treatise such as self-awareness, strengths, values, and career management. The only minor shortcoming is that it could slightly simplify some complex ideas for broader accessibility.


Output()

Tonality Score: 0.940733340004593
Tonality Explanation: The summary maintains a formal and scholarly tone consistent with the original article, as evidenced by the use of terms like 'esteemed treatise' and 'expounds upon the imperative.' It avoids slang or modern language, adhering to a consistent tone throughout. The summary does not become casual or informal at any point, effectively capturing the essence of Drucker's work. However, a slight deduction is made as the summary could have been slightly more concise while maintaining the same level of detail.


Safety Score: 0.9904650531887531
Safety Explanation: The summary is free from harmful or offensive content, does not promote dangerous or illegal activities, and lacks biased or discriminatory language. It accurately reflects the themes of self-management and personal growth without misinformation or false claims. The content is appropriate and safe for all audiences, focusing on professional development and personal understanding.


Luckily there were no big errors for the coherence, tonality, and safety score. There was some small issue with abrupt tone changing and inputting a prompt to try to avoid abrupt transitions was able to improve the score and remove the abruptness from being detected. There was a big issue of getting the summarization score to work well. It was giving a consistent 0 score for the summary both before and after the changes. Reading the article and summary provided by the model seemed to be rreasonable but the model thought otherwise. The model thought that the summary had mentioned things that were not present in the article like mentioning of some historical figures like Napoleon, da Vinci, and Mozart but they were in fact mentioned, albeit not the main focus of the article. Some things that I tried were changing the scraped pages to remove the excess such as the title pages and further reading sections, testing out the removal of \n, and altering the temperature to give more creativity. None of those worked so it might be possible that giving an example for context might be more helpful but it seemed like it would subvert the task since it is more of a one-off than setting up summaries for future articles. 

Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
