# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [56]:
%load_ext dotenv
%dotenv ../05_src/.secrets

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [57]:
from openai import OpenAI
import os

In [58]:
import os
print("Key loaded:", os.getenv("OPENAI_API_KEY") is not None)

Key loaded: True


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [59]:
from langchain_community.document_loaders import PyPDFLoader

# downloaded the GenAI divide pdf to the document folder

file_path = "../05_src/documents/The_GenAI_Divide_2025.pdf" # select pdfs relative path
loader = PyPDFLoader(file_path)
docs = loader.load()

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

print("Document length:", len(document_text))
# document length is 53851

Document length: 53851


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [60]:
# import packages
import os
from openai import OpenAI
from pydantic import BaseModel, Field

# set up structure for output
class SummaryOutput(BaseModel):
    author: str = Field(description="the author of the article")
    title: str = Field(description="the title of the article")
    relevance: str = Field(description="a short paragraph for why this is relevant to an AI professional")
    summary: str = Field(description="a concise summary of the article")
    tone: str = Field(description="the tone used for the summary")
    input_tokens: int = Field(description="the number of input tokens used")
    output_tokens: int = Field(description="the number of output tokens used")


client = OpenAI(
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
    api_key='any value', 
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')}
)

# set instructions 
chosen_tone = "Academic English"
instructions = f"You are a researcher at a prestigious university. Summarize the text using {chosen_tone}. Return the response JSON format."
user_prompt_template = "Summarize the text below in a 'Formal Academic Writing' tone. Include author, title, and relevance (why it's useful for AI professionals). Keep the summary under 1000 tokens.: {document_content}"


completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": instructions},
        {"role": "user", "content": user_prompt_template.format(document_content=document_text)}
    ],
    response_format=SummaryOutput,
)

result = completion.choices[0].message.parsed

result.input_tokens = completion.usage.prompt_tokens
result.output_tokens = completion.usage.completion_tokens

# structured Output
print(f"Author: {result.author}")
print(f"Title: {result.title}")
print(f"Tone: {result.tone}")
print(f"Relevance: {result.relevance}")
print(f"Summary:\n{result.summary}")
print(f"Input Tokens: {result.input_tokens}")
print(f"Output Tokens: {result.output_tokens}")

Author: MIT NANDA, Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari
Title: The GenAI Divide: State of AI in Business 2025
Tone: Formal Academic Writing
Relevance: This report is particularly relevant to AI professionals as it examines the disparity in AI implementation effectiveness across various businesses and sectors, highlighting critical factors influencing the successful integration and scaling of Generative AI solutions. It provides insights into learning-capable systems and their potential for driving tangible business outcomes, which are essential concepts for shaping AI strategies.
Summary:
The report details findings from Project NANDA, revealing a significant disparity in the effectiveness of Generative AI (GenAI) implementations, termed the 'GenAI Divide,' where 95% of organizations yield no return on substantial investments estimated at $30-40 billion. It identifies that while tools like ChatGPT see widespread adoption and high pilot rates, they often fail 

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [61]:
import os
from deepeval import evaluate
from deepeval.models import GPTModel
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

custom_model = GPTModel(
    model="gpt-4o-mini",
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
    _openai_api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')}
)


test_case = LLMTestCase(
    input=document_text,
    actual_output=result.summary
)


sum_metric = SummarizationMetric(
    threshold=0.5,
    model=custom_model,
    assessment_questions=[
        "Does the summary accurately reflect the core findings of the report?",
        "Does the summary omit any major conclusions?",
        "Is the summary factually consistent with the source document?",
        "Does the summary avoid introducing new unsupported claims?",
        "Is the summary concise without losing essential meaning?"
]
)


coherence_metric = GEval(
    name="Coherence",
    criteria="Coherence: Logical flow, Clear structure, Smooth transitions, Readable, Human-like",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=custom_model
)


tonality_metric = GEval(
    name="Tonality",
    criteria="Tonality: Formal tone, No slang, Professional, Consistent, Objective",
    evaluation_steps=[
        "Check for formal, academic vocabulary.",
        "Check if the structure would be appropriate for an academic style publication",
        "Penalize modern slang and casual tone"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=custom_model
)


safety_metric = GEval(
    name="Safety",
    criteria="Safety: No bias, No misinformation, Neutral tone, Ethical, Safe for work",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=custom_model
)


evaluate(
    test_cases=[test_case], 
    metrics=[sum_metric, coherence_metric, tonality_metric, safety_metric]
)

Output()



Metrics Summary

  - ‚úÖ Summarization (score: 0.6666666666666666, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.67 because the summary contains contradictions to the original text regarding the return on investment in GenAI and includes extra information that was not present in the original text, such as specific project names and details about the effectiveness of Generative AI tools., error: None)
  - ‚úÖ Coherence [GEval] (score: 0.7947840067492762, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The output presents a logical flow of ideas, starting with the introduction of the 'GenAI Divide' and moving through specific findings and patterns. It has a clear structure, with a well-defined body that discusses the key barriers and implications. However, the conclusion could be more explicit in summarizing the main points, which would enhance the overall readability and engagement for the audience., error: None)
  - ‚úÖ Ton

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Summarization', threshold=0.5, success=True, score=0.6666666666666666, reason='The score is 0.67 because the summary contains contradictions to the original text regarding the return on investment in GenAI and includes extra information that was not present in the original text, such as specific project names and details about the effectiveness of Generative AI tools.', strict_mode=False, evaluation_model='gpt-4o-mini', error=None, evaluation_cost=0.0046618499999999995, verbose_logs='Truths (limit=None):\n[\n    "The report is titled \'The GenAI Divide: State of AI in Business 2025\'.",\n    "The report was produced by MIT NANDA.",\n    "The research period for the report was from January to June 2025.",\n    "The report is based on a multi-method research design.",\n    "The research included a systematic review of over 300 publicly disclosed AI initiatives.",\n    "The research 

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [62]:

feedback = f"""
The prior summary was sufficent but please address these points and find room for improvement:
- Tonality Score: {tonality_metric.score} (Reason: {tonality_metric.reason})
- Summarization Score: {sum_metric.score} (Reason: {sum_metric.reason})
"""


enhancement_instructions = f"""
You are an expert editor and a prestigeous journal. You will be given a text and a summary from an AI judge.
Rewrite the summary to address the feedback.
Ensure the tone is a perfect example of {chosen_tone}.
"""

enhancement_user_prompt = f"""
Original Text: {document_text[:2000]} # save tokens by using a smaller portion
Draft Summary: {result.summary}
Judge Feedback: {feedback}

Enhanced the previous summary and fix any issues.
"""


enhanced_completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": enhancement_instructions},
        {"role": "user", "content": enhancement_user_prompt}
    ],
    response_format=SummaryOutput,
)

enhanced_result = enhanced_completion.choices[0].message.parsed
print(f"New Summary:\n{enhanced_result.summary}")

New Summary:
This report from Project NANDA investigates the significant disparity in Generative AI (GenAI) implementations, referred to as the 'GenAI Divide,' which finds that 95% of organizations experience no return on substantial investments, estimated at $30-40 billion. Although tools such as ChatGPT have seen widespread adoption and high pilot rates, they often fail to create discernible impacts on profit and loss. The findings uncover four noteworthy patterns: limited disruption across various industries; a paradox where larger enterprises initiate more pilots yet achieve less scalability; an investment bias that favors visible functionalities over those that yield higher return on investment; and a trend favoring successful external partnerships over internal project builds. The primary obstacle to scaling GenAI solutions is attributed to the learning capacity of the systems rather than the barriers of infrastructure or regulation, as many current solutions lack essential conte

In [63]:
enhanced_test_case = LLMTestCase(
    input=document_text,
    actual_output=enhanced_result.summary
)


evaluate(
    test_cases=[enhanced_test_case], 
    metrics=[sum_metric, coherence_metric, tonality_metric, safety_metric]
)

Output()



Metrics Summary

  - ‚úÖ Summarization (score: 0.7692307692307693, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.77 because the summary contains a contradiction regarding the report's title and includes extra information not found in the original text, which affects its accuracy and completeness., error: None)
  - ‚úÖ Coherence [GEval] (score: 0.7849198261364884, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The output presents a logical flow of ideas, starting with the introduction of the 'GenAI Divide' and moving through the findings in a coherent manner. It has a clear structure, with a well-defined body that discusses the patterns observed and concludes with the implications for organizations. However, the transitions between some sentences could be smoother to enhance readability, and while the content is engaging, it could benefit from a more explicit conclusion summarizing the key takeaways., error: None)
  - ‚úÖ T

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Summarization', threshold=0.5, success=True, score=0.7692307692307693, reason="The score is 0.77 because the summary contains a contradiction regarding the report's title and includes extra information not found in the original text, which affects its accuracy and completeness.", strict_mode=False, evaluation_model='gpt-4o-mini', error=None, evaluation_cost=0.0045093, verbose_logs='Truths (limit=None):\n[\n    "The report is titled \'The GenAI Divide: State of AI in Business 2025\'.",\n    "The authors of the report include Aditya Challapally, Chris Pease, Ramesh Raskar, and Pradyumna Chari.",\n    "The research period for the report was from January to June 2025.",\n    "The report is based on a multi-method research design that includes a systematic review of over 300 publicly disclosed AI initiatives.",\n    "Structured interviews were conducted with representatives from 52 org

Please, do not forget to add your comments.

The enhancement succeeded in improving the model in the areas where the original summary was weak. Strong areas like Tonality (score of 0.895 to 0.897) where already strong the first time so did not have much room for improvement. However, weaker areas like Summarization (score from 0.667 to 0.769) saw significant improvement after enhancement. However both evaluations claim that the summary has some contradictions and confabulations. I think this enhancement allows the AI to reflect and incorperate the feedback into the next attempt. The judgement should have further checks and human evaluation before being used for anything important in case the AI is incorrect in its summary or evaluation.


# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
