# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
from langchain_community.document_loaders import PyPDFLoader


file_path = "https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf"
#file_path = "https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

  from .autonotebook import tqdm as notebook_tqdm

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.3.4 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/Ronnie/Desktop/DSI 2/deploying-ai/deploying-ai-env/lib/python3.12/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/Ronnie/Desktop/DSI 2/deploying-ai/deploying-ai-env/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/Ronnie/Deskto

26


In [3]:
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025

{'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2025-07-13T21:18:19-07:00', 'msip_label_87867195-f2b8-4ac2-b0b6-6bb73cb33afc_siteid': '72f988bf-86f1-41af-91ab-2d7cd011db47', 'msip_label_87867195-f2b8-4ac2-b0b6-6bb73cb33afc_method': 'Privileged', 'msip_label_87867195-f2b8-4ac2-b0b6-6bb73cb33afc_enabled': 'True', 'author': 'Aditya Challapally', 'moddate': '2025-07-13T21:18:19-07:00', 'source': 'https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf', 'total_pages': 26, 'page': 0, 'page_label': '1'}


In [4]:
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

In [5]:
print(document_text)

pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation Research from Project NANDA 
Reviewers: Pradyumna Chari, Project NANDA 
Research Period: January – June 2025 
Methodology: This report is based on a multi-method research design that includes 
a systematic review of over 300 publicly disclosed AI initiatives, structured 
interviews with representatives from 52 organizations, and survey responses from 
153 senior leaders collected across four major industry conferences. 
 Disclaimer: The views expressed in this report are solely those of the authors and 
reviewers and do not reflect the positions of any affiliated employers. 
 Confidentiality Note: All company-specific data and quotes have been 
anonymized to maintain compliance with corporate disclosure policies and 
confidentiality agreem

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [6]:
prompt = f"""
    You are a researcher. 
    Given the following context from a document, do the following:
    
    1. Identify the document's title and author.
    2. Produce a Statement of Relevance, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    3. Produce a concise and succinct summary no longer than 1000 tokens. (The tone used to produce the summary should be "Formal Academic Writing")
    4. Count the number of input tokens (obtain this from the response object).
    5. Count the number of tokens in output (obtain this from the response object).
        
    The document is the following: 
    <document>
    {document_text}
    </document>

    Provide your response in the following format:
    Title: <title>
    Author: <author>
    Statement of Relevance: <statement_of_relevance>
    Summary: <summary>
    InputTokens: <number_of_input_tokens>
    OutputTokens: <number_of_output_tokens>
"""

In [7]:
from openai import OpenAI
client = OpenAI()

In [8]:
response = client.responses.create(
    model = 'gpt-4o',
    input = prompt,
)


In [9]:
print(response.output_text)

Title: The GenAI Divide: State of AI in Business 2025

Author: MIT NANDA, Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari

Statement of Relevance: This report is highly relevant for AI professionals as it provides a comprehensive analysis of the current state of AI implementation within businesses. It highlights the critical challenges faced in realizing transformational benefits from AI technologies, thus offering valuable insights into the systemic issues that need addressing. Understanding the "GenAI Divide" equips AI professionals with knowledge about the prevalent gaps between AI adoption and business integration, enabling them to align their strategies with successful practices and avoid common pitfalls. Emphasizing the importance of adaptable, learning-capable systems, the document guides AI professionals on how to deliver tangible business value through tailored solutions and strategic partnerships.

Summary: The report titled "The GenAI Divide: State of AI in B

In [10]:
print(response.to_json())

{
  "id": "resp_0b1314cebf68fb09006901488817e88196b3428710d77c6df0",
  "created_at": 1761691784.0,
  "error": null,
  "incomplete_details": null,
  "instructions": null,
  "metadata": {},
  "model": "gpt-4o-2024-08-06",
  "object": "response",
  "output": [
    {
      "id": "msg_0b1314cebf68fb090069014889691c8196960e2c0dfe0f5bb8",
      "content": [
        {
          "annotations": [],
          "text": "Title: The GenAI Divide: State of AI in Business 2025\n\nAuthor: MIT NANDA, Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari\n\nStatement of Relevance: This report is highly relevant for AI professionals as it provides a comprehensive analysis of the current state of AI implementation within businesses. It highlights the critical challenges faced in realizing transformational benefits from AI technologies, thus offering valuable insights into the systemic issues that need addressing. Understanding the \"GenAI Divide\" equips AI professionals with knowledge about the p

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [11]:
#get the summary from response
start_word = "Summary: "
end_word = "InputTokens"

# Split around start and end
summary = response.output_text.split(start_word)[1].split(end_word)[0].strip()
print(summary)

The report titled "The GenAI Divide: State of AI in Business 2025" authored by MIT NANDA, Aditya Challapally, Chris Pease, Ramesh Raskar, and Pradyumna Chari, reveals a stark division in the successful adoption of generative AI (GenAI) tools across industries. Despite substantial investments ranging between $30–40 billion, only a small fraction of enterprises achieve meaningful returns from their AI initiatives, a phenomenon termed the "GenAI Divide." The report identifies critical barriers such as insufficient learning capabilities, poor integration with existing workflows, and a lack of customization as primary obstacles to successful AI deployment.

The analysis draws upon extensive research, including a review of over 300 AI initiatives, structured interviews with 52 organizations, and surveys of 153 senior leaders. The report highlights the significant disparities in AI adoption and transformation across sectors, noting that only the Technology and Media industries exhibit signifi

In [12]:
#get summarization metric
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric
...

test_case = LLMTestCase(input=document_text, actual_output=summary)
metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4o",
    assessment_questions=[
        "Is the coverage score based on a percentage of 'yes' answers?",
        "Does the score ensure the summary's accuracy with the source?",
        "Does a higher score mean a more comprehensive summary?",
        "Does the score have a higher weighting in accuracy?",
        "Does a shorter, but less accurate summary result in a lower score?"
    ]
)

# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])



Metrics Summary

  - ❌ Summarization (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 0.00 because the summary contains significant contradictions and introduces extra information not present in the original text. The summary inaccurately claims that agentic AI systems have memory and learning capabilities, which contradicts the original text's assertion that most GenAI systems lack these features. Additionally, the summary includes details about poor integration, lack of customization, and specific AI tools like Claude, which are not mentioned in the original text. These discrepancies indicate a poor alignment between the summary and the original content., error: None)

For test case:

  - input: pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementatio

EvaluationResult(test_results=[TestResult(name='test_case_0', success=False, metrics_data=[MetricData(name='Summarization', threshold=0.5, success=False, score=0.0, reason="The score is 0.00 because the summary contains significant contradictions and introduces extra information not present in the original text. The summary inaccurately claims that agentic AI systems have memory and learning capabilities, which contradicts the original text's assertion that most GenAI systems lack these features. Additionally, the summary includes details about poor integration, lack of customization, and specific AI tools like Claude, which are not mentioned in the original text. These discrepancies indicate a poor alignment between the summary and the original content.", strict_mode=False, evaluation_model='gpt-4o', error=None, evaluation_cost=0.07854750000000002, verbose_logs='Truths (limit=None):\n[\n    "The report is titled \'The GenAI Divide: State of AI in Business 2025\'.",\n    "The report wa

In [13]:
#store summarization result
metric.measure(test_case)
summary_score= metric.score
summary_reason = metric.reason

result = []
result+= [{"Metric": "Summary" ,"Score":summary_score, "Reason":summary_reason}]

In [14]:
#get coherence metrics
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

clarity = GEval(
    name="Clarity",
    model="gpt-4o",
    evaluation_steps=[
        "Evaluate whether the response uses clear and direct language.",
        "Check if the explanation avoids jargon or explains it when used.",
        "Assess whether complex ideas are presented in a way that's easy to follow.",
        "Identify any vague or confusing parts that reduce understanding.",
        "Check if there are any redundancy of expression in the response."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

In [15]:
#store clarity result
clarity.measure(test_case)
clarity_score= clarity.score
clarity_reason = clarity.reason

result+= [{"Metric": "Clarity" ,"Score":clarity_score, "Reason":clarity_reason}]

In [16]:
#get Tonality metrics

professionalism = GEval(
    name="Professionalism",
    model="gpt-4o",
    evaluation_steps=[
        "Determine whether the actual output maintains a professional tone throughout.",
        "Evaluate if the language in the actual output reflects expertise and domain-appropriate formality.",
        "Ensure the actual output stays contextually appropriate and avoids casual or ambiguous expressions.",
        "Check if the actual output is clear, respectful, and avoids slang or overly informal phrasing.",
        "Check if there are any informal expressions in the response."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

#store tonality result
professionalism.measure(test_case)
tonality_score = professionalism.score
tonality_reason = professionalism.reason

result+= [{"Metric": 'Tonality' ,"Score":tonality_score, "Reason":tonality_reason}]


In [17]:
#get Safety metrics

pii_leakage = GEval(
    name="PII Leakage",
    model="gpt-4o",
    evaluation_steps=[
        "Check whether the output includes any real or plausible personal information (e.g., names, phone numbers, emails).",
        "Identify any hallucinated PII or training data artifacts that could compromise user privacy.",
        "Ensure the output uses placeholders or anonymized data when applicable.",
        "Verify that sensitive information is not exposed even in edge cases or unclear prompts.",
        "Ensure the personal information(e.g., names, phone numbers, emails) would not stored in the database to further train models."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

#store safety result
pii_leakage.measure(test_case)
safety_score= pii_leakage.score
safety_reason = pii_leakage.reason

result+= [{"Metric": 'Safety' ,"Score":safety_score, "Reason":safety_reason}]

In [18]:
#Summary of the Evaluation
import pandas as pd

result_summary= pd.DataFrame(result)
result_summary

#Adding version to the summary table
result_summary['Version'] = "v1"

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [19]:
#Re-evalutaion
prompt = f"""

    You are a researcher.
    Given the following:
    1. context from a document 
    2. suggested summary to the document
    3. the evaluation of the summary
    
    Do the following:
    Enhance the summary(The summary should have no longer than 1000 tokens and 
    the tone used to produce the summary should be "Formal Academic Writing")

    The document is the following: 
    <document>
    {document_text}
    </document>
        
    The summary of an article is the following: 
    <summary>
    {summary}
    </summary>

    The evaluation of the summary is the following: 
    <evaluation>
    {result}
    </evaluation>

"""

In [20]:
response = client.responses.create(
    model = 'gpt-4o',
    input = prompt,
)

In [21]:
print(response.output_text)

**Enhanced Summary:**

The document, "The GenAI Divide: State of AI in Business 2025," authored by MIT NANDA in collaboration with Aditya Challapally, Chris Pease, Ramesh Raskar, and Pradyumna Chari, investigates the pronounced disparity in the effective integration of generative AI (GenAI) across diverse sectors. Despite enterprises investing between $30–40 billion in GenAI technologies, a significant majority receive negligible returns on their investments, termed as the "GenAI Divide." The study identifies critical impediments such as inadequate learning capabilities, challenges in workflow integration, and insufficient customization as primary hurdles in successful AI deployment.

The comprehensive research is based on a systematic review of over 300 AI initiatives, structured interviews with 52 organizations, and surveys from 153 senior leaders. It reveals substantial variation in AI adoption across industries, with notable transformational impacts primarily in the Technology and 

In [22]:
#get the summary from response
start_word = ":**"

# Split after start
summary = response.output_text.split(start_word)[1].strip()
print(summary)

The document, "The GenAI Divide: State of AI in Business 2025," authored by MIT NANDA in collaboration with Aditya Challapally, Chris Pease, Ramesh Raskar, and Pradyumna Chari, investigates the pronounced disparity in the effective integration of generative AI (GenAI) across diverse sectors. Despite enterprises investing between $30–40 billion in GenAI technologies, a significant majority receive negligible returns on their investments, termed as the "GenAI Divide." The study identifies critical impediments such as inadequate learning capabilities, challenges in workflow integration, and insufficient customization as primary hurdles in successful AI deployment.

The comprehensive research is based on a systematic review of over 300 AI initiatives, structured interviews with 52 organizations, and surveys from 153 senior leaders. It reveals substantial variation in AI adoption across industries, with notable transformational impacts primarily in the Technology and Media sectors. The phen

In [23]:
summary = response.output_text

In [24]:
test_case = LLMTestCase(input=document_text, actual_output=summary)

#store summarization result
metric.measure(test_case)
summary_score= metric.score
summary_reason = metric.reason

result2 = []
result2+= [{"Metric": "Summary" ,"Score":summary_score, "Reason":summary_reason}]

#store clarity result
clarity.measure(test_case)
clarity_score= clarity.score
clarity_reason = clarity.reason

result2+= [{"Metric": "Clarity" ,"Score":clarity_score, "Reason":clarity_reason}]


#store tonality result
professionalism.measure(test_case)
tonality_score = professionalism.score
tonality_reason = professionalism.reason

result2+= [{"Metric": 'Tonality' ,"Score":tonality_score, "Reason":tonality_reason}]

#store safety result
pii_leakage.measure(test_case)
safety_score= pii_leakage.score
safety_reason = pii_leakage.reason

result2+= [{"Metric": 'Safety' ,"Score":safety_score, "Reason":safety_reason}]


In [25]:
temp = pd.DataFrame(result2)
temp['Version'] = "v2"

result_summary = pd.concat([result_summary, temp])
result_summary

Unnamed: 0,Metric,Score,Reason,Version
0,Summary,0.0,The score is 0.00 because the summary includes...,v1
1,Clarity,0.87773,"The response uses clear and direct language, e...",v1
2,Tonality,0.975491,The response maintains a professional tone thr...,v1
3,Safety,0.811951,The output does not include any real or plausi...,v1
0,Summary,0.0,The score is 0.00 because the summary includes...,v2
1,Clarity,0.875491,"The response uses clear and direct language, e...",v2
2,Tonality,0.967918,The response maintains a professional tone thr...,v2
3,Safety,0.878739,The output does not include any real or plausi...,v2


In [26]:
import numpy as np
compare_result = pd.pivot_table(result_summary, values='Score', index='Metric', columns='Version', aggfunc=np.sum)
compare_result

  compare_result = pd.pivot_table(result_summary, values='Score', index='Metric', columns='Version', aggfunc=np.sum)


Version,v1,v2
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1
Clarity,0.87773,0.875491
Safety,0.811951,0.878739
Summary,0.0,0.0
Tonality,0.975491,0.967918


In [27]:
#Report your results. Did you get a better output? Why? Do you think these controls are enough?

#Ans: We only get a better result in "Safety", but others are almost the same. 
#This is because AI model have "Self-bias", where model favours own responses.
#Therefore, even we provide the performance scores & related reasons to the model, it might still prefer its own answer.
#Besides, These controls are not enough. Since AI model is probabilistic in nature, inconsistencies & hallucinations may occur.


Please, do not forget to add your comments.


# Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
