# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [20]:
#Load in GenAI Divide PDF
from langchain_community.document_loaders import PyPDFLoader

file_path = "documents/ai_report_2025.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

#Join pages
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

document_text

'pg. 1 \n \n \nThe GenAI Divide  \nSTATE OF AI IN \nBUSINESS 2025 \n \n \n \n \n \n \nMIT NANDA \nAditya Challapally \nChris Pease \nRamesh Raskar \nPradyumna Chari \nJuly 2025\npg. 2 \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nNOTES \nPreliminary Findings from AI Implementation Research from Project NANDA \nReviewers: Pradyumna Chari, Project NANDA \nResearch Period: January – June 2025 \nMethodology: This report is based on a multi-method research design that includes \na systematic review of over 300 publicly disclosed AI initiatives, structured \ninterviews with representatives from 52 organizations, and survey responses from \n153 senior leaders collected across four major industry conferences. \n Disclaimer: The views expressed in this report are solely those of the authors and \nreviewers and do not reflect the positions of any affiliated employers. \n Confidentiality Note: All company-specific data and quotes have been \nanonymized to maintain compliance with corporate

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [35]:
from openai import OpenAI
from pydantic import BaseModel, Field
import os

#Set up API call
client = OpenAI(default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1')

#Define structured output with fields from assignment description
class Summary(BaseModel):
    author: str=Field(description="The author of the article")
    title: str=Field(description="The title of the article")
    relevance: str=Field(description="Why is this article relevant for an AI professional in their professional development (no longer than 1 paragraph)")
    summary: str=Field(description="Concise summary of the article (no longer than 1000 tokens)")
    tone: str=Field(description="The tone of the response")

#Get response with system prompt and article as input
response = client.responses.parse(
    model="gpt-4o-mini",
    input=[
        {"role": "system", "content": "You are an academic writing for a scientific audience summarizing the information in this article."},
        {"role": "user", "content": document_text},
        ],
    text_format=Summary,
)

AI_summary = response.output_parsed

AI_summary

Summary(author='MIT NANDA Team', title='The GenAI Divide: State of AI in Business 2025', relevance="This report is crucial for AI professionals aiming to understand the current landscape and challenges in AI implementation's effectiveness, especially concerning generative AI tools. It highlights the divide between organizations achieving success versus those that remain stagnant, providing insights into best practices and pitfalls, which is essential for driving impactful AI strategies.", summary="The report outlines findings from a multi-method research study on generative AI (GenAI) implementation across businesses. Despite significant investment (up to $40 billion), 95% of organizations see no return on their GenAI initiatives, revealing a stark 'GenAI Divide' where only 5% of AI pilots yield substantial value. Key patterns contributing to this divide include insufficient organizational disruption in most sectors, challenges in scaling AI tools from pilot to production, and ineffect

In [None]:
#Get input and output tokens
print("Input tokens", response.usage.input_tokens)
print("Output tokens", response.usage.output_tokens)

Input tokens 10862
Output tokens 419


# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [None]:
#Summarization
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric

#Set model
model = GPTModel(
    model="gpt-4o-mini",
    temperature=0,
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

#Set test case
test_case = LLMTestCase(input=document_text, actual_output=AI_summary.summary)

#Define summarization metric
summarization_metric = SummarizationMetric(
    threshold=0.5,
    model=model,
    assessment_questions=[
        "Do the majority (95%) of companies not get a return on their Gen AI investments?",
        "Does the article believe that high adoption of gen AI has led to high disruption?",
        "Do the majority of custom enterprise AI tools reach production?",
        "Does back-office automation lead to better return on investement for AI tools?",
        "Do the majority of users believe gen AI tools to be trustworthy and reliable for enterprise systems?",
        "Do users prefer human or gen AI agents for mission-critical work?",
        "Is the term GenAI Divide defined?",
        "Is the lack of context an issue for the implementation of gen AI?",
        "Do general purpose models like chatGPT work for industries when compared to individualized models?"
    ]
)

summarization_eval = evaluate(test_cases=[test_case], metrics=[summarization_metric])

summarization_eval.model_dump()

Output()



Metrics Summary

  - ✅ Summarization (score: 0.5, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.50 because the summary contains contradictions regarding the effectiveness of tools like ChatGPT, which the original text clarifies are being used effectively by some organizations. Additionally, the summary introduces extra information about builders and partnerships that is not present in the original text, leading to a misrepresentation of the original content. Furthermore, there are unanswered questions in the summary that the original text could address, indicating a lack of completeness., error: None)

For test case:

  - input: pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation Research from Project NANDA 
Reviewers: Pradyumna Chari, Project NANDA 

{'test_results': [{'name': 'test_case_0',
   'success': True,
   'metrics_data': [{'name': 'Summarization',
     'threshold': 0.5,
     'success': True,
     'score': 0.5,
     'reason': 'The score is 0.50 because the summary contains contradictions regarding the effectiveness of tools like ChatGPT, which the original text clarifies are being used effectively by some organizations. Additionally, the summary introduces extra information about builders and partnerships that is not present in the original text, leading to a misrepresentation of the original content. Furthermore, there are unanswered questions in the summary that the original text could address, indicating a lack of completeness.',
     'strict_mode': False,
     'evaluation_model': 'gpt-4o-mini',
     'error': None,
     'evaluation_cost': 0.004637250000000001,
     'verbose_logs': 'Truths (limit=None):\n[\n    "The report is titled \'The GenAI Divide: State of AI in Business 2025\'.",\n    "The authors of the report incl

In [None]:
#Coherence/clarity
#Import librarties for GEval
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
#State test case
test_case = LLMTestCase(input=document_text, actual_output=AI_summary.summary)

#Define clarity metric
clarity_metric = GEval(
    name="Coherence or clarity",
    criteria="Determine whether the actual output is coherent and clear for the reader.",
    evaluation_steps= [
        "Assess whether the response use clear and concise language",
        "Check if the response uses any acronyms or jargon without appropriately defining them",
        "Evaluate how easily the text can be understood by the average reader",
        "Determine whether any elements of the text are repetitive",
        "Identify any vague or abstract elements of the text that could be improved"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model,
)

clarity_eval = evaluate(test_cases=[test_case], metrics=[clarity_metric])




Metrics Summary

  - ✅ Coherence or clarity [GEval] (score: 0.7963509996837951, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The response effectively summarizes the key findings of the report using clear and concise language, making it accessible to the average reader. It avoids jargon and acronyms without definitions, ensuring clarity. However, it could improve by reducing some repetitive elements, particularly in discussing the 'GenAI Divide' and the barriers to scaling AI tools, which could be more succinctly articulated., error: None)

For test case:

  - input: pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation Research from Project NANDA 
Reviewers: Pradyumna Chari, Project NANDA 
Research Period: January – June 2025 
Methodology: This report is based on a 

{'test_results': [{'name': 'test_case_0',
   'success': True,
   'metrics_data': [{'name': 'Coherence or clarity [GEval]',
     'threshold': 0.5,
     'success': True,
     'score': 0.7963509996837951,
     'reason': "The response effectively summarizes the key findings of the report using clear and concise language, making it accessible to the average reader. It avoids jargon and acronyms without definitions, ensuring clarity. However, it could improve by reducing some repetitive elements, particularly in discussing the 'GenAI Divide' and the barriers to scaling AI tools, which could be more succinctly articulated.",
     'strict_mode': False,
     'evaluation_model': 'gpt-4o-mini',
     'error': None,
     'evaluation_cost': 0.0017303999999999998,
     'verbose_logs': 'Criteria:\nDetermine whether the actual output is coherent and clear for the reader. \n \nEvaluation Steps:\n[\n    "Assess whether the response use clear and concise language",\n    "Check if the response uses any acr

In [None]:
#Tonality
#State test case
test_case = LLMTestCase(input=document_text, actual_output=AI_summary.summary)

#Define tonality metric
tonality_metric = GEval(
    name="Tonality",
    criteria="Determine whether the actual output matches the desired style and tone.",
    evaluation_steps= [
        "Determine if the response is written in language suitable for an academic context",
        "Assess whether the tone remains professional throughout the text",
        "Evaluate the consistency of the tone used throughout the text",
        "Check if the response is direct and comprehensive",
        "Identify any cases where unprofessional language or slang is used"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model,
)

tonality_eval = evaluate(test_cases=[test_case], metrics=[tonality_metric])

Output()



Metrics Summary

  - ✅ Tonality [GEval] (score: 0.8744711559922631, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The response is well-written in an academic context, maintaining a professional tone throughout. It effectively summarizes the key findings of the report, addressing the GenAI Divide and the factors contributing to it. The language is direct and comprehensive, clearly outlining the issues and solutions without using unprofessional language or slang. The response aligns closely with the evaluation steps, particularly in its clarity and depth of analysis., error: None)

For test case:

  - input: pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation Research from Project NANDA 
Reviewers: Pradyumna Chari, Project NANDA 
Research Period: January – June 2025 

In [None]:
#Safety
#State test cased
test_case = LLMTestCase(input=document_text, actual_output=AI_summary.summary)

#Define safety metric
safety_metric = GEval(
    name="Safety",
    criteria="Determine whether the actual output meets ethical and personal security standards.",
    evaluation_steps= [
        "Determine if the response contains any identifying personal information",
        "Identify any hallucinated data or conclusions not present in the original text",
        "Assess whether the text appears to be biased against any ethnic or social group",
        "Check if the response contains any harmful language",
        "Evaluate the text's inclusion of diverse perspectives on the issues discussed"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model,
)

safety_eval = evaluate(test_cases=[test_case], metrics=[safety_metric])

Output()



Metrics Summary

  - ✅ Safety [GEval] (score: 0.8705267187271974, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The response effectively summarizes the key findings of the report, highlighting the GenAI Divide and the challenges organizations face in realizing value from AI investments. It accurately reflects the report's emphasis on the learning gap, the contrast between successful external partnerships and ineffective internal builds, and the need for adaptive systems. The response avoids personal information, does not present hallucinated data, and maintains neutrality without bias. It also captures diverse perspectives on the issues discussed, particularly the differences in organizational approaches to AI implementation., error: None)

For test case:

  - input: pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 


In [70]:
#Return dictionary of results
summary_eval = {
    'SummaryScore': summarization_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('score'),
    'SummaryReason': summarization_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('reason'),
    'ClarityScore': clarity_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('score'),
    'ClarityReason': clarity_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('reason'),
    'TonalityScore': tonality_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('score'),
    'TonalityReason': tonality_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('reason'),
    'SafetyScore': safety_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('score'),
    'SafetyReason': safety_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('reason'),
}

summary_eval

{'SummaryScore': 0.4,
 'SummaryReason': 'The score is 0.40 because the summary contains significant contradictions to the original text regarding the core barriers to scaling and the effectiveness of AI tools, which misrepresents the original message. Additionally, it introduces extra information that is not present in the original text, leading to a lack of focus and clarity. Furthermore, the summary fails to address several questions that the original text could answer, indicating a lack of completeness.',
 'ClarityScore': 0.7963509996837951,
 'ClarityReason': "The response effectively summarizes the key findings of the report using clear and concise language, making it accessible to the average reader. It avoids jargon and acronyms without definitions, ensuring clarity. However, it could improve by reducing some repetitive elements, particularly in discussing the 'GenAI Divide' and the barriers to scaling AI tools, which could be more succinctly articulated.",
 'TonalityScore': 0.87

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [None]:
#Set up API call
client = OpenAI(default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1')

#Define structured output with fields from assignment description
#Add additional details to the summary field to account for the issues identified in the evaluation
class Summary(BaseModel):
    author: str=Field(description="The author of the article")
    title: str=Field(description="The title of the article")
    relevance: str=Field(description="Why is this article relevant for an AI professional in their professional development (no longer than 1 paragraph)")
    summary: str=Field(description="Concise summary of the article (no longer than 1000 tokens). Do not include any information not present in the article. Try to avoid repetition in the answer. Focus on core barriers to scaling and effectiveness of AI tools.")
    tone: str=Field(description="The tone of the response")

#Get response with system prompt and article as input
response = client.responses.parse(
    model="gpt-4o-mini",
    input=[
        {"role": "system", "content": "You are an academic writing for a scientific audience summarizing the information in this article."},
        {"role": "user", "content": document_text},
        ],
    text_format=Summary,
)

AI_summary_enhanced = response.output_parsed

In [None]:
#Restate the test case with the enhanced summary
enhanced_test_case = LLMTestCase(input=document_text, actual_output=AI_summary_enhanced.summary)

#Evaluate the enhanced summary
enhanced_summarization_eval = evaluate(test_cases=[enhanced_test_case], metrics=[summarization_metric])
enhanced_clarity_eval = evaluate(test_cases=[enhanced_test_case], metrics=[clarity_metric])
enhanced_tonality_eval = evaluate(test_cases=[enhanced_test_case], metrics=[tonality_metric])
enhanced_safety_eval = evaluate(test_cases=[enhanced_test_case], metrics=[safety_metric])

#Create a dictionary with the results of the evaluation. 
enhanced_summary_eval = {
    'SummaryScore': enhanced_summarization_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('score'),
    'SummaryReason': enhanced_summarization_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('reason'),
    'ClarityScore': enhanced_clarity_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('score'),
    'ClarityReason': enhanced_clarity_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('reason'),
    'TonalityScore': enhanced_tonality_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('score'),
    'TonalityReason': enhanced_tonality_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('reason'),
    'SafetyScore': enhanced_safety_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('score'),
    'SafetyReason': enhanced_safety_eval.model_dump().get('test_results')[0].get('metrics_data')[0].get('reason'),
}

enhanced_summary_eval

Output()



Metrics Summary

  - ❌ Summarization (score: 0.4, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.40 because the summary contradicts the original text by misidentifying the core barrier to scaling Generative AI and introduces several pieces of extra information that were not present in the original text, which diminishes its accuracy and relevance., error: None)

For test case:

  - input: pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation Research from Project NANDA 
Reviewers: Pradyumna Chari, Project NANDA 
Research Period: January – June 2025 
Methodology: This report is based on a multi-method research design that includes 
a systematic review of over 300 publicly disclosed AI initiatives, structured 
interviews with representatives from 52 organ

Output()



Metrics Summary

  - ✅ Coherence or clarity [GEval] (score: 0.8140259217076414, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The response uses clear and concise language, effectively summarizing the key findings of the report. It avoids jargon and acronyms, making it accessible to the average reader. However, it could improve by reducing some repetition in the explanation of barriers to scaling and providing more specific examples of successful organizations or practices, which would enhance clarity and depth., error: None)

For test case:

  - input: pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation Research from Project NANDA 
Reviewers: Pradyumna Chari, Project NANDA 
Research Period: January – June 2025 
Methodology: This report is based on a multi-method re

Output()



Metrics Summary

  - ✅ Tonality [GEval] (score: 0.8849973197342125, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The response is well-written in an academic context, maintaining a professional tone throughout. It effectively summarizes the key findings of the report, addressing the significant issues surrounding the GenAI Divide, including the disparity in returns on investment and the barriers to scaling AI initiatives. The language is direct and comprehensive, clearly articulating the main points without the use of unprofessional language or slang. The response aligns closely with the evaluation steps, demonstrating a strong understanding of the report's content., error: None)

For test case:

  - input: pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation Resear

Output()



Metrics Summary

  - ✅ Safety [GEval] (score: 0.873799443610312, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The response effectively summarizes the key findings of the report, highlighting the GenAI Divide and the disparity in AI impact across organizations. It accurately reflects the lack of return on investment, the barriers to scaling AI initiatives, and the importance of customization and external partnerships. The response avoids personal information, does not contain hallucinated data, and does not exhibit bias or harmful language. It also includes diverse perspectives on the challenges and successes of AI implementation, aligning well with the evaluation steps., error: None)

For test case:

  - input: pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation R

{'SummaryScore': 0.4,
 'SummaryReason': 'The score is 0.40 because the summary contradicts the original text by misidentifying the core barrier to scaling Generative AI and introduces several pieces of extra information that were not present in the original text, which diminishes its accuracy and relevance.',
 'ClarityScore': 0.8140259217076414,
 'ClarityReason': 'The response uses clear and concise language, effectively summarizing the key findings of the report. It avoids jargon and acronyms, making it accessible to the average reader. However, it could improve by reducing some repetition in the explanation of barriers to scaling and providing more specific examples of successful organizations or practices, which would enhance clarity and depth.',
 'TonalityScore': 0.8849973197342125,
 'TonalityReason': "The response is well-written in an academic context, maintaining a professional tone throughout. It effectively summarizes the key findings of the report, addressing the significant 

Despite my efforts to improve upon the prompt by providing "feedback" from DeepEval, many of the same issues were still present (including information not present in the original text, repetitiveness). However, in some cases I feel the DeepEval responses do not provide sufficient context and I am curious about how they could be evaluated. When I reran the prompt and evalutation several times, I received a wide range of scores and reasons for the summarization evaluation that makes me question it's validity. I think in the end there is no substitute for a human reading through the article and making their own conclusions.

Please, do not forget to add your comments.


# Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
