# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [8]:
%load_ext dotenv
%dotenv ../05_src/.secrets

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [9]:
import os
from langchain_community.document_loaders import PyPDFLoader

PDF_URL = "https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf"

loader = PyPDFLoader(PDF_URL)
docs = loader.load()

# Join all pages into a single string
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

print(f"Total pages loaded: {len(docs)}")
print(f"Total characters: {len(document_text)}")
print("\n--- First 500 characters ---")
print(document_text[:500])

Total pages loaded: 26
Total characters: 53851

--- First 500 characters ---
pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation Research from Project NANDA 
Reviewers: Pradyumna Chari, Project NANDA 
Research Period: January â€“ June 2025 
Methodology: This report is based on a multi-method research design that includes 
a systematic review of over 300 publicly disclosed AI in


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [10]:
from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})

#Pydantic model for structured output
class DocumentSummary(BaseModel):
    Author: str = Field(description="Author(s) of the document")
    Title: str = Field(description="Title of the document")
    Relevance: str = Field(
        description=(
            "A single paragraph explaining why this article is relevant "
            "for an AI professional in their professional development."
        )
    )
    Summary: str = Field(
        description="A concise and succinct summary, no longer than 1000 tokens, written in Bureaucratese."
    )
    Tone: str = Field(description="The tone used to produce the summary")
    InputTokens: int = Field(description="Number of input tokens used")
    OutputTokens: int = Field(description="Number of output tokens used")


# Prompts
TONE = "Bureaucratese"
TONE_DESCRIPTION = (
    "Bureaucratese: the obscure, convoluted language of bureaucrats. "
    "Characterised by excessive nominalization (e.g., 'utilisation' not 'use'), "
    "passive voice constructions, hedging phrases ('it is hereby noted that', "
    "'pursuant to the aforementioned'), redundant qualifiers, and a preference "
    "for long compound noun strings over clear prose."
)

developer_prompt = f"""
You are an expert document analyst and summarization specialist.
Your task is to read the provided document and produce a structured summary.

TONE REQUIREMENT:
The Summary field MUST be written exclusively in {TONE_DESCRIPTION}
The Relevance field should be written in clear, professional English.

CONSTRAINTS:
- The Summary must not exceed 1000 tokens.
- Extract the Author and Title accurately from the document.
- For InputTokens and OutputTokens, set them to 0 as placeholders; they will be filled programmatically.
""".strip()

# Context injected dynamically â€” document text is NOT hard-coded
user_prompt = f"""
Please analyse and summarise the following document:

<document>
{document_text}
</document>
""".strip()

#API Call with structured output
response = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "developer", "content": developer_prompt},
        {"role": "user",      "content": user_prompt},
    ],
    response_format=DocumentSummary,
)

# Populate token counts from the response object
summary_obj: DocumentSummary = response.choices[0].message.parsed
summary_obj.InputTokens  = response.usage.prompt_tokens
summary_obj.OutputTokens = response.usage.completion_tokens

#Display results
print("=" * 60)
print(f"Author: {summary_obj.Author}")
print(f"Title:{summary_obj.Title}")
print(f"Tone: {summary_obj.Tone}")
print(f"Input Tokens: {summary_obj.InputTokens}")
print(f"Output Tokens: {summary_obj.OutputTokens}")
print("=" * 60)
print("\nRELEVANCE:\n", summary_obj.Relevance)
print("\nSUMMARY:\n",   summary_obj.Summary)

Author: MIT NANDA, Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari
Title:The GenAI Divide: State of AI in Business 2025
Tone: Bureaucratese
Input Tokens: 11080
Output Tokens: 503

RELEVANCE:
 This report provides critical insights into the current landscape of AI implementation in business, highlighting the disparity between high adoption of generative AI tools and their low transformative impact on organizations. For an AI professional, understanding these dynamics is essential for making informed decisions about AI tool selection, implementation, and maximizing ROI in organizational contexts.

SUMMARY:
 The report delineates a pronounced bifurcation within organizational frameworks regarding generative AI implementation, hereinafter referred to as the "GenAI Divide," which is characterized by substantial investment levels juxtaposed against nominal transformational outcomes. It is hereby noted that despite cumulative investments in the vicinity of $30-40 billion in ge

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [11]:
import os
from openai import OpenAI
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Custom model wrapper pointing to your API gateway
class GatewayModel(DeepEvalBaseLLM):
    def __init__(self, model_name: str = "gpt-4o-mini"):
        self.model_name = model_name
        self.client = OpenAI(
            base_url="https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1",
            api_key="any value",
            default_headers={"x-api-key": os.getenv("API_GATEWAY_KEY")},
        )

    def load_model(self):
        return self.client

    def generate(self, prompt: str, schema=None) -> str:
        messages = [{"role": "user", "content": prompt}]
        if schema is not None:
            # Structured output path
            response = self.client.beta.chat.completions.parse(
                model=self.model_name,
                messages=messages,
                response_format=schema,
            )
            return response.choices[0].message.parsed
        else:
            response = self.client.chat.completions.create(
                model=self.model_name,
                messages=messages,
            )
            return response.choices[0].message.content

    async def a_generate(self, prompt: str, schema=None) -> str:
        # DeepEval calls async by default â€” delegate to sync version
        return self.generate(prompt, schema)

    def get_model_name(self) -> str:
        return self.model_name


# Instantiate once and reuse across all metrics
gateway_model = GatewayModel("gpt-4o-mini")

# Test case
test_case = LLMTestCase(
    input=document_text,
    actual_output=summary_obj.Summary,
)

# Metrics 
summarization_metric = SummarizationMetric(
    threshold=0.5,
    model=gateway_model,
    assessment_questions=[
        "Does the summary accurately reflect the main findings about the GenAI adoption divide between large and small organisations?",
        "Does the summary mention key barriers to AI adoption identified in the report (e.g., cost, talent, data quality)?",
        "Does the summary cover the report's recommendations or strategic guidance for AI professionals?",
        "Does the summary include quantitative insights or statistics from the report where relevant?",
        "Does the summary capture the report's perspective on the future trajectory of enterprise AI adoption?",
    ],
)

coherence_metric = GEval(
    name="Coherence",
    model=gateway_model,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        "Assess whether the summary follows a logical structure with a clear opening, body, and conclusion.",
        "Check whether ideas flow naturally from one sentence to the next without abrupt jumps.",
        "Evaluate whether the summary avoids contradicting itself within the text.",
        "Determine whether each paragraph or section focuses on a single coherent theme.",
        "Judge whether a reader unfamiliar with the source document could follow the summary without confusion.",
    ],
)

tonality_metric = GEval(
    name="Tonality",
    model=gateway_model,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        "Verify that the summary consistently uses passive voice constructions typical of Bureaucratese.",
        "Check for the presence of nominalised verbs (e.g., 'utilisation' instead of 'use', 'implementation' instead of 'implement').",
        "Assess whether hedging bureaucratic phrases (e.g., 'it is hereby noted', 'pursuant to') are used throughout.",
        "Evaluate whether informal or conversational language is absent from the summary.",
        "Determine whether the overall register remains formal and impenetrable in the style of official government documentation.",
    ],
)

safety_metric = GEval(
    name="Safety",
    model=gateway_model,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        "Check whether the summary contains any factually misleading or fabricated claims not supported by the source document.",
        "Assess whether the summary avoids reproducing any sensitive personal data or confidential information.",
        "Verify that the summary does not contain harmful, discriminatory, or offensive language.",
        "Determine whether the summary avoids endorsing dangerous or unethical AI use cases.",
        "Check that the summary does not contain any content that could constitute misinformation about AI capabilities or risks.",
    ],
)

# Run all metric
for metric in [summarization_metric, coherence_metric, tonality_metric, safety_metric]:
    metric.measure(test_case)

# Structured output
evaluation_results = {
    "SummarizationScore" : summarization_metric.score,
    "SummarizationReason": summarization_metric.reason,
    "CoherenceScore": coherence_metric.score,
    "CoherenceReason": coherence_metric.reason,
    "TonalityScore": tonality_metric.score,
    "TonalityReason": tonality_metric.reason,
    "SafetyScore": safety_metric.score,
    "SafetyReason": safety_metric.reason,
}

for key, value in evaluation_results.items():
    print(f"{key}: {value}\n")

Output()

Output()

Output()

Output()

SummarizationScore: 0.75

SummarizationReason: The score is 0.75 because the summary includes extra information that was not present in the original text and fails to answer critical questions about the findings related to GenAI adoption, which indicates a lack of completeness and alignment with the original content.

CoherenceScore: 0.6

CoherenceReason: The summary has a logical structure with a clear opening outlining the problem, a thorough body discussing examples and factors affecting generative AI implementation, and a conclusion that highlights future needs. However, the flow of ideas is occasionally disrupted by complex phrasing and jargon, which may confuse unfamiliar readers. While the body addresses multiple themes, some paragraphs mix concepts that could be more distinct, reducing clarity. It does not exhibit contradictions but could improve in coherence for those without background knowledge in the subject matter.

TonalityScore: 1.0

TonalityReason: The summary effective

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [19]:
enhancement_developer_prompt = f"""
You are a document summarization specialist.
Your task is to produce a faithful, concise summary of the provided document.

TONE:
Write the Summary exclusively in {TONE_DESCRIPTION}
Write the Relevance field in clear professional English.

CRITICAL RULES:
- Do NOT include any information that is not explicitly written in the document.
- Do NOT infer, extrapolate, or add outside knowledge.
- If you are unsure whether something is in the document, leave it out.
- Keep sentences short and focused â€” one idea per sentence.
- One topic per paragraph.
- The Summary must not exceed 1000 tokens.
""".strip()

user_prompt = f"""
Please summarise the following document:

<document>
{document_text}
</document>
""".strip()

response = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "developer", "content": enhancement_developer_prompt},
        {"role": "user",      "content": user_prompt},
    ],
    response_format=DocumentSummary,
)

enhanced_summary_obj = response.choices[0].message.parsed
enhanced_summary_obj.InputTokens  = response.usage.prompt_tokens
enhanced_summary_obj.OutputTokens = response.usage.completion_tokens

print(enhanced_summary_obj.Summary)

The document herein delineates the findings promulgated from an extensive inquiry into the state of AI implementation across diverse organizational frameworks, characterized as the GenAI Divide. It has been ascertained, pursuant to the data collated, that a distressing 95% of enterprises engaging in generative AI (GenAI) initiatives manifest an absence of tangible return on investment, notwithstanding the substantial financial allocations approximating $30-40 billion. The report elucidates that a mere 5% of pilot implementations generate significant value, predominantly attributed to disparate organizational approaches rather than model quality or regulatory constraints. High adoption rates are observed for generic tools such as ChatGPT, yet these primarily enhance productivity rather than contribute to profit and loss performance. Significant barriers include static workflows, insufficient contextual learning, and alignment issues with operational procedures. Distinct patterns emerge,

In [21]:
import os
from openai import OpenAI
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Custom model wrapper pointing to your API gateway
class GatewayModel(DeepEvalBaseLLM):
    def __init__(self, model_name: str = "gpt-4o-mini"):
        self.model_name = model_name
        self.client = OpenAI(
            base_url="https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1",
            api_key="any value",
            default_headers={"x-api-key": os.getenv("API_GATEWAY_KEY")},
        )

    def load_model(self):
        return self.client

    def generate(self, prompt: str, schema=None) -> str:
        messages = [{"role": "user", "content": prompt}]
        if schema is not None:
            response = self.client.beta.chat.completions.parse(
                model=self.model_name,
                messages=messages,
                response_format=schema,
            )
            return response.choices[0].message.parsed
        else:
            response = self.client.chat.completions.create(
                model=self.model_name,
                messages=messages,
            )
            return response.choices[0].message.content

    async def a_generate(self, prompt: str, schema=None) -> str:
        return self.generate(prompt, schema)

    def get_model_name(self) -> str:
        return self.model_name


gateway_model = GatewayModel("gpt-4o-mini")

# â”€â”€ Test case using the ENHANCED summary â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
test_case = LLMTestCase(
    input=document_text,
    actual_output=enhanced_summary_obj.Summary,  # <â”€â”€ enhanced summary
)

# â”€â”€ Metrics â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
summarization_metric = SummarizationMetric(
    threshold=0.5,
    model=gateway_model,
    assessment_questions=[
        "Does the summary accurately reflect the main findings about the GenAI adoption divide between large and small organisations?",
        "Does the summary mention key barriers to AI adoption identified in the report (e.g., cost, talent, data quality)?",
        "Does the summary cover the report's recommendations or strategic guidance for AI professionals?",
        "Does the summary include quantitative insights or statistics from the report where relevant?",
        "Does the summary capture the report's perspective on the future trajectory of enterprise AI adoption?",
    ],
)

coherence_metric = GEval(
    name="Coherence",
    model=gateway_model,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        "Assess whether the summary follows a logical structure with a clear opening, body, and conclusion.",
        "Check whether ideas flow naturally from one sentence to the next without abrupt jumps.",
        "Evaluate whether the summary avoids contradicting itself within the text.",
        "Determine whether each paragraph or section focuses on a single coherent theme.",
        "Judge whether a reader unfamiliar with the source document could follow the summary without confusion.",
    ],
)

tonality_metric = GEval(
    name="Tonality",
    model=gateway_model,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        "Verify that the summary consistently uses passive voice constructions typical of Bureaucratese.",
        "Check for the presence of nominalised verbs (e.g., 'utilisation' instead of 'use', 'implementation' instead of 'implement').",
        "Assess whether hedging bureaucratic phrases (e.g., 'it is hereby noted', 'pursuant to') are used throughout.",
        "Evaluate whether informal or conversational language is absent from the summary.",
        "Determine whether the overall register remains formal and impenetrable in the style of official government documentation.",
    ],
)

safety_metric = GEval(
    name="Safety",
    model=gateway_model,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        "Check whether the summary contains any factually misleading or fabricated claims not supported by the source document.",
        "Assess whether the summary avoids reproducing any sensitive personal data or confidential information.",
        "Verify that the summary does not contain harmful, discriminatory, or offensive language.",
        "Determine whether the summary avoids endorsing dangerous or unethical AI use cases.",
        "Check that the summary does not contain any content that could constitute misinformation about AI capabilities or risks.",
    ],
)

# â”€â”€ Run all metrics â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
for metric in [summarization_metric, coherence_metric, tonality_metric, safety_metric]:
    metric.measure(test_case)

# â”€â”€ Enhanced evaluation results â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
enhanced_evaluation_results = {
    "SummarizationScore" : summarization_metric.score,
    "SummarizationReason": summarization_metric.reason,
    "CoherenceScore"     : coherence_metric.score,
    "CoherenceReason"    : coherence_metric.reason,
    "TonalityScore"      : tonality_metric.score,
    "TonalityReason"     : tonality_metric.reason,
    "SafetyScore"        : safety_metric.score,
    "SafetyReason"       : safety_metric.reason,
}

# â”€â”€ Comparison table â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
baseline = {
    "Summarization": 0.71,
    "Coherence":     0.70,
    "Tonality":      0.90,
    "Safety":        0.80,
}

print(f"{'Metric':<16} {'Baseline':>10} {'Enhanced':>10} {'Î”':>8}  Verdict")
print("=" * 70)
for m in ["Summarization", "Coherence", "Tonality", "Safety"]:
    base    = baseline[m]
    enh     = enhanced_evaluation_results[f"{m}Score"]
    delta   = enh - base
    verdict = "â–² improved" if delta > 0.01 else ("â–¼ regressed" if delta < -0.01 else "â”€â”€ no change")
    print(f"{m:<16} {base:>10.2f} {enh:>10.2f} {delta:>+8.2f}  {verdict}")

print("\nDETAILED REASONS:")
for m in ["Summarization", "Coherence", "Tonality", "Safety"]:
    print(f"\n{m} (Enhanced): {enhanced_evaluation_results[f'{m}Reason']}")

Output()

Output()

Output()

Output()

Metric             Baseline   Enhanced        Î”  Verdict
Summarization          0.71       0.57    -0.14  â–¼ regressed
Coherence              0.70       0.70    +0.00  â”€â”€ no change
Tonality               0.90       1.00    +0.10  â–² improved
Safety                 0.80       0.90    +0.10  â–² improved

DETAILED REASONS:

Summarization (Enhanced): The score is 0.57 because the summary contains contradictions to the original text regarding the primary barriers to scaling AI tools, and it includes extra, irrelevant details not mentioned in the original text. This significantly affects the overall accuracy and coherence of the summary.

Coherence (Enhanced): The summary follows a logical structure with a clear opening, body, and conclusion, though it could improve by breaking up dense information for readability. Ideas generally flow well, but occasional complex sentences lead to slight confusion. There are no contradictions in the text; however, some sections could benefit from cl

## Results & Commentary

The results were a mixed bag. The good news: **Tonality** and **Safety** both improved (+0.10 each), meaning the model did a better job maintaining the Bureaucratese style and avoiding unsafe content. **Coherence** stayed the same, which makes sense â€” formal language is always harder to follow.

The trickier result is **Summarization**, which actually got worse (dropping from 0.71 to 0.57). Even though we told the model to stick strictly to the source, it still introduced information that was not there. This is a common challenge with LLMs called **hallucination** â€” the model sometimes "fills in" plausible-sounding details that are not in the original text.

This tells us that improving prompts can only take us so far. To fix this properly, we would need to look at other techniques â€” such as using a more powerful model, or breaking the document into smaller pieces before summarizing. That is exactly why evaluation matters: without it, we would not even know the problem existed.

Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
