# AI Summarisation Tool

Quickly summarize long articles, research papers, and documents using AI. Then evaluate how good your summaries are with objective metrics so you can improve them.

**What we'll do:**
1. Load an article or document
2. Generate a summary using an LLM
3. Check how good it is using evaluation metrics
4. Get insights on how to improve

Let's get started! üöÄ

## Step 1: Install & Import Dependencies

In [None]:
# Uncomment to install (first time only)
# !pip install ragas langchain-openai pypdf python-docx

import os
import asyncio
from typing import Optional

# Ragas imports
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import SummarizationScore, Faithfulness, DiscreteMetric
from ragas.llms import llm_factory

print("‚úÖ Imports successful!")

## Step 2: Load Configuration from .env

First, let's load your API keys from the .env file (keep them out of code!)

To get started:
1. Copy `.env.example` to `.env`
2. Add your API key to `.env`
3. Run the cell below

In [None]:
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Get API key and model from .env
api_key = os.environ.get("OPENAI_API_KEY")
llm_model = os.environ.get("LLM_MODEL", "gpt-4o-mini")

# Verify API key is set
if not api_key:
    print("‚ö†Ô∏è  OPENAI_API_KEY not found in .env")
    print("\nTo fix this:")
    print("1. Copy .env.example to .env")
    print("2. Add your OpenAI API key: OPENAI_API_KEY=sk-...")
    print("3. Re-run this cell")
else:
    # Initialize the LLM
    try:
        llm = llm_factory(llm_model)
        print("‚úÖ LLM configured successfully")
        print(f"   Model: {llm_model}")
        print(f"   Provider: {llm.__class__.__name__}")
    except Exception as e:
        print(f"‚ùå Error initializing LLM: {e}")
        print(f"\nMake sure your API key is valid and set in .env")

## Step 3: Load Example Article

We'll use built-in examples first. Later, modify this to load your own files.

In [None]:
# Example 1: Apple AI Investment
article_1 = """
Apple announced on Tuesday that it will invest $1 billion in new AI research centers
across the United States over the next five years. The company plans to hire 500 new
researchers and engineers specifically for AI development. CEO Tim Cook stated that
artificial intelligence is central to the company's future product strategy. The investment
will focus on areas like natural language processing, computer vision, and machine learning
efficiency. Apple will establish research hubs in San Francisco, Boston, and Seattle.
The company already employs over 10,000 AI researchers globally.
"""

# Example 2: Electric Vehicles & Environment
article_2 = """
A recent study by Stanford researchers found that electric vehicles (EVs) can reduce
carbon emissions by 70% compared to gasoline cars over their lifetime, when considering
manufacturing and electricity sources. However, this varies significantly by region based
on how electricity is generated. In regions with renewable energy sources, the reduction
can reach 90%, while in coal-dependent regions it drops to 40%. The study analyzed over
200 million vehicle registrations across 60 countries. Researchers emphasize that as power
grids become cleaner, the environmental benefit of EVs will only improve.
"""

# Example 3: Medical Treatment Approval
article_3 = """
The FDA approved a new diabetes treatment on Wednesday that requires injections only once
per week instead of daily. Clinical trials showed 85% of patients achieved their target
blood sugar levels within 3 months. The drug, developed by Novo Nordisk, is expected to
reduce patient burden and improve medication adherence. Side effects were minimal and similar
to existing treatments. The drug will be priced at $400 per month, which pharmaceutical
experts say is competitive with existing weekly injection alternatives.
"""

# Select which article to use
selected_article = article_1  # Change to article_2 or article_3 to try others

print("üì∞ Selected Article:")
print("="*70)
print(selected_article.strip())
print("="*70)

## Step 4: Generate Summary

Create a summary using an LLM (or provide your own).

In [None]:
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate

# Initialize LLM for summarization
summarizer_llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.3)

# Create summarization prompt
summarization_prompt = PromptTemplate(
    input_variables=["text"],
    template="""Summarize the following text in 2-3 sentences. Focus on the main points and key details.

Text:
{text}

Summary:"""
)

# Generate summary
chain = summarization_prompt | summarizer_llm
summary_response = chain.invoke({"text": selected_article})
generated_summary = summary_response.content.strip()

print("‚úçÔ∏è  Generated Summary:")
print("="*70)
print(generated_summary)
print("="*70)

## Step 5: Prepare for Evaluation

For the new metrics, we need to format our data properly. SummarizationScore expects reference contexts (the original text).

In [None]:
# Create evaluation sample with proper format
# SummarizationScore needs reference_contexts (the original text split into chunks)
sample = SingleTurnSample(
    user_input="",  # Not used for summarization metrics
    response=generated_summary,
    reference_contexts=[selected_article.strip()],  # Original article as context
)

# Also create a sample for Faithfulness metric
faithfulness_sample = SingleTurnSample(
    user_input=selected_article.strip(),
    response=generated_summary,
)

print("‚úÖ Evaluation samples prepared")
print(f"   Article length: {len(selected_article.split())} words")
print(f"   Summary length: {len(generated_summary.split())} words")
print(f"   Compression ratio: {len(selected_article.split()) / len(generated_summary.split()):.1f}x")

## Step 6: Create Evaluation Sample

Format data for Ragas evaluation.

In [None]:
# Create Ragas sample
sample = SingleTurnSample(
    user_input=selected_article.strip(),
    response=generated_summary,
    reference=reference_summary.strip()
)

print("‚úÖ Sample created for evaluation")
print(f"   Article length: {len(selected_article.split())} words")
print(f"   Summary length: {len(generated_summary.split())} words")
print(f"   Compression ratio: {len(selected_article.split()) / len(generated_summary.split()):.1f}x")

## Step 7: Evaluate Summary Quality

Now let's evaluate how good the summary is using three metrics. These metrics will help us understand if we should trust the summary or improve it.

In [None]:
# Initialize the new metrics with the LLM
summarization_metric = SummarizationScore(llm=llm)
faithfulness_metric = Faithfulness(llm=llm)

# Create a discrete metric for overall quality assessment
quality_metric = DiscreteMetric(
    name="summary_quality",
    criteria="Evaluate if the summary is accurate, complete, and well-written",
    allowed_values=["poor", "fair", "good", "excellent"],
    llm=llm,
)

print("‚úÖ Evaluation metrics initialized")
print("\nüìä We'll measure:")
print("   1. Summarization Score - QA-based + conciseness")
print("   2. Faithfulness - Is summary grounded in source?")
print("   3. Overall Quality - Categorical assessment (poor/fair/good/excellent)")

## Step 8: Run Quality Check üîç

This will evaluate the summary. Depending on your LLM, this takes 30-60 seconds.

In [None]:
async def evaluate_summary():
    """Run all evaluation metrics"""
    print("üîç Evaluating summary...\n")
    
    try:
        # 1. Summarization Score (QA-based + conciseness)
        print("‚è≥ Computing Summarization Score...")
        summ_score = await summarization_metric.ascore(sample)
        print(f"   ‚úÖ Summarization Score: {summ_score:.3f}\n")
        
        # 2. Faithfulness (is it grounded in source?)
        print("‚è≥ Computing Faithfulness...")
        faith_score = await faithfulness_metric.ascore(faithfulness_sample)
        print(f"   ‚úÖ Faithfulness: {faith_score:.3f}\n")
        
        # 3. Overall Quality (discrete: poor/fair/good/excellent)
        print("‚è≥ Computing Overall Quality...")
        quality_result = await quality_metric.ascore(faithfulness_sample)
        print(f"   ‚úÖ Overall Quality: {quality_result}\n")
        
        return {
            "summarization_score": summ_score,
            "faithfulness": faith_score,
            "overall_quality": quality_result
        }
    
    except Exception as e:
        print(f"‚ùå Error during evaluation: {e}")
        print("\nüí° Tip: Make sure your LLM API key is set correctly!")
        import traceback
        traceback.print_exc()
        return None

# Run evaluation
results = await evaluate_summary()

if results:
    print("="*70)
    print("EVALUATION COMPLETE!")
    print("="*70)

## Step 9: Review Results üìä

Let's see how good your summary is and what to improve.

In [None]:
if results:
    summ_score = results["summarization_score"]
    faith = results["faithfulness"]
    quality = results["overall_quality"]
    
    # Display scores
    print("\nüìä EVALUATION RESULTS:\n")
    
    print(f"1Ô∏è‚É£  SUMMARIZATION SCORE: {summ_score:.3f}/1.0")
    print(f"   Measures how well the summary captures key information")
    print(f"   (Based on QA + conciseness)")
    if summ_score >= 0.8:
        print("   ‚úÖ Excellent - Summary captures content well")
    elif summ_score >= 0.6:
        print("   ‚úÖ Good - Summary covers main points")
    elif summ_score >= 0.4:
        print("   ‚ö†Ô∏è  Fair - Summary could be improved")
    else:
        print("   ‚ùå Poor - Summary is missing important content")
    
    print(f"\n2Ô∏è‚É£  FAITHFULNESS: {faith:.3f}/1.0")
    print(f"   Is the summary factually grounded in the source?")
    if faith >= 0.9:
        print("   ‚úÖ Excellent - Summary sticks to source material")
    elif faith >= 0.7:
        print("   ‚úÖ Good - Mostly accurate")
    elif faith >= 0.5:
        print("   ‚ö†Ô∏è  Fair - Some ungrounded claims")
    else:
        print("   ‚ùå Poor - Contains hallucinations")
    
    print(f"\n3Ô∏è‚É£  OVERALL QUALITY: {quality}")
    print(f"   Expert assessment of summary quality")
    quality_color = {
        "excellent": "‚úÖ",
        "good": "‚úÖ",
        "fair": "‚ö†Ô∏è",
        "poor": "‚ùå"
    }
    print(f"   {quality_color.get(quality, '‚ùì')} {quality.upper()}")
    
    # Overall verdict
    print("\n" + "="*70)
    print("VERDICT:")
    print("="*70)
    
    if summ_score >= 0.8 and faith >= 0.8 and quality in ["good", "excellent"]:
        print(f"‚úÖ HIGH QUALITY SUMMARY")
        print("\n   This summary is accurate, complete, and trustworthy.")
        print("   Safe to publish or share with confidence.")
    elif summ_score >= 0.6 and faith >= 0.7 and quality in ["fair", "good", "excellent"]:
        print(f"‚úÖ GOOD QUALITY SUMMARY")
        print("\n   Summary is generally accurate and captures key points.")
        print("   Minor review recommended before publishing.")
    elif summ_score >= 0.4 and faith >= 0.5:
        print(f"‚ö†Ô∏è  ACCEPTABLE SUMMARY")
        print("\n   Summary has merit but needs improvement.")
        print("   Consider revising the summarization prompt or trying a different model.")
    else:
        print(f"‚ùå POOR QUALITY SUMMARY")
        print("\n   Summary has significant issues.")
        print("   Recommend regenerating with different approaches.")

## Step 10: Try With Your Own Content

Now you can load your own articles and evaluate them. Modify the code below to load your files.

In [None]:
# Option 1: Load from plain text file
# with open("/path/to/your/article.txt", "r") as f:
#     your_article = f.read()

# Option 2: Load from PDF (requires pypdf)
# from pypdf import PdfReader
# reader = PdfReader("/path/to/your/document.pdf")
# your_article = "".join([page.extract_text() for page in reader.pages])

# Option 3: Load from Word document (requires python-docx)
# from docx import Document
# doc = Document("/path/to/your/document.docx")
# your_article = "\n".join([p.text for p in doc.paragraphs])

# Option 4: Just paste your text directly
your_article = """
Paste your article here...
"""

print("üìù Ready to evaluate your own content!")
print(f"\nArticle loaded ({len(your_article.split())} words)")
print("\nüí° Next steps:")
print("   1. Edit the 'selected_article' variable (Step 3) to use your_article")
print("   2. Update the reference summary (Step 5)")
print("   3. Run Steps 4-9 again to evaluate")

## üìö Understanding the Evaluation Metrics

These metrics help you understand if your summary is good and how to improve it.

### What Faithfulness Tells You
- **Definition:** Is the summary based on the source text, or does it hallucinate?
- **Why care:** A summary that makes up facts is worse than useless
- **Low score means:** The LLM added information not in the original text
- **How to improve:** Use a different model or stricter summarization prompt

### What Answer Relevance Tells You  
- **Definition:** Does the summary capture the important points?
- **Why care:** A summary that's accurate but missing key info defeats the purpose
- **Low score means:** Important information was left out
- **How to improve:** Allow longer summaries or focus the prompt on key topics

### What ROUGE Score Tells You
- **Definition:** How similar is your summary to a reference summary?
- **Why care:** Professional summaries tend to follow similar patterns
- **Low score means:** Your summary took a different approach (not necessarily bad)
- **How to improve:** Compare with reference and adjust your prompt

---

## üöÄ Tips for Better Summaries

1. **Experiment with prompts** - Try "Summarize in 3 bullet points" vs "Summarize in 1 paragraph"
2. **Try different models** - GPT-4 may score higher than GPT-3.5
3. **Adjust length** - Longer summaries capture more details (but less compression)
4. **Compare results** - Run the same article multiple times to see consistency
5. **Use as feedback loop** - Use the metrics to refine your summarization strategy

Happy summarizing! üéâ