<a href="https://colab.research.google.com/github/baker-jr-john/automated-summary-evaluation-llm/blob/main/04_Summary_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìù Summary Evaluation System - Live Demo
**John Baker | EDUC 6192 Final Project | December 2025**

This notebook creates a live Gradio interface for evaluating student summaries of "The Challenge of Exploring Venus" using Llama 3.1 8B.

---

## Step 1: Install Dependencies
Run this cell first (takes ~2 minutes)

In [1]:
# Install required packages
!pip install -q transformers accelerate bitsandbytes gradio torch

print("‚úÖ Dependencies installed successfully!")

‚úÖ Dependencies installed successfully!


## Step 2: Load the Model
This takes 2-3 minutes. You'll see progress updates.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import json
import re
import gc
import gradio as gr

print("Loading Llama 3.1 8B with 4-bit quantization...")
print("This takes 2-3 minutes. Please wait...\n")

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# 4-bit quantization for Colab free tier
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load tokenizer
print("[1/2] Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load model
print("[2/2] Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    dtype=torch.float16
)

print("\n‚úÖ Model loaded successfully!")
print(f"   Device: {model.device}")
print(f"   Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

## Step 3: Define Source Text and Evaluation Functions

In [3]:
# Source text from "The Challenge of Exploring Venus"
VENUS_SOURCE_TEXT = """THE CHALLENGE OF EXPLORING VENUS

Venus, sometimes called the "Evening Star," is one of the brightest points of light in the night sky, making it simple for even an amateur stargazer to spot. However, this nickname is misleading since Venus is actually a planet. While Venus is simple to see from the distant but safe vantage point of Earth, it has proved a very challenging place to examine more closely.

Often referred to as Earth's "twin," Venus is the closest planet to Earth in terms of density and size, and occasionally the closest in distance too. Earth and Venus, along with Mars, our other planetary neighbor, orbit the sun at different speeds. These differences in speed mean that sometimes we are closer to Mars and other times to Venus. Because Venus is sometimes right around the corner‚Äîin space terms‚Äîhumans have sent numerous spacecraft to land on this cloud-draped world. Each previous mission was unmanned, and for good reason, since no spacecraft survived the landing for more than a few hours. Maybe this issue explains why not a single spaceship has touched down on Venus in more than three decades. Numerous factors contribute to Venus's reputation as a challenging planet for humans to study, despite its proximity to us.

A thick atmosphere of almost 97 percent carbon dioxide blankets Venus. Even more challenging are the clouds of highly corrosive sulfuric acid in Venus's atmosphere. On the planet's surface, temperatures average over 800 degrees Fahrenheit, and the atmospheric pressure is 90 times greater than what we experience on our own planet. These conditions are far more extreme than anything humans encounter on Earth; such an environment would crush even a submarine accustomed to diving to the deepest parts of our oceans and would liquefy many metals. Also notable, Venus has the hottest surface temperature of any planet in our solar system, even though Mercury is closer to our sun. Beyond high pressure and heat, Venusian geology and weather present additional impediments like erupting volcanoes, powerful earthquakes, and frequent lightning strikes to probes seeking to land on its surface.

If our sister is so inhospitable, why are scientists even discussing further visits to its surface? Astronomers are fascinated by Venus because it may well once have been the most Earth-like planet in our solar system. Long ago, Venus was probably covered largely with oceans and could have supported various forms of life, just like Earth. Today, Venus still has some features that are analogous to those on Earth. The planet has a surface of rocky sediment and includes familiar features such as valleys, mountains, and craters. Furthermore, recall that Venus can sometimes be our nearest option for a planetary visit, a crucial consideration given the long time frames of space travel. The value of returning to Venus seems indisputable, but what are the options for making such a mission both safe and scientifically productive?

The National Aeronautics and Space Administration (NASA) has one particularly compelling idea for sending humans to study Venus. NASA's possible solution to the hostile conditions on the surface of Venus would allow scientists to float above the fray. Imagine a blimp-like vehicle hovering 30 or so miles above the roiling Venusian landscape. Just as our jet airplanes travel at a higher altitude to fly over many storms, a vehicle hovering over Venus would avoid the unfriendly ground conditions by staying up and out of the way. At thirty-plus miles above the surface, temperatures would still be toasty at around 170 degrees Fahrenheit, but the air pressure would be close to that of sea level on Earth. Solar power would be plentiful, and radiation would not exceed Earth's levels. Not easy conditions, but survivable for humans.

However, peering at Venus from a ship orbiting or hovering safely far above the planet can provide only limited insight into ground conditions, rendering standard forms of photography and videography ineffective. More importantly, researchers cannot take samples of rock, gas, or anything else from a distance. Therefore, scientists seeking to conduct a thorough mission to understand Venus would need to get up close and personal despite the risks. Or maybe we should think of them as challenges. Many researchers are working on innovations that would allow our machines to last long enough to contribute meaningfully to our knowledge of Venus.

NASA is working on other approaches to studying Venus. For example, some simplified electronics made of silicon carbide have been tested in a chamber simulating the chaos of Venus's surface and have lasted for three weeks in such conditions. Another project is looking back at an old technology called mechanical computers. These devices were first envisioned in the 1800s and played an important role in the 1940s during World War II. The thought of computers existing in those days may sound shocking, but these devices made calculations by using gears and levers and did not require electronics at all. Modern computers are enormously powerful, flexible, and quick, but tend to be more delicate when it comes to extreme physical conditions. Just imagine exposing a cell phone or tablet to acid or heat capable of melting tin. By comparison, systems that use mechanical parts can be made more resistant to pressure, heat, and other forces.

Striving to meet the challenge presented by Venus has value, not only because of the insight to be gained on the planet itself, but also because human curiosity will likely lead us into many equally intimidating endeavors. Our travels on Earth and beyond should not be limited by dangers and doubts but should be expanded to meet the very edges of imagination and innovation."""

print("‚úÖ Source text loaded")

‚úÖ Source text loaded


In [4]:
def create_evaluation_prompt(summary, source_text):
    """
    Final Prompt: 'Innocent until proven guilty' logic for Conciseness.
    """

    prompt = f"""You are a strict teacher grading a student summary.
Compare the summary CAREFULLY to the source text.

SOURCE TEXT:
{source_text}

RUBRIC:
ACCURACY (STRICT):
5=Perfectly accurate.
4=Minor details off.
3=Mix of true/false.
2=Contains "hallucinations" (facts not in text).
1=Completely wrong.

CONCISENESS (LENIENT):
5=Effective Standard English. (Includes intro, transitions, and conclusion).
4=Mostly efficient.
3=Visibly redundant (repeats the exact same sentence twice).
2=Excessively wordy.
1=Rambling.

---
INSTRUCTIONS:
1. ACCURACY: Be strict. If the summary claims things not in the source (like "fast technology", "humans living there", or "oceans"), the score MUST be 2 or 1.
2. CONCISENESS: Be generous. Start at Score 5.
   - Do NOT penalize for standard essay features (intro, conclusion, transitions).
   - Do NOT penalize the student for discussing the author's repetition.
   - Only give a low score if the student repeats the EXACT SAME thought multiple times in a row.

FORMAT:
Provide your response as a simple list. Do not use JSON. Use exactly this format:

Completeness Score: <number>
Completeness Feedback: <text>
Accuracy Score: <number>
Accuracy Feedback: <text>
Coherence Score: <number>
Coherence Feedback: <text>
Conciseness Score: <number>
Conciseness Feedback: <text>

---
STUDENT SUMMARY:
{summary}

YOUR EVALUATION:"""

    return prompt

def evaluate_summary(summary, model, tokenizer, source_text):
    """
    Evaluation Function - TOKEN SLICING + CLEANUP
    """

    if not summary.strip():
        return {"error": "Please enter a summary to evaluate"}

    # 1. Clear Memory
    gc.collect()
    torch.cuda.empty_cache()

    # 2. Create Prompt
    prompt = create_evaluation_prompt(summary, source_text)

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    # 3. Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=600,
            temperature=0.1,
            do_sample=True,
            top_p=0.95,
            pad_token_id=tokenizer.eos_token_id
        )

    # Slice only new tokens
    input_length = inputs["input_ids"].shape[1]
    generated_tokens = outputs[0][input_length:]
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    # Sanitize
    try:
        response = response.encode('utf-8', 'ignore').decode('utf-8')
    except:
        pass

    # 4. Parse
    return parse_line_items(response)


def parse_line_items(text):
    """
    Universal Parser with 'Chatty Cleanup'
    Stops reading feedback if the model starts chatting (e.g., 'Note:', 'Let me know')
    """
    if "YOUR EVALUATION:" in text:
        text = text.split("YOUR EVALUATION:")[-1]

    keys = [
        "Completeness Score", "Completeness Feedback",
        "Accuracy Score", "Accuracy Feedback",
        "Coherence Score", "Coherence Feedback",
        "Conciseness Score", "Conciseness Feedback"
    ]

    # Find keys
    found_keys = []
    for key in keys:
        pattern = rf"{key}[:\s]"
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            found_keys.append({"key": key, "start": match.start(), "end": match.end()})

    found_keys.sort(key=lambda x: x["start"])

    results = {}
    for i, item in enumerate(found_keys):
        current_key = item["key"]
        start_content = item["end"]

        if i < len(found_keys) - 1:
            end_content = found_keys[i + 1]["start"]
        else:
            end_content = len(text)

        content = text[start_content:end_content].strip()
        content = re.sub(r"^[:\-\s]+", "", content).strip()

        # --- üßπ CLEANUP LOGIC ---
        # If this is a feedback field, cut off common conversational markers
        if "Feedback" in current_key:
            # List of phrases that indicate the model has stopped grading and started chatting
            stop_phrases = [
                "Note:", "Let me know", "I hope", "I'd be happy", "Best regards",
                "My response:", "Completeness Score:", "---"
            ]
            for phrase in stop_phrases:
                if phrase in content:
                    content = content.split(phrase)[0].strip()

            # Also cut on double newlines if they look like paragraph breaks to a signature
            # (Optional: keep if you want multi-paragraph feedback, but usually safer to cut)
            if "\n\n" in content:
                 content = content.split("\n\n")[0].strip()
        # ------------------------

        results[current_key] = content

    scores = {}
    dims = ["Completeness", "Accuracy", "Coherence", "Conciseness"]

    for dim in dims:
        score_key = f"{dim} Score"
        fb_key = f"{dim} Feedback"

        raw_score = results.get(score_key, "0")
        score_match = re.search(r"\d", raw_score)
        score = int(score_match.group(0)) if score_match else 0

        feedback = results.get(fb_key, "Could not parse feedback.")
        scores[dim.lower()] = {"score": score, "feedback": feedback}

    return scores


def format_results(evaluation):
    """Format results for display"""
    if "error" in evaluation:
        return f"‚ùå **Error:** {evaluation['error']}\n\n**Raw Output:**\n{evaluation.get('raw', '')}"

    results = "## üìä Evaluation Results\n\n"

    display_map = [
        ("Completeness", "completeness", "üìù"),
        ("Accuracy", "accuracy", "‚úÖ"),
        ("Coherence", "coherence", "üîó"),
        ("Conciseness", "conciseness", "‚úÇÔ∏è")
    ]

    for name, key, emoji in display_map:
        if key in evaluation:
            item = evaluation[key]
            score = item.get("score", 0)
            feedback = item.get("feedback", "No feedback")

            bar = "üü¶" * score + "‚¨ú" * (5 - score)

            results += f"### {emoji} {name}: {score}/5\n"
            results += f"{bar}\n\n"
            results += f"**Feedback:** {feedback}\n\n"
            results += "---\n\n"

    return results

print("‚úÖ Final Cleanup Logic Applied")

‚úÖ Final Cleanup Logic Applied


## Step 4: Launch Gradio Interface
This creates a shareable web interface for live evaluations

In [5]:
def gradio_evaluate(summary):
    """Wrapper for Gradio"""
    evaluation = evaluate_summary(summary, model, tokenizer, VENUS_SOURCE_TEXT)
    return format_results(evaluation)

# Create interface
with gr.Blocks(title="Summary Evaluation System", theme=gr.themes.Soft()) as demo:
    gr.Markdown("""
    # üìù Summary Evaluation System
    ### Automated Rubric-Based Feedback for "The Challenge of Exploring Venus"

    **Instructions:** Paste a student summary below and click "Evaluate" to receive instant feedback.
    """)

    with gr.Row():
        with gr.Column():
            summary_input = gr.Textbox(
                label="Student Summary",
                placeholder="Paste the summary here...",
                lines=10
            )

            evaluate_btn = gr.Button("üéØ Evaluate Summary", variant="primary", size="lg")

        with gr.Column():
            results_output = gr.Markdown(
                value="*Results will appear here*"
            )

    evaluate_btn.click(
        fn=gradio_evaluate,
        inputs=summary_input,
        outputs=results_output
    )

    gr.Markdown("""
    ---
    **Project:** Automated Summary Evaluation | **Developer:** John Baker | **Course:** EDUC 6192
    """)

# Launch with share=True for public URL
demo.launch(share=True, debug=True)

  with gr.Blocks(title="Summary Evaluation System", theme=gr.themes.Soft()) as demo:


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://44373e80ab84ae5c97.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://44373e80ab84ae5c97.gradio.live




---
## üéØ Next Steps for Demo Day (Dec 10)

1. **Test with validation summaries** - Copy-paste from your scored set
2. **Check consistency** - Run same summary multiple times
3. **Document edge cases** - Note any parsing failures
4. **Polish UI** - Adjust colors, add more examples
5. **Prepare talking points** - Explain how rubric guides evaluation

**Backup plan:** If Gradio has issues, you can always run `evaluate_summary()` directly and show results in notebook output.