<a href="https://colab.research.google.com/github/baker-jr-john/automated-summary-evaluation-llm/blob/main/05_Summary_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìù Summary Evaluation System - Live Demo
**John Baker | EDUC 6192 Final Project | December 2025**

This notebook creates a live Gradio interface for evaluating student summaries of "The Challenge of Exploring Venus" using Llama 3.1 8B.

---

## Step 1: Install Dependencies
Run this cell first (takes ~2 minutes)

In [1]:
# Install required packages
!pip install -q transformers accelerate bitsandbytes gradio torch

print("‚úÖ Dependencies installed successfully!")

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.4/59.4 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25h‚úÖ Dependencies installed successfully!


## Step 2: Hugging Face Authentication

In [2]:
from google.colab import userdata
from huggingface_hub import login
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import json
import re
import gc
import gradio as gr

try:
    HF_TOKEN = userdata.get('HF_TOKEN')
    login(token=HF_TOKEN)
    print("‚úì Authenticated with Hugging Face (via Secrets)")
except Exception:
    print("Secret not found. Please enter your Hugging Face token manually:")
    login()
    print("‚úì Authenticated with Hugging Face (manual entry)")

‚úì Authenticated with Hugging Face (via Secrets)


## Step 3: Load the Model
This takes 2-3 minutes. You'll see progress updates.

In [None]:
print("Loading Llama 3.1 8B with 4-bit quantization...")
print("This takes 2-3 minutes. Please wait...\n")

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# 4-bit quantization for Colab free tier
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load tokenizer
print("[1/2] Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load model
print("[2/2] Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    dtype=torch.float16
)

print("\n‚úÖ Model loaded successfully!")
print(f"   Device: {model.device}")
print(f"   Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

## Step 4: Define Source Text and Evaluation Functions

In [4]:
# Source text
VENUS_SOURCE_TEXT = """THE CHALLENGE OF EXPLORING VENUS

Venus, sometimes called the "Evening Star," is one of the brightest points of light in the night sky, making it simple for even an amateur stargazer to spot. However, this nickname is misleading since Venus is actually a planet. While Venus is simple to see from the distant but safe vantage point of Earth, it has proved a very challenging place to examine more closely.

Often referred to as Earth's "twin," Venus is the closest planet to Earth in terms of density and size, and occasionally the closest in distance too. Earth and Venus, along with Mars, our other planetary neighbor, orbit the sun at different speeds. These differences in speed mean that sometimes we are closer to Mars and other times to Venus. Because Venus is sometimes right around the corner‚Äîin space terms‚Äîhumans have sent numerous spacecraft to land on this cloud-draped world. Each previous mission was unmanned, and for good reason, since no spacecraft survived the landing for more than a few hours. Maybe this issue explains why not a single spaceship has touched down on Venus in more than three decades. Numerous factors contribute to Venus's reputation as a challenging planet for humans to study, despite its proximity to us.

A thick atmosphere of almost 97 percent carbon dioxide blankets Venus. Even more challenging are the clouds of highly corrosive sulfuric acid in Venus's atmosphere. On the planet's surface, temperatures average over 800 degrees Fahrenheit, and the atmospheric pressure is 90 times greater than what we experience on our own planet. These conditions are far more extreme than anything humans encounter on Earth; such an environment would crush even a submarine accustomed to diving to the deepest parts of our oceans and would liquefy many metals. Also notable, Venus has the hottest surface temperature of any planet in our solar system, even though Mercury is closer to our sun. Beyond high pressure and heat, Venusian geology and weather present additional impediments like erupting volcanoes, powerful earthquakes, and frequent lightning strikes to probes seeking to land on its surface.

If our sister is so inhospitable, why are scientists even discussing further visits to its surface? Astronomers are fascinated by Venus because it may well once have been the most Earth-like planet in our solar system. Long ago, Venus was probably covered largely with oceans and could have supported various forms of life, just like Earth. Today, Venus still has some features that are analogous to those on Earth. The planet has a surface of rocky sediment and includes familiar features such as valleys, mountains, and craters. Furthermore, recall that Venus can sometimes be our nearest option for a planetary visit, a crucial consideration given the long time frames of space travel. The value of returning to Venus seems indisputable, but what are the options for making such a mission both safe and scientifically productive?

The National Aeronautics and Space Administration (NASA) has one particularly compelling idea for sending humans to study Venus. NASA's possible solution to the hostile conditions on the surface of Venus would allow scientists to float above the fray. Imagine a blimp-like vehicle hovering 30 or so miles above the roiling Venusian landscape. Just as our jet airplanes travel at a higher altitude to fly over many storms, a vehicle hovering over Venus would avoid the unfriendly ground conditions by staying up and out of the way. At thirty-plus miles above the surface, temperatures would still be toasty at around 170 degrees Fahrenheit, but the air pressure would be close to that of sea level on Earth. Solar power would be plentiful, and radiation would not exceed Earth's levels. Not easy conditions, but survivable for humans.

However, peering at Venus from a ship orbiting or hovering safely far above the planet can provide only limited insight into ground conditions, rendering standard forms of photography and videography ineffective. More importantly, researchers cannot take samples of rock, gas, or anything else from a distance. Therefore, scientists seeking to conduct a thorough mission to understand Venus would need to get up close and personal despite the risks. Or maybe we should think of them as challenges. Many researchers are working on innovations that would allow our machines to last long enough to contribute meaningfully to our knowledge of Venus.

NASA is working on other approaches to studying Venus. For example, some simplified electronics made of silicon carbide have been tested in a chamber simulating the chaos of Venus's surface and have lasted for three weeks in such conditions. Another project is looking back at an old technology called mechanical computers. These devices were first envisioned in the 1800s and played an important role in the 1940s during World War II. The thought of computers existing in those days may sound shocking, but these devices made calculations by using gears and levers and did not require electronics at all. Modern computers are enormously powerful, flexible, and quick, but tend to be more delicate when it comes to extreme physical conditions. Just imagine exposing a cell phone or tablet to acid or heat capable of melting tin. By comparison, systems that use mechanical parts can be made more resistant to pressure, heat, and other forces.

Striving to meet the challenge presented by Venus has value, not only because of the insight to be gained on the planet itself, but also because human curiosity will likely lead us into many equally intimidating endeavors. Our travels on Earth and beyond should not be limited by dangers and doubts but should be expanded to meet the very edges of imagination and innovation."""

print("‚úÖ Source text loaded")

‚úÖ Source text loaded


In [5]:
# Evaluation funtions
def create_evaluation_messages(summary, source_text):
    """
    Constructs the chat messages.
    System role = The Teacher/Rubric (Static instructions)
    User role = The Student Summary (Variable input)
    """

    # 1. Everything that is "The Rules" goes into the SYSTEM message
    system_content = f"""You are a strict teacher grading a student summary.
Compare the summary CAREFULLY to the source text.

SOURCE TEXT:
{source_text}

RUBRIC:
COMPLETENESS:
5=Comprehensive. Covers ALL 3 main topics: (1) Venus's harsh conditions, (2) Scientific interest (Earth's twin), AND (3) NASA's proposed solutions (Blimps or Mechanical Computers).
4=Detailed but misses one minor aspect.
3=Partial. Covers the main idea but misses specific examples (e.g., mentions it's "hard" but doesn't explain the acid/pressure).
2=Incomplete/Vague. Focuses mostly on one aspect (e.g., just "why we study it") and misses the specific NASA solutions.
1=Barely touches the text.

ACCURACY:
5=Perfectly accurate.
4=Minor details off.
3=Mix of accurate quotes/facts and incorrect conclusions (e.g., swapping "Venus" for "Mars").
2=Contains major "hallucinations" (facts not in text).
1=Completely wrong.

---
INSTRUCTIONS:
1. COMPLETENESS CHECK: Does the student mention "Blimps", "Floating", or "Mechanical Computers"?
   - IF NO: The Completeness score MUST be 3 or lower. The summary is missing the "Solutions" section of the text.

2. ACCURACY CHECK:
   - The student summary mentions "Mars" in the conclusion. This is a LOGIC ERROR, not a hallucination, because they are misinterpreting the text's mention of planetary neighbors.
   - **CONSTRAINT:** If the student includes DIRECT QUOTES from the text, you CANNOT give an Accuracy score of 1 or 2. You must give at least a 3.

3. CONCISENESS: Be generous. Ignore spelling errors (like "lwad" or "intimdating"). Focus on flow.

4. FORMAT: Provide your response as a simple list. Do not use JSON. Use exactly this format:

Completeness Score: <number>
Completeness Feedback: <text>
Accuracy Score: <number>
Accuracy Feedback: <text>
Coherence Score: <number>
Coherence Feedback: <text>
Conciseness Score: <number>
Conciseness Feedback: <text>
"""

    # 2. The specific input to grade goes into the USER message
    user_content = f"""STUDENT SUMMARY:
{summary}

YOUR EVALUATION:"""

    messages = [
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ]

    return messages

def evaluate_summary(summary, model, tokenizer, source_text):
    """
    Evaluation Function - Updated to use Chat Templates
    """

    if not summary.strip():
        return {"error": "Please enter a summary to evaluate"}

    # 1. Clear Memory
    gc.collect()
    torch.cuda.empty_cache()

    # 2. Create Messages & Apply Template
    messages = create_evaluation_messages(summary, source_text)

    # allow the tokenizer to format the prompt with Llama 3 special tokens
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)

    # 3. Generate
    with torch.no_grad():
        outputs = model.generate(
            input_ids,
            max_new_tokens=600,
            temperature=0.1,
            do_sample=True,
            top_p=0.95,
            pad_token_id=tokenizer.eos_token_id
        )

    # Slice only new tokens (remove the input prompt from the result)
    input_length = input_ids.shape[1]
    generated_tokens = outputs[0][input_length:]
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    # Sanitize
    try:
        response = response.encode('utf-8', 'ignore').decode('utf-8')
    except:
        pass

    # 4. Parse
    return parse_line_items(response)

def parse_line_items(text):
    """
    Universal Parser with 'Chatty Cleanup'
    Stops reading feedback if the model starts chatting (e.g., 'Note:', 'Let me know')
    """
    if "YOUR EVALUATION:" in text:
        text = text.split("YOUR EVALUATION:")[-1]

    keys = [
        "Completeness Score", "Completeness Feedback",
        "Accuracy Score", "Accuracy Feedback",
        "Coherence Score", "Coherence Feedback",
        "Conciseness Score", "Conciseness Feedback"
    ]

    # Find keys
    found_keys = []
    for key in keys:
        pattern = rf"{key}[:\s]"
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            found_keys.append({"key": key, "start": match.start(), "end": match.end()})

    found_keys.sort(key=lambda x: x["start"])

    results = {}
    for i, item in enumerate(found_keys):
        current_key = item["key"]
        start_content = item["end"]

        if i < len(found_keys) - 1:
            end_content = found_keys[i + 1]["start"]
        else:
            end_content = len(text)

        content = text[start_content:end_content].strip()
        content = re.sub(r"^[:\-\s]+", "", content).strip()

        # --- üßπ CLEANUP LOGIC ---
        # If this is a feedback field, cut off common conversational markers
        if "Feedback" in current_key:
            # List of phrases that indicate the model has stopped grading and started chatting
            stop_phrases = [
                "Note:", "Let me know", "I hope", "I'd be happy", "Best regards",
                "My response:", "Completeness Score:", "---"
            ]
            for phrase in stop_phrases:
                if phrase in content:
                    content = content.split(phrase)[0].strip()

            # Also cut on double newlines if they look like paragraph breaks to a signature
            # (Optional: keep if you want multi-paragraph feedback, but usually safer to cut)
            if "\n\n" in content:
                 content = content.split("\n\n")[0].strip()
        # ------------------------

        results[current_key] = content

    scores = {}
    dims = ["Completeness", "Accuracy", "Coherence", "Conciseness"]

    for dim in dims:
        score_key = f"{dim} Score"
        fb_key = f"{dim} Feedback"

        raw_score = results.get(score_key, "0")
        score_match = re.search(r"\d", raw_score)
        score = int(score_match.group(0)) if score_match else 0

        feedback = results.get(fb_key, "Could not parse feedback.")
        scores[dim.lower()] = {"score": score, "feedback": feedback}

    return scores


def format_results(evaluation):
    """Format results for display"""
    if "error" in evaluation:
        return f"‚ùå **Error:** {evaluation['error']}\n\n**Raw Output:**\n{evaluation.get('raw', '')}"

    results = "## üìä Evaluation Results\n\n"

    display_map = [
        ("Completeness", "completeness", "üìù"),
        ("Accuracy", "accuracy", "‚úÖ"),
        ("Coherence", "coherence", "üîó"),
        ("Conciseness", "conciseness", "‚úÇÔ∏è")
    ]

    for name, key, emoji in display_map:
        if key in evaluation:
            item = evaluation[key]
            score = item.get("score", 0)
            feedback = item.get("feedback", "No feedback")

            bar = "üü¶" * score + "‚¨ú" * (5 - score)

            results += f"### {emoji} {name}: {score}/5\n"
            results += f"{bar}\n\n"
            results += f"**Feedback:** {feedback}\n\n"
            results += "---\n\n"

    return results

print("‚úÖ Final Cleanup Logic Applied")

‚úÖ Final Cleanup Logic Applied


## Step 5: Launch Gradio Interface
This creates a shareable web interface for live evaluations

In [6]:
# --- GRADIO INTERFACE ---

def gradio_evaluate(summary):
    """Wrapper for Gradio"""
    if not summary or not summary.strip():
        return "‚ö†Ô∏è **Please enter a summary first.**"

    evaluation = evaluate_summary(summary, model, tokenizer, VENUS_SOURCE_TEXT)
    return format_results(evaluation)

# --- REAL EXAMPLES FROM DATASET ---

# Essay ID: SYNTH_V_17_S3
example_good = """In the article "The Challenge of Exploring Venus," the author argues that studying Venus is worth the dangers it presents. Overall, the author does a decent job of supporting this claim by explaining the extreme conditions of Venus and the potential knowledge we could gain from exploring it. However, there are some areas where the support could be stronger.

First, the author provides vivid details about how harsh Venus is. They mention that the atmosphere is made up of almost 97 percent carbon dioxide and has clouds of sulfuric acid. Plus, the average temperature is over 800 degrees Fahrenheit, which is super hot! These points show why exploring Venus is dangerous, but they also highlight that the challenges should not stop us from trying. The author explains that scientists are thinking of ways, like using a blimp-like vehicle, to overcome these dangers, which is a strong point. This suggests that finding solutions to challenges is important for exploration.

Also, the author talks about how Venus might have once been Earth-like, which makes studying it even more interesting. Learning about its history could help us understand more about our own planet. However, the author repeats some ideas about the heat and pressure that can make the writing feel choppy. These points are important, but mentioning them multiple times doesn't add new information.

In conclusion, the author supports the idea that studying Venus is a worthy pursuit, but the argument could be more effective with less repetition. They provide good evidence about Venus‚Äôs dangerous conditions and the potential discoveries, but tightening the writing would make it even stronger. Overall, the author‚Äôs points show that curiosity and innovation can help us tackle the challenges of exploring Venus."""

# Essay ID: AAAVUP14319000051516
example_hallucination = """In "the challenge of exploring venus ," the author suggests that studying venus is a worthy pursuit despite the dangers it presents . becauce in the text it says at paragraph eight "striving to meet challenge presented by venus has value , not only because of the insight to be gained on the planet itself , but also becauce human curiosity will likely lwad us into many equally intimdating endeavors ." this proves that we should try to get to mars .

there is even more evidence . In paragraph four it says " Astronomers are fascinated by venus because it may well once beeen the most earth like planet in are solar sytem . " this just further shows the imense reasearch value .

theres even more prove . in the artical at paragraph 2 it says " often referred to as Earths "twin,"Venus is the closest planet to earth in terms of denisty and sise , and occasionally the closest in distance too. " showing are planets similer history .

in conclusion all this eveidince points to even though it will be hard we show try to reasearch venus more ."""

# Essay ID: SYNTH_V_07_S2
example_repetitive = """In the article "The Challenge of Exploring Venus," the author argues that studying Venus is a worthy pursuit, even though Venus is extremely dangerous. However, I believe the author does not support this argument very well. While they mention some interesting facts about Venus, they fail to provide strong evidence that makes readers really understand why studying Venus is so important.

Venus is dangerous because of the heat. The temperatures on Venus are very hot. In fact, the average surface temperature is over 800 degrees Fahrenheit! That is super hot and way hotter than anything we experience on Earth. This heat makes it almost impossible for spacecraft to survive very long once they land. The author also points out that Venus has an atmosphere made up of almost 97% carbon dioxide, which is a harmful gas for humans and makes the planet inhospitable. This means that astronauts or machines sent to Venus could be in serious trouble if they land. So, when they talk about studying it, they should explain more about how that can be done in such extreme conditions.

Moreover, the author mentions that there are clouds of sulfuric acid in Venus's atmosphere. This is another reason why Venus is very dangerous. The clouds are corrosive, which means they can literally eat away at things. This proves that exploring Venus is risky, and the author doesn‚Äôt give enough convincing reasons why we should still try. Yes, they mention that scientists are interested in how Venus might have been like Earth a long time ago and could have supported life. But that point gets lost amidst all the details about the extreme conditions.

The author does talk about NASA's plans for studying Venus, like a blimp-like vehicle that could float above the surface. This is a creative solution, but the article does not go deep enough into how this could solve the dangers presented by the heat and acid. It‚Äôs stated that the conditions would be survivable at 30 miles above the surface, but that‚Äôs still not enough information for the reader to understand how studying Venus can be achieved safely. They should have included more details on how scientists can gather the information they need while avoiding the risks.

In conclusion, while the author tries to show that studying Venus is important, they don't do a great job of supporting this idea. They mention the dangers like extreme heat and corrosive acid, but they need to provide clearer reasons why these risks are worth taking. The evidence given isn't strong enough to really convince readers that exploring Venus will benefit us, so I think the argument is weak. Overall, more depth in explanation and more persuasive evidence would have made for a stronger case to support the idea of studying Venus."""

# --- BRANDING SETUP ---
# Defining the Penn Colors via Hex Codes (Web Standards)
# Penn Blue: #011F5B
# Penn Red:  #990000

penn_theme = gr.themes.Default().set(
    # Primary Button (Evaluate) -> Penn Blue
    button_primary_background_fill="#011F5B",
    button_primary_background_fill_hover="#990000", # Red on Hover
    button_primary_text_color="white",

    # FIXED: The correct variable name includes '_color'
    button_primary_border_color="#011F5B",

    # Loaders & Sliders -> Penn Blue
    loader_color="#011F5B",
    slider_color="#011F5B",

    # Block Labels (Focus) -> Penn Blue
    block_label_text_color="#011F5B",
    block_title_text_color="#011F5B",

    # Borders (Subtle Branding)
    block_border_width="2px",
    input_border_color_focus="#011F5B"
)

# JavaScript to Force Light Mode
js_light_mode = """
function() {
    document.body.classList.remove('dark');
    document.body.classList.add('light');
}
"""

# Setup Interface
with gr.Blocks(title="Summary Evaluation System", theme=penn_theme) as demo:

    gr.Markdown("""
    # üìù Summary Evaluation System
    ### Automated Rubric-Based Feedback | Llama 3.1 8B

    **Instructions:** Paste a student summary below or click one of the **Test Cases** to see the grading in action.
    """)

    with gr.Row():
        # LEFT COLUMN
        with gr.Column(scale=1):
            summary_input = gr.Textbox(
                label="Student Summary",
                placeholder="Paste the student's text here...",
                lines=10
            )

            with gr.Row():
                clear_btn = gr.ClearButton(components=[summary_input], value="üóëÔ∏è Clear")
                evaluate_btn = gr.Button("üéØ Evaluate Summary", variant="primary", size="lg")

            gr.Examples(
                examples=[[example_good], [example_hallucination], [example_repetitive]],
                inputs=summary_input,
                label="üß™ Test Cases",
                example_labels=["‚úÖ Good", "‚ùå Hallucination", "‚ö†Ô∏è Repetitive"]
            )

        # RIGHT COLUMN
        with gr.Column(scale=1):
            results_output = gr.Markdown(
                value="### üìä Evaluation Results\n*Detailed feedback will appear here after you click Evaluate.*"
            )

    # Event Logic
    evaluate_btn.click(fn=gradio_evaluate, inputs=summary_input, outputs=results_output)

    # Trigger Light Mode on Load
    demo.load(None, None, None, js=js_light_mode)

    # Footer
    gr.Markdown("""
    ---
    <center>
    EDUC 6192 Final Project &#x2022; Developer: <a href="https://www.johnbaker.io/" target="_blank">John Baker</a> &#x2022; December 10, 2025
    </center>
    """)

print("üöÄ Launching interface...")
demo.launch(share=True, debug=True)

  with gr.Blocks(title="Summary Evaluation System", theme=penn_theme) as demo:


üöÄ Launching interface...
Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://3887daf7c1de7f4738.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://3887daf7c1de7f4738.gradio.live


