<a href="https://colab.research.google.com/github/baker-jr-john/automated-summary-evaluation-llm/blob/main/03_Llama_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
print("Installing required packages...")
!pip install -q transformers accelerate bitsandbytes huggingface_hub

Installing required packages...


In [2]:
from google.colab import userdata
from huggingface_hub import login
import os

try:
    HF_TOKEN = userdata.get('HF_TOKEN')
    login(token=HF_TOKEN)
    print("✓ Authenticated with Hugging Face (via Secrets)")
except Exception:
    print("Secret not found. Please enter your Hugging Face token manually:")
    login()
    print("✓ Authenticated with Hugging Face (manual entry)")

✓ Authenticated with Hugging Face (via Secrets)


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

print("Loading Llama 3.1 8B-Instruct...")
print("(This takes 2-5 minutes on first run)")

MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# Configure 4-bit quantization to fit in Colab GPU memory
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=quantization_config,
    device_map="auto",
    dtype=torch.bfloat16,
)

In [4]:
SOURCE_TEXT = """THE CHALLENGE OF EXPLORING VENUS

Venus, sometimes called the "Evening Star," is one of the brightest points of light in the night sky, making it simple for even an amateur stargazer to spot. However, this nickname is misleading since Venus is actually a planet. While Venus is simple to see from the distant but safe vantage point of Earth, it has proved a very challenging place to examine more closely.

Often referred to as Earth's "twin," Venus is the closest planet to Earth in terms of density and size, and occasionally the closest in distance too. Earth and Venus, along with Mars, our other planetary neighbor, orbit the sun at different speeds. These differences in speed mean that sometimes we are closer to Mars and other times to Venus. Because Venus is sometimes right around the corner—in space terms—humans have sent numerous spacecraft to land on this cloud-draped world. Each previous mission was unmanned, and for good reason, since no spacecraft survived the landing for more than a few hours. Maybe this issue explains why not a single spaceship has touched down on Venus in more than three decades. Numerous factors contribute to Venus's reputation as a challenging planet for humans to study, despite its proximity to us.

A thick atmosphere of almost 97 percent carbon dioxide blankets Venus. Even more challenging are the clouds of highly corrosive sulfuric acid in Venus's atmosphere. On the planet's surface, temperatures average over 800 degrees Fahrenheit, and the atmospheric pressure is 90 times greater than what we experience on our own planet. These conditions are far more extreme than anything humans encounter on Earth; such an environment would crush even a submarine accustomed to diving to the deepest parts of our oceans and would liquefy many metals. Also notable, Venus has the hottest surface temperature of any planet in our solar system, even though Mercury is closer to our sun. Beyond high pressure and heat, Venusian geology and weather present additional impediments like erupting volcanoes, powerful earthquakes, and frequent lightning strikes to probes seeking to land on its surface.

If our sister is so inhospitable, why are scientists even discussing further visits to its surface? Astronomers are fascinated by Venus because it may well once have been the most Earth-like planet in our solar system. Long ago, Venus was probably covered largely with oceans and could have supported various forms of life, just like Earth. Today, Venus still has some features that are analogous to those on Earth. The planet has a surface of rocky sediment and includes familiar features such as valleys, mountains, and craters. Furthermore, recall that Venus can sometimes be our nearest option for a planetary visit, a crucial consideration given the long time frames of space travel. The value of returning to Venus seems indisputable, but what are the options for making such a mission both safe and scientifically productive?

The National Aeronautics and Space Administration (NASA) has one particularly compelling idea for sending humans to study Venus. NASA's possible solution to the hostile conditions on the surface of Venus would allow scientists to float above the fray. Imagine a blimp-like vehicle hovering 30 or so miles above the roiling Venusian landscape. Just as our jet airplanes travel at a higher altitude to fly over many storms, a vehicle hovering over Venus would avoid the unfriendly ground conditions by staying up and out of the way. At thirty-plus miles above the surface, temperatures would still be toasty at around 170 degrees Fahrenheit, but the air pressure would be close to that of sea level on Earth. Solar power would be plentiful, and radiation would not exceed Earth’s levels. Not easy conditions, but survivable for humans.

However, peering at Venus from a ship orbiting or hovering safely far above the planet can provide only limited insight into ground conditions, rendering standard forms of photography and videography ineffective. More importantly, researchers cannot take samples of rock, gas, or anything else from a distance. Therefore, scientists seeking to conduct a thorough mission to understand Venus would need to get up close and personal despite the risks. Or maybe we should think of them as challenges. Many researchers are working on innovations that would allow our machines to last long enough to contribute meaningfully to our knowledge of Venus.

NASA is working on other approaches to studying Venus. For example, some simplified electronics made of silicon carbide have been tested in a chamber simulating the chaos of Venus's surface and have lasted for three weeks in such conditions. Another project is looking back at an old technology called mechanical computers. These devices were first envisioned in the 1800s and played an important role in the 1940s during World War II. The thought of computers existing in those days may sound shocking, but these devices made calculations by using gears and levers and did not require electronics at all. Modern computers are enormously powerful, flexible, and quick, but tend to be more delicate when it comes to extreme physical conditions. Just imagine exposing a cell phone or tablet to acid or heat capable of melting tin. By comparison, systems that use mechanical parts can be made more resistant to pressure, heat, and other forces.

Striving to meet the challenge presented by Venus has value, not only because of the insight to be gained on the planet itself, but also because human curiosity will likely lead us into many equally intimidating endeavors. Our travels on Earth and beyond should not be limited by dangers and doubts but should be expanded to meet the very edges of imagination and innovation."""

print(f"Source text loaded: {len(SOURCE_TEXT)} characters")

Source text loaded: 5773 characters


In [5]:
RUBRIC = """
## SUMMARY EVALUATION RUBRIC (Grades 6-8)

### Task Context
Students read "The Challenge of Exploring Venus" and wrote a response to this prompt:
"Write an essay evaluating how well the author supports the claim that studying Venus is a worthy pursuit despite the dangers. Use evidence from the text to support your evaluation."

This is a HYBRID task requiring students to:
1. Identify the author's claim and supporting evidence
2. Evaluate how effectively the author builds the argument
3. Support their evaluation with specific textual evidence

### Scoring Dimensions

**COMPLETENESS (1-5)**: Coverage of the author's main supporting points
- 5: Identifies ALL major supporting points (extreme conditions, scientific value, NASA solutions, alternative technologies) with specific evidence
- 4: Identifies MOST major points with evidence; one minor omission
- 3: Identifies SEVERAL points but misses at least one crucial aspect
- 2: Identifies only a FEW points; missing multiple important concepts
- 1: Fails to identify main points or provides only vague statements

**ACCURACY (1-5)**: Factual correctness of claims about the text
- 5: All information factually correct; precise language; no distortions
- 4: Generally accurate with only minor imprecisions that don't alter meaning (awkward paraphrasing with correct meaning = 4, not 3)
- 3: Contains accurate points but also noticeable errors or oversimplifications
- 2: Multiple significant factual errors or misrepresentations (note: quoting or paraphrasing the source text is not a factual error)
- 1: Information contradicts source or includes fabricated details

**COHERENCE (1-5)**: Logical organization and flow
- 5: Ideas flow logically; effective transitions; each sentence builds on previous
- 4: Clearly organized; transitions mostly effective; minor rough spots
- 3: Basic organization but inconsistent flow; transitions missing in places
- 2: Organization unclear; ideas jump between topics; few transitions
- 1: No discernible organization; disconnected fragments

**CONCISENESS (1-5)**: Efficiency of expression
- 5: Every sentence essential; no repetition; focused on main ideas
- 4: Mostly efficient; only minor wordiness or brief repetition
- 3: Noticeable wordiness; some repetition; includes irrelevant information
- 2: Significant wordiness; frequent repetition; could be cut substantially
- 1: Excessively wordy; ideas repeated multiple times; essential content buried
"""

print("Rubric loaded successfully")

Rubric loaded successfully


In [6]:
def create_evaluation_prompt(student_summary):
    """Create the full CoT evaluation prompt with ONE-SHOT + NEGATIVE CONSTRAINTS."""

    # 1. Define the Exemplar (VAL_04 - High Quality Authentic Summary)
    exemplar_text = """The author excellently supports the idea that even though it is dangerous, Venus is worth exploring. You can tell the author supports the idea of further exploration of Venus because of their use of details. The author explains Venus, why it is so dangerous, and why we should continue exploring it to support the idea that Venus is a challenge that we should not give up on.

One of the reasons the authors point comes across so well is how in depth they explain Venus so that the reader can be more knowlegable about the topic before the author begins to explain why we should continue to explore it. The author gives as much detail as a book about planets so that the reader knows that the author is well versed in the topic and is not having an opinion without factual evidence to support it. In paragraph 2, it says "Often referred to as Earth's "twin", Venus is the closest planet to Earth in terms of density and size, and occasionally the closest in distance too. Earth, Venus, and Mars, our other planetary neighbor, orbit the sun at different speeds.". Throughout this paragraph, the author gives information about Venus so you can understand in depth how and why it is explored, and most importantly, why it is so dangerous.

The danger of Venus is why it is mostly unknown, and why humans want to study it more. Even unmanned missions do not survive Venus's burning temperatures and intense pressure for more than a couple hours, making it very challenging to study. The author uses data like in the quote " A thick atmosphere of amost 97 percent carbon dioxide blankets Venus. Even more challenging are the clouds of highly corrosive sulfuric acid in Venus's atmosphere."(paragraph 3), to show how dangerous it is and why Venus is mostly unexplored. The author shows that the danger is not keeping NASA away, but it is drawing them closer. The author states that "Astronomers are facinated by Venus because it may well once have been the most Earth-like planet in our solar system." (paragraph 4). To the author, natural human curiousity is another reason why we should continue pursuing Venus, and how we are going to continue to explore Venus, even if it is dangerous.

The author uses examples of ideas from real scientists to support the statement that we should not give up on the idea of knowing more about Venus. NASA is still trying to figure out a way to have people explore Venus deeper. The author uses NASA's solutions to the conditions of Venus to explain why we should never stop exploring space. NASA is coming up with solutions to Venus, but they might prove ineffective, so pursuing Veus is still a worthy idea. They are trying to come up with a way to float above the harms of Venus, so they can still study it close, but be unaffected by the harmful temperatures and pressures of the surface. The author offers a rebuttal to this idea, saying "peering at Venus from a ship orbiting or hovering safely far above the planet can provide only limited insight on ground conditions because most forms of light cannot penetrate the dense atmosphere, rendering standard forms of photography and videography ineffective. More importantly, researchers cannot take samples of rock, gas, or anything else, from a distance." (paragraph 6). This information the author presents makes the reader understand that Venus is insanely difficult to explore when even NASA can not present useful ideas for intense exploration. But even when every idea is shut down, the author makes it clear that we should not give up the fight for exploration.

Venus is still an unhabitable planet for even our smartest robots. We as humans have tried our hardest to make sure we understand the many planets in out solar system. Even though it seems impossible, the author explains very successfully that this does not mean exploring Venus is impossible, it just means Venus is a complicated puzzle, but when it is solved, everyone will be overwhelmed with satisfaction, so to the author, giving up is not an option. The author believes that one day, exploring Venus in depth will be possible, and they explain their reasoning behind it very clearly so the reader can understand that studying Venus is a worthy persuit despite the dangers it presents."""

    exemplar_response = """
COMPLETENESS: 4
Justification: Identifies most major supporting points (extreme conditions, scientific value, NASA solutions) with evidence. Misses only minor nuance about specific past missions or mechanical computers details.

ACCURACY: 4
Justification: Claims are generally accurate. "97 percent carbon dioxide" and "sulfuric acid" are correctly cited. Minor imprecision in paraphrasing NASA's specific timeline.

COHERENCE: 4
Justification: Clearly organized with paragraph breaks. Transitions like "One of the reasons" and "The danger of Venus is why" are effective.

CONCISENESS: 3
Justification: Mostly efficient but includes some conversational filler ("You can tell the author supports...") that could be cut.
"""

    # 2. Build the Prompt
    prompt = f"""You are an experienced middle school English Language Arts teacher evaluating a student's response.

## SOURCE TEXT
{SOURCE_TEXT}

## EVALUATION RUBRIC
{RUBRIC}

## EXEMPLAR (Reference Standard)
Use this as your baseline for a "Score 4". Note that even this strong essay is NOT perfect (it gets a 3 for Conciseness).
Student Response:
{exemplar_text}

Teacher Evaluation:
{exemplar_response}

## YOUR TASK
Evaluate the NEW student response below.

**SCORING TRAPS TO AVOID (CRITICAL):**
1. **The Halo Effect:** Do NOT give high scores in all categories just because the writing is pretty.
2. **The Horns Effect:** Do NOT give low scores in "Completeness" or "Accuracy" just because the grammar/spelling is bad. **Look for the facts hidden in the mess.**
3. **Score Inflation:** The Exemplar above is a **4**, not a 5. A score of 5 is reserved for truly exceptional, college-ready analysis. Do not give out 5s easily.

**Think step-by-step before providing scores.**

### Step 1: Analyze COMPLETENESS
Check for these 4 Key Concepts (KC):
1. **KC1 - Dangers:** (Heat, Acid, Pressure)
2. **KC2 - History/Value:** (Earth's Twin, Oceans)
3. **KC3 - NASA Solution:** (Blimps/Hovering)
4. **KC4 - Technology:** (Mechanical Computers/Silicon Carbide)

*Decision Guide:*
- 4 Key Concepts mentioned = Score 4 or 5
- 3 Key Concepts mentioned = Score 3 or 4
- 1-2 Key Concepts mentioned = Score 1 or 2
*(Note: If the student mentions the facts but phrasing is awkward, they still get the points for Completeness!)*

### Step 2: Analyze ACCURACY
Check for factual errors.
- Minor imprecise phrasing = Score 4
- Major error (e.g., wrong planet, invented facts) = Score 2

### Step 3: Analyze COHERENCE
- Clear paragraph structure + good transitions = Score 4-5
- "First, Second, Third" list structure = Score 3-4 (Acceptable but mechanical)
- No paragraphs / random order = Score 1-2

### Step 4: Analyze CONCISENESS
- Focused, no fluff = Score 4-5
- Repetitive or slight tangents = Score 3
- Excessive length or rambling = Score 1-2

## NEW STUDENT RESPONSE
{student_summary}

## PROVIDE YOUR EVALUATION
After your analysis, provide scores in this EXACT format:

COMPLETENESS: [score 1-5]
Justification: [1-2 sentences]

ACCURACY: [score 1-5]
Justification: [1-2 sentences]

COHERENCE: [score 1-5]
Justification: [1-2 sentences]

CONCISENESS: [score 1-5]
Justification: [1-2 sentences]

OVERALL FEEDBACK: [2-3 sentences of constructive feedback for the student]
"""
    return prompt

In [7]:
def create_simple_prompt(student_summary):
    """A simpler, more direct prompt without extensive CoT scaffolding."""

    prompt = f"""You are a middle school English teacher grading a student essay.

ARTICLE SUMMARY: The source article discusses why Venus is difficult to explore (extreme heat, pressure, sulfuric acid) but argues it's worth studying because Venus may have once been Earth-like, has similar features today, and is sometimes our closest neighbor. NASA proposes hovering vehicles at 30 miles altitude where conditions are survivable. Scientists are also developing heat-resistant electronics and mechanical computers.

STUDENT TASK: Evaluate how well the author supports the claim that studying Venus is worthwhile despite the dangers.

STUDENT RESPONSE:
{student_summary}

SCORING RUBRIC (1-5 scale):
- COMPLETENESS: Does it cover the main supporting points?
- ACCURACY: Are the facts correct?
- COHERENCE: Is it well-organized with good flow?
- CONCISENESS: Is it focused without unnecessary repetition?

Provide scores in this format:
COMPLETENESS: [1-5]
ACCURACY: [1-5]
COHERENCE: [1-5]
CONCISENESS: [1-5]
BRIEF FEEDBACK: [1-2 sentences]
"""
    return prompt

In [8]:
def generate_evaluation(prompt, max_new_tokens=800, temperature=0.1):
    """Generate model response for an evaluation prompt."""

    messages = [
        {"role": "system", "content": "You are an expert middle school English teacher who evaluates student writing using rubrics."},
        {"role": "user", "content": prompt}
    ]

    # Format for Llama 3.1 Instruct
    input_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    # Build generation kwargs conditionally
    generate_kwargs = {
        "max_new_tokens": max_new_tokens,
        "pad_token_id": tokenizer.eos_token_id,
    }

    if temperature > 0:
        generate_kwargs["do_sample"] = True
        generate_kwargs["temperature"] = temperature
        generate_kwargs["top_p"] = 0.9
    else:
        generate_kwargs["do_sample"] = False
        # --- FIX: Explicitly unset these to silence the warning ---
        generate_kwargs["temperature"] = None
        generate_kwargs["top_p"] = None

    with torch.no_grad():
        outputs = model.generate(**inputs, **generate_kwargs)

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract just the assistant's response
    if "<|assistant|>" in response:
        response = response.split("<|assistant|>")[-1].strip()
    elif "assistant" in response.lower():
        parts = response.split("COMPLETENESS:")
        if len(parts) > 1:
            response = "COMPLETENESS:" + parts[-1]

    return response

In [9]:
import re

def parse_scores(response_text):
    """Extract numerical scores from the model's response."""
    scores = {}

    # Pattern: DIMENSION: [score] or DIMENSION: score
    patterns = {
        'completeness': r'COMPLETENESS:\s*\[?(\d)\]?',
        'accuracy': r'ACCURACY:\s*\[?(\d)\]?',
        'coherence': r'COHERENCE:\s*\[?(\d)\]?',
        'conciseness': r'CONCISENESS:\s*\[?(\d)\]?'
    }

    for dim, pattern in patterns.items():
        match = re.search(pattern, response_text, re.IGNORECASE)
        if match:
            scores[dim] = int(match.group(1))
        else:
            scores[dim] = None

    return scores

In [10]:
TEST_SUMMARIES = {
    "VAL_02": """Do you guys think that venus is dangers? Well for our part venus need to get up close and personal despite the risk or maybe they should think of them as challenges. Astronmers are fascinated by venus because it may once have been the most earth-like planet in our solar system. Even more challenging are the clouds of highly corrosive sulfuric acid in venus's atmosphere. Venus is simple to see from the distant but safe vantage point of earth, it had proved a very challenging place to examine more closely. Venus would need to get uo close and personal despite the risk and maybe should think of them as challenges.

Astronomers are fascinated by venus because it may well once have been the most earth-like planet in the solar system because people from long time had to covered the oceans to support carious forms of life to them just how earth is from us today. Also because the value of the returning of the venus seem to be a little difficult but there was other option to make a mission to be safe and productive to both of them for the astronomers to be by venus and know the planets in the solar system.

Even more challenging are the clouds of highly corrosvie sulfuric acid in venus's atmosphere because they think that the challenging for venus might not be working for them but now they do work because the conditions are now more extreme then any human encounter out there and all of the environment can crush of a submarine. But then venus has the hottest surface temperature to any of the planet in the solar system that their is even when mercury is close to the sun venus can still be hot for the system being beside another planet.

Venus is simple to see from the distant but safe vantage point of earth, it has proved a very challenging place to examine more closely. Venus can be the closest planet to earth even in terms of size and density. Also each of the previous mission can have an unmanned and for a reason no sacecraft has been survived for the time of landing for more than hours or even minutes. Venus had more than three decades, but venus reputation for a challenge planet is for humans to work on and study for and to despite the proximity to it.

This is what I think the author suggests to the study of venus because venus would need to get up close and personal despite the risk and maybe should think of them for a challenge, because astronomers are fascinated by venus because it may once have been the most earth-like planet in the solar system, even more challenging are the clouds of highly corrosive sulfuric acid in venus's atmosphere, and venus is simple to see from the distant but can be safe vantage point in earth and has proved a very challenging place to examine more closely to it.""",

    "VAL_04": """The author excellently supports the idea that even though it is dangerous, Venus is worth exploring. You can tell the author supports the idea of further exploration of Venus because of their use of details. The author explains Venus, why it is so dangerous, and why we should continue exploring it to support the idea that Venus is a challenge that we should not give up on.

One of the reasons the authors point comes across so well is how in depth they explain Venus so that the reader can be more knowlegable about the topic before the author begins to explain why we should continue to explore it. The author gives as much detail as a book about planets so that the reader knows that the author is well versed in the topic and is not having an opinion without factual evidence to support it. In paragraph 2, it says "Often referred to as Earth's "twin", Venus is the closest planet to Earth in terms of density and size, and occasionally the closest in distance too. Earth, Venus, and Mars, our other planetary neighbor, orbit the sun at different speeds.". Throughout this paragraph, the author gives information about Venus so you can understand in depth how and why it is explored, and most importantly, why it is so dangerous.

The danger of Venus is why it is mostly unknown, and why humans want to study it more. Even unmanned missions do not survive Venus's burning temperatures and intense pressure for more than a couple hours, making it very challenging to study. The author uses data like in the quote " A thick atmosphere of amost 97 percent carbon dioxide blankets Venus. Even more challenging are the clouds of highly corrosive sulfuric acid in Venus's atmosphere."(paragraph 3), to show how dangerous it is and why Venus is mostly unexplored. The author shows that the danger is not keeping NASA away, but it is drawing them closer. The author states that "Astronomers are facinated by Venus because it may well once have been the most Earth-like planet in our solar system." (paragraph 4). To the author, natural human curiousity is another reason why we should continue pursuing Venus, and how we are going to continue to explore Venus, even if it is dangerous.

The author uses examples of ideas from real scientists to support the statement that we should not give up on the idea of knowing more about Venus. NASA is still trying to figure out a way to have people explore Venus deeper. The author uses NASA's solutions to the conditions of Venus to explain why we should never stop exploring space. NASA is coming up with solutions to Venus, but they might prove ineffective, so pursuing Veus is still a worthy idea. They are trying to come up with a way to float above the harms of Venus, so they can still study it close, but be unaffected by the harmful temperatures and pressures of the surface. The author offers a rebuttal to this idea, saying "peering at Venus from a ship orbiting or hovering safely far above the planet can provide only limited insight on ground conditions because most forms of light cannot penetrate the dense atmosphere, rendering standard forms of photography and videography ineffective. More importantly, researchers cannot take samples of rock, gas, or anything else, from a distance." (paragraph 6). This information the author presents makes the reader understand that Venus is insanely difficult to explore when even NASA can not present useful ideas for intense exploration. But even when every idea is shut down, the author makes it clear that we should not give up the fight for exploration.

Venus is still an unhabitable planet for even our smartest robots. We as humans have tried our hardest to make sure we understand the many planets in out solar system. Even though it seems impossible, the author explains very successfully that this does not mean exploring Venus is impossible, it just means Venus is a complicated puzzle, but when it is solved, everyone will be overwhelmed with satisfaction, so to the author, giving up is not an option. The author believes that one day, exploring Venus in depth will be possible, and they explain their reasoning behind it very clearly so the reader can understand that studying Venus is a worthy persuit despite the dangers it presents.""",

    "VAL_15": """In "The Challenge of Exploring Venus," the author talks about why studying Venus is important, even though it is dangerous.

First, Venus is really hot. The article says the temperature is over 800 degrees Fahrenheit. That's way hotter than most things on Earth. Second, there are clouds of sulfuric acid in the atmosphere. This makes it hard for machines to land there. Third, Venus has a lot of pressure that is 90 times stronger than on Earth. This can crush any spacecraft that tries to land. Fourth, scientists think Venus could have had oceans a long time ago and possibly life. This is interesting because it gives us an idea of what Earth might have been like too. Fifth, NASA has some ideas to study Venus. They want to send a blimp-like vehicle to float above it. This could help scientists avoid the extreme conditions on the ground. Lastly, even though Venus is challenging, the author suggests that exploring it can help us learn more about space and even ourselves.

Overall, the author mentions many facts about Venus being dangerous, but doesn't explain very well why studying it is so important.""",

    "VAL_20": """People are facinated with the Man on the Moon and the idea of Martians, but most people do not think about life on Venus. Venus is the second planet from the sun and shares many geographical features with Earth. However, studying this planet is made difficult by the dense and toxic atmosphere, high temperatures, and violent weather. Despite this, some people think that Venus should still be explored, and the author of "The Challenge of Exploring Venus" is of this opinion. The idea that studying Venus is a worthy pursuit despite the dangers is well supported by the author as seen through the rewards of studying Venus and the progress that has been made towards studying Venus.

First, the idea that studying Venus is a worthy pursuit despite the dangers is well supported by the author as seen through the many rewards of studying Venus. After laying out the dangers of studying Venus, the author explains why scientists continue to study the planet. "Astronomers are fascinated by Venus because it may well have been the most Earth-like planet in our solar system" (4). By studying Venus, astronomers and geologists can predict what might happen to Earth in the future. Gaining an understanding of Earth's future may well allow scientists to predict what happened in Earth's past. Scientists are eager to learn about the early years of Earth's past, as it is shrouded in mystery, and this thirst for knowledge motivates them to study Venus. In describing how similar Venus was to Earth, the author says, "Long ago, Venus was probably covered largely with oceans and could have supported various forms of life" (4). If there was once life on Venus, the similarity between it and Earth would grow. As with geology, if biologists can understand what caused life to cease on Venus, they might be able to predict how life on Venus and on Earth might have started. The author shows that scientists studying Venus reap the reward of being able to learn about Earth's geology and early life. By laying out the various rewards to be had from studying Venus, the author is strengthening his or her argument that Venus should be studied.

Secondly, the idea that studying Venus is a worthy pursuit despite the dangers is well supported by the author as seen through the large amount of progress that has been made towards studying Venus. Although the author describes how Venus could be studied from the air, scientists still desire to learn about Venus from the planet's surface. One of their solutions to the problem of getting equipment to last on the surface of Venus is to expirement with new materials. "Simplified electronics made of silicon carbide have been tested in a chamber simulating the choas of Venus's surface and have lasted of three weeks in such conditions" (7). Research and experimentation taking place on Earth is giving scientists and astronauts more options for studying Venus. Although conditions on Venus are not hospitable to life, these new scientific advances are making it possible for data-gathering equipment to be sent to the surface of Venus and last long enough to gather data. Other scientists are moving away from traditional electronics and looking into purely mechanical systems. "Systems that use mechanical parts can be made more resistant to pressure, heat, and other forces" (7). The alternative that has presented itself to would-be explorers of Venus is older technology, like that found in the earliest computers. Scientists have realized that modern technology is too fragile and that more durable technologies are needed. By turning to other forms of technology, scientists are widening their options for ways to study Venus. The author mentions three different ways that scientists are making progress towards being able to study Venus - from the air, using new materials, and using old technologies. The author's postion that Venus should continue to be studied is supported by the scientific advancements that are serving to make studying Venus a reality.

In conclusion, the author's opinon that Venus should continue to be studied despite the dangers is well supported by the rewards of studying another Earth-like planet and the advancements that have been made towards being able to effectively study Venus. Scientists have strong motivation for studying Venus, and new technologies are making it possible for them to overcome the challenges presented by Venus's harsh terrain. Although scientists studying Venus are unlikely to encounter any life forms, what they do discover will help them to understand Earth's past and shape our future.""",

    "VAL_25": """In "the challenge of exploring venus ," the author suggests that studying venus is a worthy pursuit

despite the dangers it presents . becauce in the text it says at paragraph eight

"striving to meet challenge presented by venus has value , not only because of the insight to be gained on the planet itself , but also becauce human curiosity will likely lwad us into many equally intimdating endeavors ." this proves that we should try to get to mars .

there is even more evidence . In paragraph four it says " Astronomers are fascinated by venus because it may well once beeen

the most earth like planet in are solar sytem . " this just further shows the imense reasearch value .

theres even more prove . in the artical at paragraph 2 it says " often referred to as Earths "twin,"Venus is the closest planet to earth in terms of denisty and sise , and occasionally the closest in distance too. " showing are planets similer history .

in conclusion all this eveidince points to even though it will be hard we show try to reasearch venus more ."""
}

# =============================================================================
# GROUND TRUTH SCORES - YOUR Day 1 Scores
# =============================================================================

GROUND_TRUTH = {
    "VAL_02": {"completeness": 4, "accuracy": 4, "coherence": 3, "conciseness": 3},  # Authentic - repetitive, errors
    "VAL_04": {"completeness": 4, "accuracy": 4, "coherence": 4, "conciseness": 3},  # Authentic - analytical, well-structured
    "VAL_15": {"completeness": 3, "accuracy": 4, "coherence": 4, "conciseness": 4},  # Synthetic - list-style, moderate
    "VAL_20": {"completeness": 4, "accuracy": 5, "coherence": 4, "conciseness": 2},  # Authentic - formal essay, lengthy
    "VAL_25": {"completeness": 2, "accuracy": 3, "coherence": 3, "conciseness": 3},  # Authentic - short, spelling errors
}

# Quick validation
print("=" * 60)
print("TEST SUMMARIES LOADED")
print("=" * 60)
for sid, text in TEST_SUMMARIES.items():
    word_count = len(text.split())
    print(f"{sid}: {word_count} words")
print("=" * 60)
print("\n✅ GROUND_TRUTH scores loaded from your Day 1 spreadsheet!")
print("   Source: Summary_Scoring_Template.xlsx - Main Scoring sheet")

TEST SUMMARIES LOADED
VAL_02: 499 words
VAL_04: 732 words
VAL_15: 191 words
VAL_20: 753 words
VAL_25: 193 words

✅ GROUND_TRUTH scores loaded from your Day 1 spreadsheet!
   Source: Summary_Scoring_Template.xlsx - Main Scoring sheet


In [11]:
print("="*70)
print("RUNNING EVALUATIONS ON 5 TEST SUMMARIES")
print("="*70)

results = {}

for summary_id, summary_text in TEST_SUMMARIES.items():
    print(f"\n{'='*70}")
    print(f"Evaluating: {summary_id}")
    print(f"{'='*70}")
    print(f"\nSummary preview: {summary_text[:150]}...")

    # Use the full CoT prompt
    prompt = create_evaluation_prompt(summary_text)

    print("\nGenerating evaluation...")
    response = generate_evaluation(prompt)

    # Parse scores
    scores = parse_scores(response)
    results[summary_id] = {
        'llm_scores': scores,
        'ground_truth': GROUND_TRUTH.get(summary_id, {}),
        'response': response
    }

    print(f"\n--- LLM SCORES ---")
    for dim, score in scores.items():
        gt = GROUND_TRUTH.get(summary_id, {}).get(dim, "N/A")
        match = "✓" if score == gt else "○" if score and gt and abs(score - gt) == 1 else "✗"
        print(f"  {dim.capitalize()}: LLM={score} | Ground Truth={gt} {match}")

    print(f"\n--- FULL RESPONSE ---")
    print(response[:1500] + "..." if len(response) > 1500 else response)

RUNNING EVALUATIONS ON 5 TEST SUMMARIES

Evaluating: VAL_02

Summary preview: Do you guys think that venus is dangers? Well for our part venus need to get up close and personal despite the risk or maybe they should think of them...

Generating evaluation...

--- LLM SCORES ---
  Completeness: LLM=3 | Ground Truth=4 ○
  Accuracy: LLM=3 | Ground Truth=4 ○
  Coherence: LLM=2 | Ground Truth=3 ○
  Conciseness: LLM=2 | Ground Truth=3 ○

--- FULL RESPONSE ---
COMPLETENESS: 3
Justification: The student identifies most major points but misses one crucial aspect (KC4) and repeats some points, making it hard to determine their understanding.

### Step 2: Analyze ACCURACY
Check for factual errors.
- Minor imprecise phrasing = Score 4
- Major error (e.g., wrong planet, invented facts) = Score 2

The student repeats some phrases from the original text, which might indicate a lack of understanding. However, there are no major errors in the text. The student seems to be paraphrasing the original text,

In [12]:
print("\n" + "="*70)
print("AGREEMENT ANALYSIS")
print("="*70)

dimensions = ['completeness', 'accuracy', 'coherence', 'conciseness']

# Calculate agreements
exact_matches = {dim: 0 for dim in dimensions}
adjacent_matches = {dim: 0 for dim in dimensions}  # Within 1 point
total_valid = {dim: 0 for dim in dimensions}

for summary_id, data in results.items():
    llm = data['llm_scores']
    gt = data['ground_truth']

    for dim in dimensions:
        if llm.get(dim) is not None and gt.get(dim) is not None:
            total_valid[dim] += 1
            diff = abs(llm[dim] - gt[dim])
            if diff == 0:
                exact_matches[dim] += 1
                adjacent_matches[dim] += 1
            elif diff == 1:
                adjacent_matches[dim] += 1

print("\n### AGREEMENT BY DIMENSION ###\n")
print(f"{'Dimension':<15} {'Exact':<15} {'Adjacent (±1)':<15}")
print("-" * 45)

for dim in dimensions:
    n = total_valid[dim]
    if n > 0:
        exact_pct = (exact_matches[dim] / n) * 100
        adj_pct = (adjacent_matches[dim] / n) * 100
        print(f"{dim.capitalize():<15} {exact_pct:>5.1f}% ({exact_matches[dim]}/{n})   {adj_pct:>5.1f}% ({adjacent_matches[dim]}/{n})")
    else:
        print(f"{dim.capitalize():<15} No valid comparisons")

# Overall
total_exact = sum(exact_matches.values())
total_adjacent = sum(adjacent_matches.values())
total_n = sum(total_valid.values())

print("-" * 45)
if total_n > 0:
    print(f"{'OVERALL':<15} {(total_exact/total_n)*100:>5.1f}% ({total_exact}/{total_n})   {(total_adjacent/total_n)*100:>5.1f}% ({total_adjacent}/{total_n})")

print("\n### INTERPRETATION ###")
print("- Exact match: LLM score equals your ground truth score")
print("- Adjacent match: LLM score is within ±1 of ground truth")
print("- Target: Adjacent agreement ≥85% indicates good calibration")


AGREEMENT ANALYSIS

### AGREEMENT BY DIMENSION ###

Dimension       Exact           Adjacent (±1)  
---------------------------------------------
Completeness     80.0% (4/5)   100.0% (5/5)
Accuracy         40.0% (2/5)   100.0% (5/5)
Coherence        40.0% (2/5)    80.0% (4/5)
Conciseness      20.0% (1/5)    60.0% (3/5)
---------------------------------------------
OVERALL          45.0% (9/20)    85.0% (17/20)

### INTERPRETATION ###
- Exact match: LLM score equals your ground truth score
- Adjacent match: LLM score is within ±1 of ground truth
- Target: Adjacent agreement ≥85% indicates good calibration


In [13]:
print("\n" + "="*70)
print("TESTING SIMPLE PROMPT VARIATION")
print("="*70)

# Pick one summary to test both prompts
test_id = "VAL_04"
test_summary = TEST_SUMMARIES[test_id]

print(f"\nTesting on: {test_id}")

# Full CoT prompt (already done above)
full_scores = results[test_id]['llm_scores']

# Simple prompt
simple_prompt = create_simple_prompt(test_summary)
print("\nGenerating with SIMPLE prompt...")
simple_response = generate_evaluation(simple_prompt, max_new_tokens=400)
simple_scores = parse_scores(simple_response)

print("\n### PROMPT COMPARISON ###")
print(f"\n{'Dimension':<15} {'Full CoT':<12} {'Simple':<12} {'Ground Truth':<12}")
print("-" * 55)

for dim in dimensions:
    full = full_scores.get(dim, "N/A")
    simp = simple_scores.get(dim, "N/A")
    gt = GROUND_TRUTH[test_id].get(dim, "N/A")
    print(f"{dim.capitalize():<15} {str(full):<12} {str(simp):<12} {str(gt):<12}")

print("\n### SIMPLE PROMPT RESPONSE ###")
print(simple_response)


TESTING SIMPLE PROMPT VARIATION

Testing on: VAL_04

Generating with SIMPLE prompt...

### PROMPT COMPARISON ###

Dimension       Full CoT     Simple       Ground Truth
-------------------------------------------------------
Completeness    4            4            4           
Accuracy        4            5            4           
Coherence       4            4            4           
Conciseness     3            3            3           

### SIMPLE PROMPT RESPONSE ###
COMPLETENESS: 4
The student provides a good overview of the article's main points, but could have delved deeper into the supporting details.

ACCURACY: 5
The student accurately summarizes the article's content, including specific quotes and facts.

COHERENCE: 4
The essay is well-organized, but could benefit from a clearer introduction and conclusion to frame the discussion.

CONCISENESS: 3
The student provides some repetitive language and could have condensed the essay to focus on the most essential points.

BRIEF FE

In [14]:
from google.colab import drive
drive.mount('/content/drive')

import json
import pandas as pd
from datetime import datetime

# Create results DataFrame
rows = []
for summary_id, data in results.items():
    row = {'summary_id': summary_id}
    for dim in dimensions:
        row[f'llm_{dim}'] = data['llm_scores'].get(dim)
        row[f'gt_{dim}'] = data['ground_truth'].get(dim)
    rows.append(row)

df = pd.DataFrame(rows)

# Save
timestamp = datetime.now().strftime('%Y%m%d_%H%M')
output_path = f'/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/LLM_Evaluation_Results/llm_evaluation_results_{timestamp}.csv'
df.to_csv(output_path, index=False)
print(f"Results saved to: {output_path}")

# Save full responses
responses_path = f'/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/LLM_Responses/llm_responses_{timestamp}.json'
with open(responses_path, 'w') as f:
    json.dump({k: v['response'] for k, v in results.items()}, f, indent=2)
print(f"Full responses saved to: {responses_path}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Results saved to: /content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/LLM_Evaluation_Results/llm_evaluation_results_20251203_0643.csv
Full responses saved to: /content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/LLM_Responses/llm_responses_20251203_0643.json


In [15]:
print("\n" + "="*70)
print("TEMPERATURE EXPERIMENT")
print("="*70)

test_id = "VAL_04"
test_summary = TEST_SUMMARIES[test_id]
prompt = create_evaluation_prompt(test_summary)

temperatures = [0.0, 0.1, 0.3]

for temp in temperatures:
    print(f"\n--- Temperature: {temp} ---")
    response = generate_evaluation(prompt, temperature=temp)
    scores = parse_scores(response)
    print(f"Scores: {scores}")

print("\nNOTE: Lower temperature = more deterministic/consistent")
print("      Higher temperature = more varied/creative responses")

# %% [markdown]
# ## Next Steps
#
# 1. **Review the results** - Check which summaries show good agreement
# 2. **Identify patterns** - Which dimensions are harder for the LLM?
# 3. **Refine the prompt** - Add examples, clarify instructions
# 4. **Run on full validation set** - Test all 25 summaries
# 5. **Calculate Cohen's Kappa** - Formal inter-rater reliability metric


TEMPERATURE EXPERIMENT

--- Temperature: 0.0 ---
Scores: {'completeness': 4, 'accuracy': 4, 'coherence': 4, 'conciseness': 3}

--- Temperature: 0.1 ---
Scores: {'completeness': 4, 'accuracy': 4, 'coherence': 4, 'conciseness': 3}

--- Temperature: 0.3 ---
Scores: {'completeness': 4, 'accuracy': 4, 'coherence': 4, 'conciseness': 3}

NOTE: Lower temperature = more deterministic/consistent
      Higher temperature = more varied/creative responses
