<a href="https://colab.research.google.com/github/baker-jr-john/automated-summary-evaluation-llm/blob/main/automated_summary_evaluation_llm_rubric_feedback.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Load the dataset - use comma separator (default for CSV)
df = pd.read_csv('/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/ASAP2_train_sourcetexts.csv',
                 encoding='ISO-8859-1')

# Explore the structure
print("Columns:", df.columns.tolist())
print("\nShape:", df.shape)
print("\nFirst few rows:")
print(df.head())

# Check what types of assignments/prompts exist
print("\nUnique prompts:")
print(df['prompt_name'].value_counts())

# Look at score distribution
print("\nScore distribution:")
print(df['score'].value_counts().sort_index())

Mounted at /content/drive
Columns: ['essay_id', 'score', 'full_text', 'assignment', 'prompt_name', 'economically_disadvantaged', 'student_disability_status', 'ell_status', 'race_ethnicity', 'gender', 'source_text_1', 'source_text_2', 'source_text_3', 'source_text_4']

Shape: (24728, 14)

First few rows:
               essay_id  score  \
0  AAAVUP14319000159574      4   
1  AAAVUP14319000159542      2   
2  AAAVUP14319000159461      3   
3  AAAVUP14319000159420      2   
4  AAAVUP14319000159419      2   

                                           full_text  \
0  The author suggests that studying Venus is wor...   
1  NASA is fighting to be alble to to go to Venus...   
2  "The Evening Star", is one of the brightest po...   
3  The author supports this idea because from rea...   
4  How the author supports this idea is that he s...   

                                          assignment      prompt_name  \
0  In "The Challenge of Exploring Venus," the aut...  Exploring Venus   
1  In "

In [2]:
# Look at the actual assignment prompts to understand task types
print("=" * 80)
for prompt in df['prompt_name'].unique():
    subset = df[df['prompt_name'] == prompt]
    print(f"\n{prompt} ({len(subset)} responses)")
    print(f"Score range: {subset['score'].min()}-{subset['score'].max()}")
    print("\nAssignment:")
    print(subset['assignment'].iloc[0][:300] + "...")  # First 300 chars
    print("-" * 80)


Exploring Venus (4480 responses)
Score range: 1-6

Assignment:
In "The Challenge of Exploring Venus," the author suggests studying Venus is a worthy pursuit despite the dangers it presents. Using details from the article, write an essay evaluating how well the author supports this idea. Be sure to include: a claim that evaluates how well the author supports the...
--------------------------------------------------------------------------------

Facial action coding system (4883 responses)
Score range: 1-6

Assignment:
In the article "Making Mona Lisa Smile," the author describes how a new technology called the Facial Action Coding System enables computers to identify human emotions. Using details from the article, write an essay arguing whether the use of this technology to read the emotional expressions of stude...
--------------------------------------------------------------------------------

The Face on Mars (3015 responses)
Score range: 1-6

Assignment:
You have read the article

In [3]:
# Filter for Exploring Venus responses
venus_df = df[df['prompt_name'] == 'Exploring Venus'].copy()

print(f"Total Venus responses: {len(venus_df)}")
print(f"\nScore distribution:")
print(venus_df['score'].value_counts().sort_index())

# Look at the source text
print("\n" + "="*80)
print("SOURCE TEXT:")
print("="*80)
print(venus_df['source_text_1'].iloc[0])

# Examine sample responses across score levels
print("\n" + "="*80)
print("SAMPLE RESPONSES BY SCORE LEVEL:")
print("="*80)

for score in sorted(venus_df['score'].unique()):
    print(f"\n--- SCORE {score} EXAMPLE ---")
    sample = venus_df[venus_df['score'] == score].iloc[0]
    print(sample['full_text'][:400] + "...")

Total Venus responses: 4480

Score distribution:
score
1     567
2    1419
3    1469
4     808
5     175
6      42
Name: count, dtype: int64

SOURCE TEXT:
The Challenge of Exploring Venus
Venus, sometimes called the √¢¬Ä¬úEvening Star,√¢¬Ä¬ù is one of the brightest points of light in the night sky, making it simple for even and amateur stargazer to spot. However, this nickname is misleading since Venus is actually a planet. While Venus is simple to see from the distant but safe vantage point of Earth, it has proved a very challenging place to examine more closely. 
Often referred to as Earth's √¢¬Ä¬útwin,√¢¬Ä¬ù Venus is the closest planet to Earth in terms of density and size, and occasionally the closest in distance too. Earth, Venus, and Mars, our other planetary neighbor, orbit the sun at different speeds. These differences in speed mean that sometimes we are closer to Mars and other times to Venus. Because Venus is sometimes right around the corner - in space terms - humans have sp

In [4]:
import random

# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)

# Calculate proportional samples (36 total based on original 6-point distribution)
# Proportions: 1=13%, 2=32%, 3=33%, 4=18%, 5=4%, 6=1%
samples_needed = {
    1: 5,   # ~13% (567/4480)
    2: 11,  # ~32% (1,419/4480)
    3: 12,  # ~33% (1,469/4480)
    4: 6,   # ~18% (808/4480)
    5: 2,   # ~4% (175/4480)
    6: 0    # ~1% (42/4480) - too few to sample reliably, we'll grab these separately
}

# For score 6, let's just include all available or sample very carefully
# Since there are only 42 total, we could include 1-2 in the validation set

sampled_rows = []
for score, n_samples in samples_needed.items():
    if n_samples > 0:
        score_subset = venus_df[venus_df['score'] == score]
        if len(score_subset) >= n_samples:
            sample = score_subset.sample(n=n_samples, random_state=42)
            sampled_rows.append(sample)

# For score 6, sample 1 if we want to include it
score_6_subset = venus_df[venus_df['score'] == 6]
if len(score_6_subset) > 0:
    score_6_sample = score_6_subset.sample(n=1, random_state=42)
    sampled_rows.append(score_6_sample)

venus_validation_sample = pd.concat(sampled_rows)

print(f"\nSampled {len(venus_validation_sample)} Venus responses for validation")
print("\n6-point score distribution in sample:")
print(venus_validation_sample['score'].value_counts().sort_index())
print("\nPercentages:")
print(venus_validation_sample['score'].value_counts(normalize=True).sort_index() * 100)

# Save validation sample
venus_validation_sample.to_csv('/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/validation_set_venus_36.csv', index=False)

print("\n‚úÖ Saved validation sample (6-point scale)!")


Sampled 37 Venus responses for validation

6-point score distribution in sample:
score
1     5
2    11
3    12
4     6
5     2
6     1
Name: count, dtype: int64

Percentages:
score
1    13.513514
2    29.729730
3    32.432432
4    16.216216
5     5.405405
6     2.702703
Name: proportion, dtype: float64

‚úÖ Saved validation sample (6-point scale)!


In [5]:
# ========================================
# QUICK TEST: Generate 3 Synthetic Examples
# ========================================
import pandas as pd
from datetime import datetime
import time
from google.colab import drive, userdata
from openai import OpenAI

drive.mount('/content/drive', force_remount=True)

api_key = userdata.get('OPENAI_API_KEY')
client = OpenAI(api_key=api_key)

# Load Venus source text
df = pd.read_csv('/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/ASAP2_train_sourcetexts.csv',
                 encoding='ISO-8859-1')
venus_df = df[df['prompt_name'] == 'Exploring Venus']
VENUS_SOURCE = venus_df['source_text_1'].iloc[0]
VENUS_ASSIGNMENT = venus_df['assignment'].iloc[0]

# Rubric
RUBRIC = """
COMPLETENESS (1-5): Coverage of main ideas and supporting details
ACCURACY (1-5): Factual correctness and faithful representation
COHERENCE (1-5): Organization, transitions, logical flow
CONCISENESS (1-5): Appropriate length without repetition
"""

def generate_example(essay_id, score, error_type, instructions, word_target):
    prompt = f"""You are simulating a grade 7-8 middle school student writing an evaluative essay.

ASSIGNMENT: {VENUS_ASSIGNMENT}

SOURCE TEXT: {VENUS_SOURCE}

RUBRIC: {RUBRIC}

TASK: Write a student response earning score {score}/6 with these characteristics:
{instructions}

Target length: {word_target} words
Use authentic middle school vocabulary and style.
Write only the student essay (no meta-commentary):"""

    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You simulate authentic middle school writing for research."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.9,
            max_tokens=800
        )

        text = response.choices[0].message.content.strip()

        return {
            'essay_id': essay_id,
            'score': score,
            'full_text': text,
            'synthetic_flag': True,
            'target_error_pattern': error_type,
            'word_count': len(text.split()),
            'generation_date': datetime.now().isoformat(),
            'assignment': VENUS_ASSIGNMENT,
            'prompt_name': 'Exploring Venus',
            'source_text_1': VENUS_SOURCE
        }
    except Exception as e:
        print(f"Error: {e}")
        return None

# TEST: Generate 3 examples
configs = [
    {'essay_id': 'SYNTH_V_01_S1', 'score': 1, 'error_type': 'Severe incompleteness + fabrication',
     'word_target': '100-150',
     'instructions': 'Cover only 1-2 superficial details. Include 2-3 fabricated facts. Show fundamental misunderstanding. Random organization.'},

    {'essay_id': 'SYNTH_V_04_S2', 'score': 2, 'error_type': 'Completeness gap',
     'word_target': '150-200',
     'instructions': 'Identify main claim correctly but provide only 1-2 vague examples. Omit specific evidence (temperatures, NASA solutions). Very superficial.'},

    {'essay_id': 'SYNTH_V_11_S3', 'score': 3, 'error_type': 'Good content, weak coherence',
     'word_target': '220-250',
     'instructions': 'Cover all main ideas with adequate detail. Use awkward transitions. Random ordering. Choppy flow. Good content, poor organization.'},
]

results = []
for i, cfg in enumerate(configs, 1):
    print(f"[{i}/3] Generating {cfg['essay_id']}...", end=" ")
    result = generate_example(**cfg)
    if result:
        results.append(result)
        print(f"‚úì ({result['word_count']} words)")
        print(f"Preview: {result['full_text'][:200]}...\n")
    time.sleep(1)

# Save test results
test_df = pd.DataFrame(results)
test_path = '/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/test_synthetic_3.csv'
test_df.to_csv(test_path, index=False)

print(f"‚úÖ Generated {len(results)} test examples")
print(f"üìÅ Saved to: {test_path}")
print("\nüëÄ Review these examples. If they look good, proceed to full generation!")

Mounted at /content/drive
[1/3] Generating SYNTH_V_01_S1... ‚úì (130 words)
Preview: In "The Challenge of Exploring Venus," the author talks about studying Venus, but I think they don't do a good job explaining why it's worth it. They mention that Venus has volcanoes and can be really...

[2/3] Generating SYNTH_V_04_S2... ‚úì (161 words)
Preview: In "The Challenge of Exploring Venus," the author argues that studying Venus is worth it even though it‚Äôs a really dangerous place. I think the author does an okay job of explaining this idea, but not...

[3/3] Generating SYNTH_V_11_S3... ‚úì (235 words)
Preview: In "The Challenge of Exploring Venus," the author argues that studying Venus is a valuable goal, even though it is dangerous. While the article shares great information about Venus, the author could h...

‚úÖ Generated 3 test examples
üìÅ Saved to: /content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/test_s

In [6]:
# ========================================
# REGENERATE 2 EXAMPLES WITH REFINED PROMPTS
# ========================================
from openai import OpenAI
import pandas as pd
from datetime import datetime
import time
from google.colab import drive, userdata

drive.mount('/content/drive', force_remount=True)

api_key = userdata.get('OPENAI_API_KEY')
client = OpenAI(api_key=api_key)

# Load Venus source text (if not already loaded)
df = pd.read_csv('/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/ASAP2_train_sourcetexts.csv',
                 encoding='ISO-8859-1')
venus_df = df[df['prompt_name'] == 'Exploring Venus']
VENUS_SOURCE = venus_df['source_text_1'].iloc[0]
VENUS_ASSIGNMENT = venus_df['assignment'].iloc[0]

# Rubric
RUBRIC = """
COMPLETENESS (1-5): Coverage of main ideas and supporting details
ACCURACY (1-5): Factual correctness and faithful representation
COHERENCE (1-5): Organization, transitions, logical flow
CONCISENESS (1-5): Appropriate length without repetition
"""

# Generation function
def generate_example(essay_id, score, error_type, instructions, word_target):
    prompt = f"""You are simulating a grade 7-8 middle school student writing an evaluative essay.

ASSIGNMENT: {VENUS_ASSIGNMENT}

SOURCE TEXT: {VENUS_SOURCE}

RUBRIC: {RUBRIC}

TASK: Write a student response earning score {score}/6 with these characteristics:
{instructions}

Target length: {word_target} words
Use authentic middle school vocabulary and style.
Write only the student essay (no meta-commentary):"""

    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You simulate authentic middle school writing for research."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.9,
            max_tokens=800
        )

        text = response.choices[0].message.content.strip()

        return {
            'essay_id': essay_id,
            'score': score,
            'full_text': text,
            'synthetic_flag': True,
            'target_error_pattern': error_type,
            'word_count': len(text.split()),
            'generation_date': datetime.now().isoformat(),
            'assignment': VENUS_ASSIGNMENT,
            'prompt_name': 'Exploring Venus',
            'source_text_1': VENUS_SOURCE
        }
    except Exception as e:
        print(f"Error: {e}")
        return None

# ========================================
# REFINED CONFIGURATIONS (THE KEY CHANGES)
# ========================================

refined_configs = [
    # SCORE 1 - REVISED for more chaos
    {
        'essay_id': 'SYNTH_V_01_S1_REVISED',
        'score': 1,
        'error_type': 'Severe incompleteness + fabrication',
        'word_target': '100-150',
        'instructions': """Cover only 1-2 superficial details like "Venus is bright" or "it's hot."
Include 2-3 fabricated facts (e.g., say "hotter than the sun" or "NASA already went there").

CRITICAL - Make it SCATTERED and CHAOTIC:
- Jump between totally unrelated ideas with NO logical connections
- NO clear introduction-body-conclusion structure
- Let ideas trail off or suddenly change direction mid-thought
- Use only simple transitions: "Also," "And," or just start new sentences randomly
- Make at least 2-3 sentences that don't connect to anything around them
- Reader should feel confused trying to follow your point

Example of scattered style you should use:
"Venus is really hot I think. Also there's blimps or something? The article talks about dangers but I can't remember. And they should explore Mars instead because it's better. Venus has acid maybe. I heard NASA already landed there but it broke. Also space is cool."

Show you fundamentally misunderstood the assignment. Make it feel random and unfocused."""
    },

    # SCORE 3 - REVISED for choppier flow
    {
        'essay_id': 'SYNTH_V_11_S3_REVISED',
        'score': 3,
        'error_type': 'Good content, weak coherence',
        'word_target': '220-250',
        'instructions': """Cover ALL main ideas with GOOD specific details:
- Dangers: 800¬∞F temperature, sulfuric acid, 97% CO2 atmosphere, 90x pressure
- Solutions: NASA's blimp at 30 miles altitude, mechanical computers, silicon carbide
- Value: Venus may have had oceans, Earth's twin, scientific curiosity

CRITICAL - Good content but CHOPPY execution:
- Use ONLY weak transitions: "Also," "And," "Another thing," "So," "Plus"
- NEVER use sophisticated ones: "Furthermore," "Additionally," "Moreover," "In conclusion"
- Present good ideas but in somewhat random order - jump between topics
- Make each paragraph feel disconnected from the previous one

Example of choppy style you should use:
"The author talks about Venus being super dangerous. It's like 800 degrees with sulfuric acid everywhere and 97% carbon dioxide. Also NASA has this idea about using blimps to float above Venus at like 30 miles up. And the temperature would still be hot but humans could survive. Plus Venus might have had oceans a long time ago so that's why scientists care. Another thing is they're making mechanical computers that can handle the extreme heat and pressure."

Reader should think: "This student clearly understood the article and has good details, but the organization and flow are rough. It feels choppy."

DO NOT write a polished conclusion that ties everything together - make it feel more abrupt."""
    }
]

# ========================================
# GENERATE THE 2 REVISED EXAMPLES
# ========================================

print("="*60)
print("GENERATING 2 REVISED EXAMPLES WITH REFINED PROMPTS")
print("="*60)
print()

revised_results = []
for i, cfg in enumerate(refined_configs, 1):
    print(f"[{i}/2] Generating {cfg['essay_id']}...")
    print(f"Target: {cfg['error_type']}")
    print()

    result = generate_example(**cfg)

    if result:
        revised_results.append(result)
        print(f"‚úì Generated! ({result['word_count']} words)\n")
        print(f"FULL TEXT:")
        print("-"*60)
        print(result['full_text'])
        print("-"*60)
        print()

    time.sleep(2)  # Rate limiting

# ========================================
# SAVE REVISED EXAMPLES
# ========================================

if revised_results:
    revised_df = pd.DataFrame(revised_results)
    revised_path = '/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/test_synthetic_REVISED.csv'
    revised_df.to_csv(revised_path, index=False)

    print("\n" + "="*60)
    print("‚úÖ REGENERATION COMPLETE!")
    print("="*60)
    print(f"üìÅ Saved to: {revised_path}")
    print()
    print("üìä COMPARISON:")
    print(f"  Original Score 1: 129 words, somewhat coherent")
    print(f"  Revised Score 1:  {revised_results[0]['word_count']} words")
    print()
    print(f"  Original Score 3: 241 words, too smooth")
    print(f"  Revised Score 3:  {revised_results[1]['word_count']} words")
    print()
    print("üîç REVIEW THE TEXT ABOVE")
    print("   Check if Score 1 now feels scattered/chaotic")
    print("   Check if Score 3 now feels choppy but has good content")
    print()
    print("‚úÖ If satisfied ‚Üí Proceed to STEP 2")
    print("‚ùå If not quite right ‚Üí Adjust instructions and regenerate")

Mounted at /content/drive
GENERATING 2 REVISED EXAMPLES WITH REFINED PROMPTS

[1/2] Generating SYNTH_V_01_S1_REVISED...
Target: Severe incompleteness + fabrication

‚úì Generated! (139 words)

FULL TEXT:
------------------------------------------------------------
Venus is really hot, like, hotter than the sun, I think. The article says it has clouds, which are not nice at all. NASA already went there, but the spaceship broke, so that's bad. Also, people say Venus is bright in the sky, but that doesn‚Äôt mean it‚Äôs safe. And they should explore Mars instead because it‚Äôs better for studying life maybe. Venus has acid and volcanoes, which seems dangerous, but it kind of sounds exciting too like a video game. There‚Äôs blimps or something that could fly over it to keep people safe, but I don‚Äôt get how that works. Also, the atmosphere is weird and has carbon, I think. The author talks about innovations, but why don‚Äôt they just forget it? Space is cool but Venus sounds too risky. So,

In [7]:
"""
COMPLETE SYNTHETIC EXAMPLES GENERATOR - REFINED VERSION
Generates 23 synthetic Venus summaries with improved authenticity
- Score 1: Enhanced chaotic, scattered organization
- Score 3: Enhanced choppy flow with good content
- Scores 2, 4, 5: Unchanged (already working well)
"""

from openai import OpenAI
import pandas as pd
from datetime import datetime
import time
from google.colab import drive, userdata

drive.mount('/content/drive', force_remount=True)

# ===========================================
# CONFIGURATION
# ===========================================

# Set your OpenAI API key
api_key = userdata.get('OPENAI_API_KEY')
client = OpenAI(api_key=api_key)

# File paths
AUTHENTIC_PATH = '/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/validation_set_venus_36.csv'
SYNTHETIC_PATH = '/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/validation_set_synthetic_23.csv'
COMBINED_PATH = '/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/validation_set_combined_60.csv'

# ===========================================
# LOAD VENUS SOURCE TEXT
# ===========================================

def load_venus_source():
    """Load the Venus article text from the ASAP dataset"""
    df = pd.read_csv('/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/ASAP2_train_sourcetexts.csv',
                     encoding='ISO-8859-1')
    venus_df = df[df['prompt_name'] == 'Exploring Venus']
    source_text = venus_df['source_text_1'].iloc[0]
    assignment = venus_df['assignment'].iloc[0]
    return source_text, assignment

VENUS_SOURCE, VENUS_ASSIGNMENT = load_venus_source()

# ===========================================
# RUBRIC
# ===========================================

RUBRIC = """
RUBRIC FOR VENUS SUMMARY EVALUATION (1-5 scale per dimension):

COMPLETENESS (1-5):
5: Comprehensive coverage of all main ideas with strong supporting details
4: Covers main ideas with good supporting details, minor gaps acceptable
3: Covers basic main ideas but missing some key supporting details
2: Partial coverage with significant gaps
1: Minimal coverage, major ideas missing

ACCURACY (1-5):
5: All information factually correct
4: Mostly accurate with only minor imprecisions
3: Generally accurate but with some notable errors
2: Multiple factual errors or significant misrepresentations
1: Major factual errors, invented details, fundamental misunderstandings

COHERENCE (1-5):
5: Excellent organization with smooth transitions and clear logical flow
4: Well-organized with generally good transitions
3: Basic organization but with awkward transitions or logical gaps
2: Poor organization, weak transitions, difficult to follow
1: Incoherent, random organization, no clear structure

CONCISENESS (1-5):
5: Appropriately concise (200-250 words), no unnecessary repetition
4: Reasonably concise (250-280 words), minimal redundancy
3: Somewhat verbose (280-320 words) or with noticeable repetition
2: Too long (320-400 words) with significant repetition
1: Extremely brief (<150 words) or excessively long (>400 words)
"""

# ===========================================
# GENERATION FUNCTIONS
# ===========================================

def create_generation_prompt(score, specific_instructions, word_count_guidance):
    """Create prompt for generating a synthetic example"""
    return f"""You are simulating an authentic middle school student (grades 7-8) writing an evaluative essay about whether the author of an article successfully supports their argument. This is for educational research to test an automated assessment system.

ASSIGNMENT:
{VENUS_ASSIGNMENT}

SOURCE TEXT:
{VENUS_SOURCE}

RUBRIC:
{RUBRIC}

YOUR TASK:
Write a student response that would realistically earn a score of {score} on the 6-point scale based on the rubric above.

SPECIFIC REQUIREMENTS FOR THIS SAMPLE:
{specific_instructions}

WRITING GUIDELINES:
- Use vocabulary and sentence structure typical of grades 7-8
- Target length: {word_count_guidance} words
- Include some natural middle school writing patterns (minor grammar quirks, occasional informal phrasing)
- Make it feel authentic - not overly polished or obviously AI-generated
- Focus on the CONTENT errors specified above (don't make it artificially bad with excessive spelling/grammar errors)
- Stay focused on the Venus exploration topic
- Remember: this is evaluating HOW WELL THE AUTHOR SUPPORTS THE IDEA, not just summarizing

Write only the student essay response (no meta-commentary):"""

def generate_synthetic_example(example_config, model="gpt-4o-mini"):
    """Generate a single synthetic example using GPT-4o-Mini"""

    prompt = create_generation_prompt(
        score=example_config['score'],
        specific_instructions=example_config['instructions'],
        word_count_guidance=example_config['word_count']
    )

    try:
        # ‚úÖ Use Chat Completions via the client
        response = client.chat.completions.create(
            model=model,
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are an expert at simulating authentic middle school "
                        "student writing for educational research purposes."
                    )
                },
                {
                    "role": "user",
                    "content": prompt
                },
            ],
            temperature=0.9,
            max_tokens=800,
        )

        generated_text = response.choices[0].message.content.strip()

        return {
            'essay_id': example_config['essay_id'],
            'score': example_config['score'],
            'full_text': generated_text,
            'assignment': VENUS_ASSIGNMENT,
            'prompt_name': 'Exploring Venus',
            'source_text_1': VENUS_SOURCE,
            'source_text_2': None,
            'source_text_3': None,
            'source_text_4': None,
            'economically_disadvantaged': 'Synthetic',
            'student_disability_status': 'Synthetic',
            'ell_status': 'Synthetic',
            'race_ethnicity': 'Synthetic',
            'gender': 'Synthetic',
            'synthetic_flag': True,
            'target_error_pattern': example_config['target_error'],
            'generation_date': datetime.now().isoformat(),
            'generation_model': model,
            'word_count': len(generated_text.split())
        }

    except Exception as e:
        print(f"Error generating {example_config['essay_id']}: {e}")
        return None

# ===========================================
# ALL 23 SYNTHETIC EXAMPLE CONFIGURATIONS
# ===========================================

SYNTHETIC_EXAMPLES = [

    # ========================================
    # SCORE 1 EXAMPLES (3 total) - REFINED
    # ========================================

    {
        'essay_id': 'SYNTH_V_01_S1',
        'score': 1,
        'target_error': 'Severe incompleteness + fabrication',
        'word_count': '100-150',
        'instructions': """Cover only 1-2 superficial details like "Venus is bright" or "it's hot."
Include 2-3 fabricated facts (e.g., say "hotter than the sun" or "NASA already went there").

CRITICAL - Make it SCATTERED and CHAOTIC:
- Jump between totally unrelated ideas with NO logical connections
- NO clear introduction-body-conclusion structure
- Let ideas trail off or suddenly change direction mid-thought
- Use only simple transitions: "Also," "And," or just start new sentences randomly
- Make at least 2-3 sentences that don't connect to anything around them
- Reader should feel confused trying to follow your point

Example scattered style: "Venus is really hot I think. Also there's blimps or something? The article talks about dangers but I can't remember. And they should explore Mars instead. Venus has acid maybe."

Show you fundamentally misunderstood the assignment. Make it feel random and unfocused."""
    },

    {
        'essay_id': 'SYNTH_V_02_S1',
        'score': 1,
        'target_error': 'Extreme brevity + major misunderstandings',
        'word_count': '80-120',
        'instructions': """Write only 3-5 sentences total (extremely brief).
Fundamentally misrepresent the article's main argument (say scientists have already successfully explored Venus when the article is about future plans and challenges).

CRITICAL - Make it SCATTERED and CHAOTIC:
- Jump between ideas with no connections
- NO structure at all
- Treat Venus exploration as if it's already accomplished rather than a future challenge
- Miss the evaluative component entirely - don't assess whether the author supported their claim well
- Use simple transitions only: "Also," "And," or none
- Let thoughts trail off incompletely

Show you didn't understand what the article was actually about. Make it feel very confused."""
    },

    {
        'essay_id': 'SYNTH_V_03_S1',
        'score': 1,
        'target_error': 'Off-topic rambling + factual confusion',
        'word_count': '150-200',
        'instructions': """Spend most of the essay on tangential topics (other planets, space exploration in general, why space is cool).
Confuse Venus with Mars or Mercury in several places.
Include information about planets that wasn't in the article at all.

CRITICAL - Make it SCATTERED and CHAOTIC:
- Jump wildly between unrelated topics: Venus ‚Üí other planets ‚Üí space careers ‚Üí back to Venus ‚Üí random facts
- NO clear focus on the assigned task
- Incoherent connections between ideas
- Simple or no transitions
- Several sentences that feel completely disconnected
- Never clearly address whether the author supported their argument

Show the student didn't focus on the assigned task and got distracted by tangents."""
    },

    # ========================================
    # SCORE 2 EXAMPLES (7 total) - UNCHANGED
    # ========================================

    {
        'essay_id': 'SYNTH_V_04_S2',
        'score': 2,
        'target_error': 'Completeness gap - missing critical supporting details',
        'word_count': '150-200',
        'instructions': """Correctly identify the main claim (studying Venus is worthy despite dangers).
Mention that the author discusses dangers and solutions.
BUT only provide 1-2 very vague examples (e.g., "there are dangers" without specifying what).
Omit key evidence like specific temperatures, NASA's blimp solution, mechanical computers, etc.
Show basic understanding but very superficial engagement with the text.
Address the evaluation aspect but without sufficient detail to be convincing."""
    },

    {
        'essay_id': 'SYNTH_V_05_S2',
        'score': 2,
        'target_error': 'Accuracy issues - multiple misrepresentations',
        'word_count': '180-220',
        'instructions': """Capture the basic structure (dangers ‚Üí solutions ‚Üí why it's worth it).
Include several factual errors: wrong temperature (say 600¬∞F instead of 800¬∞F), wrong atmospheric pressure, wrong altitude for NASA's blimp.
Misattribute information (e.g., say Mercury is Earth's twin, or confuse which planet has the hottest surface).
Mix up timeframes (say missions were recent when they were decades ago).
Show the student read the article but remembered details incorrectly."""
    },

    {
        'essay_id': 'SYNTH_V_06_S2',
        'score': 2,
        'target_error': 'Coherence problems - poor organization',
        'word_count': '180-220',
        'instructions': """Include relevant content from the article but present it in random order.
Jump from dangers to solutions back to dangers to why Venus is interesting with no logical flow.
Use very weak or missing transitions ("Also..." or "And another thing...").
Make it hard to follow the argument even though the information is present.
Repeat the same point in different places rather than grouping related ideas."""
    },

    {
        'essay_id': 'SYNTH_V_07_S2',
        'score': 2,
        'target_error': 'Conciseness problems - excessive length with repetition',
        'word_count': '400-450',
        'instructions': """Include accurate information about Venus.
Repeat the same points 3-4 times using slightly different wording each time.
Say things like "Venus is dangerous because of the heat. The heat on Venus is extreme. The temperatures on Venus are very hot."
Include unnecessary elaboration on minor details.
Make it feel like padding to meet a length requirement.
Could easily be cut to 200 words without losing content."""
    },

    {
        'essay_id': 'SYNTH_V_08_S2',
        'score': 2,
        'target_error': 'Quote-heavy with minimal synthesis',
        'word_count': '180-220',
        'instructions': """Rely heavily on direct phrases from the article (don't use actual quotation marks, but use near-quotes).
String together borrowed phrases with minimal original summarization.
Reads like a patchwork: "The article says [near quote]. It also mentions [near quote]. The author states [near quote]."
Very little original synthesis or paraphrasing.
Shows the student didn't process the information, just copied it.
Weak evaluation of whether the author's support is effective."""
    },

    {
        'essay_id': 'SYNTH_V_09_S2',
        'score': 2,
        'target_error': 'Shallow coverage - lists facts without connections',
        'word_count': '160-200',
        'instructions': """Write in a bullet-point style or list-like structure even without actual bullets.
"First, Venus is hot. Second, there is acid. Third, NASA has an idea."
List facts from the article without connecting them or showing relationships.
No clear evaluation of the author's argument - just recitation.
Miss the analytical component entirely.
Each sentence feels disconnected from the previous one."""
    },

    {
        'essay_id': 'SYNTH_V_10_S2',
        'score': 2,
        'target_error': 'Personal opinion intrusion',
        'word_count': '180-220',
        'instructions': """Start summarizing the article but then shift into personal opinions.
Use phrases like "I think we should explore Venus because..." or "I believe the author is right because I've always been interested in space..."
Include 2-3 paragraphs about the student's own views on space exploration.
Lose objectivity required for summary/evaluation.
Confuse personal response with evaluation of the author's support.
Mix "the author supports this" with "I agree because..." """
    },

    # ========================================
    # SCORE 3 EXAMPLES (8 total) - REFINED
    # ========================================

    {
        'essay_id': 'SYNTH_V_11_S3',
        'score': 3,
        'target_error': 'Good completeness, weak coherence',
        'word_count': '220-250',
        'instructions': """Cover ALL main ideas with GOOD specific details:
- Dangers: 800¬∞F temperature, sulfuric acid, 97% CO2 atmosphere, 90x pressure
- Solutions: NASA's blimp at 30 miles altitude, mechanical computers, silicon carbide
- Value: Venus may have had oceans, Earth's twin, scientific curiosity

CRITICAL - Good content but CHOPPY execution:
- Use ONLY weak transitions: "Also," "And," "Another thing," "So," "Plus"
- NEVER use: "Furthermore," "Additionally," "Moreover," "In conclusion"
- Present good ideas in somewhat random order - jump between topics
- Make each paragraph feel disconnected from the previous one

Example choppy style: "The author talks about Venus being super dangerous. It's like 800 degrees with sulfuric acid. Also NASA has this idea about blimps. And the temperature would be hot but survivable. Plus Venus might have had oceans."

Reader should think: "Good content but rough organization." DO NOT write polished conclusion."""
    },

    {
        'essay_id': 'SYNTH_V_12_S3',
        'score': 3,
        'target_error': 'Good accuracy/completeness, conciseness issues',
        'word_count': '320-350',
        'instructions': """Cover all main ideas with accurate, thorough detail.
Include all key facts with correct information.

CRITICAL - Good content but CHOPPY execution AND TOO LONG:
- Use ONLY weak transitions: "Also," "And," "Another thing," "So," "Plus"
- Be moderately too long (320-350 words)
- Include some unnecessary elaboration or minor tangential details
- Some repetition of ideas
- Could be tightened significantly without losing content
- Good substance but needs editing
- Somewhat random organization

Make it feel like the student knows the material well but wrote too much with choppy flow."""
    },

    {
        'essay_id': 'SYNTH_V_13_S3',
        'score': 3,
        'target_error': 'Good structure, minor accuracy lapses',
        'word_count': '220-250',
        'instructions': """Write with decent organization and appropriate length.
Include good coverage of main ideas.

CRITICAL - Good content but CHOPPY execution PLUS minor errors:
- Use ONLY weak transitions: "Also," "And," "So," "Plus"
- Include 2-3 minor factual errors (slightly wrong numbers)
- Example: say the blimp would be 20 miles up instead of 30, or say 80% carbon dioxide instead of 97%
- Errors are small enough that overall understanding is clear
- Somewhat choppy flow with weak transitions

Otherwise solid summary with adequate evaluation."""
    },

    {
        'essay_id': 'SYNTH_V_14_S3',
        'score': 3,
        'target_error': 'Adequate but mechanical',
        'word_count': '200-230',
        'instructions': """Hit all required elements in a formulaic way.
"The author supports this idea in three ways. First,... Second,... Third,..."

CRITICAL - Good content but CHOPPY execution AND mechanical:
- Use ONLY weak transitions: "Also," "And," "First," "Second," "Third"
- Very five-paragraph-essay structure that feels paint-by-numbers
- Overly simplistic sentence structure throughout (mostly simple sentences, few complex ones)
- Feels formulaic but technically complete
- Adequate but uninspired
- Choppy transitions between sections

Make it feel like following a template rather than natural writing."""
    },

    {
        'essay_id': 'SYNTH_V_15_S3',
        'score': 3,
        'target_error': 'Good content, weak introduction/conclusion',
        'word_count': '220-250',
        'instructions': """Write strong middle paragraphs with good detail about dangers, solutions, and value.
Include specific facts and evidence.

CRITICAL - Good content but CHOPPY execution PLUS weak framing:
- Use ONLY weak transitions in body: "Also," "And," "Plus," "So"
- Unclear or missing claim statement in introduction
- Introduction jumps straight into details without setting up the evaluation
- Abrupt ending or incomplete conclusion that doesn't tie ideas together
- The body is strong (score 4 content level) but framing is weak and choppy

Make the middle good but the beginning and end feel rough."""
    },

    {
        'essay_id': 'SYNTH_V_16_S3',
        'score': 3,
        'target_error': 'Imbalanced coverage',
        'word_count': '220-250',
        'instructions': """Write excellent, detailed coverage of the dangers (sulfuric acid, heat, pressure, etc.).
Then only 2-3 sentences total on NASA's solutions (very superficial).
Barely mention why Venus is scientifically valuable.

CRITICAL - Good content but CHOPPY execution AND imbalanced:
- Use ONLY weak transitions: "Also," "And," "Another thing"
- Show engagement with some sections but uneven attention
- Good depth in dangers, inadequate in solutions/value
- Choppy flow throughout
- Somewhat random organization

Make it obvious the student focused on one section and rushed through others."""
    },

    {
        'essay_id': 'SYNTH_V_17_S3',
        'score': 3,
        'target_error': 'Nearly good but with redundancy',
        'word_count': '260-290',
        'instructions': """Write with accurate information and decent structure.
Include good evaluation of the author's support.

CRITICAL - Good content but CHOPPY execution PLUS redundancy:
- Use ONLY weak transitions: "Also," "And," "Plus," "So"
- Repeat 2-3 points unnecessarily
- Example: mention the extreme heat in paragraph 2, then mention it again in paragraph 3 in similar words
- Some ideas stated twice without adding new information
- Choppy transitions throughout
- Could be excellent if tightened and smoothed

Make it feel like good understanding but needs editing for flow and conciseness."""
    },

    {
        'essay_id': 'SYNTH_V_18_S3',
        'score': 3,
        'target_error': 'Good summary with minor coherence gaps',
        'word_count': '220-250',
        'instructions': """Write comprehensive and accurate coverage.
Include good specific details.

CRITICAL - Good content but CHOPPY execution PLUS coherence hiccups:
- Use ONLY weak transitions: "Also," "And," "So," "Plus"
- Make one paragraph or section feel disconnected from the rest
- Include slightly confusing pronoun references (unclear antecedents)
- One transition that doesn't quite work
- Reader might need to reread one part to understand the connection
- Overall good but with noticeable choppiness

Make it feel like the content is there but organization could be smoother."""
    },

    # ========================================
    # SCORE 4 EXAMPLES (4 total) - UNCHANGED
    # ========================================

    {
        'essay_id': 'SYNTH_V_19_S4',
        'score': 4,
        'target_error': 'Excellent overall, slightly too lengthy',
        'word_count': '290-310',
        'instructions': """Write a clear, explicit evaluation of how well the author supports the argument.
Include comprehensive coverage of dangers (specific examples), solutions (blimp, mechanical computers), and scientific value.
Make it accurate throughout with good detail.
Organize well with smooth transitions.
BUT make it slightly longer than ideal (290-310 words).
Could be tightened by 40-60 words without losing substance.
Very strong work with only minor conciseness issue."""
    },

    {
        'essay_id': 'SYNTH_V_20_S4',
        'score': 4,
        'target_error': 'Very good but with minor conciseness issue',
        'word_count': '250-270',
        'instructions': """Write with excellent structure, accuracy, and completeness.
Include strong evaluation of the author's argument.
Make it clear and coherent throughout.
BUT include 1-2 sentences that could be tightened.
One slightly redundant point or phrase.
Example: might say both "very hot" and "extremely high temperatures" in close proximity.
Nearly perfect with just minor tightening needed."""
    },

    {
        'essay_id': 'SYNTH_V_21_S4',
        'score': 4,
        'target_error': 'Strong summary, minor accuracy detail',
        'word_count': '230-260',
        'instructions': """Write with excellent organization, completeness, and conciseness.
Include clear evaluation with strong supporting evidence.
Use smooth, sophisticated writing.
BUT include one small factual error that doesn't undermine the overall argument.
Example: say 85 times atmospheric pressure instead of 90, or 750¬∞F instead of 800¬∞F.
Error is minor enough that understanding remains strong.
Otherwise near-perfect."""
    },

    {
        'essay_id': 'SYNTH_V_22_S4',
        'score': 4,
        'target_error': 'Near-excellent but slightly mechanical',
        'word_count': '240-260',
        'instructions': """Hit all rubric criteria very well.
Make it accurate, complete, organized, and reasonably concise.
Include clear evaluation with good evidence.
BUT lack the sophisticated synthesis and insightful connections of score 5-6.
Be slightly formulaic in approach.
Very competent and thorough but doesn't have the "spark" of exceptional writing.
Very good but not quite excellent."""
    },

    # ========================================
    # SCORE 5 EXAMPLES (1 total) - UNCHANGED
    # ========================================

    {
        'essay_id': 'SYNTH_V_23_S5',
        'score': 5,
        'target_error': 'Excellent summary with very minor flaw',
        'word_count': '230-250',
        'instructions': """Write a clear, insightful evaluation of how the author builds their argument.
Include comprehensive coverage of all key evidence (dangers, NASA solutions, scientific value, mechanical computers, past missions).
Make all information accurate and well-synthesized.
Use excellent organization with smooth, sophisticated transitions.
Be appropriately concise (230-250 words) with no redundancy.
Show deep understanding and analytical thinking.
BUT include one VERY minor issue (e.g., two ideas that could be connected more explicitly, or one transition that's good but could be slightly smoother).
The flaw should be extremely subtle - this is nearly perfect work.
Should feel like strong high school or early college writing."""
    }
]

# ===========================================
# MAIN GENERATION FUNCTION
# ===========================================

def generate_all_synthetic_examples(save_path, delay=1.5):
    """Generate all 23 synthetic examples and save to CSV"""

    print("=" * 70)
    print(f"GENERATING {len(SYNTHETIC_EXAMPLES)} SYNTHETIC EXAMPLES")
    print("=" * 70)
    print(
        f"Estimated time: {len(SYNTHETIC_EXAMPLES) * 2:.0f} seconds "
        f"(~{len(SYNTHETIC_EXAMPLES) * 2 / 60:.0f} minutes)"
    )
    print()

    results = []

    for i, example_config in enumerate(SYNTHETIC_EXAMPLES, 1):
        print(
            f"[{i:2d}/{len(SYNTHETIC_EXAMPLES)}] "
            f"{example_config['essay_id']} (Score {example_config['score']})...",
            end=" ",
        )

        # ‚úÖ actually generate one example
        result = generate_synthetic_example(example_config)

        if result is not None:
            results.append(result)
            word_count = result["word_count"]
            print(f"‚úì ({word_count} words)")
        else:
            print("‚úó FAILED")

        # Rate limiting
        if i < len(SYNTHETIC_EXAMPLES):
            time.sleep(delay)

    # If all generations failed, bail out gracefully
    if not results:
        print(
            "\nNo synthetic examples were generated. "
            "Check the error messages above (likely an API or config issue)."
        )
        return pd.DataFrame()

    # Create DataFrame
    synthetic_df = pd.DataFrame(results)

    # Save to CSV
    synthetic_df.to_csv(save_path, index=False)

    print()
    print("=" * 70)
    print("‚úÖ GENERATION COMPLETE!")
    print("=" * 70)
    print(f"Generated: {len(results)}/{len(SYNTHETIC_EXAMPLES)} examples")
    print(f"üìÅ Saved to: {save_path}")
    print()

    print("Score distribution:")
    print(synthetic_df["score"].value_counts().sort_index())
    print()

    return synthetic_df

# ===========================================
# COMBINE WITH AUTHENTIC SAMPLES
# ===========================================

def combine_with_authentic(authentic_path, synthetic_df, output_path):
    """Combine authentic and synthetic samples into final validation set"""

    print("="*70)
    print("COMBINING WITH AUTHENTIC SAMPLES")
    print("="*70)

    # Load authentic samples
    authentic_df = pd.read_csv(authentic_path)

    # Add metadata columns to authentic samples
    authentic_df['synthetic_flag'] = False
    authentic_df['target_error_pattern'] = 'Authentic student work'
    authentic_df['generation_date'] = None
    authentic_df['generation_model'] = None
    authentic_df['word_count'] = authentic_df['full_text'].apply(lambda x: len(str(x).split()))

    # Combine
    combined_df = pd.concat([authentic_df, synthetic_df], ignore_index=True)

    # Shuffle to mix authentic and synthetic
    combined_df = combined_df.sample(frac=1, random_state=42).reset_index(drop=True)

    # Save
    combined_df.to_csv(output_path, index=False)

    print()
    print("‚úÖ COMBINED DATASET CREATED!")
    print(f"   Authentic samples: {len(authentic_df)} ({len(authentic_df)/len(combined_df)*100:.1f}%)")
    print(f"   Synthetic samples: {len(synthetic_df)} ({len(synthetic_df)/len(combined_df)*100:.1f}%)")
    print(f"   Total samples: {len(combined_df)}")
    print()
    print(f"üìÅ Saved to: {output_path}")
    print()
    print("Final score distribution:")
    print(combined_df['score'].value_counts().sort_index())
    print()

    return combined_df

Mounted at /content/drive


In [8]:
# ===========================================
# QUICK SANITY CHECK (RUNS WHEN YOU EXECUTE THIS CELL)
# ===========================================

print("Running quick sanity check with one synthetic example...")
test_config = SYNTHETIC_EXAMPLES[0]
test_sample = generate_synthetic_example(test_config)

print("Result is None?", test_sample is None)
if test_sample:
    print("Sample word_count:", test_sample["word_count"])
    print(test_sample["full_text"][:400], "...")
    # Optional: stop here while debugging
    # import sys
    # sys.exit("Stopping after sanity check.")

Running quick sanity check with one synthetic example...
Result is None? False
Sample word_count: 156
Venus is super bright and hot. Like, hotter than the sun I think. Also, it has a lot of clouds and acid which is really cool and dangerous. The author talks about NASA wanting to send a blimp to Venus, but I don‚Äôt get why they want to go when they already sent a spaceship there once. I mean, it was probably really hard but they should just explore Mars instead. And the article says it has volcanoe ...


In [9]:
# ===========================================
# RUN COMPLETE GENERATION
# ===========================================

if __name__ == "__main__":
    print("\n" + "="*70)
    print("SYNTHETIC EXAMPLES GENERATION - REFINED VERSION")
    print("="*70)
    print()

    # Step 1: Generate synthetic examples
    synthetic_df = generate_all_synthetic_examples(SYNTHETIC_PATH, delay=1.5)

    if synthetic_df.empty:
        print("‚ùå Skipping combination because no synthetic examples were generated.")
    else:
        # Step 2: Combine with authentic samples
        final_df = combine_with_authentic(AUTHENTIC_PATH, synthetic_df, COMBINED_PATH)
        print("="*70)
        print("üéâ ALL DONE!")
        print("="*70)
        print()
        print("Next steps:")
        print("1. ‚úÖ Review synthetic examples for quality")
        print("2. ‚úÖ Regenerate any that need adjustment")
        print("3. ‚úÖ Proceed to Phase 2: Expert Rating")
        print()
        print(f"Your validation set is ready: {COMBINED_PATH}")
        print()


SYNTHETIC EXAMPLES GENERATION - REFINED VERSION

GENERATING 23 SYNTHETIC EXAMPLES
Estimated time: 46 seconds (~1 minutes)

[ 1/23] SYNTH_V_01_S1 (Score 1)... ‚úì (144 words)
[ 2/23] SYNTH_V_02_S1 (Score 1)... ‚úì (99 words)
[ 3/23] SYNTH_V_03_S1 (Score 1)... ‚úì (248 words)
[ 4/23] SYNTH_V_04_S2 (Score 2)... ‚úì (196 words)
[ 5/23] SYNTH_V_05_S2 (Score 2)... ‚úì (222 words)
[ 6/23] SYNTH_V_06_S2 (Score 2)... ‚úì (242 words)
[ 7/23] SYNTH_V_07_S2 (Score 2)... ‚úì (488 words)
[ 8/23] SYNTH_V_08_S2 (Score 2)... ‚úì (216 words)
[ 9/23] SYNTH_V_09_S2 (Score 2)... ‚úì (197 words)
[10/23] SYNTH_V_10_S2 (Score 2)... ‚úì (219 words)
[11/23] SYNTH_V_11_S3 (Score 3)... ‚úì (248 words)
[12/23] SYNTH_V_12_S3 (Score 3)... ‚úì (341 words)
[13/23] SYNTH_V_13_S3 (Score 3)... ‚úì (249 words)
[14/23] SYNTH_V_14_S3 (Score 3)... ‚úì (242 words)
[15/23] SYNTH_V_15_S3 (Score 3)... ‚úì (238 words)
[16/23] SYNTH_V_16_S3 (Score 3)... ‚úì (272 words)
[17/23] SYNTH_V_17_S3 (Score 3)... ‚úì (283 words)
[18/23] SY

In [10]:
combined_df = pd.read_csv(COMBINED_PATH)

print(combined_df.shape)  # should be (60, ...)
print(combined_df["synthetic_flag"].value_counts())
print(combined_df["score"].value_counts().sort_index())

# Look at a few synthetic rows
combined_df[combined_df["synthetic_flag"]].head()[["essay_id", "score", "word_count"]]

(60, 19)
synthetic_flag
False    37
True     23
Name: count, dtype: int64
score
1     8
2    18
3    20
4    10
5     3
6     1
Name: count, dtype: int64


Unnamed: 0,essay_id,score,word_count
3,SYNTH_V_09_S2,2,197
5,SYNTH_V_18_S3,3,257
7,SYNTH_V_12_S3,3,341
9,SYNTH_V_21_S4,4,267
10,SYNTH_V_10_S2,2,219


In [11]:
synthetic = combined_df[combined_df["synthetic_flag"]]

# Check word-count ranges by score
synthetic.groupby("score")["word_count"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,3.0,163.666667,76.422074,99.0,121.5,144.0,196.0,248.0
2,7.0,254.285714,104.247645,196.0,206.5,219.0,232.0,488.0
3,8.0,266.25,33.813564,238.0,246.5,253.0,274.75,341.0
4,4.0,256.25,19.362765,238.0,241.0,254.5,269.75,278.0
5,1.0,238.0,,238.0,238.0,238.0,238.0,238.0


In [12]:
# =============================================================================
# Calibration Subset Selection
# =============================================================================

# =============================================================================
# SETUP: Mount Google Drive
# =============================================================================
from google.colab import drive
drive.mount('/content/drive')

# =============================================================================
# CONFIGURATION: Set your file paths
# =============================================================================

# UPDATE THESE PATHS to match your Google Drive structure:
VALIDATION_DATASET_PATH = '/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/validation_set_combined_60.csv'
OUTPUT_DIR = '/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_2/Phase2_Calibration/'

# The script will create these files in OUTPUT_DIR:
# - calibration_subset.csv
# - calibration_practice_summaries.txt

print("="*80)
print("CALIBRATION SUBSET SELECTION")
print("="*80)
print(f"\nReading from: {VALIDATION_DATASET_PATH}")
print(f"Saving to: {OUTPUT_DIR}")

# =============================================================================
# STEP 1: Load and analyze dataset
# =============================================================================
import pandas as pd
import numpy as np
import os

print("\n[STEP 1] Loading Dataset...")
print("-" * 80)

df = pd.read_csv(VALIDATION_DATASET_PATH)

print(f"‚úì Loaded {len(df)} summaries")
print(f"\nColumns: {list(df.columns)}")

print("\n\nDataset Distribution:")
print(f"  Authentic (ASAP 2.0): {(df['synthetic_flag'] == False).sum()}")
print(f"  Synthetic (GPT-4o-Mini): {(df['synthetic_flag'] == True).sum()}")

print("\nScore distribution:")
score_dist = df['score'].value_counts().sort_index()
for score, count in score_dist.items():
    pct = (count / len(df)) * 100
    print(f"  Score {score}: {count:2d} summaries ({pct:4.1f}%)")

# =============================================================================
# STEP 2: Define selection strategy
# =============================================================================
print("\n\n[STEP 2] Selection Strategy")
print("-" * 80)

selection_plan = {
    1: {'target': 2, 'authentic': 1, 'synthetic': 1},
    2: {'target': 3, 'authentic': 2, 'synthetic': 1},
    3: {'target': 3, 'authentic': 2, 'synthetic': 1},
    4: {'target': 2, 'authentic': 1, 'synthetic': 1},
    5: {'target': 1, 'authentic': 1, 'synthetic': 0},
    6: {'target': 1, 'authentic': 1, 'synthetic': 0}
}

print("\nWill select:")
for score, plan in selection_plan.items():
    print(f"  Score {score}: {plan['target']} summaries ({plan['authentic']} auth, {plan['synthetic']} synth)")

total_target = sum(plan['target'] for plan in selection_plan.values())
print(f"\nTotal: {total_target} summaries for calibration")

# =============================================================================
# STEP 3: Execute selection
# =============================================================================
print("\n\n[STEP 3] Selecting Summaries")
print("-" * 80)

calibration_subset = []

# Score 1: 1 authentic + 1 synthetic
score_1_df = df[df['score'] == 1]
cal_1_auth = score_1_df[score_1_df['synthetic_flag'] == False].iloc[0]
cal_1_synth = score_1_df[score_1_df['synthetic_flag'] == True].iloc[0]
calibration_subset.extend([cal_1_auth, cal_1_synth])
print(f"Score 1: Selected {cal_1_auth['essay_id']} (auth), {cal_1_synth['essay_id']} (synth)")

# Score 2: 2 authentic + 1 synthetic
score_2_df = df[df['score'] == 2]
cal_2_auth = score_2_df[score_2_df['synthetic_flag'] == False].iloc[0:2]
cal_2_synth = score_2_df[score_2_df['synthetic_flag'] == True].iloc[0]
calibration_subset.extend([cal_2_auth.iloc[0], cal_2_auth.iloc[1], cal_2_synth])
print(f"Score 2: Selected {cal_2_auth.iloc[0]['essay_id']}, {cal_2_auth.iloc[1]['essay_id']} (auth), {cal_2_synth['essay_id']} (synth)")

# Score 3: 2 authentic + 1 synthetic
score_3_df = df[df['score'] == 3]
cal_3_auth = score_3_df[score_3_df['synthetic_flag'] == False].iloc[0:2]
cal_3_synth = score_3_df[score_3_df['synthetic_flag'] == True].iloc[0]
calibration_subset.extend([cal_3_auth.iloc[0], cal_3_auth.iloc[1], cal_3_synth])
print(f"Score 3: Selected {cal_3_auth.iloc[0]['essay_id']}, {cal_3_auth.iloc[1]['essay_id']} (auth), {cal_3_synth['essay_id']} (synth)")

# Score 4: 1 authentic + 1 synthetic
score_4_df = df[df['score'] == 4]
cal_4_auth = score_4_df[score_4_df['synthetic_flag'] == False].iloc[0]
cal_4_synth = score_4_df[score_4_df['synthetic_flag'] == True].iloc[0]
calibration_subset.extend([cal_4_auth, cal_4_synth])
print(f"Score 4: Selected {cal_4_auth['essay_id']} (auth), {cal_4_synth['essay_id']} (synth)")

# Score 5: 1 authentic
score_5_df = df[df['score'] == 5]
cal_5_auth = score_5_df[score_5_df['synthetic_flag'] == False].iloc[0]
calibration_subset.append(cal_5_auth)
print(f"Score 5: Selected {cal_5_auth['essay_id']} (auth)")

# Score 6: 1 authentic (only one available)
score_6_df = df[df['score'] == 6]
cal_6 = score_6_df.iloc[0]
calibration_subset.append(cal_6)
print(f"Score 6: Selected {cal_6['essay_id']} (auth)")

# =============================================================================
# STEP 4: Create calibration DataFrame
# =============================================================================
print("\n\n[STEP 4] Creating Calibration DataFrame")
print("-" * 80)

cal_df = pd.DataFrame(calibration_subset)
cal_df = cal_df.reset_index(drop=True)

print(f"\n‚úì Created DataFrame with {len(cal_df)} summaries")
print(f"\nScore distribution in calibration set:")
print(cal_df['score'].value_counts().sort_index())
print(f"\nAuthentic: {(cal_df['synthetic_flag'] == False).sum()}")
print(f"Synthetic: {(cal_df['synthetic_flag'] == True).sum()}")

# =============================================================================
# STEP 5: Display details
# =============================================================================
print("\n\n[STEP 5] Calibration Subset Details")
print("-" * 80)
print(f"\n{'#':<3} {'Essay ID':<25} {'Score':<6} {'Source':<8} {'Words':<6} {'Error Pattern':<45}")
print("-" * 80)

for idx, row in cal_df.iterrows():
    source = "Synthetic" if row['synthetic_flag'] else "Authentic"
    error = row['target_error_pattern'] if pd.notna(row['target_error_pattern']) else "N/A"
    print(f"{idx+1:<3} {row['essay_id']:<25} {row['score']:<6} {source:<8} {row['word_count']:<6} {error[:45]:<45}")

# =============================================================================
# STEP 6: Save CSV to Google Drive
# =============================================================================
print("\n\n[STEP 6] Saving Calibration Subset CSV")
print("-" * 80)

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

cal_csv_path = os.path.join(OUTPUT_DIR, 'calibration_subset.csv')
cal_df.to_csv(cal_csv_path, index=False)
print(f"‚úì Saved: {cal_csv_path}")

# =============================================================================
# STEP 7: Create practice summaries document
# =============================================================================
print("\n\n[STEP 7] Creating Practice Summaries Document")
print("-" * 80)

output_lines = []
output_lines.append("=" * 80)
output_lines.append("CALIBRATION PRACTICE SET - 12 SUMMARIES")
output_lines.append("=" * 80)
output_lines.append("\nInstructions:")
output_lines.append("1. Score each summary across all 4 dimensions WITHOUT looking at benchmark scores")
output_lines.append("2. Use your rubric and document your reasoning")
output_lines.append("3. After scoring all 12, compare with the benchmark scores")
output_lines.append("4. Analyze discrepancies to refine your rubric interpretation")
output_lines.append("\n" + "=" * 80 + "\n")

for idx, row in cal_df.iterrows():
    practice_num = idx + 1

    output_lines.append(f"\n{'='*80}")
    output_lines.append(f"PRACTICE_{practice_num:02d}: {row['essay_id']}")
    output_lines.append(f"{'='*80}")
    output_lines.append(f"Source: {'Synthetic' if row['synthetic_flag'] else 'Authentic'}")
    output_lines.append(f"Word Count: {row['word_count']}")
    if pd.notna(row['target_error_pattern']):
        output_lines.append(f"Error Pattern: {row['target_error_pattern']}")

    output_lines.append(f"\n{'-'*80}")
    output_lines.append("SUMMARY TEXT:")
    output_lines.append(f"{'-'*80}\n")
    output_lines.append(row['full_text'])
    output_lines.append("\n" + "="*80 + "\n")

# Save to text file
practice_txt_path = os.path.join(OUTPUT_DIR, 'calibration_practice_summaries.txt')
with open(practice_txt_path, 'w', encoding='utf-8') as f:
    f.write('\n'.join(output_lines))

print(f"‚úì Saved: {practice_txt_path}")

# =============================================================================
# STEP 8: Create Practice IDs reference
# =============================================================================
print("\n\n[STEP 8] Practice IDs for Calibration Tracker")
print("-" * 80)
print("\nCopy these IDs into your Calibration_Tracker.xlsx:")
print()
for i, essay_id in enumerate(cal_df['essay_id'], 1):
    print(f"PRACTICE_R1_{i:02d}: {essay_id}")

# Save to a separate reference file
practice_ids_path = os.path.join(OUTPUT_DIR, 'calibration_practice_ids.txt')
with open(practice_ids_path, 'w', encoding='utf-8') as f:
    f.write("Practice IDs for Calibration Tracker\n")
    f.write("="*80 + "\n\n")
    f.write("Use these in your Calibration_Tracker.xlsx:\n\n")
    for i, essay_id in enumerate(cal_df['essay_id'], 1):
        f.write(f"PRACTICE_R1_{i:02d}: {essay_id}\n")

print(f"\n‚úì Saved: {practice_ids_path}")

# =============================================================================
# COMPLETION
# =============================================================================
print("\n\n" + "="*80)
print("CALIBRATION SUBSET SELECTION COMPLETE")
print("="*80)

print(f"\n‚úì Files saved to: {OUTPUT_DIR}")
print(f"  ‚Ä¢ calibration_subset.csv (metadata)")
print(f"  ‚Ä¢ calibration_practice_summaries.txt (full texts)")
print(f"  ‚Ä¢ calibration_practice_ids.txt (IDs for tracker)")

print(f"\n‚úì Selected {len(cal_df)} summaries:")
print(f"  ‚Ä¢ All 6 score levels covered")
print(f"  ‚Ä¢ {(cal_df['synthetic_flag'] == False).sum()} authentic, {(cal_df['synthetic_flag'] == True).sum()} synthetic")
print(f"  ‚Ä¢ Word count range: {cal_df['word_count'].min()}-{cal_df['word_count'].max()}")

print("\nüìã Next steps:")
print("  1. Download the three files from your Google Drive")
print("  2. Review the practice summaries document")
print("  3. Begin Phase 1 of calibration (Rubric Study)")
print("  4. Use the practice IDs to update your Calibration_Tracker.xlsx")

print("\n" + "="*80)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
CALIBRATION SUBSET SELECTION

Reading from: /content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/validation_set_combined_60.csv
Saving to: /content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_2/Phase2_Calibration/

[STEP 1] Loading Dataset...
--------------------------------------------------------------------------------
‚úì Loaded 60 summaries

Columns: ['essay_id', 'score', 'full_text', 'assignment', 'prompt_name', 'economically_disadvantaged', 'student_disability_status', 'ell_status', 'race_ethnicity', 'gender', 'source_text_1', 'source_text_2', 'source_text_3', 'source_text_4', 'synthetic_flag', 'target_error_pattern', 'generation_date', 'generation_model', 'word_count']


Dataset Distribution:
  Authentic (ASAP 2.0): 

In [13]:
"""
Generate Calibration Benchmark Scores Answer Key
Reads calibration_subset.csv and creates formatted answer key file
"""

import pandas as pd
from pathlib import Path

# Configuration
CALIBRATION_SUBSET_PATH = "/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_2/Phase2_Calibration/calibration_subset.csv"  # Adjust path as needed
OUTPUT_PATH = "/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_2/Phase2_Calibration/CALIBRATION_BENCHMARK_SCORES.txt"

def create_benchmark_scores_file(input_csv, output_txt):
    """
    Create formatted benchmark scores file from calibration subset CSV.

    Parameters:
    -----------
    input_csv : str
        Path to calibration_subset.csv
    output_txt : str
        Path for output benchmark scores file
    """

    # Read calibration subset
    df = pd.read_csv(input_csv)

    # Define practice IDs and their roles
    exemplars = {
        'PRACTICE_01': 'Not used for blind practice - reference only',
        'PRACTICE_02': 'EXEMPLAR - analyzed in detail in EXEMPLAR_ANALYSIS_GUIDE.md',
        'PRACTICE_07': 'EXEMPLAR - analyzed in detail in EXEMPLAR_ANALYSIS_GUIDE.md',
        'PRACTICE_11': 'EXEMPLAR - analyzed in detail in EXEMPLAR_ANALYSIS_GUIDE.md'
    }

    round_1 = ['PRACTICE_03', 'PRACTICE_04', 'PRACTICE_05', 'PRACTICE_06']
    round_2 = ['PRACTICE_08', 'PRACTICE_09', 'PRACTICE_10', 'PRACTICE_12']

    # Create practice_id column if it doesn't exist
    if 'practice_id' not in df.columns:
        # Create practice IDs based on row order
        df['practice_id'] = [f'PRACTICE_{str(i+1).zfill(2)}' for i in range(len(df))]

    # Build the output content
    content = []

    # Header
    content.append("=" * 80)
    content.append("CALIBRATION PRACTICE SUMMARIES - BENCHMARK SCORES (ANSWER KEY)")
    content.append("=" * 80)
    content.append("")
    content.append("DO NOT LOOK AT THIS FILE UNTIL YOU HAVE SCORED ALL 9 PRACTICE SUMMARIES BLIND!")
    content.append("")
    content.append("Instructions for Use:")
    content.append("1. Score all 9 practice summaries (PRACTICE_03 through PRACTICE_06, and")
    content.append("   PRACTICE_08 through PRACTICE_12) WITHOUT looking at this file")
    content.append("2. Record your scores in Calibration_Tracker.xlsx")
    content.append("3. AFTER completing all 9, open this file to compare your scores")
    content.append("4. Calculate agreement metrics and analyze discrepancies")
    content.append("")
    content.append("=" * 80)
    content.append("")

    # Exemplar summaries section
    content.append("EXEMPLAR SUMMARIES (Study These First - Scores Already Known)")
    content.append("=" * 80)
    content.append("")

    for practice_id, note in exemplars.items():
        row = df[df['practice_id'] == practice_id].iloc[0]
        content.append(f"{practice_id}: {row['essay_id']}")
        content.append(f"Benchmark Score: {row['score']}")
        content.append(f"Source: {'Authentic' if row['synthetic_flag'] == 0 else 'Synthetic'}")
        content.append(f"Note: {note}")
        content.append("")

    content.append("=" * 80)
    content.append("")

    # Practice Round 1
    content.append("PRACTICE ROUND 1 - BLIND SCORING (Complete First)")
    content.append("=" * 80)
    content.append("")

    for practice_id in round_1:
        row = df[df['practice_id'] == practice_id].iloc[0]
        content.append(f"{practice_id}: {row['essay_id']}")
        content.append(f"Benchmark Score: {row['score']}")
        content.append(f"Source: {'Authentic' if row['synthetic_flag'] == 0 else 'Synthetic'}")
        content.append(f"Word Count: {row['word_count']}")
        if row['synthetic_flag'] == 1 and pd.notna(row['target_error_pattern']):
            content.append(f"Error Pattern: {row['target_error_pattern']}")
        content.append("")

    content.append("=" * 80)
    content.append("")

    # Practice Round 2
    content.append("PRACTICE ROUND 2 - BLIND SCORING (Complete Second)")
    content.append("=" * 80)
    content.append("")

    for practice_id in round_2:
        row = df[df['practice_id'] == practice_id].iloc[0]
        content.append(f"{practice_id}: {row['essay_id']}")
        content.append(f"Benchmark Score: {row['score']}")
        content.append(f"Source: {'Authentic' if row['synthetic_flag'] == 0 else 'Synthetic'}")
        content.append(f"Word Count: {row['word_count']}")
        if row['synthetic_flag'] == 1 and pd.notna(row['target_error_pattern']):
            content.append(f"Error Pattern: {row['target_error_pattern']}")
        content.append("")

    content.append("=" * 80)
    content.append("")

    # Score distribution
    content.append("SCORE DISTRIBUTION IN PRACTICE SET")
    content.append("=" * 80)
    content.append("")

    score_counts = df['score'].value_counts().sort_index()
    for score, count in score_counts.items():
        practice_ids = df[df['score'] == score]['practice_id'].tolist()
        ids_str = ', '.join(practice_ids)

        # Identify exemplars
        exemplar_ids = [pid for pid in practice_ids if pid in exemplars]
        if exemplar_ids:
            ids_str += f" ({', '.join([f'{pid} - exemplar' for pid in exemplar_ids])})"

        content.append(f"Score {score}: {count} {'summary' if count == 1 else 'summaries'} ({ids_str})")

    content.append("")
    content.append("=" * 80)
    content.append("")

    # Agreement metrics section
    content.append("AGREEMENT METRICS TO CALCULATE")
    content.append("=" * 80)
    content.append("")
    content.append("After comparing your scores to these benchmarks:")
    content.append("")
    content.append("1. EXACT AGREEMENT: How many summaries did you score exactly the same?")
    content.append("   Target: ‚â• 60% (at least 6 out of 9)")
    content.append("")
    content.append("2. ADJACENT AGREEMENT: How many were within ¬±1 point?")
    content.append("   Target: > 85% (at least 8 out of 9)")
    content.append("")
    content.append("3. MEAN ABSOLUTE ERROR (MAE): Average distance from benchmark")
    content.append("   Target: < 0.5 points per dimension")
    content.append("   ")
    content.append("   Formula: Sum of |your score - benchmark| √∑ number of summaries")
    content.append("")
    content.append("4. PATTERNS IN DISCREPANCIES:")
    content.append("   - Do you tend to score higher or lower than benchmarks?")
    content.append("   - Are discrepancies concentrated in specific dimensions?")
    content.append("   - Are errors larger for certain score levels?")
    content.append("")
    content.append("=" * 80)
    content.append("")

    # Next steps
    content.append("NEXT STEPS AFTER COMPARISON")
    content.append("=" * 80)
    content.append("")
    content.append("1. Calculate your agreement metrics")
    content.append("2. Identify patterns in discrepancies")
    content.append("3. Create decision rules for borderline cases")
    content.append("4. Review rubric areas where you struggled")
    content.append("5. Proceed to Practice Round 2 with refined approach")
    content.append("6. After Round 2, assess readiness for full validation scoring")
    content.append("")
    content.append("=" * 80)

    # Write to file
    with open(output_txt, 'w', encoding='utf-8') as f:
        f.write('\n'.join(content))

    print(f"‚úì Benchmark scores file created: {output_txt}")
    print(f"  Total summaries: {len(df)}")
    print(f"  Exemplars: {len(exemplars)}")
    print(f"  Practice Round 1: {len(round_1)}")
    print(f"  Practice Round 2: {len(round_2)}")
    print(f"\nScore distribution:")
    for score, count in score_counts.items():
        print(f"  Score {score}: {count}")


if __name__ == "__main__":
    # Create the benchmark scores file
    create_benchmark_scores_file(CALIBRATION_SUBSET_PATH, OUTPUT_PATH)

    print("\n‚úì Generation complete!")
    print(f"\nReminder: DO NOT open {OUTPUT_PATH} until after blind scoring!")

‚úì Benchmark scores file created: /content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_2/Phase2_Calibration/CALIBRATION_BENCHMARK_SCORES.txt
  Total summaries: 12
  Exemplars: 4
  Practice Round 1: 4
  Practice Round 2: 4

Score distribution:
  Score 1: 2
  Score 2: 3
  Score 3: 3
  Score 4: 2
  Score 5: 1
  Score 6: 1

‚úì Generation complete!

Reminder: DO NOT open /content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_2/Phase2_Calibration/CALIBRATION_BENCHMARK_SCORES.txt until after blind scoring!


In [14]:
"""
25-Summary Validation Subset Selector
Stratified sampling from 60-summary validation set for accelerated timeline
"""

import pandas as pd
import numpy as np
import os

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# ============================================================================
# CONFIGURATION - Adjust these paths for your Google Drive setup
# ============================================================================

# Path to your validation_set_combined_60.csv in Google Drive
VALIDATION_DATASET_PATH = "/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/validation_set_combined_60.csv"

# Output directory in Google Drive
OUTPUT_DIR = "/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_2/data"

# Random seed for reproducibility
RANDOM_SEED = 42

# Target distribution for 25 summaries (proportional to original 60)
TARGET_DISTRIBUTION = {
    1: 3,   # From 8 available
    2: 8,   # From 18 available
    3: 8,   # From 20 available
    4: 4,   # From 10 available
    5: 1,   # From 3 available
    6: 1    # From 1 available
}
# Total = 25 summaries

# ============================================================================

def select_validation_subset(input_csv, target_dist, random_seed=42):
    """
    Select stratified 25-summary subset from 60-summary validation set.

    Parameters:
    -----------
    input_csv : str
        Path to validation_set_combined_60.csv
    target_dist : dict
        Target number of summaries per score level
    random_seed : int
        Random seed for reproducibility

    Returns:
    --------
    pd.DataFrame
        Selected subset of 25 summaries
    """

    print("=" * 80)
    print("FAST-TRACK VALIDATION SUBSET SELECTION")
    print("=" * 80)
    print()

    # Read full validation dataset
    print(f"Reading validation dataset...")
    print(f"  Path: {input_csv}")
    df = pd.read_csv(input_csv)
    print(f"‚úì Loaded {len(df)} summaries")
    print()

    # Set random seed
    np.random.seed(random_seed)

    # Display current distribution
    print("Current score distribution (60 summaries):")
    score_dist = df['score'].value_counts().sort_index()
    for score, count in score_dist.items():
        auth_count = len(df[(df['score'] == score) & (df['synthetic_flag'] == 0)])
        synth_count = len(df[(df['score'] == score) & (df['synthetic_flag'] == 1)])
        print(f"  Score {score}: {count} total ({auth_count} authentic, {synth_count} synthetic)")
    print()

    # Stratified sampling by score
    print("Target distribution (25 summaries):")
    for score, target in target_dist.items():
        print(f"  Score {score}: {target} summaries")
    print()

    print("Selecting summaries...")
    selected_dfs = []

    for score, target_count in target_dist.items():
        # Get all summaries with this score
        score_df = df[df['score'] == score].copy()

        if len(score_df) < target_count:
            print(f"  ‚ö† Warning: Only {len(score_df)} summaries available for score {score} (need {target_count})")
            selected = score_df
        else:
            # Randomly sample target_count summaries
            selected = score_df.sample(n=target_count, random_state=random_seed)

        selected_dfs.append(selected)

        auth_selected = len(selected[selected['synthetic_flag'] == 0])
        synth_selected = len(selected[selected['synthetic_flag'] == 1])
        print(f"  ‚úì Score {score}: Selected {len(selected)} ({auth_selected} authentic, {synth_selected} synthetic)")

    # Combine all selected summaries
    subset_df = pd.concat(selected_dfs, ignore_index=True)

    # Shuffle the final subset
    subset_df = subset_df.sample(frac=1, random_state=random_seed).reset_index(drop=True)

    # Add validation_id for tracking
    subset_df['validation_id'] = [f'VAL_{str(i+1).zfill(2)}' for i in range(len(subset_df))]

    print()
    print("=" * 80)
    print("SELECTION COMPLETE")
    print("=" * 80)
    print(f"Total selected: {len(subset_df)} summaries")
    print()

    # Final distribution summary
    print("Final subset distribution:")
    for score in sorted(subset_df['score'].unique()):
        count = len(subset_df[subset_df['score'] == score])
        auth_count = len(subset_df[(subset_df['score'] == score) & (subset_df['synthetic_flag'] == 0)])
        synth_count = len(subset_df[(subset_df['score'] == score) & (subset_df['synthetic_flag'] == 1)])
        print(f"  Score {score}: {count} ({auth_count} authentic, {synth_count} synthetic)")

    total_auth = len(subset_df[subset_df['synthetic_flag'] == 0])
    total_synth = len(subset_df[subset_df['synthetic_flag'] == 1])
    print()
    print(f"Overall: {total_auth} authentic ({total_auth/len(subset_df)*100:.1f}%), "
          f"{total_synth} synthetic ({total_synth/len(subset_df)*100:.1f}%)")

    return subset_df


def create_scoring_text_file(subset_df, output_txt):
    """
    Create formatted text file for manual scoring.

    Parameters:
    -----------
    subset_df : pd.DataFrame
        Selected validation subset
    output_txt : str
        Path for output text file
    """

    content = []

    # Header
    content.append("=" * 80)
    content.append("FAST-TRACK VALIDATION SET - 25 SUMMARIES FOR SCORING")
    content.append("=" * 80)
    content.append("")
    content.append("Instructions:")
    content.append("1. Score each summary across all 4 dimensions using your calibrated approach")
    content.append("2. Record scores in your scoring template spreadsheet")
    content.append("3. Document brief rationale for borderline cases")
    content.append("4. These scores will be your ground truth for LLM validation")
    content.append("")
    content.append("Timeline: Complete all 25 by end of day Monday, December 2")
    content.append("Estimated time: 6-8 hours (15-20 min per summary)")
    content.append("")
    content.append("=" * 80)
    content.append("")

    # Each summary
    for idx, row in subset_df.iterrows():
        content.append("")
        content.append("=" * 80)
        content.append(f"{row['validation_id']}: {row['essay_id']}")
        content.append("=" * 80)
        content.append(f"Original Score: {row['score']}")
        content.append(f"Source: {'Authentic' if row['synthetic_flag'] == 0 else 'Synthetic'}")
        content.append(f"Word Count: {row['word_count']}")
        if row['synthetic_flag'] == 1 and pd.notna(row.get('target_error_pattern')):
            content.append(f"Error Pattern: {row['target_error_pattern']}")
        content.append("")
        content.append("-" * 80)
        content.append("SUMMARY TEXT:")
        content.append("-" * 80)
        content.append("")
        content.append(row['full_text'])
        content.append("")
        content.append("-" * 80)
        content.append("YOUR SCORES (Complete after reading):")
        content.append("-" * 80)
        content.append("Completeness (1-5): _____")
        content.append("Accuracy (1-5): _____")
        content.append("Coherence (1-5): _____")
        content.append("Conciseness (1-5): _____")
        content.append("")
        content.append("Brief rationale/notes:")
        content.append("")
        content.append("")
        content.append("=" * 80)
        content.append("")

    # Footer
    content.append("")
    content.append("=" * 80)
    content.append("END OF VALIDATION SET")
    content.append("=" * 80)
    content.append("")
    content.append("Next steps after scoring:")
    content.append("1. Transfer scores to spreadsheet")
    content.append("2. Begin LLM prompt design (Tuesday)")
    content.append("3. Test initial prompt on 5 summaries (Tuesday)")
    content.append("4. Prepare progress update presentation (Wednesday)")

    # Write to file
    with open(output_txt, 'w', encoding='utf-8') as f:
        f.write('\n'.join(content))

    print(f"‚úì Scoring text file created")


def main():
    """Main execution function."""

    # Create output directory if it doesn't exist
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    print(f"Output directory: {OUTPUT_DIR}")
    print()

    # Define output paths
    output_csv = os.path.join(OUTPUT_DIR, "validation_subset_25.csv")
    output_txt = os.path.join(OUTPUT_DIR, "validation_subset_25_for_scoring.txt")

    # Select subset
    subset_df = select_validation_subset(
        VALIDATION_DATASET_PATH,
        TARGET_DISTRIBUTION,
        RANDOM_SEED
    )

    # Save CSV
    print(f"\nSaving subset CSV...")
    subset_df.to_csv(output_csv, index=False)
    print(f"‚úì CSV saved: {output_csv}")
    print(f"  {len(subset_df)} summaries")

    # Create scoring text file
    print(f"\nCreating scoring text file...")
    create_scoring_text_file(subset_df, output_txt)
    print(f"‚úì Text file saved: {output_txt}")

    # Summary statistics
    print()
    print("=" * 80)
    print("FILES CREATED IN GOOGLE DRIVE")
    print("=" * 80)
    print(f"1. validation_subset_25.csv - Subset data for analysis")
    print(f"2. validation_subset_25_for_scoring.txt - Formatted for manual scoring")
    print()
    print(f"Location: {OUTPUT_DIR}")
    print()
    print("=" * 80)
    print("NEXT STEPS - FAST-TRACK SCHEDULE")
    print("=" * 80)
    print()
    print("üìÖ MONDAY DEC 1 (Tomorrow):")
    print("   ‚Ä¢ Score all 25 summaries (6-8 hours)")
    print("   ‚Ä¢ Use your calibrated approach from practice rounds")
    print("   ‚Ä¢ Document scores in spreadsheet as you go")
    print()
    print("üìÖ TUESDAY DEC 2:")
    print("   ‚Ä¢ Design base LLM evaluation prompt")
    print("   ‚Ä¢ Set up Llama 3.1 8B in Colab")
    print("   ‚Ä¢ Test on 5 summaries")
    print("   ‚Ä¢ Calculate initial agreement metrics")
    print()
    print("üìÖ WEDNESDAY DEC 3:")
    print("   ‚Ä¢ Prepare progress update presentation")
    print("   ‚Ä¢ Iterate on prompt based on results")
    print("   ‚Ä¢ DELIVERABLE: Progress update")
    print()
    print("üéØ You're on track for the December 10 demo!")
    print("=" * 80)


if __name__ == "__main__":
    main()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Output directory: /content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_2/data

FAST-TRACK VALIDATION SUBSET SELECTION

Reading validation dataset...
  Path: /content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/Phase_1/data/dataset/validation_set_combined_60.csv
‚úì Loaded 60 summaries

Current score distribution (60 summaries):
  Score 1: 8 total (5 authentic, 3 synthetic)
  Score 2: 18 total (11 authentic, 7 synthetic)
  Score 3: 20 total (12 authentic, 8 synthetic)
  Score 4: 10 total (6 authentic, 4 synthetic)
  Score 5: 3 total (2 authentic, 1 synthetic)
  Score 6: 1 total (1 authentic, 0 synthetic)

Target distribution (25 summaries):
  Score 1: 3 summaries
  Score 2: 8 summaries
  Score 3: 8 summaries
  Score 4: 4 summaries
  Score 5: 1 su

In [15]:
print("Installing required packages...")
!pip install -q transformers accelerate bitsandbytes huggingface_hub

Installing required packages...


In [16]:
from google.colab import userdata
from huggingface_hub import login
import os

try:
    HF_TOKEN = userdata.get('HF_TOKEN')
    login(token=HF_TOKEN)
    print("‚úì Authenticated with Hugging Face (via Secrets)")
except Exception:
    print("Secret not found. Please enter your Hugging Face token manually:")
    login()
    print("‚úì Authenticated with Hugging Face (manual entry)")

‚úì Authenticated with Hugging Face (via Secrets)


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

print("Loading Llama 3.1 8B-Instruct...")
print("(This takes 2-5 minutes on first run)")

MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# Configure 4-bit quantization to fit in Colab GPU memory
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=quantization_config,
    device_map="auto",
    dtype=torch.bfloat16,
)

print("‚úì Model loaded successfully!")
print(f"  Model device: {model.device}")

In [18]:
SOURCE_TEXT = """THE CHALLENGE OF EXPLORING VENUS

Venus, sometimes called the "Evening Star," is one of the brightest points of light in the night sky, making it simple for even an amateur stargazer to spot. However, this nickname is misleading since Venus is actually a planet. While Venus is simple to see from the distant but safe vantage point of Earth, it has proved a very challenging place to examine more closely.

Often referred to as Earth's "twin," Venus is the closest planet to Earth in terms of density and size, and occasionally the closest in distance too. Earth and Venus, along with Mars, our other planetary neighbor, orbit the sun at different speeds. These differences in speed mean that sometimes we are closer to Mars and other times to Venus. Because Venus is sometimes right around the corner‚Äîin space terms‚Äîhumans have sent numerous spacecraft to land on this cloud-draped world. Each previous mission was unmanned, and for good reason, since no spacecraft survived the landing for more than a few hours. Maybe this issue explains why not a single spaceship has touched down on Venus in more than three decades. Numerous factors contribute to Venus's reputation as a challenging planet for humans to study, despite its proximity to us.

A thick atmosphere of almost 97 percent carbon dioxide blankets Venus. Even more challenging are the clouds of highly corrosive sulfuric acid in Venus's atmosphere. On the planet's surface, temperatures average over 800 degrees Fahrenheit, and the atmospheric pressure is 90 times greater than what we experience on our own planet. These conditions are far more extreme than anything humans encounter on Earth; such an environment would crush even a submarine accustomed to diving to the deepest parts of our oceans and would liquefy many metals. Also notable, Venus has the hottest surface temperature of any planet in our solar system, even though Mercury is closer to our sun. Beyond high pressure and heat, Venusian geology and weather present additional impediments like erupting volcanoes, powerful earthquakes, and frequent lightning strikes to probes seeking to land on its surface.

If our sister is so inhospitable, why are scientists even discussing further visits to its surface? Astronomers are fascinated by Venus because it may well once have been the most Earth-like planet in our solar system. Long ago, Venus was probably covered largely with oceans and could have supported various forms of life, just like Earth. Today, Venus still has some features that are analogous to those on Earth. The planet has a surface of rocky sediment and includes familiar features such as valleys, mountains, and craters. Furthermore, recall that Venus can sometimes be our nearest option for a planetary visit, a crucial consideration given the long time frames of space travel. The value of returning to Venus seems indisputable, but what are the options for making such a mission both safe and scientifically productive?

The National Aeronautics and Space Administration (NASA) has one particularly compelling idea for sending humans to study Venus. NASA's possible solution to the hostile conditions on the surface of Venus would allow scientists to float above the fray. Imagine a blimp-like vehicle hovering 30 or so miles above the roiling Venusian landscape. Just as our jet airplanes travel at a higher altitude to fly over many storms, a vehicle hovering over Venus would avoid the unfriendly ground conditions by staying up and out of the way. At thirty-plus miles above the surface, temperatures would still be toasty at around 170 degrees Fahrenheit, but the air pressure would be close to that of sea level on Earth. Solar power would be plentiful, and radiation would not exceed Earth‚Äôs levels. Not easy conditions, but survivable for humans.

However, peering at Venus from a ship orbiting or hovering safely far above the planet can provide only limited insight into ground conditions, rendering standard forms of photography and videography ineffective. More importantly, researchers cannot take samples of rock, gas, or anything else from a distance. Therefore, scientists seeking to conduct a thorough mission to understand Venus would need to get up close and personal despite the risks. Or maybe we should think of them as challenges. Many researchers are working on innovations that would allow our machines to last long enough to contribute meaningfully to our knowledge of Venus.

NASA is working on other approaches to studying Venus. For example, some simplified electronics made of silicon carbide have been tested in a chamber simulating the chaos of Venus's surface and have lasted for three weeks in such conditions. Another project is looking back at an old technology called mechanical computers. These devices were first envisioned in the 1800s and played an important role in the 1940s during World War II. The thought of computers existing in those days may sound shocking, but these devices made calculations by using gears and levers and did not require electronics at all. Modern computers are enormously powerful, flexible, and quick, but tend to be more delicate when it comes to extreme physical conditions. Just imagine exposing a cell phone or tablet to acid or heat capable of melting tin. By comparison, systems that use mechanical parts can be made more resistant to pressure, heat, and other forces.

Striving to meet the challenge presented by Venus has value, not only because of the insight to be gained on the planet itself, but also because human curiosity will likely lead us into many equally intimidating endeavors. Our travels on Earth and beyond should not be limited by dangers and doubts but should be expanded to meet the very edges of imagination and innovation."""

print(f"Source text loaded: {len(SOURCE_TEXT)} characters")

Source text loaded: 5778 characters


In [19]:
RUBRIC = """
## SUMMARY EVALUATION RUBRIC (Grades 6-8)

### Task Context
Students read "The Challenge of Exploring Venus" and wrote a response to this prompt:
"Write an essay evaluating how well the author supports the claim that studying Venus is a worthy pursuit despite the dangers. Use evidence from the text to support your evaluation."

This is a HYBRID task requiring students to:
1. Identify the author's claim and supporting evidence
2. Evaluate how effectively the author builds the argument
3. Support their evaluation with specific textual evidence

### Scoring Dimensions

**COMPLETENESS (1-5)**: Coverage of the author's main supporting points
- 5: Identifies ALL major supporting points (extreme conditions, scientific value, NASA solutions, alternative technologies) with specific evidence
- 4: Identifies MOST major points with evidence; one minor omission
- 3: Identifies SEVERAL points but misses at least one crucial aspect
- 2: Identifies only a FEW points; missing multiple important concepts
- 1: Fails to identify main points or provides only vague statements

**ACCURACY (1-5)**: Factual correctness of claims about the text
- 5: All information factually correct; precise language; no distortions
- 4: Generally accurate with only minor imprecisions that don't alter meaning (awkward paraphrasing with correct meaning = 4, not 3)
- 3: Contains accurate points but also noticeable errors or oversimplifications
- 2: Multiple significant factual errors or misrepresentations (note: quoting or paraphrasing the source text is not a factual error)
- 1: Information contradicts source or includes fabricated details

**COHERENCE (1-5)**: Logical organization and flow
- 5: Ideas flow logically; effective transitions; each sentence builds on previous
- 4: Clearly organized; transitions mostly effective; minor rough spots
- 3: Basic organization but inconsistent flow; transitions missing in places
- 2: Organization unclear; ideas jump between topics; few transitions
- 1: No discernible organization; disconnected fragments

**CONCISENESS (1-5)**: Efficiency of expression
- 5: Every sentence essential; no repetition; focused on main ideas
- 4: Mostly efficient; only minor wordiness or brief repetition
- 3: Noticeable wordiness; some repetition; includes irrelevant information
- 2: Significant wordiness; frequent repetition; could be cut substantially
- 1: Excessively wordy; ideas repeated multiple times; essential content buried
"""

print("Rubric loaded successfully")

Rubric loaded successfully


In [20]:
def create_evaluation_prompt(student_summary):
    """Create the full CoT evaluation prompt for a student summary."""

    prompt = f"""You are an experienced middle school English Language Arts teacher evaluating a student's response to a reading comprehension task. The student read an article about Venus exploration and wrote an evaluative essay.

## SOURCE TEXT
{SOURCE_TEXT}

## STUDENT TASK
The student was asked: "Write an essay evaluating how well the author supports the claim that studying Venus is a worthy pursuit despite the dangers. Use evidence from the text to support your evaluation."

## STUDENT RESPONSE
{student_summary}

## EVALUATION RUBRIC
{RUBRIC}

## YOUR TASK
Evaluate this student response using Chain-of-Thought reasoning. For each dimension:

1. First, identify specific evidence from the student's response
2. Then, compare against the rubric criteria
3. Finally, assign a score (1-5) with brief justification

**Think step-by-step before providing scores.**

### Step 1: Analyze COMPLETENESS
What main supporting points from the article does the student identify or discuss?
- Extreme conditions on Venus (heat, pressure, sulfuric acid, etc.)?
- Scientific value (Earth-like past, similar features, proximity)?
- NASA's blimp/hovering solution?
- Alternative technologies (silicon carbide, mechanical computers)?
Identify what's present and what's missing.

### Step 2: Analyze ACCURACY
Check each factual claim the student makes against the source text:
- Are temperatures, pressures, and other numbers correct?
- Are the solutions described accurately?
- Is the author's argument represented faithfully?

### Step 3: Analyze COHERENCE
Examine the organization and flow:
- Is there a clear introduction and conclusion?
- Do paragraphs/sentences connect logically?
- Are transitions used effectively?

### Step 4: Analyze CONCISENESS
Check for efficiency:
- Is there unnecessary repetition?
- Are there irrelevant tangents?
- Could the response be shortened without losing meaning?

## PROVIDE YOUR EVALUATION

After your analysis, provide scores in this EXACT format:

COMPLETENESS: [score 1-5]
Justification: [1-2 sentences]

ACCURACY: [score 1-5]
Justification: [1-2 sentences]

COHERENCE: [score 1-5]
Justification: [1-2 sentences]

CONCISENESS: [score 1-5]
Justification: [1-2 sentences]

OVERALL FEEDBACK: [2-3 sentences of constructive feedback for the student]
"""
    return prompt

In [21]:
def create_simple_prompt(student_summary):
    """A simpler, more direct prompt without extensive CoT scaffolding."""

    prompt = f"""You are a middle school English teacher grading a student essay.

ARTICLE SUMMARY: The source article discusses why Venus is difficult to explore (extreme heat, pressure, sulfuric acid) but argues it's worth studying because Venus may have once been Earth-like, has similar features today, and is sometimes our closest neighbor. NASA proposes hovering vehicles at 30 miles altitude where conditions are survivable. Scientists are also developing heat-resistant electronics and mechanical computers.

STUDENT TASK: Evaluate how well the author supports the claim that studying Venus is worthwhile despite the dangers.

STUDENT RESPONSE:
{student_summary}

SCORING RUBRIC (1-5 scale):
- COMPLETENESS: Does it cover the main supporting points?
- ACCURACY: Are the facts correct?
- COHERENCE: Is it well-organized with good flow?
- CONCISENESS: Is it focused without unnecessary repetition?

Provide scores in this format:
COMPLETENESS: [1-5]
ACCURACY: [1-5]
COHERENCE: [1-5]
CONCISENESS: [1-5]
BRIEF FEEDBACK: [1-2 sentences]
"""
    return prompt

In [30]:
def generate_evaluation(prompt, max_new_tokens=800, temperature=0.1):
    """Generate model response for an evaluation prompt."""

    messages = [
        {"role": "system", "content": "You are an expert middle school English teacher who evaluates student writing using rubrics."},
        {"role": "user", "content": prompt}
    ]

    # Format for Llama 3.1 Instruct
    input_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    # Build generation kwargs conditionally
    generate_kwargs = {
        "max_new_tokens": max_new_tokens,
        "pad_token_id": tokenizer.eos_token_id,
    }

    if temperature > 0:
        generate_kwargs["do_sample"] = True
        generate_kwargs["temperature"] = temperature
        generate_kwargs["top_p"] = 0.9
    else:
        generate_kwargs["do_sample"] = False
        # --- FIX: Explicitly unset these to silence the warning ---
        generate_kwargs["temperature"] = None
        generate_kwargs["top_p"] = None

    with torch.no_grad():
        outputs = model.generate(**inputs, **generate_kwargs)

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract just the assistant's response
    if "<|assistant|>" in response:
        response = response.split("<|assistant|>")[-1].strip()
    elif "assistant" in response.lower():
        parts = response.split("COMPLETENESS:")
        if len(parts) > 1:
            response = "COMPLETENESS:" + parts[-1]

    return response

In [23]:
import re

def parse_scores(response_text):
    """Extract numerical scores from the model's response."""
    scores = {}

    # Pattern: DIMENSION: [score] or DIMENSION: score
    patterns = {
        'completeness': r'COMPLETENESS:\s*\[?(\d)\]?',
        'accuracy': r'ACCURACY:\s*\[?(\d)\]?',
        'coherence': r'COHERENCE:\s*\[?(\d)\]?',
        'conciseness': r'CONCISENESS:\s*\[?(\d)\]?'
    }

    for dim, pattern in patterns.items():
        match = re.search(pattern, response_text, re.IGNORECASE)
        if match:
            scores[dim] = int(match.group(1))
        else:
            scores[dim] = None

    return scores

In [24]:
TEST_SUMMARIES = {
    "VAL_02": """Do you guys think that venus is dangers? Well for our part venus need to get up close and personal despite the risk or maybe they should think of them as challenges. Astronmers are fascinated by venus because it may once have been the most earth-like planet in our solar system. Even more challenging are the clouds of highly corrosive sulfuric acid in venus's atmosphere. Venus is simple to see from the distant but safe vantage point of earth, it had proved a very challenging place to examine more closely. Venus would need to get uo close and personal despite the risk and maybe should think of them as challenges.

Astronomers are fascinated by venus because it may well once have been the most earth-like planet in the solar system because people from long time had to covered the oceans to support carious forms of life to them just how earth is from us today. Also because the value of the returning of the venus seem to be a little difficult but there was other option to make a mission to be safe and productive to both of them for the astronomers to be by venus and know the planets in the solar system.

Even more challenging are the clouds of highly corrosvie sulfuric acid in venus's atmosphere because they think that the challenging for venus might not be working for them but now they do work because the conditions are now more extreme then any human encounter out there and all of the environment can crush of a submarine. But then venus has the hottest surface temperature to any of the planet in the solar system that their is even when mercury is close to the sun venus can still be hot for the system being beside another planet.

Venus is simple to see from the distant but safe vantage point of earth, it has proved a very challenging place to examine more closely. Venus can be the closest planet to earth even in terms of size and density. Also each of the previous mission can have an unmanned and for a reason no sacecraft has been survived for the time of landing for more than hours or even minutes. Venus had more than three decades, but venus reputation for a challenge planet is for humans to work on and study for and to despite the proximity to it.

This is what I think the author suggests to the study of venus because venus would need to get up close and personal despite the risk and maybe should think of them for a challenge, because astronomers are fascinated by venus because it may once have been the most earth-like planet in the solar system, even more challenging are the clouds of highly corrosive sulfuric acid in venus's atmosphere, and venus is simple to see from the distant but can be safe vantage point in earth and has proved a very challenging place to examine more closely to it.""",

    "VAL_04": """The author excellently supports the idea that even though it is dangerous, Venus is worth exploring. You can tell the author supports the idea of further exploration of Venus because of their use of details. The author explains Venus, why it is so dangerous, and why we should continue exploring it to support the idea that Venus is a challenge that we should not give up on.

One of the reasons the authors point comes across so well is how in depth they explain Venus so that the reader can be more knowlegable about the topic before the author begins to explain why we should continue to explore it. The author gives as much detail as a book about planets so that the reader knows that the author is well versed in the topic and is not having an opinion without factual evidence to support it. In paragraph 2, it says "Often referred to as Earth's "twin", Venus is the closest planet to Earth in terms of density and size, and occasionally the closest in distance too. Earth, Venus, and Mars, our other planetary neighbor, orbit the sun at different speeds.". Throughout this paragraph, the author gives information about Venus so you can understand in depth how and why it is explored, and most importantly, why it is so dangerous.

The danger of Venus is why it is mostly unknown, and why humans want to study it more. Even unmanned missions do not survive Venus's burning temperatures and intense pressure for more than a couple hours, making it very challenging to study. The author uses data like in the quote " A thick atmosphere of amost 97 percent carbon dioxide blankets Venus. Even more challenging are the clouds of highly corrosive sulfuric acid in Venus's atmosphere."(paragraph 3), to show how dangerous it is and why Venus is mostly unexplored. The author shows that the danger is not keeping NASA away, but it is drawing them closer. The author states that "Astronomers are facinated by Venus because it may well once have been the most Earth-like planet in our solar system." (paragraph 4). To the author, natural human curiousity is another reason why we should continue pursuing Venus, and how we are going to continue to explore Venus, even if it is dangerous.

The author uses examples of ideas from real scientists to support the statement that we should not give up on the idea of knowing more about Venus. NASA is still trying to figure out a way to have people explore Venus deeper. The author uses NASA's solutions to the conditions of Venus to explain why we should never stop exploring space. NASA is coming up with solutions to Venus, but they might prove ineffective, so pursuing Veus is still a worthy idea. They are trying to come up with a way to float above the harms of Venus, so they can still study it close, but be unaffected by the harmful temperatures and pressures of the surface. The author offers a rebuttal to this idea, saying "peering at Venus from a ship orbiting or hovering safely far above the planet can provide only limited insight on ground conditions because most forms of light cannot penetrate the dense atmosphere, rendering standard forms of photography and videography ineffective. More importantly, researchers cannot take samples of rock, gas, or anything else, from a distance." (paragraph 6). This information the author presents makes the reader understand that Venus is insanely difficult to explore when even NASA can not present useful ideas for intense exploration. But even when every idea is shut down, the author makes it clear that we should not give up the fight for exploration.

Venus is still an unhabitable planet for even our smartest robots. We as humans have tried our hardest to make sure we understand the many planets in out solar system. Even though it seems impossible, the author explains very successfully that this does not mean exploring Venus is impossible, it just means Venus is a complicated puzzle, but when it is solved, everyone will be overwhelmed with satisfaction, so to the author, giving up is not an option. The author believes that one day, exploring Venus in depth will be possible, and they explain their reasoning behind it very clearly so the reader can understand that studying Venus is a worthy persuit despite the dangers it presents.""",

    "VAL_15": """In "The Challenge of Exploring Venus," the author talks about why studying Venus is important, even though it is dangerous.

First, Venus is really hot. The article says the temperature is over 800 degrees Fahrenheit. That's way hotter than most things on Earth. Second, there are clouds of sulfuric acid in the atmosphere. This makes it hard for machines to land there. Third, Venus has a lot of pressure that is 90 times stronger than on Earth. This can crush any spacecraft that tries to land. Fourth, scientists think Venus could have had oceans a long time ago and possibly life. This is interesting because it gives us an idea of what Earth might have been like too. Fifth, NASA has some ideas to study Venus. They want to send a blimp-like vehicle to float above it. This could help scientists avoid the extreme conditions on the ground. Lastly, even though Venus is challenging, the author suggests that exploring it can help us learn more about space and even ourselves.

Overall, the author mentions many facts about Venus being dangerous, but doesn't explain very well why studying it is so important.""",

    "VAL_20": """People are facinated with the Man on the Moon and the idea of Martians, but most people do not think about life on Venus. Venus is the second planet from the sun and shares many geographical features with Earth. However, studying this planet is made difficult by the dense and toxic atmosphere, high temperatures, and violent weather. Despite this, some people think that Venus should still be explored, and the author of "The Challenge of Exploring Venus" is of this opinion. The idea that studying Venus is a worthy pursuit despite the dangers is well supported by the author as seen through the rewards of studying Venus and the progress that has been made towards studying Venus.

First, the idea that studying Venus is a worthy pursuit despite the dangers is well supported by the author as seen through the many rewards of studying Venus. After laying out the dangers of studying Venus, the author explains why scientists continue to study the planet. "Astronomers are fascinated by Venus because it may well have been the most Earth-like planet in our solar system" (4). By studying Venus, astronomers and geologists can predict what might happen to Earth in the future. Gaining an understanding of Earth's future may well allow scientists to predict what happened in Earth's past. Scientists are eager to learn about the early years of Earth's past, as it is shrouded in mystery, and this thirst for knowledge motivates them to study Venus. In describing how similar Venus was to Earth, the author says, "Long ago, Venus was probably covered largely with oceans and could have supported various forms of life" (4). If there was once life on Venus, the similarity between it and Earth would grow. As with geology, if biologists can understand what caused life to cease on Venus, they might be able to predict how life on Venus and on Earth might have started. The author shows that scientists studying Venus reap the reward of being able to learn about Earth's geology and early life. By laying out the various rewards to be had from studying Venus, the author is strengthening his or her argument that Venus should be studied.

Secondly, the idea that studying Venus is a worthy pursuit despite the dangers is well supported by the author as seen through the large amount of progress that has been made towards studying Venus. Although the author describes how Venus could be studied from the air, scientists still desire to learn about Venus from the planet's surface. One of their solutions to the problem of getting equipment to last on the surface of Venus is to expirement with new materials. "Simplified electronics made of silicon carbide have been tested in a chamber simulating the choas of Venus's surface and have lasted of three weeks in such conditions" (7). Research and experimentation taking place on Earth is giving scientists and astronauts more options for studying Venus. Although conditions on Venus are not hospitable to life, these new scientific advances are making it possible for data-gathering equipment to be sent to the surface of Venus and last long enough to gather data. Other scientists are moving away from traditional electronics and looking into purely mechanical systems. "Systems that use mechanical parts can be made more resistant to pressure, heat, and other forces" (7). The alternative that has presented itself to would-be explorers of Venus is older technology, like that found in the earliest computers. Scientists have realized that modern technology is too fragile and that more durable technologies are needed. By turning to other forms of technology, scientists are widening their options for ways to study Venus. The author mentions three different ways that scientists are making progress towards being able to study Venus - from the air, using new materials, and using old technologies. The author's postion that Venus should continue to be studied is supported by the scientific advancements that are serving to make studying Venus a reality.

In conclusion, the author's opinon that Venus should continue to be studied despite the dangers is well supported by the rewards of studying another Earth-like planet and the advancements that have been made towards being able to effectively study Venus. Scientists have strong motivation for studying Venus, and new technologies are making it possible for them to overcome the challenges presented by Venus's harsh terrain. Although scientists studying Venus are unlikely to encounter any life forms, what they do discover will help them to understand Earth's past and shape our future.""",

    "VAL_25": """In "the challenge of exploring venus ," the author suggests that studying venus is a worthy pursuit

despite the dangers it presents . becauce in the text it says at paragraph eight

"striving to meet challenge presented by venus has value , not only because of the insight to be gained on the planet itself , but also becauce human curiosity will likely lwad us into many equally intimdating endeavors ." this proves that we should try to get to mars .

there is even more evidence . In paragraph four it says " Astronomers are fascinated by venus because it may well once beeen

the most earth like planet in are solar sytem . " this just further shows the imense reasearch value .

theres even more prove . in the artical at paragraph 2 it says " often referred to as Earths "twin,"Venus is the closest planet to earth in terms of denisty and sise , and occasionally the closest in distance too. " showing are planets similer history .

in conclusion all this eveidince points to even though it will be hard we show try to reasearch venus more ."""
}

# =============================================================================
# GROUND TRUTH SCORES - YOUR Day 1 Scores
# =============================================================================

GROUND_TRUTH = {
    "VAL_02": {"completeness": 4, "accuracy": 4, "coherence": 3, "conciseness": 3},  # Authentic - repetitive, errors
    "VAL_04": {"completeness": 4, "accuracy": 4, "coherence": 4, "conciseness": 3},  # Authentic - analytical, well-structured
    "VAL_15": {"completeness": 3, "accuracy": 4, "coherence": 4, "conciseness": 4},  # Synthetic - list-style, moderate
    "VAL_20": {"completeness": 4, "accuracy": 5, "coherence": 4, "conciseness": 2},  # Authentic - formal essay, lengthy
    "VAL_25": {"completeness": 2, "accuracy": 3, "coherence": 3, "conciseness": 3},  # Authentic - short, spelling errors
}

# Quick validation
print("=" * 60)
print("TEST SUMMARIES LOADED")
print("=" * 60)
for sid, text in TEST_SUMMARIES.items():
    word_count = len(text.split())
    print(f"{sid}: {word_count} words")
print("=" * 60)
print("\n‚úÖ GROUND_TRUTH scores loaded from your Day 1 spreadsheet!")
print("   Source: Summary_Scoring_Template.xlsx - Main Scoring sheet")

TEST SUMMARIES LOADED
VAL_02: 499 words
VAL_04: 732 words
VAL_15: 191 words
VAL_20: 753 words
VAL_25: 193 words

‚úÖ GROUND_TRUTH scores loaded from your Day 1 spreadsheet!
   Source: Summary_Scoring_Template.xlsx - Main Scoring sheet


In [25]:
print("="*70)
print("RUNNING EVALUATIONS ON 5 TEST SUMMARIES")
print("="*70)

results = {}

for summary_id, summary_text in TEST_SUMMARIES.items():
    print(f"\n{'='*70}")
    print(f"Evaluating: {summary_id}")
    print(f"{'='*70}")
    print(f"\nSummary preview: {summary_text[:150]}...")

    # Use the full CoT prompt
    prompt = create_evaluation_prompt(summary_text)

    print("\nGenerating evaluation...")
    response = generate_evaluation(prompt)

    # Parse scores
    scores = parse_scores(response)
    results[summary_id] = {
        'llm_scores': scores,
        'ground_truth': GROUND_TRUTH.get(summary_id, {}),
        'response': response
    }

    print(f"\n--- LLM SCORES ---")
    for dim, score in scores.items():
        gt = GROUND_TRUTH.get(summary_id, {}).get(dim, "N/A")
        match = "‚úì" if score == gt else "‚óã" if score and gt and abs(score - gt) == 1 else "‚úó"
        print(f"  {dim.capitalize()}: LLM={score} | Ground Truth={gt} {match}")

    print(f"\n--- FULL RESPONSE ---")
    print(response[:1500] + "..." if len(response) > 1500 else response)

RUNNING EVALUATIONS ON 5 TEST SUMMARIES

Evaluating: VAL_02

Summary preview: Do you guys think that venus is dangers? Well for our part venus need to get up close and personal despite the risk or maybe they should think of them...

Generating evaluation...

--- LLM SCORES ---
  Completeness: LLM=2 | Ground Truth=4 ‚úó
  Accuracy: LLM=2 | Ground Truth=4 ‚úó
  Coherence: LLM=2 | Ground Truth=3 ‚óã
  Conciseness: LLM=1 | Ground Truth=3 ‚úó

--- FULL RESPONSE ---
COMPLETENESS: 2
Justification: The student identifies some of the main supporting points, such as the extreme conditions on Venus and the scientific value of studying the planet. However, they miss crucial aspects like NASA's solutions and alternative technologies, and their discussion is disjointed and lacks a clear structure.

### ACCURACY: 2
Justification: The student makes several factual errors, such as stating that Venus is "simple to see from the distant but safe vantage point of earth" (the text actually says it's "simple

In [26]:
print("\n" + "="*70)
print("AGREEMENT ANALYSIS")
print("="*70)

dimensions = ['completeness', 'accuracy', 'coherence', 'conciseness']

# Calculate agreements
exact_matches = {dim: 0 for dim in dimensions}
adjacent_matches = {dim: 0 for dim in dimensions}  # Within 1 point
total_valid = {dim: 0 for dim in dimensions}

for summary_id, data in results.items():
    llm = data['llm_scores']
    gt = data['ground_truth']

    for dim in dimensions:
        if llm.get(dim) is not None and gt.get(dim) is not None:
            total_valid[dim] += 1
            diff = abs(llm[dim] - gt[dim])
            if diff == 0:
                exact_matches[dim] += 1
                adjacent_matches[dim] += 1
            elif diff == 1:
                adjacent_matches[dim] += 1

print("\n### AGREEMENT BY DIMENSION ###\n")
print(f"{'Dimension':<15} {'Exact':<15} {'Adjacent (¬±1)':<15}")
print("-" * 45)

for dim in dimensions:
    n = total_valid[dim]
    if n > 0:
        exact_pct = (exact_matches[dim] / n) * 100
        adj_pct = (adjacent_matches[dim] / n) * 100
        print(f"{dim.capitalize():<15} {exact_pct:>5.1f}% ({exact_matches[dim]}/{n})   {adj_pct:>5.1f}% ({adjacent_matches[dim]}/{n})")
    else:
        print(f"{dim.capitalize():<15} No valid comparisons")

# Overall
total_exact = sum(exact_matches.values())
total_adjacent = sum(adjacent_matches.values())
total_n = sum(total_valid.values())

print("-" * 45)
if total_n > 0:
    print(f"{'OVERALL':<15} {(total_exact/total_n)*100:>5.1f}% ({total_exact}/{total_n})   {(total_adjacent/total_n)*100:>5.1f}% ({total_adjacent}/{total_n})")

print("\n### INTERPRETATION ###")
print("- Exact match: LLM score equals your ground truth score")
print("- Adjacent match: LLM score is within ¬±1 of ground truth")
print("- Target: Adjacent agreement ‚â•85% indicates good calibration")


AGREEMENT ANALYSIS

### AGREEMENT BY DIMENSION ###

Dimension       Exact           Adjacent (¬±1)  
---------------------------------------------
Completeness     80.0% (4/5)    80.0% (4/5)
Accuracy         40.0% (2/5)    80.0% (4/5)
Coherence        40.0% (2/5)   100.0% (5/5)
Conciseness      20.0% (1/5)    80.0% (4/5)
---------------------------------------------
OVERALL          45.0% (9/20)    85.0% (17/20)

### INTERPRETATION ###
- Exact match: LLM score equals your ground truth score
- Adjacent match: LLM score is within ¬±1 of ground truth
- Target: Adjacent agreement ‚â•85% indicates good calibration


In [27]:
print("\n" + "="*70)
print("TESTING SIMPLE PROMPT VARIATION")
print("="*70)

# Pick one summary to test both prompts
test_id = "VAL_04"
test_summary = TEST_SUMMARIES[test_id]

print(f"\nTesting on: {test_id}")

# Full CoT prompt (already done above)
full_scores = results[test_id]['llm_scores']

# Simple prompt
simple_prompt = create_simple_prompt(test_summary)
print("\nGenerating with SIMPLE prompt...")
simple_response = generate_evaluation(simple_prompt, max_new_tokens=400)
simple_scores = parse_scores(simple_response)

print("\n### PROMPT COMPARISON ###")
print(f"\n{'Dimension':<15} {'Full CoT':<12} {'Simple':<12} {'Ground Truth':<12}")
print("-" * 55)

for dim in dimensions:
    full = full_scores.get(dim, "N/A")
    simp = simple_scores.get(dim, "N/A")
    gt = GROUND_TRUTH[test_id].get(dim, "N/A")
    print(f"{dim.capitalize():<15} {str(full):<12} {str(simp):<12} {str(gt):<12}")

print("\n### SIMPLE PROMPT RESPONSE ###")
print(simple_response)


TESTING SIMPLE PROMPT VARIATION

Testing on: VAL_04

Generating with SIMPLE prompt...

### PROMPT COMPARISON ###

Dimension       Full CoT     Simple       Ground Truth
-------------------------------------------------------
Completeness    4            4            4           
Accuracy        4            5            4           
Coherence       4            4            4           
Conciseness     3            3            3           

### SIMPLE PROMPT RESPONSE ###
COMPLETENESS: 4
The student provides a good overview of the article's main points, but could have delved deeper into the supporting details.

ACCURACY: 5
The student accurately summarizes the article's content, including specific quotes and facts.

COHERENCE: 4
The essay is well-organized, but could benefit from a clearer introduction and conclusion to frame the discussion.

CONCISENESS: 3
The student could have condensed the essay by eliminating some repetitive phrases and focusing on the most essential points.

BRI

In [28]:
from google.colab import drive
drive.mount('/content/drive')

import json
import pandas as pd
from datetime import datetime

# Create results DataFrame
rows = []
for summary_id, data in results.items():
    row = {'summary_id': summary_id}
    for dim in dimensions:
        row[f'llm_{dim}'] = data['llm_scores'].get(dim)
        row[f'gt_{dim}'] = data['ground_truth'].get(dim)
    rows.append(row)

df = pd.DataFrame(rows)

# Save
timestamp = datetime.now().strftime('%Y%m%d_%H%M')
output_path = f'/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/LLM_Evaluation_Results/llm_evaluation_results_{timestamp}.csv'
df.to_csv(output_path, index=False)
print(f"Results saved to: {output_path}")

# Save full responses
responses_path = f'/content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/LLM_Responses/llm_responses_{timestamp}.json'
with open(responses_path, 'w') as f:
    json.dump({k: v['response'] for k, v in results.items()}, f, indent=2)
print(f"Full responses saved to: {responses_path}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Results saved to: /content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/LLM_Evaluation_Results/llm_evaluation_results_20251203_0305.csv
Full responses saved to: /content/drive/MyDrive/Courses/2025/3_Fall/EDUC_6192_Large_Language_Model_Applications_in_Education/Project/LLM_Responses/llm_responses_20251203_0305.json


In [31]:
print("\n" + "="*70)
print("TEMPERATURE EXPERIMENT")
print("="*70)

test_id = "VAL_04"
test_summary = TEST_SUMMARIES[test_id]
prompt = create_evaluation_prompt(test_summary)

temperatures = [0.0, 0.1, 0.3]

for temp in temperatures:
    print(f"\n--- Temperature: {temp} ---")
    response = generate_evaluation(prompt, temperature=temp)
    scores = parse_scores(response)
    print(f"Scores: {scores}")

print("\nNOTE: Lower temperature = more deterministic/consistent")
print("      Higher temperature = more varied/creative responses")

# %% [markdown]
# ## Next Steps
#
# 1. **Review the results** - Check which summaries show good agreement
# 2. **Identify patterns** - Which dimensions are harder for the LLM?
# 3. **Refine the prompt** - Add examples, clarify instructions
# 4. **Run on full validation set** - Test all 25 summaries
# 5. **Calculate Cohen's Kappa** - Formal inter-rater reliability metric


TEMPERATURE EXPERIMENT

--- Temperature: 0.0 ---
Scores: {'completeness': 4, 'accuracy': 4, 'coherence': 4, 'conciseness': 3}

--- Temperature: 0.1 ---
Scores: {'completeness': 4, 'accuracy': 5, 'coherence': 4, 'conciseness': 3}

--- Temperature: 0.3 ---
Scores: {'completeness': 4, 'accuracy': 4, 'coherence': 4, 'conciseness': 3}

NOTE: Lower temperature = more deterministic/consistent
      Higher temperature = more varied/creative responses
