## Setup

In [61]:
# Initialize environment variables/constants (for Google Colab)
# import os
# from google.colab import userdata

# os.environ["GOOGLE_API_KEY"] = userdata.get("GOOGLE_API_KEY")

# Initialize environment variables/constants (for VS Code)
import os

# Set your Google Gemini API key here or in your environment variables
# You can get a free API key from: https://aistudio.google.com/app/apikey
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")

In [62]:
# Install langchain google genai
from IPython.display import clear_output

# google colab command
# !pip install -U langchain-google-genai

# vs code command
%pip install langchain
%pip install -U langchain-google-genai
%pip install google-generativeai

clear_output()

In [63]:
# Instantiate an LLM
from langchain.chat_models import init_chat_model

# Select free tier Gemini model
model = init_chat_model(
    model="gemini-2.5-flash",
    model_provider="google_genai"
)

## Clues Generator

In [None]:
# Write the prompts
import json

system_prompt = """
You are the game master of a game called "Disinformer", which is similar to the "message relay" game. Below is the description of how the game works:
```
In this cooperative game, players use communication and teamwork to uncover the original prompt over multiple rounds of clues. Along the way, they must contend with a disruptive "Disinformer," varying player interpretations, and time limits.

There will be a minimum of 3 players and maximum of 10 players:
- Regular players (a.k.a. the netizens): The job is to solve clues and discover the original prompt
- at most 2 misinformed players: Has the same job as the regular players. However, this player is unknowingly being given vague/ambiguous clues.
- at most 2 disinformer players: The job is to solve clues and discover prompt to persuade other players from clue.

There will be 2 rounds in each game.
- In the first round, the players will be given clues to guess a general category/term (e.g. "movie", "song", "novel", etc)
- In the second round, the players will be given clues to guess a more specific thing (e.g. "The Dark Knight (2008)", "The Hitchhiker's Guide to the Galaxy (Novel)", "Space Oddity - David Bowie (1969)", etc) which is related to the general category in the previous round.

In each round, there will be 3 type of clues for each player:
- Informed: Not-so-easy but unambiguous clues.
- Misinformed: Ambiguous/vague clues that may potentially make them think an entirely different guess (intended for the misinformed player).
- Fake: Clues that point to one of the wrong answers.

Additionally, in each round, the players will be given 10 minutes to discuss their guess. If they stuck, they may ask the game master to reveal an additional clue to help them.
```

As a game master, given a category and a thing (e.g. Movie: The Dark Knight (2008)), for each round, generate:
- 9 informed clues for the regular players. Make the clues to be as distinct as possible.
- 1 extra informed clue for a backup.
- 2 misinformed clues.
- 2 fake clues
- Also, the answer choices for that round (3 choices)

For round 2, make sure it is subtle enough. For example, when generating clues for a movie:
- No direct names.
- No title references.
- Focus on plot nuances, secondary characters, or themes instead of iconic moments.


=== MANDATORY WORD COUNT REQUIREMENT ===
**CRITICAL: EVERY SINGLE CLUE MUST BE EXACTLY 15-20 WORDS. NO EXCEPTIONS.**

Before submitting your response, you MUST:
1. Count every word in every clue individually (use a word counter)
3. Verify ALL 28 clues fall within 15-20 words
4. If even ONE clue is outside range, STOP and rewrite ONLY that clue
5. Repeat until 100% pass validation

Example of VALID clues (count the words):
- "The protagonist discovers a hidden power while fleeing from mysterious pursuers in an ancient temple underground." (15 words) ‚úÖ
- "A clever scientist creates an invention that has unintended consequences for society and their personal relationships." (16 words) ‚úÖ
- "Betrayal and redemption intertwine as characters navigate political conflicts involving espionage international borders and moral dilemmas." (16 words) ‚úÖ

Example of INVALID clues (DO NOT USE):
- "A discovery." (2 words) ‚ùå
- "The plot involves something significant." (5 words) ‚ùå
- "This film explores themes that are quite complex and multifaceted in nature and shows characters." (14 words) ‚ùå
- "The protagonist faces numerous challenges while trying to achieve their goal against overwhelming odds in a fantasy world with magic and danger." (21 words) ‚ùå

Then, you also need to provide 3 instructions to help the disinformer.
Based on the set of informed and misinformed clues you have came up with, using the Polarisation strategy, generate 3 instructions to help the disinformer player.

However, there are some restrictions that you must follow:
- You must not mention the answer choices except for the true answer.
- The disinformer is not aware which clues are the misinformed ones. So, avoid giving advice that aims to leverage the misinformed clues

After this, we will provide you with a pair consisting of the general category and the more specific thing in the following JSON format: `<general category> - <specific thing>`
{
  "round_1": "<general category>",
  "round_2": "<specific thing>"
}
"""

output_format = """
BEFORE SUBMITTING YOUR RESPONSE:
1. Count every single clue word-by-word
2. Verify EVERY clue is 15-20 words
3. If even ONE clue is outside this range, STOP and rewrite it
4. Only submit when ALL clues pass the word count check

VALIDATION CHECKLIST (COMPLETE BEFORE SUBMISSION):
‚úì Round 1 informed (9 clues): ALL 15-20 words? YES/NO
‚úì Round 1 misinformed (2 clues): ALL 15-20 words? YES/NO
‚úì Round 1 fake (2 clues): ALL 15-20 words? YES/NO
‚úì Round 1 extra (1 clue): 15-20 words? YES/NO
‚úì Round 2 informed (9 clues): ALL 15-20 words? YES/NO
‚úì Round 2 misinformed (2 clues): ALL 15-20 words? YES/NO
‚úì Round 2 fake (2 clues): ALL 15-20 words? YES/NO
‚úì Round 2 extra (1 clue): 15-20 words? YES/NO

DO NOT SUBMIT JSON IF ANY CHECKBOX IS NO.

Total clues to validate: 28 clues (14 per round)
If ANY clue fails validation, regenerate all clues in that round.
**RESPONSE FORMAT**: You MUST respond with valid JSON only. No markdown, no explanations outside JSON.

Write the output using the following JSON format:
[
  {
    "answer": "<Answer of round 1>",
    "informed_clues": [<9 clues - EACH MUST BE 15-20 WORDS>],
    "misinformed_clues": [<2 clues - EACH MUST BE 15-20 WORDS>],
    "extra_clues": [<1 clue - MUST BE 15-20 WORDS>],
    "fake_clues": [<2 clues - EACH MUST BE 15-20 WORDS>],
    "choices": [<3 answer choices including the true answer>],
    "disinformer_instructions": [<3 instructions for the disinformer>]
  },
  {
    "answer": "<Answer of round 2>",
    "informed_clues": [<9 clues - EACH MUST BE 15-20 WORDS>],
    "misinformed_clues": [<2 clues - EACH MUST BE 15-20 WORDS>],
    "extra_clues": [<1 clue - MUST BE 15-20 WORDS>],
    "fake_clues": [<2 clues - EACH MUST BE 15-20 WORDS>],
    "choices": [<3 answer choices including the true answer>],
    "disinformer_instructions": [<3 instructions for the disinformer>]
  }
]
"""

one_shot_example = """
Below is one example of a query with VALIDATED word counts:

Q: {
  "round_1": "Song",
  "round_2": "Love Story - Taylor Swift"
}
A: [
  {
    "answer": "song",
    "informed_clues": [
      "Used to mark an emotional high point of a movie or personal moment in time.",
      "It swiftly conveys snapshots you replay in your mind instead of reading them on pages.",
      ...
    ],
    "misinformed_clues": [
      "It's something you might carefully browse over your morning coffee while relaxing peacefully at home.",
      "Rhyming patterns and rhythmic structures create sounds that echo through spaces and touch hearts deeply.",
      ...
    ],
    "extra_clues": [
      "It moves you through peaks and valleys of emotion using only rhythm and tone together."
    ],
    "fake_clues": [
      "Words printed on pages bound together tell stories across centuries and inspire human imagination deeply.",
      "Visual scenes displayed on screens create narratives showing characters acting in dramatic situations throughout films.",
      ...
    ],
    "choices": [
      "book",
      "short film",
      "song"
    ],
    "disinformer_instructions": [
      "Focus on the emotional impact rather than technical elements",
      "Notice patterns in how the content engages the audience",
      "Consider what makes it memorable across different demographics"
    ]
  },
  {
    "answer": "Love Story by Taylor Swift",
    "informed_clues": [
      "Draws on imagery of timeless romance and references feuding families rather than actual warring houses.",
      "Uses a whisper soft bridge section to heighten mounting tension before the triumphant key change.",
      ...
    ],
    "misinformed_clues": [
      "It's about sneaking out at dawn to crash a royal wedding you were not invited to.",
      "The narrative involves unexpected plot twists regarding romance and rising conflicts between opposing social groups.",
      ...
    ],
    "extra_clues": [
      "Evokes nostalgic flashback of meeting someone young and then leaps into emotional narrative confession."
    ],
    "fake_clues": [
      "A contemporary love song exploring themes of eternal devotion and unwavering commitment between two souls.",
      "A powerful ballad celebrating the strength of love across time and overcoming obstacles together.",
      ...
    ],
    "choices": [
      "A Thousand Years ‚Äì Christina Perri",
      "Love Story - Taylor Swift",
      "I Will Always Love You - Whitney Houston"
    ],
    "disinformer_instructions": [
      "Pay attention to how the story unfolds chronologically through the narrative structure",
      "Consider the specific cultural or historical references embedded in the composition",
      "Notice how the musical arrangement shifts to emphasize key emotional moments"
    ]
  }
]
"""

user_prompt = json.dumps(
    {
      "round_1": "Movie",
      "round_2": "Star Wars Episode I: The Phantom Menace"
    }
)

In [65]:
# Construct the prompt and invoke the model
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage

messages = [
    SystemMessage(system_prompt + output_format + one_shot_example),
    HumanMessage(user_prompt),
]

In [66]:
# Invoke the model
response = model.invoke(messages)

# Print the response
print(response.content)

```json
[
  {
    "answer": "Movie",
    "informed_clues": [
      "It primarily tells stories through moving images and synchronized sound on a screen.",
      "Audiences gather in dark rooms or homes to collectively experience its narrative flow.",
      "It involves directors, actors, writers, and crews collaborating to create a visual spectacle.",
      "Often relies on a distinct three-act structure to build tension and resolve character arcs.",
      "Can transport viewers to different worlds and eras, evoking strong emotional responses effectively.",
      "A single sitting usually completes the entire narrative, offering a full, contained experience.",
      "Uses cinematography and editing techniques to shape perception and convey specific artistic visions.",
      "Features soundtracks and musical scores specifically composed to enhance the on-screen action dramatically.",
      "Premieres at festivals and cinemas worldwide, becoming part of global cultural conversations quic

In [67]:
# Print the usage metadata
print(response.usage_metadata)

None


# Game Clue Analysis Matrix

## 1. Length Compliance
| Status | Criteria |
|--------|----------|
| ‚úÖ PASS | All clues 15-20 words |
| ‚ùå FAIL | Any clues outside range |

**Outliers:** ___/13 clues failed

---

## 2. Quality Scores (Rate 1-5)

### Informed Clues: ___/5
- [ ] Different angles (plot, characters, themes, technical, cultural)
- [ ] Reasonable connection to correct answer
- [ ] Nothing gives away too much

### Misinformed Clues: ___/5
- [ ] Could point to 2+ different answers
- [ ] Vague but not nonsensical
- [ ] Not obviously wrong

### Fake Clues: ___/5
- [ ] Clearly point to wrong answer choices
- [ ] Believable enough to fool players

---

## 3. Diversity Check
- [ ] **PASS** - Informed clues cover different aspects
- [ ] **FAIL** - Found duplicates: ________________

---

## 4. Difficulty Rating
| Score | Assessment |
|-------|------------|
| 1-2 | Too Easy |
| 3 | Just Right |
| 4-5 | Too Hard |

**Rating:** ___/5

---

## Overall Assessment
**Pass/Fail:** ______  
**Main Issues:** ______________________  
**Notes:** ____________________________

In [68]:
# List of different topics to test
test_topics = [
    {"round_1": "Movie", "round_2": "Star Wars Episode I: The Phantom Menace"},
    {"round_1": "Song", "round_2": "Bohemian Rhapsody - Queen"},
    {"round_1": "Book", "round_2": "Harry Potter and the Sorcerer's Stone"},
    {"round_1": "TV Show", "round_2": "Breaking Bad"},
    {"round_1": "Video Game", "round_2": "The Legend of Zelda: Breath of the Wild"},
    {"round_1": "Food", "round_2": "Pizza Margherita"},
    {"round_1": "Animal", "round_2": "African Elephant"},
    {"round_1": "Sport", "round_2": "Tennis"},
    {"round_1": "Country", "round_2": "Japan"},
    {"round_1": "Historical Event", "round_2": "Moon Landing 1969"}
]

### Manual

In [69]:
import json
import re
import csv
import pandas as pd
from time import sleep
from datetime import datetime
from langchain_core.messages import HumanMessage, SystemMessage

In [70]:
def extract_json_from_response(content):
    """Extract JSON from model response using multiple fallback methods"""
    # Clean escaped quotes
    content = content.replace('\\"', '"')

    # Method 1: Direct parse
    try:
        return json.loads(content)
    except:
        pass

    # Method 2: Extract from code blocks
    match = re.search(r"```(?:json)?\s*(.*?)\s*```", content, re.DOTALL)
    if match:
        try:
            return json.loads(match.group(1))
        except:
            pass

    # Method 3: Find incomplete array and fix
    match = re.search(r"(\[.*)", content, re.DOTALL)
    if match:
        json_text = match.group(1).rstrip()
        if not json_text.endswith(']'):
            json_text = json_text.rstrip(',') + ']'
        try:
            return json.loads(json_text)
        except:
            pass

    return None

In [71]:
def process_game_data(game_data, topic, run_number):
    """Process valid game data into rows"""
    rows = []
    for i, round_data in enumerate(game_data, start=1):
        answer = round_data.get("answer", "")
        choices = ", ".join(round_data.get("choices", []))

        for clue_type in ["informed_clues", "misinformed_clues", "fake_clues", "extra_clues"]:
            for j, clue in enumerate(round_data.get(clue_type, []), start=1):
                word_count = len(clue.split())
                rows.append({
                    "test_run": run_number,
                    "topic_category": topic['round_1'],
                    "topic_specific": topic['round_2'],
                    "round": i,
                    "answer": answer,
                    "choices": choices,
                    "clue_type": clue_type.replace("_clues", ""),
                    "clue_number": j,
                    "clue_text": clue,
                    "word_count": word_count,
                    "length_ok": "YES" if 15 <= word_count <= 20 else "NO",
                    "manual_score / comment": ""
                })
    return rows

In [72]:
# Main execution
all_rows = []

for run_number, topic in enumerate(test_topics, 1):
    print(f"Running test {run_number}/{len(test_topics)}: {topic['round_1']} - {topic['round_2']}")

    messages = [
        SystemMessage(system_prompt + output_format + one_shot_example),
        HumanMessage(json.dumps(topic)),
    ]

    response = model.invoke(messages)
    clean_content = re.sub(r"<think>.*?</think>", "", response.content, flags=re.DOTALL).strip()
    game_data = extract_json_from_response(clean_content)

    if game_data:
        try:
            all_rows.extend(process_game_data(game_data, topic, run_number))
            print(f"‚úÖ Test {run_number} completed successfully")
        except Exception as e:
            print(f"‚ùå Error processing data for test {run_number}: {e}")
    else:
        print(f"‚ùå No valid JSON found for test {run_number}")
        print("RAW:", clean_content[:200])

    sleep(5)

Running test 1/10: Movie - Star Wars Episode I: The Phantom Menace
‚úÖ Test 1 completed successfully
Running test 2/10: Song - Bohemian Rhapsody - Queen
‚úÖ Test 2 completed successfully
Running test 3/10: Book - Harry Potter and the Sorcerer's Stone
‚úÖ Test 3 completed successfully
Running test 4/10: TV Show - Breaking Bad
‚úÖ Test 4 completed successfully
Running test 5/10: Video Game - The Legend of Zelda: Breath of the Wild
‚úÖ Test 5 completed successfully
Running test 6/10: Food - Pizza Margherita
‚úÖ Test 6 completed successfully
Running test 7/10: Animal - African Elephant
‚úÖ Test 7 completed successfully
Running test 8/10: Sport - Tennis
‚úÖ Test 8 completed successfully
Running test 9/10: Country - Japan
‚úÖ Test 9 completed successfully
Running test 10/10: Historical Event - Moon Landing 1969
‚úÖ Test 10 completed successfully


In [73]:
# Save to CSV
with open("10_rounds_clues_analysis(gemini).csv", "w", newline="", encoding="utf-8") as f:
    if all_rows:
        writer = csv.DictWriter(f, fieldnames=all_rows[0].keys())
        writer.writeheader()
        writer.writerows(all_rows)
        print(f"‚úÖ CSV saved: 10_rounds_clues_analysis(gemini).csv")

print(f"Total rows generated: {len(all_rows)}")

‚úÖ CSV saved: 10_rounds_clues_analysis(gemini).csv
Total rows generated: 280


## LLM analysis (llama)


In [74]:
analysis_model = init_chat_model(
    model="gemini-2.5-flash",
    model_provider="google_genai"
)

In [75]:
def analyze_round_with_llm(round_data, analysis_model):
    """Analyze a single round using LLM"""

    # Count words for each clue type
    word_counts = {}
    length_issues = []

    for clue_type in ["informed_clues", "misinformed_clues", "fake_clues", "extra_clues"]:
        clues = round_data.get(clue_type, [])
        word_counts[clue_type] = []

        for i, clue in enumerate(clues, 1):
            word_count = len(clue.split())
            word_counts[clue_type].append(word_count)

            if not (15 <= word_count <= 20):
                length_issues.append(f"{clue_type} #{i}: {word_count} words")

    # Create word count summary
    word_count_summary = f"""
WORD COUNT ANALYSIS:
- Informed clues: {word_counts.get('informed_clues', [])}
- Misinformed clues: {word_counts.get('misinformed_clues', [])}
- Fake clues: {word_counts.get('fake_clues', [])}
- Extra clues: {word_counts.get('extra_clues', [])}

LENGTH ISSUES (should be 15-20 words):
{'; '.join(length_issues) if length_issues else 'All clues meet length requirements'}
"""

    analysis_prompt = f"""
You are evaluating clues for a disinformer game. Analyze BOTH word count compliance AND content quality.

{word_count_summary}

ROUND DATA:
{json.dumps(round_data, indent=2)}

CRITICAL REQUIREMENTS:
1. LENGTH COMPLIANCE: Each clue should be 15-20 words (see analysis above)
2. ANSWER CONTAMINATION: Check if ANY clue contains the answer word
3. SPECIFICITY: Are clues specific enough to distinguish from similar items?
4. CLUE REFERENCES: Use specific clue numbers when noting issues
5. DETAILED REASONING: Explain WHY you gave each score

CLUE TYPE REQUIREMENTS:
- **INFORMED CLUES**: Must relate to actual answer and be distinct/specific (avoid generic descriptions)
- **MISINFORMED CLUES**: Must be related to actual answer BUT vague enough to apply to multiple choices (create productive doubt)
- **FAKE CLUES**: Must clearly point to the OTHER answer choices, NOT the correct answer (effective misdirection)

SCORING SCALE (MANDATORY):
Rate based on how well each clue type fulfills its specific purpose:
- informed_quality: Rate 1-5 (How well do they point to correct answer specifically?)
- misinformed_quality: Rate 1-5 (Do they create ambiguity while staying answer-related?)
- fake_quality: Rate 1-5 (Do they clearly misdirect to wrong answer choices?)
- difficulty: Rate 1-5 (1=too easy, 2=easy, 3=just right, 4=hard, 5=too hard)

Return ONLY this JSON format:
{{
  "length_compliance_score": number (1-5),
  "length_issues_found": ["list of specific length problems"],
  "informed_quality": number (1-5),
  "informed_notes": "detailed analysis with specific clue numbers",
  "misinformed_quality": number (1-5),
  "misinformed_notes": "detailed analysis of ambiguity effectiveness",
  "fake_quality": number (1-5),
  "fake_notes": "detailed analysis of misdirection effectiveness",
  "diversity_issues": ["list specific problems found"],
  "difficulty": number (1-5),
  "difficulty_reasoning": "detailed explanation",
  "overall_notes": "comprehensive summary"
}}"""

    try:
        response = analysis_model.invoke([HumanMessage(analysis_prompt)])
        return extract_json_from_response(response.content)
    except Exception as e:
        print(f"‚ùå LLM analysis failed: {e}")
        return None

In [76]:
# Load data from your manual analysis CSV
import pandas as pd

# Load the CSV file from your manual analysis
df = pd.read_csv("10_rounds_clues_analysis(gemini).csv")  # Change filename as needed

# Group data by test_run and round to reconstruct round_data
all_results = []

In [77]:
for (test_run, round_num), group in df.groupby(['test_run', 'round']):
    # Skip disinformer instructions
    clue_data = group[group['clue_type'] != 'disinformer_instruction']

    if len(clue_data) == 0:
        continue

    # Get basic info
    topic_category = clue_data['topic_category'].iloc[0]
    topic_specific = clue_data['topic_specific'].iloc[0]
    answer = clue_data['answer'].iloc[0]
    choices = clue_data['choices'].iloc[0]

    print(f"Analyzing Test {test_run}, Round {round_num}: {topic_category} - {answer}")

    # Reconstruct round_data from CSV
    round_data = {
        "answer": answer,
        "choices": choices.split(" | ") if choices else [],
        "informed_clues": clue_data[clue_data['clue_type'] == 'informed']['clue_text'].tolist(),
        "misinformed_clues": clue_data[clue_data['clue_type'] == 'misinformed']['clue_text'].tolist(),
        "fake_clues": clue_data[clue_data['clue_type'] == 'fake']['clue_text'].tolist(),
        "extra_clues": clue_data[clue_data['clue_type'] == 'extra']['clue_text'].tolist()
    }

    # Analyze with LLM
    analysis = analyze_round_with_llm(round_data, analysis_model)

    if analysis:
        result = {
            "test_run": test_run,
            "topic_category": topic_category,
            "topic_specific": topic_specific,
            "round": round_num,
            "answer": answer,
            "choices": choices,

            # LLM Analysis Results
            "informed_quality": analysis.get("informed_quality", ""),
            "informed_notes": analysis.get("informed_notes", ""),
            "misinformed_quality": analysis.get("misinformed_quality", ""),
            "misinformed_notes": analysis.get("misinformed_notes", ""),
            "fake_quality": analysis.get("fake_quality", ""),
            "fake_notes": analysis.get("fake_notes", ""),
            "diversity_issues": "; ".join(analysis.get("diversity_issues", [])),
            "difficulty": analysis.get("difficulty", ""),
            "difficulty_reasoning": analysis.get("difficulty_reasoning", ""),
            "overall_notes": analysis.get("overall_notes", ""),

            # Word count and length compliance data
            "total_clues": len(round_data["informed_clues"]) + len(round_data["misinformed_clues"]) + len(round_data["fake_clues"]) + len(round_data["extra_clues"]),
            "length_compliant_clues": sum(1 for clue_type in ["informed_clues", "misinformed_clues", "fake_clues", "extra_clues"]
                                        for clue in round_data[clue_type]
                                        if 15 <= len(clue.split()) <= 20),
            "length_compliance_rate": f"{(sum(1 for clue_type in ['informed_clues', 'misinformed_clues', 'fake_clues', 'extra_clues'] for clue in round_data[clue_type] if 15 <= len(clue.split()) <= 20) / max(1, sum(len(round_data[clue_type]) for clue_type in ['informed_clues', 'misinformed_clues', 'fake_clues', 'extra_clues'])) * 100):.0f}%",
            "avg_word_count": round(sum(len(clue.split()) for clue_type in ["informed_clues", "misinformed_clues", "fake_clues", "extra_clues"] for clue in round_data[clue_type]) / max(1, sum(len(round_data[clue_type]) for clue_type in ["informed_clues", "misinformed_clues", "fake_clues", "extra_clues"])), 1)
        }

        all_results.append(result)
        print(f"  ‚úÖ Analyzed successfully")
    else:
        print(f"  ‚ùå Analysis failed")

    sleep(2)  # Rate limiting

Analyzing Test 1, Round 1: Movie - Movie
  ‚úÖ Analyzed successfully
Analyzing Test 1, Round 2: Movie - Star Wars Episode I: The Phantom Menace
  ‚ùå Analysis failed
Analyzing Test 2, Round 1: Song - song
  ‚úÖ Analyzed successfully
Analyzing Test 2, Round 2: Song - Bohemian Rhapsody - Queen
  ‚úÖ Analyzed successfully
Analyzing Test 3, Round 1: Book - book
  ‚úÖ Analyzed successfully
Analyzing Test 3, Round 2: Book - Harry Potter and the Sorcerer's Stone
  ‚úÖ Analyzed successfully
Analyzing Test 4, Round 1: TV Show - TV Show
  ‚úÖ Analyzed successfully
Analyzing Test 4, Round 2: TV Show - Breaking Bad
  ‚úÖ Analyzed successfully
Analyzing Test 5, Round 1: Video Game - video game
  ‚úÖ Analyzed successfully
Analyzing Test 5, Round 2: Video Game - The Legend of Zelda: Breath of the Wild
  ‚úÖ Analyzed successfully
Analyzing Test 6, Round 1: Food - food
  ‚úÖ Analyzed successfully
Analyzing Test 6, Round 2: Food - Pizza Margherita
  ‚úÖ Analyzed successfully
Analyzing Test 7, Round 1: A

In [78]:
# Save results
if all_results:
    with open("llm_analysis_results(gemini).csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=all_results[0].keys())
        writer.writeheader()
        writer.writerows(all_results)

    print(f"‚úÖ LLM analysis complete! Saved {len(all_results)} results to: llm_analysis_results(gemini).csv")

‚úÖ LLM analysis complete! Saved 19 results to: llm_analysis_results(gemini).csv


In [79]:
import pandas as pd
from pathlib import Path
# Used by to_markdown function
%pip install tabulate

# Ensure the utility functions below exist in your notebook cell.
def calculate_length_compliance(row):
    compliance_rate = int(row['length_compliance_rate'].rstrip('%'))
    total_clues = row['total_clues']
    compliant = row['length_compliant_clues']
    non_compliant = total_clues - compliant
    return compliance_rate, compliant, non_compliant, total_clues

def get_pass_fail_status(compliance_rate):
    return "‚úÖ PASS" if compliance_rate >= 80 else "‚ùå FAIL"

def get_quality_assessment(score):
    assessments = {
        1: "Poor - Needs significant revision",
        2: "Fair - Below expectations",
        3: "Good - Meets requirements",
        4: "Very Good - Exceeds expectations",
        5: "Excellent - Outstanding"
    }
    return assessments.get(int(score), "Unknown")

def get_difficulty_assessment(difficulty):
    difficulty = int(difficulty)
    if difficulty <= 2:
        return "üü¢ Too Easy"
    elif difficulty == 3:
        return "üü¢ Just Right"
    else:
        return "üü† Too Hard"

def extract_issues(notes_str):
    import pandas as pd
    if pd.isna(notes_str):
        return ["None identified"]
    notes_str = str(notes_str).lower()
    issues = []
    keywords = {
        "length": "Word count compliance issues",
        "generic": "Generic/vague clues",
        "diversity": "Lack of diversity in themes",
        "ambiguity": "Insufficient ambiguity in misinformed clues",
        "specificity": "Missing specificity in clues",
        "answer contamination": "Answer word revealed in clues"
    }
    for keyword, issue in keywords.items():
        if keyword in notes_str:
            issues.append(issue)
    return issues if issues else ["Minor issues noted"]

def generate_matrix_for_round(row):
    test_run = int(row['test_run'])
    topic_cat = row['topic_category']
    topic_spec = row['topic_specific']
    round_num = int(row['round'])
    compliance_rate, compliant, non_compliant, total = calculate_length_compliance(row)
    status = get_pass_fail_status(compliance_rate)

    # Handle NaN values with defaults
    informed_score = int(row['informed_quality']) if not pd.isna(row['informed_quality']) else 3
    misinformed_score = int(row['misinformed_quality']) if not pd.isna(row['misinformed_quality']) else 3
    fake_score = int(row['fake_quality']) if not pd.isna(row['fake_quality']) else 3
    difficulty = int(row['difficulty']) if not pd.isna(row['difficulty']) else 3

    issues = extract_issues(row['overall_notes'])
    diversity_issues = row['diversity_issues'] if not pd.isna(row['diversity_issues']) else "None identified"
    
    matrix = f"""# Game Clue Analysis Matrix
**Test Run {test_run} | Round {round_num}: {topic_cat} ‚Üí {topic_spec}**

---

## 1. Length Compliance
| Status | Criteria |
|--------|----------|
| {status} | Clues within 15-20 words |

**Compliance Rate:** {compliance_rate}% ({compliant}/{total} clues)  
**Outliers:** {non_compliant}/{total} clues failed  
**Average Word Count:** {row['avg_word_count']} words

**Assessment:** {"‚úÖ Acceptable - Most clues meet length requirements" if compliance_rate >= 80 else "‚ùå Critical - Significant length violations require revision"}

---

## 2. Quality Scores (Rate 1-5)

### Informed Clues: {informed_score}/5  
**{get_quality_assessment(informed_score)}**

{row['informed_notes']}

‚úÖ Strengths:
- Generally specific and relate to correct answer
- Provide distinct perspectives where applicable

‚ö†Ô∏è Concerns:
- {row['diversity_issues'] if not pd.isna(row['diversity_issues']) else "Minor thematic overlap observed"}

### Misinformed Clues: {misinformed_score}/5  
**{get_quality_assessment(misinformed_score)}**

{row['misinformed_notes']}

‚úÖ Strengths:
- Attempt to create ambiguity
- Generally related to the correct answer

‚ö†Ô∏è Concerns:
- May need more subtle misdirection
- Ambiguity effectiveness varies

### Fake Clues: {fake_score}/5  
**{get_quality_assessment(fake_score)}**

{row['fake_notes']}

‚úÖ Strengths:
- Effectively misdirect to wrong answer choices
- Clear deception without being obvious

---

## 3. Diversity Check

| Aspect | Status |
|--------|--------|
| Theme Coverage | {"‚úÖ PASS" if "diversity" not in diversity_issues.lower() else "‚ùå FAIL"} |
| Clue Variation | {"‚úÖ PASS" if informed_score >= 3 else "‚ùå FAIL"} |
| Angle Coverage | {"‚úÖ PASS" if non_compliant <= 2 else "‚ùå FAIL"} |

**Issues Found:** {diversity_issues}

---

## 4. Difficulty Rating

| Score | Assessment |
|-------|------------|
| Rating | {difficulty}/5 - {get_difficulty_assessment(difficulty)} |

**Reasoning:** {row['difficulty_reasoning']}

---

## Overall Assessment

**Overall Quality Score:** {(informed_score + misinformed_score + fake_score) / 3:.1f}/5

**Pass/Fail:** {"‚úÖ PASS" if compliance_rate >= 70 and (informed_score + misinformed_score + fake_score) / 3 >= 3 else "‚ö†Ô∏è NEEDS REVISION"}

**Main Issues:**
{chr(10).join(f"- {issue}" for issue in issues)}

**Priority Actions:**
1. {"Address length compliance" if compliance_rate < 80 else "Minor length adjustments"}
2. {"Enhance misinformed clue ambiguity" if misinformed_score < 3 else "Maintain misinformed clue quality"}
3. {"Increase clue diversity" if "diversity" in diversity_issues.lower() else "Maintain current diversity"}

**Overall Notes:**  
{row['overall_notes']}

---
"""
    return matrix

# --- Matrices Generation per Test Run ---
csv_path = Path("llm_analysis_results(gemini).csv")
if not csv_path.exists():
    print(f"‚ùå Error: {csv_path} not found.")
else:
    df = pd.read_csv(csv_path)
    test_runs = df['test_run'].unique()
    dir = Path("clue_analysis_matrices")
    dir.mkdir(exist_ok=True)
    
    for test in sorted(test_runs):
        group = df[df['test_run'] == test]
        text = f"# Analysis for Test {test}\n\n"
        
        # Append matrices for each round in the test
        rounds = sorted(group['round'].unique())
        for r in rounds:
            row = group[group['round'] == r].iloc[0]
            matrix_text = generate_matrix_for_round(row)
            text += matrix_text + "\n\n"
        
        # Append a round-by-round performance summary table for this test
        text += "## Round-by-Round Performance Summary\n\n"
        text += "| Round | Length Compliance | Informed | Misinformed | Fake | Difficulty |\n"
        text += "|-------|-------------------|----------|-------------|------|------------|\n"
        for r in rounds:
            row = group[group['round'] == r].iloc[0]
            length_comp = row['length_compliance_rate']
            inf_score = row['informed_quality']
            mis_score = row['misinformed_quality']
            fake_score = row['fake_quality']
            difficulty = row['difficulty']
            text += f"| {r} | {length_comp} | {inf_score}/5 | {mis_score}/5 | {fake_score}/5 | {difficulty}/5 |\n"
        
        # Save markdown for this test run
        test_file = dir / f"test{test}_clue_analysis(gemini).md"
        with open(test_file, 'w', encoding='utf-8') as f:
            f.write(text)
        print(f"‚úÖ Generated analysis matrices for Test {test}: {test_file}")
    
    # --- Overall Performance Breakdown by Category ---
    overall_summary = "# Overall Performance Breakdown by Category\n\n"
    # Fill NaN values with default score of 3 before aggregation
    df_clean = df.copy()
    df_clean['informed_quality'] = pd.to_numeric(df_clean['informed_quality'], errors='coerce').fillna(3)
    df_clean['misinformed_quality'] = pd.to_numeric(df_clean['misinformed_quality'], errors='coerce').fillna(3)
    df_clean['fake_quality'] = pd.to_numeric(df_clean['fake_quality'], errors='coerce').fillna(3)
    df_clean['difficulty'] = pd.to_numeric(df_clean['difficulty'], errors='coerce').fillna(3)

    by_category = df_clean.groupby('topic_category').agg({
        'length_compliance_rate': lambda x: f"{int(x.str.rstrip('%').astype(int).mean()):.0f}%",
        'informed_quality': lambda x: f"{x.astype(int).mean():.1f}/5",
        'misinformed_quality': lambda x: f"{x.astype(int).mean():.1f}/5",
        'fake_quality': lambda x: f"{x.astype(int).mean():.1f}/5",
        'difficulty': lambda x: f"{x.astype(int).mean():.1f}/5"
    }).reset_index()
    overall_summary += by_category.to_markdown(index=False)
    
    # Save overall summary to a markdown file
    overall_file = Path("Disinformer_Game_Clues_Quality_Summary(gemini).MD")
    with open(overall_file, 'w', encoding='utf-8') as f:
        f.write(overall_summary)
    print(f"‚úÖ Overall performance by category saved: {overall_file}")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
‚úÖ Generated analysis matrices for Test 1: clue_analysis_matrices/test1_clue_analysis(gemini).md
‚úÖ Generated analysis matrices for Test 2: clue_analysis_matrices/test2_clue_analysis(gemini).md
‚úÖ Generated analysis matrices for Test 3: clue_analysis_matrices/test3_clue_analysis(gemini).md
‚úÖ Generated analysis matrices for Test 4: clue_analysis_matrices/test4_clue_analysis(gemini).md
‚úÖ Generated analysis matrices for Test 5: clue_analysis_matrices/test5_clue_analysis(gemini).md
‚úÖ Generated analysis matrices for Test 6: clue_analysis_matrices/test6_clue_analysis(gemini).md
‚úÖ Generated analysis matrices for Test 7: clue_analysis_matrices/test7_clue_analysi