## 📚 Step 1: Import Everything

This imports all the functions we need to run this notebook.

In [10]:
from prompt_runner import (
    load_environment,
    load_configuration, 
    load_dataset,
    tasks_to_dataframe,
    run_all_tasks,
    results_summary_dataframe,
    save_results,
    create_results_table
)
import os
import pandas as pd
from litellm import supports_response_schema
pd.set_option('display.max_colwidth', None)


print("✅ Imports complete!")

✅ Imports complete!


## 🔍 Step 2: Check Which Models Support Structured Output

Before we start, let's see which models work with our Pydantic structured output. Using structured output is a way of making sure that the models give us the output in the format we want, like only answering "positive" or negative". 
This helps you choose the right models for your config.json file, which is how we control what models we're using.

In [11]:
print("🔍 Checking model support for structured output (Pydantic)...")
# Test models for json_schema support (what we actually need)
test_models = [
    ("gpt-4o-mini", "openai"),
    ("gpt-4o", "openai"), 
    ("gpt-3.5-turbo", "openai"),
    ("claude-3-haiku-20240307", "anthropic"),
    ("claude-3-5-sonnet-20241022", "anthropic"),
]

supported = []
for model, provider in test_models:
    try:
        # Use the correct function for checking Pydantic/json_schema support
        has_support = supports_response_schema(model=model, custom_llm_provider=provider)
        if has_support:
            supported.append(f"{model} ({provider})")
            print(f"✅ {model} - Supports Pydantic structured output")
        else:
            print(f"❌ {model} - No Pydantic structured output support")
    except Exception as e:
        print(f"❓ {model} - Could not check: {str(e)}")
        
print(f"\n💡 Supported models for config.json: {supported}")

🔍 Checking model support for structured output (Pydantic)...
✅ gpt-4o-mini - Supports Pydantic structured output
✅ gpt-4o - Supports Pydantic structured output
❌ gpt-3.5-turbo - No Pydantic structured output support
✅ claude-3-haiku-20240307 - Supports Pydantic structured output
✅ claude-3-5-sonnet-20241022 - Supports Pydantic structured output

💡 Supported models for config.json: ['gpt-4o-mini (openai)', 'gpt-4o (openai)', 'claude-3-haiku-20240307 (anthropic)', 'claude-3-5-sonnet-20241022 (anthropic)']


## 🔑 Step 3: Load Environment Variables

This loads your API keys from the .env file.
Make sure you have the API keys you need for the providers you want to use.

In [12]:
load_environment()

# Show what API keys we actually found
api_keys_to_check = ["OPENAI_API_KEY", "ANTHROPIC_API_KEY", "GEMINI_API_KEY"]

for key in api_keys_to_check:
    os.getenv(key)

✓ Found 3 API key(s) in .env:
   - OPENAI_API_KEY
   - ANTHROPIC_API_KEY
   - GEMINI_API_KEY
✓ Environment variables loaded from .env


## ⚙️ Step 4: Load Configuration

This reads your config.json file which contains:
- Which prompts to test
- Which models to use
- What temperatures to try
- How to format the output

In [13]:
config = load_configuration("config.json")

print(f"\n💡 To modify your experiment:")
print(f"   📝 Edit config.json to change:")
print(f"      - prompts: Add new prompt templates")
print(f"      - models: Try different LLMs")
print(f"      - temperatures: test effect of consistency")
print(f"      - sample_size: Test on more data (currently {config['data_settings']['sample_size']})")
print(f"\n🔄 After editing config.json, restart this notebook to see changes")

✓ 3 prompts, 2 models, 2 temperatures

📝 Prompts:
   1. sentiment_basic: Classify this movie review as either positive or negative: '{text}'

   2. sentiment_instruction: Read the following movie review and determine if it's positive or negative.

Review: '{text}'

   3. sentiment_bad: hey is this good or bad idk: '{text}'


💡 To modify your experiment:
   📝 Edit config.json to change:
      - prompts: Add new prompt templates
      - models: Try different LLMs
      - temperatures: Test creativity vs consistency
      - sample_size: Test on more data (currently 10)

🔄 After editing config.json, restart this notebook to see changes


## 📁 Step 5: Load Your Data

This loads your CSV file and samples data for testing.
It automatically balances the classes (equal positive/negative samples).

In [14]:
samples = load_dataset("IMDB Dataset.csv", config)

print(f"📊 Loaded {len(samples)} samples")
print(f"📈 Label distribution:")
label_counts = samples[config['data_settings']['label_column']].value_counts()
for label, count in label_counts.items():
    print(f"   - {label}: {count}")

# Show sample data
samples.head(3)

✓ 10 samples loaded
📊 Loaded 10 samples
📈 Label distribution:
   - positive: 5
   - negative: 5


Unnamed: 0,review,sentiment
0,"I don't know how or why this film has a meager rating on IMDb. This film, accompanied by ""I am Curious: Blue"" is a masterwork.<br /><br />The only thing that will let you down in this film is if you don't like the process of film, don't like psychology or if you were expecting hardcore pornographic ramming.<br /><br />This isn't a film that you will want to watch to unwind; it's a film that you want to see like any other masterpiece, with time, attention and care.<br /><br />******SUMMARIES, MAY CONTAIN A SPOILER OR TWO*******<br /><br />The main thing about this film is that it blends the whole film, within a film thing, but it does it in such a way that sometimes you forget that the fictions aren't real.<br /><br />The film is like many films in one:<br /><br />1. A political documentary, about the social system in Sweden at the time. Which in a lot of ways are still relevant to today. Interviews done by a young woman named Lena.<br /><br />2. A narrative about a filmmaker, Vilgot Sjoman, making a film... he deals with a relationship with his star in the film and how he should have never got involved with people he's supposed to work with.<br /><br />3. The film that Vilgot is making. It's about a young woman named Lena(IE. #2), who is young and very politically active, she is making a documentary (IE. #1.). She is also a coming of age and into her sexuality, and the freedom of that.<br /><br />The magnificence and sheer brilliance of ""I am Curious: Yellow/Blue"" is how these three elements are cut together. In one moment you are watching an interview about politics, and the next your watching what the interviewer is doing behind the scenes but does that so well that you sometimes forget that it is the narrative.<br /><br />Another thing is the dynamic between ""Yellow"" and ""Blue"", which if you see one, you must see the other. ""Blue"" is not a sequel at all. I'll try to explain it best i can because to my knowledge, no other films have done it though it is a great technique.<br /><br />Think of ""Yellow"" as a living thing, actual events in 14 scenes. A complete tale.<br /><br />Think of ""Blue"" as all the things IN BETWEEN the 14 scenes in ""Yellow"" that you didn't see, that is a complete tale on it's own.<br /><br />Essentially they are parallel films... the same story, told in two different ways.<br /><br />It wasn't until i saw the first 30 minutes of ""Blue"" that i fully understood ""Yellow""<br /><br />I hope this was helpful for people who are being discouraged by various influences, because this film changed the way i looked at film.<br /><br />thanks for your time.",positive
1,"For a long time it seemed like all the good Canadian actors had headed south of the border and (I guessed) all the second rank ones filled the top slots and that left the dregs for the sex comedies.<br /><br />This film was a real surprise: despite the outlandish plots that are typical of farces, the actors seemed to be trying to put something into their characters and what we, the viewer, got back was almost true suspension of belief. When the extras from the music video attacked the evicting police, you almost believed it was possible.<br /><br />If you are a fan of some of the better sex farces (Canadian or not) you should definitely seek this one out. And the big surprise, this sex farce is also loaded with some very good nudity.",positive
2,"Terry Gilliam's and David Peoples' teamed up to create one of the most intelligent and creative science fiction movies of the '90's. People's proved a screenplay with bizarre twists and fantastic ideas about the nature of time  I especially love the idea one can't change the past; it's a nice counterpoint to so many time-travelling movies which say otherwise  biological holocausts and the thin line between sanity and madness. Gilliam visualized his ideas with unique quirkiness, perfection and originality.<br /><br />The story itself is engaging: one man, James Cole (played by Bruce Willis in a heart-warming performance) travels several decades to the past to retrieve information about a virus that's wiped out mankind and left only a few survivors alive living underground: with the information he'll collect, scientists hope to find a cure so everyone in the future can return to the surface. But because their time-travelling technology isn't perfect, he ends up being sent towards different other pasts and complicating things. And from that a brilliant science fiction thriller with shades of film noir ensues as the multiple pieces of a huge jigsaw start fitting together to form a bizarre narrative involving animal right activists, end of the millennium paranoia, biological weapons, the perception of reality, and the definition of sanity. With such a complex movie, it was easy for Gilliam and Peoples to create a mess, but instead Twelve Monkeys is a thought-provoking narrative which will please those who like to be challenged and have patience to appreciate some crazy ideas.<br /><br />I watched this movie once around 10 years ago. It marked me a lot: I remember still thinking about many days after-wards; for my young mind this seemed quite mind-blowing and it was one of the first movies to make me appreciate cinema as something serious and important. I've re-watched this movie a few days ago on DVD and it's better than I remembered it. Brad Pitt still steals all the scenes he's in, playing Jeffrey Goines  almost a prelude to his Tyler Durden character in Fight Club  a rich kid with some anarchist/non-conformist ideas who's also crazy and, according to Cole, perhaps responsible for the virus. The scenes between Jeffrey and Cole in the madhouse are the best in the movie, Pitt's eyes, voice and quirky mannerisms convince you he's really a crazy guy locked in a warped logic only he understands. Pitt's Oscar nomination was well deserved! Surprising was also Bruce Willis' performance: his I didn't remember very well, but it's beautiful and full of sensibility; he plays a man who spent almost all his life underground, and when he comes to the past you'll share his childish fascination with something as simple as breathing the fresh air of the morning or watching the sun go up. Cole is a rather ambiguous character, Peoples' tried to imbue some darkness in him, and he does other disturbing things to other people and to himself: the scene where he removes his own teeth reveals how far his dementia has gone unchecked. Ironically Cole didn't start as a crazy character, but when he starts warning everyone about the end of the world, he's considered mad and convinced it's all in his mind, until he arrives at a point when he can't distinguish past from future, reality from fiction. Willis spends a lot of time looking confused and insecure, and it works perfectly. One of the fun twists in the narrative is when Cole's shrink, Dr. Kathryn Railly, finds undeniable proof he's really from the future and now has to convince him again of his mission to save the world. The screenplay is full with weird twists like this and it keeps the movie in a fast pace. Their relationship is also well-handed, although perhaps a bit compressed for time's sake. But I enjoyed watching Cole and Railly falling in love and trying to escape the authority of the future to live a peaceful life in the past. But then things end in a tragic/bittersweet climax at an airport, wrapping all the pieces together, which will blow many minds away.<br /><br />There are two great endings in this movie, a twist in the sense of Se7en or Fight Club, and a more intimate ending where Railly is crouching next to Cole who's just been shot and looking around for a younger James Cole who's witnessing his future self die; the two share a brief look, and she smiles at him. The twist is brilliant, but I prefer this ending for emotional impact. Madeleine Stowe is very good playing Dr. Railly, she drew many different emotions from me in her performance. The movie is filled with a sense of fatalism with the idea the past can't be changed: this movie shows that in a terrifying way. It reminds me of Chinatown in that sense, the way Jake Gittes messes everything up the more he tries to help. Railly's character shares that fatalism, the more she tries to help Cole  first dealing with his 'madness' then helping him in his mission  the more they're sucked into tragedy.<br /><br />The twist ends with a hopeful note, though, with the feeling Cole's mission hasn't been in vain. Twelve Monkeys is a great movie to watch if one wants to be entertained; it's not supposed to be art, although it's more artists than many artistic movies. It's an unpretentious movie where all elements, from music to editing to costume design, etc., came together beautifully to produce a modern cinema masterpiece.",positive


## 🎯 Step 6: Create All Task Combinations

This creates every possible combination of:
- Each prompt × Each model × Each temperature × Each sample

This is where we define exactly what experiments to run!

In [15]:
tasks_df = tasks_to_dataframe(config, samples)

print(f"🎯 Total: {len(tasks_df)} tasks")
print(f"⏱️  Estimated time: {tasks_df.attrs['estimated_time_min']:.1f} minutes")

# Show sample tasks
tasks_df[['prompt_id', 'model', 'temperature', 'text_snippet']].head()

🎯 Total: 120 tasks
⏱️  Estimated time: 1.0 minutes


Unnamed: 0,prompt_id,model,temperature,text_snippet
0,sentiment_basic,gpt-4o-mini,0.0,"I don't know how or why this film has a meager rating on IMDb. This film, accompanied by ""I am Curio..."
1,sentiment_basic,gpt-4o-mini,0.0,For a long time it seemed like all the good Canadian actors had headed south of the border and (I gu...
2,sentiment_basic,gpt-4o-mini,0.0,Terry Gilliam's and David Peoples' teamed up to create one of the most intelligent and creative scie...
3,sentiment_basic,gpt-4o-mini,0.0,What is there to say about an anti-establishment film that was produced in a time of such colourless...
4,sentiment_basic,gpt-4o-mini,0.0,This movie was made only 48 years after the end of the Civil War--most likely in anticipation of the...


## 🚀 Step 7: Run All Prompt Tasks

This is where the magic happens! 
- Makes API calls to test each prompt
- Uses parallel processing to go faster
- Shows real-time progress
- Handles errors gracefully

⚠️ This will cost money (API calls) and take a few minutes!

In [16]:
print("🚀 Starting prompt execution...")
print("💰 Note: This will make API calls and cost money!")

# Convert DataFrame to tasks list for execution
tasks = tasks_df.to_dict('records')
results = run_all_tasks(tasks, config, max_workers=4)

print(f"\n🎉 Completed {len(results)} tasks!")

# Show sample results
results_df = results_summary_dataframe(results)
print(f"✅ Success rate: {results_df.attrs['success_rate']:.1f}%")

results_df[['prompt_id', 'model', 'temperature', 'true_label', 'prediction']].head()

🚀 Starting prompt execution...
💰 Note: This will make API calls and cost money!
Running 120 tasks with 4 parallel workers...
Progress: 120/120 (100%)
✓ Completed all 120 tasks

🎉 Completed 120 tasks!
✅ Success rate: 100.0%


Unnamed: 0,prompt_id,model,temperature,true_label,prediction
0,sentiment_basic,gpt-4o-mini,0.0,positive,positive
1,sentiment_basic,gpt-4o-mini,0.0,positive,positive
2,sentiment_basic,gpt-4o-mini,0.0,positive,positive
3,sentiment_basic,gpt-4o-mini,0.0,positive,positive
4,sentiment_basic,gpt-4o-mini,0.0,positive,positive


## 📈 Step 8: Quick Analysis

This gives you a fast overview of how well each combination performed.
Look for patterns:
- Which prompts work better?
- Does temperature matter?
- Are there differences between models?

In [17]:
create_results_table(results)

Prompt Testing Results,Prompt Testing Results,Prompt Testing Results,Prompt Testing Results,Prompt Testing Results,Prompt Testing Results
12 combinations tested,12 combinations tested,12 combinations tested,12 combinations tested,12 combinations tested,12 combinations tested
Prompt,Model,Temperature,Accuracy,Correct,Total
sentiment_bad,claude-3-haiku-20240307,0.0,100.0%,10,10
sentiment_bad,claude-3-haiku-20240307,0.7,100.0%,10,10
sentiment_bad,gpt-4o-mini,0.0,100.0%,10,10
sentiment_bad,gpt-4o-mini,0.7,100.0%,10,10
sentiment_basic,claude-3-haiku-20240307,0.0,100.0%,10,10
sentiment_basic,claude-3-haiku-20240307,0.7,100.0%,10,10
sentiment_basic,gpt-4o-mini,0.0,100.0%,10,10
sentiment_basic,gpt-4o-mini,0.7,100.0%,10,10
sentiment_instruction,claude-3-haiku-20240307,0.0,100.0%,10,10
sentiment_instruction,claude-3-haiku-20240307,0.7,100.0%,10,10


## 💾 Step 9: Optional: Save Results to Files

This saves your results as a CSV file.
Files are timestamped so you won't overwrite previous experiments.

In [19]:
file_path = save_results(results, output_path="experiment_results")