# TheProdBot TAM/SAM/SOM Evals Notebook for Product Managers

Welcome, product managers, to a notebook designed to help you 10x your product work! This notebook is not about building "TheProdBot" itself, but rather serves as a validation tool and a platform for creating better evaluations and prompts for our vision of an AI chatbot for product managers.

The core problem this notebook addresses is the need for product managers to quickly and effectively estimate market size (TAM, SAM, SOM) for new product ideas. We've created this demo to show proof-of-life for the concept of an AI assistant that can aid in this process. By running different language models against a defined set of prompts and product context, we can:

*   **Validate the concept:** See if the models can generate valuable and relevant market size estimations.
*   **Improve prompts:** Analyze the generated traces to understand what works and what doesn't, allowing us to refine our prompts for better results.
*   **Develop robust evaluations:** Use the output and trace data to build better evaluation criteria for future iterations of "TheProdBot".

This notebook provides the framework to:

*   **Setup:** Get started with necessary dependencies and API key configuration.
*   **Configure:** Define your product idea, target markets, and the questions you want the AI to answer.
*   **Run Models:** Execute the market sizing flow with various language models.
*   **Analyze Results:** Examine the generated market estimations, logs, and summaries.
*   **Evaluate:** Utilize the interactive tool to label and assess the quality of the AI's responses, providing crucial feedback for improvement.

By using this notebook, you can gain insights into the potential of AI to assist with market sizing and contribute to building a more effective AI assistant for product managers.

In [92]:
# üß∞ Environment Setup & Drive Mount
# ----------------------------------------------------------
# This cell installs the needed Python libraries, mounts your
# Google Drive so Colab can read/write project files, and then
# confirms that you‚Äôre in the correct working directory.
# ----------------------------------------------------------

# 1Ô∏è‚É£  Install dependencies (quiet mode for clean logs)
!pip install openai pyyaml --quiet

# 2Ô∏è‚É£  Mount Google Drive to access the shared notebook files
from google.colab import drive
drive.mount('/content/drive', force_remount=False)

# 3Ô∏è‚É£  Navigate to your working directory
%cd /content/drive/MyDrive/Colab\ Notebooks/TAM-SAM-SOM.Notebook

# 4Ô∏è‚É£  List the first 50 files for a quick sanity check
!ls -la | sed -n '1,50p'

# ‚úÖ  You should see folders like /outputs, config_session.yaml, and your .py files


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook
total 142
-rw------- 1 root root   3847 Nov  1 16:15 build_evals_dataset.py
-rw------- 1 root root    982 Nov  1 15:57 build_trace.py
-rw------- 1 root root    913 Nov  1 16:23 build_traces.py
-rw------- 1 root root    748 Nov  1 16:08 config_session.yaml
-rw------- 1 root root   5564 Nov  1 16:44 eval_labeler.py
-rw------- 1 root root   1616 Nov  1 16:52 export_traces_csv.py
-rw------- 1 root root   4134 Nov  1 17:03 notebook_setup_health_check.py
drwx------ 6 root root   4096 Nov  1 17:23 outputs
-rw------- 1 root root   2928 Nov  1 17:13 prompt_runner.py
-rw------- 1 root root   1657 Nov  1 14:54 prompts_pm.json
drwx------ 2 root root   4096 Nov  1 17:16 __pycache__
-rw------- 1 root root   3659 Nov  1 16:11 run_models_bakeoff.py
-rw------- 1 root root   2089 Nov  1 16:04 run_prompts.py
-rw-------

In [93]:
# ü©∫ Notebook Health Check
# ----------------------------------------------------------
# This quick test verifies that everything‚Äôs wired up correctly:
#   ‚Ä¢ Google Drive is mounted
#   ‚Ä¢ Required files (config, prompts, scripts) are present
#   ‚Ä¢ /outputs directory exists and is writable
#   ‚Ä¢ Config and prompt JSON parse cleanly
#
# It‚Äôs basically our ‚Äúpre-flight checklist‚Äù before running any models.
# ----------------------------------------------------------

!python notebook_setup_health_check.py

# ‚úÖ  If everything‚Äôs green, you‚Äôre ready to run prompts and bakeoffs.
#     If you see any ‚ùå or ‚ö†Ô∏è messages, fix those before continuing.


=== Productside Notebook Setup Health Check ===
Working dir: /content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook
‚úÖ Google Drive mounted successfully.
‚úÖ Dir present: outputs
‚úÖ Found: config_session.yaml
‚úÖ Found: prompts_pm.json
‚úÖ Found: run_prompts.py
‚úÖ Found: run_models_bakeoff.py
‚úÖ Found: build_evals_dataset.py
‚úÖ Found: build_traces.py
‚úÖ Found: eval_labeler.py
‚úÖ Optional: export_traces_csv.py
‚úÖ Optional: notebook_setup_health_check.py
‚úÖ Optional: TheProdBot_Evals_Demo.ipynb
‚úÖ Models to test: ['gpt-3.5-turbo', 'gpt-4o-mini', 'gpt-4o', 'gpt-4.1']
‚úÖ Product context looks valid.
‚úÖ Prompts contain all TAM/SAM/SOM turns.
‚úÖ OPENAI_API_KEY found in environment.

=== NEXT STEPS ===

1Ô∏è‚É£  Run prompts for a single model:
    !python run_prompts.py

2Ô∏è‚É£  Bake off multiple models:
    !python run_models_bakeoff.py

3Ô∏è‚É£  Build automatic eval dataset:
    !python build_evals_dataset.py

4Ô∏è‚É£  Export human-readable traces:
    !python export_trac

In [94]:
# üîë Securely Capture or Confirm Your OpenAI API Key
# ----------------------------------------------------------
# This cell checks whether your OpenAI API key is already loaded
# into the Colab environment. If it‚Äôs missing, you‚Äôll be prompted
# to paste it securely (it will never be displayed or stored).
#
# ‚úÖ Why this matters:
#   ‚Ä¢ Securley loads the key from an operating system environment variable
#   ‚Ä¢ Keeps your key out of the notebook file (safe to share)
#   ‚Ä¢ Ensures all downstream scripts can access the key
#   ‚Ä¢ Works seamlessly across Colab sessions
# ----------------------------------------------------------

import os, getpass

if "OPENAI_API_KEY" in os.environ and os.environ["OPENAI_API_KEY"].strip():
    print("üîê OPENAI_API_KEY already set in environment. Using stored key.")
else:
    api_key = getpass.getpass("Paste your OPENAI_API_KEY (hidden): ").strip()
    if not api_key:
        raise ValueError("No API key entered. Aborting setup.")
    os.environ["OPENAI_API_KEY"] = api_key
    print("‚úÖ API key securely captured for this session.")

# ‚úÖ You‚Äôre good to go ‚Äî the key is live for this notebook run only.


üîê OPENAI_API_KEY already set in environment. Using stored key.


In [95]:
# üéõÔ∏è Interactive Model Selector & Runner
# ----------------------------------------------------------
# This cell lets you choose which LLM to run your TAM‚ÜíSAM‚ÜíSOM
# scenario with ‚Äî or run a full bake-off if desired.
#
# ‚úÖ What it does:
#   ‚Ä¢ Reads models from config_session.yaml
#   ‚Ä¢ Loads the prompt set from prompts_pm.json
#   ‚Ä¢ Confirms your API key is active
#   ‚Ä¢ Lets you select a model via dropdown, then runs the flow
#
# ‚ö†Ô∏è CAUTION
#   ‚Ä¢ This cell/step will call your API
#   ‚Ä¢ You will incure charges via your API vendor
#   ‚Ä¢ I/We are NOT responsible for the charges you incur
#
# The results will be stored in /outputs/<model>/ with
# trace files you‚Äôll later analyze and label.
# ----------------------------------------------------------

import os, json, yaml
from IPython.display import display, Markdown
import ipywidgets as W
from run_prompts import load_context, run_flow

# 1Ô∏è‚É£  Confirm that the API key is available
if "OPENAI_API_KEY" not in os.environ or not os.environ["OPENAI_API_KEY"].strip():
    raise EnvironmentError("‚ùå Missing OPENAI_API_KEY. Run the key loader cell first.")
else:
    print("üîê Using securely loaded OpenAI API key.")

# 2Ô∏è‚É£  Load configuration and prompt definitions
cfg_path = "config_session.yaml"
prompts_path = "prompts_pm.json"

if not os.path.exists(cfg_path):
    raise FileNotFoundError("Missing config_session.yaml")
if not os.path.exists(prompts_path):
    raise FileNotFoundError("Missing prompts_pm.json")

with open(cfg_path) as f:
    cfg = yaml.safe_load(f)
ctx = load_context(cfg_path)
prompts = json.load(open(prompts_path))

models = cfg.get("models_to_test", [])
if not models:
    raise ValueError("No 'models_to_test' found in config_session.yaml")

default_model = models[0]
display(Markdown(f"### üß† Available models: {', '.join(models)}"))
display(Markdown(f"**Default model:** `{default_model}`"))

# 3Ô∏è‚É£  Interactive selector + run button
selector = W.Dropdown(
    options=models,
    value=default_model,
    description="Choose model:",
    style={"description_width": "120px"},
    layout=W.Layout(width="60%"),
)
run_button = W.Button(
    description="Run TAM‚ÜíSAM‚ÜíSOM Flow",
    button_style="success",
    icon="play"
)
output = W.Output()

# 4Ô∏è‚É£  Define what happens when the button is clicked
def on_run_clicked(_):
    model = selector.value
    output.clear_output()
    with output:
        display(Markdown(f"### üöÄ Running flow using `{model}`"))
        run_flow(ctx, prompts, [model])
        display(Markdown("‚úÖ **Flow complete! Check `/outputs` for results.**"))

run_button.on_click(on_run_clicked)

# 5Ô∏è‚É£  Display the controls
display(W.VBox([selector, run_button, output]))

# ‚úÖ  Once finished, you can continue to the eval-building cells below.


üîê Using securely loaded OpenAI API key.


### üß† Available models: gpt-3.5-turbo, gpt-4o-mini, gpt-4o, gpt-4.1

**Default model:** `gpt-3.5-turbo`

VBox(children=(Dropdown(description='Choose model:', layout=Layout(width='60%'), options=('gpt-3.5-turbo', 'gp‚Ä¶

In [96]:
# üß© Load and Launch the Prompt Runner
# ----------------------------------------------------------
# This cell ensures that our custom Python modules are visible
# to Colab (by adding the working directory to sys.path),
# reloads any recent code edits, and launches the interactive
# prompt runner UI for TheProdBot TAM‚ÜíSAM‚ÜíSOM flow.
#
# ‚úÖ What this does:
#   ‚Ä¢ Makes sure Colab can see files in our Drive folder
#   ‚Ä¢ Invalidates old imports so changes appear immediately
#   ‚Ä¢ Starts the model selector UI (dropdown + Run button)
#
# ‚ö†Ô∏è CAUTION
#   ‚Ä¢ This cell/step will call your API
#   ‚Ä¢ You will incure charges via your API vendor
#   ‚Ä¢ I/We are NOT responsible for the charges you incur
# ----------------------------------------------------------

import sys, importlib

# 1Ô∏è‚É£  Add the notebook folder to the Python path so modules load correctly
sys.path.append('/content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook')
importlib.invalidate_caches()

# 2Ô∏è‚É£  Import and launch the interactive runner UI
import prompt_runner
prompt_runner.launch_runner()  # renders dropdown + Run button

# ‚úÖ  Choose your model (or run all), then watch the flow execute in real time.


üîê Using securely loaded OpenAI API key.


### üß† Available models: gpt-3.5-turbo, gpt-4o-mini, gpt-4o, gpt-4.1

**Default model:** `gpt-3.5-turbo`

VBox(children=(Dropdown(description='Choose model:', layout=Layout(width='60%'), options=('gpt-3.5-turbo', 'gp‚Ä¶

In [97]:
# üß≠ Run Prompts + Model Bakeoff (Live Streaming Logs)
# ----------------------------------------------------------
# This cell executes the two main Python scripts ‚Äî one for a
# single-model TAM‚ÜíSAM‚ÜíSOM flow, and one for a multi-model bakeoff.
# It also captures and streams the logs live, so you can follow
# progress in real time (and review them later in /outputs).
#
# ‚úÖ What it does:
#   ‚Ä¢ Ensures you‚Äôre in the correct working directory
#   ‚Ä¢ Confirms your OpenAI key is still loaded
#   ‚Ä¢ Exports that key so subprocesses can use it
#   ‚Ä¢ Runs both scripts with unbuffered logging (-u) for live output
#   ‚Ä¢ Saves full logs to /outputs/_run_prompts.log and /outputs/_bakeoff.log
#
# ‚ö†Ô∏è CAUTION
#   ‚Ä¢ This cell/step will call your API
#   ‚Ä¢ You will incure charges via your API vendor
#   ‚Ä¢ I/We are NOT responsible for the charges you incur
# ----------------------------------------------------------

# 1Ô∏è‚É£  Confirm working directory
%cd /content/drive/MyDrive/Colab\ Notebooks/TAM-SAM-SOM.Notebook/

# 2Ô∏è‚É£  Verify or capture API key (prompt only if missing)
import os, getpass, IPython
if "OPENAI_API_KEY" in os.environ and os.environ["OPENAI_API_KEY"].strip():
    print("üîê OPENAI_API_KEY already loaded.")
else:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Paste your OpenAI API key (hidden): ").strip()
    print("‚úÖ API key captured for this session.")

# 3Ô∏è‚É£  Export key to environment so shell commands inherit it
IPython.get_ipython().run_line_magic('env', f'OPENAI_API_KEY={os.environ["OPENAI_API_KEY"]}')

# 4Ô∏è‚É£  Execute both runs with streaming output and log capture
print("üöÄ Running single-model flow...")
!python -u run_prompts.py | tee outputs/_run_prompts.log

print("\nü§ñ Running multi-model bakeoff...")
!python -u run_models_bakeoff.py | tee outputs/_bakeoff.log

# 5Ô∏è‚É£  Wrap-up message
print("\n‚úÖ All runs complete! Logs saved to:")
print("   ‚Ä¢ outputs/_run_prompts.log")
print("   ‚Ä¢ outputs/_bakeoff.log")
print("üìÇ You can open them directly in Colab or download for later review.")





/content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook
üîê OPENAI_API_KEY already loaded.
env: OPENAI_API_KEY=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
üöÄ Running single-model flow...

ü§ñ Running multi-model bakeoff...

=== Running bake-off for models: gpt-3.5-turbo, gpt-4o-mini, gpt-4o, gpt-4.1 ===

‚ñ∂Ô∏è  [gpt-3.5-turbo] T5_tam ...
‚ñ∂Ô∏è  [gpt-3.5-turbo] T6_sam ...
‚ñ∂Ô∏è  [gpt-3.5-turbo] T7_som ...
‚úÖ  [gpt-3.5-turbo] complete ‚Üí outputs/gpt-3.5-turbo
üìÑ Wrote summary: outputs/summary_20251101_175123.json
‚ñ∂Ô∏è  [gpt-4o-mini] T5_tam ...
‚ñ∂Ô∏è  [gpt-4o-mini] T6_sam ...
‚ñ∂Ô∏è  [gpt-4o-mini] T7_som ...
‚úÖ  [gpt-4o-mini] complete ‚Üí outputs/gpt-4o-mini
üìÑ Wrote summary: outputs/summary_20251101_175201.json
‚ñ∂Ô∏è  [gpt-4o] T5_tam ...
‚ñ∂Ô∏è  [gpt-4o] T6_sam ...
‚ñ∂Ô∏è  [gpt-4o] T7_som ...
‚úÖ  [gpt-4o] complete ‚Üí outputs/gpt-4o
üìÑ Wrote su

In [98]:
# üßæ Quick Model Scoreboard Summary
# ----------------------------------------------------------
# This cell provides a quick overview of how many trace files
# (T5‚ÄìT7 = TAM, SAM, SOM) were successfully generated per model.
#
# ‚úÖ What it does:
#   ‚Ä¢ Scans /outputs for trace text files
#   ‚Ä¢ Counts how many runs each model completed
#   ‚Ä¢ Displays a simple markdown table for quick comparison
#
# Use this to confirm that all models ran correctly before
# diving into deeper evals or human labeling.
# ----------------------------------------------------------

import json, glob, os
from collections import defaultdict
from IPython.display import Markdown, display

# 1Ô∏è‚É£  Collect trace counts per model
data = defaultdict(int)
for f in sorted(glob.glob("outputs/*/T[5-7]_*.txt")):
    parts = f.split("/")
    if len(parts) < 3:
        continue
    model = parts[1]
    data[model] += 1

# 2Ô∏è‚É£  Display results
if not data:
    display(Markdown("‚ö†Ô∏è **No eval output files found in `/outputs`.**"))
else:
    display(Markdown("### üßÆ Model Run Summary\nEach model‚Äôs completed trace files (T5‚ÄìT7 = TAM, SAM, SOM):\n"))
    print("| Model | Files Found | Approx. Score (out of 24) |")
    print("|-------|--------------|---------------------------|")

    for model, count in sorted(data.items(), key=lambda x: -x[1]):
        score = f"{count:>2d} / 24"
        print(f"| {model} | {count:>2} | {score} |")

    print("\n‚úÖ Summary complete ‚Äî check `/outputs/<model_name>/` for detailed logs and responses.")

# üß† Tip:
# If you see missing traces, rerun that model using the prompt runner
# or bakeoff cell to regenerate the incomplete outputs.




### üßÆ Model Run Summary
Each model‚Äôs completed trace files (T5‚ÄìT7 = TAM, SAM, SOM):


| Model | Files Found | Approx. Score (out of 24) |
|-------|--------------|---------------------------|
| gpt-3.5-turbo |  3 |  3 / 24 |
| gpt-4.1 |  3 |  3 / 24 |
| gpt-4o-mini |  3 |  3 / 24 |
| gpt-4o |  3 |  3 / 24 |

‚úÖ Summary complete ‚Äî check `/outputs/<model_name>/` for detailed logs and responses.


In [99]:
# üß© Build ‚Üí Trace ‚Üí Export Pipeline
# ----------------------------------------------------------
# This cell runs the three scripts that turn your model runs
# into structured evaluation data. It‚Äôs the ‚Äúglue‚Äù between
# model output and human review.
#
# ‚úÖ What it does:
#   ‚Ä¢ Builds the synthetic eval dataset (auto-generated scores)
#   ‚Ä¢ Extracts structured trace records for each conversation turn
#   ‚Ä¢ Exports a clean, human-readable CSV version for review
#
# All three steps stream live logs into the Colab output window,
# while also saving copies in /outputs for reference.
# ----------------------------------------------------------

import IPython

# 1Ô∏è‚É£  Ensure API key is available to subprocesses
IPython.get_ipython().run_line_magic('env', f'OPENAI_API_KEY={os.environ["OPENAI_API_KEY"]}')

# 2Ô∏è‚É£  Run each step sequentially with streaming logs
print("üèóÔ∏è  Building eval dataset...")
!python -u build_evals_dataset.py | tee outputs/_build_evals.log

print("\nüß† Generating trace records...")
!python -u build_traces.py | tee outputs/_build_traces.log

print("\nüì§ Exporting human-readable traces to CSV...")
!python -u export_traces_csv.py | tee outputs/_export_csv.log

# 3Ô∏è‚É£  Wrap-up message
print("\n‚úÖ All steps complete! Your outputs include:")
print("   ‚Ä¢ synthetic_evals.csv ‚Äî auto-scored dataset")
print("   ‚Ä¢ traces_export.csv   ‚Äî human-readable trace records")
print("   ‚Ä¢ logs (_build_*.log) ‚Äî process logs for each stage")
print("üìÇ Check the `/outputs` folder for all generated files.")



env: OPENAI_API_KEY=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
üèóÔ∏è  Building eval dataset...

=== SUMMARY (avg score by model, T5‚ÄìT7 emphasized) ===
model
gpt-4.1          6.0
gpt-4o           5.6
gpt-4o-mini      5.6
gpt-3.5-turbo    3.8
Name: score_total, dtype: float64

Turns missing reasoning:
        model                 turn                                       raw_path
gpt-3.5-turbo      T1_product_refs      outputs/gpt-3.5-turbo/T1_product_refs.txt
gpt-3.5-turbo T2_problem_synthesis outputs/gpt-3.5-turbo/T2_problem_synthesis.txt
gpt-3.5-turbo   T3_global_pop_econ   outputs/gpt-3.5-turbo/T3_global_pop_econ.txt
gpt-3.5-turbo  T4_regions_clusters  outputs/gpt-3.5-turbo/T4_regions_clusters.txt
gpt-3.5-turbo               T5_tam               outputs/gpt-3.5-turbo/T5_tam.txt
gpt-3.5-turbo               T6_sam               outputs/gpt-3.5-turbo/T6_sam.txt

In [100]:
# üíæ Mount Google Drive & Verify Output Artifacts
# ----------------------------------------------------------
# This step reconnects your Google Drive (if needed) and jumps
# straight to the /outputs folder where all eval artifacts live.
#
# ‚úÖ What it does:
#   ‚Ä¢ Mounts Google Drive (skips remount if already active)
#   ‚Ä¢ Switches directory to /outputs for quick inspection
#   ‚Ä¢ Lists the first 50 files so you can confirm what‚Äôs been created
#
# You should see folders for each model (e.g., gpt-3.5-turbo),
# plus CSVs, JSONLs, and log files from the previous steps.
# ----------------------------------------------------------

from google.colab import drive
import os

# 1Ô∏è‚É£  Mount Google Drive (safe to rerun anytime)
drive.mount('/content/drive', force_remount=False)

# 2Ô∏è‚É£  Navigate to the /outputs directory
os.chdir("/content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook/outputs")
print(f"üìÅ Current working directory: {os.getcwd()}")

# 3Ô∏è‚É£  List the first 50 files to verify successful runs
!ls -la | sed -n '1,50p'

# ‚úÖ  You‚Äôre looking for: model folders, traces_export.csv, and build logs.
#     If they‚Äôre missing, re-run the Build ‚Üí Trace ‚Üí Export pipeline above.


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
üìÅ Current working directory: /content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook/outputs
total 270
-rw------- 1 root root  1669 Nov  1 17:52 _bakeoff.log
-rw------- 1 root root   158 Nov  1 17:52 bakeoff_summary.md
-rw------- 1 root root  1375 Nov  1 17:53 _build_evals.log
-rw------- 1 root root     0 Nov  1 17:53 _build_traces.log
-rw------- 1 root root    53 Nov  1 17:53 _export_csv.log
drwx------ 2 root root  4096 Nov  1 17:51 gpt-3.5-turbo
drwx------ 2 root root  4096 Nov  1 17:52 gpt-4.1
drwx------ 2 root root  4096 Nov  1 17:52 gpt-4o
drwx------ 2 root root  4096 Nov  1 17:52 gpt-4o-mini
-rw------- 1 root root     0 Nov  1 17:51 _run_prompts.log
-rw------- 1 root root 23184 Nov  1 15:13 session_20251101_151354.json
-rw------- 1 root root 20981 Nov  1 15:46 session_20251101_154604.json
-rw------- 1 root root   875 Nov  1 15:34 summary_20251101

In [101]:
# üß† Launch Interactive Human-in-the-Loop Evals Labeler
# ----------------------------------------------------------
# This is the final step of TheProdBot Evals Demo.
#
# Here, we mount Drive (if needed), load the generated traces,
# and open the interactive labeler UI ‚Äî where *you*, the PM,
# review model reasoning quality turn-by-turn.
#
# ‚úÖ What it does:
#   ‚Ä¢ Confirms Drive and directory access
#   ‚Ä¢ Loads all trace data into a pandas DataFrame
#   ‚Ä¢ Summarizes what was found (count + models)
#   ‚Ä¢ Opens an interactive labeler view inside the notebook
#
# Use the labeler to:
#   - Mark responses as good / weak / fail
#   - Flag unclear reasoning, bad citations, or weak math
#   - Save feedback automatically to outputs/human_labels.jsonl
# ----------------------------------------------------------

# 1Ô∏è‚É£  Mount Drive and navigate to project root
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=False)
os.chdir("/content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook")

# 2Ô∏è‚É£  Import and reload labeler components
from importlib import reload
import eval_labeler
from build_traces import build_trace_df
from eval_labeler import launch_trace_labeler
reload(eval_labeler)  # ensures latest edits are applied

# 3Ô∏è‚É£  Verify that /outputs exists
if not os.path.exists("outputs"):
    raise FileNotFoundError("‚ùå No /outputs directory found. Run the Build ‚Üí Trace ‚Üí Export pipeline first.")

# 4Ô∏è‚É£  Load the traces into a DataFrame
df = build_trace_df(outputs_root="outputs", prompts_path="prompts_pm.json")

# 5Ô∏è‚É£  Display dataset summary
print(f"üìä Loaded {len(df)} trace records for review.")
print(f"üß© Models found: {sorted(df['model'].unique()) if 'model' in df.columns else '‚Äî'}")

# 6Ô∏è‚É£  Launch the interactive labeler
layout = "stacked"  # or "side-by-side"
print(f"üß† Launching labeler in '{layout}' layout mode...")
launch_trace_labeler(df, labels_path="outputs/human_labels.jsonl", layout_mode=layout)

# ‚úÖ  Review, label, and discuss! Each save updates your labels in JSONL and CSV format under /outputs.


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
üìä Loaded 33 trace records for review.
üß© Models found: ['gpt-3.5-turbo', 'gpt-4.1', 'gpt-4o', 'gpt-4o-mini']
üß† Launching labeler in 'stacked' layout mode...


HTML(value='<b>Loaded 33 traces</b>')

HBox(children=(HTML(value=''), HTML(value='')))

VBox(children=(HTML(value="<h4 style='margin:6px 0'>Prompt (asked)</h4>"), Textarea(value='', disabled=True, l‚Ä¶

HBox(children=(VBox(children=(Checkbox(value=False, description='Reasoning unclear'), Checkbox(value=False, de‚Ä¶

Textarea(value='', layout=Layout(height='120px', width='100%'), placeholder="Why it's weak / what 'good' looks‚Ä¶

HBox(children=(Button(description='‚óÄ Prev', style=ButtonStyle()), Button(button_style='success', description='‚Ä¶

HTML(value='')

# Task
Create a GETTING_STARTED.md file with detailed instructions for product managers on how to set up and run the notebook, including cloning the repository, opening the notebook in Colab, and running each cell sequentially, explaining the purpose of each step. Save the file to "/content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook/GETTING_STARTED.md".

## Draft getting started content

### Subtask:
Generate the content for the GETTING_STARTED.md file, providing painfully detailed, step-by-step instructions for product managers on how to clone the repository, open the notebook, and run each section, explaining the purpose of each step in non-technical terms.


**Reasoning**:
Generate the detailed Markdown content for the GETTING_STARTED.md file as requested, covering all the required sections and explanations in a clear, non-technical manner.



In [111]:
getting_started_content = """
# Getting Started with TheProdBot TAM/SAM/SOM Evals Notebook

This guide provides step-by-step instructions for Product Managers on how to set up and run TheProdBot TAM/SAM/SOM Evals notebook in Google Colab. This notebook helps you understand how different AI models estimate market size (TAM, SAM, SOM) for new product ideas and provides a framework for evaluating their performance.

Follow these steps sequentially to get started:

## Step 1: Clone the GitHub Repository to Google Drive

First, you need to get a copy of the notebook files and associated scripts onto your Google Drive, which Google Colab can access.

1.  Open the notebook in Google Colab.
2.  Open the Terminal in Colab. You can do this by going to the menu at the top: `Tools` -> `Terminal` -> `New terminal`.
3.  In the terminal window that appears at the bottom, navigate to your Google Drive directory. This is usually located at `/content/drive/MyDrive/`. You can use the `cd` command:
    ```bash
    cd "/content/drive/MyDrive/Colab Notebooks"
    ```
    *(Note: If the "Colab Notebooks" folder doesn't exist, you might need to create it first using `mkdir "Colab Notebooks"`)*
4.  Now, clone the GitHub repository containing the notebook files. Replace `[REPOSITORY_URL]` with the actual URL of the GitHub repository.
    ```bash
    git clone [REPOSITORY_URL] TAM-SAM-SOM.Notebook
    ```
    *(Note: You'll need to find the correct GitHub repository URL for TheProdBot Evals. This command will create a new folder named `TAM-SAM-SOM.Notebook` in your Google Drive.)*

## Step 2: Open the Notebook in Google Colab

Now that the files are on your Google Drive, you can open the main notebook file in Colab.

1.  In the Colab file browser (the folder icon on the left sidebar), navigate to the folder you just cloned: `drive` -> `MyDrive` -> `Colab Notebooks` -> `TAM-SAM-SOM.Notebook`.
2.  Click on the notebook file. It should be named something like `TheProdBot_Evals_Demo.ipynb`. This will open the notebook in your Colab environment.

## Step 3: Run the Notebook Cells Sequentially

Go through the notebook and run each code cell one by one. You can run a cell by clicking the "play" button to the left of the cell or by selecting the cell and pressing `Shift + Enter`.

Here's a breakdown of each important cell and what it does:

### Cell 1: Environment Setup & Drive Mount

*   **Purpose:** This cell sets up the necessary environment. It installs the required software libraries (like `openai` and `pyyaml`) and connects the notebook to your Google Drive so it can read and write files. It also navigates to the project directory on your Drive.
*   **What to expect:** You will see output indicating that packages are being installed and that your Google Drive is mounted. It will then list some files in the project directory to confirm you are in the right place.

### Cell 2: Notebook Health Check

*   **Purpose:** This cell runs a quick check to make sure everything is set up correctly before you start running the AI models. It verifies that Google Drive is connected, essential files are present, and your configuration is valid.
*   **What to expect:** You should see a series of "‚úÖ" symbols indicating successful checks. If you see any "‚ùå" or "‚ö†Ô∏è", stop and address the issue before continuing. The output will also suggest the next steps you can take.

### Cell 3: Securely Capture or Confirm Your OpenAI API Key

*   **Purpose:** The AI models used in this notebook require an API key from OpenAI (or a similar provider) to function. This cell makes sure your key is available to the notebook securely. It will prompt you to enter your key if it's not already set up in your Colab environment. **Your key is never displayed or stored in the notebook file itself.**
*   **What to expect:** If your key is already set, it will confirm this. If not, a box will appear asking you to paste your API key. Paste it carefully (it will be hidden) and press Enter.

    **‚ö†Ô∏è API Cost Warning:** Running models using this notebook will incur charges based on your usage with the API provider (e.g., OpenAI). Be mindful of your API key and usage.

### Cell 4: Interactive Model Selector & Runner

*   **Purpose:** This cell provides a way to run the market sizing flow with a single selected AI model. You can choose a model from a dropdown menu and then click a button to start the process.
*   **What to expect:** You will see a dropdown list of available models and a "Run TAM‚ÜíSAM‚ÜíSOM Flow" button. Selecting a model and clicking the button will start the AI generating market size estimates based on the prompts. The output will show the progress and confirm completion.

    **‚ö†Ô∏è API Cost Warning:** Running the flow in this cell will call the AI API and incur charges.

### Cell 5: Load and Launch the Prompt Runner

*   **Purpose:** This cell is similar to the previous one but focuses on ensuring the custom Python scripts used for running the prompts are correctly loaded and available. It then launches the same interactive model selector interface.
*   **What to expect:** You will see output confirming the API key is loaded and then the same interactive model selector (dropdown and button) as in the previous cell.

    **‚ö†Ô∏è API Cost Warning:** Running the flow via this interface will call the AI API and incur charges.

### Cell 6: Run Prompts + Model Bakeoff (Live Streaming Logs)

*   **Purpose:** This is a crucial step where you run the full set of prompts against one or multiple AI models (a "bakeoff"). This generates the raw output that you will later analyze and evaluate. It also streams the process logs live so you can see what's happening.
*   **What to expect:** The cell will confirm the API key is available and then start running the scripts. You will see detailed output showing which model is being run and the progress for each prompt (T5_tam, T6_sam, T7_som, etc.). This process can take some time depending on the number of models. Finally, it will show a summary of how each model performed based on the auto-scoring and where the logs are saved.

    **‚ö†Ô∏è API Cost Warning:** This cell will make multiple calls to the AI API for each model and will incur charges.

### Cell 7: Quick Model Scoreboard Summary

*   **Purpose:** This cell provides a quick summary of how many of the core market sizing prompts (TAM, SAM, SOM - T5-T7) were successfully completed by each model during the previous run.
*   **What to expect:** You will see a small table showing each model that was run and the number of completed trace files found for it. This is a good way to quickly verify that the previous step ran successfully for all models.

### Cell 8: Build ‚Üí Trace ‚Üí Export Pipeline

*   **Purpose:** This cell processes the raw output generated by the AI models into structured data that is easier to analyze and evaluate. It runs scripts to build a synthetic evaluation dataset (auto-scoring), extract detailed records for each conversation turn (traces), and export a human-readable version of these traces into a CSV file.
*   **What to expect:** You will see output indicating that each step (building evals, generating traces, exporting CSV) is running. It will confirm that the process is complete and list the names of the generated files (like `synthetic_evals.csv` and `traces_export.csv`) and where to find them in the `/outputs` folder.

### Cell 9: Mount Google Drive & Verify Output Artifacts

*   **Purpose:** This cell ensures your Google Drive is still connected and then changes the current working directory to the `/outputs` folder. This makes it easy to see and access all the files generated in the previous steps.
*   **What to expect:** It will confirm that Google Drive is mounted and show the current directory as `/content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook/outputs`. It will then list the files in this directory. You should see folders for each model, the CSV and JSONL files from the previous step, and log files.

### Cell 10: Launch Interactive Human-in-the-Loop Evals Labeler

*   **Purpose:** This is where you, the Product Manager, provide valuable human feedback on the AI's performance. This cell loads the structured trace data and launches an interactive tool within the notebook that allows you to review each AI response turn-by-turn, assess its quality, and provide specific feedback (e.g., is the reasoning clear? Is the math correct?). Your feedback is saved automatically.
*   **What to expect:** The output will confirm that the trace data is loaded and then display an interactive user interface. This interface will show the prompt given to the AI, the AI's response (with reasoning and any structured data), and options for you to label the response's quality and provide comments. Use the navigation buttons to move through the different responses.

## Analyzing Results and Providing Feedback

After running the cells, you can:

*   **Review Logs:** Examine the `.log` files in the `/outputs` folder to see the detailed process logs for each step.
*   **Inspect Data:** Open the `synthetic_evals.csv` and `traces_export.csv` files in the `/outputs` folder to view the structured evaluation data and trace records.
*   **Use the Labeler:** Spend time in the interactive labeler (Cell 10 output) to provide detailed, human feedback on the AI's responses. This feedback is crucial for improving the AI's performance in future iterations. Your labels are saved to `outputs/human_labels.jsonl`.

By following these steps, you will successfully run the TAM/SAM/SOM evaluation process and generate valuable data and feedback for improving TheProdBot.
"""

**Reasoning**:
Save the generated Markdown content to the specified file path in Google Drive.



In [112]:
import os

# Define the file path
file_path = "/content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook/GETTING_STARTED.md"

# Ensure the directory exists (optional, but good practice)
# os.makedirs(os.path.dirname(file_path), exist_ok=True)

# Write the content to the file
with open(file_path, "w") as f:
    f.write(getting_started_content)

print(f"Successfully created GETTING_STARTED.md at: {file_path}")

Successfully created GETTING_STARTED.md at: /content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook/GETTING_STARTED.md


## Confirm file creation

### Subtask:
Verify that the `GETTING_STARTED.md` file was successfully created and is accessible.


**Reasoning**:
Change the current directory to the specified path in Google Drive to verify the file creation.



In [113]:
import os
os.chdir("/content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook")
print(f"Current working directory: {os.getcwd()}")

Current working directory: /content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook


**Reasoning**:
List the files in the current directory to verify if GETTING_STARTED.md is present.



In [114]:
!ls

build_evals_dataset.py		outputs
build_trace.py			prompt_runner.py
build_traces.py			prompts_pm.json
config_session.yaml		__pycache__
eval_labeler.py			README.md
export_traces_csv.py		run_models_bakeoff.py
GETTING_STARTED.md		run_prompts.py
notebook_setup_health_check.py	TheProdBot_Evals_Demo.ipynb


## Summary:

### Data Analysis Key Findings

*   A detailed `GETTING_STARTED.md` file was successfully generated, providing step-by-step instructions for product managers to set up and run the notebook.
*   The generated content includes explanations for cloning the repository, opening the notebook in Colab, running each cell sequentially with purpose descriptions, and using the interactive evaluation tool.
*   The `GETTING_STARTED.md` file was successfully created and saved to the specified Google Drive location: `/content/drive/MyDrive/Colab Notebooks/TAM-SAM-SOM.Notebook/GETTING_STARTED.md`.
*   Verification confirmed the file's existence and accessibility in the designated directory.

### Insights or Next Steps

*   Share the `GETTING_STARTED.md` file with the target audience (product managers) and gather feedback on its clarity and completeness.
*   Ensure the GitHub repository URL mentioned in the guide is accurate and accessible.
