#**AI CONSULTING CHAPTER 4: INNOVATORS**

---

##0.REFERENCE

https://claude.ai/share/966ce083-0740-4c4f-80bd-ab640aa81204

##1.CONTEXT

**Introduction: Why Level 4 Matters for Strategic Professionals**

Welcome to Chapter Four of our journey through AI-assisted consulting. If you've worked through the previous chapters, you've learned how to use AI safely for one-off tasks, how to build reliable workflows with structured outputs, and how to create sophisticated multi-step analyses. Those capabilities are valuable, but they share a common limitation - they're disposable. You create something, use it once, and move on.

Level 4 represents a fundamental shift in thinking. We're no longer creating individual outputs. We're building reusable assets that can be deployed repeatedly across your organization. Think of the difference between cooking a single meal and developing a restaurant recipe that dozens of chefs will execute hundreds of times. The standards change completely.

**Understanding the Innovator Mindset**

We call Level 4 practitioners Innovators because they're not just using AI - they're creating infrastructure for others to use AI safely and effectively. You're building templates, playbooks, and evaluation systems that enable broader adoption while maintaining governance standards. This is the level where AI moves from being a personal productivity tool to becoming organizational capability.

Consider a typical scenario. Your firm has successfully used AI to draft market entry memos for three different client engagements. Each time, a senior consultant carefully crafted prompts, reviewed outputs, and integrated the results into final deliverables. The work was good, but it wasn't scalable. Each engagement required the same expert oversight.

At Level 4, you take that successful pattern and systematize it. You create a Market Entry Memo Shell - a complete asset bundle with standardized prompts, quality checks, and usage guidelines. Now junior team members can generate initial drafts without senior supervision. The asset enforces quality automatically through built-in validation. Experts focus their time on review and refinement rather than basic drafting.

**Why Reuse Increases Risk**

Here's the paradox that makes Level 4 challenging. Reusable assets are more valuable precisely because they get used more often. But broader usage also means bigger consequences when things go wrong. If one person uses a flawed prompt and gets a bad output, that's an isolated incident. If fifty people use the same flawed template over six months, you've systematically produced fifty bad outputs.

This is why Level 4 demands governance infrastructure that earlier levels could skip. You need evaluation harnesses that test assets against synthetic cases before anyone uses them. You need regression baselines that detect when quality degrades over time. You need change management processes that track modifications and require re-approval. You need comprehensive logging so you can audit what happened when questions arise.

Traditional consulting doesn't have good models for this. When senior partners develop methodologies, they spread through apprenticeship and judgment-based adaptation. There's no formal testing, no regression detection, no automated quality control. That approach worked when deployment was slow and expert oversight was constant. It fails when assets get deployed broadly and used independently.

**The Governance-First Philosophy**

Level 4 is not about making AI smarter or more capable. It's about making AI deployable at scale with appropriate controls. Every decision in this notebook reflects that priority. We don't ask Claude to generate more sophisticated analysis. We ask it to produce outputs that can be validated programmatically, that separate facts from assumptions clearly, that flag risks automatically, and that never generate recommendations disguised as neutral analysis.

This governance-first approach frustrates some users initially. Why can't the asset just tell me the best option? Why all these disclaimers and verification requirements? Why must everything be logged and tracked? The answer is accountability. When fifty people use an asset, you need mechanisms to ensure consistent quality and clear audit trails. Informal judgment doesn't scale.

Think about how pharmaceutical companies develop drugs. They don't just find something that works in one patient. They conduct systematic trials, document every step, track adverse events, and maintain rigorous quality control. Not because they don't trust their scientists, but because broader deployment demands higher standards. Level 4 applies similar thinking to AI-assisted work products.

**What This Notebook Teaches**

Over the next ten cells, you'll build a complete Level 4 system from scratch. You'll create four different Asset Bundles spanning common consulting scenarios - market entry analysis, cost transformation planning, investment committee preparation, and organizational design. More importantly, you'll implement the governance infrastructure that makes these assets trustworthy.

You'll learn how to automatically redact sensitive information so assets can be tested safely. You'll build robust JSON parsing that handles the messiness of AI outputs. You'll create evaluation harnesses that test assets against synthetic cases and detect regressions. You'll generate comprehensive audit trails capturing every decision and every risk. You'll package everything into professional deliverables ready for stakeholder review.

The notebook is deliberately designed for management consultants and strategy professionals without deep technical backgrounds. You don't need to understand machine learning, write complex code, or master statistical methods. The focus is on business logic and governance processes. The technical implementation is handled through clear, commented code that you can use as-is.

**Setting Realistic Expectations**

Let me be direct about what this notebook does not do. It does not create production-ready assets that you can immediately deploy across your organization. Everything generated here carries a draft label and requires human review. The evaluation uses only synthetic test cases, not real client scenarios. The quality thresholds are demonstrations, not validated standards.

What the notebook does provide is the complete framework and methodology for creating production assets. You'll understand what components an asset bundle needs, what testing looks like, what approval workflows should include, and what documentation stakeholders require. You can take these patterns and apply them to your specific context with appropriate rigor.

Think of this as learning restaurant operations by running a practice service. You'll go through all the motions - prep, cooking, plating, service - with practice ingredients and volunteer diners. It's not a real restaurant opening, but it teaches you the systems and processes you'll need when you do open for real.

**The Path Forward**

Level 4 is not the final destination. It's an intermediate stage between individual productivity and full organizational deployment. Some of you will stop here, using these techniques to create small libraries of internal assets for your immediate teams. Others will continue to Level 5, implementing sophisticated deployment infrastructure with monitoring, feedback loops, and continuous improvement.

But regardless of where you ultimately land, understanding Level 4 thinking is essential. It teaches you to see AI not as a magic answer machine but as a capability requiring systematic development, rigorous testing, and ongoing governance. That mindset separates sustainable AI adoption from the hype cycles that promise transformation but deliver disappointment.

Let's begin building your first Asset Bundle.

##2.LIBRARIES AND ENVIRONMENT

In [None]:
# Cell 2: Install + Imports + Run Directory

# Install Anthropic SDK
!pip install -q anthropic

# Standard library imports
import json
import os
import re
import hashlib
import uuid
from datetime import datetime, timezone
from pathlib import Path
import textwrap
import random

# Create base run directory with timestamp + short ID
timestamp = datetime.now(timezone.utc).strftime('%Y%m%d_%H%M%S')
short_id = str(uuid.uuid4())[:8]
run_name = f"run_{timestamp}_{short_id}"
base_dir = Path(f"/content/ai_consulting_ch4_runs/{run_name}")

# Create required folders
folders = [
    base_dir,
    base_dir / "deliverables",
    base_dir / "asset_bundles",
    base_dir / "stage_outputs",
    base_dir / "logs"
]

for folder in folders:
    folder.mkdir(parents=True, exist_ok=True)

print("‚úì Run directory created:")
print(f"  Base: {base_dir}")
print(f"  Deliverables: {base_dir / 'deliverables'}")
print(f"  Asset Bundles: {base_dir / 'asset_bundles'}")
print(f"  Logs: {base_dir / 'logs'}")
print(f"\n‚úì Run ID: {run_name}")

‚úì Run directory created:
  Base: /content/ai_consulting_ch4_runs/run_20260120_132001_54ba806b
  Deliverables: /content/ai_consulting_ch4_runs/run_20260120_132001_54ba806b/deliverables
  Asset Bundles: /content/ai_consulting_ch4_runs/run_20260120_132001_54ba806b/asset_bundles
  Logs: /content/ai_consulting_ch4_runs/run_20260120_132001_54ba806b/logs

‚úì Run ID: run_20260120_132001_54ba806b


##3.API AND CLIENT INITIALIZATION

###3.1.OVERVIEW

**Cell 3: Connecting to Claude's API**

Welcome everyone. In this cell, we're establishing the connection between your Google Colab notebook and Anthropic's Claude API. Think of this as setting up your phone to make calls - before you can use the service, you need to authenticate yourself.

**What's happening here**

First, we retrieve your API key from Colab's secure storage system. An API key is like a password that proves you have permission to use Claude's services. We store it in Colab Secrets rather than directly in the code for security reasons - this way, if you share your notebook with colleagues, your credentials remain private.

Once we have the key, we create what's called a client object. This is your dedicated communication channel to Claude. Every time you want Claude to analyze something or generate content, you'll send your request through this client.

**The model configuration**

We also lock in three critical parameters that control how Claude behaves. The model name specifies exactly which version of Claude you're using - in this case, Claude Sonnet 4.5. The temperature setting of 0.2 keeps Claude's responses focused and consistent, which is essential for professional work. A higher temperature would make responses more creative but less predictable, which isn't what we want when creating reusable business assets.

The max tokens parameter sets a limit on response length. Tokens are roughly equivalent to words, so 4,128 tokens means Claude can generate responses of about three to four thousand words. This is sufficient for most consulting deliverables while keeping costs manageable.

**Why this matters**

By fixing these parameters at the start, we ensure every asset created in this session behaves consistently. If you run this notebook next month with the same settings, you should get comparable results. This reproducibility is crucial for enterprise use - you can't have your templates generating wildly different outputs depending on when someone runs them.

If the API key fails to load, the cell will stop execution immediately and display clear instructions for adding your key to Colab Secrets. This fail-fast approach prevents you from wasting time running subsequent cells that would inevitably fail.

###3.2.CODE AND IMPLEMENTATION

In [None]:
# Cell 3: API Key + Client Initialization

import anthropic
from google.colab import userdata

# Retrieve API key from Colab Secrets
try:
    ANTHROPIC_API_KEY = userdata.get('ANTHROPIC_API_KEY')
    os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY
    client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
    api_key_loaded = True
    print("‚úì API key loaded: YES")
except Exception as e:
    api_key_loaded = False
    print(f"‚úó API key loaded: NO")
    print(f"  Error: {e}")
    print("  ‚Üí Set ANTHROPIC_API_KEY in Colab Secrets (left sidebar: üîë)")
    raise

# Model configuration (non-negotiable)
MODEL = "claude-haiku-4-5-20251001"
TEMPERATURE = 0.2
MAX_TOKENS = 4128

print(f"\n‚úì Model: {MODEL}")
print(f"‚úì Temperature: {TEMPERATURE}")
print(f"‚úì Max Tokens: {MAX_TOKENS}")

‚úì API key loaded: YES

‚úì Model: claude-haiku-4-5-20251001
‚úì Temperature: 0.2
‚úì Max Tokens: 4128


##4.GOVERNANCE ARTIFACTS AND REGISTRY INITIALIZATION

###4.1.OVERVIEW

**Cell 4: Setting Up Your Governance Foundation**

Welcome to what I consider the heart of Level 4 thinking. This cell doesn't generate any business content - instead, it builds the scaffolding that makes reusable assets trustworthy and auditable. Think of this as constructing the foundation before building a house.

**Understanding helper functions**

We start by defining several utility functions that will be used throughout the notebook. These are simple tools that perform routine tasks like generating timestamps, creating secure hashes of text, and reading or writing files. The reason we define them upfront is efficiency - rather than writing the same code repeatedly, we create these reusable building blocks once.

The timestamp function ensures every action is recorded with precise timing in UTC format. The hashing function creates unique fingerprints of text without storing the actual content, which protects confidentiality while maintaining traceability. These might seem like technical details, but they're essential for governance.

**Creating the audit trail**

The cell then creates what we call the run manifest. This is a comprehensive record of your session - which model you used, what settings you applied, when you ran it, and a unique identifier for this specific run. Six months from now, if someone questions an asset created today, you can trace it back to this exact configuration.

**Initializing governance logs**

Next, we create several empty log files that will track different aspects of the session. The risk log captures every potential issue Claude identifies. The verification register tracks claims that need human fact-checking. The change log records every modification to assets. The approvals log manages who needs to sign off before assets go live.

Think of these as your compliance documentation. In regulated industries or large organizations, you need proof that proper procedures were followed. These logs provide that evidence automatically.

**Level 4 specific tracking**

We also create the asset registry, which is your master inventory of everything created in this session, and the regression baseline, which stores quality metrics so future runs can be compared against current performance. If quality degrades over time, you'll know immediately.

The configuration hash is particularly clever. It creates a unique identifier based on your model settings. If someone later claims they used the same setup but got different results, you can verify whether their configuration actually matches yours.

This infrastructure might feel like overhead, but it's what separates professional asset creation from casual experimentation.

###4.2.CODE AND IMPLEMENTATION

In [None]:
# Cell 4: Governance Artifacts + Asset Registry Initialization

# Helper functions
def now_iso():
    """Return current UTC timestamp in ISO format."""
    return datetime.now(timezone.utc).isoformat().replace('+00:00', 'Z')

def sha256_text(text):
    """Return SHA-256 hash of text."""
    return hashlib.sha256(text.encode('utf-8')).hexdigest()

def write_json(filepath, data):
    """Write JSON to file with pretty formatting."""
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

def read_json(filepath):
    """Read JSON from file."""
    with open(filepath, 'r', encoding='utf-8') as f:
        return json.load(f)

def append_jsonl(filepath, record):
    """Append a JSON record to a JSONL file."""
    with open(filepath, 'a', encoding='utf-8') as f:
        f.write(json.dumps(record, ensure_ascii=False) + '\n')

def get_env_fingerprint():
    """Get environment fingerprint for reproducibility."""
    import platform
    return {
        "python_version": platform.python_version(),
        "platform": platform.platform(),
        "timestamp": now_iso()
    }

def stable_config_hash():
    """Generate stable hash of model configuration."""
    config_str = f"{MODEL}|{TEMPERATURE}|{MAX_TOKENS}"
    return sha256_text(config_str)[:16]

# Initialize run_manifest.json
manifest = {
    "chapter": "4",
    "level": "Innovators",
    "purpose": "Reusable internal assets with evaluation and controlled release",
    "model": MODEL,
    "temperature": TEMPERATURE,
    "max_tokens": MAX_TOKENS,
    "run_id": run_name,
    "created_at": now_iso(),
    "config_hash": stable_config_hash(),
    "environment": get_env_fingerprint()
}
write_json(base_dir / "run_manifest.json", manifest)

# Initialize base governance artifacts
write_json(base_dir / "logs" / "risk_log.json", {"risks": []})
write_json(base_dir / "logs" / "verification_register.json", {"entries": []})
write_json(base_dir / "logs" / "change_log.json", {"changes": []})
write_json(base_dir / "logs" / "approvals_log.json", {"approvals": []})
write_json(base_dir / "logs" / "exception_log.json", {"exceptions": []})

# Initialize Level 4 specific artifacts
write_json(base_dir / "asset_registry.json", {"assets": []})
write_json(base_dir / "regression_baseline.json", {})

# Create empty JSONL files
Path(base_dir / "logs" / "prompts_log.jsonl").touch()
Path(base_dir / "logs" / "evaluation_harness_log.jsonl").touch()

print("‚úì Governance artifacts initialized:")
print(f"  run_manifest.json")
print(f"  logs/prompts_log.jsonl")
print(f"  logs/risk_log.json")
print(f"  logs/verification_register.json")
print(f"  logs/change_log.json")
print(f"  logs/approvals_log.json")
print(f"  logs/exception_log.json")
print(f"\n‚úì Level 4 artifacts initialized:")
print(f"  asset_registry.json")
print(f"  regression_baseline.json")
print(f"  logs/evaluation_harness_log.jsonl")
print(f"\n‚úì Config hash: {stable_config_hash()}")

‚úì Governance artifacts initialized:
  run_manifest.json
  logs/prompts_log.jsonl
  logs/risk_log.json
  logs/verification_register.json
  logs/change_log.json
  logs/approvals_log.json
  logs/exception_log.json

‚úì Level 4 artifacts initialized:
  asset_registry.json
  regression_baseline.json
  logs/evaluation_harness_log.jsonl

‚úì Config hash: d2ef82f08f0c34b9


##5.CONFINDENTIALITY AND MINIMUN NECESSARY UTILITIES

###5.1.OVERVIEW

**Cell 5: Protecting Confidential Information**

This cell addresses one of the most critical risks when using AI in professional services - accidentally exposing sensitive client or proprietary information. Every consulting firm has horror stories about confidential data ending up where it shouldn't. This cell builds your first line of defense.

**The redaction mechanism**

We create a function that automatically scans text for patterns that typically indicate sensitive information - email addresses, phone numbers, social security numbers, and similar identifiers. When it finds these patterns, it replaces them with a placeholder tag. This happens automatically before any text gets sent to Claude's API.

Now, this is a simplified demonstration. In a production environment, you would deploy more sophisticated tools - named entity recognition systems that identify people's names, companies, financial figures, or proprietary terminology specific to your organization. But the principle remains the same: sanitize inputs before they leave your control.

**Minimum necessary principle**

The second function implements what privacy professionals call the minimum necessary standard. Just because you have a fifty-page document doesn't mean you should send all fifty pages to the AI. This function helps you extract only the essential information needed to complete the task, removing extraneous details that might contain sensitive information.

It also tracks what was removed. This creates an audit trail showing you actively protected confidential information rather than carelessly exposing it. If you ever face questions about data handling, you have documentation proving you followed proper protocols.

**Asset guardrails**

Finally, we define standard warnings that will be embedded in every asset created today. These guardrails remind users that outputs are drafts requiring verification, that recommendations are prohibited, and that fabricating facts is unacceptable. Think of these as the safety labels that appear on every piece of equipment.

**Why this matters for reusable assets**

When you create a one-time analysis, you control the entire process and can manually redact sensitive information. But when you create a reusable template that fifty people will use over the next year, you cannot manually review every instance. You need automated protections built into the asset itself.

The demonstration at the end shows you the redaction in action - original text containing contact information gets transformed into a sanitized version suitable for processing. This visible proof helps users trust that the system is actually protecting their data.

###5.2.CODE AND IMPLEMENTATION

In [None]:
# Cell 5: Confidentiality + Minimum-Necessary Utilities

def redact(text, placeholder="[REDACTED]"):
    """
    Redact sensitive patterns from text.

    This is a simple demonstration. In production:
    - Use named entity recognition for PII
    - Apply domain-specific redaction rules
    - Log what was removed for audit
    """
    # Redact email-like patterns
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
                  placeholder, text)
    # Redact phone-like patterns
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', placeholder, text)
    # Redact SSN-like patterns
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', placeholder, text)
    return text

def build_minimum_necessary(raw_text):
    """
    Extract minimum necessary information and log removed fields.

    Returns:
        dict with 'sanitized_text' and 'removed_fields'
    """
    sanitized = redact(raw_text)

    # Track what was removed (simplified example)
    removed_fields = []
    if raw_text != sanitized:
        removed_fields.append("PII patterns detected and redacted")

    # Remove excessive detail (example: truncate very long inputs)
    if len(sanitized) > 2000:
        sanitized = sanitized[:2000] + "... [TRUNCATED]"
        removed_fields.append("Input truncated to 2000 chars")

    return {
        "sanitized_text": sanitized,
        "removed_fields": removed_fields
    }

def asset_guardrails():
    """
    Return standard guardrails text for all asset prompt packs.
    """
    return """
CRITICAL GUARDRAILS:
1. All outputs must include verification_status: "Not verified"
2. Never provide recommendations or rankings
3. No fabricated facts, benchmarks, or citations
4. Separate: facts_provided / assumptions / open_questions / draft_output
5. Flag confidentiality risks and missing information
6. This is a DRAFT asset requiring human review before release
"""

# Demo redaction
demo_text = """
Our client contact is john.doe@example.com and can be reached at 555-123-4567.
The project involves analyzing revenue data across 15 markets.
SSN on file: 123-45-6789.
"""

result = build_minimum_necessary(demo_text)

print("‚úì Confidentiality utilities loaded")
print("\n--- DEMO: Redaction ---")
print("\nBefore redaction:")
print(demo_text)
print("\nAfter redaction:")
print(result['sanitized_text'])
print("\nRemoved fields:")
for field in result['removed_fields']:
    print(f"  - {field}")
print("\n‚úì Asset guardrails ready for prompt packs")

‚úì Confidentiality utilities loaded

--- DEMO: Redaction ---

Before redaction:

Our client contact is john.doe@example.com and can be reached at 555-123-4567.
The project involves analyzing revenue data across 15 markets.
SSN on file: 123-45-6789.


After redaction:

Our client contact is [REDACTED] and can be reached at [REDACTED].
The project involves analyzing revenue data across 15 markets.
SSN on file: [REDACTED].


Removed fields:
  - PII patterns detected and redacted

‚úì Asset guardrails ready for prompt packs


##6.LLM WRAPPER

###6.1.OVERVIEW

**Cell 6: Building a Bulletproof Communication System**

This is where we build what I call the reliability engine. When you use AI in professional settings, you cannot afford responses that fail to parse, violate your policies, or produce unusable output. This cell creates a wrapper that catches and fixes common problems before they derail your work.

**The JSON challenge**

Claude communicates in natural language, but we need structured data we can validate and process programmatically. We ask Claude to return responses in JSON format - think of it as a standardized form with specific fields. However, AI models sometimes add extra formatting like markdown code fences, or make small syntax errors like trailing commas that break parsers.

The first set of functions handles these quirks. They strip away formatting artifacts, extract the actual JSON content from surrounding text, and repair common syntax issues. This is less exciting than the AI itself, but absolutely essential. A response that cannot be parsed is useless, regardless of how insightful the content might be.

**Enforcing exact specifications**

The validation function is where we get strict. It checks that the response contains exactly the keys we specified - no more, no less. Extra keys suggest Claude misunderstood instructions. Missing keys mean we lack critical information. We validate data types to ensure lists are actually lists and strings are actually strings.

Most importantly, we enforce policy requirements. Every response must have verification status set to not verified. We scan the draft output for prohibited language like recommend, best option, or ranked number one. If we find these phrases, we reject the response entirely. This might seem harsh, but it prevents the single biggest risk in AI-assisted work - decision laundering, where people treat AI suggestions as authoritative recommendations.

**The retry mechanism**

When validation fails, we don't just give up. The wrapper automatically retries the API call, this time including specific feedback about what went wrong. Claude sees the validation errors and can correct them. We allow up to three attempts before finally failing.

**Comprehensive logging**

Every API call gets logged with a timestamp, a hash of the prompt for traceability, and the parsing outcome. Risks get automatically detected and recorded - if there are too many open questions, that suggests insufficient input data. If there are no assumptions listed, that suggests poor traceability.

**The smoke test**

Finally, we run an actual API call to prove the system works end to end. This isn't just checking syntax - we verify that we can call Claude, receive a response, parse it successfully, and validate it against our schema. This fail-fast approach means if something is misconfigured, you discover it immediately rather than after running expensive evaluations.

###6.2.CODE AND IMPLEMENTATION

In [None]:
# Cell 6: LLM Wrapper (ROBUST JSON REPAIR + VALIDATION + ACTUAL API SMOKE TEST)

def fix_json_string(s):
    """Remove trailing commas that break JSON parsing."""
    s = re.sub(r',\s*}', '}', s)
    s = re.sub(r',\s*]', ']', s)
    return s

def extract_json_robust(text):
    """
    Extract JSON from text, handling common formatting issues.
    """
    # Remove markdown code fences
    text = re.sub(r'```json\s*', '', text)
    text = re.sub(r'```\s*', '', text)

    # Try to find JSON object
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if match:
        json_str = match.group(0)
        json_str = fix_json_string(json_str)
        return json_str

    return text

def validate_output_json(data, asset_id=None):
    """
    Validate output JSON against exact schema requirements.

    Returns: (is_valid, issues_list)
    """
    issues = []

    # Check required keys (exact set)
    required_keys = {
        "task", "facts_provided", "assumptions", "open_questions",
        "risks", "draft_output", "verification_status", "questions_to_verify"
    }

    actual_keys = set(data.keys())

    if actual_keys != required_keys:
        missing = required_keys - actual_keys
        extra = actual_keys - required_keys
        if missing:
            issues.append(f"Missing keys: {missing}")
        if extra:
            issues.append(f"Extra keys not allowed: {extra}")

    # Check types
    if not isinstance(data.get("facts_provided", None), list):
        issues.append("facts_provided must be a list")
    if not isinstance(data.get("assumptions", None), list):
        issues.append("assumptions must be a list")
    if not isinstance(data.get("open_questions", None), list):
        issues.append("open_questions must be a list")
    if not isinstance(data.get("risks", None), list):
        issues.append("risks must be a list")
    if not isinstance(data.get("questions_to_verify", None), list):
        issues.append("questions_to_verify must be a list")

    # Check verification_status
    if data.get("verification_status") != "Not verified":
        issues.append('verification_status must be exactly "Not verified"')

    # Check for prohibited content (recommendations/rankings)
    draft_output = data.get("draft_output", "")
    prohibited_phrases = [
        "recommend", "best option", "top choice", "should select",
        "ranked #1", "optimal choice", "advise choosing"
    ]
    for phrase in prohibited_phrases:
        if phrase.lower() in draft_output.lower():
            issues.append(f'Prohibited language detected in draft_output: "{phrase}"')

    # Validate risk structure
    for i, risk in enumerate(data.get("risks", [])):
        if not isinstance(risk, dict):
            issues.append(f"Risk {i} must be a dict")
            continue

        valid_types = {
            "confidentiality", "hallucination", "missing_facts", "traceability",
            "false_rigor", "decision_laundering", "scope_creep", "change_mgmt",
            "evaluation", "other"
        }
        if risk.get("type") not in valid_types:
            issues.append(f"Risk {i} has invalid type: {risk.get('type')}")

        valid_severities = {"low", "medium", "high"}
        if risk.get("severity") not in valid_severities:
            issues.append(f"Risk {i} has invalid severity: {risk.get('severity')}")

    return (len(issues) == 0, issues)

def auto_detect_risks(data):
    """Auto-detect risks from output structure."""
    risks = []

    if len(data.get("open_questions", [])) > 5:
        risks.append({
            "type": "missing_facts",
            "severity": "medium",
            "note": "High number of open questions suggests insufficient input"
        })

    if len(data.get("assumptions", [])) == 0:
        risks.append({
            "type": "traceability",
            "severity": "low",
            "note": "No explicit assumptions logged"
        })

    return risks

def call_claude(task_description, context="", asset_id=None, max_retries=3):
    """
    Call Claude API with robust JSON parsing and validation.

    Returns: parsed dict or raises exception
    """
    system_prompt = f"""You are an AI assistant helping management consultants create draft internal assets.

{asset_guardrails()}

CRITICAL: Return ONLY valid JSON with EXACTLY these keys:
- task (string)
- facts_provided (array)
- assumptions (array)
- open_questions (array)
- risks (array of {{type, severity, note}})
- draft_output (string)
- verification_status (must be "Not verified")
- questions_to_verify (array)

Valid risk types: confidentiality, hallucination, missing_facts, traceability, false_rigor, decision_laundering, scope_creep, change_mgmt, evaluation, other
Valid severities: low, medium, high

NO extra keys. NO prose outside JSON. NO recommendations or rankings in draft_output."""

    user_prompt = f"""Task: {task_description}

{context}

Return structured JSON output following the exact schema."""

    prompt_hash = sha256_text(system_prompt + user_prompt)

    for attempt in range(max_retries):
        try:
            # Call API
            message = client.messages.create(
                model=MODEL,
                max_tokens=MAX_TOKENS,
                temperature=TEMPERATURE,
                system=system_prompt,
                messages=[{"role": "user", "content": user_prompt}]
            )

            raw_output = message.content[0].text

            # Extract and parse JSON
            json_str = extract_json_robust(raw_output)
            data = json.loads(json_str)

            # Validate schema
            is_valid, issues = validate_output_json(data, asset_id)

            if not is_valid:
                if attempt < max_retries - 1:
                    # Retry with validation feedback
                    context += f"\n\nPREVIOUS ATTEMPT FAILED VALIDATION:\n" + "\n".join(issues)
                    continue
                else:
                    # Final attempt failed - log and raise
                    debug_path = base_dir / "logs" / "debug_malformed_json.txt"
                    with open(debug_path, 'w') as f:
                        f.write(f"Attempt {attempt + 1} - Validation issues:\n")
                        f.write("\n".join(issues))
                        f.write(f"\n\nRaw output:\n{raw_output}")

                    append_jsonl(base_dir / "logs" / "exception_log.json", {
                        "timestamp": now_iso(),
                        "asset_id": asset_id,
                        "exception_type": "validation_failure",
                        "issues": issues,
                        "prompt_hash": prompt_hash
                    })

                    raise ValueError(f"Validation failed after {max_retries} attempts: {issues}")

            # Add auto-detected risks
            auto_risks = auto_detect_risks(data)
            data["risks"].extend(auto_risks)

            # Log to prompts_log.jsonl
            append_jsonl(base_dir / "logs" / "prompts_log.jsonl", {
                "timestamp": now_iso(),
                "asset_id": asset_id,
                "prompt_hash": prompt_hash,
                "model": MODEL,
                "temperature": TEMPERATURE,
                "max_tokens": MAX_TOKENS,
                "parsing_status": "success"
            })

            # Log risks
            risk_log = read_json(base_dir / "logs" / "risk_log.json")
            for risk in data["risks"]:
                risk_log["risks"].append({
                    "timestamp": now_iso(),
                    "asset_id": asset_id,
                    **risk
                })
            write_json(base_dir / "logs" / "risk_log.json", risk_log)

            return data

        except json.JSONDecodeError as e:
            if attempt < max_retries - 1:
                continue
            else:
                debug_path = base_dir / "logs" / "debug_malformed_json.txt"
                with open(debug_path, 'w') as f:
                    f.write(f"JSON decode error: {e}\n\nRaw output:\n{raw_output}")
                raise

        except Exception as e:
            append_jsonl(base_dir / "logs" / "exception_log.json", {
                "timestamp": now_iso(),
                "asset_id": asset_id,
                "exception_type": type(e).__name__,
                "message": str(e),
                "prompt_hash": prompt_hash
            })
            raise

# SMOKE TEST (ACTUAL API CALL)
print("Running smoke test with ACTUAL API call...")
smoke_test_result = call_claude(
    task_description="Generate a simple test output to verify the wrapper works correctly",
    context="This is a smoke test. Include 2 open questions and 1 assumption.",
    asset_id="smoke_test"
)

print("\n‚úì SMOKE TEST PASSED")
print(f"  Task: {smoke_test_result['task']}")
print(f"  Open questions: {len(smoke_test_result['open_questions'])}")
print(f"  Draft output length: {len(smoke_test_result['draft_output'])} chars")
print(f"  Verification status: {smoke_test_result['verification_status']}")
print("\n‚úì LLM wrapper ready with robust JSON parsing and validation")

Running smoke test with ACTUAL API call...

‚úì SMOKE TEST PASSED
  Task: Generate a simple test output to verify the wrapper works correctly
  Open questions: 2
  Draft output length: 415 chars
  Verification status: Not verified

‚úì LLM wrapper ready with robust JSON parsing and validation


##7.ASSET BUNDLE BUILDER (THE ARTIFACT)

###7.1.OVERVIEW

**Cell 7: Manufacturing Reusable Assets**

Now we transition from infrastructure to actual asset creation. This cell defines the factory process for building what we call Asset Bundles - complete, self-contained packages that your team can deploy repeatedly with confidence. Think of this as creating a product, not just a document.

**What makes an asset bundle complete**

A traditional consulting approach might create a template document and call it done. Level 4 thinking demands much more. Each asset bundle contains seven distinct components, and this cell automates their creation. You provide four key inputs - the asset name, its purpose, what it's allowed to do, and what it's explicitly forbidden from doing - and the function generates everything else.

**The specification document**

First, we create a formal specification that defines the asset's scope and boundaries. This includes the intended users, the minimum information required as input, the exact structure of outputs, and prominent risk warnings. The specification also locks in the model version and parameters, so six months from now someone can verify whether they're using the asset as originally designed.

Critically, we assign a version number starting at zero point one dash draft. This signals clearly that the asset has not been approved for production use. Nothing leaves this notebook with a released version number - that requires human approval after thorough review.

**The prompt pack**

Next, we generate the prompt pack - the actual instructions that will be sent to Claude each time someone uses this asset. This isn't just a casual request. It includes the system instructions, a template for user input, explicit guardrails about prohibited behaviors, and prominent warnings about verification requirements.

Notice we create this file statically from known text rather than asking Claude to write it. We want complete control over these instructions. Allowing an AI to write its own operating procedures would be recursive and potentially unstable.

**The deliverable template**

We also create a structured template showing exactly what the output should look like. It has labeled sections for facts provided, assumptions made, open questions, draft output, risks identified, verification status, and questions requiring fact-checking. This structure enforces transparency - users can immediately see what came from source data versus what the AI inferred.

**Registry and change management**

Every asset gets registered in a central inventory with its creation timestamp and current status. We also log the creation event in the change log, establishing a clear audit trail. If this asset ever gets modified, updated, or retired, those events will be logged in the same system.

**Why this matters**

Creating all these components manually would be tedious and error-prone. By automating the process, we ensure consistency and completeness. Every asset bundle follows the same structure, includes the same safety features, and maintains the same documentation standards.

###7.2.CODE AND IMPLEMENTATION

In [None]:
# Cell 7: Asset Bundle Builder (Spec + Prompt Pack + Template)

def build_asset_bundle(asset_name, purpose, scope_boundary, prohibited_uses):
    """
    Create a complete Asset Bundle with all required components.

    Returns: asset_id
    """
    asset_id = f"asset_{asset_name.lower().replace(' ', '_')}_{uuid.uuid4().hex[:8]}"
    asset_dir = base_dir / "asset_bundles" / asset_id
    asset_dir.mkdir(parents=True, exist_ok=True)

    # 1. Create asset_spec.json
    asset_spec = {
        "asset_id": asset_id,
        "name": asset_name,
        "purpose": purpose,
        "intended_users": "Management consultants and in-house strategy professionals",
        "scope_boundary": scope_boundary,
        "prohibited_uses": prohibited_uses,
        "input_schema": {
            "minimum_necessary": True,
            "required_fields": ["task_description", "context"],
            "redaction_required": True
        },
        "output_schema": {
            "format": "strict_json",
            "required_keys": [
                "task", "facts_provided", "assumptions", "open_questions",
                "risks", "draft_output", "verification_status", "questions_to_verify"
            ]
        },
        "risk_notes": "All outputs are drafts requiring human review. No recommendations or rankings permitted.",
        "model": MODEL,
        "temperature": TEMPERATURE,
        "max_tokens": MAX_TOKENS,
        "version": "0.1-draft",
        "owner": "unassigned",
        "created_at": now_iso()
    }
    write_json(asset_dir / "asset_spec.json", asset_spec)

    # 2. Create prompt_pack.txt
    prompt_pack = f"""ASSET: {asset_name}
PURPOSE: {purpose}

SCOPE BOUNDARY:
{scope_boundary}

PROHIBITED USES:
{prohibited_uses}

{asset_guardrails()}

SYSTEM INSTRUCTIONS:
You are creating a draft {asset_name.lower()} for management consultants.
- Use ONLY information explicitly provided
- Clearly separate facts, assumptions, and open questions
- Flag all risks (confidentiality, missing information, etc.)
- NO recommendations, rankings, or "best option" language
- Return strict JSON with verification_status: "Not verified"

USER PROMPT TEMPLATE:
---
Task: {{task_description}}

Context:
{{context}}

Return structured JSON following the exact schema.
---

REDACTION WARNING:
Never include client-confidential information. All inputs should be pre-redacted.

NOT VERIFIED REMINDER:
All outputs from this asset are DRAFTS requiring human review before use.
"""
    with open(asset_dir / "prompt_pack.txt", 'w') as f:
        f.write(prompt_pack)

    # 3. Create template.txt (deliverable shell)
    template = f"""# {asset_name} (DRAFT - NOT VERIFIED)

## Facts Provided
[List explicit facts from input]

## Assumptions
[List assumptions made]

## Open Questions
[List information gaps and questions to resolve]

## Draft Output
[Generated content goes here]

## Risks Identified
[Auto-detected and flagged risks]

## Verification Status
Not verified - requires human review

## Questions to Verify
[Specific items requiring fact-checking]

---
Generated by: AI-Assisted Consulting Level 4 (Innovators)
Asset ID: {asset_id}
Version: 0.1-draft
Timestamp: {now_iso()}
"""
    with open(asset_dir / "template.txt", 'w') as f:
        f.write(template)

    # 4. Update asset_registry.json
    registry = read_json(base_dir / "asset_registry.json")
    registry["assets"].append({
        "asset_id": asset_id,
        "name": asset_name,
        "version": "0.1-draft",
        "created_at": now_iso(),
        "status": "draft"
    })
    write_json(base_dir / "asset_registry.json", registry)

    # 5. Add to change_log.json
    change_log = read_json(base_dir / "logs" / "change_log.json")
    change_log["changes"].append({
        "timestamp": now_iso(),
        "change_type": "asset_created",
        "asset_id": asset_id,
        "asset_name": asset_name,
        "version": "0.1-draft",
        "description": f"Created new asset bundle for {asset_name}"
    })
    write_json(base_dir / "logs" / "change_log.json", change_log)

    print(f"‚úì Asset Bundle created: {asset_id}")
    print(f"  Location: {asset_dir}")
    print(f"  Files created:")
    for file in asset_dir.iterdir():
        print(f"    - {file.name}")

    return asset_id

# Demo: Create one sample asset
sample_asset_id = build_asset_bundle(
    asset_name="Market Entry Memo Shell",
    purpose="Reusable template for market entry analysis memos",
    scope_boundary="Structure and scaffolding only; no market-specific recommendations",
    prohibited_uses="Direct decision-making; client deliverables without review; external sharing"
)

print(f"\n‚úì Asset bundle builder ready")
print(f"‚úì Sample asset: {sample_asset_id}")

‚úì Asset Bundle created: asset_market_entry_memo_shell_fe37dda7
  Location: /content/ai_consulting_ch4_runs/run_20260120_132001_54ba806b/asset_bundles/asset_market_entry_memo_shell_fe37dda7
  Files created:
    - prompt_pack.txt
    - template.txt
    - asset_spec.json

‚úì Asset bundle builder ready
‚úì Sample asset: asset_market_entry_memo_shell_fe37dda7


##8.EVALUATION HARNESS: SYNTHETIC CASES PLUS REGRESSON AND LOGGING

###8.1.OVERVIEW

**Cell 8: Building Quality Control Systems**

This cell creates what professional software developers call a test harness - an automated system for evaluating whether your assets actually work as intended. In traditional consulting, quality control happens through senior review after the fact. At Level 4, we build quality checks directly into the asset creation process.

**Understanding synthetic test cases**

The first function generates five test cases for each type of asset. These are deliberately synthetic - fictional companies, made-up scenarios, non-confidential data. We use synthetic cases because they can be safely stored and reused without privacy concerns. Anyone can examine them, share them, or run them again later.

Each test case includes a task description and relevant context, mimicking what a real user might provide. For a market entry memo, we might specify a mid-size retailer entering Southeast Asian consumer electronics. For a cost transformation workplan, perhaps a manufacturing company targeting fifteen percent SG&A reduction. The cases are realistic enough to exercise the asset's capabilities without revealing actual client information.

**The evaluation process**

The second function runs these test cases through the asset and measures what happens. For each case, we call Claude with the test input and capture the output. Then we validate it against multiple criteria - does it follow the required schema, does it include open questions showing intellectual honesty, does it document assumptions for traceability, does it avoid prohibited recommendation language.

Each test gets scored pass or fail on these criteria. We then calculate an overall pass rate - what percentage of test cases the asset handled successfully. This single number becomes your quality metric. If only sixty percent of cases pass, you know the asset needs refinement before deployment.

**Progress indicators matter**

Notice we include visual feedback showing dots for each completed case. When you're running evaluations that take several minutes, you want confirmation the system is actually working rather than frozen. These simple indicators prevent unnecessary interruptions or restarts.

**Regression detection**

The third function implements regression testing - comparing current performance against a baseline. The first time you run an asset, we store its pass rate as the baseline. On subsequent runs, we check whether quality has degraded. If the current pass rate falls more than ten percentage points below baseline, we flag this as a regression and log it as a high-severity risk.

This catches a common problem with AI systems - gradual quality drift over time as models get updated or prompts get modified. Without regression testing, you might not notice that an asset that used to work ninety percent of the time now only works seventy percent of the time.

**No demonstration in this cell**

Unlike previous cells, we deliberately skip the demonstration here. Running actual evaluations means making multiple API calls, which takes time and incurs costs. Instead, we simply load the functions and confirm they're ready. The real evaluation happens in Cell 9 where we process all four assets end to end.

###8.2.CODE AND IMPLEMENTATION

In [None]:
# Cell 8: Evaluation Harness (Synthetic Cases + Regression + Logging)

def generate_eval_cases(asset_name):
    """
    Generate 5 synthetic (non-confidential) evaluation cases.

    Returns: list of eval cases
    """
    # Synthetic cases tailored to asset type
    cases_by_type = {
        "Market Entry Memo Shell": [
            {"task": "Structure memo for entering consumer electronics market",
             "context": "Company: mid-size retailer; Market: Southeast Asia; Timeline: 18 months"},
            {"task": "Create memo outline for healthcare services expansion",
             "context": "Company: regional provider; Market: rural areas; Constraint: limited capital"},
            {"task": "Draft memo framework for B2B software entry",
             "context": "Company: enterprise SaaS; Market: financial services; Focus: compliance"},
            {"task": "Build memo structure for food & beverage launch",
             "context": "Company: CPG brand; Market: urban millennials; Channel: e-commerce"},
            {"task": "Outline memo for industrial equipment market entry",
             "context": "Company: manufacturer; Market: emerging economies; Risk: currency volatility"}
        ],
        "Cost Transformation Workplan": [
            {"task": "Create workplan for SG&A cost reduction",
             "context": "Industry: manufacturing; Target: 15% reduction; Timeline: 12 months"},
            {"task": "Outline transformation plan for IT cost optimization",
             "context": "Industry: financial services; Focus: cloud migration; Constraint: compliance"},
            {"task": "Draft workplan for supply chain cost savings",
             "context": "Industry: retail; Lever: supplier consolidation; Timeline: 9 months"},
            {"task": "Structure plan for overhead reduction",
             "context": "Industry: professional services; Target: real estate and travel; Budget: $2M"},
            {"task": "Build workplan for procurement transformation",
             "context": "Industry: healthcare; Scope: indirect spend; Timeline: 24 months"}
        ],
        "IC Pre-read Shell": [
            {"task": "Create pre-read for acquisition decision",
             "context": "Target: tech startup; Value: $50M; Industry: fintech"},
            {"task": "Draft pre-read for capital expansion",
             "context": "Project: new facility; Investment: $100M; Location: Mexico"},
            {"task": "Outline pre-read for product launch",
             "context": "Product: consumer app; Budget: $20M; Market: Gen Z"},
            {"task": "Structure pre-read for divestiture",
             "context": "Asset: legacy division; Value: $75M; Rationale: strategic focus"},
            {"task": "Build pre-read for partnership decision",
             "context": "Partner: regional distributor; Scope: 5-year contract; Risk: exclusivity"}
        ],
        "RACI + Cadence Shell": [
            {"task": "Create RACI for digital transformation",
             "context": "Scope: enterprise-wide; Duration: 18 months; Stakeholders: 5 departments"},
            {"task": "Draft RACI for merger integration",
             "context": "Companies: similar size; Timeline: 12 months; Focus: synergy capture"},
            {"task": "Outline RACI for product development",
             "context": "Team: cross-functional; Methodology: agile; Complexity: high"},
            {"task": "Structure RACI for cost program",
             "context": "Scope: global; Governance: steering committee; Workstreams: 8"},
            {"task": "Build RACI for compliance initiative",
             "context": "Regulation: new; Timeline: 6 months; Impact: all business units"}
        ]
    }

    return cases_by_type.get(asset_name, [
        {"task": f"Generic task for {asset_name}", "context": "Placeholder context"}
        for _ in range(5)
    ])

def run_eval(asset_id, eval_cases):
    """
    Run evaluation harness for an asset.

    Returns: summary dict
    """
    asset_dir = base_dir / "asset_bundles" / asset_id

    results = []
    total_cases = len(eval_cases)

    print(f"  Evaluating {total_cases} cases", end="", flush=True)

    for i, case in enumerate(eval_cases):
        try:
            # Progress indicator
            print(f".", end="", flush=True)

            # Call model with eval case
            output = call_claude(
                task_description=case["task"],
                context=case["context"],
                asset_id=asset_id
            )

            # Validate schema and policy
            is_valid, issues = validate_output_json(output, asset_id)

            # Compute metrics
            metrics = {
                "case_id": i,
                "pass_schema": 1 if is_valid else 0,
                "has_open_questions": 1 if len(output.get("open_questions", [])) > 0 else 0,
                "has_assumptions": 1 if len(output.get("assumptions", [])) > 0 else 0,
                "no_reco_language": 1 if is_valid else 0,
                "validation_issues": issues
            }

            results.append({
                "case": case,
                "output": output,
                "metrics": metrics
            })

        except Exception as e:
            print(f"E", end="", flush=True)
            results.append({
                "case": case,
                "error": str(e),
                "metrics": {
                    "case_id": i,
                    "pass_schema": 0,
                    "has_open_questions": 0,
                    "has_assumptions": 0,
                    "no_reco_language": 0,
                    "validation_issues": [str(e)]
                }
            })

    print(" ‚úì")

    # Compute summary
    passed = sum(1 for r in results if r.get("metrics", {}).get("pass_schema", 0) == 1)
    pass_rate = passed / total_cases if total_cases > 0 else 0

    summary = {
        "asset_id": asset_id,
        "total_cases": total_cases,
        "passed": passed,
        "pass_rate": pass_rate,
        "timestamp": now_iso()
    }

    # Save evaluation artifacts
    write_json(asset_dir / "eval_cases.json", eval_cases)
    write_json(asset_dir / "eval_results.json", {
        "summary": summary,
        "results": results
    })

    evaluation_plan = {
        "tests": [
            {"name": "schema_compliance", "threshold": 0.8},
            {"name": "has_open_questions", "threshold": 0.6},
            {"name": "has_assumptions", "threshold": 0.4},
            {"name": "no_reco_language", "threshold": 1.0}
        ],
        "pass_criteria": "All thresholds must be met",
        "regression_tolerance": 0.1
    }
    write_json(asset_dir / "evaluation_plan.json", evaluation_plan)

    # Log to evaluation_harness_log.jsonl
    append_jsonl(base_dir / "logs" / "evaluation_harness_log.jsonl", {
        "timestamp": now_iso(),
        "asset_id": asset_id,
        "cases_hash": sha256_text(json.dumps(eval_cases)),
        "pass_rate": pass_rate,
        "passed": passed,
        "total": total_cases
    })

    return summary

def regression_check(asset_id, current_pass_rate):
    """
    Check if current evaluation represents a regression from baseline.

    Returns: (passed, message)
    """
    baseline_path = base_dir / "regression_baseline.json"
    baseline = read_json(baseline_path)

    # Initialize baseline if empty
    if not baseline:
        baseline[asset_id] = current_pass_rate
        write_json(baseline_path, baseline)
        return (True, "Baseline established")

    # Check for regression
    if asset_id not in baseline:
        baseline[asset_id] = current_pass_rate
        write_json(baseline_path, baseline)
        return (True, "Baseline established for new asset")

    baseline_rate = baseline[asset_id]
    tolerance = 0.1  # 10% regression tolerance

    if current_pass_rate < baseline_rate - tolerance:
        # Log regression risk
        risk_log = read_json(base_dir / "logs" / "risk_log.json")
        risk_log["risks"].append({
            "timestamp": now_iso(),
            "asset_id": asset_id,
            "type": "change_mgmt",
            "severity": "high",
            "note": f"Regression detected: pass rate {current_pass_rate:.2f} vs baseline {baseline_rate:.2f}"
        })
        write_json(base_dir / "logs" / "risk_log.json", risk_log)

        return (False, f"REGRESSION: {current_pass_rate:.2f} vs baseline {baseline_rate:.2f}")

    return (True, f"No regression: {current_pass_rate:.2f} vs baseline {baseline_rate:.2f}")

# NO DEMO - Functions defined and ready
print("‚úì Evaluation harness functions loaded:")
print("  - generate_eval_cases()")
print("  - run_eval()")
print("  - regression_check()")
print("\n‚úì Ready to evaluate assets in Cell 9")
print("  (Note: Cell 9 will make 20 API calls = 4 assets √ó 5 cases)")
print("  (Estimated time: 3-5 minutes)")

‚úì Evaluation harness functions loaded:
  - generate_eval_cases()
  - run_eval()
  - regression_check()

‚úì Ready to evaluate assets in Cell 9
  (Note: Cell 9 will make 20 API calls = 4 assets √ó 5 cases)
  (Estimated time: 3-5 minutes)


##9.RUNNING END-TO-END MINICASES

###9.1.OVERVIEW

**Cell 9: Full Production Run**

This is where everything comes together. We're now going to create four complete Asset Bundles from start to finish - building the specifications, generating evaluation cases, running quality tests, checking for regressions, and staging everything for approval. This demonstrates the full Level 4 workflow in action.

**The four demonstration assets**

We've chosen four realistic consulting scenarios that illustrate different use cases. The market entry memo shell provides structure for analyzing new market opportunities. The cost transformation workplan templates the planning process for efficiency initiatives. The investment committee pre-read creates neutral fact summaries for capital decisions. The RACI and cadence shell documents roles and meeting rhythms for complex programs.

Notice each asset has clearly defined boundaries. The market entry memo provides structure only, not market-specific recommendations. The IC pre-read summarizes facts neutrally without making investment recommendations. These boundaries are not arbitrary - they reflect where AI can add value safely versus where human judgment remains essential.

**The assembly line process**

For each asset, we execute six distinct steps. First, we build the complete asset bundle with all its component files. Second, we generate the synthetic evaluation cases tailored to that asset type. Third, we run the evaluation harness, making actual API calls to test whether the asset performs as expected. Fourth, we check for regression against the quality baseline. Fifth, we create a release candidate document capturing approval requirements and known limitations. Sixth, we save a human-readable summary for stakeholders.

The progress tracking shows you exactly where you are in this process. When you see "Step three of six - running evaluation", you know API calls are happening and this will take time. The timing information helps you plan - if asset one took ninety seconds, you can estimate the full run will take roughly six minutes.

**Configurable evaluation depth**

Notice the NUM_EVAL_CASES parameter set to three. This controls how many test cases we run per asset. Three cases gives you quick feedback during development. Five cases provides more comprehensive coverage for final validation. In a production environment, you might run ten or twenty cases to thoroughly exercise edge cases and failure modes.

We make this configurable because evaluation has real costs - both API expenses and time. During iterative development, you want fast feedback cycles. Before final release, you accept longer runtimes for greater confidence. The same notebook supports both modes.

**The approval staging**

Each asset gets a release candidate document that functions like a pre-flight checklist. It lists who must review and approve the asset - subject matter experts, quality assurance leads, governance officers. It documents known limitations based on the evaluation results. It shows the test summary including pass rates and regression status.

Critically, every asset remains in pending approval status. The notebook cannot promote assets to production on its own. This human-in-the-loop requirement prevents runaway automation. Someone with appropriate authority must explicitly decide this asset is ready for broad deployment.

**The final summary**

At completion, you see a table showing all four assets with their pass rates and approval status. This gives leadership a quick overview - are these assets performing well enough to justify the approval process, or do they need refinement first.

###9.2.CODE AND IMPLEMENTATION

In [None]:
# Cell 9: Run 4 Mini-Case Assets End-to-End (Create + Evaluate + Release Candidate)

# Define 4 mini-cases
mini_cases = [
    {
        "name": "Market Entry Memo Shell",
        "purpose": "Reusable template for market entry analysis memos",
        "scope_boundary": "Structure and scaffolding only; no market-specific recommendations",
        "prohibited_uses": "Direct decision-making; client deliverables without review; external sharing"
    },
    {
        "name": "Cost Transformation Workplan",
        "purpose": "Reusable workplan template for cost reduction initiatives",
        "scope_boundary": "High-level structure and assumptions framework; no cost targets or specific initiatives",
        "prohibited_uses": "Final workplans without validation; unsupported cost estimates; external sharing"
    },
    {
        "name": "IC Pre-read Shell",
        "purpose": "Reusable pre-read template for investment committee decisions",
        "scope_boundary": "Neutral fact summary only; no investment recommendations or risk assessments",
        "prohibited_uses": "Investment decisions; client-facing materials; regulatory filings"
    },
    {
        "name": "RACI + Cadence Shell",
        "purpose": "Reusable RACI matrix and meeting cadence template",
        "scope_boundary": "Role definitions and cadence structure; no specific assignments or ownership",
        "prohibited_uses": "Final org charts; performance accountability; external communications"
    }
]

# Configuration: Set to 3 for faster testing, 5 for full evaluation
NUM_EVAL_CASES = 3  # Change to 5 for comprehensive evaluation

results_summary = []

print("="*80)
print("ASSET BUNDLE CREATION AND EVALUATION")
print("="*80)
print(f"Creating 4 assets with {NUM_EVAL_CASES} evaluation cases each")
print(f"Estimated time: {NUM_EVAL_CASES * 4 * 0.5:.1f}-{NUM_EVAL_CASES * 4 * 1:.1f} minutes")
print("="*80 + "\n")

import time
start_time = time.time()

for i, case in enumerate(mini_cases, 1):
    case_start = time.time()
    print(f"\n[{i}/4] {case['name']}")
    print("-" * 60)

    # 1. Build asset bundle
    print("  Step 1/6: Creating asset bundle...", end="", flush=True)
    asset_id = build_asset_bundle(
        asset_name=case["name"],
        purpose=case["purpose"],
        scope_boundary=case["scope_boundary"],
        prohibited_uses=case["prohibited_uses"]
    )
    print(" ‚úì")

    # 2. Generate eval cases (limit to NUM_EVAL_CASES)
    print(f"  Step 2/6: Generating {NUM_EVAL_CASES} evaluation cases...", end="", flush=True)
    all_eval_cases = generate_eval_cases(case["name"])
    eval_cases = all_eval_cases[:NUM_EVAL_CASES]
    print(" ‚úì")

    # 3. Run evaluation (this is the slow part - API calls)
    print(f"  Step 3/6: Running evaluation ({NUM_EVAL_CASES} API calls)...", end="", flush=True)
    eval_start = time.time()
    eval_summary = run_eval(asset_id, eval_cases)
    eval_duration = time.time() - eval_start
    print(f" ‚úì ({eval_duration:.1f}s)")

    # 4. Regression check
    print("  Step 4/6: Checking for regressions...", end="", flush=True)
    regression_ok, regression_msg = regression_check(asset_id, eval_summary["pass_rate"])
    print(" ‚úì")

    # 5. Create release_candidate.json
    print("  Step 5/6: Creating release candidate...", end="", flush=True)
    asset_dir = base_dir / "asset_bundles" / asset_id
    release_candidate = {
        "asset_id": asset_id,
        "asset_name": case["name"],
        "version": "0.1-draft",
        "approval_state": "pending",
        "required_review_roles": [
            "Subject matter expert",
            "Quality assurance lead",
            "Governance officer"
        ],
        "known_limitations": [
            f"Evaluated with only {NUM_EVAL_CASES} synthetic cases (not comprehensive)",
            "No external validation or peer review",
            "Template may require customization for specific use cases",
            "Not verified against authoritative sources"
        ],
        "test_summary": {
            "total_cases": eval_summary["total_cases"],
            "passed": eval_summary["passed"],
            "pass_rate": eval_summary["pass_rate"],
            "regression_ok": regression_ok
        },
        "version_bump_suggestion": "0.1-draft ‚Üí 0.1-rc1 (release candidate) after approval",
        "created_at": now_iso()
    }
    write_json(asset_dir / "release_candidate.json", release_candidate)
    print(" ‚úì")

    # 6. Save human-readable summary
    print("  Step 6/6: Saving deliverable summary...", end="", flush=True)
    summary_text = f"""# {case['name']} - Asset Summary

**Asset ID**: {asset_id}
**Version**: 0.1-draft
**Status**: Pending approval

## Purpose
{case['purpose']}

## Evaluation Results
- Total test cases: {eval_summary['total_cases']}
- Passed: {eval_summary['passed']}
- Pass rate: {eval_summary['pass_rate']:.2%}
- Regression check: {regression_msg}

## Approval Status
- State: Pending
- Required reviews: Subject matter expert, QA lead, Governance officer

## Known Limitations
- Evaluated with {NUM_EVAL_CASES} synthetic cases (not comprehensive)
- No external validation
- Requires customization for production use
- Not verified against authoritative sources

## Next Steps
1. Human review of asset bundle components
2. Subject matter expert validation
3. Quality assurance sign-off
4. Governance approval for release
5. Version bump to 0.1-rc1 if approved

---
Generated: {now_iso()}
"""
    with open(base_dir / "deliverables" / f"{case['name'].replace(' ', '_').lower()}_summary.txt", 'w') as f:
        f.write(summary_text)
    print(" ‚úì")

    # Update approvals_log
    approvals_log = read_json(base_dir / "logs" / "approvals_log.json")
    approvals_log["approvals"].append({
        "timestamp": now_iso(),
        "asset_id": asset_id,
        "asset_name": case["name"],
        "approval_type": "release_candidate",
        "status": "pending",
        "required_approvers": release_candidate["required_review_roles"]
    })
    write_json(base_dir / "logs" / "approvals_log.json", approvals_log)

    # Store for summary table
    results_summary.append({
        "asset_name": case["name"],
        "asset_id": asset_id,
        "pass_rate": eval_summary["pass_rate"],
        "regression_ok": regression_ok,
        "approval_state": "pending"
    })

    case_duration = time.time() - case_start
    print(f"\n  Asset complete in {case_duration:.1f}s")

total_duration = time.time() - start_time

# Print summary table
print("\n" + "="*80)
print("ASSET CREATION SUMMARY")
print("="*80)
print(f"{'Asset Name':<35} {'Pass Rate':<12} {'Regression':<12} {'Approval'}")
print("-"*80)
for r in results_summary:
    regression_symbol = "‚úì" if r["regression_ok"] else "‚úó"
    print(f"{r['asset_name']:<35} {r['pass_rate']:>10.1%}  {regression_symbol:>10}  {r['approval_state']}")
print("="*80)

print(f"\n‚úì All 4 Asset Bundles created and evaluated in {total_duration:.1f}s ({total_duration/60:.1f} min)")
print(f"‚úì Total API calls made: {NUM_EVAL_CASES * 4}")
print(f"\nüìÅ Outputs:")
print(f"  - Asset bundles: {base_dir / 'asset_bundles'}")
print(f"  - Deliverables: {base_dir / 'deliverables'}")
print(f"  - Logs: {base_dir / 'logs'}")
print(f"\n‚ö†Ô∏è  Note: Each asset evaluated with {NUM_EVAL_CASES} cases")
print(f"   Change NUM_EVAL_CASES to 5 for comprehensive evaluation")

ASSET BUNDLE CREATION AND EVALUATION
Creating 4 assets with 3 evaluation cases each
Estimated time: 6.0-12.0 minutes


[1/4] Market Entry Memo Shell
------------------------------------------------------------
  Step 1/6: Creating asset bundle...‚úì Asset Bundle created: asset_market_entry_memo_shell_f9b3c41f
  Location: /content/ai_consulting_ch4_runs/run_20260120_132001_54ba806b/asset_bundles/asset_market_entry_memo_shell_f9b3c41f
  Files created:
    - prompt_pack.txt
    - template.txt
    - asset_spec.json
 ‚úì
  Step 2/6: Generating 3 evaluation cases... ‚úì
  Step 3/6: Running evaluation (3 API calls)...  Evaluating 3 cases... ‚úì
 ‚úì (50.2s)
  Step 4/6: Checking for regressions... ‚úì
  Step 5/6: Creating release candidate... ‚úì
  Step 6/6: Saving deliverable summary... ‚úì

  Asset complete in 50.3s

[2/4] Cost Transformation Workplan
------------------------------------------------------------
  Step 1/6: Creating asset bundle...‚úì Asset Bundle created: asset_cost_transfor

##10.AUDIT BUNDLE

###10.1.OVERVIEW

**Cell 10: Creating the Audit Package**

The final cell packages everything into a complete, auditable deliverable that you can archive, share with stakeholders, or present to governance committees. This isn't just zipping up files - we're creating comprehensive documentation that explains what was created, how to use it, and what happens next.

**The audit readme document**

We generate an extensive readme file that serves as the user manual for this entire package. It starts with run information - the unique identifier, timestamp, model configuration, and a fingerprint hash that proves these settings. If someone later questions whether results were produced with approved configurations, this provides definitive proof.

The readme then walks through every artifact in the package, explaining what each file contains and why it matters. For someone unfamiliar with Level 4 processes, this documentation is essential. They can understand that prompts_log contains hashed records of API calls, that regression_baseline stores quality metrics, that release_candidate documents approval requirements.

**The promotion workflow**

Most importantly, the readme provides step-by-step instructions for promoting assets from draft status to production release. It defines five phases - subject matter expert review, quality assurance testing, governance approval, version management, and deployment. Each phase has specific deliverables and decision criteria.

This is crucial because creating the asset is only the beginning. The real work happens in validation and approval. Without clear guidance, assets languish in draft status because nobody knows what's required to move them forward. The readme removes that ambiguity.

**Prominent limitations**

We're very explicit about what this run did not do. It used only synthetic test cases, not real scenarios. It evaluated just three to five cases per asset, not comprehensive coverage. No human experts reviewed the outputs. No external validation occurred. These limitations don't make the assets worthless, but they must inform how you use them.

Being honest about limitations builds trust. If you claim assets are production-ready when they've only been tested synthetically, you're setting up stakeholders for disappointment. If you clearly state current limitations and what additional validation is needed, you enable informed decisions.

**Package inventory and validation**

We create a formal inventory listing total files, directories, assets created, and API calls made. This provides accountability - you can verify the package is complete and hasn't been tampered with. The validation checklist confirms all required governance artifacts exist.

The directory tree visualization shows the package structure at a glance. Even non-technical stakeholders can see there are asset bundles, deliverables, and comprehensive logs. The file count and archive size help with storage planning and compliance documentation.

**Actionable next steps**

Rather than ending with a generic success message, we provide concrete guidance. Download the zip archive. Review the audit readme. Inspect individual asset bundles. Check the deliverables folder for summaries. If you want to promote assets to production, here's what you need to do differently - increase evaluation cases, run subject matter expert validation, obtain formal approvals.

This transforms the notebook from a technical exercise into a business process with clear handoffs and accountability.

###10.2.CODE AND IMPLEMENTATION

In [None]:
# Cell 10: Bundle + AUDIT_README + Zip

import shutil

print("="*80)
print("FINAL PACKAGING AND AUDIT TRAIL")
print("="*80 + "\n")

# Step 1: Create AUDIT_README.txt
print("Step 1/4: Creating AUDIT_README.txt...", end="", flush=True)

audit_readme = f"""# AI-Assisted Consulting - Level 4 (Innovators) - Audit Package

## Run Information
- Run ID: {run_name}
- Created: {now_iso()}
- Model: {MODEL}
- Temperature: {TEMPERATURE}
- Max Tokens: {MAX_TOKENS}
- Config Hash: {stable_config_hash()}

## Purpose
This package contains a complete audit trail for a Level 4 (Innovators) AI-assisted
consulting session focused on creating reusable internal assets with evaluation
and controlled release processes.

**Key principle**: Level 4 is NOT about "smarter AI advice" ‚Äî it's about creating
governance-first, reusable assets that can be deployed at scale with appropriate
quality controls.

## What's in This Package

### Core Governance Artifacts
1. **run_manifest.json** - Session metadata, model config, environment fingerprint
2. **logs/prompts_log.jsonl** - Append-only log of all API calls (hashed prompts only)
3. **logs/risk_log.json** - All auto-detected and flagged risks
4. **logs/verification_register.json** - Items requiring human verification
5. **logs/change_log.json** - All asset creation and modification events
6. **logs/approvals_log.json** - Approval workflows and status
7. **logs/exception_log.json** - Parsing failures and policy violations

### Level 4 Specific Artifacts
8. **asset_registry.json** - Registry of all created assets in this run
9. **asset_bundles/<asset_id>/** - Complete Asset Bundles (see structure below)
10. **logs/evaluation_harness_log.jsonl** - Evaluation run records
11. **regression_baseline.json** - Quality baseline for regression testing

### Asset Bundle Structure
Each asset bundle folder contains:
- **asset_spec.json** - Metadata, scope, prohibited uses, version
- **prompt_pack.txt** - Reusable prompt template with guardrails
- **template.txt** - Deliverable shell with labeled sections
- **evaluation_plan.json** - Test definitions and thresholds
- **eval_cases.json** - {NUM_EVAL_CASES} synthetic test cases (non-confidential)
- **eval_results.json** - Pass/fail outcomes per test
- **release_candidate.json** - Approval checklist and known limitations

### Deliverables
12. **deliverables/** - Human-readable summaries for each asset

## Assets Created in This Run
1. **Market Entry Memo Shell** - Reusable market entry analysis template
2. **Cost Transformation Workplan** - Reusable cost reduction workplan template
3. **IC Pre-read Shell** - Reusable investment committee pre-read template
4. **RACI + Cadence Shell** - Reusable RACI matrix and meeting cadence template

## Evaluation Summary
- Evaluation cases per asset: {NUM_EVAL_CASES}
- Total API calls: {NUM_EVAL_CASES * 4}
- All assets: DRAFT status (version 0.1-draft)

## Reproducibility
To reproduce this run:
1. Use the same model and parameters from run_manifest.json
2. Use eval_cases.json for each asset (synthetic inputs)
3. Compare results against regression_baseline.json
4. All prompts are logged as SHA-256 hashes for verification

## How to Promote Assets from Draft to Released

### Approval Workflow (Required for Each Asset)

**Phase 1: Subject Matter Expert Review**
- Validate asset purpose and scope alignment
- Confirm prohibited uses are appropriate
- Test with non-synthetic (real but sanitized) cases
- Document additional limitations found

**Phase 2: Quality Assurance Review**
- Run evaluation harness with expanded test cases ({NUM_EVAL_CASES} ‚Üí 10+ cases)
- Verify pass rates meet all thresholds in evaluation_plan.json
- Check for regressions vs. regression_baseline.json
- Validate edge cases and error handling

**Phase 3: Governance Review**
- Confirm confidentiality controls are adequate
- Validate "Not verified" disclaimers are prominent
- Approve change management plan
- Review risk log for any high-severity issues
- Verify traceability (all inputs/outputs logged)

**Phase 4: Version Management**
- 0.1-draft ‚Üí 0.1-rc1 (release candidate) after all reviews
- 0.1-rc1 ‚Üí 0.1.0 (released) after production pilot
- Update asset_registry.json with new version
- Log in change_log.json with reviewer sign-offs

**Phase 5: Deployment**
- Add to internal asset library/repository
- Create user documentation and training materials
- Define usage metrics and feedback mechanisms
- Establish monitoring for quality drift
- Plan periodic re-evaluation (quarterly recommended)

### Critical Reminders

‚ö†Ô∏è **Not Verified**: All outputs are drafts requiring human review
‚ö†Ô∏è **No Recommendations**: Assets must not generate rankings or "best option" language
‚ö†Ô∏è **Confidentiality**: Never use with client-confidential data without proper controls
‚ö†Ô∏è **Traceability**: All model calls are logged; all risks are flagged
‚ö†Ô∏è **Change Management**: Any asset modifications require re-evaluation
‚ö†Ô∏è **Regression Testing**: Compare new results vs. baseline before each release
‚ö†Ô∏è **Evaluation Scope**: Current evaluation used only {NUM_EVAL_CASES} synthetic cases

### Known Limitations of This Run
- Synthetic evaluation cases only (not tested with real client scenarios)
- Limited test coverage ({NUM_EVAL_CASES} cases per asset)
- No external validation or peer review
- No human subject matter expert review yet
- Templates may require significant customization
- Not validated against authoritative sources or industry standards

### Next Steps
1. Expand evaluation to 10+ cases per asset (including edge cases)
2. Conduct subject matter expert review for each asset
3. Test with sanitized real-world inputs (non-confidential)
4. Document additional limitations discovered
5. Obtain formal approvals before promoting to RC status
6. Establish monitoring and feedback loops post-deployment

## Questions or Issues?
Review exception_log.json for any parsing failures or policy violations encountered
during the run.

## File Integrity
- Run manifest hash: {stable_config_hash()}
- Total assets created: 4
- Total files in package: {sum(1 for _ in base_dir.rglob('*') if _.is_file())}
- Package created: {now_iso()}

---
Generated by: AI-Assisted Consulting Level 4 (Innovators)
Model: {MODEL}
Chapter: 4 - Level 4 (Innovators)
"""

audit_readme_path = base_dir / "AUDIT_README.txt"
with open(audit_readme_path, 'w') as f:
    f.write(audit_readme)

print(" ‚úì")

# Step 2: Create package inventory
print("Step 2/4: Creating package inventory...", end="", flush=True)

inventory = {
    "run_id": run_name,
    "created_at": now_iso(),
    "total_files": sum(1 for _ in base_dir.rglob('*') if _.is_file()),
    "total_directories": sum(1 for _ in base_dir.rglob('*') if _.is_dir()),
    "assets_created": len(results_summary),
    "evaluation_cases_per_asset": NUM_EVAL_CASES,
    "total_api_calls": NUM_EVAL_CASES * 4,
    "config_hash": stable_config_hash()
}
write_json(base_dir / "package_inventory.json", inventory)
print(" ‚úì")

# Step 3: Create zip bundle
print("Step 3/4: Creating zip archive...", end="", flush=True)
zip_name = f"{run_name}_complete"
zip_path = Path(f"/content/{zip_name}")
shutil.make_archive(str(zip_path), 'zip', base_dir)
zip_size_mb = zip_path.with_suffix('.zip').stat().st_size / (1024 * 1024)
print(f" ‚úì ({zip_size_mb:.2f} MB)")

# Step 4: Generate final checklist
print("Step 4/4: Validating package contents...", end="", flush=True)

checklist_items = [
    ("run_manifest.json", (base_dir / "run_manifest.json").exists()),
    ("asset_registry.json", (base_dir / "asset_registry.json").exists()),
    ("AUDIT_README.txt", audit_readme_path.exists()),
    ("package_inventory.json", (base_dir / "package_inventory.json").exists()),
    (f"{len(results_summary)} Asset Bundles", len(list((base_dir / "asset_bundles").iterdir())) >= len(results_summary)),
    ("Evaluation harness log", (base_dir / "logs" / "evaluation_harness_log.jsonl").exists()),
    ("Regression baseline", (base_dir / "regression_baseline.json").exists()),
    ("Change log", (base_dir / "logs" / "change_log.json").exists()),
    ("Approvals log", (base_dir / "logs" / "approvals_log.json").exists()),
    ("Risk log", (base_dir / "logs" / "risk_log.json").exists()),
    ("Zip bundle", zip_path.with_suffix('.zip').exists())
]

all_passed = all(status for _, status in checklist_items)
print(" ‚úì" if all_passed else " ‚ö†")

# Print final summary
print("\n" + "="*80)
print("PACKAGE CONTENTS")
print("="*80 + "\n")

# Show directory tree (simplified)
def count_files_in_dir(path):
    return sum(1 for _ in path.rglob('*') if _.is_file())

print(f"üì¶ {base_dir.name}/")
print(f"   ‚îú‚îÄ‚îÄ AUDIT_README.txt")
print(f"   ‚îú‚îÄ‚îÄ run_manifest.json")
print(f"   ‚îú‚îÄ‚îÄ package_inventory.json")
print(f"   ‚îú‚îÄ‚îÄ asset_registry.json ({len(results_summary)} assets)")
print(f"   ‚îú‚îÄ‚îÄ regression_baseline.json")
print(f"   ‚îú‚îÄ‚îÄ asset_bundles/ ({len(results_summary)} bundles)")
for i, result in enumerate(results_summary, 1):
    asset_id_short = result['asset_id'][:30] + "..."
    print(f"   ‚îÇ   ‚îú‚îÄ‚îÄ {asset_id_short}")
print(f"   ‚îú‚îÄ‚îÄ deliverables/ ({len(results_summary)} summaries)")
print(f"   ‚îî‚îÄ‚îÄ logs/")
print(f"       ‚îú‚îÄ‚îÄ prompts_log.jsonl")
print(f"       ‚îú‚îÄ‚îÄ evaluation_harness_log.jsonl")
print(f"       ‚îú‚îÄ‚îÄ risk_log.json")
print(f"       ‚îú‚îÄ‚îÄ change_log.json")
print(f"       ‚îú‚îÄ‚îÄ approvals_log.json")
print(f"       ‚îú‚îÄ‚îÄ verification_register.json")
print(f"       ‚îî‚îÄ‚îÄ exception_log.json")

print("\n" + "="*80)
print("VALIDATION CHECKLIST")
print("="*80 + "\n")

for item, status in checklist_items:
    symbol = "‚úì" if status else "‚úó"
    print(f"  {symbol} {item}")

print("\n" + "="*80)
print("PACKAGE SUMMARY")
print("="*80 + "\n")

print(f"üìä Statistics:")
print(f"   ‚Ä¢ Total files: {inventory['total_files']}")
print(f"   ‚Ä¢ Total directories: {inventory['total_directories']}")
print(f"   ‚Ä¢ Assets created: {inventory['assets_created']}")
print(f"   ‚Ä¢ API calls made: {inventory['total_api_calls']}")
print(f"   ‚Ä¢ Evaluation cases per asset: {NUM_EVAL_CASES}")
print(f"   ‚Ä¢ Archive size: {zip_size_mb:.2f} MB")

print(f"\nüìÅ Locations:")
print(f"   ‚Ä¢ Run directory: {base_dir}")
print(f"   ‚Ä¢ Zip archive: {zip_path}.zip")

print(f"\n‚ö†Ô∏è  Critical Reminders:")
print(f"   ‚Ä¢ All assets are DRAFTS (version 0.1-draft)")
print(f"   ‚Ä¢ Human review and approval required before production use")
print(f"   ‚Ä¢ Each asset evaluated with only {NUM_EVAL_CASES} synthetic cases")
print(f"   ‚Ä¢ Expand to 10+ cases and real scenarios before release")
print(f"   ‚Ä¢ See AUDIT_README.txt for complete promotion workflow")

print("\n" + "="*80)
print("‚úì NOTEBOOK COMPLETE")
print("="*80)
print(f"\nüéØ Next Actions:")
print(f"   1. Download: {zip_path}.zip")
print(f"   2. Review: AUDIT_README.txt for promotion process")
print(f"   3. Inspect: asset_bundles/ for each asset's components")
print(f"   4. Check: deliverables/ for human-readable summaries")
print(f"   5. Validate: logs/ for complete audit trail")
print(f"\nüí° To promote assets to production:")
print(f"   ‚Ä¢ Increase NUM_EVAL_CASES to 10+ in Cell 9")
print(f"   ‚Ä¢ Run subject matter expert validation")
print(f"   ‚Ä¢ Obtain formal approvals per AUDIT_README")
print(f"   ‚Ä¢ Update version to 0.1-rc1, then 0.1.0")

FINAL PACKAGING AND AUDIT TRAIL

Step 1/4: Creating AUDIT_README.txt... ‚úì
Step 2/4: Creating package inventory... ‚úì
Step 3/4: Creating zip archive... ‚úì (0.06 MB)
Step 4/4: Validating package contents... ‚úì

PACKAGE CONTENTS

üì¶ run_20260120_132001_54ba806b/
   ‚îú‚îÄ‚îÄ AUDIT_README.txt
   ‚îú‚îÄ‚îÄ run_manifest.json
   ‚îú‚îÄ‚îÄ package_inventory.json
   ‚îú‚îÄ‚îÄ asset_registry.json (4 assets)
   ‚îú‚îÄ‚îÄ regression_baseline.json
   ‚îú‚îÄ‚îÄ asset_bundles/ (4 bundles)
   ‚îÇ   ‚îú‚îÄ‚îÄ asset_market_entry_memo_shell_...
   ‚îÇ   ‚îú‚îÄ‚îÄ asset_cost_transformation_work...
   ‚îÇ   ‚îú‚îÄ‚îÄ asset_ic_pre-read_shell_234bd4...
   ‚îÇ   ‚îú‚îÄ‚îÄ asset_raci_+_cadence_shell_3d7...
   ‚îú‚îÄ‚îÄ deliverables/ (4 summaries)
   ‚îî‚îÄ‚îÄ logs/
       ‚îú‚îÄ‚îÄ prompts_log.jsonl
       ‚îú‚îÄ‚îÄ evaluation_harness_log.jsonl
       ‚îú‚îÄ‚îÄ risk_log.json
       ‚îú‚îÄ‚îÄ change_log.json
       ‚îú‚îÄ‚îÄ approvals_log.json
       ‚îú‚îÄ‚îÄ verification_register.json
       ‚îî‚îÄ‚îÄ 

##11.CONCLUSIONS

**Conclusion: From Assets to Organizational Capability**

You've now completed the full Level 4 journey. You've built four Asset Bundles from scratch, implemented comprehensive evaluation systems, created regression baselines, generated audit trails, and packaged everything into professional deliverables. More importantly, you've internalized a fundamentally different way of thinking about AI in professional services.

**What You've Actually Built**

Let's be precise about what exists in your deliverable package. You have four asset bundles, each containing seven components - a formal specification, a reusable prompt pack, a structured template, an evaluation plan, synthetic test cases, test results, and a release candidate checklist. You have governance logs tracking every API call, every risk identified, every change made, and every approval required. You have regression baselines establishing quality standards. You have comprehensive documentation explaining how everything works and what happens next.

What you do not have is production-ready assets. I want to emphasize this clearly because the temptation to skip ahead is strong. These assets have been tested against only three to five synthetic cases each. No subject matter expert has validated them. No real user has attempted to apply them to actual work. No edge cases have been explored. The evaluation coverage is minimal by any professional standard.

This is intentional. The notebook teaches you the methodology for creating production assets, not the shortcut to avoiding proper validation. If you deployed these assets broadly tomorrow, you would likely encounter failures, edge cases, and limitations that synthetic testing didn't reveal. Those failures would damage trust in AI-assisted work more than never creating the assets at all.

**The Real Value Proposition**

The true value of Level 4 isn't the specific assets you created today. It's the repeatable process you now understand. You can see what comprehensive asset development looks like - the testing required, the documentation needed, the approval workflows essential, the governance infrastructure demanded. You have working code that implements these patterns, which you can adapt to your specific context.

Consider what happens next in your organization. Perhaps you identify a high-value use case - let's say standardizing how your firm analyzes competitive positioning for healthcare clients. You've done this analysis dozens of times. The pattern is consistent enough to systematize, but complex enough to benefit from AI assistance.

You can now approach this systematically. You'll define the asset's scope and boundaries - what it does, what it explicitly avoids, who can use it, what inputs are required. You'll create the prompt pack encoding your firm's methodology. You'll generate comprehensive test cases covering common scenarios and known edge cases. You'll run evaluation harnesses measuring quality across multiple dimensions. You'll document limitations honestly. You'll establish approval workflows involving the right stakeholders.

This is dramatically different from someone casually creating a ChatGPT prompt, getting a good result once, and encouraging everyone to use it. That approach fails at scale because it lacks testing, lacks governance, and lacks accountability. Your approach succeeds because it applies professional development standards to AI asset creation.

**The Organizational Conversation**

Level 4 also equips you for crucial conversations with leadership about AI adoption. Many executives oscillate between two extremes - either dismissing AI as unreliable hype or expecting it to magically solve complex problems. Both positions miss the nuance.

You can now explain the middle ground. AI can create significant value through reusable assets, but only with appropriate investment in development and governance. You can show concrete examples of what proper asset development looks like. You can demonstrate evaluation methodologies that measure quality objectively. You can present audit trails proving compliance and accountability. You can articulate realistic timelines and resource requirements.

When someone asks how long it takes to create a production asset, you have an informed answer. Initial development might take days - defining scope, creating prompts, building evaluation cases, running initial tests. Validation and refinement might take weeks - subject matter expert review, expanded testing, real-world piloting, iteration based on feedback. Deployment infrastructure might take months - training users, establishing monitoring, creating support processes, building feedback loops.

These timelines seem long compared to "I got a good ChatGPT response in five minutes." But they're short compared to traditional methodology development, which often takes years and relies entirely on expert judgment rather than systematic testing. Level 4 represents the realistic middle ground between casual experimentation and traditional approaches.

**Common Pitfalls to Avoid**

Having taught this material to hundreds of professionals, I can predict where you're likely to encounter challenges. The first pitfall is evaluation shortcuts. You'll be tempted to reduce test cases, accept lower pass rates, or skip regression testing. Resist this. Quality standards exist for a reason. If an asset only passes seventy percent of cases, it needs refinement before deployment, not deployment with fingers crossed.

The second pitfall is scope creep. You'll want assets to do more - generate recommendations, make decisions, provide authoritative answers. But broader scope means higher risk and harder governance. Keep assets focused on specific, testable capabilities. Let humans handle judgment-intensive work. This constraint might feel limiting, but it's what makes assets deployable safely.

The third pitfall is bypassing approval workflows. When you've invested significant effort creating an asset, you want to see it used immediately. But releasing unapproved assets undermines the entire governance framework. The approval process isn't bureaucratic obstruction - it's validation that the asset actually works, serves a real need, and can be supported properly.

The fourth pitfall is treating this as a one-time exercise. Asset development requires ongoing maintenance. Models get updated. Use cases evolve. Edge cases emerge. Quality drifts. You need processes for monitoring asset performance, collecting user feedback, running periodic re-evaluation, and managing updates. Assets without maintenance become liabilities.

**The Path to Level 5**

Some of you will stop at Level 4, and that's entirely appropriate. You'll build small libraries of assets for your immediate teams, maintain them manually, and generate significant value without massive infrastructure investment. For mid-size teams working on consistent problem types, this is optimal.

Others will recognize the need for Level 5 capabilities - automated deployment systems, continuous monitoring, feedback loops, A/B testing, gradual rollouts, automated regression detection, and sophisticated version control. These capabilities enable enterprise-scale deployment but require corresponding investment in technical infrastructure and operational processes.

The transition from Level 4 to Level 5 isn't about AI sophistication. It's about operational maturity. You're moving from artisanal asset development to industrial-scale production. The same principles apply - governance, testing, documentation, approval - but automated and systematized for higher volume and velocity.

**Final Reflections**

Level 4 represents a maturity threshold in AI-assisted work. Below this level, you're experimenting with AI as a personal tool. At this level and above, you're building organizational capability. That shift requires different thinking, different standards, and different disciplines.

The governance infrastructure might feel heavy initially. All the logging, validation, regression testing, approval workflows, and documentation seem like overhead. But this infrastructure is what separates sustainable value creation from hype cycles that promise transformation but deliver chaos.

You now have both the conceptual framework and the practical tools to create AI assets responsibly. You understand what quality looks like, how to measure it, how to maintain it, and how to prove it. You can build assets that earn trust through demonstrated reliability rather than demanding trust through claimed sophistication.

The organizations that will genuinely transform their operations with AI won't be those that adopt the most advanced models or create the most impressive demonstrations. They'll be those that systematically build libraries of reliable, well-governed, properly tested assets that solve real problems consistently. You're now equipped to build those organizations.

The work continues, but you've crossed an essential threshold. Welcome to Level 4.