# 01_design_strategy_planning

Notebook UI for planning a multi-round enzyme design workflow using user constraints and optional prior literature-review context.

## Python Path Setup
Ensure project-root imports work whether Jupyter starts from repo root or `notebooks/`.

In [1]:
from pathlib import Path
import os
import sys

cwd = Path.cwd().resolve()
repo_root = cwd.parent if cwd.name == "notebooks" else cwd
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))
src_root = repo_root / "src"
if src_root.exists() and str(src_root) not in sys.path:
    sys.path.insert(0, str(src_root))

## Imports
Load helper functions for setup, optional literature-context loading, planning prompt generation, and thread persistence.

In [2]:
import importlib
import agentic_protein_design.steps.design_strategy_planning as dsp

dsp = importlib.reload(dsp)
from project_config.local_api_keys import OPENAI_API_KEY

default_user_inputs = dsp.default_user_inputs
init_thread = dsp.init_thread
load_literature_context = dsp.load_literature_context
generate_design_strategy_plan = dsp.generate_design_strategy_plan
reflect_and_regenerate_design_strategy_plan = dsp.reflect_and_regenerate_design_strategy_plan
design_strategy_reflection_prompt = dsp.design_strategy_reflection_prompt
save_design_strategy_plan = dsp.save_design_strategy_plan
save_design_strategy_workflow_steps = dsp.save_design_strategy_workflow_steps
persist_thread_update = dsp.persist_thread_update
setup_data_root = dsp.setup_data_root
get_step_processed_dir = dsp.get_step_processed_dir

from agentic_protein_design.core import apply_notebook_markdown_style

# Slightly smaller rendered markdown text in outputs.
apply_notebook_markdown_style(font_size_px=14, line_height=1.4)


## API Key Setup
Load the OpenAI key from `project_config/local_api_keys.py` into environment variables for LLM calls.

In [3]:
if OPENAI_API_KEY and OPENAI_API_KEY != "REPLACE_WITH_YOUR_OPENAI_API_KEY":
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

"OPENAI_API_KEY" in os.environ

True

## User Inputs
Configure dataset root, thread key, optional literature context key, and planning requirements.

In [4]:
root_key = "examples"
existing_thread_key = 'design_strategy_planning_4b3ab33f0d634e4a85ce73f259eda102'

user_inputs = {
    "enzyme_family": "unspecific peroxygenases (UPOs)",
    "seed_sequences": ["CviUPO"],
    "reactions_of_interest": "peroxygenation of aromatics",
    "substrates_of_interest": ["Veratryl alcohol", "Naphthalene", "NBD", "ABTS", "S82"],
    "application_context": "biocatalysis and green chemistry",
    "constraints": ["H2O2 tolerance", "stability", "expression host compatibility"],
    "design_type_preference": "mutants_of_backbone",
    "backbone_protein": "CviUPO",
    "library_types": [
        "targeted_mutation_set",
        "site_saturation_mutagenesis",
        "combinatorial_library",
    ],
    "num_design_rounds": 3,
    "design_targets": [
        "increase peroxygenative mono-oxidation selectivity",
        "reduce over-oxidation",
        "maintain catalytic activity",
        "maintain or improve stability",
    ],
    "use_binding_pocket_analysis_step": True,
    "available_tools": [
        "sequence database search and alignment",
        "conservation analysis",
        "Boltz-2 docking/pose assessment",
        "OpenMM/YASARA ddG_bind simulations",
        "Pythia stability prediction",
        "protein language model zero-shot scoring",
        "BoltzGen or RFdiffusion2 de novo generation",
        "supervised surrogate models with OHE/PLM embeddings",
    ],
    # Optional: key format is {tag}_{thread_id}, e.g. literature_review_<thread_id>
    "literature_context_thread_key": "literature_review_d762a72ec7f04bec9b66ccd3aac21b91",
    "llm_model": "gpt-5.2",
    "llm_temperature": 0.2,
}

# Optional: reset all fields from helper defaults


## Setup Runtime Context
Initialize data directories and active chat thread from the values above.

In [5]:
data_root, resolved_dirs = setup_data_root(root_key)
step_processed_dir = get_step_processed_dir(resolved_dirs)
thread, threads_preview = init_thread(root_key, existing_thread_key)
thread_id = thread["thread_id"]
data_root, step_processed_dir, thread_id


(PosixPath('/Users/charmainechia/Documents/projects/agentic-protein-design/examples'),
 PosixPath('/Users/charmainechia/Documents/projects/agentic-protein-design/examples/processed/01_design_strategy_planning'),
 '4b3ab33f0d634e4a85ce73f259eda102')

## Optional Literature Context
Load prior literature-review context (thread history + referenced output files) when a thread key is provided.

In [6]:
literature_context_thread_key = str(user_inputs.get("literature_context_thread_key", "")).strip() or None
context_result = load_literature_context(literature_context_thread_key, max_chars_per_file=20000)
literature_context = str(context_result.get("context_text", ""))
literature_context_bundle = context_result.get("context_bundle")
len(literature_context), context_result.get("context_error", "")

(88579, '')

## Generate Design Strategy Plan
Run planning LLM call using user requirements and optional literature context, then save markdown output.

In [7]:
plan_outputs = generate_design_strategy_plan(
    user_inputs,
    literature_context=literature_context,
)
design_plan = plan_outputs['strategy_writeup']
workflow_steps_json = plan_outputs['workflow_steps_json']

# Save/overwrite both artifacts from initial generation.
out_design_plan = save_design_strategy_plan(design_plan, step_processed_dir)
out_workflow_steps = save_design_strategy_workflow_steps(workflow_steps_json, step_processed_dir)

{
    "workflow_step_count": len(workflow_steps_json),
    "has_strategy_writeup": bool(design_plan.strip()),
    "design_plan_path": str(out_design_plan),
    "workflow_steps_path": str(out_workflow_steps),
}


### Design Strategy LLM Call (Prompt 2 Writeup)

<details><summary>Prompt</summary>

```text

You are an expert computational protein engineer and workflow architect.

Goal:
Convert project requirements into an executable multi-step protein-design workflow.

Inputs:
- user_inputs_json: project requirements and design preferences
- literature_context (optional): prior literature synthesis and linked full-text summaries

Requirements:
1) Build an end-to-end strategy with clear phases (data, hypothesis, design, evaluation, iteration).
2) Explicitly choose and justify design mode:
   - de novo design, backbone-focused mutant design, or hybrid
3) Propose library strategy aligned to objectives:
   - exact targeted sequences, random mutagenesis, site-saturation mutagenesis (SSM), combinatorial library
4) Plan across the specified number of rounds and approximate size of each round, with clear decision gates and progression criteria.
5) Include concrete implementation details:
   - tools/models to run per step
   - expected inputs/outputs per step
   - fallback or alternative path if a step fails
6) Integrate relevant existing process modules when useful (for example binding_pocket_analysis).
7) Prioritize feasibility, information gain per round, and tractable experimental burden.

Available tool categories (non-exhaustive):
- sequence database search and alignment -> conservation analysis
- receptor-ligand docking / structure prediction (for example Boltz-2)
- ddG_bind style simulations (for example OpenMM or YASARA)
- stability prediction (for example Pythia)
- zero-shot scoring from protein language models
- de novo generation (for example BoltzGen or RFdiffusion2)
- supervised surrogate models (for example OHE/PLM embeddings + constrained acquisition)


PROJECT REQUIREMENTS SNAPSHOT
- enzyme_family: unspecific peroxygenases (UPOs)
- seed_sequences: CviUPO
- reactions_of_interest: peroxygenation of aromatics
- design_type_preference: mutants_of_backbone
- backbone_protein: CviUPO
- library_types: targeted_mutation_set; site_saturation_mutagenesis; combinatorial_library
- num_design_rounds: 3
- design_targets: increase peroxygenative mono-oxidation selectivity; reduce over-oxidation; maintain catalytic activity; maintain or improve stability
- constraints: H2O2 tolerance; stability; expression host compatibility
- available_tools: sequence database search and alignment; conservation analysis; Boltz-2 docking/pose assessment; OpenMM/YASARA ddG_bind simulations; Pythia stability prediction; protein language model zero-shot scoring; BoltzGen or RFdiffusion2 de novo generation; supervised surrogate models with OHE/PLM embeddings



PART 2: Write a concise human-readable strategy summary using:
1) the original context, and
2) the PART 1 JSON workflow.

Output contract:
- Return ONLY markdown prose.
- Keep it succinct and action-oriented (target ~350-700 words).

Required sections:
1. Overall strategy (5-8 bullets)
2. Design choices and assumptions (short)
3. Step-by-step execution summary
   - one compact paragraph per step from PART 1
   - if a step proposes using a foundational Protein Language Model or Structure-prediction model is proposed, recommend which models to use.
   - if a step proposes using a supervised Machine Learning model trained on screening data, propose the model type and input features to use (ok to provide options)
4. Decision gates and immediate next actions (short)

```
</details>

#### Response

(Full strategy shown below in compact view.)

### Human-Readable Strategy (Prompt 2)

Output(layout=Layout(border_bottom='1px solid #e0e0e0', border_left='1px solid #e0e0e0', border_right='1px sol…

### Structured Workflow Plan (Prompt 1 JSON as table)

Unnamed: 0,step_index,step_name,tools,description,python_code_preview
0,1,ingest_seed_and_build_homolog_msa,"sequence database search and alignment, conser...","Loads/creates the CviUPO seed FASTA, performs ...","import os, json from pathlib import Path # In..."
1,2,build_or_fetch_3d_model_and_prepare_complexes,Boltz-2 docking/pose assessment,Generates or fetches a CviUPO 3D model and doc...,"import os, json from pathlib import Path OUTD..."
2,3,binding_pocket_analysis_and_mutation_site_sele...,conservation analysis,Runs a binding-pocket/tunnel analysis (module)...,"import os, json, math from pathlib import Path..."
3,4,round1_targeted_mutation_set_design_and_in_sil...,"protein language model zero-shot scoring, Pyth...",Designs a Round 1 targeted mutation set (singl...,"import os, json, itertools from pathlib import..."
4,5,round2_ssm_design_at_best_sites_with_ddGbind_f...,"OpenMM/YASARA ddG_bind simulations, Pythia sta...",Designs Round 2 SSM at 1–2 best channel sites ...,"import os, json from pathlib import Path OUTD..."
5,6,round3_combinatorial_library_and_surrogate_gui...,supervised surrogate models with OHE/PLM embed...,Builds a combinatorial library from top mutati...,"import os, json, itertools, random from pathli..."
6,7,experiment_results_ingestion_and_round_gating_...,supervised surrogate models with OHE/PLM embed...,"Ingests experimental measurements, ranks varia...","import os, json from pathlib import Path impor..."


{'workflow_step_count': 7,
 'has_strategy_writeup': True,
 'design_plan_path': '/Users/charmainechia/Documents/projects/agentic-protein-design/examples/processed/01_design_strategy_planning/design_strategy_planning.md',
 'workflow_steps_path': '/Users/charmainechia/Documents/projects/agentic-protein-design/examples/processed/01_design_strategy_planning/design_strategy_planning_workflow_steps.json'}

## Reflect / Critique and Regenerate Plan

Use this step to critique the initial plan, incorporate optional user feedback, and rewrite a single improved final plan for saving/persistence.


In [8]:
user_feedback = {
    "plan_reflection_user_feedback": "",  # Optional: free-text critique/changes you want the LLM to apply
    "plan_reflection_prompt_override": "",  # Optional: replace default reflection prompt (part 2)
}

reflection_user_feedback = str(user_feedback.get("plan_reflection_user_feedback", "")).strip()
reflection_prompt = str(user_feedback.get("plan_reflection_prompt_override", "")).strip() or design_strategy_reflection_prompt

plan_outputs = reflect_and_regenerate_design_strategy_plan(
    user_inputs=user_inputs,
    original_prompt_1_output=workflow_steps_json,
    original_prompt_2_output=design_plan,
    literature_context=literature_context,
    user_feedback=reflection_user_feedback,
    critique_prompt=reflection_prompt,
)

# Update in-memory copies
design_plan = plan_outputs['strategy_writeup']
workflow_steps_json = plan_outputs['workflow_steps_json']

# Overwrite saved artifacts with improved outputs
out_design_plan = save_design_strategy_plan(design_plan, step_processed_dir)
out_workflow_steps = save_design_strategy_workflow_steps(workflow_steps_json, step_processed_dir)

{
    "refined_workflow_step_count": len(workflow_steps_json),
    "design_plan_path": str(out_design_plan),
    "workflow_steps_path": str(out_workflow_steps),
}


Critique and revisions summary:
- **Added explicit FASTA/config ingestion and blocking gates:** Improved plan introduces a dedicated Step 1 that validates a real CviUPO FASTA and writes a normalized project config; structural/pocket steps now halt if a placeholder sequence is used—because residue numbering and library design were unreliable in the original (silent placeholder).  
- **Re-sequenced and clarified dependencies:** Homolog/MSA/conservation is retained but repositioned as non-blocking for early structure-based work, while structure/pose generation is made explicitly dependent on validated FASTA—because the original mixed placeholder inputs with residue-indexed decisions.  
- **Upgraded docking to a pose ensemble:** Improved workflow generates multiple poses per ligand (e.g., 5) rather than a single placeholder pose—because selectivity conclusions are brittle to docking uncertainty and ensemble consensus is more robust.  
- **Formalized pocket analysis outputs and exclusions:*

### Design Strategy Reflection / Rewrite

<details><summary>Prompt</summary>

```text

You are an expert computational protein engineer and workflow architect.

Goal:
Convert project requirements into an executable multi-step protein-design workflow.

Inputs:
- user_inputs_json: project requirements and design preferences
- literature_context (optional): prior literature synthesis and linked full-text summaries

Requirements:
1) Build an end-to-end strategy with clear phases (data, hypothesis, design, evaluation, iteration).
2) Explicitly choose and justify design mode:
   - de novo design, backbone-focused mutant design, or hybrid
3) Propose library strategy aligned to objectives:
   - exact targeted sequences, random mutagenesis, site-saturation mutagenesis (SSM), combinatorial library
4) Plan across the specified number of rounds and approximate size of each round, with clear decision gates and progression criteria.
5) Include concrete implementation details:
   - tools/models to run per step
   - expected inputs/outputs per step
   - fallback or alternative path if a step fails
6) Integrate relevant existing process modules when useful (for example binding_pocket_analysis).
7) Prioritize feasibility, information gain per round, and tractable experimental burden.

Available tool categories (non-exhaustive):
- sequence database search and alignment -> conservation analysis
- receptor-ligand docking / structure prediction (for example Boltz-2)
- ddG_bind style simulations (for example OpenMM or YASARA)
- stability prediction (for example Pythia)
- zero-shot scoring from protein language models
- de novo generation (for example BoltzGen or RFdiffusion2)
- supervised surrogate models (for example OHE/PLM embeddings + constrained acquisition)


PROJECT REQUIREMENTS SNAPSHOT
- enzyme_family: unspecific peroxygenases (UPOs)
- seed_sequences: CviUPO
- reactions_of_interest: peroxygenation of aromatics
- design_type_preference: mutants_of_backbone
- backbone_protein: CviUPO
- library_types: targeted_mutation_set; site_saturation_mutagenesis; combinatorial_library
- num_design_rounds: 3
- design_targets: increase peroxygenative mono-oxidation selectivity; reduce over-oxidation; maintain catalytic activity; maintain or improve stability
- constraints: H2O2 tolerance; stability; expression host compatibility
- available_tools: sequence database search and alignment; conservation analysis; Boltz-2 docking/pose assessment; OpenMM/YASARA ddG_bind simulations; Pythia stability prediction; protein language model zero-shot scoring; BoltzGen or RFdiffusion2 de novo generation; supervised surrogate models with OHE/PLM embeddings



You are reviewing previously generated planning artifacts:
1) a human-readable strategy writeup (Prompt 2 output), and
2) a structured workflow JSON list (Prompt 1 output).

Task:
1) Critique consistency and quality across both artifacts:
   - gaps, weak assumptions, missing decision gates, feasibility risks
   - mismatches between workflow JSON steps and writeup narrative
   - missing or unclear implementation details in either artifact
2) Incorporate explicit user feedback if provided.
3) Regenerate an improved human-readable plan that is aligned with the workflow JSON.

Important output constraint:
- Return ONLY the rewritten final plan.
- Do NOT include a separate critique section.
- Do NOT include "old plan vs new plan" comparison.
- Keep the same overall output structure used for design strategy planning.

```
</details>

#### Response

(Refined strategy shown below in compact view.)

### Refined Strategy (After Reflection)

Output(layout=Layout(border_bottom='1px solid #e0e0e0', border_left='1px solid #e0e0e0', border_right='1px sol…

### Structured Workflow Plan (Prompt 1 JSON as table)

Unnamed: 0,step_index,step_name,tools,description,python_code_preview
0,1,ingest_seed_sequence_and_project_config,,Create/validate the backbone FASTA and a norma...,"import os, json, re from pathlib import Path ..."
1,2,homolog_search_msa_and_conservation_map,"sequence database search and alignment, conser...",Run homolog search + MSA and compute per-posit...,"import os, json from pathlib import Path impor..."
2,3,structure_modeling_and_ligand_pose_ensemble,Boltz-2 docking/pose assessment,Build/fetch a CviUPO structure model and gener...,"import os, json from pathlib import Path INPU..."
3,4,binding_pocket_analysis_site_nomination_and_ex...,conservation analysis,Nominate mutation sites using pocket/tunnel an...,"import os, json from pathlib import Path OUTD..."
4,5,round1_targeted_single_mutant_library_design_a...,"protein language model zero-shot scoring, Pyth...",Design a targeted single-mutant library (~24–4...,"import os, json, itertools from pathlib import..."
5,6,ingest_round1_experimental_results_and_select_...,,Load Round 1 experimental results (CSV) and se...,"import os, json from pathlib import Path impor..."
6,7,round2_ssm_library_design_with_stability_and_b...,"OpenMM/YASARA ddG_bind simulations, Pythia sta...",Design Round 2 SSM at 1–2 sites (full 19×sites...,"import os, json from pathlib import Path OUTD..."
7,8,ingest_round2_results_train_surrogate_and_defi...,supervised surrogate models with OHE/PLM embed...,"Ingest Round 1–2 results, define a 3-site (max...","import os, json, itertools from pathlib import..."
8,9,round3_surrogate_guided_combinatorial_library_...,supervised surrogate models with OHE/PLM embed...,Enumerate the Round 3 combinatorial space and ...,"import os, json, itertools from pathlib import..."


{'refined_workflow_step_count': 9,
 'design_plan_path': '/Users/charmainechia/Documents/projects/agentic-protein-design/examples/processed/01_design_strategy_planning/design_strategy_planning.md',
 'workflow_steps_path': '/Users/charmainechia/Documents/projects/agentic-protein-design/examples/processed/01_design_strategy_planning/design_strategy_planning_workflow_steps.json'}

## Save Thread Update
Append planning prompt/metadata to `chats/<llm_process_tag>_<thread_id>.json`.

In [10]:
persist_thread_update(
    root_key=root_key,
    thread_id=thread_id,
    user_inputs=user_inputs,
    design_plan_path=out_design_plan,
    design_plan_text=design_plan,
    workflow_steps_path=out_workflow_steps,
    workflow_steps_json=workflow_steps_json,
    literature_context_thread_key=literature_context_thread_key,
)


'2026-02-20T16:08:09.150567+00:00'