# 06_binding_pocket_analysis

Notebook UI for UPO homolog pocket analysis using an LLM with binding, alignment, and optional reaction inputs.

## Python Path Setup
Ensure project-root imports work whether Jupyter starts from repo root or `notebooks/`.

In [1]:
from pathlib import Path
import os
import sys

cwd = Path.cwd().resolve()
repo_root = cwd.parent if cwd.name == "notebooks" else cwd
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))
src_root = repo_root / "src"
if src_root.exists() and str(src_root) not in sys.path:
    sys.path.insert(0, str(src_root))

## Imports
Load helper functions for table loading, LLM analysis, output export, and thread persistence.

In [2]:
import importlib
import agentic_protein_design.steps.binding_pocket as bp
bp = importlib.reload(bp)
from project_config.local_api_keys import OPENAI_API_KEY
from agentic_protein_design.core.thread_context import build_thread_context_text
from agentic_protein_design.core import resolve_input_path

analyze_pocket_profiles = bp.analyze_pocket_profiles
default_user_inputs = bp.default_user_inputs
build_prompt_with_context = bp.build_prompt_with_context
generate_llm_pocket_analysis = bp.generate_llm_pocket_analysis
save_mutation_design_proposal = bp.save_mutation_design_proposal
generate_llm_mutation_design_proposal = bp.generate_llm_mutation_design_proposal
run_llm_pocket_analysis_stages = bp.run_llm_pocket_analysis_stages
init_thread = bp.init_thread
load_input_tables = bp.load_input_tables
persist_thread_update = bp.persist_thread_update
save_binding_outputs = bp.save_binding_outputs
save_llm_analysis = bp.save_llm_analysis
setup_data_root = bp.setup_data_root

## API Key Setup
Load the OpenAI key from `project_config/local_api_keys.py` into environment variables for LLM calls.

In [3]:
if OPENAI_API_KEY and OPENAI_API_KEY != "REPLACE_WITH_YOUR_OPENAI_API_KEY":
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

"OPENAI_API_KEY" in os.environ

True

## User Inputs
Edit all run parameters here (single place): dataset root, thread selection, analysis options, model, and input paths.

In [4]:
root_key = "PIPS2"
existing_thread_key = "binding_pocket_llm_analysis_59353c876ab140688b1c239a15aac24e"  # None

user_inputs = {
    "selected_positions": None, # [100, 103, 104, 107, 141, 222],
    "pairwise_comparisons":  [("CviUPO", "ET096")], # None
    "focus_question": (
        "Identify per-protein structural interpretations and cross-homolog patterns "
        "that could explain activity/property differences."
    ),
    "design_requirements": (
        "Backbone: ET096. Goal: improve peroxygenative mono-oxidation selectivity on S82 "
        "while retaining useful activity and limiting over-oxidation to Di-Ox. "
        "Prioritize conservative, mechanistically justified mutations and a first-round panel <= 12 variants."
    ),
    "literature_context_thread_key": "literature_review_74b148fa493e4105a47dd5a54ac85b65",  # Optional: literature-review thread key
    "reaction_data_description": (
        "- Veratryl alcohol: peroxygenative\n"
        "- Naphthalene: peroxygenative\n"
        "- NBD: peroxygenative\n"
        "- ABTS: peroxidative\n"
        "- S82: mixed; Mono-Ox ~ peroxygenation-biased, Di-Ox ~ peroxidation-biased\n"
        "Use ratios (e.g. Mono-Ox : Di-Ox) to infer peroxygenation vs peroxidation balance."
    ),
    "use_reaction_data": True,
    "llm_model": "gpt-5.2",
    "llm_temperature": 0.2,
    "llm_max_rows_per_table": 300,
}

input_paths = {
    # Paths are relative to the data root from project_config.variables.address_dict[root_key].
    "binding_csv": "pdb/UPOs_peroxygenation_analysis/docked/REPS/bindingpocket_analysis.csv",
    "alignment_csv": "pdb/UPOs_peroxygenation_analysis/docked/REPS/msa/reps_ali_withDist_FILT.csv",
    "reaction_data_csv": "pdb/UPOs_peroxygenation_analysis/docked/REPS/substrate_reaction_data.csv",
}

# Optional: reset analysis options from helper defaults
# user_inputs = default_user_inputs()

## Setup Runtime Context
Initialize data directories and active chat thread from the values above.

In [5]:
data_root, resolved_dirs = setup_data_root(root_key)
thread, threads_preview = init_thread(root_key, existing_thread_key)
thread_id = thread["thread_id"]
data_root, thread_id

(PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data'),
 '59353c876ab140688b1c239a15aac24e')

## Load Input Tables
Load descriptor and alignment tables, and optional reaction data, from `input_paths`.

In [6]:
binding_csv = resolve_input_path(data_root, input_paths["binding_csv"])
alignment_csv = resolve_input_path(data_root, input_paths["alignment_csv"])
reaction_data_csv = None
if user_inputs.get("use_reaction_data", False) and input_paths.get("reaction_data_csv", "").strip():
    reaction_data_csv = resolve_input_path(data_root, input_paths["reaction_data_csv"])

pocket, ali, reaction_df = load_input_tables(binding_csv, alignment_csv, reaction_data_csv)
binding_csv, alignment_csv, reaction_data_csv, pocket.head(3), (None if reaction_df is None else reaction_df.head(3))


(PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/pdb/UPOs_peroxygenation_analysis/docked/REPS/bindingpocket_analysis.csv'),
 PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/pdb/UPOs_peroxygenation_analysis/docked/REPS/msa/reps_ali_withDist_FILT.csv'),
 PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/pdb/UPOs_peroxygenation_analysis/docked/REPS/substrate_reaction_data.csv'),
    Unnamed: 0                    struct_name                  struct_name.1  \
 0           0                ET096_S82_glide                ET096_S82_glide   
 1           1               CviUPO_S82_glide               CviUPO_S82_glide   
 2           2  CviUPO-F88L+T158A_S82_chai1_0  CviUPO-F88L+T158A_S82_chai1_0   
 
                    struct_name.2  num_pocket_res_ali  num_pocket_res<6  \
 0                ET096_S82_glide                  38                12   
 1               CviUPO_S82_glide                  39                13   
 2  

## Structured Exports
Generate heuristic comparative tables and export CSVs to `processed/`.

In [7]:
selected_positions = user_inputs["selected_positions"]
interp_df, pattern_summary = analyze_pocket_profiles(pocket, ali, selected_positions)
out_interp, out_patterns = save_binding_outputs(interp_df, pattern_summary, resolved_dirs["processed"])
out_interp, out_patterns

(PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/processed/binding_pocket_interpretations.csv'),
 PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/processed/binding_pocket_pattern_summary.csv'))

## LLM Pocket Analysis
Query the LLM client with the full prompt and input tables, then save markdown output.

Prerequisite: set `OPENAI_API_KEY` in `project_config/local_api_keys.py`.

In [8]:
# Run two-stage LLM analysis (the helper prints Prompt 1/2 and Output 1/2)
stage_outputs = run_llm_pocket_analysis_stages(pocket, ali, reaction_df, user_inputs)
prompt_2_output = stage_outputs["prompt_2_output"]
llm_analysis = stage_outputs["combined_analysis"]
out_llm = save_llm_analysis(llm_analysis, resolved_dirs["processed"])

# Prompt 3 defaults (overwritten in the next cell)
mutation_design_text = ""
out_mutation_design = None
literature_context_thread_key = None
out_llm


=== Prompt 1 Sent To LLM ===

Analyse the uploaded inputs for a set of proteins to interpret how binding-pocket structure relates to catalytic activity and selectivity. 
Consider how both the proximal (<6 Å from docked ligand) and distal (up to ~11 Å from binding pocket centroid) residues affect the binding pocket environment.

INPUTS
- binding_pocket_table: extracted binding-pocket properties (per protein), calculated separately over proximal and distal residue sets where available.
- pocket_alignment_table: filtered residue alignment of pocket-proximal positions.
- reaction_data (optional): enzyme activity data on substrates.

OBJECTIVE
For each protein, integrate structural descriptors with (optional) reaction data to infer mechanistic behavior and classify pocket phenotypes.

TASKS

1) For each protein:
   - Generate a punchy tagline.
   - Provide a concise 5-6 bullet summary addressing:
        (i) proximal electrostatics  
        (ii) proximal sterics  
        (iii) distal elec

PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/processed/binding_pocket_llm_analysis.md')

## LLM Backbone Engineering Proposal (Prompt 3)
Use Stage-2 residue-level drivers plus optional literature-thread context to propose mutation designs under user requirements.

In [9]:
design_requirements = str(user_inputs.get("design_requirements", "")).strip()
literature_context_thread_key = str(user_inputs.get("literature_context_thread_key", "")).strip() or None

context_result = build_thread_context_text(
    literature_context_thread_key,
    include_referenced_files=True,
    max_chars_per_file=20000,
    on_missing="warn",
)
literature_context = str(context_result.get("context_text", ""))
literature_context_bundle = context_result.get("context_bundle")

mutation_design_text = generate_llm_mutation_design_proposal(
    prompt_2_output=prompt_2_output,
    design_requirements=design_requirements,
    user_inputs=user_inputs,
    literature_context=literature_context,
)
out_mutation_design = save_mutation_design_proposal(mutation_design_text, resolved_dirs["processed"])
out_mutation_design



=== Prompt 3 Sent To LLM ===

You are designing enzyme variants for rational engineering.

You are given:
1) prompt_2_output: residue-level mechanistic analysis of binding-pocket drivers.
2) literature_context (optional): prior external context (for example literature-review thread outputs).
3) design_requirements: user-provided requirements including:
   - target backbone protein to engineer
   - engineering aims (activity/selectivity/stability/pathway bias)
   - constraints (allowed positions, mutation budget, excluded residues/motifs, expression or assay limits)

TASK
Generate a concrete mutation design proposal grounded primarily in prompt_2_output and supported by literature_context when relevant.

OUTPUT FORMAT
1) Design Intent
   - State backbone protein and explicit engineering objective.

2) Proposed Mutations (ranked)
   - Provide 5-10 proposals total.
   - Include both:
     - specific substitutions (e.g., F88L), and
     - optional position-level exploration suggestions (e

PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/processed/binding_pocket_mutation_design.md')

## Save Thread Update
Run this final cell to append run metadata and prompt context to `chats/<llm_process_tag>_<thread_id>.json`.

In [11]:
persist_thread_update(
    root_key=root_key,
    thread_id=thread_id,
    user_inputs=user_inputs,
    input_paths=input_paths,
    selected_positions=selected_positions,
    reaction_df=reaction_df,
    out_interp=out_interp,
    out_patterns=out_patterns,
    llm_analysis_path=out_llm,
    llm_analysis_text=llm_analysis,
    mutation_design_path=out_mutation_design,
    mutation_design_text=mutation_design_text,
    literature_context_thread_key=literature_context_thread_key,
    llm_model=str(user_inputs.get("llm_model", "")),
)

'2026-02-12T17:43:59.483130+00:00'