# 02 - Binding Pocket Analysis

Notebook UI for UPO homolog pocket analysis using an LLM with binding, alignment, and optional reaction inputs.

## Python Path Setup
Ensure project-root imports work whether Jupyter starts from repo root or `notebooks/`.

In [3]:
from pathlib import Path
import os
import sys

cwd = Path.cwd().resolve()
repo_root = cwd.parent if cwd.name == "notebooks" else cwd
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

## Imports
Load helper functions for table loading, LLM analysis, output export, and thread persistence.

In [4]:
import importlib
import notebook_helpers.nb_02_binding_pocket_analysis as bp
bp = importlib.reload(bp)
from local_api_keys import OPENAI_API_KEY

analyze_pocket_profiles = bp.analyze_pocket_profiles
default_user_inputs = bp.default_user_inputs
build_prompt_with_context = bp.build_prompt_with_context
generate_llm_pocket_analysis = bp.generate_llm_pocket_analysis
init_thread = bp.init_thread
load_input_tables = bp.load_input_tables
persist_thread_update = bp.persist_thread_update
save_binding_outputs = bp.save_binding_outputs
save_llm_analysis = bp.save_llm_analysis
setup_data_root = bp.setup_data_root

## API Key Setup
Load the OpenAI key from `local_api_keys.py` into environment variables for LLM calls.

In [5]:
if OPENAI_API_KEY and OPENAI_API_KEY != "REPLACE_WITH_YOUR_OPENAI_API_KEY":
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

"OPENAI_API_KEY" in os.environ

True

## User Inputs
Edit all run parameters here (single place): dataset root, thread selection, analysis options, model, and input paths.

In [6]:
root_key = "PIPS2"
existing_thread_id = "59353c876ab140688b1c239a15aac24e"# None

user_inputs = {
    "selected_positions": None, # [100, 103, 104, 107, 141, 222],
    "focus_question": (
        "Identify per-protein structural interpretations and cross-homolog patterns "
        "that could explain activity/property differences."
    ),
    "reaction_data_description": (
        "- Veratryl alcohol: peroxygenative\n"
        "- Naphthalene: peroxygenative\n"
        "- NBD: peroxygenative\n"
        "- ABTS: peroxidative\n"
        "- S82: mixed; Mono-Ox ~ peroxygenation-biased, Di-Ox ~ peroxidation-biased\n"
        "Use ratios (e.g. Mono-Ox : Di-Ox) to infer peroxygenation vs peroxidation balance."
    ),
    "use_reaction_data": True,
    "llm_model": "gpt-5.2",
    "llm_temperature": 0.2,
    "llm_max_rows_per_table": 300,
}

input_paths = {
    # Paths are relative to the data root from variables.address_dict[root_key].
    "binding_csv": "pdb/UPOs_peroxygenation_analysis/docked/REPS/bindingpocket_analysis.csv",
    "alignment_csv": "pdb/UPOs_peroxygenation_analysis/docked/REPS/msa/reps_ali_withDist_FILT.csv",
    "reaction_data_csv": "pdb/UPOs_peroxygenation_analysis/docked/REPS/substrate_reaction_data.csv",
}

# Optional: reset analysis options from helper defaults
# user_inputs = default_user_inputs()

## Setup Runtime Context
Initialize data directories and active chat thread from the values above.

In [7]:
data_root, resolved_dirs = setup_data_root(root_key)
thread, threads_preview = init_thread(root_key, existing_thread_id)
thread_id = thread["thread_id"]
data_root, thread_id

(PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data'),
 '59353c876ab140688b1c239a15aac24e')

## Load Input Tables
Load descriptor and alignment tables, and optional reaction data, from `input_paths`.

In [8]:
def resolve_input_path(path_value: str) -> Path:
    p = Path(path_value).expanduser()
    if p.is_absolute():
        return p.resolve()
    return (data_root / p).resolve()

binding_csv = resolve_input_path(input_paths["binding_csv"])
alignment_csv = resolve_input_path(input_paths["alignment_csv"])
reaction_data_csv = None
if user_inputs.get("use_reaction_data", False) and input_paths.get("reaction_data_csv", "").strip():
    reaction_data_csv = resolve_input_path(input_paths["reaction_data_csv"])

pocket, ali, reaction_df = load_input_tables(binding_csv, alignment_csv, reaction_data_csv)
binding_csv, alignment_csv, reaction_data_csv, pocket.head(3), (None if reaction_df is None else reaction_df.head(3))

(PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/pdb/UPOs_peroxygenation_analysis/docked/REPS/bindingpocket_analysis.csv'),
 PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/pdb/UPOs_peroxygenation_analysis/docked/REPS/msa/reps_ali_withDist_FILT.csv'),
 PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/pdb/UPOs_peroxygenation_analysis/docked/REPS/substrate_reaction_data.csv'),
    Unnamed: 0                    struct_name                  struct_name.1  \
 0           0                ET096_S82_glide                ET096_S82_glide   
 1           1               CviUPO_S82_glide               CviUPO_S82_glide   
 2           2  CviUPO-F88L+T158A_S82_chai1_0  CviUPO-F88L+T158A_S82_chai1_0   
 
                    struct_name.2  num_pocket_res_ali  num_pocket_res<6  \
 0                ET096_S82_glide                  38                12   
 1               CviUPO_S82_glide                  39                13   
 2  

## Structured Exports
Generate heuristic comparative tables and export CSVs to `processed/`.

In [8]:
selected_positions = user_inputs["selected_positions"]
interp_df, pattern_summary = analyze_pocket_profiles(pocket, ali, selected_positions)
out_interp, out_patterns = save_binding_outputs(interp_df, pattern_summary, resolved_dirs["processed"])
out_interp, out_patterns

(PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/processed/binding_pocket_interpretations.csv'),
 PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/processed/binding_pocket_pattern_summary.csv'))

## LLM Pocket Analysis
Query the LLM client with the full prompt and input tables, then save markdown output.

Prerequisite: set `OPENAI_API_KEY` in `local_api_keys.py`.

In [9]:
prompt_text = build_prompt_with_context(reaction_df, user_inputs)
print("=== Prompt Sent To LLM ===")
print(prompt_text)
print("\n=== LLM Output ===")
llm_analysis = generate_llm_pocket_analysis(pocket, ali, reaction_df, user_inputs)
print(llm_analysis)
out_llm = save_llm_analysis(llm_analysis, resolved_dirs["processed"])
out_llm

=== Prompt Sent To LLM ===

Analyse the uploaded inputs for a set of proteins to interpret how binding-pocket structure relates to catalytic activity and selectivity. 
Consider how both the proximal (<6 from docked ligand) and distal (up to 11 angstrom from binding pocket centroid) geometries and electrostatics binding pocket environment. 

INPUTS
- binding_pocket_table: table of extracted binding-pocket properties (per protein). Properties are calculated over both "distal" and "proximal" ligands, which represent the residues in the broader pocket region, and closer to the reaction coordinate, respectively.   
- pocket_alignment_table: filtered residue alignment of pocket-proximal positions
- reaction_data (optional): table or dict summarising enzyme activity on different substrates

TASKS
1. For each protein, generate a concise **3–4 bullet summary** of its binding-pocket geometry and chemistry, considering both distal and proximal effects.
2. Interpret how these properties are likely

PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/processed/binding_pocket_llm_analysis.md')

## Save Thread Update
Run this final cell to append run metadata and prompt context to `chats/<thread_id>.json`.

In [10]:
persist_thread_update(
    root_key=root_key,
    thread_id=thread_id,
    user_inputs=user_inputs,
    input_paths=input_paths,
    selected_positions=selected_positions,
    reaction_df=reaction_df,
    out_interp=out_interp,
    out_patterns=out_patterns,
    llm_analysis_path=out_llm,
    llm_analysis_text=llm_analysis,
    llm_model=str(user_inputs.get("llm_model", "")),
)

'2026-02-11T09:08:42.160703+00:00'