# 00_literature_review

Notebook UI for retrieval-backed enzyme literature survey and LLM-generated engineering-focused synthesis.

## Python Path Setup
Ensure project-root imports work whether Jupyter starts from repo root or `notebooks/`.

In [1]:
from pathlib import Path
import os
import sys

cwd = Path.cwd().resolve()
repo_root = cwd.parent if cwd.name == "notebooks" else cwd
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))
src_root = repo_root / "src"
if src_root.exists() and str(src_root) not in sys.path:
    sys.path.insert(0, str(src_root))

## Imports
Load helper functions for setup, retrieval, prompt construction, LLM synthesis, and thread persistence.

In [2]:
import importlib
import agentic_protein_design.steps.literature_review as lr
lr = importlib.reload(lr)
from project_config.local_api_keys import OPENAI_API_KEY

build_literature_agent_prompt = lr.build_literature_agent_prompt
default_user_inputs = lr.default_user_inputs
generate_literature_llm_review = lr.generate_literature_llm_review
init_thread = lr.init_thread
persist_thread_update = lr.persist_thread_update
run_literature_pipeline = lr.run_literature_pipeline
save_literature_llm_review = lr.save_literature_llm_review
save_literature_outputs = lr.save_literature_outputs
setup_data_root = lr.setup_data_root

## API Key Setup
Load the OpenAI key from `project_config/local_api_keys.py` into environment variables for LLM calls.

In [8]:
if OPENAI_API_KEY and OPENAI_API_KEY != "REPLACE_WITH_YOUR_OPENAI_API_KEY":
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

"OPENAI_API_KEY" in os.environ

True

## User Inputs
Edit all run parameters here (single place): root/thread selection, literature prompt placeholders, retrieval settings, model, and root-relative input paths.

In [3]:
root_key = "PIPS2"
existing_thread_id = None

user_inputs = {
    "enzyme_family": "unspecific peroxygenases (UPOs)",
    "seed_sequences": ["CviUPO"],
    "reactions_of_interest": "peroxygenation of aromatics",
    "substrates_of_interest": ["Veratryl alcohol", "Naphthalene", "NBD", "ABTS"],
    "application_context": "biocatalysis, pharmaceutical synthesis, green chemistry",
    "constraints": ["stability", "solvent tolerance", "H2O2 tolerance", "expression host"],
    "keywords": ["peroxygenation"],
    "search_sources": [
        "UniProt",
        "InterPro",
        "PDB",
        "AlphaFold DB",
        "PubMed",
        "EuropePMC",
        "OpenAlex",
        "WebSearch",
    ],
    "literature_targets": ["bioRxiv", "Nature", "ScienceDirect", "PNAS", "ACS", "Wiley", "ChemRxiv"],
    "evidence_focus": [
        "protein_annotations",
        "structure",
        "reaction_mechanism",
        "historical_mutations",
    ],
    "max_results_per_source": 20,
    "fetch_open_access_fulltext": True,
    "fulltext_max_chars": 6000,
    "llm_model": "gpt-5.2",
    "llm_temperature": 0.2,
    "llm_max_rows_per_table": 250,
    "enable_relaxed_fallback": True,
    "min_quality_score_for_llm_context": 0.35,
}

input_paths = {
    # Paths are relative to the data root from project_config.variables.address_dict[root_key].
    "seed_sequences_file": "",
    "constraints_file": "",
}

# Optional: reset all fields from helper defaults
# user_inputs = default_user_inputs()

## Setup Runtime Context
Initialize data directories and active chat thread from the values above.

In [4]:
data_root, resolved_dirs = setup_data_root(root_key)
thread, threads_preview = init_thread(root_key, existing_thread_id)
thread_id = thread["thread_id"]
data_root, thread_id

(PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data'),
 '74b148fa493e4105a47dd5a54ac85b65')

## Optional Local Input Files
Load optional root-relative files for seed sequences and constraints to enrich the prompt placeholders.

In [5]:
def resolve_input_path(path_value: str) -> Path:
    p = Path(path_value).expanduser()
    if p.is_absolute():
        return p.resolve()
    return (data_root / p).resolve()

seed_file_val = input_paths.get("seed_sequences_file", "").strip()
if seed_file_val:
    seed_file = resolve_input_path(seed_file_val)
    if seed_file.exists():
        lines = [l.strip() for l in seed_file.read_text(encoding="utf-8").splitlines() if l.strip()]
        if lines:
            user_inputs["seed_sequences"] = lines

constraints_file_val = input_paths.get("constraints_file", "").strip()
if constraints_file_val:
    constraints_file = resolve_input_path(constraints_file_val)
    if constraints_file.exists():
        clines = [l.strip() for l in constraints_file.read_text(encoding="utf-8").splitlines() if l.strip()]
        if clines:
            user_inputs["constraints"] = clines

## Retrieve And Export
Run retrieval against selected protein/literature databases plus optional general web search, then export CSV outputs.

In [6]:
outputs = run_literature_pipeline(user_inputs)
out_paths = save_literature_outputs(outputs, resolved_dirs["processed"])
outputs["source_report"], outputs.get("source_debug"), outputs.get("literature_hits", []).head(10) if hasattr(outputs.get("literature_hits", None), "head") else outputs.get("literature_hits"), out_paths

(                  metric  value
 0  total_literature_hits      4
 1           biorxiv_hits      0
 2            nature_hits      0
 3     sciencedirect_hits      0
 4       open_access_hits      3
 5      high_quality_hits      4
 6    medium_quality_hits      0
 7       low_quality_hits      0,
           source query_mode  \
 0        UniProt    primary   
 1       InterPro    primary   
 2            PDB    primary   
 3   AlphaFold DB    primary   
 4         PubMed    primary   
 5      EuropePMC    primary   
 6       OpenAlex    primary   
 7      WebSearch    primary   
 8        UniProt    relaxed   
 9       InterPro    relaxed   
 10           PDB    relaxed   
 11  AlphaFold DB    relaxed   
 12        PubMed    relaxed   
 13     EuropePMC    relaxed   
 14      OpenAlex    relaxed   
 15     WebSearch    relaxed   
 
                                                 query  rows  \
 0   unspecific peroxygenases (UPOs) CviUPO peroxyg...     0   
 1   unspecific peroxygenase

## LLM Literature Synthesis
Print the full prompt, call the LLM with retrieval context, and save the generated review.

In [9]:
prompt_text = build_literature_agent_prompt(user_inputs)
print("=== Prompt Sent To LLM ===")
print(prompt_text)
print("\n=== LLM Output ===")
llm_review = generate_literature_llm_review(outputs, user_inputs)
print(llm_review)
out_llm = save_literature_llm_review(llm_review, resolved_dirs["processed"])
out_llm

=== Prompt Sent To LLM ===

You are an AI research agent supporting an enzyme engineering project.
Your role is to conduct a structured, technically rigorous literature review and generate a concise but insight-dense summary to guide experimental design.

INPUTS (provided dynamically)
- enzyme_family: unspecific peroxygenases (UPOs)
- seed_sequences (optional): CviUPO
- reactions_of_interest: peroxygenation of aromatics
- substrates_of_interest (optional): Veratryl alcohol Naphthalene NBD ABTS
- application_context (optional): biocatalysis, pharmaceutical synthesis, green chemistry
- constraints (optional): stability solvent tolerance H2O2 tolerance expression host

OBJECTIVE
Gather and synthesize current knowledge on:
1. Enzyme class structure and fold
2. Reaction mechanism (including intermediates, catalytic residues, cofactors)
3. Substrate scope and known selectivity trends
4. Cofactor requirements and catalytic cycle
5. Known engineering efforts (mutagenesis, directed evolution, M

PosixPath('/Users/charmainechia/Documents/projects/PIPS/PIPS2-UPOs-data/processed/literature_review_llm_summary.md')

## Save Thread Update
Run this final cell to append run metadata and compact LLM summary to `chats/<llm_process_tag>_<thread_id>.json`.

In [10]:
persist_thread_update(
    root_key=root_key,
    thread_id=thread_id,
    inputs=user_inputs,
    out_paths=out_paths,
    llm_review_path=out_llm,
    llm_review_text=llm_review,
)

'2026-02-12T16:49:58.332639+00:00'