# 00_literature_review

Notebook UI for retrieval-backed enzyme literature survey and LLM-generated engineering-focused synthesis.

## Python Path Setup
Ensure project-root imports work whether Jupyter starts from repo root or `notebooks/`.

In [1]:
from pathlib import Path
import os
import sys

cwd = Path.cwd().resolve()
repo_root = cwd.parent if cwd.name == "notebooks" else cwd
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))
src_root = repo_root / "src"
if src_root.exists() and str(src_root) not in sys.path:
    sys.path.insert(0, str(src_root))

## Imports
Load helper functions for setup, retrieval, prompt construction, LLM synthesis, and thread persistence.

In [2]:
import importlib
import agentic_protein_design.steps.literature_review as lr
lr = importlib.reload(lr)
from project_config.local_api_keys import OPENAI_API_KEY

build_literature_agent_prompt = lr.build_literature_agent_prompt
default_user_inputs = lr.default_user_inputs
generate_literature_llm_review = lr.generate_literature_llm_review
init_thread = lr.init_thread
persist_thread_update = lr.persist_thread_update
run_literature_pipeline = lr.run_literature_pipeline
save_literature_llm_review = lr.save_literature_llm_review
save_literature_outputs = lr.save_literature_outputs
setup_data_root = lr.setup_data_root
get_step_processed_dir = lr.get_step_processed_dir

from agentic_protein_design.core import apply_notebook_markdown_style

apply_notebook_markdown_style(font_size_px=14, line_height=1.4)


## API Key Setup
Load the OpenAI key from `project_config/local_api_keys.py` into environment variables for LLM calls.

In [3]:
if OPENAI_API_KEY and OPENAI_API_KEY != "REPLACE_WITH_YOUR_OPENAI_API_KEY":
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

"OPENAI_API_KEY" in os.environ

True

## User Inputs
Edit all run parameters here (single place): root/thread selection, literature prompt placeholders, retrieval settings, model, and root-relative input paths.

In [4]:
root_key = "examples"
existing_thread_id = 'literature_review_d762a72ec7f04bec9b66ccd3aac21b91'

user_inputs = {
    "enzyme_family": "unspecific peroxygenases (UPOs)",
    "seed_sequences": ["CviUPO"],
    "reactions_of_interest": "peroxygenation of aromatics",
    "substrates_of_interest": ["Veratryl alcohol", "Naphthalene", "NBD", "ABTS"],
    "application_context": "biocatalysis, pharmaceutical synthesis, green chemistry",
    "constraints": ["stability", "solvent tolerance", "H2O2 tolerance", "expression host"],
    "keywords": ["peroxygenation"],
    "search_sources": [
        "UniProt",
        "InterPro",
        "PDB",
        "AlphaFold DB",
        "PubMed",
        "EuropePMC",
        "OpenAlex",
        "WebSearch",
    ],
    "literature_targets": ["bioRxiv", "Nature", "ScienceDirect", "PNAS", "ACS", "Wiley", "ChemRxiv"],
    "evidence_focus": [
        "protein_annotations",
        "structure",
        "reaction_mechanism",
        "historical_mutations",
    ],
    "max_results_per_source": 20,
    "fetch_open_access_fulltext": True,
    "fulltext_max_chars": 10000,
    "llm_model": "gpt-5.2",
    "llm_temperature": 0.2,
    "llm_max_rows_per_table": 250,
    "enable_relaxed_fallback": True,
    "min_quality_score_for_llm_context": 0.35,
    "data_fbase_key": "examples",  # key from address_dict in project_config/variables.py
    "data_fbase": "",  # optional explicit path fallback
    "data_subfolder": "",  # optional; looks in {data_fbase}/literature/{data_subfolder}/
    "enable_pdf_rag": True,
    "pdf_rag_max_files": 20,
    "pdf_rag_max_chars_per_file": 8000,
    "pdf_rag_max_total_chars": 80000,
}

input_paths = {
    # Paths are relative to the data root from project_config.variables.address_dict[root_key].
    "seed_sequences_file": "",
    "constraints_file": "",
}

# Optional: reset all fields from helper defaults
# user_inputs = default_user_inputs()

## Setup Runtime Context
Initialize data directories and active chat thread from the values above.

In [5]:
data_root, resolved_dirs = setup_data_root(root_key)
step_processed_dir = get_step_processed_dir(resolved_dirs)
thread, threads_preview = init_thread(root_key, existing_thread_id)
thread_id = thread["thread_id"]
data_root, step_processed_dir, thread_id


(PosixPath('/Users/charmainechia/Documents/projects/agentic-protein-design/examples'),
 PosixPath('/Users/charmainechia/Documents/projects/agentic-protein-design/examples/processed/00_literature_review'),
 'd762a72ec7f04bec9b66ccd3aac21b91')

## Optional Local Input Files
Load optional root-relative files for seed sequences and constraints to enrich the prompt placeholders.

In [6]:
def resolve_input_path(path_value: str) -> Path:
    p = Path(path_value).expanduser()
    if p.is_absolute():
        return p.resolve()
    return (data_root / p).resolve()

seed_file_val = input_paths.get("seed_sequences_file", "").strip()
if seed_file_val:
    seed_file = resolve_input_path(seed_file_val)
    if seed_file.exists():
        lines = [l.strip() for l in seed_file.read_text(encoding="utf-8").splitlines() if l.strip()]
        if lines:
            user_inputs["seed_sequences"] = lines

constraints_file_val = input_paths.get("constraints_file", "").strip()
if constraints_file_val:
    constraints_file = resolve_input_path(constraints_file_val)
    if constraints_file.exists():
        clines = [l.strip() for l in constraints_file.read_text(encoding="utf-8").splitlines() if l.strip()]
        if clines:
            user_inputs["constraints"] = clines

## Retrieve And Export
Run retrieval against selected protein/literature databases plus optional general web search, then export CSV outputs.

In [7]:
outputs = run_literature_pipeline(user_inputs)
out_paths = save_literature_outputs(outputs, step_processed_dir)


[LocalPDF] docs_dir=/Users/charmainechia/Documents/projects/agentic-protein-design/examples/literature discovered=9 max_files=20 max_chars_per_file=8000 max_total_chars=80000
[LocalPDF] OK file=2014_MolinaEspeja_Directed Evolution of UPO from Agrocybe aegerita.pdf extracted_chars=63567 used_chars=8000 truncated=True total_used_chars=8000
[LocalPDF] OK file=2021_ACSCatal_Accessing Chemo- and Regioselective Benzylic and Aromatic Oxidations by Protein Engineering of an UPO.pdf extracted_chars=65989 used_chars=8000 truncated=True total_used_chars=16000
[LocalPDF] OK file=2022_antioxidants_Engineering Collariella virescens Peroxygenase for Epoxides Production from Vegetable Oil.pdf extracted_chars=54873 used_chars=8000 truncated=True total_used_chars=24000
[LocalPDF] OK file=2023_ACSCat_Muench_Computational-Aided Engineering of a Selective Unspecific Peroxygenase toward Enantiodivergent β‑Ionone Hydroxylation.pdf extracted_chars=47861 used_chars=8000 truncated=True total_used_chars=32000


Ignoring wrong pointing object 25 0 (offset 0)
Ignoring wrong pointing object 103 0 (offset 0)


[LocalPDF] OK file=2024_ACSCat_Muench_Functionally Diverse Peroxygenases by AlphaFold2, Design, and Signal Peptide Shuffling.pdf extracted_chars=55776 used_chars=8000 truncated=True total_used_chars=40000
[LocalPDF] OK file=2024_ChemAEurJ_Barber_UPOs  can be Tuned for Oxygenation or Halogenation.pdf extracted_chars=27478 used_chars=8000 truncated=True total_used_chars=48000


Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 17 0 (offset 0)
Ignoring wrong pointing object 21 0 (offset 0)
Ignoring wrong pointing object 25 0 (offset 0)
Ignoring wrong pointing object 28 0 (offset 0)
Ignoring wrong pointing object 31 0 (offset 0)
Ignoring wrong pointing object 51 0 (offset 0)
Ignoring wrong pointing object 53 0 (offset 0)
Ignoring wrong pointing object 55 0 (offset 0)
Ignoring wrong pointing object 95 0 (offset 0)


[LocalPDF] OK file=2024_JACS_Yan_Engineering of Unspecific Peroxygenases Using a SuperfolderGreen-Fluorescent-Protein-Mediated Secretion System in E coli.pdf extracted_chars=46736 used_chars=8000 truncated=True total_used_chars=56000
[LocalPDF] OK file=Modification of the peroxygenative-peroxidative activity ratio in the unspecific peroxygenase from Agrocybe aegerita by structure-guided evolution.pdf extracted_chars=43434 used_chars=8000 truncated=True total_used_chars=64000
[LocalPDF] OK file=Unspecified peroxidases - the pot of gold at the end of the rainbow.pdf extracted_chars=52474 used_chars=8000 truncated=True total_used_chars=72000


## LLM Literature Synthesis
Print the full prompt, call the LLM with retrieval context, and save the generated review.

In [8]:
llm_review = generate_literature_llm_review(outputs, user_inputs)
out_llm = save_literature_llm_review(llm_review, step_processed_dir)
print(out_llm)


### Literature Review LLM Call

<details><summary>Prompt</summary>

```text

You are an AI research agent supporting an enzyme engineering project.
Your role is to conduct a structured, technically rigorous literature review and generate a concise but insight-dense summary to guide experimental design.

INPUTS (provided dynamically)
- enzyme_family: unspecific peroxygenases (UPOs)
- seed_sequences (optional): CviUPO
- reactions_of_interest: peroxygenation of aromatics
- substrates_of_interest (optional): Veratryl alcohol Naphthalene NBD ABTS
- application_context (optional): biocatalysis, pharmaceutical synthesis, green chemistry
- constraints (optional): stability solvent tolerance H2O2 tolerance expression host

OBJECTIVE
Gather and synthesize current knowledge on:
1. Enzyme class structure and fold
2. Reaction mechanism (including intermediates, catalytic residues, cofactors)
3. Substrate scope and known selectivity trends
4. Cofactor requirements and catalytic cycle
5. Known engineering efforts (mutagenesis, directed evolution, ML-guided design)
6. Stability, expression, and formulation constraints
7. Gaps in knowledge and opportunities for engineering

SOURCES TO CONSULT
- Protein databases: UniProt, PDB, AlphaFold DB, InterPro, Pfam
- Literature databases: PubMed, Nature, PNAS, Science, ScienceDirect, ACS, Wiley, bioRxiv, ChemRxiv
- Reviews and meta-analyses when available
- Structural studies (crystal structures, cryo-EM)
- Mechanistic enzymology papers
- Directed evolution and protein engineering studies

INSTRUCTIONS FOR INFORMATION GATHERING
- Prioritize peer-reviewed literature and high-impact reviews.
- Distinguish between experimentally validated findings and computational predictions.
- Extract quantitative data when possible (kcat, KM, TTN, enantioselectivity, stability metrics, mutation effects).
- Identify conserved catalytic residues and structural motifs.
- Map engineering-relevant residues (active site, access channel, gating residues, stability hotspots).
- Note contradictions or unresolved mechanistic questions.

OUTPUT FORMAT

1. Executive Summary (<=10 bullet points)
   - High-level takeaways most relevant to engineering strategy.

2. Structural Overview
   - Fold classification
   - Domain architecture
   - Active site organization
   - Access channels / substrate tunnels
   - Cofactor binding
   - Known motifs or epitopes

3. Reaction Mechanism
   - Catalytic cycle steps
   - Key intermediates
   - Rate-limiting steps (if known)
   - Competing pathways (e.g. peroxygenation vs peroxidation)
   - Determinants of chemoselectivity / regioselectivity

4. Substrate Scope & Selectivity Trends
   - Classes of substrates accepted
   - Structural features tolerated
   - Trends in polarity, size, electronics
   - Known limitations

5. Engineering Landscape
   - Mutations known to affect activity/selectivity
   - Stability-enhancing mutations
   - Expression improvements
   - Channel reshaping strategies
   - ML-guided or computational design efforts
   - Reported performance gains (quantitative if available)

6. Practical Constraints
   - Cofactor stability (e.g. H2O2 sensitivity)
   - Uncoupling / inactivation pathways
   - Expression hosts used
   - Solvent/temperature/pH tolerance

7. Comparative Analysis (if multiple homologs provided)
   - Structural or mechanistic differences
   - Performance trade-offs
   - Known phenotypic clusters

8. Engineering Opportunities
   - Hypothesis-driven mutation targets
   - Channel/gating positions
   - Electrostatic tuning opportunities
   - Stability engineering strategies
   - Assay design suggestions

9. References
   - Provide full citations (authors, year, journal)
   - Include DOI or PubMed ID where available
   - Clearly distinguish review vs primary research

STYLE REQUIREMENTS
- Write clearly and technically, suitable for a PhD-level enzyme engineering audience.
- Emphasize mechanistic reasoning and structure-function relationships.
- Avoid generic textbook explanations.
- Highlight actionable insights for experimental design.
- Be concise but information-dense.

If seed_sequences are provided:
- Identify closest homologs.
- Summarize known structures and variants for those sequences.
- Highlight residue positions frequently engineered.

If reactions_of_interest are specified:
- Focus mechanistic discussion on those reaction classes.
- Distinguish between productive and competing pathways.

The goal is not just to summarize literature, but to extract engineering-relevant insight that can guide rational or ML-assisted enzyme optimization.

```
</details>

#### Response

(Full literature review shown below in compact view.)

### Literature Review Summary

Output(layout=Layout(border_bottom='1px solid #e0e0e0', border_left='1px solid #e0e0e0', border_right='1px sol…

/Users/charmainechia/Documents/projects/agentic-protein-design/examples/processed/00_literature_review/literature_review_llm_summary.md


## Save Thread Update
Run this final cell to append run metadata and compact LLM summary to `chats/<llm_process_tag>_<thread_id>.json`.

In [9]:
persist_thread_update(
    root_key=root_key,
    thread_id=thread_id,
    inputs=user_inputs,
    out_paths=out_paths,
    llm_review_path=out_llm,
    llm_review_text=llm_review,
)
print(thread_id)

d762a72ec7f04bec9b66ccd3aac21b91
