# SSTR2 Unified Binder Discovery System
### FastDesign + De Novo Peptide — 통합 파이프라인

---

**Target**: SSTR2 (Somatostatin Receptor Type 2)  
**Strategy**: 물리 기반 설계(FastDesign) + AI 기반 De Novo 설계(RFdiffusion+ProteinMPNN+ESMFold)를 병합하여  
가중합(Weighted Score)으로 통합 랭킹 후 최종 후보를 선정한다.

| 단계 | 설명 |
|------|------|
| **Phase 0** | Setup & 환경 점검 |
| **Phase 1** | 구조 QC — FoldMason lDDT + Binding Pocket 분석 |
| **Phase 2** | FastDesign 파이프라인 (V1) — 펩타이드 서열 최적화 |
| **Phase 3** | De Novo 파이프라인 (Arm 3) — RFdiffusion + ProteinMPNN + ESMFold |
| **Phase 4** | 통합 랭킹 — 가중합 스코어로 병합 |
| **Phase 5** | 최종 대시보드 & 시각화 |

---
## Phase 0: Setup & Environment

In [None]:
# ===== Phase 0: 공통 설정 =====
import os
import sys
import json
import csv
import time
import warnings
from pathlib import Path
from collections import defaultdict, OrderedDict, Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.patches import Patch
from tqdm.notebook import tqdm
from IPython.display import display, HTML, Markdown

warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', 200)
pd.set_option('display.width', 140)
mpl.rc('axes', unicode_minus=False)
plt.style.use('seaborn-v0_8-whitegrid')
mpl.rcParams.update({'figure.dpi': 130, 'figure.figsize': (12, 5)})

REPO = Path('.').resolve().parent
sys.path.insert(0, str(REPO))
sys.path.insert(0, str(REPO / 'bionemo'))

# ── 경로 상수 ──
DATA_DIR       = REPO / 'data' / 'fold_test1'
RESULTS_DIR    = REPO / 'results'
FOLDMASON_DIR  = RESULTS_DIR / 'foldmason'
DOCKING_DIR    = RESULTS_DIR / 'sstr2_docking'
DENOVO_DIR     = DOCKING_DIR / 'arm3_denovo'
OUTPUT_DIR     = Path('unified_results')
OUTPUT_DIR.mkdir(exist_ok=True)

INPUT_CIF = DATA_DIR / 'fold_test1_model_0.cif'

print(f'REPO:       {REPO}')
print(f'INPUT_CIF:  {INPUT_CIF} (exists={INPUT_CIF.exists()})')
print(f'OUTPUT_DIR: {OUTPUT_DIR.resolve()}')

In [None]:
# ── API 키 확인 (De Novo Arm 3용, 선택적) ──
api_key = os.getenv('NGC_CLI_API_KEY') or os.getenv('NVIDIA_API_KEY')
if not api_key:
    for kf in [REPO / 'molmim.key', REPO / 'ngc.key']:
        if kf.exists():
            api_key = kf.read_text().strip()
            break
HAS_API = bool(api_key and api_key.startswith('nvapi-'))

# ── py3Dmol (선택적) ──
try:
    import py3Dmol
    HAS_3D = True
except ImportError:
    py3Dmol = None
    HAS_3D = False

display(HTML(
    '<div style="padding:12px;background:#e3f2fd;border-radius:8px;font-size:14px">'
    f'API: <b>{"Connected" if HAS_API else "Offline (기존 결과 사용)"}</b> &nbsp;|&nbsp; '
    f'py3Dmol: <b>{"OK" if HAS_3D else "미설치"}</b>'
    '</div>'
))

---
## Phase 1: 구조 QC — FoldMason + Binding Pocket

AlphaFold3 모델의 품질을 검증하고, SSTR2 바인딩 포켓을 분석한다.  
이 정보는 후속 설계 단계의 입력 검증 역할을 한다.

In [None]:
# ── 1.1 AlphaFold3 신뢰도 ──
confidences = []
for i in range(5):
    fp = DATA_DIR / f'fold_test1_summary_confidences_{i}.json'
    if fp.exists():
        confidences.append(json.loads(fp.read_text()))

if confidences:
    scores = [c['ranking_score'] for c in confidences]
    best_idx = int(np.argmax(scores))
    best_conf = confidences[best_idx]
    display(HTML(
        f'<div style="background:#e8f5e9;padding:12px;border-radius:8px">'
        f'<b>Best Model: Model {best_idx}</b> &mdash; '
        f'Ranking={scores[best_idx]:.2f}, '
        f'ipTM={best_conf["iptm"]:.2f}, '
        f'pTM={best_conf["ptm"]:.2f}'
        f'</div>'
    ))
else:
    print('AlphaFold3 confidence 파일 없음 — QC 건너뜀')

In [None]:
# ── 1.2 FoldMason 구조 정렬 결과 ──
def parse_fasta(text):
    seqs = {}
    name = None
    for line in text.strip().split('\n'):
        if line.startswith('>'):
            name = line[1:].strip()
            seqs[name] = ''
        elif name:
            seqs[name] += line.strip()
    return seqs

fm_aa_path = FOLDMASON_DIR / 'result_foldmason_aa.fa'
fm_3di_path = FOLDMASON_DIR / 'result_foldmason_3di.fa'
fm_nw_path = FOLDMASON_DIR / 'result_foldmason.nw'

if fm_aa_path.exists():
    aa_seqs = parse_fasta(fm_aa_path.read_text())
    di_seqs = parse_fasta(fm_3di_path.read_text()) if fm_3di_path.exists() else {}
    newick = fm_nw_path.read_text().strip() if fm_nw_path.exists() else ''

    display(HTML(
        '<div style="background:#f3f4f6;padding:16px;border-radius:10px">'
        '<h3 style="margin-top:0">FoldMason 구조 QC</h3>'
        '<table style="font-size:14px">'
        '<tr><td><b>Average MSA lDDT</b></td>'
        '<td><span style="font-size:22px;color:#2e7d32;font-weight:bold">0.664</span>'
        ' (0.5~0.7: 중간 수준 일관성)</td></tr>'
        f'<tr><td><b>정렬된 서열</b></td><td>{len(aa_seqs)}개</td></tr>'
        f'<tr><td><b>Guide Tree</b></td>'
        f'<td style="font-family:monospace;font-size:11px">{newick}</td></tr>'
        '</table></div>'
    ))
else:
    print('FoldMason 결과 없음 — 건너뜀')

In [None]:
# ── 1.3 Binding Pocket 분석 ──
pocket_path = DOCKING_DIR / 'binding_pocket.json'
pocket = None

if pocket_path.exists():
    pocket = json.loads(pocket_path.read_text())
    resnames = [r['resname'] for r in pocket['pocket_residues']]

    aa_props = {
        'hydrophobic': ['ALA','VAL','ILE','LEU','MET','PHE','TRP','PRO'],
        'polar': ['SER','THR','ASN','GLN','TYR','CYS'],
        'positive': ['LYS','ARG','HIS'],
        'negative': ['ASP','GLU'],
    }
    def get_prop(rn):
        for p, aas in aa_props.items():
            if rn in aas: return p
        return 'other'

    prop_counts = Counter(get_prop(r) for r in resnames)
    display(HTML(
        f'<div style="background:#fff3e0;padding:12px;border-radius:8px">'
        f'<b>Binding Pocket</b>: {pocket["num_pocket_residues"]} 잔기 '
        f'(cutoff {pocket["cutoff_angstrom"]}Å) &mdash; '
        + ', '.join(f'{k}: {v}' for k, v in prop_counts.items())
        + '</div>'
    ))
else:
    print('Binding pocket 데이터 없음 — 건너뜀')

In [None]:
# ── 1.4 QC 시각화 ──
if confidences:
    fig, axes = plt.subplots(1, 3, figsize=(15, 4.5))

    # Ranking Score
    ax = axes[0]
    colors_r = ['#FF6347' if i == best_idx else '#B0C4DE' for i in range(len(scores))]
    bars = ax.bar(range(len(scores)), scores, color=colors_r, edgecolor='white', width=0.6)
    for b, s in zip(bars, scores):
        ax.text(b.get_x() + b.get_width()/2, s + 0.02, f'{s:.2f}',
                ha='center', fontsize=11, fontweight='bold')
    ax.set_xticks(range(len(scores)))
    ax.set_xticklabels([f'Model {i}' for i in range(len(scores))])
    ax.set_ylabel('Score'); ax.set_ylim(0, 1)
    ax.set_title('AlphaFold3 Ranking Score', fontweight='bold')

    # pTM vs ipTM
    ax = axes[1]
    x = np.arange(len(confidences)); w = 0.35
    ax.bar(x - w/2, [c['ptm'] for c in confidences], w, label='pTM', color='#6495ED', edgecolor='white')
    ax.bar(x + w/2, [c['iptm'] for c in confidences], w, label='ipTM', color='#FF6347', edgecolor='white')
    ax.set_xticks(x); ax.set_xticklabels([f'M{i}' for i in range(len(confidences))])
    ax.set_ylabel('Score'); ax.set_ylim(0, 1); ax.legend(fontsize=9)
    ax.set_title('pTM vs ipTM', fontweight='bold')

    # Pocket composition
    if pocket:
        ax = axes[2]
        prop_colors = {'hydrophobic':'#E07B54','polar':'#4C9A2A','positive':'#4169E1','negative':'#DC143C','other':'#888'}
        labels_p = list(prop_counts.keys())
        sizes_p = list(prop_counts.values())
        colors_p = [prop_colors.get(l, '#888') for l in labels_p]
        ax.pie(sizes_p, labels=[f'{l}\n({s})' for l, s in zip(labels_p, sizes_p)],
               colors=colors_p, autopct='%1.0f%%', startangle=90,
               textprops={'fontsize': 10}, wedgeprops={'edgecolor': 'white', 'linewidth': 2})
        ax.set_title(f'Pocket ({pocket["num_pocket_residues"]} residues)', fontweight='bold')
    else:
        axes[2].axis('off')

    plt.suptitle('Phase 1: 구조 QC 요약', fontweight='bold', fontsize=14, y=1.02)
    plt.tight_layout()
    plt.show()

---
## Phase 2: FastDesign 파이프라인 (from V1)

PyRosetta 기반 물리 시뮬레이션으로 SST14 펩타이드의 서열을 최적화한다.

```
CIF → PDB → 체인 표준화 → Relax → FastDesign x20 → 필터링 → FlexPepDock Refine
```

> **LOAD_FROM_CACHE**: 이전 실행 결과가 있으면 캐시에서 로드 (시간 절약)

In [None]:
# ===== FastDesign 캐시 / 실행 모드 설정 =====
# True  → 이전 실행 결과를 캐시에서 로드 (빠름)
# False → PyRosetta FastDesign 20개 새로 실행 (30분~1시간)
LOAD_FROM_CACHE = True

CACHE_DIR = Path('candidates')  # 기존 V1 결과 디렉토리
CACHE_CSV = CACHE_DIR / 'df_candidates.csv'
CACHE_META = CACHE_DIR / 'meta.json'

print(f'LOAD_FROM_CACHE = {LOAD_FROM_CACHE}')
if LOAD_FROM_CACHE and CACHE_CSV.exists():
    print(f'  캐시 발견: {CACHE_CSV}')
elif LOAD_FROM_CACHE:
    print(f'  캐시 없음 — PyRosetta 실행으로 전환합니다')
    LOAD_FROM_CACHE = False

In [None]:
# ===== 2.1 CIF → PDB 변환 =====
if not LOAD_FROM_CACHE:
    from Bio.PDB import MMCIFParser, PDBIO

    def cif_to_pdb(cif_path, pdb_path, structure_id='AF3_MODEL'):
        parser = MMCIFParser(QUIET=True)
        structure = parser.get_structure(structure_id, str(cif_path))
        io = PDBIO()
        io.set_structure(structure)
        io.save(str(pdb_path))
        return pdb_path

    OUTPUT_PDB = DATA_DIR / 'fold_test1_model_0_from_cif.pdb'
    if not OUTPUT_PDB.exists():
        cif_to_pdb(INPUT_CIF, OUTPUT_PDB)
    INPUT_PDB = str(OUTPUT_PDB)
    print(f'INPUT_PDB: {INPUT_PDB}')
else:
    print('[Cache Mode] CIF→PDB 변환 건너뜀')

In [None]:
# ===== 2.2 PyRosetta 초기화 + 전처리 + FastDesign =====
if not LOAD_FROM_CACHE:
    import pyrosetta
    from pyrosetta import rosetta
    pyrosetta.init('-mute all -relax:default_repeats 3')
    print('PyRosetta initialized')

    # ── 3-letter → 1-letter 매핑 ──
    AA3_TO_1 = {
        'ALA':'A','ARG':'R','ASN':'N','ASP':'D','CYS':'C','GLN':'Q','GLU':'E',
        'GLY':'G','HIS':'H','ILE':'I','LEU':'L','LYS':'K','MET':'M','PHE':'F',
        'PRO':'P','SER':'S','THR':'T','TRP':'W','TYR':'Y','VAL':'V','MSE':'M',
    }

    # ── 펩타이드 체인 탐지 ──
    def find_peptide_chain_pose(pose, peptide_len=14):
        info = []
        for ch in range(1, pose.num_chains() + 1):
            seq = pose.chain_sequence(ch)
            info.append((ch, len(seq), seq))
        hits = [ch for ch, ln, seq in info if ln == peptide_len]
        if len(hits) != 1:
            raise RuntimeError(f'길이=={peptide_len} 체인 탐지 결과가 1개가 아닙니다: {hits}')
        return hits[0]

    # ── 체인 추출 ──
    def extract_chain_pose_by_dump(original_pose, chain_id):
        tmp_full = '__tmp_full.pdb'
        tmp_chain = f'__tmp_chain_{chain_id}.pdb'
        original_pose.dump_pdb(tmp_full)
        first_res = original_pose.chain_begin(chain_id)
        pdbinfo = original_pose.pdb_info()
        chain_letter = pdbinfo.chain(first_res) if pdbinfo else ''
        if not chain_letter.strip():
            chain_letter = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'[chain_id - 1]
        with open(tmp_full, 'r') as f:
            lines = f.readlines()
        with open(tmp_chain, 'w') as out:
            for line in lines:
                if (line.startswith('ATOM') or line.startswith('HETATM')) and len(line) > 21:
                    if line[21] == chain_letter:
                        out.write(line)
                if line.startswith('TER'):
                    out.write(line)
            out.write('END\n')
        new_pose = pyrosetta.pose_from_pdb(tmp_chain)
        os.remove(tmp_full)
        os.remove(tmp_chain)
        return new_pose

    # ── 체인 표준화 ──
    def standardize_to_AB(pose, peptide_chain_id, out_pdb='standardized_raw.pdb'):
        receptor_chains = [ch for ch in range(1, pose.num_chains() + 1) if ch != peptide_chain_id]
        rec_pose = extract_chain_pose_by_dump(pose, receptor_chains[0])
        for ch in receptor_chains[1:]:
            rec_pose.append_pose_by_jump(extract_chain_pose_by_dump(pose, ch), rec_pose.total_residue())
        pep_pose = extract_chain_pose_by_dump(pose, peptide_chain_id)
        rec_pose.append_pose_by_jump(pep_pose, rec_pose.total_residue())
        rec_pose.dump_pdb(out_pdb)
        print(f'[OK] standardized -> {out_pdb} (A=receptor, B=peptide)')
        return rec_pose

    # ── Relax ──
    def relax_peptide_only(in_pdb='standardized_raw.pdb', out_pdb='standardized_relaxed.pdb',
                           peptide_chain_number=2):
        pose = pyrosetta.pose_from_pdb(in_pdb)
        sfxn = pyrosetta.create_score_function('ref2015')
        mm = rosetta.core.kinematics.MoveMap()
        mm.set_bb(False); mm.set_chi(False); mm.set_jump(False)
        for i in range(pose.chain_begin(peptide_chain_number),
                       pose.chain_end(peptide_chain_number) + 1):
            mm.set_bb(i, True); mm.set_chi(i, True)
        fr = rosetta.protocols.relax.FastRelax(sfxn, 3)
        fr.set_movemap(mm)
        fr.apply(pose)
        pose.dump_pdb(out_pdb)
        print(f'[OK] Relaxed -> {out_pdb}')
        return pose

    # ── 실행 ──
    pose = pyrosetta.pose_from_pdb(INPUT_PDB)
    pep_chain = find_peptide_chain_pose(pose, peptide_len=14)
    standard_pose = standardize_to_AB(pose, pep_chain)
    relaxed_pose = relax_peptide_only()
else:
    print('[Cache Mode] PyRosetta 전처리 건너뜀')

In [None]:
# ===== 2.3 Scoring 함수 정의 =====
def stability_pk_proxy_scores(seq):
    """안정성/PK proxy 점수 계산"""
    seq = seq.strip().upper()
    kr = sum(1 for x in seq if x in 'KR')
    fyw = sum(1 for x in seq if x in 'FYW')
    cleavage_risk = 2.0 * kr + 1.0 * fyw
    hydrophobic = sum(1 for x in seq if x in 'AILMFWVY')
    hydrophobic_fraction = hydrophobic / len(seq) if seq else 0
    pos_charge = sum(1 for x in seq if x in 'KRH')
    neg_charge = sum(1 for x in seq if x in 'DE')
    net_charge_proxy = pos_charge - neg_charge
    pk_penalty = 5.0 * max(0, hydrophobic_fraction - 0.5) + 0.5 * abs(net_charge_proxy)
    return {
        'cleavage_risk': cleavage_risk,
        'pk_penalty': pk_penalty,
        'hydrophobic_fraction': round(hydrophobic_fraction, 3),
        'net_charge_proxy': net_charge_proxy,
    }

def compute_rank_score(dG, cleavage_risk, pk_penalty):
    """V1 rank_score: 높을수록 좋음"""
    return (-dG) - 0.5 * cleavage_risk - 1.0 * pk_penalty

print('Scoring 함수 정의 완료')

In [None]:
# ===== 2.4 FastDesign 실행 또는 캐시 로드 =====
if LOAD_FROM_CACHE:
    df_fastdesign = pd.read_csv(CACHE_CSV)
    print(f'[Cache] FastDesign 결과 로드: {len(df_fastdesign)}개 후보')
    display(df_fastdesign.head())
else:
    from pyrosetta.rosetta.protocols.analysis import InterfaceAnalyzerMover
    from pyrosetta.rosetta.core.pack.task import TaskFactory
    from pyrosetta.rosetta.core.pack.task.operation import (
        RestrictToRepacking, PreventRepacking, OperateOnResidueSubset,
    )
    from pyrosetta.rosetta.core.select.residue_selector import ChainSelector, ResidueIndexSelector
    from pyrosetta.rosetta.protocols.denovo_design.movers import FastDesign as FastDesignMover

    def analyze_interface(pose):
        iam = InterfaceAnalyzerMover(1)
        iam.set_pack_separated(True)
        iam.apply(pose)
        return {
            'dG_REU': pose.scores.get('dG_separated', pose.scores.get('dG_separated/dSASAx100', 0)),
            'dSASA': pose.scores.get('dSASA_int', 0),
        }

    def peptide_seq(pose, chain=2):
        return pose.chain_sequence(chain)

    def diff_positions(seq_wt, seq_mut):
        return [i for i, (a, b) in enumerate(zip(seq_wt, seq_mut)) if a != b]

    def build_task_factory(pose, peptide_chain_id=2, design_positions=None):
        tf = TaskFactory()
        receptor_selector = ChainSelector(1)
        tf.push_back(OperateOnResidueSubset(PreventRepacking(), receptor_selector))
        if design_positions:
            pep_start = pose.chain_begin(peptide_chain_id)
            cys_indices = []
            for i in range(pose.chain_begin(peptide_chain_id),
                           pose.chain_end(peptide_chain_id) + 1):
                if pose.residue(i).name3().strip() == 'CYS':
                    cys_indices.append(i)
            for ci in cys_indices:
                sel = ResidueIndexSelector(str(ci))
                tf.push_back(OperateOnResidueSubset(PreventRepacking(), sel))
            non_design = []
            for i in range(pose.chain_begin(peptide_chain_id),
                           pose.chain_end(peptide_chain_id) + 1):
                local_pos = i - pep_start + 1
                if local_pos not in design_positions and i not in cys_indices:
                    non_design.append(str(i))
            if non_design:
                sel = ResidueIndexSelector(','.join(non_design))
                tf.push_back(OperateOnResidueSubset(RestrictToRepacking(), sel))
        return tf

    def fastdesign_candidates(input_pdb='standardized_relaxed.pdb', n=20,
                              design_pos='1,2,4,5,6,7,8,9,10,11,12,14',
                              seed_base=1000, max_retries=3):
        os.makedirs('candidates', exist_ok=True)
        sfxn = pyrosetta.create_score_function('ref2015')
        design_positions = [int(x) for x in design_pos.split(',')]
        rows = []
        for i in tqdm(range(n), desc='FastDesign'):
            seed = seed_base + i
            success = False
            for attempt in range(max_retries):
                try:
                    pose = pyrosetta.pose_from_pdb(input_pdb)
                    wt_seq = peptide_seq(pose)
                    tf = build_task_factory(pose, 2, design_positions)
                    fd = FastDesignMover(sfxn)
                    fd.set_task_factory(tf)
                    pyrosetta.rosetta.numeric.random.rg().set_seed(seed + attempt * 10000)
                    fd.apply(pose)
                    mut_seq = peptide_seq(pose)
                    diffs = diff_positions(wt_seq, mut_seq)
                    out_pdb = f'candidates/candidate_{i+1:03d}.pdb'
                    pose.dump_pdb(out_pdb)
                    iface = analyze_interface(pose)
                    proxy = stability_pk_proxy_scores(mut_seq)
                    rs = compute_rank_score(iface['dG_REU'], proxy['cleavage_risk'], proxy['pk_penalty'])
                    rows.append({
                        'candidate': f'candidate_{i+1:03d}.pdb',
                        'pdb_path': out_pdb,
                        'seq': mut_seq,
                        'dG_REU': round(iface['dG_REU'], 3),
                        'dSASA': round(iface['dSASA'], 1),
                        'rank_score': round(rs, 3),
                        **proxy,
                        'mut_positions': diffs,
                    })
                    success = True
                    break
                except Exception as e:
                    if attempt == max_retries - 1:
                        print(f'  candidate {i+1} failed: {e}')
        df = pd.DataFrame(rows).sort_values('rank_score', ascending=False).reset_index(drop=True)
        df.to_csv('candidates/df_candidates.csv', index=False)
        return df

    df_fastdesign = fastdesign_candidates()
    print(f'FastDesign 완료: {len(df_fastdesign)}개 후보')

# ── Cys 위반 필터링 ──
WT_SEQ = 'AGCKNFFWKTFTSC'
CYS_POSITIONS = [i for i, aa in enumerate(WT_SEQ) if aa == 'C']

def is_cys_violation(seq):
    for pos in CYS_POSITIONS:
        if pos < len(seq) and seq[pos] != 'C':
            return True
    return False

df_fastdesign['cys_violation'] = df_fastdesign['seq'].apply(is_cys_violation)
df_fd_filtered = df_fastdesign[~df_fastdesign['cys_violation']].copy()
print(f'필터링: {len(df_fastdesign)} → {len(df_fd_filtered)} (Cys 위반 {df_fastdesign["cys_violation"].sum()}건 제거)')

In [None]:
# ===== 2.5 FastDesign 결과 요약 =====
df_show = df_fd_filtered if len(df_fd_filtered) > 0 else df_fastdesign
cols_show = ['candidate', 'seq', 'dG_REU', 'dSASA', 'rank_score', 'cleavage_risk', 'pk_penalty']
available_cols = [c for c in cols_show if c in df_show.columns]
display(df_show[available_cols].head(10))

if 'dG_REU' in df_show.columns and 'rank_score' in df_show.columns:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 4.5))
    ax1.barh(range(min(10, len(df_show))),
             df_show['dG_REU'].head(10).values,
             color='#42a5f5', edgecolor='white')
    ax1.set_yticks(range(min(10, len(df_show))))
    ax1.set_yticklabels(df_show['candidate'].head(10).values, fontsize=9)
    ax1.invert_yaxis()
    ax1.set_xlabel('dG (REU) — 낮을수록 좋음')
    ax1.set_title('FastDesign: 결합 에너지 Top 10', fontweight='bold')

    ax2.barh(range(min(10, len(df_show))),
             df_show['rank_score'].head(10).values,
             color='#66bb6a', edgecolor='white')
    ax2.set_yticks(range(min(10, len(df_show))))
    ax2.set_yticklabels(df_show['candidate'].head(10).values, fontsize=9)
    ax2.invert_yaxis()
    ax2.set_xlabel('Rank Score — 높을수록 좋음')
    ax2.set_title('FastDesign: 통합 순위 Top 10', fontweight='bold')

    plt.tight_layout()
    plt.show()

---
## Phase 3: De Novo 파이프라인 (Arm 3)

RFdiffusion + ProteinMPNN + ESMFold로 SSTR2에 결합하는 **완전히 새로운** 펩타이드를 설계한다.

| 단계 | 도구 | 역할 |
|------|------|------|
| Step 1 | RFdiffusion | 바인딩 포켓에 맞는 백본 생성 |
| Step 2 | ProteinMPNN | 백본 → 최적 서열 |
| Step 3 | ESMFold | 서열 → 구조 예측 (pLDDT 검증) |

> **RUN_DENOVO**: False이면 기존 결과 로드, True이면 NIM API 호출

In [None]:
# ===== De Novo 실행/로드 모드 =====
RUN_DENOVO = False  # True → NIM API 호출, False → 기존 결과 로드

DENOVO_RESULT_FILE = DENOVO_DIR / 'arm3_final_20260210_000106.json'

if not RUN_DENOVO and DENOVO_RESULT_FILE.exists():
    print(f'[Load] 기존 De Novo 결과: {DENOVO_RESULT_FILE}')
elif not HAS_API:
    print('API 키 없음 + 기존 결과 없음 → De Novo 건너뜀')
    RUN_DENOVO = False
else:
    print('[Run] NIM API로 De Novo 파이프라인 실행')

In [None]:
# ===== 3.1 De Novo 실행 또는 결과 로드 =====
df_denovo = None

if RUN_DENOVO and HAS_API:
    # ── NIM API로 Arm 3 실행 ──
    from rfdiffusion_client import get_client as get_rfdiffusion
    from proteinmpnn_client import get_client as get_proteinmpnn
    from esmfold_client import get_client as get_esmfold

    pocket_info = json.loads((DOCKING_DIR / 'binding_pocket.json').read_text())
    hotspot_res = pocket_info.get('hotspot_res', [])
    rfdiff_contigs = pocket_info.get('rfdiffusion', {}).get('contigs', 'B1-369/0 10-30')
    receptor_pdb = DOCKING_DIR / 'sstr2_receptor.pdb'

    NUM_DESIGNS = 5
    SEQS_PER_BB = 4

    rfdiff = get_rfdiffusion()
    mpnn = get_proteinmpnn()
    esmfold = get_esmfold()

    designs = []
    print('[Step 1] RFdiffusion 백본 설계')
    backbones = []
    for i in tqdm(range(NUM_DESIGNS), desc='RFdiffusion'):
        try:
            result = rfdiff.design_binder(
                pdb_path=receptor_pdb, contigs=rfdiff_contigs,
                hotspot_res=hotspot_res[:10], diffusion_steps=50, random_seed=i)
            if result.get('output_pdb'):
                backbones.append({'idx': i, 'pdb': result['output_pdb']})
        except Exception as e:
            print(f'  backbone {i} 오류: {e}')

    print(f'[Step 2] ProteinMPNN 서열 설계 ({len(backbones)} 백본)')
    seq_designs = []
    for bb in backbones:
        try:
            result = mpnn.predict(input_pdb=bb['pdb'], num_seq_per_target=SEQS_PER_BB, sampling_temp=0.2)
            seqs = result.get('sequences', [])
            if isinstance(seqs, str):
                entries = mpnn.parse_fasta(seqs)
                seqs = [e['sequence'] for e in entries]
            for j, seq in enumerate(seqs):
                seq_designs.append({'backbone_idx': bb['idx'], 'seq_idx': j, 'binder_sequence': seq})
        except Exception as e:
            print(f'  MPNN 오류: {e}')

    print(f'[Step 3] ESMFold 폴딩 검증 ({len(seq_designs)} 서열)')
    for sd in tqdm(seq_designs, desc='ESMFold'):
        try:
            result = esmfold.predict(sd['binder_sequence'])
            sd['plddt'] = result.get('mean_plddt', result.get('plddt', None))
        except Exception as e:
            sd['plddt'] = None

    df_denovo = pd.DataFrame(seq_designs)
    if 'plddt' in df_denovo.columns:
        df_denovo = df_denovo.dropna(subset=['plddt'])
        df_denovo = df_denovo.sort_values('plddt', ascending=False).reset_index(drop=True)
    print(f'De Novo 완료: {len(df_denovo)}개 설계')

elif DENOVO_RESULT_FILE.exists():
    # ── 기존 결과 로드 ──
    arm3_data = json.loads(DENOVO_RESULT_FILE.read_text())
    df_denovo = pd.DataFrame(arm3_data['designs'])
    # stability proxy 추가
    for idx, row in df_denovo.iterrows():
        proxy = stability_pk_proxy_scores(row['binder_sequence'])
        for k, v in proxy.items():
            df_denovo.at[idx, k] = v
    df_denovo = df_denovo.sort_values('plddt', ascending=False).reset_index(drop=True)
    print(f'[Load] De Novo 결과: {len(df_denovo)}개 설계')

if df_denovo is not None:
    display(df_denovo[['backbone_idx', 'binder_sequence', 'plddt',
                       'cleavage_risk', 'pk_penalty']].head(10))

In [None]:
# ===== 3.2 De Novo 시각화 =====
if df_denovo is not None and len(df_denovo) > 0:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))

    # pLDDT 분포
    plddts = df_denovo['plddt'].values
    bb_ids = df_denovo['backbone_idx'].values
    bb_colors = {0: '#e53935', 1: '#1565c0', 2: '#2e7d32', 3: '#ff9800', 4: '#9c27b0'}
    colors = [bb_colors.get(b, '#888') for b in bb_ids]

    ax1.scatter(range(len(plddts)), np.sort(plddts)[::-1],
                c=[bb_colors.get(bb_ids[i], '#888') for i in np.argsort(plddts)[::-1]],
                s=100, edgecolor='white', linewidth=1.5, zorder=5)
    ax1.axhline(y=70, color='#2e7d32', linestyle='--', linewidth=2, alpha=0.7, label='High confidence (70)')
    ax1.axhline(y=50, color='#ff9800', linestyle='--', linewidth=1.5, alpha=0.5, label='Minimum (50)')
    ax1.fill_between(range(len(plddts)), 70, 100, alpha=0.06, color='#2e7d32')
    ax1.set_xlabel('Design rank'); ax1.set_ylabel('pLDDT')
    ax1.set_title('De Novo: ESMFold pLDDT', fontweight='bold')
    ax1.set_ylim(40, 95); ax1.legend(fontsize=9)

    # 서열 길이 분포
    lengths = df_denovo['binder_sequence'].str.len()
    ax2.hist(lengths, bins=range(int(lengths.min()), int(lengths.max()) + 2),
             color='#42a5f5', edgecolor='white', linewidth=1.5)
    ax2.set_xlabel('Peptide length (aa)')
    ax2.set_ylabel('Count')
    ax2.set_title('De Novo: 서열 길이 분포', fontweight='bold')

    plt.suptitle('Phase 3: De Novo 설계 결과', fontweight='bold', fontsize=14, y=1.02)
    plt.tight_layout()
    plt.show()

---
## Phase 4: 통합 랭킹 — Weighted Score

두 파이프라인의 후보를 하나의 DataFrame으로 병합하고, 정규화된 가중합 점수로 랭킹한다.

### 통합 점수 공식

```
unified_score = w_binding × norm(binding_metric)
              + w_structure × norm(pLDDT)
              - w_risk × norm(cleavage_risk + pk_penalty)
```

| 가중치 | 기본값 | 의미 |
|--------|--------|------|
| `w_binding` | 0.50 | 결합력 (dG for FastDesign, pLDDT proxy for De Novo) |
| `w_structure` | 0.30 | 구조 신뢰도 / 설계 품질 |
| `w_risk` | 0.20 | 안정성/PK 위험 페널티 |

In [None]:
# ===== 4.1 통합 DataFrame 구성 =====
W_BINDING   = 0.50
W_STRUCTURE = 0.30
W_RISK      = 0.20

rows_unified = []

# ── FastDesign 후보 추가 ──
df_fd_src = df_fd_filtered if len(df_fd_filtered) > 0 else df_fastdesign
for _, row in df_fd_src.iterrows():
    dG = row.get('dG_REU', 0)
    rows_unified.append({
        'source': 'FastDesign',
        'name': row.get('candidate', ''),
        'sequence': row.get('seq', ''),
        'length': len(row.get('seq', '')),
        'binding_metric': -dG,  # 높을수록 좋음 (부호 반전)
        'structure_metric': row.get('dSASA', 0),  # dSASA: 클수록 넓은 접촉
        'plddt': None,  # FastDesign은 pLDDT 없음
        'dG_REU': dG,
        'dSASA': row.get('dSASA', 0),
        'cleavage_risk': row.get('cleavage_risk', 0),
        'pk_penalty': row.get('pk_penalty', 0),
        'risk_total': row.get('cleavage_risk', 0) + row.get('pk_penalty', 0),
        'rank_score_v1': row.get('rank_score', 0),
    })

# ── De Novo 후보 추가 ──
if df_denovo is not None:
    for _, row in df_denovo.iterrows():
        plddt = row.get('plddt', 0) or 0
        rows_unified.append({
            'source': 'DeNovo',
            'name': f"bb{row['backbone_idx']:02d}_seq{row['seq_idx']}",
            'sequence': row.get('binder_sequence', ''),
            'length': len(row.get('binder_sequence', '')),
            'binding_metric': plddt,  # De Novo는 pLDDT를 binding proxy로 사용
            'structure_metric': plddt,
            'plddt': plddt,
            'dG_REU': None,
            'dSASA': None,
            'cleavage_risk': row.get('cleavage_risk', 0),
            'pk_penalty': row.get('pk_penalty', 0),
            'risk_total': row.get('cleavage_risk', 0) + row.get('pk_penalty', 0),
            'rank_score_v1': None,
        })

df_unified = pd.DataFrame(rows_unified)
print(f'통합 후보: {len(df_unified)}개 (FastDesign: {(df_unified["source"]=="FastDesign").sum()}, '
      f'DeNovo: {(df_unified["source"]=="DeNovo").sum()})')

In [None]:
# ===== 4.2 Min-Max 정규화 + 가중합 =====
def minmax_norm(series):
    """Min-Max 정규화 (0~1). NaN은 0.5로 채움."""
    s = series.fillna(series.median() if series.notna().any() else 0)
    mn, mx = s.min(), s.max()
    if mx == mn:
        return pd.Series(0.5, index=series.index)
    return (s - mn) / (mx - mn)

df_unified['norm_binding']   = minmax_norm(df_unified['binding_metric'])
df_unified['norm_structure'] = minmax_norm(df_unified['structure_metric'])
df_unified['norm_risk']      = minmax_norm(df_unified['risk_total'])

df_unified['unified_score'] = (
    W_BINDING   * df_unified['norm_binding']
    + W_STRUCTURE * df_unified['norm_structure']
    - W_RISK      * df_unified['norm_risk']
)

df_unified = df_unified.sort_values('unified_score', ascending=False).reset_index(drop=True)
df_unified['unified_rank'] = range(1, len(df_unified) + 1)

# ── 결과 저장 ──
df_unified.to_csv(OUTPUT_DIR / 'unified_ranking.csv', index=False)
print(f'통합 랭킹 저장: {OUTPUT_DIR / "unified_ranking.csv"}')
print(f'\nTop 15:')
display(df_unified[['unified_rank', 'source', 'name', 'sequence', 'length',
                     'unified_score', 'binding_metric', 'plddt',
                     'dG_REU', 'cleavage_risk', 'pk_penalty']].head(15))

---
## Phase 5: 최종 대시보드

In [None]:
# ===== 5.1 통합 대시보드 시각화 =====
fig = plt.figure(figsize=(16, 10))
gs = fig.add_gridspec(2, 3, hspace=0.35, wspace=0.35)

# ── (1) 파이프라인 개요 ──
ax = fig.add_subplot(gs[0, 0])
ax.axis('off')
summary_items = [
    ('AlphaFold3', f'Ranking = {scores[best_idx]:.2f}' if confidences else 'N/A', '#e3f2fd'),
    ('FoldMason', 'lDDT = 0.664', '#fce4ec'),
    ('Pocket', f'{pocket["num_pocket_residues"]} residues' if pocket else 'N/A', '#e8f5e9'),
]
for i, (name, val, bg) in enumerate(summary_items):
    y = 0.85 - i * 0.32
    ax.add_patch(plt.Rectangle((0.05, y - 0.08), 0.9, 0.22, facecolor=bg,
                                edgecolor='#bdbdbd', linewidth=1, transform=ax.transAxes, zorder=2))
    ax.text(0.5, y + 0.04, name, transform=ax.transAxes, ha='center', fontsize=12, fontweight='bold', zorder=3)
    ax.text(0.5, y - 0.04, val, transform=ax.transAxes, ha='center', fontsize=10, color='#555', zorder=3)
ax.set_title('Structure QC', fontweight='bold', fontsize=12)

# ── (2) 소스별 후보 수 ──
ax = fig.add_subplot(gs[0, 1])
source_counts = df_unified['source'].value_counts()
src_colors = {'FastDesign': '#42a5f5', 'DeNovo': '#ef5350'}
ax.bar(source_counts.index, source_counts.values,
       color=[src_colors.get(s, '#888') for s in source_counts.index],
       edgecolor='white', linewidth=2, width=0.5)
for i, (s, c) in enumerate(zip(source_counts.index, source_counts.values)):
    ax.text(i, c + 0.5, str(c), ha='center', fontsize=14, fontweight='bold')
ax.set_ylabel('Candidates')
ax.set_title('Pipeline Candidates', fontweight='bold', fontsize=12)

# ── (3) Unified Score 분포 ──
ax = fig.add_subplot(gs[0, 2])
for src, color in src_colors.items():
    mask = df_unified['source'] == src
    if mask.any():
        ax.hist(df_unified.loc[mask, 'unified_score'], bins=10, alpha=0.7,
                color=color, edgecolor='white', label=src)
ax.set_xlabel('Unified Score')
ax.set_ylabel('Count')
ax.set_title('Score Distribution', fontweight='bold', fontsize=12)
ax.legend(fontsize=9)

# ── (4) Top 10 통합 랭킹 ──
ax = fig.add_subplot(gs[1, :])
top_n = min(15, len(df_unified))
top_df = df_unified.head(top_n)
bar_colors = [src_colors.get(s, '#888') for s in top_df['source']]
bars = ax.barh(range(top_n), top_df['unified_score'], color=bar_colors,
               edgecolor='white', linewidth=1.5, height=0.7)
ax.set_yticks(range(top_n))
ylabels = [f"[{row['source'][:2]}] {row['name']}" for _, row in top_df.iterrows()]
ax.set_yticklabels(ylabels, fontsize=9)
ax.invert_yaxis()
ax.set_xlabel('Unified Score', fontsize=11)
ax.set_title('Top 15 Unified Ranking', fontweight='bold', fontsize=13)

# 점수 라벨
for i, (_, row) in enumerate(top_df.iterrows()):
    label = f"{row['unified_score']:.3f}"
    if row.get('dG_REU') is not None:
        label += f" (dG={row['dG_REU']:.1f})"
    if row.get('plddt') is not None:
        label += f" (pLDDT={row['plddt']:.1f})"
    ax.text(row['unified_score'] + 0.005, i, label, va='center', fontsize=8)

# 범례
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=c, edgecolor='white', label=s) for s, c in src_colors.items()]
ax.legend(handles=legend_elements, fontsize=9, loc='lower right')

fig.suptitle('SSTR2 Unified Binder Discovery — Final Dashboard',
             fontweight='bold', fontsize=16, y=1.01)
plt.show()

In [None]:
# ===== 5.2 Top 5 후보 상세 카드 =====
top5 = df_unified.head(5)

html_cards = ''
for i, (_, row) in enumerate(top5.iterrows()):
    src_badge_color = '#1565c0' if row['source'] == 'FastDesign' else '#e53935'
    metrics = []
    if row.get('dG_REU') is not None:
        metrics.append(f'dG = {row["dG_REU"]:.1f} REU')
    if row.get('plddt') is not None:
        metrics.append(f'pLDDT = {row["plddt"]:.1f}')
    if row.get('dSASA') is not None and row['dSASA']:
        metrics.append(f'dSASA = {row["dSASA"]:.0f} A^2')
    metrics.append(f'Risk = {row["risk_total"]:.1f}')

    html_cards += f'''
    <div style="background:#fff;border:2px solid {src_badge_color};border-radius:12px;
                padding:16px;margin:8px 0;box-shadow:0 2px 4px rgba(0,0,0,0.1)">
      <div style="display:flex;justify-content:space-between;align-items:center">
        <div>
          <span style="background:{src_badge_color};color:white;padding:3px 10px;
                       border-radius:20px;font-size:12px;font-weight:bold">
            #{i+1} {row['source']}
          </span>
          <span style="margin-left:10px;font-weight:bold;font-size:14px">{row['name']}</span>
        </div>
        <span style="font-size:20px;font-weight:bold;color:{src_badge_color}">
          {row['unified_score']:.3f}
        </span>
      </div>
      <div style="font-family:monospace;font-size:13px;margin:8px 0;padding:6px;
                  background:#f5f5f5;border-radius:6px">
        {row['sequence']}
      </div>
      <div style="font-size:12px;color:#666">
        Length: {row['length']}aa &nbsp;|&nbsp; {' &nbsp;|&nbsp; '.join(metrics)}
      </div>
    </div>'''

display(HTML(f'''
<div style="max-width:800px">
  <h3>Top 5 Unified Candidates</h3>
  {html_cards}
</div>
'''))

In [None]:
# ===== 5.3 Next Steps =====
display(HTML('''
<div style="background:linear-gradient(135deg,#1565c0,#42a5f5);color:white;
            padding:30px;border-radius:12px;margin:20px 0">
  <h2 style="margin-top:0">Next Steps</h2>
  <ol style="font-size:15px;line-height:2.0">
    <li><b>FlexPepDock Refine</b>: Top De Novo 후보를 SSTR2 복합체에서 FlexPepDock으로 정밀화</li>
    <li><b>AlphaFold3 재예측</b>: 통합 Top 5 후보 서열로 SSTR2 복합체 구조 재예측</li>
    <li><b>MD Simulation</b>: 최종 후보의 결합 안정성을 분자 동역학으로 검증</li>
    <li><b>Arm 2 통합</b>: MolMIM+DiffDock 소분자 후보도 통합 랭킹에 추가</li>
    <li><b>실험 검증</b>: In vitro 바인딩 어세이 (SPR, ITC)</li>
  </ol>
</div>
'''))

print(f'\n=== 실행 완료 ===')
print(f'통합 후보: {len(df_unified)}개')
print(f'결과 저장: {OUTPUT_DIR / "unified_ranking.csv"}')
print(f'Top 1: [{df_unified.iloc[0]["source"]}] {df_unified.iloc[0]["name"]} '
      f'(score={df_unified.iloc[0]["unified_score"]:.3f})')