<a href="https://colab.research.google.com/github/christophergaughan/RFDiffusion-fab-glycosylation-scanner/blob/main/AntibodyML_Glycosylation_Scanner_v3_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AntibodyML Glycosylation Scanner v3.0 ‚Äî Demo
## Scanning De Novo RFantibody Designs for N-Glycosylation Liabilities

**Author:** AntibodyML Consulting LLC  
**Date:** January 2026  
**Version:** 3.0  

---

## Purpose

This notebook demonstrates the Enhanced Fab Glycosylation Scanner v3.0 by analyzing 15 de novo antibody designs generated using **RFantibody** (Bennett et al., *Nature* 2025), the antibody-finetuned version of RFdiffusion.

The goal is to illustrate a gap in current AI-driven antibody design pipelines: **structure-based tools like RFdiffusion and sequence-design tools like ProteinMPNN have no awareness of post-translational modification liabilities**, including N-linked glycosylation sequons that can compromise manufacturing in CHO cells.

---

## ‚ö†Ô∏è Important Caveats and Limitations

### Study Scope

| Parameter | Value | Implication |
|-----------|-------|-------------|
| Sample size | N=15 designs | Small sample; patterns may not generalize |
| Target antigen | SARS-CoV-2 RBD (PDB: 6M0J) | Single target; other targets may yield different results |
| Framework scaffold | Adalimumab (PDB: 4NYL) | Framework-specific progenitors may not apply to other scaffolds |
| Design method | RFantibody + ProteinMPNN | Results specific to this pipeline |

### Scanner Limitations

1. **The scanner predicts LIABILITY (sequon presence), not OCCUPANCY (glycan presence).** Mass spectrometry is required to confirm actual glycosylation.

2. **The scanner predicts RISK, not OUTCOME.** Functional impact on binding, stability, or manufacturability requires experimental validation.

3. **Progenitor sites vs. active sequons:**
   - **Active sequons (N-X-S/T):** Direct manufacturing risk in CHO cells
   - **Progenitor sites (D-X-S/T, N-X-A, etc.):** One mutation away from becoming sequons. For fixed therapeutic sequences, these are *not* direct manufacturing risks. They represent evolutionary potential during affinity maturation (relevant for natural antibodies or further engineering campaigns).

4. **IMGT numbering depends on AntPack alignment quality.** De novo sequences may have unusual features that affect numbering accuracy.

5. **Vernier zone boundaries are approximate.** Exact functional residues vary by antibody structure.

---

## Data Provenance

### How These Sequences Were Generated

1. **RFantibody backbone generation:**
   - Target: SARS-CoV-2 spike RBD (PDB 6M0J, chain E)
   - Framework template: Adalimumab/Humira (PDB 4NYL)
   - Hotspot residues: E455, E456, E486, E489, E505 (ACE2 binding interface)
   - CDR loop length ranges: L1:10-12, L2:7, L3:8-10, H1:7, H2:6, H3:12-16
   - Diffusion timesteps: T=100
   - Output: 15 backbone structures

2. **ProteinMPNN sequence design:**
   - Input: RFantibody backbone PDBs
   - Framework residues: Fixed from adalimumab
   - CDR residues: Designed by ProteinMPNN
   - Output: 15 full sequences (Heavy + Light chains)

### Key Point

Neither RFdiffusion nor ProteinMPNN considers glycosylation risk. The tools optimize for:
- Structural stability
- Target complementarity
- Sequence-structure compatibility

They do **not** optimize for:
- N-X-S/T sequon avoidance
- Deamidation liability (N-G, N-S motifs)
- Oxidation liability (exposed Met/Trp)
- Isomerization liability (D-G motifs)

---

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1. Installation and Imports

In [2]:
# Install AntPack for IMGT numbering
!pip install antpack==0.3.8.6 -q

import re
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass, field
from enum import Enum

# AntPack imports
try:
    from antpack import SingleChainAnnotator, VJGeneTool
    ANTPACK_AVAILABLE = True
    print("‚úì AntPack imported successfully")
except ImportError as e:
    ANTPACK_AVAILABLE = False
    print(f"‚úó AntPack import failed: {e}")
    print("  Some features will be unavailable.")

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m46.9/46.9 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m15.1/15.1 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25h‚úì AntPack imported successfully


## 2. Input Data: RFantibody Sequences

These 15 antibody sequences were generated by RFantibody (backbone) + ProteinMPNN (sequence design) targeting SARS-CoV-2 RBD using an adalimumab framework scaffold.

**Format:** Variable domain sequences only (Fv region)

In [3]:
# =============================================================================
# INPUT: 15 RFantibody de novo designs
# Target: SARS-CoV-2 RBD (PDB 6M0J)
# Framework: Adalimumab (PDB 4NYL)
# Generated: January 2026
# =============================================================================

RFANTIBODY_SEQUENCES = {
    "ab_des_0": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFGDYAMHWVRQAPGKGLEYVSAVDANGKGKYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCVSASTRSDIRGPLVGWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYNSSTRAGGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCLVTSGRHGFGQGTKVEIK"
    },
    "ab_des_1": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFGDYAMHWVRQAPGKGLEYVSAVSAGGRGHYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCVSASTRSDIRGPLVGWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYASSTRKGGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCAQFDSTPWGFGQGTKVEIK"
    },
    "ab_des_2": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFSNWVHWVRQAPGKGLEYVSAISRSGATGKYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCYYSQRNDSPRGFGWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYAASTRRGGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCAHWSGSPMGFGQGTKVEIK"
    },
    "ab_des_3": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFKDWVSAINASGSGIDVYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARSAASTLVTSAGIDYWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYTYYQQNFSSTPMGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCARFDTSPLGFGQGTKVEIK"
    },
    "ab_des_4": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFADWVHWVRQAPGKGLEYVSAVSAGGRGHYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCALSTNGSLLGTSAWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYNISTRAGGLPSRFSGSGSGTDFTLTISSLQPEDFATYYCARFETTPMGFGQGTKVEIK"
    },
    "ab_des_5": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFADWVHWVRQAPGKGLEYVSAISAGDKGHYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARSATSGFDRTSTIDYWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYASSTKRGGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCAQFELTPMGFGQGTKVEIK"
    },
    "ab_des_6": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFNLSNGAMHWVRQAPGKGLEYVSAIDAGDKADRYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARSADIDRSSTIDYWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYSSSKRKGGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCAQFGSTPWGFGQGTKVEIK"
    },
    "ab_des_7": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFANWVHWVRQAPGKGLEYVSAISAGGKGHFYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARGARNNTSRHTTYIDYWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYNNTSTRVGGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCAQFETTPMGFGQGTKVEIK"
    },
    "ab_des_8": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFADWVHWVRQAPGKGLEYVSAISAGDKGHYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARSASRDRGDSTIDYWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYAASTKRGGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCAQFETTPMGFGQGTKVEIK"
    },
    "ab_des_9": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFADWVHWVRQAPGKGLEYVSAIDADGKGHYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARGASRDRSTIDYWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYAASTRRGGLPSRFSGSGSGTDFTLTISSLQPEDFATYYCAQFETTPMGFGQGTKVEIK"
    },
    "ab_des_10": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFADWVHWVRQAPGKGLEYVSAISAGGKGRYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARSAASDIRRSTIDYWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYAASTKRGGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCAQFETTPMGFGQGTKVEIK"
    },
    "ab_des_11": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFADWVHWVRQAPGKGLEYVSAISAGDKGHYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARSAARDIRGSTIDYWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYNSSTRAGGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCAQFETTPMGFGQGTKVEIK"
    },
    "ab_des_12": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFADWVHWVRQAPGKGLEYVSAISRGDAGKYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARGASRDIRRATIDYWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYAASTKRGGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCAQFETTPMGFGQGTKVEIK"
    },
    "ab_des_13": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFADWVHWVRQAPGKGLEYVSAISAGGKGHYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCARGAARDIRSTLVDYWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYAASTKRGGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCAQFETTPMGFGQGTKVEIK"
    },
    "ab_des_14": {
        "H": "EVQLVESGGGLVKPGGSLRLSCAASGFPFGDYAMHWVRQAPGKGLEYVSAVDANGKGHYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCLSETGHLGSALVGWGQGTLVTVSS",
        "L": "DIQMTQSPSFLSASVGDRVTITCRASQGISNYLAWYQQKPGKAPKLLIYRNSKRAGGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCAQFDTTPMGFGQGTKVEIK"
    },
}

print(f"Loaded {len(RFANTIBODY_SEQUENCES)} antibody designs")
print(f"Each design has Heavy (H) and Light (L) chain sequences")

Loaded 15 antibody designs
Each design has Heavy (H) and Light (L) chain sequences


## 3. Scanner Configuration

### 3.1 X-Position Glycosylation Efficiency

The amino acid at the X position in N-X-S/T sequons affects glycosylation efficiency. Values from Shakin-Eshleman et al. (1996) *JBC* 271:6363-6366.

| Category | Residues | Efficiency | Notes |
|----------|----------|------------|-------|
| Blocked | P | 0.00 | Proline prevents OST access |
| Low | W, D, E, L | 0.15-0.25 | Bulky or charged |
| Medium | K, R, H, A, M, N, Q, C | 0.40-0.55 | Variable |
| High | Y, F, V, I, G, S, T | 0.70-0.90 | Preferred by OST |

### 3.2 NXT vs NXS

NXT sequons glycosylate ~3x more efficiently than NXS (Kasturi et al. 1995).

In [4]:
# =============================================================================
# SCANNER CONFIGURATION
# =============================================================================

# X-position efficiency (Shakin-Eshleman 1996)
X_POSITION_EFFICIENCY = {
    'P': 0.00,  # BLOCKED
    'W': 0.15, 'D': 0.20, 'E': 0.20, 'L': 0.25,  # LOW
    'K': 0.40, 'R': 0.40, 'H': 0.45, 'A': 0.50, 'M': 0.50,  # MEDIUM
    'N': 0.55, 'Q': 0.55, 'C': 0.50,
    'Y': 0.70, 'F': 0.80, 'V': 0.80, 'I': 0.80,  # HIGH
    'G': 0.85, 'S': 0.85, 'T': 0.90,
}

# NXT vs NXS multiplier
SEQUON_TYPE_MULTIPLIER = {'NXT': 1.0, 'NXS': 0.33}

# IMGT region boundaries
IMGT_REGIONS = {
    'FR1':  (1, 26),
    'CDR1': (27, 38),
    'FR2':  (39, 55),
    'CDR2': (56, 65),
    'FR3':  (66, 104),
    'CDR3': (105, 117),
    'FR4':  (118, 128),
}

# Vernier zone (DE loop) - positions that influence CDR conformations
VERNIER_ZONE = (75, 88)

# Progenitor hotspots from van de Bovenkamp 2018 SI Table S1
PROGENITOR_HOTSPOTS = {77: 278, 81: 256, 59: 187, 84: 137, 82: 124}

class RiskLevel(Enum):
    CRITICAL = "CRITICAL"  # CDR3
    HIGH = "HIGH"          # CDR1/2, Vernier zone
    MEDIUM = "MEDIUM"      # FR2, other FR3
    LOW = "LOW"            # FR1, FR4

print("‚úì Scanner configuration loaded")

‚úì Scanner configuration loaded


## 4. Scanner Functions

In [5]:
# =============================================================================
# HELPER FUNCTIONS
# =============================================================================

def get_x_efficiency(x_residue: str) -> float:
    """Get glycosylation efficiency for X-position residue."""
    return X_POSITION_EFFICIENCY.get(x_residue.upper(), 0.50)

def get_sequon_type(motif: str) -> str:
    """Classify sequon as NXT or NXS."""
    if len(motif) >= 3:
        third = motif[2].upper()
        if third == 'T': return 'NXT'
        elif third == 'S': return 'NXS'
    return 'Unknown'

def get_imgt_region(imgt_pos: str) -> str:
    """Map IMGT position to antibody region."""
    if imgt_pos is None or imgt_pos == '-' or imgt_pos == '':
        return "Unknown"
    try:
        num = int(''.join(c for c in str(imgt_pos) if c.isdigit()))
    except ValueError:
        return "Unknown"
    for region, (start, end) in IMGT_REGIONS.items():
        if start <= num <= end:
            return region
    if num > 128:
        return "C-region"
    return "Unknown"

def is_vernier_zone(imgt_pos: str) -> bool:
    """Check if position is in the Vernier zone (IMGT 75-88)."""
    if imgt_pos is None or imgt_pos == '-':
        return False
    try:
        num = int(''.join(c for c in str(imgt_pos) if c.isdigit()))
        return VERNIER_ZONE[0] <= num <= VERNIER_ZONE[1]
    except ValueError:
        return False

def classify_position_risk(imgt_pos: str, region: str) -> RiskLevel:
    """Assign risk level based on position and region."""
    if region == "CDR3": return RiskLevel.CRITICAL
    if region in ["CDR1", "CDR2"]: return RiskLevel.HIGH
    if region == "FR3" and is_vernier_zone(imgt_pos): return RiskLevel.HIGH
    if region in ["FR2", "FR3"]: return RiskLevel.MEDIUM
    return RiskLevel.LOW

print("‚úì Helper functions defined")

‚úì Helper functions defined


In [6]:
# =============================================================================
# DATA STRUCTURES
# =============================================================================

@dataclass
class GlycosylationSite:
    """An active N-X-S/T sequon (direct glycosylation liability)."""
    position: int
    motif: str
    context: str
    imgt_position: Optional[str] = None
    region: Optional[str] = None
    sequon_type: Optional[str] = None
    x_residue: Optional[str] = None
    x_efficiency: Optional[float] = None
    occupancy_score: Optional[float] = None
    position_risk: Optional[RiskLevel] = None
    is_vernier_zone: bool = False

@dataclass
class ProgenitorSite:
    """A latent site one mutation away from N-X-S/T."""
    position: int
    current_motif: str
    progenitor_type: str
    potential_sequon: str
    context: str
    imgt_position: Optional[str] = None
    region: Optional[str] = None
    position_risk: Optional[RiskLevel] = None
    is_vernier_zone: bool = False
    predicted_efficiency: Optional[float] = None

@dataclass
class ScanResult:
    """Complete scan result for one chain."""
    name: str
    sequence: str
    chain_type: str
    v_gene: Optional[str] = None
    glycosylation_sites: List[GlycosylationSite] = field(default_factory=list)
    progenitor_sites: List[ProgenitorSite] = field(default_factory=list)
    total_sites: int = 0
    total_progenitors: int = 0
    highest_risk: Optional[RiskLevel] = None

print("‚úì Data structures defined")

‚úì Data structures defined


In [7]:
# =============================================================================
# SEQUON DETECTION
# =============================================================================

def scan_glycosylation_sites(sequence: str, context_window: int = 5) -> List[GlycosylationSite]:
    """
    Scan for N-X-S/T motifs (X ‚â† P).

    These are ACTIVE sequons that can be glycosylated in CHO cells.
    """
    sites = []
    pattern = re.compile(r'N[^P][ST]')

    for match in pattern.finditer(sequence):
        pos = match.start() + 1  # 1-indexed
        motif = match.group()
        x_residue = motif[1]

        start = max(0, match.start() - context_window)
        end = min(len(sequence), match.end() + context_window)
        context = sequence[start:end]

        sequon_type = get_sequon_type(motif)
        x_eff = get_x_efficiency(x_residue)
        seq_mult = SEQUON_TYPE_MULTIPLIER.get(sequon_type, 0.5)
        occupancy = x_eff * seq_mult

        sites.append(GlycosylationSite(
            position=pos, motif=motif, context=context,
            sequon_type=sequon_type, x_residue=x_residue,
            x_efficiency=x_eff, occupancy_score=occupancy,
        ))

    return sites

print("‚úì Sequon detection function defined")

‚úì Sequon detection function defined


In [8]:
# =============================================================================
# PROGENITOR DETECTION
# =============================================================================

def scan_progenitor_sites(sequence: str, context_window: int = 5) -> List[ProgenitorSite]:
    """
    Scan for progenitor sites (one mutation from N-X-S/T).

    Patterns detected:
    - D-X-S/T ‚Üí N-X-S/T (Asp‚ÜíAsn: single nucleotide change)
    - N-X-A ‚Üí N-X-S (Ala‚ÜíSer: single nucleotide change)
    - N-P-S/T ‚Üí N-X-S/T (Pro removal unblocks)

    NOTE: For fixed therapeutic sequences, progenitors are NOT direct
    manufacturing risks. They represent evolutionary potential only.
    """
    progenitors = []

    # D-X-S/T ‚Üí N-X-S/T
    for match in re.finditer(r'D[^P][ST]', sequence):
        pos = match.start() + 1
        motif = match.group()
        potential = 'N' + motif[1:]
        start = max(0, match.start() - context_window)
        end = min(len(sequence), match.end() + context_window)
        context = sequence[start:end]
        x_eff = get_x_efficiency(motif[1])
        seq_type = get_sequon_type(potential)
        predicted_eff = x_eff * SEQUON_TYPE_MULTIPLIER.get(seq_type, 0.5)
        progenitors.append(ProgenitorSite(
            position=pos, current_motif=motif, progenitor_type="D‚ÜíN",
            potential_sequon=potential, context=context,
            predicted_efficiency=predicted_eff,
        ))

    # N-X-A ‚Üí N-X-S
    for match in re.finditer(r'N[^P]A', sequence):
        pos = match.start() + 1
        motif = match.group()
        potential = motif[:2] + 'S'
        start = max(0, match.start() - context_window)
        end = min(len(sequence), match.end() + context_window)
        context = sequence[start:end]
        x_eff = get_x_efficiency(motif[1])
        predicted_eff = x_eff * SEQUON_TYPE_MULTIPLIER['NXS']
        progenitors.append(ProgenitorSite(
            position=pos, current_motif=motif, progenitor_type="A‚ÜíS",
            potential_sequon=potential, context=context,
            predicted_efficiency=predicted_eff,
        ))

    # N-P-S/T ‚Üí N-X-S/T (blocked, could unblock)
    for match in re.finditer(r'NP[ST]', sequence):
        pos = match.start() + 1
        motif = match.group()
        potential = f"N-X-{motif[2]}"
        start = max(0, match.start() - context_window)
        end = min(len(sequence), match.end() + context_window)
        context = sequence[start:end]
        seq_type = 'NXT' if motif[2] == 'T' else 'NXS'
        predicted_eff = 0.5 * SEQUON_TYPE_MULTIPLIER.get(seq_type, 0.5)
        progenitors.append(ProgenitorSite(
            position=pos, current_motif=motif, progenitor_type="P‚ÜíX (unblock)",
            potential_sequon=potential, context=context,
            predicted_efficiency=predicted_eff,
        ))

    return progenitors

print("‚úì Progenitor detection function defined")

‚úì Progenitor detection function defined


In [9]:
# =============================================================================
# ANTPACK INTEGRATION
# =============================================================================

def run_antpack_analysis(sequence: str, chain_type: str) -> Dict:
    """Run IMGT numbering and V-gene assignment via AntPack."""
    if not ANTPACK_AVAILABLE:
        return {"error": "AntPack not available", "numbering": None}

    results = {"numbering": None, "v_gene": None, "error": None}

    try:
        chains = ['H'] if chain_type == 'H' else ['K', 'L']
        annotator = SingleChainAnnotator(chains=chains, scheme="imgt")
        result_tuple = annotator.analyze_seq(sequence)
        numbering, pct_id, detected_chain, err = result_tuple
        results["numbering"] = numbering
        results["error"] = err if err else None

        if numbering is not None and not err:
            try:
                vj_tool = VJGeneTool()
                vj_result = vj_tool.assign_vj_genes(result_tuple, sequence, 'human')
                if vj_result:
                    results["v_gene"] = vj_result[0]
            except Exception:
                pass
    except Exception as e:
        results["error"] = str(e)

    return results

def map_position_to_imgt(linear_pos: int, numbering: List) -> Optional[str]:
    """Map linear position to IMGT numbering."""
    if numbering is None:
        return None
    idx = linear_pos - 1
    if idx < 0 or idx >= len(numbering):
        return None
    imgt_pos = numbering[idx]
    if imgt_pos == '-' or imgt_pos == '':
        return None
    return str(imgt_pos)

print("‚úì AntPack integration functions defined")

‚úì AntPack integration functions defined


In [10]:
# =============================================================================
# SITE ANNOTATION
# =============================================================================

def annotate_sites(sites, numbering):
    """Add IMGT positions and risk levels to detected sites."""
    for site in sites:
        imgt_pos = map_position_to_imgt(site.position, numbering)
        site.imgt_position = imgt_pos
        site.region = get_imgt_region(imgt_pos)
        site.is_vernier_zone = is_vernier_zone(imgt_pos)
        site.position_risk = classify_position_risk(imgt_pos, site.region)
    return sites

print("‚úì Site annotation function defined")

‚úì Site annotation function defined


In [11]:
# =============================================================================
# MAIN SCANNER
# =============================================================================

def scan_sequence(name: str, sequence: str, chain_type: str) -> ScanResult:
    """
    Run complete glycosylation liability scan on an antibody sequence.

    Args:
        name: Sequence identifier
        sequence: Amino acid sequence (variable domain)
        chain_type: 'H' for heavy, 'L' for light

    Returns:
        ScanResult with detected sequons, progenitors, and risk assessment
    """
    result = ScanResult(name=name, sequence=sequence, chain_type=chain_type)

    # IMGT numbering
    antpack_result = run_antpack_analysis(sequence, chain_type)
    numbering = antpack_result.get("numbering")
    result.v_gene = antpack_result.get("v_gene")

    # Detect and annotate sequons
    glyc_sites = scan_glycosylation_sites(sequence)
    glyc_sites = annotate_sites(glyc_sites, numbering)
    result.glycosylation_sites = glyc_sites
    result.total_sites = len(glyc_sites)

    # Detect and annotate progenitors
    prog_sites = scan_progenitor_sites(sequence)
    prog_sites = annotate_sites(prog_sites, numbering)
    result.progenitor_sites = prog_sites
    result.total_progenitors = len(prog_sites)

    # Determine highest risk
    all_risks = [s.position_risk for s in glyc_sites if s.position_risk]
    all_risks += [p.position_risk for p in prog_sites if p.position_risk]
    if all_risks:
        for risk in [RiskLevel.CRITICAL, RiskLevel.HIGH, RiskLevel.MEDIUM, RiskLevel.LOW]:
            if risk in all_risks:
                result.highest_risk = risk
                break

    return result

print("‚úì Main scanner function defined")
print("\nReady to scan sequences!")

‚úì Main scanner function defined

Ready to scan sequences!


## 5. Run the Scanner

Now we scan all 15 RFantibody designs and report findings explicitly.

In [12]:
# =============================================================================
# RUN SCANNER ON ALL 15 DESIGNS
# =============================================================================

print("=" * 80)
print("  GLYCOSYLATION LIABILITY SCAN: 15 RFantibody De Novo Designs")
print("  Scanner: AntibodyML v3.0")
print("  Target: SARS-CoV-2 RBD | Framework: Adalimumab")
print("=" * 80)

all_results = {}

# Counters
designs_with_active_sequons = 0
designs_with_progenitors = 0
total_active_sequons = 0
total_progenitors = 0

# Risk counters (active sequons only)
critical_sequons = 0
high_sequons = 0
medium_sequons = 0

for design_name, chains in RFANTIBODY_SEQUENCES.items():
    print(f"\n{'‚îÄ' * 80}")
    print(f"  {design_name}")
    print(f"{'‚îÄ' * 80}")

    design_results = {"H": None, "L": None}
    design_has_sequon = False
    design_has_progenitor = False

    for chain_type, sequence in chains.items():
        result = scan_sequence(f"{design_name}_{chain_type}", sequence, chain_type)
        design_results[chain_type] = result

        chain_label = "Heavy" if chain_type == "H" else "Light"

        # Report active sequons
        if result.glycosylation_sites:
            design_has_sequon = True
            for site in result.glycosylation_sites:
                risk_str = site.position_risk.value if site.position_risk else "?"
                vernier_flag = " [VERNIER]" if site.is_vernier_zone else ""

                print(f"  ‚ö†Ô∏è  ACTIVE SEQUON | {chain_label} chain")
                print(f"      Motif: {site.motif} ({site.sequon_type})")
                print(f"      Position: IMGT {site.imgt_position} ({site.region}){vernier_flag}")
                print(f"      Context: ...{site.context}...")
                print(f"      Risk: {risk_str} | Occupancy score: {site.occupancy_score:.2f}")
                print()

                total_active_sequons += 1
                if site.position_risk == RiskLevel.CRITICAL:
                    critical_sequons += 1
                elif site.position_risk == RiskLevel.HIGH:
                    high_sequons += 1
                elif site.position_risk == RiskLevel.MEDIUM:
                    medium_sequons += 1

        # Report progenitors
        if result.progenitor_sites:
            design_has_progenitor = True
            for prog in result.progenitor_sites:
                risk_str = prog.position_risk.value if prog.position_risk else "?"
                vernier_flag = " [VERNIER]" if prog.is_vernier_zone else ""

                print(f"  üß¨ PROGENITOR | {chain_label} chain")
                print(f"      Current: {prog.current_motif} ‚Üí Potential: {prog.potential_sequon}")
                print(f"      Mutation: {prog.progenitor_type}")
                print(f"      Position: IMGT {prog.imgt_position} ({prog.region}){vernier_flag}")
                print(f"      Context: ...{prog.context}...")
                print(f"      Risk if actualized: {risk_str}")
                print()

                total_progenitors += 1

    if design_has_sequon:
        designs_with_active_sequons += 1
    if design_has_progenitor:
        designs_with_progenitors += 1

    if not design_has_sequon and not design_has_progenitor:
        print(f"  ‚úÖ No liabilities detected")

    all_results[design_name] = design_results

  GLYCOSYLATION LIABILITY SCAN: 15 RFantibody De Novo Designs
  Scanner: AntibodyML v3.0
  Target: SARS-CoV-2 RBD | Framework: Adalimumab

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  ab_des_0
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  üß¨ PROGENITOR | Heavy chain
      Current: DNS ‚Üí Potential: NNS
      Mutation: D‚ÜíN
      Position: IMGT 81 (FR3) [VERNIER]
      Context: ...FTISRDNSKNTLY...
      Risk if actualized: HIGH

  ‚ö†Ô∏è  ACTIVE SEQUON | Light chain
      Motif: NSS (NXS)
      Position: IMGT 56 (CDR2)
      Context: ...KLLIYNSSTRAGG...
      Risk: HIGH | Occupancy score: 0.28



## 6. Summary Statistics

In [13]:
# =============================================================================
# SUMMARY
# =============================================================================

print("\n" + "=" * 80)
print("  SUMMARY: Glycosylation Liability Assessment")
print("=" * 80)

print(f"\n  INPUT")
print(f"  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print(f"  Designs scanned: 15")
print(f"  Pipeline: RFantibody (backbone) + ProteinMPNN (sequence)")
print(f"  Target: SARS-CoV-2 RBD")
print(f"  Framework: Adalimumab")

print(f"\n  ACTIVE SEQUONS (N-X-S/T) ‚Äî Direct Manufacturing Risk")
print(f"  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print(f"  Designs with ‚â•1 active sequon: {designs_with_active_sequons}/15 ({designs_with_active_sequons/15*100:.0f}%)")
print(f"  Total active sequons: {total_active_sequons}")
print(f"")
print(f"  Risk stratification:")
print(f"    CRITICAL (CDR3):        {critical_sequons}")
print(f"    HIGH (CDR1/2/Vernier):  {high_sequons}")
print(f"    MEDIUM (FR2/FR3):       {medium_sequons}")

print(f"\n  PROGENITOR SITES ‚Äî Latent Liabilities (one mutation away)")
print(f"  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print(f"  Designs with ‚â•1 progenitor: {designs_with_progenitors}/15 ({designs_with_progenitors/15*100:.0f}%)")
print(f"  Total progenitors: {total_progenitors}")
print(f"")
print(f"  NOTE: Progenitors are NOT direct manufacturing risks for fixed")
print(f"  therapeutic sequences. They indicate evolutionary potential.")

print(f"\n  COMBINED")
print(f"  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
designs_with_any = max(designs_with_active_sequons, designs_with_progenitors)
# Actually need to count properly
designs_with_any = 0
for design_name in RFANTIBODY_SEQUENCES:
    has_any = False
    for chain_type in ['H', 'L']:
        r = all_results[design_name][chain_type]
        if r.total_sites > 0 or r.total_progenitors > 0:
            has_any = True
    if has_any:
        designs_with_any += 1

print(f"  Designs with ANY liability: {designs_with_any}/15 ({designs_with_any/15*100:.0f}%)")
print(f"  Total liabilities (sequons + progenitors): {total_active_sequons + total_progenitors}")


  SUMMARY: Glycosylation Liability Assessment

  INPUT
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  Designs scanned: 15
  Pipeline: RFantibody (backbone) + ProteinMPNN (sequence)
  Target: SARS-CoV-2 RBD
  Framework: Adalimumab

  ACTIVE SEQUONS (N-X-S/T) ‚Äî Direct Manufacturing Risk
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  Designs with ‚â•1 active sequon: 7/15 (47%)
  Total active sequons: 10

  Risk stratification:
    CRITICAL (CDR3):        3
    HIGH (CDR1/2/Vernier):  6
    MEDIUM (FR2/FR3):       1

  PROGENITOR SITES ‚Äî Latent Liabilities (one mutation away)
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  Designs with ‚â•1 progenitor: 15/15 (100%)
  Total progenitors: 38

  NOTE: Progenitors are NOT direct manufacturing risks for fixed
  therapeutic 

## 7. Key Findings

### 7.1 Active Sequons ‚Äî Direct Manufacturing Risk

Active N-X-S/T sequons in the designed antibodies represent **immediate** glycosylation liabilities. If these sequences are expressed in CHO cells, the oligosaccharyltransferase (OST) complex may attach N-glycans at these sites.

**Implications:**
- Product heterogeneity (glycoform mixtures)
- Potential reduction in titer
- Binding interference (if in CDR)
- Immunogenicity concerns (non-human glycoforms)

### 7.2 Progenitor Sites ‚Äî Evolutionary Potential

Progenitor sites (D-X-S/T, N-X-A, N-P-S/T) are **one mutation away** from becoming active sequons.

**For therapeutic manufacturing:** These are NOT direct risks. The sequence is fixed; no somatic hypermutation occurs in CHO cells.

**When progenitors matter:**
- If using the design as a starting point for directed evolution
- If the antibody will undergo further affinity maturation
- For understanding the evolutionary "neighborhood" of the sequence

### 7.3 Recurrent Framework Patterns

We observe that **all 15 designs** share certain progenitor sites in the framework regions. This is expected because:

1. The adalimumab framework was held constant
2. ProteinMPNN only redesigned CDR loops
3. Framework progenitors (e.g., DNS at IMGT 81, DFT at IMGT 86) are inherited from the scaffold

**Implication:** Framework selection matters. Different scaffold choices would have different progenitor profiles.

## 8. Interpretation Guidelines

### What This Analysis Shows

‚úÖ RFantibody + ProteinMPNN can generate antibody sequences with glycosylation liabilities  
‚úÖ These tools have no built-in awareness of N-X-S/T motifs  
‚úÖ Post-design liability screening is necessary  

### What This Analysis Does NOT Show

‚ùå Whether the detected sites will actually be glycosylated (requires mass spec)  
‚ùå Whether glycosylation affects binding or stability (requires functional assays)  
‚ùå Whether these results generalize to other targets/frameworks (N=15, single system)  

### Recommended Workflow

1. **Generate designs** with RFantibody/ProteinMPNN (or similar tools)
2. **Screen for liabilities** using glycosylation scanner
3. **Prioritize clean designs** or flag sites for mutagenesis
4. **Validate experimentally** with mass spectrometry and binding assays

## 9. References

### Antibody Design Tools

1. **Bennett NR, et al.** (2025) Atomically accurate de novo design of antibodies with RFdiffusion. *Nature*. https://doi.org/10.1038/s41586-025-08536-w

2. **Dauparas J, et al.** (2022) Robust deep learning‚Äìbased protein sequence design using ProteinMPNN. *Science* 378:49-56.

### Glycosylation Biology

3. **van de Bovenkamp FS, et al.** (2018) Adaptive antibody diversification through N-linked glycosylation of the immunoglobulin variable region. *PNAS* 115:1901-1906. PMID: 29432145

4. **Shakin-Eshleman SH, et al.** (1996) The amino acid at the X position of an Asn-X-Ser sequon is an important determinant of N-linked core-glycosylation efficiency. *JBC* 271:6363-6366. PMID: 8626433

### Vernier Zone

5. **Tramontano A, et al.** (1990) Framework residue 71 is a major determinant of the position and conformation of the second hypervariable region in the VH domains of immunoglobulins. *J Mol Biol* 215:175-182.

---

**AntibodyML Consulting LLC**  
*Bridging computational design and manufacturing reality*

In [14]:
print("\n" + "=" * 80)
print("  TECHNICAL DETAILS")
print("=" * 80)
print(f"\n  Scanner version: 3.0")
print(f"  Numbering scheme: IMGT (via AntPack 0.3.8.6)")
print(f"  Species context: Human")
print(f"\n  Active sequon pattern: N-X-S/T where X ‚â† P")
print(f"  Progenitor patterns: D-X-S/T, N-X-A, N-P-S/T")
print(f"\n  Risk stratification:")
print(f"    CRITICAL: CDR3 (primary specificity determinant)")
print(f"    HIGH: CDR1, CDR2, Vernier zone (IMGT 75-88)")
print(f"    MEDIUM: FR2, FR3 (non-Vernier)")
print(f"    LOW: FR1, FR4")
print(f"\n  Occupancy score = X_efficiency √ó Sequon_type_multiplier")
print(f"    NXT multiplier: 1.0")
print(f"    NXS multiplier: 0.33 (~3x less efficient)")


  TECHNICAL DETAILS

  Scanner version: 3.0
  Numbering scheme: IMGT (via AntPack 0.3.8.6)
  Species context: Human

  Active sequon pattern: N-X-S/T where X ‚â† P
  Progenitor patterns: D-X-S/T, N-X-A, N-P-S/T

  Risk stratification:
    CRITICAL: CDR3 (primary specificity determinant)
    HIGH: CDR1, CDR2, Vernier zone (IMGT 75-88)
    MEDIUM: FR2, FR3 (non-Vernier)
    LOW: FR1, FR4

  Occupancy score = X_efficiency √ó Sequon_type_multiplier
    NXT multiplier: 1.0
    NXS multiplier: 0.33 (~3x less efficient)
