<a href="https://colab.research.google.com/github/christophergaughan/progenitor-glycosylation-scanner/blob/main/Enhanced_Fab_Glycosylation_Scanner_v3_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enhanced Fab Glycosylation Scanner v3.0
## Progenitor Site Detection, Occupancy Risk Scoring, and Vernier Zone Analysis

**Version**: 3.0  
**Author**: AntibodyML Consulting LLC  
**Date**: December 2025


## The Problem

We are designing antibodies *in silico*. Tools like RFdiffusion and AlphaFold generate accurate 3D protein structures‚Äîbut they are essentially **blind to post-translational modifications (PTMs)**.

One critical PTM is **N-linked glycosylation** in the Fab region. Unlike the conserved Fc glycan at Asn297, Fab glycosylation sites:
- Are **not predicted** by current ML pipelines
- Can **disrupt antigen binding** (5 of 7 sites reduced binding in adalimumab; van de Bovenkamp 2018)
- Can cause **severe immunogenicity** (cetuximab's Fab glycan carries the Œ±-gal epitope responsible for anaphylaxis in tick-bite sensitized patients)

Worse: natural antibodies undergo **selection** in germinal centers‚Äîdysfunctional glycoforms get culled. *In silico* designed antibodies skip this filter entirely.

## What This Tool Does

This scanner identifies **two classes of liability** that RFdiffusion cannot see:

| Liability Type | Description | Risk |
|----------------|-------------|------|
| **Active sites** | N-X-S/T sequons already present | Immediate glycosylation risk |
| **Progenitor sites** | Positions one mutation away from N-X-S/T | Latent risk during affinity maturation |

For each site, we provide:
- **IMGT position** and region (CDR vs Framework vs Vernier zone)
- **Occupancy probability** based on sequon type (NXT vs NXS) and X-residue efficiency
- **Conformational risk** flagging for Vernier zone sites with allosteric leverage

## Data Sources

Primary: **van de Bovenkamp FS, et al.** (2018) Adaptive antibody diversification through N-linked glycosylation of the immunoglobulin variable region. *PNAS* 115:1901-1906. [PMID: 29432145](https://pubmed.ncbi.nlm.nih.gov/29432145/)

Efficiency scoring: **Shakin-Eshleman SH, et al.** (1996) The amino acid at the X position of an Asn-X-Ser sequon is an important determinant of N-linked core-glycosylation efficiency. *JBC* 271:6363-6366. [PMID: 8626433](https://pubmed.ncbi.nlm.nih.gov/8626433/)


---

### What's New in v3.0

This version integrates findings from **van de Bovenkamp et al. (2018) PNAS** to move beyond simple sequon detection toward **risk-stratified liability assessment**:

| Feature | v2.0 | v3.0 |
|---------|------|------|
| N-X-S/T sequon detection | ‚úì | ‚úì |
| IMGT numbering | ‚úì | ‚úì |
| Region classification | ‚úì | ‚úì |
| **Progenitor site detection** | ‚úó | ‚úì |
| **X-position efficiency scoring** | ‚úó | ‚úì |
| **NXT vs NXS differentiation** | ‚úó | ‚úì |
| **Vernier zone flagging** | ‚úó | ‚úì |
| **IGHV family risk weighting** | ‚úó | ‚úì |

---

### Key Concepts from van de Bovenkamp et al. (2018)

#### 1. Progenitor Sites

A **progenitor glycosylation site** is a germline position that is **one nucleotide mutation away** from becoming an N-X-S/T sequon. During somatic hypermutation (SHM), these sites can be "actualized" into functional glycosylation sites.

**Progenitor patterns:**
- `D-X-S/T` ‚Üí `N-X-S/T` (Asp‚ÜíAsn: GAC‚ÜíAAC or GAT‚ÜíAAT)
- `N-X-A` ‚Üí `N-X-S` (Ala‚ÜíSer: GCN‚ÜíTCN)
- `N-P-S/T` ‚Üí `N-X-S/T` (Pro‚Üíany: removes the blocking residue)
- `B-X-S/T` ‚Üí `N-X-S/T` (where B = Asn-adjacent codons)

**Key finding:** 79-89% of observed Fab glycosylation sites in human antibodies arose from progenitor sites (van de Bovenkamp Table 1).

#### 2. Glycosylation Efficiency Cascade

Not all sequons are equally likely to be glycosylated:

```
Level 1: Progenitor site exists (germline potential)
    ‚Üì SHM
Level 2: Sequon actualized (N-X-S/T present in mature sequence)
    ‚Üì OST recognition
Level 3: Glycan occupancy (ER processing)
```

**Efficiency determinants (Shakin-Eshleman 1996):**
- NXT glycosylates more efficiently than NXS (~3x)
- X-position amino acid matters: Pro blocks completely; Trp/Asp/Glu/Leu are inefficient; Phe/Gly/Ile/Ser/Thr/Val are efficient

#### 3. The Vernier Zone and Conformational Leverage

The **Vernier zone** comprises framework residues that structurally underlie and support the CDR loops. The **DE loop** (IMGT positions ~75-88, also called H4/L4 or "CDR4" in nanobody literature) is particularly critical.

**Why Vernier zone glycosylation is high-risk:**
- Single residue changes in the Vernier zone can shift CDR conformational ensembles
- Glycosylation adds ~2.5 kDa of mass with significant steric bulk
- Effects propagate allosterically‚Äîa glycan at position 77 can affect CDR-H1, H2, AND H3 conformations

**Important caveat:** The exact Vernier residues vary by antibody structure. We flag the DE loop region (IMGT 75-88) as HIGH conformational risk, but structural analysis is needed to confirm specific residue involvement.

---

### Primary Data Source

**van de Bovenkamp FS, et al.** Adaptive antibody diversification through N-linked glycosylation of the immunoglobulin variable region. *Proc Natl Acad Sci USA*. 2018;115(8):1901-1906. doi:10.1073/pnas.1711720115. PMID: 29432145

**Supplementary data used:**
- SI Appendix Table S1: Progenitor site positions across human germline V-genes
- SI Appendix Fig S5: X-position amino acid frequency analysis

---

## 1. Installation

We use **AntPack** for IMGT numbering and V/J gene assignment. AntPack is Colab-friendly with minimal dependencies.

In [1]:
# Install AntPack - Colab friendly, numpy only dependency
!pip install antpack==0.3.8.6 -q

print("AntPack installed successfully!")

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m46.9/46.9 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m15.1/15.1 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hAntPack installed successfully!


## 2. Imports and Availability Check

In [2]:
import re
from typing import List, Dict, Tuple, Optional, Set
from dataclasses import dataclass, field
from enum import Enum

# AntPack imports
try:
    from antpack import SingleChainAnnotator, VJGeneTool
    ANTPACK_AVAILABLE = True
    print("‚úì AntPack imported successfully")
except ImportError as e:
    ANTPACK_AVAILABLE = False
    print(f"‚úó AntPack import failed: {e}")

‚úì AntPack imported successfully


## 3. Constants and Lookup Tables

### 3.1 X-Position Glycosylation Efficiency

The amino acid at the X position in N-X-S/T sequons significantly affects glycosylation efficiency. This lookup table is derived from:

1. **Shakin-Eshleman et al. (1996) JBC 271:6363-6366** - Experimental mutagenesis of rabies virus glycoprotein
2. **PMC meta-analysis of glycosylated sequons** - Statistical enrichment across viral, archaeal, and eukaryotic glycoproteins

**Key findings:**
- **Pro (P)**: Completely blocks glycosylation (conformational constraint prevents OST access)
- **Trp, Asp, Glu, Leu**: Inefficient glycosylation
- **Phe, Gly, Ile, Ser, Thr, Val**: Consistently over-represented in glycosylated sequons

**van de Bovenkamp observation:** Leucine is the most frequent X-position residue in Fab sequons, but predisposes to LOW efficiency. The next four most frequent residues predispose to HIGH efficiency‚Äîcreating a distribution architecture where most sequons have low penetrance (optionality) while some have high penetrance (commitment).

In [3]:
# X-position glycosylation efficiency lookup table
# Values represent relative probability of glycan occupancy (0.0 - 1.0)
# Source: Shakin-Eshleman 1996 JBC + PMC meta-analysis

X_POSITION_EFFICIENCY = {
    # BLOCKED - Pro prevents OST access
    'P': 0.00,

    # LOW EFFICIENCY - Experimentally confirmed inefficient
    'W': 0.15,  # Trp - bulky indole ring
    'D': 0.20,  # Asp - negative charge
    'E': 0.20,  # Glu - negative charge
    'L': 0.25,  # Leu - hydrophobic bulk, MOST COMMON in Fab sequons

    # MEDIUM EFFICIENCY - Charged or neutral
    'K': 0.40,  # Lys - positive charge
    'R': 0.40,  # Arg - positive charge
    'H': 0.45,  # His - titratable
    'A': 0.50,  # Ala - small, neutral
    'M': 0.50,  # Met - neutral
    'N': 0.55,  # Asn - polar
    'Q': 0.55,  # Gln - polar
    'C': 0.50,  # Cys - potential disulfide

    # HIGH EFFICIENCY - Consistently over-represented in glycosylated sequons
    'Y': 0.70,  # Tyr - aromatic but hydroxyl
    'F': 0.80,  # Phe - preferred despite bulk
    'V': 0.80,  # Val - small hydrophobic, preferred
    'I': 0.80,  # Ile - preferred
    'G': 0.85,  # Gly - minimal steric hindrance
    'S': 0.85,  # Ser - polar, compact
    'T': 0.90,  # Thr - highly favorable
}

def get_x_efficiency(x_residue: str) -> float:
    """Get glycosylation efficiency for X-position residue."""
    return X_POSITION_EFFICIENCY.get(x_residue.upper(), 0.50)  # Default to medium

print("‚úì X-position efficiency table loaded")
print(f"  Blocked: P ({X_POSITION_EFFICIENCY['P']})")
print(f"  Low: W,D,E,L ({X_POSITION_EFFICIENCY['L']})")
print(f"  High: G,S,T,V,I,F ({X_POSITION_EFFICIENCY['T']})")

‚úì X-position efficiency table loaded
  Blocked: P (0.0)
  Low: W,D,E,L (0.25)
  High: G,S,T,V,I,F (0.9)


### 3.2 NXT vs NXS Efficiency Multiplier

NXT sequons glycosylate approximately **3x more efficiently** than NXS sequons, despite NXS being 3x more common in Fab regions.

**van de Bovenkamp interpretation:** The system is enriched for low-efficiency sequons (NXS), creating optionality and heterogeneity rather than saturation.

In [4]:
# Sequon type efficiency multipliers
# NXT is ~3x more efficient than NXS (Kasturi et al. 1995, van de Bovenkamp 2018)

SEQUON_TYPE_MULTIPLIER = {
    'NXT': 1.0,   # Reference (higher efficiency)
    'NXS': 0.33,  # ~3x less efficient than NXT
}

def get_sequon_type(motif: str) -> str:
    """Classify sequon as NXT or NXS based on third residue."""
    if len(motif) >= 3:
        third = motif[2].upper()
        if third == 'T':
            return 'NXT'
        elif third == 'S':
            return 'NXS'
    return 'Unknown'

print("‚úì Sequon type multipliers loaded")
print(f"  NXT: {SEQUON_TYPE_MULTIPLIER['NXT']} (reference)")
print(f"  NXS: {SEQUON_TYPE_MULTIPLIER['NXS']} (~3x less efficient)")

‚úì Sequon type multipliers loaded
  NXT: 1.0 (reference)
  NXS: 0.33 (~3x less efficient)


### 3.3 IMGT Region Definitions and Risk Classification

Different regions carry different functional risks for glycosylation:

| Region | IMGT Positions | Glycosylation Risk | Rationale |
|--------|----------------|--------------------|-----------|
| CDR1 | 27-38 | HIGH | Direct antigen contact |
| CDR2 | 56-65 | HIGH | Direct antigen contact |
| CDR3 | 105-117 | CRITICAL | Most variable, primary specificity determinant |
| FR1 | 1-26 | LOW | Structural scaffold, distal from paratope |
| FR2 | 39-55 | MEDIUM | VH/VL interface |
| FR3 (DE loop) | 66-104 (esp. 75-88) | HIGH | **Vernier zone** - conformational leverage |
| FR4 | 118-128 | LOW | Structural scaffold |

### The Vernier Zone: A Special Case

The **DE loop** (IMGT ~75-88) within FR3 is part of the Vernier zone‚Äîframework residues that structurally support and influence CDR conformations.

**van de Bovenkamp SI data shows progenitor site clustering at:**
- Position 77: 278 progenitor configurations across germline alleles
- Position 81: 256 progenitor configurations
- Position 82: 124 progenitor configurations
- Position 84: 137 progenitor configurations

**Why this matters:** A glycan in the Vernier zone doesn't just sterically block a local contact‚Äîit can shift the entire CDR conformational ensemble through allosteric effects. This is qualitatively different from CDR glycosylation, which primarily causes direct steric interference.

**Caveat:** The exact Vernier residues vary by antibody structure. Our scanner flags IMGT 75-88 as elevated conformational risk, but definitive assessment requires structural analysis or molecular dynamics.

In [5]:
# IMGT region boundaries
IMGT_REGIONS = {
    'FR1':  (1, 26),
    'CDR1': (27, 38),
    'FR2':  (39, 55),
    'CDR2': (56, 65),
    'FR3':  (66, 104),
    'CDR3': (105, 117),
    'FR4':  (118, 128),
}

# DE loop / Vernier zone approximate boundaries within FR3
# Source: van de Bovenkamp 2018 SI Table S1 clustering analysis
VERNIER_ZONE = (75, 88)

# Progenitor hot spots from van de Bovenkamp SI Table S1
# These positions have the highest number of progenitor configurations across germlines
PROGENITOR_HOTSPOTS = {
    77: 278,  # Highest clustering
    81: 256,
    59: 187,  # CDR2 region
    84: 137,
    82: 124,
}

class RiskLevel(Enum):
    """Risk levels for glycosylation site functional impact."""
    CRITICAL = "CRITICAL"  # CDR3, primary specificity
    HIGH = "HIGH"          # CDR1/2, Vernier zone
    MEDIUM = "MEDIUM"      # FR2, other FR3
    LOW = "LOW"            # FR1, FR4


def get_imgt_region(imgt_pos: str) -> str:
    """
    Classify IMGT position into antibody region.

    Args:
        imgt_pos: IMGT position (may include insertion letters like "111A")

    Returns:
        Region name (e.g., "CDR3", "FR3")
    """
    if imgt_pos is None or imgt_pos == '-' or imgt_pos == '':
        return "Unknown"

    try:
        num = int(''.join(c for c in str(imgt_pos) if c.isdigit()))
    except ValueError:
        return "Unknown"

    for region, (start, end) in IMGT_REGIONS.items():
        if start <= num <= end:
            return region

    if num > 128:
        return "C-region"
    return "Unknown"


def is_vernier_zone(imgt_pos: str) -> bool:
    """
    Check if position is within the Vernier zone (DE loop).

    Args:
        imgt_pos: IMGT position string

    Returns:
        True if position is in Vernier zone (IMGT 75-88)
    """
    if imgt_pos is None or imgt_pos == '-':
        return False

    try:
        num = int(''.join(c for c in str(imgt_pos) if c.isdigit()))
        return VERNIER_ZONE[0] <= num <= VERNIER_ZONE[1]
    except ValueError:
        return False


def is_progenitor_hotspot(imgt_pos: str) -> Tuple[bool, Optional[int]]:
    """
    Check if position is a known progenitor clustering hotspot.

    Args:
        imgt_pos: IMGT position string

    Returns:
        Tuple of (is_hotspot, count_of_progenitor_configurations)
    """
    if imgt_pos is None or imgt_pos == '-':
        return False, None

    try:
        num = int(''.join(c for c in str(imgt_pos) if c.isdigit()))
        if num in PROGENITOR_HOTSPOTS:
            return True, PROGENITOR_HOTSPOTS[num]
        return False, None
    except ValueError:
        return False, None


def classify_position_risk(imgt_pos: str, region: str) -> RiskLevel:
    """
    Classify functional risk level based on position and region.

    Args:
        imgt_pos: IMGT position string
        region: Antibody region name

    Returns:
        RiskLevel enum value
    """
    # CDR3 is always critical
    if region == "CDR3":
        return RiskLevel.CRITICAL

    # CDR1/2 are high risk
    if region in ["CDR1", "CDR2"]:
        return RiskLevel.HIGH

    # Vernier zone within FR3 is high risk
    if region == "FR3" and is_vernier_zone(imgt_pos):
        return RiskLevel.HIGH

    # Other FR3 and FR2 are medium
    if region in ["FR2", "FR3"]:
        return RiskLevel.MEDIUM

    # FR1, FR4 are low risk
    return RiskLevel.LOW


print("‚úì IMGT region definitions and risk classification loaded")
print(f"  Vernier zone: IMGT {VERNIER_ZONE[0]}-{VERNIER_ZONE[1]}")
print(f"  Progenitor hotspots: {list(PROGENITOR_HOTSPOTS.keys())}")

‚úì IMGT region definitions and risk classification loaded
  Vernier zone: IMGT 75-88
  Progenitor hotspots: [77, 81, 59, 84, 82]


### 3.4 IGHV Family Risk Weighting

Different human germline V-gene families have different propensities for Fab glycosylation due to varying numbers of progenitor sites.

**From van de Bovenkamp Table 1:**

| IGHV Family | % Sequences with ‚â•1 Site | Relative Risk |
|-------------|--------------------------|---------------|
| IGHV4 | 19% | HIGH |
| IGHV1 | 9% | MEDIUM |
| IGHV3 | 5% | LOW |

**Practical implication:** If humanizing a mouse antibody onto an IGHV4 framework, be especially vigilant for progenitor sites that may be actualized during affinity maturation or that already exist in the acceptor framework.

In [6]:
# IGHV family glycosylation propensity
# Source: van de Bovenkamp 2018 PNAS Table 1

IGHV_FAMILY_RISK = {
    'IGHV4': {'prevalence': 0.19, 'risk': 'HIGH'},
    'IGHV1': {'prevalence': 0.09, 'risk': 'MEDIUM'},
    'IGHV3': {'prevalence': 0.05, 'risk': 'LOW'},
    'IGHV2': {'prevalence': 0.08, 'risk': 'MEDIUM'},  # Interpolated
    'IGHV5': {'prevalence': 0.07, 'risk': 'MEDIUM'},  # Interpolated
    'IGHV6': {'prevalence': 0.06, 'risk': 'LOW'},     # Interpolated
    'IGHV7': {'prevalence': 0.06, 'risk': 'LOW'},     # Interpolated
}

def get_ighv_family_risk(v_gene: str) -> Dict:
    """
    Get glycosylation risk information for IGHV family.

    Args:
        v_gene: V-gene assignment string (e.g., "IGHV4-59*01")

    Returns:
        Dict with prevalence and risk level, or None if not IGHV
    """
    if v_gene is None:
        return None

    # Extract family (e.g., "IGHV4" from "IGHV4-59*01")
    for family in IGHV_FAMILY_RISK.keys():
        if v_gene.startswith(family):
            return IGHV_FAMILY_RISK[family]

    return None

print("‚úì IGHV family risk data loaded")
for family, data in list(IGHV_FAMILY_RISK.items())[:3]:
    print(f"  {family}: {data['prevalence']*100:.0f}% prevalence, {data['risk']} risk")

‚úì IGHV family risk data loaded
  IGHV4: 19% prevalence, HIGH risk
  IGHV1: 9% prevalence, MEDIUM risk
  IGHV3: 5% prevalence, LOW risk


## 4. Data Structures

We define dataclasses to hold detected sites with full annotation including the new risk scoring fields.

In [7]:
@dataclass
class GlycosylationSite:
    """Represents a detected N-glycosylation sequon with full annotation."""

    # Basic detection
    position: int                           # 1-indexed position in sequence
    motif: str                              # The N-X-S/T tripeptide
    context: str                            # Surrounding sequence context

    # IMGT annotation
    imgt_position: Optional[str] = None     # IMGT position (e.g., "77", "111A")
    region: Optional[str] = None            # FR1/CDR1/FR2/CDR2/FR3/CDR3/FR4

    # Risk scoring (new in v3.0)
    sequon_type: Optional[str] = None       # NXT or NXS
    x_residue: Optional[str] = None         # The X in N-X-S/T
    x_efficiency: Optional[float] = None    # Efficiency score for X residue
    sequon_multiplier: Optional[float] = None  # NXT vs NXS multiplier
    occupancy_score: Optional[float] = None # Combined occupancy probability

    # Position risk (new in v3.0)
    position_risk: Optional[RiskLevel] = None
    is_vernier_zone: bool = False
    is_progenitor_hotspot: bool = False
    hotspot_count: Optional[int] = None     # Number of germline progenitor configurations

    # Mechanism inference
    mechanism: Optional[str] = None         # Junctional/SHM/Germline


@dataclass
class ProgenitorSite:
    """Represents a latent progenitor site (one mutation from N-X-S/T)."""

    position: int                           # 1-indexed position in sequence
    current_motif: str                      # Current tripeptide (e.g., "DLS")
    progenitor_type: str                    # Type: D‚ÜíN, A‚ÜíS, P‚ÜíX, etc.
    potential_sequon: str                   # What it would become (e.g., "NLS")
    context: str                            # Surrounding sequence

    # IMGT annotation
    imgt_position: Optional[str] = None
    region: Optional[str] = None

    # Risk assessment
    position_risk: Optional[RiskLevel] = None
    is_vernier_zone: bool = False
    predicted_efficiency: Optional[float] = None  # If actualized


@dataclass
class ScanResult:
    """Complete scan result for an antibody sequence."""

    # Input
    name: str
    sequence: str
    chain_type: str

    # AntPack results
    v_gene: Optional[str] = None
    j_gene: Optional[str] = None
    v_identity: Optional[float] = None
    ighv_family_risk: Optional[Dict] = None

    # Detected sites
    glycosylation_sites: List[GlycosylationSite] = field(default_factory=list)
    progenitor_sites: List[ProgenitorSite] = field(default_factory=list)

    # Summary
    total_sites: int = 0
    total_progenitors: int = 0
    highest_risk: Optional[RiskLevel] = None


print("‚úì Data structures defined")

‚úì Data structures defined


## 5. Core Scanner Functions

### 5.1 N-X-S/T Sequon Detection

Scan for canonical N-linked glycosylation sequons where X ‚â† Pro.

In [8]:
def scan_glycosylation_sites(sequence: str, context_window: int = 5) -> List[GlycosylationSite]:
    """
    Scan sequence for N-X-S/T motifs (N-linked glycosylation sequons).
    X cannot be proline (blocks glycosylation).

    Args:
        sequence: Amino acid sequence
        context_window: Residues on each side for context display

    Returns:
        List of GlycosylationSite objects with basic detection info
    """
    sites = []
    pattern = re.compile(r'N[^P][ST]')

    for match in pattern.finditer(sequence):
        pos = match.start() + 1  # 1-indexed
        motif = match.group()
        x_residue = motif[1]

        # Extract context
        start = max(0, match.start() - context_window)
        end = min(len(sequence), match.end() + context_window)
        context = sequence[start:end]

        # Calculate efficiency scores
        sequon_type = get_sequon_type(motif)
        x_eff = get_x_efficiency(x_residue)
        seq_mult = SEQUON_TYPE_MULTIPLIER.get(sequon_type, 0.5)
        occupancy = x_eff * seq_mult

        sites.append(GlycosylationSite(
            position=pos,
            motif=motif,
            context=context,
            sequon_type=sequon_type,
            x_residue=x_residue,
            x_efficiency=x_eff,
            sequon_multiplier=seq_mult,
            occupancy_score=occupancy,
        ))

    return sites


# Quick test
_test_sites = scan_glycosylation_sites("QVQLVQSGAEVKNLTPGSV")
print(f"‚úì Sequon scanner defined")
if _test_sites:
    print(f"  Test: Found {_test_sites[0].motif} at position {_test_sites[0].position}")
    print(f"  Occupancy score: {_test_sites[0].occupancy_score:.2f}")

‚úì Sequon scanner defined
  Test: Found NLT at position 13
  Occupancy score: 0.25


### 5.2 Progenitor Site Detection

Scan for latent glycosylation liabilities‚Äîpositions that are **one mutation away** from becoming N-X-S/T sequons.

**Progenitor patterns detected:**

| Pattern | Mutation | Codons |
|---------|----------|--------|
| D-X-S/T ‚Üí N-X-S/T | Asp‚ÜíAsn | GAC‚ÜíAAC, GAT‚ÜíAAT |
| N-X-A ‚Üí N-X-S | Ala‚ÜíSer | GCN‚ÜíTCN |
| N-P-S/T ‚Üí N-X-S/T | Pro‚Üíany | Removes block |
| Q-X-S/T ‚Üí N-X-S/T | Gln‚ÜíAsn | CAA‚ÜíAAA, CAG‚ÜíAAG (2 mutations usually, but included) |

**Note:** We focus on D‚ÜíN as the primary progenitor because:
1. Single nucleotide change (GAC‚ÜíAAC)
2. Most common progenitor type in van de Bovenkamp data
3. Asp and Asn are chemically similar (deamidation also converts N‚ÜíD)

In [9]:
def scan_progenitor_sites(sequence: str, context_window: int = 5) -> List[ProgenitorSite]:
    """
    Scan for progenitor glycosylation sites (one mutation from N-X-S/T).

    Progenitor types:
    - D-X-S/T: Asp at position 1 (D‚ÜíN single nucleotide change)
    - N-X-A: Ala at position 3 (A‚ÜíS single nucleotide change)
    - N-P-S/T: Pro at position 2 (P‚ÜíX removes block)

    Args:
        sequence: Amino acid sequence
        context_window: Residues on each side for context display

    Returns:
        List of ProgenitorSite objects
    """
    progenitors = []

    # Pattern 1: D-X-S/T (Asp‚ÜíAsn at position 1)
    # X cannot be Pro (would still be blocked after D‚ÜíN)
    pattern_d = re.compile(r'D[^P][ST]')
    for match in pattern_d.finditer(sequence):
        pos = match.start() + 1
        motif = match.group()
        potential = 'N' + motif[1:]

        start = max(0, match.start() - context_window)
        end = min(len(sequence), match.end() + context_window)
        context = sequence[start:end]

        # Calculate predicted efficiency if actualized
        x_residue = motif[1]
        x_eff = get_x_efficiency(x_residue)
        seq_type = get_sequon_type(potential)
        predicted_eff = x_eff * SEQUON_TYPE_MULTIPLIER.get(seq_type, 0.5)

        progenitors.append(ProgenitorSite(
            position=pos,
            current_motif=motif,
            progenitor_type="D‚ÜíN",
            potential_sequon=potential,
            context=context,
            predicted_efficiency=predicted_eff,
        ))

    # Pattern 2: N-X-A (Ala‚ÜíSer at position 3)
    pattern_a = re.compile(r'N[^P]A')
    for match in pattern_a.finditer(sequence):
        pos = match.start() + 1
        motif = match.group()
        potential = motif[:2] + 'S'

        start = max(0, match.start() - context_window)
        end = min(len(sequence), match.end() + context_window)
        context = sequence[start:end]

        x_residue = motif[1]
        x_eff = get_x_efficiency(x_residue)
        predicted_eff = x_eff * SEQUON_TYPE_MULTIPLIER['NXS']

        progenitors.append(ProgenitorSite(
            position=pos,
            current_motif=motif,
            progenitor_type="A‚ÜíS",
            potential_sequon=potential,
            context=context,
            predicted_efficiency=predicted_eff,
        ))

    # Pattern 3: N-P-S/T (Pro removal unblocks)
    pattern_p = re.compile(r'NP[ST]')
    for match in pattern_p.finditer(sequence):
        pos = match.start() + 1
        motif = match.group()
        # Could become N-X-S/T where X is any non-Pro
        potential = f"N-X-{motif[2]}"  # Generic representation

        start = max(0, match.start() - context_window)
        end = min(len(sequence), match.end() + context_window)
        context = sequence[start:end]

        # Efficiency depends on what replaces Pro - use medium estimate
        seq_type = 'NXT' if motif[2] == 'T' else 'NXS'
        predicted_eff = 0.5 * SEQUON_TYPE_MULTIPLIER.get(seq_type, 0.5)

        progenitors.append(ProgenitorSite(
            position=pos,
            current_motif=motif,
            progenitor_type="P‚ÜíX (unblock)",
            potential_sequon=potential,
            context=context,
            predicted_efficiency=predicted_eff,
        ))

    return progenitors


# Quick test
_test_prog = scan_progenitor_sites("QVQLVQDGSTLNPSVKG")
print(f"‚úì Progenitor scanner defined")
print(f"  Test sequence has {len(_test_prog)} progenitor site(s)")
for p in _test_prog:
    print(f"    {p.current_motif} at {p.position}: {p.progenitor_type} ‚Üí {p.potential_sequon}")

‚úì Progenitor scanner defined
  Test sequence has 2 progenitor site(s)
    DGS at 7: D‚ÜíN ‚Üí NGS
    NPS at 12: P‚ÜíX (unblock) ‚Üí N-X-S


## 6. AntPack Integration

Use AntPack for IMGT numbering and V/J gene assignment, then annotate detected sites with position-based risk assessment.

In [10]:
def run_antpack_analysis(sequence: str, chain_type: str) -> Dict:
    """
    Run full AntPack analysis: IMGT numbering + V/J gene assignment.

    Args:
        sequence: Amino acid sequence
        chain_type: 'H' for heavy, 'L' for light (kappa or lambda)

    Returns:
        Dict with numbering, V/J genes, identity, and any errors
    """
    if not ANTPACK_AVAILABLE:
        return {"error": "AntPack not available"}

    results = {
        "chain_type_detected": None,
        "percent_identity": None,
        "numbering": None,
        "v_gene": None,
        "j_gene": None,
        "v_identity": None,
        "j_identity": None,
        "error": None
    }

    try:
        chains = ['H'] if chain_type == 'H' else ['K', 'L']
        annotator = SingleChainAnnotator(chains=chains, scheme="imgt")

        result_tuple = annotator.analyze_seq(sequence)
        numbering, pct_id, detected_chain, err = result_tuple

        results["numbering"] = numbering
        results["percent_identity"] = pct_id * 100 if pct_id else None
        results["chain_type_detected"] = detected_chain
        results["error"] = err if err else None

        # V/J gene assignment
        if numbering is not None and not err:
            try:
                vj_tool = VJGeneTool()
                vj_result = vj_tool.assign_vj_genes(result_tuple, sequence, 'human')

                if vj_result:
                    results["v_gene"] = vj_result[0]
                    results["j_gene"] = vj_result[1]
                    results["v_identity"] = vj_result[2] * 100 if vj_result[2] else None
                    results["j_identity"] = vj_result[3] * 100 if vj_result[3] else None
            except Exception as e:
                results["vj_error"] = str(e)

    except Exception as e:
        results["error"] = str(e)

    return results


def map_position_to_imgt(linear_pos: int, numbering: List) -> Optional[str]:
    """
    Map a linear sequence position to IMGT numbering.

    Args:
        linear_pos: 1-indexed position in sequence
        numbering: IMGT numbering list from AntPack

    Returns:
        IMGT position string (e.g., "72", "111A") or None
    """
    if numbering is None:
        return None

    idx = linear_pos - 1  # Convert to 0-indexed

    if idx < 0 or idx >= len(numbering):
        return None

    imgt_pos = numbering[idx]

    if imgt_pos == '-' or imgt_pos == '':
        return None

    return str(imgt_pos)


print("‚úì AntPack integration functions defined")

‚úì AntPack integration functions defined


## 7. Site Annotation and Risk Scoring

Annotate detected sites with IMGT positions, regions, and comprehensive risk assessment.

In [11]:
def annotate_glycosylation_sites(
    sites: List[GlycosylationSite],
    numbering: List
) -> List[GlycosylationSite]:
    """
    Annotate glycosylation sites with IMGT positions and risk assessment.

    Args:
        sites: List of detected GlycosylationSite objects
        numbering: IMGT numbering from AntPack

    Returns:
        Annotated list of GlycosylationSite objects
    """
    for site in sites:
        # Map to IMGT
        imgt_pos = map_position_to_imgt(site.position, numbering)
        site.imgt_position = imgt_pos

        # Get region
        site.region = get_imgt_region(imgt_pos)

        # Check Vernier zone
        site.is_vernier_zone = is_vernier_zone(imgt_pos)

        # Check progenitor hotspot
        is_hot, hot_count = is_progenitor_hotspot(imgt_pos)
        site.is_progenitor_hotspot = is_hot
        site.hotspot_count = hot_count

        # Classify position risk
        site.position_risk = classify_position_risk(imgt_pos, site.region)

        # Infer mechanism
        if site.region == "CDR3":
            site.mechanism = "Junctional (V-J recombination)"
        elif site.region in ["CDR1", "CDR2"]:
            site.mechanism = "Likely SHM-acquired"
        else:
            site.mechanism = "SHM or germline (check V-gene)"

    return sites


def annotate_progenitor_sites(
    progenitors: List[ProgenitorSite],
    numbering: List
) -> List[ProgenitorSite]:
    """
    Annotate progenitor sites with IMGT positions and risk assessment.

    Args:
        progenitors: List of detected ProgenitorSite objects
        numbering: IMGT numbering from AntPack

    Returns:
        Annotated list of ProgenitorSite objects
    """
    for prog in progenitors:
        # Map to IMGT
        imgt_pos = map_position_to_imgt(prog.position, numbering)
        prog.imgt_position = imgt_pos

        # Get region
        prog.region = get_imgt_region(imgt_pos)

        # Check Vernier zone
        prog.is_vernier_zone = is_vernier_zone(imgt_pos)

        # Classify position risk
        prog.position_risk = classify_position_risk(imgt_pos, prog.region)

    return progenitors


print("‚úì Site annotation functions defined")

‚úì Site annotation functions defined


## 8. Main Scanner Function

Orchestrate the complete scanning pipeline.

In [12]:
def scan_sequence(name: str, sequence: str, chain_type: str) -> ScanResult:
    """
    Run complete glycosylation liability scan on an antibody sequence.

    Args:
        name: Sequence identifier
        sequence: Amino acid sequence (variable domain)
        chain_type: 'H' for heavy, 'L' for light

    Returns:
        ScanResult with all detected sites and risk assessment
    """
    result = ScanResult(
        name=name,
        sequence=sequence,
        chain_type=chain_type
    )

    # Run AntPack analysis
    antpack_result = run_antpack_analysis(sequence, chain_type)
    numbering = antpack_result.get("numbering")

    result.v_gene = antpack_result.get("v_gene")
    result.j_gene = antpack_result.get("j_gene")
    result.v_identity = antpack_result.get("v_identity")

    # Get IGHV family risk if heavy chain
    if chain_type == 'H' and result.v_gene:
        result.ighv_family_risk = get_ighv_family_risk(result.v_gene)

    # Scan for glycosylation sites
    glyc_sites = scan_glycosylation_sites(sequence)
    glyc_sites = annotate_glycosylation_sites(glyc_sites, numbering)
    result.glycosylation_sites = glyc_sites
    result.total_sites = len(glyc_sites)

    # Scan for progenitor sites
    prog_sites = scan_progenitor_sites(sequence)
    prog_sites = annotate_progenitor_sites(prog_sites, numbering)
    result.progenitor_sites = prog_sites
    result.total_progenitors = len(prog_sites)

    # Determine highest risk
    all_risks = [s.position_risk for s in glyc_sites if s.position_risk]
    all_risks += [p.position_risk for p in prog_sites if p.position_risk]

    if all_risks:
        # Sort by severity
        risk_order = [RiskLevel.CRITICAL, RiskLevel.HIGH, RiskLevel.MEDIUM, RiskLevel.LOW]
        for risk in risk_order:
            if risk in all_risks:
                result.highest_risk = risk
                break

    return result


print("‚úì Main scanner function defined")

‚úì Main scanner function defined


## 9. Report Generation

Generate human-readable reports with clear risk stratification.

In [13]:
def format_risk_badge(risk: RiskLevel) -> str:
    """Format risk level as colored badge."""
    badges = {
        RiskLevel.CRITICAL: "üî¥ CRITICAL",
        RiskLevel.HIGH: "üü† HIGH",
        RiskLevel.MEDIUM: "üü° MEDIUM",
        RiskLevel.LOW: "üü¢ LOW",
    }
    return badges.get(risk, "‚ö™ UNKNOWN")


def generate_report(result: ScanResult) -> str:
    """
    Generate formatted report for scan result.

    Args:
        result: ScanResult object

    Returns:
        Formatted string report
    """
    lines = []
    lines.append("=" * 78)
    lines.append(f"  {result.name}")
    lines.append("=" * 78)
    lines.append("")

    # Summary
    lines.append(f"  Chain: {result.chain_type}")
    if result.v_gene:
        lines.append(f"  V-gene: {result.v_gene}")
    if result.j_gene:
        lines.append(f"  J-gene: {result.j_gene}")
    if result.v_identity:
        lines.append(f"  V-gene identity: {result.v_identity:.1f}%")

    # IGHV family risk
    if result.ighv_family_risk:
        fam_risk = result.ighv_family_risk
        lines.append(f"  IGHV family glycosylation prevalence: {fam_risk['prevalence']*100:.0f}% ({fam_risk['risk']} risk)")

    lines.append("")

    # Overall risk
    if result.highest_risk:
        lines.append(f"  OVERALL RISK: {format_risk_badge(result.highest_risk)}")
    else:
        lines.append(f"  OVERALL RISK: üü¢ NONE DETECTED")
    lines.append("")

    # Glycosylation sites
    lines.append("-" * 78)
    lines.append(f"  N-X-S/T SEQUONS: {result.total_sites} detected")
    lines.append("-" * 78)

    if result.glycosylation_sites:
        for site in result.glycosylation_sites:
            lines.append("")
            lines.append(f"  ‚Ä¢ {site.motif} at linear {site.position} ‚Üí IMGT {site.imgt_position}")
            lines.append(f"    Region: {site.region} | {format_risk_badge(site.position_risk)}")
            lines.append(f"    Sequon type: {site.sequon_type} | X-residue: {site.x_residue}")
            lines.append(f"    Occupancy score: {site.occupancy_score:.2f} (X-eff: {site.x_efficiency:.2f} √ó type: {site.sequon_multiplier:.2f})")

            if site.is_vernier_zone:
                lines.append(f"    ‚ö†Ô∏è  VERNIER ZONE - Conformational leverage risk")
            if site.is_progenitor_hotspot:
                lines.append(f"    ‚ö†Ô∏è  PROGENITOR HOTSPOT - {site.hotspot_count} germline configurations")

            lines.append(f"    Mechanism: {site.mechanism}")
            lines.append(f"    Context: ...{site.context}...")
    else:
        lines.append("")
        lines.append("  None detected.")

    # Progenitor sites
    lines.append("")
    lines.append("-" * 78)
    lines.append(f"  PROGENITOR SITES: {result.total_progenitors} detected")
    lines.append("-" * 78)

    if result.progenitor_sites:
        for prog in result.progenitor_sites:
            lines.append("")
            lines.append(f"  ‚Ä¢ {prog.current_motif} at linear {prog.position} ‚Üí IMGT {prog.imgt_position}")
            lines.append(f"    Progenitor type: {prog.progenitor_type} ‚Üí {prog.potential_sequon}")
            lines.append(f"    Region: {prog.region} | {format_risk_badge(prog.position_risk)} (if actualized)")
            lines.append(f"    Predicted efficiency if actualized: {prog.predicted_efficiency:.2f}")

            if prog.is_vernier_zone:
                lines.append(f"    ‚ö†Ô∏è  VERNIER ZONE - High conformational impact if actualized")

            lines.append(f"    Context: ...{prog.context}...")
    else:
        lines.append("")
        lines.append("  None detected.")

    lines.append("")
    return "\n".join(lines)


print("‚úì Report generation functions defined")

‚úì Report generation functions defined


## 10. Validation Set

Test sequences with known glycosylation status from the literature.

In [14]:
# Validation sequences with ground truth
VALIDATION_SEQUENCES = {
    "VRC01_light": {
        "sequence": "EIVLTQSPGTLSLSPGERATLSCRASQSVSSNYLAWYQQKPGQAPRLLIYGASSRATGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQYGSSNLTFGGGTKVEIK",
        "chain_type": "L",
        "expected_glyc": True,
        "known_site": "NLT in CDR3",
        "notes": "VRC01-class bnAb. NLT from junctional diversity.",
    },
    "N6_light": {
        "sequence": "QSVLTQPPSVSAAPGQKVTISCSGSSSNIGNNYVSWYQQLPGTAPKLLIYDNNKRPSGIPDRFSGSKSGTSATLGITGLQTGDEADYYCGTWDSSLNLTFGGGTKLTVL",
        "chain_type": "L",
        "expected_glyc": True,
        "known_site": "NLT in CDR3",
        "notes": "VRC01-class bnAb. Convergent evolution - independent NLT.",
    },
    "Cetuximab_heavy": {
        "sequence": "QVQLKQSGPGLVQPSQSLSITCTVSGFSLTNYGVHWVRQSPGKGLEWLGVIWSGGNTDYNTPFTSRLSINKDNSKSQVFFKMNSLQSNDTAIYYCVKNGNPWLAYWGQGTLVTVSA",
        "chain_type": "H",
        "expected_glyc": True,
        "known_site": "NDT in FR3 (carries Œ±-gal)",
        "notes": "Only FDA therapeutic with documented Fab glycosylation.",
    },
    "Cetuximab_light": {
        "sequence": "DILLTQSPVILSVSPGERVSFSCRASQSIGTNIHWYQQRTNGSPRLLIKYASESISGIPSRFSGSGSGTDFTLSINSVESEDIADYYCQQNNNWPTTFGAGTKLELK",
        "chain_type": "L",
        "expected_glyc": False,  # Has sequon but NOT occupied
        "known_site": "NGS in FR2 (UNOCCUPIED)",
        "notes": "Sequon present but not glycosylated in vivo.",
    },
    "Bevacizumab_heavy": {
        "sequence": "EVQLVESGGGLVQPGGSLRLSCAASGYTFTNYGMNWVRQAPGKGLEWVGWINTYTGEPTYAADFKRRFTFSLDTSKSTAYLQMNSLRAEDTAVYYCAKYPHYYGSSHWYFDVWGQGTLVTVSS",
        "chain_type": "H",
        "expected_glyc": False,
        "known_site": None,
        "notes": "Negative control - no Fab glycosylation.",
    },
    "12A12_light": {
        "sequence": "QSALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSKRPSGVSNRFSGSKSGNTASLTISGLQAEDEADYYCSSYTSSSTLYIFGGGTKVTVL",
        "chain_type": "L",
        "expected_glyc": False,
        "known_site": None,
        "notes": "VRC01-class bnAb, no Fab glycosylation.",
    },
}

print(f"Loaded {len(VALIDATION_SEQUENCES)} validation sequences")

Loaded 6 validation sequences


## 11. Run Validation

Scan all validation sequences and display results.

In [15]:
print("\n" + "#" * 78)
print("#  ENHANCED FAB GLYCOSYLATION SCANNER v3.0 - VALIDATION RUN")
print("#" * 78)
print("\nData sources:")
print("  - van de Bovenkamp et al. (2018) PNAS 115:1901-1906")
print("  - Shakin-Eshleman et al. (1996) JBC 271:6363-6366")
print("  - Chuang et al. (2020) mAbs 12:1836719")
print("\n")

results = {}

for name, data in VALIDATION_SEQUENCES.items():
    result = scan_sequence(name, data["sequence"], data["chain_type"])
    results[name] = result

    print(generate_report(result))

    # Comparison with ground truth
    detected = result.total_sites > 0
    expected = data["expected_glyc"]
    match = "‚úì" if detected == expected else "‚úó"

    print(f"  Ground truth: {'Glycosylated' if expected else 'Not glycosylated'}")
    print(f"  Detection match: {match}")
    if data["known_site"]:
        print(f"  Known site: {data['known_site']}")
    print(f"  Notes: {data['notes']}")
    print("\n")


##############################################################################
#  ENHANCED FAB GLYCOSYLATION SCANNER v3.0 - VALIDATION RUN
##############################################################################

Data sources:
  - van de Bovenkamp et al. (2018) PNAS 115:1901-1906
  - Shakin-Eshleman et al. (1996) JBC 271:6363-6366
  - Chuang et al. (2020) mAbs 12:1836719


  VRC01_light

  Chain: L
  V-gene: IGKV3-20*01
  J-gene: IGKJ4*01
  V-gene identity: 98.9%

  OVERALL RISK: üî¥ CRITICAL

------------------------------------------------------------------------------
  N-X-S/T SEQUONS: 1 detected
------------------------------------------------------------------------------

  ‚Ä¢ NLT at linear 96 ‚Üí IMGT 115
    Region: CDR3 | üî¥ CRITICAL
    Sequon type: NXT | X-residue: L
    Occupancy score: 0.25 (X-eff: 0.25 √ó type: 1.00)
    Mechanism: Junctional (V-J recombination)
    Context: ...QYGSSNLTFGGGT...

-----------------------------------------------------------------

## 12. Summary Statistics

In [16]:
print("\n" + "=" * 78)
print("  VALIDATION SUMMARY")
print("=" * 78)

# Calculate performance
true_positives = 0
true_negatives = 0
false_positives = 0
false_negatives = 0

for name, data in VALIDATION_SEQUENCES.items():
    result = results[name]
    detected = result.total_sites > 0
    expected = data["expected_glyc"]

    if detected and expected:
        true_positives += 1
    elif not detected and not expected:
        true_negatives += 1
    elif detected and not expected:
        false_positives += 1
    else:
        false_negatives += 1

total = len(VALIDATION_SEQUENCES)
sensitivity = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
specificity = true_negatives / (true_negatives + false_positives) if (true_negatives + false_positives) > 0 else 0

print(f"\n  True Positives:  {true_positives}")
print(f"  True Negatives:  {true_negatives}")
print(f"  False Positives: {false_positives}  (sequon detected but not occupied in vivo)")
print(f"  False Negatives: {false_negatives}")
print(f"\n  Sensitivity: {sensitivity*100:.1f}%")
print(f"  Specificity: {specificity*100:.1f}%")
print(f"\n  Note: 'False positives' are valid sequon detections. The scanner identifies")
print(f"  LIABILITY (sequon presence), not OCCUPANCY (glycan presence).")
print(f"  Cetuximab light chain NGS is correctly detected as a liability.")


  VALIDATION SUMMARY

  True Positives:  3
  True Negatives:  2
  False Positives: 1  (sequon detected but not occupied in vivo)
  False Negatives: 0

  Sensitivity: 100.0%
  Specificity: 66.7%

  Note: 'False positives' are valid sequon detections. The scanner identifies
  LIABILITY (sequon presence), not OCCUPANCY (glycan presence).
  Cetuximab light chain NGS is correctly detected as a liability.


## 13. Discussion: The Vernier Zone

### What We Know

The **Vernier zone** is a set of framework residues that structurally support and influence CDR conformations. The **DE loop** (IMGT positions ~75-88, located within FR3) is a key component.

**Evidence for conformational leverage:**

1. **Single residue mutations in the Vernier zone can shift CDR conformational ensembles** (Tramontano et al., Al-Lazikani et al., Chothia & Lesk). For example, VH residue 71 (Chothia numbering, ~IMGT 87) co-determines CDR-H2 canonical structure.

2. **Humanization failures often trace to Vernier mismatches.** CDR grafting from mouse to human frameworks sometimes fails because the human framework has different Vernier residues, shifting CDR conformations and killing binding.

3. **Antibodies exist as conformational ensembles** (Fern√°ndez-Quintero et al.). CDRs sample multiple discrete conformational states. Vernier perturbations can shift the population distribution across these states.

### What We Don't Know

**The exact functional impact of glycosylation at specific Vernier positions is largely unexplored.**

- van de Bovenkamp showed progenitor sites cluster in this region (positions 77, 81, 82, 84)
- They demonstrated that glycans can modulate binding in position-dependent ways
- But they did not systematically characterize Vernier-specific effects

### Scanner Approach

We flag IMGT 75-88 as **elevated conformational risk** based on:
1. Known structural importance of the DE loop
2. Progenitor site clustering in this region
3. Mechanistic plausibility (glycan mass + steric bulk ‚Üí conformational perturbation)

**This is a hypothesis-generating flag, not a definitive prediction.** Structural analysis or molecular dynamics would be needed to confirm functional impact for any specific antibody.

---

### References

1. **van de Bovenkamp FS, et al.** (2018) Adaptive antibody diversification through N-linked glycosylation of the immunoglobulin variable region. *PNAS* 115:1901-1906.

2. **Shakin-Eshleman SH, et al.** (1996) The amino acid at the X position of an Asn-X-Ser sequon is an important determinant of N-linked core-glycosylation efficiency. *JBC* 271:6363-6366.

3. **Fern√°ndez-Quintero ML, et al.** (2020) Antibodies exhibit multiple paratope states influencing VH-VL domain orientations. *Commun Biol* 3:589.

4. **Chothia C, Lesk AM.** (1987) Canonical structures for the hypervariable regions of immunoglobulins. *J Mol Biol* 196:901-917.

5. **Tramontano A, et al.** (1990) Framework residue 71 is a major determinant of the position and conformation of the second hypervariable region in the VH domains of immunoglobulins. *J Mol Biol* 215:175-182.

## 14. Conclusions and Next Steps

### Key Capabilities of v3.0

| Feature | Purpose |
|---------|--------|
| Sequon detection | Find existing N-X-S/T sites |
| Progenitor detection | Find latent liabilities (one mutation away) |
| X-position scoring | Estimate occupancy probability |
| NXT/NXS differentiation | Refine occupancy estimate |
| Vernier zone flagging | Highlight conformational risk |
| IGHV family risk | Contextualize heavy chain risk |

### Limitations

1. **Scanner predicts LIABILITY, not OCCUPANCY.** Mass spectrometry needed to confirm glycan presence.

2. **Scanner predicts RISK, not OUTCOME.** Functional impact (binding, stability) is antigen-specific and requires experimental testing.

3. **Vernier zone boundaries are approximate.** Exact functional residues vary by antibody structure.

4. **Progenitor detection limited to common patterns.** Rare progenitor routes may be missed.

### Next Steps

1. Expand validation to N‚â•30 FDA-approved therapeutics
2. Integrate with RFdiffusion output scanning
3. Add structural accessibility filter using AlphaFold2/ESMFold predictions
4. Develop glycoform-specific binding prediction (longer-term)

---

**AntibodyML Consulting LLC**  
*Bridging computational design and manufacturing reality*

In [17]:
print("\n" + "=" * 78)
print("  TECHNICAL DETAILS")
print("=" * 78)
print(f"\n  Scanner version: 3.0")
print(f"  AntPack version: 0.3.8.6")
print(f"  Numbering scheme: IMGT")
print(f"  Species: Human")
print(f"\n  Primary data source:")
print(f"    van de Bovenkamp FS, et al. (2018) PNAS 115:1901-1906")
print(f"    PMID: 29432145 | DOI: 10.1073/pnas.1711720115")
print(f"\n  Efficiency scoring source:")
print(f"    Shakin-Eshleman SH, et al. (1996) JBC 271:6363-6366")
print(f"    PMID: 8626433")
print(f"\n  Validation sequences:")
print(f"    Chuang GY, et al. (2020) mAbs 12:1836719 | PMID: 33164673")
print(f"    Chung CH, et al. (2008) NEJM 358:1109-1117 | PMID: 18337601")


  TECHNICAL DETAILS

  Scanner version: 3.0
  AntPack version: 0.3.8.6
  Numbering scheme: IMGT
  Species: Human

  Primary data source:
    van de Bovenkamp FS, et al. (2018) PNAS 115:1901-1906
    PMID: 29432145 | DOI: 10.1073/pnas.1711720115

  Efficiency scoring source:
    Shakin-Eshleman SH, et al. (1996) JBC 271:6363-6366
    PMID: 8626433

  Validation sequences:
    Chuang GY, et al. (2020) mAbs 12:1836719 | PMID: 33164673
    Chung CH, et al. (2008) NEJM 358:1109-1117 | PMID: 18337601
