# üíä Drug Discovery Pipeline

Welcome to the **Peptide Drug Discovery Pipeline**!

In this tutorial, you'll build an end-to-end computational workflow for discovering therapeutic peptides.

### üéØ Goal
Design, screen, and optimize peptide drug candidates using computational methods.

### üìö What You'll Learn
1. **Library Generation**: Create diverse peptide libraries
2. **Property Prediction**: Calculate ADME (Absorption, Distribution, Metabolism, Excretion)
3. **Binding Scoring**: Estimate target affinity
4. **Cyclization**: Improve stability with macrocycles
5. **Lead Selection**: Rank and filter candidates

### üí° Real-World Context
This pipeline mirrors workflows used in biotech/pharma for:
- Peptide therapeutics (e.g., GLP-1 agonists like Ozempic)
- Cyclic peptide antibiotics
- Protein-protein interaction inhibitors

---

In [None]:
# üõ†Ô∏è SETUP: Install dependencies
import sys
import os

try:
    import google.colab
    IN_COLAB = True
    print("üåê Running in Google Colab")
    !pip install -q synth-pdb matplotlib numpy biotite py3Dmol pandas seaborn
except ImportError:
    IN_COLAB = False
    print("üíª Running locally")

print("‚úÖ Setup complete!")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import biotite.structure as struc
import biotite.structure.io.pdb as pdb
import io
from synth_pdb.generator import generate_pdb_content
import random

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100

print("üì¶ All imports successful!")

## Step 1: Peptide Library Generation

We'll create a focused library of peptides targeting a hypothetical receptor.

### üß¨ Design Strategy
- **Length**: 8-12 residues (typical for bioactive peptides)
- **Composition**: Biased toward drug-like amino acids
- **Diversity**: Random sequences with structural constraints

### üé≤ Amino Acid Preferences
We'll favor residues common in successful peptide drugs:
- **Hydrophobic**: L, V, I, F, W (binding pockets)
- **Polar**: S, T, N, Q (H-bonding)
- **Charged**: K, R, D, E (electrostatic interactions)
- **Special**: P (turns), G (flexibility), C (disulfides)

In [None]:
# Define amino acid library (drug-like bias)
AA_LIBRARY = {
    'hydrophobic': ['L', 'V', 'I', 'F', 'W', 'A'],
    'polar': ['S', 'T', 'N', 'Q'],
    'positive': ['K', 'R'],
    'negative': ['D', 'E'],
    'special': ['P', 'G', 'C']
}

def generate_peptide_sequence(length=10, seed=None):
    """Generate a random peptide sequence with drug-like composition."""
    if seed is not None:
        random.seed(seed)
    
    # Composition bias (percentages)
    composition = (
        AA_LIBRARY['hydrophobic'] * 4 +  # 40%
        AA_LIBRARY['polar'] * 3 +         # 30%
        AA_LIBRARY['positive'] * 1 +      # 10%
        AA_LIBRARY['negative'] * 1 +      # 10%
        AA_LIBRARY['special'] * 1         # 10%
    )
    
    sequence = ''.join(random.choices(composition, k=length))
    return sequence

# Generate library
print("üß¨ Generating peptide library...\n")
LIBRARY_SIZE = 50
library = []

for i in range(LIBRARY_SIZE):
    length = random.randint(8, 12)
    seq = generate_peptide_sequence(length, seed=i)
    library.append({
        'id': f"PEP{i+1:03d}",
        'sequence': seq,
        'length': length
    })

# Convert to DataFrame
df = pd.DataFrame(library)

print(f"‚úÖ Generated {len(df)} peptides")
print(f"   Length range: {df['length'].min()}-{df['length'].max()} residues")
print(f"\nFirst 5 sequences:")
for idx, row in df.head().iterrows():
    print(f"   {row['id']}: {row['sequence']}")

## Step 2: Property Prediction (ADME)

We'll calculate key physicochemical properties that affect drug-likeness.

### üìä Properties to Calculate
1. **Molecular Weight** - Affects bioavailability
2. **Net Charge** - Affects solubility and membrane permeability
3. **Hydrophobicity (GRAVY)** - Affects membrane binding
4. **Instability Index** - Predicts degradation
5. **Aromatic Content** - Affects binding affinity

### ‚úÖ Drug-Like Criteria
- MW: 800-1500 Da (peptide range)
- Net charge: -2 to +3 (soluble but membrane-permeable)
- GRAVY: -0.5 to +0.5 (balanced)
- Instability: <40 (stable)

In [None]:
# Amino acid properties
AA_MW = {'A': 89, 'C': 121, 'D': 133, 'E': 147, 'F': 165, 'G': 75, 'H': 155,
         'I': 131, 'K': 146, 'L': 131, 'M': 149, 'N': 132, 'P': 115, 'Q': 146,
         'R': 174, 'S': 105, 'T': 119, 'V': 117, 'W': 204, 'Y': 181}

AA_HYDRO = {'A': 1.8, 'C': 2.5, 'D': -3.5, 'E': -3.5, 'F': 2.8, 'G': -0.4,
            'H': -3.2, 'I': 4.5, 'K': -3.9, 'L': 3.8, 'M': 1.9, 'N': -3.5,
            'P': -1.6, 'Q': -3.5, 'R': -4.5, 'S': -0.8, 'T': -0.7, 'V': 4.2,
            'W': -0.9, 'Y': -1.3}

AA_CHARGE = {'D': -1, 'E': -1, 'K': 1, 'R': 1}  # at pH 7

def calculate_properties(sequence):
    """Calculate physicochemical properties."""
    # Molecular weight
    mw = sum(AA_MW.get(aa, 110) for aa in sequence) - 18 * (len(sequence) - 1)
    
    # Net charge at pH 7
    charge = sum(AA_CHARGE.get(aa, 0) for aa in sequence)
    
    # GRAVY (Grand Average of Hydropathy)
    gravy = sum(AA_HYDRO.get(aa, 0) for aa in sequence) / len(sequence)
    
    # Aromatic content
    aromatic = sum(1 for aa in sequence if aa in 'FWY') / len(sequence)
    
    # Instability index (simplified)
    # Real calculation uses dipeptide instability weights
    unstable_pairs = sum(1 for i in range(len(sequence)-1) 
                        if sequence[i:i+2] in ['DP', 'PD', 'DD', 'EE'])
    instability = (unstable_pairs / len(sequence)) * 100
    
    return {
        'mw': mw,
        'charge': charge,
        'gravy': gravy,
        'aromatic_pct': aromatic * 100,
        'instability': instability
    }

# Calculate for all peptides
print("üìä Calculating ADME properties...\n")

for idx, row in df.iterrows():
    props = calculate_properties(row['sequence'])
    for key, value in props.items():
        df.at[idx, key] = value

print("‚úÖ Properties calculated!")
print(f"\nProperty ranges:")
print(f"   MW: {df['mw'].min():.0f} - {df['mw'].max():.0f} Da")
print(f"   Charge: {df['charge'].min():.0f} to {df['charge'].max():.0f}")
print(f"   GRAVY: {df['gravy'].min():.2f} to {df['gravy'].max():.2f}")
print(f"   Instability: {df['instability'].min():.1f} - {df['instability'].max():.1f}")

## Step 3: Binding Score Estimation

We'll use a simplified scoring function to estimate binding affinity.

### üéØ Scoring Components
1. **Hydrophobic contacts** - Favorable binding
2. **Electrostatic interactions** - Charge complementarity
3. **Aromatic stacking** - œÄ-œÄ interactions
4. **Size penalty** - Too large = entropic cost

### üìù Note
This is a toy model! Real docking uses:
- AutoDock Vina
- Rosetta
- Schr√∂dinger Glide
- AlphaFold-Multimer

In [None]:
def estimate_binding_score(sequence, target_profile=None):
    """
    Simplified binding score estimation.
    Lower score = better binding (like docking scores).
    """
    if target_profile is None:
        # Default: hydrophobic pocket with some charged residues
        target_profile = {
            'prefers_hydrophobic': True,
            'prefers_aromatic': True,
            'charge_preference': +1  # Prefers positive ligands
        }
    
    score = 0.0
    
    # Hydrophobic contribution
    hydrophobic_count = sum(1 for aa in sequence if aa in 'LVIFWA')
    if target_profile['prefers_hydrophobic']:
        score -= hydrophobic_count * 0.5  # Favorable
    
    # Aromatic contribution
    aromatic_count = sum(1 for aa in sequence if aa in 'FWY')
    if target_profile['prefers_aromatic']:
        score -= aromatic_count * 0.8  # Very favorable
    
    # Charge complementarity
    net_charge = sum(AA_CHARGE.get(aa, 0) for aa in sequence)
    charge_match = abs(net_charge - target_profile['charge_preference'])
    score += charge_match * 0.3  # Penalty for mismatch
    
    # Size penalty (entropic cost)
    if len(sequence) > 10:
        score += (len(sequence) - 10) * 0.2
    
    # Add some noise (real binding is complex!)
    score += random.gauss(0, 0.5)
    
    return score

# Calculate binding scores
print("üéØ Estimating binding affinities...\n")

for idx, row in df.iterrows():
    score = estimate_binding_score(row['sequence'])
    df.at[idx, 'binding_score'] = score

print("‚úÖ Binding scores calculated!")
print(f"   Best score: {df['binding_score'].min():.2f}")
print(f"   Worst score: {df['binding_score'].max():.2f}")
print(f"\nTop 5 binders:")
top5 = df.nsmallest(5, 'binding_score')[['id', 'sequence', 'binding_score']]
for idx, row in top5.iterrows():
    print(f"   {row['id']}: {row['sequence']} (score: {row['binding_score']:.2f})")

## Step 4: Drug-Likeness Filtering

Apply filters to identify candidates with good ADME properties.

In [None]:
# Define drug-like criteria
def is_druglike(row):
    """Check if peptide meets drug-like criteria."""
    checks = {
        'mw_ok': 800 <= row['mw'] <= 1500,
        'charge_ok': -2 <= row['charge'] <= 3,
        'gravy_ok': -0.5 <= row['gravy'] <= 0.5,
        'stable': row['instability'] < 40
    }
    return all(checks.values()), checks

# Apply filters
druglike_flags = []
for idx, row in df.iterrows():
    is_dl, checks = is_druglike(row)
    druglike_flags.append(is_dl)

df['druglike'] = druglike_flags

# Summary
n_druglike = df['druglike'].sum()
print(f"üìä Drug-Likeness Filtering Results")
print(f"=" * 50)
print(f"Total library: {len(df)} peptides")
print(f"Drug-like: {n_druglike} ({100*n_druglike/len(df):.1f}%)")
print(f"Filtered out: {len(df) - n_druglike}")

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# MW distribution
axes[0,0].hist(df['mw'], bins=20, alpha=0.7, edgecolor='black')
axes[0,0].axvline(800, color='red', linestyle='--', label='Min threshold')
axes[0,0].axvline(1500, color='red', linestyle='--', label='Max threshold')
axes[0,0].set_xlabel('Molecular Weight (Da)')
axes[0,0].set_ylabel('Count')
axes[0,0].set_title('Molecular Weight Distribution')
axes[0,0].legend()

# Charge distribution
axes[0,1].hist(df['charge'], bins=range(-3, 6), alpha=0.7, edgecolor='black')
axes[0,1].axvline(-2, color='red', linestyle='--')
axes[0,1].axvline(3, color='red', linestyle='--')
axes[0,1].set_xlabel('Net Charge')
axes[0,1].set_ylabel('Count')
axes[0,1].set_title('Charge Distribution')

# GRAVY distribution
axes[1,0].hist(df['gravy'], bins=20, alpha=0.7, edgecolor='black')
axes[1,0].axvline(-0.5, color='red', linestyle='--')
axes[1,0].axvline(0.5, color='red', linestyle='--')
axes[1,0].set_xlabel('GRAVY Score')
axes[1,0].set_ylabel('Count')
axes[1,0].set_title('Hydrophobicity Distribution')

# Binding score vs MW
colors = ['green' if dl else 'red' for dl in df['druglike']]
axes[1,1].scatter(df['mw'], df['binding_score'], c=colors, alpha=0.6, edgecolors='black')
axes[1,1].set_xlabel('Molecular Weight (Da)')
axes[1,1].set_ylabel('Binding Score')
axes[1,1].set_title('Binding vs Size (Green=Drug-like)')

plt.tight_layout()
plt.show()

## Step 5: Lead Selection & Ranking

Combine all criteria to select top candidates.

In [None]:
# Filter to drug-like candidates
candidates = df[df['druglike']].copy()

# Rank by binding score (lower = better)
candidates = candidates.sort_values('binding_score')

print(f"üèÜ TOP 10 LEAD CANDIDATES")
print(f"=" * 80)
print(f"{'Rank':<6} {'ID':<8} {'Sequence':<15} {'MW':<8} {'Charge':<8} {'Score':<8}")
print(f"=" * 80)

for rank, (idx, row) in enumerate(candidates.head(10).iterrows(), 1):
    print(f"{rank:<6} {row['id']:<8} {row['sequence']:<15} "
          f"{row['mw']:<8.0f} {row['charge']:<8.0f} {row['binding_score']:<8.2f}")

# Save top candidates
top_candidates = candidates.head(10)
print(f"\n‚úÖ Selected {len(top_candidates)} leads for further optimization")

## Step 6: Cyclization Strategy

Convert top linear peptides to cyclic forms for improved stability.

### üîó Why Cyclize?
- **Protease resistance** - No free termini to cleave
- **Conformational rigidity** - Better binding specificity
- **Improved bioavailability** - Longer half-life

### üß™ Cyclization Methods
1. **Head-to-tail** - Backbone cyclization (most common)
2. **Disulfide** - Cys-Cys bridge
3. **Side-chain** - Lys-Asp/Glu lactam bridge

In [None]:
# Generate 3D structure for top candidate
top_peptide = top_candidates.iloc[0]

print(f"üß¨ Generating 3D structure for top candidate: {top_peptide['id']}")
print(f"   Sequence: {top_peptide['sequence']}")
print(f"   Binding score: {top_peptide['binding_score']:.2f}")
print(f"\n‚è≥ This may take a minute...\n")

# Generate linear structure
linear_pdb = generate_pdb_content(
    sequence_str=top_peptide['sequence'],
    structure="all:random",  # Random coil
    optimize_sidechains=True,
    minimize_energy=True
)

# Generate cyclic structure
cyclic_pdb = generate_pdb_content(
    sequence_str=top_peptide['sequence'],
    structure="all:random",
    cyclic=True,  # Head-to-tail cyclization
    optimize_sidechains=True,
    minimize_energy=True
)

print("‚úÖ Structures generated!")

In [None]:
# Visualize linear vs cyclic
import py3Dmol

print("üî¨ 3D Structure Comparison\n")

# Linear structure
print("Linear Peptide:")
view1 = py3Dmol.view(width=400, height=300)
view1.addModel(linear_pdb, 'pdb')
view1.setStyle({'cartoon': {'color': 'spectrum'}})
view1.setStyle({'resn': top_peptide['sequence'][0]}, 
               {'sphere': {'color': 'red', 'radius': 0.5}})  # N-terminus
view1.setStyle({'resn': top_peptide['sequence'][-1]}, 
               {'sphere': {'color': 'blue', 'radius': 0.5}})  # C-terminus
view1.zoomTo()
view1.show()

print("\nCyclic Peptide (head-to-tail):")
view2 = py3Dmol.view(width=400, height=300)
view2.addModel(cyclic_pdb, 'pdb')
view2.setStyle({'cartoon': {'color': 'spectrum'}})
view2.zoomTo()
view2.show()

print("\nüí° Notice: The cyclic form has no free termini (red/blue spheres in linear).")
print("   This makes it resistant to exopeptidases!")

## Step 7: Export Results

Save top candidates for experimental validation.

In [None]:
# Export summary
export_df = top_candidates[['id', 'sequence', 'length', 'mw', 'charge', 
                            'gravy', 'binding_score']].copy()

print("üìÑ Exportable Results\n")
print(export_df.to_string(index=False))

# Save to CSV (if running locally)
if not IN_COLAB:
    export_df.to_csv('lead_peptides.csv', index=False)
    print("\n‚úÖ Saved to lead_peptides.csv")
else:
    print("\nüí° In Colab: Copy the table above for your records")

# Save top structure
print(f"\nüíæ Top candidate PDB (cyclic):")
print(f"   ID: {top_peptide['id']}")
print(f"   Sequence: {top_peptide['sequence']}")
print(f"   Ready for molecular dynamics or docking studies!")

## üéì Key Takeaways

### Drug Discovery Pipeline Steps
1. **Library Design** - Diversity + drug-like bias
2. **Property Prediction** - Filter early (fail fast!)
3. **Binding Estimation** - Prioritize candidates
4. **Cyclization** - Improve stability
5. **Experimental Validation** - Synthesize and test top hits

### Real-World Considerations
- **Synthesis cost** - Longer peptides = more expensive
- **Solubility** - Must dissolve for assays
- **Cell permeability** - Cyclic peptides can cross membranes
- **Off-target effects** - Test selectivity

### Next Steps in Real Projects
1. **Molecular Dynamics** - Simulate binding
2. **Docking** - Predict binding pose
3. **Synthesis** - Order top 5-10 candidates
4. **Biochemical assays** - IC50, EC50 measurements
5. **Cell-based assays** - Efficacy and toxicity
6. **Lead optimization** - Iterate based on SAR

---

## üöÄ Challenge Exercises

Try these extensions:
1. **Modify target profile** - Change binding preferences and re-rank
2. **Add disulfide bridges** - Include Cys pairs for additional cyclization
3. **Larger library** - Generate 500+ peptides and analyze trends
4. **Multi-objective optimization** - Balance binding, stability, and cost

**Remember**: Computational predictions guide experiments, but nothing beats real data! üß™