[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/engelberger/tutorials-ai4pd-2025/blob/main/tutorial_alphafold2_i89_conformations_v3.ipynb)

# Tutorial: Understanding Conformational Selection in AlphaFold2 through Coevolution Analysis

**Duration:** 90 minutes  
**Instructor:** Felipe Engelberger  
**Date:** AI4PD Workshop 2025

---

## Learning Objectives

By the end of this tutorial, you will understand:

1. **Coevolution → Structure**: How AlphaFold2's Evoformer leverages evolutionary information to predict structure
2. **MSA → Conformation**: Why MSA presence/absence determines which conformation is predicted
3. **The i89 Case Study**: How removing coevolution signal at the calcium-binding site enables alternative conformation prediction
4. **Conformational Sampling**: Using MSA subsampling and dropout to explore conformational landscapes
5. **Recycling Dynamics**: How AlphaFold2 "changes its mind" about conformations during iterative refinement

## Scientific Background

AlphaFold2's Evoformer module processes Multiple Sequence Alignments (MSAs) to extract coevolution patterns - residues that mutate together through evolution often interact in 3D space. For the i89 protein (Guo, Kortemme et al.), this coevolution signal strongly biases predictions toward the calcium-bound state. By manipulating the MSA input, we can control which conformation AlphaFold2 predicts.

## Tutorial Overview

1. **Setup and Introduction** - Prepare environment and introduce i89 protein
2. **Coevolution Analysis** - Understand the evolutionary signal in the MSA
3. **Structure Predictions** - Compare predictions with/without MSA
4. **Conformational Sampling** - Explore subsampling and dropout strategies
5. **Recycling Analysis** - Track conformational changes through iterations
6. **Results Summary** - Synthesize findings and implications


## Section 1: Environment Setup

First, we'll set up our environment with the AF2 Utils package that provides a simple wrapper around ColabDesign.


In [None]:
#@title Install Dependencies and Import Libraries
import os
import sys
import warnings
warnings.filterwarnings('ignore')

# Check if running in Colab
IN_COLAB = 'google.colab' in sys.modules

# Download required files if not present
if not os.path.exists("af2_utils.py"):
    os.system("wget -q https://raw.githubusercontent.com/engelberger/tutorials-ai4pd-2025/main/af2_utils.py")
    
if not os.path.exists("logmd_utils.py"):
    os.system("wget -q https://raw.githubusercontent.com/engelberger/tutorials-ai4pd-2025/main/logmd_utils.py")

# Import packages
import af2_utils as af2
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import json

# Setup environment
af2.setup_environment(verbose=False)

# Check and install dependencies if needed
status = af2.check_installation(verbose=False)
missing = [k for k, v in status.items() if not v and k != 'environment_setup']
if missing:
    af2.install_dependencies(
        install_colabdesign='colabdesign' in missing,
        install_hhsuite='hhsuite' in missing,
        download_params='alphafold_params' in missing,
        verbose=False
    )

print(f"✓ AF2 Utils v{af2.__version__} ready")
print(f"✓ LogMD: {'available' if af2.check_logmd() else 'not available (install with: pip install logmd)'}")


## Section 2: The i89 Protein System

The i89 protein is a 96-residue designed protein that exhibits two distinct conformational states:
- **State 1**: Calcium-bound conformation (typically predicted with MSA)
- **State 2**: Alternative conformation (accessible without MSA)

This conformational switching makes i89 ideal for understanding how AlphaFold2 uses evolutionary information.


In [None]:
#@title Define i89 Sequence and Load Reference Structures

# i89 protein sequence (96 residues)
I89_SEQUENCE = "GSHMASMEDLQAEARAFLSEEMIAEFKAAFDMFDADGGGDISYKAVGTVFRMLGINPSKEVLDYLKEKIDVDGSGTIDFEEFLVLMVYIMKQDA"

# Download reference structures if needed
if not os.path.exists("state1.pdb") or not os.path.exists("state2.pdb"):
    os.system("wget -q https://raw.githubusercontent.com/engelberger/tutorials-ai4pd-2025/main/state1.pdb")
    os.system("wget -q https://raw.githubusercontent.com/engelberger/tutorials-ai4pd-2025/main/state2.pdb")

# Load reference coordinates
state1_coords = af2.load_pdb_coords("state1.pdb")
state2_coords = af2.load_pdb_coords("state2.pdb")
ref_rmsd = af2.calculate_rmsd(state1_coords, state2_coords)

print(f"i89 protein: {len(I89_SEQUENCE)} residues")
print(f"RMSD between State 1 and State 2: {ref_rmsd:.2f} Å")
print(f"Calcium-binding loop: residues 85-95")


## Section 3: MSA Generation and Coevolution Analysis

### Understanding AlphaFold2's Inputs

AlphaFold2 takes two primary inputs for structure prediction:
1. **MSA (Multiple Sequence Alignment)**: Evolutionary information from homologous sequences
2. **Deletion Matrix**: Tracks insertions/deletions across sequences

The Evoformer module processes these inputs to extract coevolution patterns, which guide structure prediction.


In [None]:
#@title Generate MSA for i89 (Run once and reuse throughout)

# Check if MSA already exists
if os.path.exists("i89_msa.npy") and os.path.exists("i89_del_matrix.npy"):
    print("Loading existing MSA...")
    msa_full = np.load("i89_msa.npy")
    deletion_matrix = np.load("i89_del_matrix.npy")
    print(f"✓ Loaded MSA with {len(msa_full)} sequences")
else:
    print("Generating MSA using MMseqs2 (this may take 2-3 minutes)...")
    msa_full, deletion_matrix = af2.get_msa(
        sequences=[I89_SEQUENCE],
        jobname="i89_msa",
        mode="unpaired",
        cov=50,
        id=90,
        max_msa=512,
        verbose=False
    )
    # Save for reuse
    np.save("i89_msa.npy", msa_full)
    np.save("i89_del_matrix.npy", deletion_matrix)
    print(f"✓ Generated MSA with {len(msa_full)} sequences")

print(f"MSA shape: {msa_full.shape} (sequences × positions)")
print(f"Deletion matrix shape: {deletion_matrix.shape}")


### Coevolution Analysis: The Key to Understanding Conformational Selection

Coevolution reveals which residues have evolved together to maintain protein function. The Evoformer learns similar patterns to predict which residues interact in 3D space.


In [None]:
#@title Compute and Visualize Coevolution Matrix

# Compute coevolution using Direct Coupling Analysis
print("Computing coevolution matrix...")
coev_matrix = af2.get_coevolution(msa_full)

# Focus on calcium-binding loop region (residues 85-95, 0-indexed as 84-95)
ca_start, ca_end = 84, 95
ca_region_coev = coev_matrix[ca_start:ca_end, ca_start:ca_end]

# Calculate statistics
upper_tri = np.triu_indices_from(coev_matrix, k=6)
overall_mean = np.mean(coev_matrix[upper_tri])
ca_mean = np.mean(ca_region_coev[np.triu_indices_from(ca_region_coev, k=1)])

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Full coevolution matrix
im1 = axes[0].imshow(coev_matrix, cmap='viridis', aspect='auto')
axes[0].set_title('Full Coevolution Matrix')
axes[0].set_xlabel('Residue Position')
axes[0].set_ylabel('Residue Position')
plt.colorbar(im1, ax=axes[0], label='Coevolution Score')

# Zoom on calcium-binding region
im2 = axes[1].imshow(ca_region_coev, cmap='hot', aspect='auto')
axes[1].set_title('Ca-binding Loop (85-95)')
axes[1].set_xlabel('Position in Loop')
axes[1].set_ylabel('Position in Loop')
plt.colorbar(im2, ax=axes[1], label='Coevolution Score')

# Coevolution strength along sequence
coev_strength = np.mean(coev_matrix, axis=0)
axes[2].plot(range(1, len(coev_strength)+1), coev_strength, 'b-', linewidth=2)
axes[2].axvspan(85, 95, alpha=0.3, color='red', label='Ca-binding loop')
axes[2].set_xlabel('Residue Position')
axes[2].set_ylabel('Mean Coevolution')
axes[2].set_title('Coevolution Signal Along Sequence')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n Coevolution Statistics:")
print(f"Overall mean coevolution: {overall_mean:.4f}")
print(f"Ca-binding loop mean: {ca_mean:.4f}")
print(f"Signal enrichment in Ca-loop: {ca_mean/overall_mean:.2f}x")
print(f"\n Hypothesis: Strong coevolution in Ca-binding loop → MSA biases toward State 1")
print(f"   Without MSA → No bias → Alternative State 2 accessible")


## Section 4: Structure Predictions - Testing Our Hypothesis

Now we'll test our hypothesis by comparing AlphaFold2 predictions with and without MSA. All predictions automatically save PDB files for every recycle iteration.


In [None]:
#@title Helper Functions for Consistent Predictions and Analysis

def run_prediction_with_analysis(sequence, msa, deletion_matrix, job_name, num_seeds=3, num_recycles=3):
    """Run predictions with multiple seeds and save all PDBs."""
    
    job_folder = af2.create_job_folder(sequence, job_name)
    all_predictions = []
    
    for seed in range(num_seeds):
        print(f"  Seed {seed}...", end=" ")
        
        # Setup model
        model = af2.setup_model(sequence, verbose=False)
        
        # Run prediction with recycling
        result = af2.predict_with_recycling(
            model,
            msa=msa,
            deletion_matrix=deletion_matrix,
            max_recycles=num_recycles,
            seed=seed,
            save_pdbs=True,
            job_folder=job_folder,
            sequence=sequence,
            model_name=f"{job_name}_seed{seed}",
            verbose=False
        )
        
        # Store result with metadata
        result['seed'] = seed
        result['job_folder'] = job_folder
        result['job_name'] = job_name
        all_predictions.append(result)
        
        print(f"pLDDT={result['metrics']['plddt']*100:.1f}%")
    
    return all_predictions, job_folder

def analyze_conformational_landscape(predictions, state1_coords, state2_coords):
    """Analyze which conformations were sampled."""
    
    rmsd_data = []
    for pred in predictions:
        ca_coords = pred['structure'][:, 1, :]
        rmsd1 = af2.calculate_rmsd(ca_coords, state1_coords)
        rmsd2 = af2.calculate_rmsd(ca_coords, state2_coords)
        rmsd_data.append({
            'seed': pred['seed'],
            'rmsd_state1': rmsd1,
            'rmsd_state2': rmsd2,
            'plddt': pred['metrics']['plddt'] * 100,
            'closer_to': 'State 1' if rmsd1 < rmsd2 else 'State 2'
        })
    
    return rmsd_data

def plot_recycling_trajectory(predictions, state1_coords, state2_coords, title):
    """Plot how conformations change during recycling."""
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    for pred in predictions:
        seed = pred['seed']
        trajectory = pred['trajectory']
        
        # Extract RMSD values for each recycle
        recycles = []
        rmsd1_vals = []
        rmsd2_vals = []
        
        for step in trajectory:
            ca_coords = step['structure'][:, 1, :]
            recycles.append(step['recycle'])
            rmsd1_vals.append(af2.calculate_rmsd(ca_coords, state1_coords))
            rmsd2_vals.append(af2.calculate_rmsd(ca_coords, state2_coords))
        
        # Plot trajectories
        axes[0].plot(recycles, rmsd1_vals, 'o-', label=f'Seed {seed}', alpha=0.7)
        axes[1].plot(recycles, rmsd2_vals, 's-', label=f'Seed {seed}', alpha=0.7)
    
    axes[0].set_xlabel('Recycle')
    axes[0].set_ylabel('RMSD to State 1 (Å)')
    axes[0].set_title(f'{title} - Distance to State 1')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    axes[1].set_xlabel('Recycle')
    axes[1].set_ylabel('RMSD to State 2 (Å)')
    axes[1].set_title(f'{title} - Distance to State 2')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
