# Algorithm 1: MSA Block Deletion

MSA Block Deletion is a data augmentation technique used during training. It randomly deletes contiguous blocks of sequences from the MSA to improve model robustness.

## Algorithm Pseudocode

![MSA Block Deletion](../imgs/algorithms/MSABlockDeletion.png)

## Source Code Location
- **File**: `AF2-source-code/model/tf/data_transforms.py`
- **Function**: `sample_msa` and related functions
- **Line**: 214

## Purpose

1. **Data Augmentation**: Prevents overfitting by varying the MSA content
2. **Robustness**: Model learns to work with different MSA depths
3. **Efficiency**: Reduces computation during training

In [None]:
import numpy as np

np.random.seed(42)

In [None]:
def msa_block_deletion(msa, deletion_prob=0.3, min_keep=1):
    """
    MSA Block Deletion - Algorithm 1.
    
    Randomly deletes contiguous blocks of sequences from MSA.
    
    Args:
        msa: MSA array [N_seq, N_res]
        deletion_prob: Probability of deleting each sequence
        min_keep: Minimum number of sequences to keep (including query)
    
    Returns:
        Subsampled MSA
    """
    N_seq, N_res = msa.shape
    
    print(f"Original MSA: {N_seq} sequences")
    
    # Always keep the first sequence (query)
    keep_mask = np.zeros(N_seq, dtype=bool)
    keep_mask[0] = True
    
    # Randomly decide which sequences to keep
    random_keep = np.random.random(N_seq) > deletion_prob
    keep_mask = keep_mask | random_keep
    
    # Ensure minimum sequences
    if keep_mask.sum() < min_keep:
        # Force keep some sequences
        indices = np.where(~keep_mask)[0]
        np.random.shuffle(indices)
        for idx in indices[:min_keep - keep_mask.sum()]:
            keep_mask[idx] = True
    
    result = msa[keep_mask]
    print(f"After deletion: {len(result)} sequences")
    
    return result

In [None]:
# Test
N_seq, N_res = 128, 64
msa = np.random.randint(0, 21, size=(N_seq, N_res))

print("Test MSA Block Deletion")
print("="*40)

for prob in [0.1, 0.3, 0.5, 0.7]:
    result = msa_block_deletion(msa.copy(), deletion_prob=prob)
    print()

## Source Code Reference

```python
# From AF2-source-code/model/tf/data_transforms.py

def sample_msa(protein, max_seq, keep_extra):
  """Sample MSA randomly, keeping max_seq sequences.
  
  Jumper et al. (2021) Suppl. Alg. 1 "MSABlockDeletion"
  """
  # Always keep the query (first sequence)
  # Randomly sample from remaining sequences
  ...
```