# Antibody Developability Triaging Pipeline

Implementation based on Sweet-Jones & Martin paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC11901365/

Using AntiBERTy embeddings and kernel PCA for developability assessment

This notebook implements the methods in roughly 8 steps:
1. Installation & Imports
2. Functions for alignment of antibody sequences with ANARCI
3. Loading and preprocessing of the data
4. Embedding of the combined VH/VL regions with AntiBERTy
5. Unsupervised learning of the antibody latent space with Kernel PCA
6. Visualization of the Kernel PCA principle components
7. Definition of a Z-score-derived ellipse function for antibody selection
8. Selection of developable antibodies
9. Full pipeline

### Vibe Coding
Code for this pipeline was initially developed using Claude Sonnet 4 and a prompt that provided a link to the manuscript with a suggested workflow for the notebook (similar to the steps above). Several fixes had to be made along the way, including addition of antibody alignment, different concatenation of the embeddings, and batching as the embeddings are generated to control for GPU memory limitations.

Importantly, **when vibe coding, you need to check that the science is coded correctly**! Even though the pipeline runs, you will notice that this notebook does not replicate the results of the manuscript with high fidelity. In particular, the Kernel PCA plot here is fundamentally different from the one in the manuscript. Significant effort has been put into identifying the source of the discrepancy. The current thoughts are in two places. 1) the manuscript uses the AbNum server to generate Chothia numbering while this notebook utilizes ANARCI. The results of Chothia numbering with the [AbYsis server](http://www.abysis.org/) are different, altering the relative embeddings of antibodies. Alternative numbering schemes, including IMGT and Martin, have been implemented in an attempt to replicate the results, without success. 2) the manuscript documents the tokens utilized for gaps in the antibody alignment; these tokens correspond to masking tokens instead of gap/padding tokens. Tokenization differences will affect the calculation of embeddings. It is unclear if the interface to the pLMs has changed or if the authors tokenization strategy was chosen for a specific reason.

## Which csv files are used in this notebook?

**Supplementary data**
Supplementary data was processed and cleaned in an effort to replicate the work in the manuscript. (See code in the section, "Cleaning and segmenting the Supplementary Data").

**Updated database**
A second set of files was created from an up-to-date version of TheraSabDab. Here, subsets of the antibodies were created in attempts to deal with GPU memory issues. Batching fixed the problem and the full sets are available. Use a smaller set of antibodies to have a faster notebook as alignment and embedding can take some time on a large number of antibodies.

##### Cleaning and segmenting the Supplementary Data

In [1]:
####################################
# It is not necessary to run the code in this section, as the files are made available.
# Code is provided for transparency in the event you might learn from it.
####################################

# import pandas as pd
# approved_df = pd.read_excel("AntibodyTriagingWithEmbeddings/kmab_a_2472009_sm7938.xlsx", sheet_name="1",skiprows=0)
# all_df = pd.read_csv("AntibodyTriagingWithEmbeddings/Sweet-Jones_clean.csv")

# therap_df = all_df.loc[all_df['Barcode/Name'].str.contains('umab')]
# therap_df.rename(columns={'Barcode/Name':'Therapeutic','VH':'HeavySequence','VL':'LightSequence'}, inplace=True)
# therap_df['HeavySequence'] = therap_df['HeavySequence'].str.replace(' ','')
# therap_df['LightSequence'] = therap_df['LightSequence'].str.replace(' ','')
# therap_df.loc[therap_df['Therapeutic'].isin(approved_df.loc[approved_df['Status']=='Approved','Name'].values),'Highest_Clin_Trial']='Approved'
# therap_df.loc[therap_df['Therapeutic'].isin(approved_df.loc[approved_df['Status']=='Discontinued','Name'].values),'Highest_Clin_Trial']='Discontinued'
# therap_df.loc[therap_df['Therapeutic'].isin(approved_df.loc[approved_df['Status']=='In Trials','Name'].values),'Highest_Clin_Trial']='In Trials'
# therap_df.loc[therap_df['Therapeutic'].isin(approved_df.loc[approved_df['Status']=='Studied','Name'].values),'Highest_Clin_Trial']='Studied'
# therap_df[['Therapeutic','HeavySequence','LightSequence','Highest_Clin_Trial']].to_csv('Sweet-Jones_Martin_therapeutic_mabs.csv',index=False)

# library_df = all_df.loc[~(all_df['Barcode/Name'].str.contains('umab'))]
# library_df.rename(columns={'Barcode/Name':'Therapeutic','VH':'HeavySequence','VL':'LightSequence'}, inplace=True)
# library_df['HeavySequence'] = library_df['HeavySequence'].str.replace(' ','')
# library_df['LightSequence'] = library_df['LightSequence'].str.replace(' ','')
# library_df['Highest_Clin_Trial'] = "Hit_Antibody"
# library_df[['Therapeutic','HeavySequence','LightSequence','Highest_Clin_Trial']].to_csv('Sweet-Jones_Martin_library_mabs.csv',index=False)

# 1. INSTALLATION & IMPORTS

In [1]:
if 'google.colab' in str(get_ipython()):
    print("Installing dependencies...")
    !pip install -q condacolab
    import condacolab
    condacolab.install()

Installing dependencies...
‚ú®üç∞‚ú® Everything looks OK!


In [2]:
import os
if 'google.colab' in str(get_ipython()):

  # setup anarci
  if not os.path.exists('ANARCI_READY'):
    !conda install -y anarci hmmer biopython -c bioconda --no-deps --solver=classic #2>&1 1>/dev/null
    !touch ANARCI_READY

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - 

In [3]:
pip install --no-dependencies antiberty

Collecting antiberty
  Downloading antiberty-0.1.3-py3-none-any.whl.metadata (4.7 kB)
Downloading antiberty-0.1.3-py3-none-any.whl (96.6 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m96.6/96.6 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: antiberty
Successfully installed antiberty-0.1.3


In [4]:
if 'google.colab' in str(get_ipython()):
    print("Running on Google Colab. Executing Colab-specific commands...")
    # Mount Google Drive to access files
    from google.colab import drive
    drive.mount('/content/drive')

    # Drive location for the fasta files
    data_loc = '/content/drive/MyDrive/AIDrivenDesignOfBiologics/AIDrivenDesignOfBiologics-PEGSEurope-2025/ProteinLanguageModels/AntibodyTriagingWithEmbeddings/'

else:
    print("Not running on Google Colab. Skipping Colab-specific commands.")
    print("Running in a local environment or Jupyter Notebook.")
    data_loc = '/home/davidnannemann/AIDD4B/ProteinLMs/'

Running on Google Colab. Executing Colab-specific commands...


MessageError: Error: credential propagation was unsuccessful

In [None]:
if 'google.colab' in str(get_ipython()):
    therapeutic_antibodies_csv = f'{data_loc}/Sweet-Jones_Martin_therapeutic_mabs.csv'
    random_paired_antibodies_csv = f'{data_loc}/Sweet-Jones_Martin_library_mabs.csv'
else:
    therapeutic_antibodies_csv = 'AntibodyTriagingWithEmbeddings/Sweet-Jones_Martin_therapeutic_mabs.csv'
    random_paired_antibodies_csv = 'AntibodyTriagingWithEmbeddings/Sweet-Jones_Martin_library_mabs.csv'

    #therapeutic_antibodies_csv = 'AntibodyTriagingWithEmbeddings/therapeutic_antibodies_approved.csv'
    #random_paired_antibodies_csv = 'AntibodyTriagingWithEmbeddings/paired_sequences_10k.csv'

In [None]:
if not 'google.colab' in str(get_ipython()):
    import os
    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

import pandas as pd
import numpy as np
import string
import gc
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import KernelPCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import rbf_kernel
import torch
from transformers import AutoTokenizer, AutoModel
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.width', 250) # Auto-detect width in terminal
pd.set_option('display.max_colwidth', 200) # Limit column content to 100 characters

# AntiBERTy for antibody embeddings
try:
    from antiberty import AntiBERTyRunner
    ANTIBERTY_AVAILABLE = True
    print("AntiBERTy imported successfully!")
except ImportError:
    ANTIBERTY_AVAILABLE = False
    print("WARNING: AntiBERTy not available. Please install with: pip install antiberty")

# ANARCI for antibody numbering and alignment
try:
    from anarci import anarci
    from anarci.anarci import run_anarci
    ANARCI_AVAILABLE = True
    print("ANARCI imported successfully!")
except ImportError:
    ANARCI_AVAILABLE = False
    print("WARNING: ANARCI not available. Please install with: pip install anarci")
    print("You may also need to install HMMER and configure ANARCI databases.")

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Imports completed successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"AntiBERTy available: {ANTIBERTY_AVAILABLE}")
print(f"ANARCI available: {ANARCI_AVAILABLE}")

In [None]:
# INSTALLATION CHECKS
def check_anarci_installation():
    """
    Check ANARCI installation and provide installation instructions.
    """

    print("="*60)
    print("ANARCI INSTALLATION CHECK")
    print("="*60)

    if ANARCI_AVAILABLE:
        print("‚úì ANARCI is installed and available")
        try:
            # Test basic functionality
            test_seq = "QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYAMHWVRQAPGQGLEWMGWINAGNGNTKYSQKFQGRVTITRDTSASTAYMELSSLRSEDTAVYYCAR"
            result = anarci([('test', test_seq)], scheme='chothia', output=False)
            if result and result[0]:
                print("‚úì ANARCI test run successful")
                return True
            else:
                print("‚ö† ANARCI installed but test run failed")
                return False
        except Exception as e:
            print(f"‚ö† ANARCI installed but error during test: {e}")
            return False
    else:
        print("‚úó ANARCI is not available")
        print("\nTo install ANARCI:")
        print("1. Install dependencies:")
        print("   conda install -c bioconda hmmer")
        print("   # OR")
        print("   sudo apt-get install hmmer  # Ubuntu/Debian")
        print("   # OR")
        print("   brew install hmmer  # MacOS")
        print("\n2. Install ANARCI:")
        print("   pip install anarci")
        print("\n3. Setup ANARCI database (run once):")
        print("   python -c 'import anarci; anarci.setup_database()'")
        print("\nAlternative installation via conda:")
        print("   conda install -c bioconda anarci")
        return False

def check_antiberty_installation():
    """
    Check AntiBERTy installation and provide installation instructions.
    """

    print("="*60)
    print("ANTIBERTY INSTALLATION CHECK")
    print("="*60)

    if ANTIBERTY_AVAILABLE:
        print("‚úì AntiBERTy is installed and available")
        return True
    else:
        print("‚úó AntiBERTy is not available")
        print("\nTo install AntiBERTy:")
        print("  pip install antiberty")
        return False
def check_all_dependencies():
    """
    Check all required dependencies and provide installation instructions.
    """

    print("="*70)
    print("DEPENDENCY CHECK FOR ANTIBODY DEVELOPABILITY PIPELINE")
    print("="*70)

    all_ready = True

    # Check AntiBERTy
    antiberty_ready = check_antiberty_installation()
    all_ready = all_ready and antiberty_ready

    print("\n" + "="*60)

    # Check ANARCI
    anarci_ready = check_anarci_installation()
    all_ready = all_ready and anarci_ready

    print("\n" + "="*60)
    print("OVERALL STATUS")
    print("="*60)

    if all_ready:
        print("‚úì All dependencies are ready!")
        print("You can run the full pipeline with sequence alignment and AntiBERTy embeddings.")
    else:
        print("‚ö† Some dependencies are missing.")
        if antiberty_ready and not anarci_ready:
            print("You can run the pipeline without sequence alignment.")
        elif anarci_ready and not antiberty_ready:
            print("You need AntiBERTy for embedding extraction.")
        else:
            print("Please install the missing dependencies before running the pipeline.")

    return all_ready, antiberty_ready, anarci_ready

all_ready, antiberty_ready, anarci_ready = check_all_dependencies()

print(f"AntiBERTy ready: {antiberty_ready}")
print(f"ANARCI ready: {anarci_ready}")
print("All ready: ", all_ready)

# 2. ANTIBODY SEQUENCE ALIGNMENT USING ANARCI

Human antibody variable regions have a consistent structure:
- one heavy chain domani and one light chain domain
- Three CDRs and four Framework Regions (FWR) per domain

Regression strategies can benefit from providing the embeddings in a consistent structure across the samples. We do this by aligning the antibodies with an antibody-specific alignment tool called ANARCI.

Below is a script that uses ANARCI to number each antibody sequence (using Chothia numbering, though everyone knows IMGT is better ;-) ), and create a classical alignment structure based on that numbering.

In [None]:
heavy_chothia_scheme = """1     2     3     4     5     6     7     8     9
          10    11    12    13    14    15    16    17    18    19
          20    21    22    23    24    25    26    27    28    29
          30    31
          31A   31B
          32    33    34    35    36    37    38    39
          40    41    42    43    44    45    46    47    48    49
          50    51    52
          52A   52B   52C   53    54    55    56    57    58    59
          60    61    62    63    64    65    66    67    68    69
          70    71    72    73    74    75    76    77    78    79
          80    81    82
          82A   82B   82C   83    84    85    86    87    88    89
          90    91    92    93    94    95    96    97    98    99
          100
          100A  100B  100C  100D  100E  100F  100G  100H  100I  100J
          100K  101   102   103   104   105   106   107   108   109
          110   111   112   113"""
# 100L 100M 100N 100O 100P 100Q
light_chothia_scheme = """1     2     3     4     5     6     7     8     9
          10    11    12    13    14    15    16    17    18    19
          20    21    22    23    24    25    26    27    28    29
          30
          30A   30B   30C   30D   30E   30F
          31    32    33    34    35    36    37    38    39
          40    41    42    43    44    45    46    47    48    49
          50    51    52    53    54    55    56    57    58    59
          60    61    62    63    64    65    66    67    68    69
          70    71    72    73    74    75    76    77    78    79
          80    81    82    83    84    85    86    87    88    89
          90    91    92    93    94    95
          95A   95B   95C   95D   95E   95F   96    97    98    99
          100   101   102   103   104   105   106
          106A                                      107   108   109"""
# '52A', '52B', '52C', '52D'

heavy_martin_scheme = """1     2     3     4     5     6     7     8
          8A    8B    8C    8D                                   9
          10    11    12    13    14    15    16    17    18    19
          20    21    22    23    24    25    26    27    28    29
          30    31
          31A   31B   31C   31D   31E   31F   31G   31H   31I   31J   31K
          32    33    34    35    36    37    38    39
          40    41    42    43    44    45    46    47    48    49
          50    51    52
          52A   52B   52C   52D   52E   52F   52G   52H   52I   52J   52K   52L
          53    54    55    56    57    58    59
          60    61    62    63    64    65    66    67    68    69
          70    71    72
          72A   72B   72C   72D   72E   72F   72G   72H   72I
                            73    74    75    76    77    78    79
          80    81    82    83    84    85    86    87    88    89
          90    91    92    93    94    95    96    97    98    99
          100
          100A  100B  100C  100D  100E  100F  100G  100H  100I  100J
          100K  100L  100M  100N  100O  100P  100Q  100R  100S  100T 100U 100V 100W 100X 100Y 100Z
          101   102   103   104   105   106   107   108   109
          110   111   112   113"""

light_martin_scheme = """1     2     3     4     5     6     7     8     9
          10    11    12    13    14    15    16    17    18    19
          20    21    22    23    24    25    26    27    28    29
          30
          30A   30B   30C   30D   30E   30F   30G   30H   30I   30J   30K
          31    32    33    34    35    36    37    38    39
          40
          40A   41    42    43    44    45    46    47    48    49
          50    51    52
          52A   52B   52C   52D   52E
          53    54    55    56    57    58    59
          60    61    62    63    64    65    66    67    68
          68A   68B   68C   68D   68E   68F   68G   68H         69
          70    71    72    73    74    75    76    77    78    79
          80    81    82    83    84    85    86    87    88    89
          90    91    92    93    94    95
          95A   95B   95C   95D   95E   95F   95G   95H   95I
          96    97    98    99
          100   101   102   103   104   105   106   107
          107A                                            108   109
          110"""

expanded_imgt_scheme       = """1     2     3     4     5
                 5A    5B    5C    5D                6     7     8     9
                10    11    12    13    14    15    16    17    18    19
                20    21    22    23
                23A                     24    25    26    27    28    29
                30    31    32
                32A   32B
                33B   33A         33    34    35    36    37    38    39
                40    41    42    43    44    45
                45A                                 46
                46A   46B   46C   46D   46E   46F         47    48    49
                50    50A   50B   50C   50D   50E
                      51    52    52A   52B   52C   52D
                                  53    54    55    56    57    58    59
                60    60A   60B   60C   60D   60E   60F
                      61F   61E   61D   61C   61B   61A
                      61    62    63    64    65    66    67
                67A   67B   67C   67D   67E   67F
                68A                                             68    69
                69A
                70    70A   70B   70C
                      71    72    73    74    75    76    77    78    79
                80    80A   80B
                      81    82    82A   82B   82C   82D   82E
                                  83    84    84A   84B
                                              85    86    87    88    89
                89A   89B
                90    91    92    93    94    95    96    97    98    99
               100   101   102   103   104   105   106   107   108   109
               110   111
               111A  111B  111C  111D  111E  111F  111G  111H  111I  111J  111K  111L  111M
               112M  112L  112K  112J  112I  112H  112G  112F  112E  112D  112C  112B  112A
                           112   113   114   115   116   117   118   119   119A  119B  119C
               120   121   122   123   124   125   126   127   128
               """
basic_imgt_scheme = """1     2     3     4     5     6     7     8     9
                10    11    12    13    14    15    16    17    18    19
                20    21    22    23    24    25    26    27    28    29
                30    31    32
                32A   32B
                33B   33A         33    34    35    36    37    38    39
                40    41    42    43    44    45    46    47    48    49
                50    51    52    53    54    55    56    57    58    59
                60    60A   60B   60C   60D   60E   60F
                      61F   61E   61D   61C   61B   61A
                      61    62    63    64    65    66    67    68    69
                70    71    72    73    74    75    76    77    78    79
                80    81    82    83    84    85    86    87    88    89
                90    91    92    93    94    95    96    97    98    99
               100   101   102   103   104   105   106   107   108   109
               110   111
               111A  111B  111C  111D  111E  111F  111G  111H  111I  111J  111K  111L  111M
               112M  112L  112K  112J  112I  112H  112G  112F  112E  112D  112C  112B  112A
                           112   113   114   115   116   117   118   119
               120   121   122   123   124   125   126   127   128
               """

imgt_scheme = expanded_imgt_scheme.split()
heavy_chothia_scheme = heavy_chothia_scheme.split()
heavy_martin_scheme = heavy_martin_scheme.split()

heavy_numbering_scheme = heavy_martin_scheme
print(len(heavy_numbering_scheme), heavy_numbering_scheme)

light_chothia_scheme = light_chothia_scheme.split()
light_martin_scheme = light_martin_scheme.split()
light_numbering_scheme = light_martin_scheme
print(len(light_numbering_scheme), light_numbering_scheme)

In [None]:
class AntibodyAligner:
    """
    Class for aligning antibody sequences using ANARCI with specified numbering scheme.
    """

    def __init__(self, scheme='martin', allowed_species=['human']):
        """
        Initialize the antibody aligner.

        Args:
            scheme (str): Numbering scheme ('chothia', 'kabat', 'imgt', 'martin')
            allowed_species (list): Allowed species for recognition
        """
        if not ANARCI_AVAILABLE:
            raise ImportError("ANARCI is required for sequence alignment. Please install with: pip install anarci")

        self.scheme = scheme
        self.allowed_species = allowed_species

        # Define standard positions for alignment (with specified numbering scheme)
        # Heavy chain positions: 1-113 (approximate, varies by CDR insertions)
        # Light chain positions: 1-107 (approximate, varies by CDR insertions)
        self.heavy_positions = None
        self.light_positions = None

        self.failed_indices = None

    def number_sequence(self, sequence, chain_type='H'):
        """
        Number a single antibody sequence using ANARCI.

        Args:
            sequence (str): Antibody sequence
            chain_type (str): 'H' for heavy, 'L' for light

        Returns:
            tuple: (numbered_sequence, domain_type, species)
        """

        try:
            # Run ANARCI numbering
            results = anarci([('query', sequence)],
                           scheme=self.scheme,
                           output=False,
                           allowed_species=self.allowed_species)

            if not results or not results[0] or not results[0][0]:
                return None, None, None

            # Extract results
            numbering, alignment_details, hit_table = results

            if not numbering[0] or not numbering[0][0]:
                return None, None, None

            # Get the first (best) result
            domain_numbering = numbering[0][0]
            details = alignment_details[0]

            domain_type = details['query_name'] if 'query_name' in details else 'unknown'
            species = details['species'] if 'species' in details else 'unknown'

            return domain_numbering, domain_type, species

        except Exception as e:
            print(f"Error numbering sequence: {e}")
            return None, None, None

    def create_aligned_sequence(self, numbering, standard_positions):
        """
        Create aligned sequence with gaps for missing positions.

        Args:
            numbering (list): ANARCI numbering results
            standard_positions (list): Standard positions for alignment

        Returns:
            str: Aligned sequence with gaps
        """

        if not numbering:
            return None

        # Create position to residue mapping
        pos_to_residue = {}
        for position, residue in numbering[0]:
            if residue != '-':  # Skip gaps in the original numbering
                pos_to_residue[position] = residue

        # Build aligned sequence
        aligned_seq = []
        for pos in standard_positions:
            if pos in pos_to_residue:
                aligned_seq.append(pos_to_residue[pos])
            else:
                aligned_seq.append('-')  # Gap for missing position

        return ''.join(aligned_seq)

    def get_standard_positions(self, all_numberings, chain_type='H'):
        """
        Determine standard positions from all numbered sequences.

        Args:
            all_numberings (list): List of numbering results
            chain_type (str): 'H' for heavy, 'L' for light

        Returns:
            list: Sorted list of standard positions
        """

        all_positions = set()

        for numbering in all_numberings:
            if numbering: # Check if numbering was successful
                for position, residue in numbering[0]:
                    if residue != '-': # Ignore gaps from ANARCI
                        all_positions.add(position)

        # Sort positions (ANARCI positions are tuples like (1, ' ') or (27, 'A'))\
        if chain_type=="H":
            numbering_scheme = heavy_numbering_scheme
        if chain_type=="L":
            numbering_scheme = light_numbering_scheme
        sorted_positions = []
        unsorted_positions = []

        used_positions = [str(y[0])+y[1].strip() for y in all_positions]
        for numstring in numbering_scheme:
            if numstring in used_positions:
                if numstring[-1] in string.ascii_uppercase:
                    sorted_positions.append((int(numstring[:-1]),numstring[-1]))
                else:
                    sorted_positions.append((int(numstring),' '))
            else:
                unsorted_positions.append(numstring)

        if len(unsorted_positions)>0:
            print(f"WARNING: The following positions are not accounted for {unsorted_positions}")
        return sorted_positions

    def align_antibody_sequences(self, df, heavy_col='HeavySequence', light_col='LightSequence'):
        """
        Align all antibody sequences in the dataframe using Chothia numbering.

        Args:
            df (pd.DataFrame): DataFrame with antibody sequences
            heavy_col (str): Column name for heavy chain sequences
            light_col (str): Column name for light chain sequences

        Returns:
            pd.DataFrame: DataFrame with aligned sequences added
        """

        print(f"Aligning {len(df)} antibody sequences using ANARCI ({self.scheme} numbering)...")

        df_aligned = df.copy()

        # Initialize lists for results
        heavy_aligned = []
        light_aligned = []
        heavy_numberings = []
        light_numberings = []
        alignment_success = []

        # First pass: Number all sequences
        print("First pass: Numbering sequences...")
        for idx, row in df.iterrows():
            if idx % 500 == 0:
                print(f"Processing antibody {idx+1}/{len(df)}")

            # Number heavy chain
            heavy_numbering, heavy_domain, heavy_species = self.number_sequence(
                row[heavy_col], chain_type='H'
            )

            # Number light chain
            light_numbering, light_domain, light_species = self.number_sequence(
                row[light_col], chain_type='L'
            )

            if row[heavy_col] in numbering_corrections:
                heavy_numbering = numbering_corrections[row[heavy_col]]

            #print(heavy_numbering)
            #print([x[0] for x in heavy_numbering[0]])
            extra_heavy_num = [x for x in [str(y[0][0])+y[0][1].strip() for y in heavy_numbering[0]] if x not in heavy_numbering_scheme]
            extra_light_num = [x for x in [str(y[0][0])+y[0][1].strip() for y in light_numbering[0]] if x not in light_numbering_scheme]

            if len(extra_heavy_num)>0:
                print(row['Therapeutic'], row[heavy_col])
                print(extra_heavy_num)
                print(heavy_numbering)
                heavy_numbering = None

            if len(extra_light_num)>0:
                print(row['Therapeutic'], row[light_col])
                print(extra_light_num)
                print(light_numbering)
                light_numbering = None


            heavy_numberings.append(heavy_numbering)
            light_numberings.append(light_numbering)

            # Track success
            success = (heavy_numbering is not None) and (light_numbering is not None)
            alignment_success.append(success)

        # Determine standard positions from all successful numberings
        print("Determining standard positions...")
        successful_heavy = [n for n, s in zip(heavy_numberings, alignment_success) if s and n]
        successful_light = [n for n, s in zip(light_numberings, alignment_success) if s and n]

        self.heavy_positions = self.get_standard_positions(successful_heavy, 'H')
        self.light_positions = self.get_standard_positions(successful_light, 'L')

        print(f"Heavy chain alignment length: {len(self.heavy_positions)} positions")
        print(f"Light chain alignment length: {len(self.light_positions)} positions")

        # Second pass: Create aligned sequences
        print("Second pass: Creating aligned sequences...")
        for i, (heavy_num, light_num, success) in enumerate(zip(heavy_numberings, light_numberings, alignment_success)):
            if success:
                heavy_aligned_seq = self.create_aligned_sequence(heavy_num, self.heavy_positions)
                light_aligned_seq = self.create_aligned_sequence(light_num, self.light_positions)
            else:
                heavy_aligned_seq = None
                light_aligned_seq = None

            heavy_aligned.append(heavy_aligned_seq)
            light_aligned.append(light_aligned_seq)

        # Add aligned sequences to dataframe
        df_aligned['HeavyAligned'] = heavy_aligned
        df_aligned['LightAligned'] = light_aligned
        df_aligned['AlignmentSuccess'] = alignment_success

        # Summary statistics
        successful_alignments = sum(alignment_success)
        print(f"\nAlignment Results:")
        print(f"Successfully aligned: {successful_alignments}/{len(df)} ({100*successful_alignments/len(df):.1f}%)")

        if successful_alignments < len(df):
            self.failed_indices = [i for i, s in enumerate(alignment_success) if not s]
            print(f"Failed alignments at indices: {self.failed_indices[:10]}{'...' if len(self.failed_indices) > 10 else ''}")

        return df_aligned

    def get_alignment_info(self):
        """Get information about the current alignment."""
        return {
            'scheme': self.scheme,
            'heavy_positions': len(self.heavy_positions) if self.heavy_positions else 0,
            'light_positions': len(self.light_positions) if self.light_positions else 0,
            'heavy_pos_range': f"{self.heavy_positions[0]}-{self.heavy_positions[-1]}" if self.heavy_positions else "None",
            'light_pos_range': f"{self.light_positions[0]}-{self.light_positions[-1]}" if self.light_positions else "None"
        }

# 3. DATA LOADING AND PREPROCESSING

We're going to read and concatenate the two csv files with antibody sequences and clinical progression tags, separating out the antibodies that are approved and in the clinic. A different separation strategy could provide a better re-implementation of the original manuscript. Think about this if you want.

In [None]:
numbering_corrections = {}

def load_antibody_data(csv_file_path, random_paired_file_path):
    """
    Load therapeutic antibody data from CSV file.

    Expected columns:
    - Therapeutic: Antibody name
    - HeavySequence: Heavy chain sequence
    - LightSequence: Light chain sequence
    - Highest_Clin_Trial: Clinical trial status
    """

    # Load the data
    df = pd.read_csv(csv_file_path)

    # remove duplicates
    num_seqs_begin = df.shape[0]
    df = df.drop_duplicates(subset=['HeavySequence','LightSequence'])
    num_seqs_end = df.shape[0]

    # select only "human" sequences
    # umab_mask  = df['Therapeutic'].str.contains('umab', case=False, na=False)
    # zumab_mask = df['Therapeutic'].str.contains('zumab', case=False, na=False)
    # df = df.loc[umab_mask & ~zumab_mask]

    # Display basic information about the dataset
    print(f"Loaded {len(df)} antibodies from {csv_file_path}")
    print(f"Removed {num_seqs_begin - num_seqs_end} antibodies out of {num_seqs_begin} original antibodies")
    #print(f"Columns: {list(df.columns)}")
    print(f"\nClinical trial status distribution:")
    print(df['Highest_Clin_Trial'].value_counts())
    print("\n")

    paired_df = pd.read_csv(random_paired_file_path)
    # remove duplicates
    num_seqs_begin = paired_df.shape[0]
    paired_df = paired_df.drop_duplicates(subset=['HeavySequence','LightSequence'])
    num_seqs_end = paired_df.shape[0]
    # Display basic information about the dataset
    print(f"Loaded {len(paired_df)} antibodies from {random_paired_file_path}")
    print(f"Removed {num_seqs_begin - num_seqs_end} antibodies out of {num_seqs_begin} original antibodies")
    #print(f"Columns: {list(paired_df.columns)}")
    print(f"\nClinical trial status distribution:")
    print(paired_df['Highest_Clin_Trial'].value_counts())
    print("\n")

    df = pd.concat([df,paired_df], ignore_index=True)
    # Display basic information about the dataset
    print(f"Combined {len(df)} antibodies")
    #print(f"Columns: {list(df.columns)}")
    print(f"\nClinical trial status distribution:")
    print(df['Highest_Clin_Trial'].value_counts())

    # # Separate training antibodies from other antibodies
    # approved_mask = ~(df['Highest_Clin_Trial'].str.contains('Hit_Antibody', case=False, na=False))
    # #approved_mask = df['Sweet-Jones_Martin_dataset'].str.contains('Y', case=False, na=False)

    # df['Status'] = np.where(approved_mask, 'Approved', 'Other')
    df['Status'] = df['Highest_Clin_Trial']
    print(f"\nStatus distribution:")
    print(df['Status'].value_counts())

    # Remove any entries with missing sequences
    initial_count = len(df)
    df = df.dropna(subset=['HeavySequence', 'LightSequence'])
    final_count = len(df)

    if initial_count != final_count:
        print(f"Removed {initial_count - final_count} entries with missing sequences")

    return df

def load_and_align_antibodies(csv_file_path,random_paired_file_path, scheme='martin'):
    """
    Complete pipeline for loading and aligning antibody sequences.

    Args:
        csv_file_path (str): Path to CSV file
        scheme (str): Numbering scheme for alignment

    Returns:
        tuple: (aligned_dataframe, aligner_object)
    """

    print("="*60)
    print("ANTIBODY DATA LOADING AND ALIGNMENT")
    print("="*60)

    # Load data
    df = load_antibody_data(csv_file_path, random_paired_file_path)

    df.reset_index(inplace=True, drop=True)

    # Align sequences
    aligner = AntibodyAligner(scheme=scheme)
    df_aligned = aligner.align_antibody_sequences(df)

    # Print alignment info
    print(f"\nAlignment Information:")
    info = aligner.get_alignment_info()
    for key, value in info.items():
        print(f"  {key}: {value}")

    return df_aligned, aligner

df_aligned, aligner = load_and_align_antibodies(therapeutic_antibodies_csv, random_paired_antibodies_csv, scheme="martin")

In [None]:
heavy_out_positions = [str(x[0])+x[1].strip() for x in aligner.heavy_positions]

print("The following positions were output by ANARCI but unavailable in the defined numbering scheme. Antibodies with these positions will go unused.")
print([x for x in heavy_out_positions if x not in heavy_numbering_scheme])
print("The following positions were available in the defined numbering scheme but are unused in the current antibody set:")
print([x for x in heavy_numbering_scheme if x not in heavy_out_positions])

light_out_positions = [str(x[0])+x[1].strip() for x in aligner.light_positions]

print("The following positions were output by ANARCI but unavailable in the defined numbering scheme. Antibodies with these positions will go unused.")
print([x for x in light_out_positions if x not in light_numbering_scheme])
print("The following positions were available in the defined numbering scheme but are unused in the current antibody set:")
print([x for x in light_numbering_scheme if x not in light_out_positions])

Let's look at some aligned sequences

In [None]:
df_aligned['LightAligned'].head(5)  # Display first 5 aligned light sequences

In [None]:
df_aligned['HeavyAligned'].head(5)  # Display first 5 aligned heavy sequences


# 4. ANTIBERTY EMBEDDING EXTRACTION WITH ALIGNED SEQUENCES

In [None]:
class AntiBERTyEmbedder:
    """
    Class for extracting embeddings using the official AntiBERTy implementation.
    Uses the jeffreyruffolo/AntiBERTy repository for optimized antibody embeddings.
    """

    def __init__(self, batch_size=32, clear_cache=True):
        """Initialize the AntiBERTy runner with batching support.
            Args:
                batch_size (int): Number of sequences to process at once
                clear_cache (bool): Whether to clear CUDA cache between batches
        """
        if not ANTIBERTY_AVAILABLE:
            raise ImportError("AntiBERTy is required. Please install with: pip install antiberty")

        self.batch_size = batch_size
        self.clear_cache = clear_cache

        print("Initializing AntiBERTy runner...")
        try:
            self.antiberty = AntiBERTyRunner()
            print("AntiBERTy runner initialized successfully!")

            # Print memory info
            if torch.cuda.is_available():
                self._print_gpu_memory("Initial GPU memory")

        except Exception as e:
            print(f"Error initializing AntiBERTy: {e}")
            raise

    def _print_gpu_memory(self, stage=""):
        """Print current GPU memory usage."""
        if torch.cuda.is_available():
            allocated = torch.cuda.memory_allocated() / 1024**3  # GB
            reserved = torch.cuda.memory_reserved() / 1024**3   # GB
            max_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3  # GB
            print(f"{stage}: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved, {max_memory:.2f}GB total")

    def _clear_gpu_cache(self):
        """Clear GPU cache and run garbage collection."""
        if torch.cuda.is_available() and self.clear_cache:
            torch.cuda.empty_cache()
        gc.collect()

    def get_sequence_embeddings(self, sequences, batch_size=None, show_progress=True):
        """
        Extract embeddings for a list of protein sequences with batching.

        Args:
            sequences (list): List of protein sequences (can contain gaps '-')
            batch_size (int, optional): Override default batch size
            show_progress (bool): Whether to show progress bar

        Returns:
            np.ndarray: Array of sequence embeddings
        """

        if batch_size is None:
            batch_size = self.batch_size

        # Clean sequences - replace gaps with '_' (padding character for AntiBERTy)
        cleaned_sequences = []
        for seq in sequences:
            cleaned_seq = seq.replace('-', '_')
            cleaned_sequences.append(cleaned_seq)

        num_sequences = len(cleaned_sequences)
        num_batches = (num_sequences + batch_size - 1) // batch_size

        print(f"Processing {num_sequences} sequences in {num_batches} batches of size {batch_size}")

        all_embeddings = []

        # Process in batches
        iterator = range(0, num_sequences, batch_size)
        if show_progress:
            iterator = tqdm(iterator, desc="Processing batches", unit="batch")

        for start_idx in iterator:
            end_idx = min(start_idx + batch_size, num_sequences)
            batch_sequences = cleaned_sequences[start_idx:end_idx]

            try:
                # Get embeddings for this batch
                #batch_embeddings = self.antiberty.embed(batch_sequences)

                formatted_sequences = [" ".join([ "[PAD]" if y=="_" else y for y in list(x)]) for x in batch_sequences]
                inputs = self.antiberty.tokenizer(formatted_sequences, return_tensors="pt")
                device = next(self.antiberty.model.parameters()).device
                inputs = {k: v.to(device) for k, v in inputs.items()}
                with torch.no_grad():
                    outputs = self.antiberty.model(**inputs, output_hidden_states=True)

                batch_embeddings = outputs['hidden_states'][-1] # get just the last layer

                # Handle CUDA tensors properly
                if isinstance(batch_embeddings, torch.Tensor):
                    batch_embeddings = batch_embeddings.cpu().detach().numpy()
                elif isinstance(batch_embeddings, list):
                    processed_embeddings = []
                    for emb in batch_embeddings:
                        if isinstance(emb, torch.Tensor):
                            processed_embeddings.append(emb.cpu().detach().numpy())
                        else:
                            processed_embeddings.append(np.array(emb))
                    batch_embeddings = np.array(processed_embeddings)
                elif not isinstance(batch_embeddings, np.ndarray):
                    batch_embeddings = np.array(batch_embeddings)
                all_embeddings.append(batch_embeddings)

                # Clear cache after each batch to free memory
                self._clear_gpu_cache()

                if show_progress and torch.cuda.is_available():
                    # Update progress with memory info every 10 batches
                    if (start_idx // batch_size) % 10 == 0:
                        allocated = torch.cuda.memory_allocated() / 1024**3
                        iterator.set_postfix({"GPU_GB": f"{allocated:.1f}"})

            except Exception as e:
                print(f"Error processing batch {start_idx//batch_size + 1}/{num_batches}: {e}")

                # Try with smaller batch size if CUDA out of memory
                if "out of memory" in str(e).lower() and batch_size > 1:
                    print(f"Retrying batch with smaller size: {batch_size//2}")
                    smaller_batch_embeddings = self.get_sequence_embeddings(
                        batch_sequences,
                        batch_size=batch_size//2,
                        show_progress=False
                    )
                    all_embeddings.append(smaller_batch_embeddings)
                    self._clear_gpu_cache()
                else:
                    # Return zero embeddings for failed batch
                    print(f"Using zero embeddings for failed batch")
                    failed_embeddings = np.zeros((len(batch_sequences), 512))  # Assuming 512-dim
                    all_embeddings.append(failed_embeddings)
        # Concatenate all batch results
        if all_embeddings:
            final_embeddings = np.concatenate(all_embeddings, axis=0)
        else:
            final_embeddings = np.zeros((num_sequences, 512))  # Fallback

        print(f"Final embeddings shape: {final_embeddings.shape}")

        if torch.cuda.is_available():
            self._print_gpu_memory("Final GPU memory")

        return final_embeddings

    def get_antibody_embeddings(self, df, use_aligned=True, combine_chains=True,
                               gap_strategy='replace', batch_size=None):
        """
        Extract embeddings for all antibodies in the dataframe with batching.

        Args:
            df (pd.DataFrame): DataFrame with antibody sequences
            use_aligned (bool): Whether to use aligned sequences
            combine_chains (bool): Whether to combine heavy and light chains
            gap_strategy (str): How to handle gaps ('replace', 'remove')
            batch_size (int, optional): Override default batch size

        Returns:
            tuple: (embeddings_array, valid_indices)
        """

        if batch_size is None:
            batch_size = self.batch_size

        embeddings = []
        valid_indices = []

        # Choose sequence columns
        if use_aligned:
            heavy_col = 'HeavyAligned'
            light_col = 'LightAligned'
            print(f"Using aligned sequences for embedding extraction")

            # Filter to successfully aligned sequences
            valid_mask = df['AlignmentSuccess'] & df[heavy_col].notna() & df[light_col].notna()
            df_valid = df[valid_mask].reset_index(drop=True)
            valid_indices = df[valid_mask].index.tolist()

        else:
            heavy_col = 'HeavySequence'
            light_col = 'LightSequence'
            print(f"Using original sequences for embedding extraction")
            df_valid = df.copy()
            valid_indices = list(range(len(df)))

        print(f"Extracting embeddings for {len(df_valid)} antibodies...")

        if combine_chains:
            # Combine heavy and light chains
            combined_sequences = []
            for idx, row in df_valid.iterrows():
                # Combine with separator
                combined_seq = row[heavy_col] + '[SEP]' + row[light_col]
                combined_sequences.append(combined_seq)

            print("Getting combined chain embeddings with batching...")
            embeddings = self.get_sequence_embeddings(combined_sequences, batch_size=batch_size)
            embeddings = embeddings[:,1:-1] # remove the embedding for the start and end tokens

        else:
            # Get separate embeddings for heavy and light chains
            heavy_sequences = df_valid[heavy_col].tolist()
            light_sequences = df_valid[light_col].tolist()

            print("Getting heavy chain embeddings with batching...")
            heavy_embeddings = self.get_sequence_embeddings(heavy_sequences, batch_size=batch_size)

            print("Getting light chain embeddings with batching...")
            light_embeddings = self.get_sequence_embeddings(light_sequences, batch_size=batch_size)

            # Concatenate heavy and light embeddings
            embeddings = np.concatenate([heavy_embeddings[:,1:-1], light_embeddings[:,1:-1]], axis=1)

        print(f"Extracted embeddings with shape: {embeddings.shape}")
        print(f"Valid antibodies: {len(valid_indices)}/{len(df)}")

        return embeddings, valid_indices

    def get_embedding_statistics(self, embeddings):
        """Get basic statistics about the embeddings."""

        stats = {
            'shape': embeddings.shape,
            'mean_norm': np.mean(np.linalg.norm(embeddings, axis=1)),
            'std_norm': np.std(np.linalg.norm(embeddings, axis=1)),
            'mean_values': np.mean(embeddings, axis=0),
            'std_values': np.std(embeddings, axis=0),
            'embedding_dim': embeddings.shape[1] if len(embeddings.shape) > 1 else 0
        }

        print(f"Embedding Statistics:")
        print(f"  Shape: {stats['shape']}")
        print(f"  Embedding dimension: {stats['embedding_dim']}")
        print(f"  Mean L2 norm: {stats['mean_norm']:.3f} ¬± {stats['std_norm']:.3f}")
        print(f"  Feature mean range: [{np.min(stats['mean_values']):.3f}, {np.max(stats['mean_values']):.3f}]")

        return stats

embedder = AntiBERTyEmbedder(batch_size=32, clear_cache=True)

use_aligned_sequences = True  # Set to True to use aligned sequences, False for original sequences
if use_aligned_sequences:
    embeddings, valid_indices = embedder.get_antibody_embeddings(
        df_aligned, use_aligned=True, combine_chains=False
    )
    # Use only successfully aligned antibodies for analysis
    df_analysis = df_aligned.iloc[valid_indices].reset_index(drop=True)
    print(f"Using {len(df_analysis)} successfully aligned antibodies for analysis")
else:
    embeddings, valid_indices = embedder.get_antibody_embeddings(
        df_aligned, use_aligned=False, combine_chains=False
    )
    df_analysis = df_aligned.copy()

Look at the shape of the embeddings. What are each of the dimensions?

In [None]:
print(embeddings.shape)

The first dimension of the embeddings is the number of samples. The second dimension should be the number of residues in the aligned antibodies. Let's check...

In [None]:
len(df_aligned['HeavyAligned'][0]) + len(df_aligned['LightAligned'][0])

Yes, the length of the aligned residues does match. The tokens signalling the beginning and end of each sequence are removed during concatenation in the function `get_antibody_embeddings`.

Finally, what is the last dimension, 512? This is a property of the AntiBERTy architecture. Other language models will have different embedding dimensions.

Question: what are the embeddings? Let's look...
They're just a numpy array of values extracted from the model.

In [None]:
print(type(embeddings[0]))
print(embeddings[0])

# 5. KERNEL PCA TRAINING

In [None]:
class AntibodyKernelPCA:
    """
    Kernel PCA implementation for antibody developability assessment.
    Uses RBF kernel as mentioned in the Sweet-Jones & Martin methodology.
    """

    def __init__(self, n_components=2, kernel='rbf', gamma=None, scale_embeddings=True):
        """
        Initialize Kernel PCA model.

        Args:
            n_components (int): Number of principal components
            kernel (str): Kernel type ('rbf', 'poly', 'sigmoid', 'cosine')
            gamma (float): Kernel coefficient for RBF kernel
        """
        self.n_components = n_components
        self.kernel = kernel
        self.gamma = gamma
        self.kpca = None
        self.scale_embeddings = scale_embeddings
        self.scaler = StandardScaler()

    def _reshape_embeddings(self, embeddings):
        """
        Reshape embeddings to 2D by concatenating the last dimensions.

        Args:
            embeddings (np.ndarray): Input embeddings

        Returns:
            np.ndarray: Reshaped embeddings (n_samples, concatenated_features)
        """
        print(f"Input embeddings shape: {embeddings.shape}")

        if len(embeddings.shape) == 2:
            # Already correct shape (n_samples, n_features)
            return embeddings

        elif len(embeddings.shape) == 3:
            # 3D array - concatenate embeddings from dimensions 1 and 2
            # Shape: (n_samples, dim1, dim2) -> (n_samples, dim1 * dim2)
            # Each sample will have embeddings from all dim1 positions concatenated

            n_samples, dim1, dim2 = embeddings.shape

            # Reshape to concatenate: flatten last two dimensions while preserving order
            embeddings_2d = embeddings.reshape(n_samples, dim1 * dim2)

            print(f"Concatenated 3D embeddings from {embeddings.shape} to {embeddings_2d.shape}")
            print(f"  Each sample now has {dim1} embedding vectors of size {dim2} concatenated")
            return embeddings_2d

        elif len(embeddings.shape) == 4:
            # 4D array - concatenate embeddings from dimensions 1, 2, and 3
            # Shape: (n_samples, dim1, dim2, dim3) -> (n_samples, dim1 * dim2 * dim3)

            n_samples, dim1, dim2, dim3 = embeddings.shape

            # Reshape to concatenate: flatten last three dimensions while preserving order
            embeddings_2d = embeddings.reshape(n_samples, dim1 * dim2 * dim3)

            print(f"Concatenated 4D embeddings from {embeddings.shape} to {embeddings_2d.shape}")
            print(f"  Each sample now has {dim1}√ó{dim2} embedding vectors of size {dim3} concatenated")
            return embeddings_2d

        elif len(embeddings.shape) == 1:
            # 1D array - assume single sample
            embeddings_2d = embeddings.reshape(1, -1)
            print(f"Reshaped 1D embeddings from {embeddings.shape} to {embeddings_2d.shape}")
            return embeddings_2d

        else:
            raise ValueError(f"Unsupported embeddings shape: {embeddings.shape}")

    def _compute_gamma(self, embeddings_scaled):
        """
        Compute gamma value based on the strategy.

        Args:
            embeddings_scaled (np.ndarray): Scaled embeddings

        Returns:
            float: Computed gamma value
        """
        n_features = embeddings_scaled.shape[1]

        if self.gamma == 'scale':
            # Recommended: 1 / (n_features * variance)
            variance = np.var(embeddings_scaled)
            gamma_val = 1.0 / (n_features * variance) if variance > 0 else 1.0 / n_features
            print(f"  Gamma strategy: 'scale'")
            print(f"    Computed gamma = 1/(n_features √ó variance) = {gamma_val:.6f}")

        elif self.gamma == 'auto' or self.gamma is None:
            # Default sklearn: 1 / n_features (often too small for high-dim embeddings)
            gamma_val = 1.0 / n_features
            print(f"  Gamma strategy: 'auto'")
            print(f"    Computed gamma = 1/n_features = {gamma_val:.6f}")
            print(f"  ‚ö† WARNING: This may be too small for high-dimensional embeddings!")

        else:
            # User-specified value
            gamma_val = float(self.gamma)
            print(f"  Gamma strategy: user-specified")
            print(f"    Gamma value = {gamma_val:.6f}")

        return gamma_val

    def fit_transform(self, embeddings):
        """
        Fit kernel PCA and transform embeddings.

        Args:
            embeddings (np.ndarray): Antibody embeddings

        Returns:
            np.ndarray: Transformed embeddings in PC space
        """

        print("Fitting Kernel PCA...")

        # Reshape embeddings to 2D if needed
        embeddings_2d = self._reshape_embeddings(embeddings)

        # Check for any remaining issues
        if len(embeddings_2d.shape) != 2:
            raise ValueError(f"After reshaping, embeddings still not 2D: {embeddings_2d.shape}")

        # Check for NaN or infinite values
        if np.any(np.isnan(embeddings_2d)):
            print("Warning: NaN values found in embeddings, replacing with zeros")
            embeddings_2d = np.nan_to_num(embeddings_2d)

        if np.any(np.isinf(embeddings_2d)):
            print("Warning: Infinite values found in embeddings, replacing with finite values")
            embeddings_2d = np.nan_to_num(embeddings_2d)

        print(f"Final embeddings shape for PCA: {embeddings_2d.shape}")
        print(f"Embedding statistics: mean={np.mean(embeddings_2d):.3f}, std={np.std(embeddings_2d):.3f}")

        # Standardize embeddings
        if self.scale_embeddings:
            try:
                embeddings_scaled = self.scaler.fit_transform(embeddings_2d)
                print("‚úì StandardScaler fit successful")
            except Exception as e:
                print(f"Error in StandardScaler: {e}")
                print(f"Embeddings shape: {embeddings_2d.shape}")
                print(f"Embeddings dtype: {embeddings_2d.dtype}")
                raise
        else:
            embeddings_scaled = embeddings_2d

        # Initialize Kernel PCA
        gamma = self._compute_gamma(embeddings_scaled)

        self.kpca = KernelPCA(
            n_components=self.n_components,
            kernel=self.kernel,
            gamma=gamma,
            random_state=42,
            n_jobs=-1
        )

        # Fit and transform
        try:
            embeddings_transformed = self.kpca.fit_transform(embeddings_scaled)
            print("‚úì Kernel PCA fit_transform successful")
        except Exception as e:
            print(f"Error in Kernel PCA: {e}")
            print(f"Scaled embeddings shape: {embeddings_scaled.shape}")
            raise

        print(f"Kernel PCA completed. Output shape: {embeddings_transformed.shape}")

        # Try to get explained variance ratio
        explained_var = self.get_explained_variance_ratio()
        if explained_var is not None:
            print(f"Explained variance ratio (approximation): {explained_var}")

        return embeddings_transformed

    def transform(self, embeddings):
        """Transform new embeddings using fitted model."""
        if self.kpca is None:
            raise ValueError("Model must be fitted before transforming new data")

        # Reshape if needed
        embeddings_2d = self._reshape_embeddings(embeddings)

        # Scale using fitted scaler
        embeddings_scaled = self.scaler.transform(embeddings_2d)

        return self.kpca.transform(embeddings_scaled)

    def get_explained_variance_ratio(self):
        """
        Approximate explained variance ratio for kernel PCA.
        Note: This is an approximation as exact calculation is complex for kernel PCA.
        """
        if self.kpca is None:
            return None

        try:
            # Get eigenvalues from the kernel matrix
            eigenvalues = self.kpca.eigenvalues_
            if eigenvalues is not None and len(eigenvalues) > 0:
                # Take only positive eigenvalues
                positive_eigenvalues = eigenvalues[eigenvalues > 0]
                if len(positive_eigenvalues) > 0:
                    total_variance = np.sum(positive_eigenvalues)
                    explained_variance_ratio = positive_eigenvalues / total_variance
                    return explained_variance_ratio[:self.n_components]
            return None
        except Exception as e:
            print(f"Could not compute explained variance ratio: {e}")
            return None

if not 'kpca_gamma_dict' in locals():
    kpca_gamma_dict = {}
for gamma in [ 100, 500]: # add additional values to tune the PCA
    if gamma in kpca_gamma_dict.keys(): continue
    print(gamma)
    kpca_model = AntibodyKernelPCA(gamma=gamma, scale_embeddings=False)
    pc_embeddings = kpca_model.fit_transform(embeddings)
    kpca_gamma_dict[gamma] = (kpca_model, pc_embeddings)

# 6. VISUALIZATION OF KERNEL PCA RESULTS

In [None]:
def plot_kernel_pca_results(pc_embeddings, labels, status_col='Status',
                           title="Antibody Developability Landscape",
                           figsize=(12, 8)):
    """
    Plot kernel PCA results with approved vs other antibodies in different colors.

    Args:
        pc_embeddings (np.ndarray): PC coordinates
        labels (pd.Series or array): Status labels (Approved/Other)
        status_col (str): Column name for status
        title (str): Plot title
        figsize (tuple): Figure size
    """

    plt.figure(figsize=figsize)

    # Create color mapping
    unique_labels = np.unique(labels)
    colors = ['#2E86AB','#C73E1D', '#A23B72', '#F18F01', "#28C71D"]
    color_map = {label: colors[i % len(colors)] for i, label in enumerate(unique_labels)}

    # Plot each group
    for label in sorted(unique_labels):
        mask = labels == label
        if label=="Approved":
            z = 100
        else:
            z=1
        plt.scatter(
            pc_embeddings[mask, 0],
            pc_embeddings[mask, 1],
            c=color_map[label],
            label=f'{label} (n={np.sum(mask)})',
            alpha=0.7,
            s=10,
            edgecolors='white',
            linewidth=0.5,
            zorder=z
        )

    plt.xlabel('PC1', fontsize=12)
    plt.ylabel('PC2', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.xlim(-0.1,0.1)
    plt.ylim(-0.1,0.1)
    plt.legend(fontsize=10)
    plt.grid(True, alpha=0.3)

    # Add statistics
    approved_mask = labels == 'Approved'
    other_mask = labels == 'Other'

    if np.any(approved_mask) and np.any(other_mask):
        # Calculate centroids
        approved_centroid = np.mean(pc_embeddings[approved_mask], axis=0)
        other_centroid = np.mean(pc_embeddings[other_mask], axis=0)

        # Plot centroids
        plt.scatter(*approved_centroid, c='red', s=200, marker='x', linewidth=3, label='Approved Centroid')
        plt.scatter(*other_centroid, c='blue', s=200, marker='x', linewidth=3, label='Other Centroid')

    plt.tight_layout()
    plt.show()

    # Print summary statistics
    print("\nSummary Statistics:")
    print(f"Total antibodies: {len(pc_embeddings)}")
    for label in unique_labels:
        mask = labels == label
        count = np.sum(mask)
        pc1_mean = np.mean(pc_embeddings[mask, 0])
        pc2_mean = np.mean(pc_embeddings[mask, 1])
        pc1_std = np.std(pc_embeddings[mask, 0])
        pc2_std = np.std(pc_embeddings[mask, 1])

        print(f"\n{label} antibodies (n={count}):")
        print(f"  PC1: {pc1_mean:.3f} ¬± {pc1_std:.3f}")
        print(f"  PC2: {pc2_mean:.3f} ¬± {pc2_std:.3f}")

print(kpca_gamma_dict.keys())
kpca_model, pc_embeddings = kpca_gamma_dict[200]
plot_kernel_pca_results(pc_embeddings, df_analysis['Status'])

In [None]:
kpca_gamma_dict.keys()

# 7. ELLIPSE SELECTION STRATEGY


In [None]:

class EllipseSelector:
    """
    Ellipse-based selection strategy for identifying developable antibodies.
    Based on Z-scores for PC1 and PC2 as described in Sweet-Jones & Martin.
    """

    def __init__(self, z_threshold=2.0):
        """
        Initialize ellipse selector.

        Args:
            z_threshold (float): Z-score threshold for ellipse boundary (default: 2.0)
        """
        self.z_threshold = z_threshold
        self.approved_centroid = None
        self.approved_std = None
        self.fitted = False

    def fit(self, pc_embeddings, labels):
        """
        Fit ellipse parameters based on approved antibodies.

        Args:
            pc_embeddings (np.ndarray): PC coordinates
            labels (array-like): Status labels
        """

        # Get approved antibodies
        approved_mask = labels == 'Approved'

        if not np.any(approved_mask):
            raise ValueError("No approved antibodies found in the dataset")

        approved_coords = pc_embeddings[approved_mask]

        # Calculate centroid and standard deviations
        self.approved_centroid = np.mean(approved_coords, axis=0)
        self.approved_std = np.std(approved_coords, axis=0)

        # Handle case where std is very small
        self.approved_std = np.maximum(self.approved_std, 1e-6)

        self.fitted = True

        print(f"Ellipse fitted to {np.sum(approved_mask)} approved antibodies")
        print(f"Centroid: PC1={self.approved_centroid[0]:.3f}, PC2={self.approved_centroid[1]:.3f}")
        print(f"Std Dev: PC1={self.approved_std[0]:.3f}, PC2={self.approved_std[1]:.3f}")
        print(f"Z-score threshold: {self.z_threshold}")

    def calculate_z_scores(self, pc_embeddings):
        """
        Calculate Z-scores for PC coordinates relative to approved centroid.

        Args:
            pc_embeddings (np.ndarray): PC coordinates

        Returns:
            np.ndarray: Z-scores for each coordinate
        """

        if not self.fitted:
            raise ValueError("Selector must be fitted before calculating Z-scores")

        z_scores = np.abs(pc_embeddings - self.approved_centroid) / self.approved_std
        return z_scores

    def is_within_ellipse(self, pc_embeddings):
        """
        Determine which antibodies are within the ellipse.

        Args:
            pc_embeddings (np.ndarray): PC coordinates

        Returns:
            np.ndarray: Boolean mask indicating which antibodies are within ellipse
        """

        z_scores = self.calculate_z_scores(pc_embeddings)

        # Ellipse equation: (z1/threshold)¬≤ + (z2/threshold)¬≤ <= 1
        ellipse_distance = np.sum((z_scores / self.z_threshold) ** 2, axis=1)
        within_ellipse = ellipse_distance <= 1.0

        return within_ellipse

    def plot_ellipse(self, pc_embeddings, labels, figsize=(12, 8)):
        """
        Plot the ellipse selection boundary along with the data points.
        """

        if not self.fitted:
            raise ValueError("Selector must be fitted before plotting")

        plt.figure(figsize=figsize)

        # Plot data points
        unique_labels = np.unique(labels)
        colors = ['#2E86AB', '#A23B72', '#F18F01', '#C73E1D']
        color_map = {label: colors[i % len(colors)] for i, label in enumerate(unique_labels)}

        for label in unique_labels:
            zorder=1
            if label=="Approved":
                zorder=8
            mask = labels == label
            plt.scatter(
                pc_embeddings[mask, 0],
                pc_embeddings[mask, 1],
                c=color_map[label],
                label=f'{label} (n={np.sum(mask)})',
                alpha=0.5,
                s=20,
                edgecolors='white',
                linewidth=0.5,
                zorder=zorder
            )

        # Plot ellipse
        theta = np.linspace(0, 2*np.pi, 100)
        ellipse_x = self.approved_centroid[0] + self.z_threshold * self.approved_std[0] * np.cos(theta)
        ellipse_y = self.approved_centroid[1] + self.z_threshold * self.approved_std[1] * np.sin(theta)

        plt.plot(ellipse_x, ellipse_y, 'r--', linewidth=2, zorder=1000,
                label=f'Selection Ellipse (Z={self.z_threshold})')

        # Plot centroid
        plt.scatter(*self.approved_centroid, c='red', s=200, marker='x',
                   linewidth=3, label='Approved Centroid')

        # Highlight selected antibodies
        # within_ellipse = self.is_within_ellipse(pc_embeddings)
        # selected_coords = pc_embeddings[within_ellipse]
        # plt.scatter(selected_coords[:, 0], selected_coords[:, 1],
        #            facecolors='none', edgecolors='red', linewidth=2, s=120,
        #            label=f'Selected (n={np.sum(within_ellipse)})')

        plt.xlim(-0.1, 0.1)
        plt.ylim(-0.1, 0.1)
        plt.xlabel('PC1', fontsize=12)
        plt.ylabel('PC2', fontsize=12)
        plt.title('Antibody Selection using Ellipse Strategy', fontsize=14, fontweight='bold')
        plt.legend(fontsize=10)
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()

# 8. ANTIBODY SELECTION FUNCTION

In [None]:
def select_developable_antibodies(df, pc_embeddings, z_threshold=2.0,
                                 return_details=True):
    """
    Complete pipeline for selecting developable antibodies using ellipse strategy.

    Args:
        df (pd.DataFrame): Antibody dataframe
        pc_embeddings (np.ndarray): PC coordinates
        z_threshold (float): Z-score threshold for ellipse
        return_details (bool): Whether to return detailed analysis

    Returns:
        dict: Selection results and analysis
    """

    print("="*60)
    print("ANTIBODY DEVELOPABILITY SELECTION PIPELINE")
    print("="*60)

    # Initialize selector
    selector = EllipseSelector(z_threshold=z_threshold)

    # Fit ellipse to approved antibodies
    selector.fit(pc_embeddings, df['Status'])

    # Select antibodies within ellipse
    within_ellipse = selector.is_within_ellipse(pc_embeddings)
    selected_antibodies = df[within_ellipse].copy()

    # Calculate Z-scores for all antibodies
    z_scores = selector.calculate_z_scores(pc_embeddings)
    df_with_scores = df.copy()
    df_with_scores['PC1_ZScore'] = z_scores[:, 0]
    df_with_scores['PC2_ZScore'] = z_scores[:, 1]
    df_with_scores['Within_Ellipse'] = within_ellipse
    df_with_scores['Ellipse_Distance'] = np.sum((z_scores / z_threshold) ** 2, axis=1)

    # Analysis
    total_antibodies = len(df)
    selected_count = np.sum(within_ellipse)
    approved_count = np.sum(df['Status'] == 'Approved')
    approved_selected = np.sum((df['Status'] == 'Approved') & within_ellipse)
    other_selected = np.sum((df['Status'] == 'Hit_Antibody') & within_ellipse)

    # Calculate enrichment
    if selected_count > 0:
        approved_enrichment = (approved_selected / selected_count) / (approved_count / total_antibodies)
    else:
        approved_enrichment = 0

    print(f"\nSELECTION RESULTS:")
    print(f"Total antibodies: {total_antibodies}")
    print(f"Selected antibodies: {selected_count} ({100*selected_count/total_antibodies:.1f}%)")
    print(f"Approved antibodies selected: {approved_selected}/{approved_count} ({100*approved_selected/approved_count:.1f}%)")
    print(f"Other antibodies selected: {other_selected}/{total_antibodies-approved_count} ({100*other_selected/(total_antibodies-approved_count):.1f}%)")
    print(f"Approved enrichment factor: {approved_enrichment:.2f}")

    # Plot results
    selector.plot_ellipse(pc_embeddings, df['Status'])

    results = {
        'selector': selector,
        'selected_antibodies': selected_antibodies,
        'df_with_scores': df_with_scores,
        'within_ellipse_mask': within_ellipse,
        'total_count': total_antibodies,
        'selected_count': selected_count,
        'approved_selected': approved_selected,
        'other_selected': other_selected,
        'enrichment_factor': approved_enrichment,
        'z_threshold': z_threshold
    }

    if return_details:
        return results
    else:
        return selected_antibodies

def rank_antibodies_by_developability(df, pc_embeddings, z_threshold=2.0):
    """
    Rank all antibodies by their developability score (inverse ellipse distance).

    Args:
        df (pd.DataFrame): Antibody dataframe
        pc_embeddings (np.ndarray): PC coordinates
        z_threshold (float): Z-score threshold

    Returns:
        pd.DataFrame: Ranked antibodies with developability scores
    """

    # Fit selector
    selector = EllipseSelector(z_threshold=z_threshold)
    selector.fit(pc_embeddings, df['Status'])

    # Calculate scores
    z_scores = selector.calculate_z_scores(pc_embeddings)
    ellipse_distances = np.sum((z_scores / z_threshold) ** 2, axis=1)

    # Developability score (higher is better)
    developability_scores = 1.0 / (1.0 + ellipse_distances)

    # Create ranked dataframe
    df_ranked = df.copy()
    df_ranked['PC1_ZScore'] = z_scores[:, 0]
    df_ranked['PC2_ZScore'] = z_scores[:, 1]
    df_ranked['Ellipse_Distance'] = ellipse_distances
    df_ranked['Developability_Score'] = developability_scores
    df_ranked['Within_Ellipse'] = ellipse_distances <= 1.0
    df_ranked['Rank'] = df_ranked['Developability_Score'].rank(method='dense', ascending=False)

    # Sort by developability score
    df_ranked = df_ranked.sort_values('Developability_Score', ascending=False)

    print(f"Antibodies ranked by developability score")
    print(f"Top 10 most developable antibodies:")
    print(df_ranked[['Therapeutic', 'Status', 'Developability_Score', 'Within_Ellipse']].head(10))

    return df_ranked

kpca_model, pc_embeddings = kpca_gamma_dict[500]
z_threshold = 2.0  # Z-score threshold for ellipse selection
results = select_developable_antibodies(df_analysis, pc_embeddings, z_threshold)
df_ranked = rank_antibodies_by_developability(df_analysis, pc_embeddings, z_threshold=z_threshold)

Compare the results of this KernelPCA with the one in Figure 1 of the manuscript. Why could these be different?
- Different implementation details
- Different antibodies in the Approved set and the Hit and Unapproved sets.
- Fewer outliers in the random selection of Hit antibodies
- The original authors use other methods to filter the hit antibodies. One would think this would remove outliers, but we don't see that here.


Let's explore the ranked antibodies.

- How is the Developability Score calculated?
- If you were an antibody engineer, what other criteria would you like to plot?

In [None]:
df_ranked.sort_values('Developability_Score', ascending=False)[['Therapeutic','Developability_Score','PC1_ZScore', 'PC2_ZScore','Rank']].head(10)

# 9. RUN FULL PIPELINE DEMONSTRATION

This section is not necessary, but it's nice not to have to hit Go for each cell (though you do need to define the functions.)

In [None]:
def run_complete_pipeline(therapeutic_csv_file_path, paired_csv_file_path, z_threshold=2.0, use_aligned_sequences=True,
                         numbering_scheme='chothia'):
    """
    Run the complete antibody developability assessment pipeline with sequence alignment.

    Args:
        csv_file_path (str): Path to CSV file with antibody data
        z_threshold (float): Z-score threshold for ellipse selection
        use_aligned_sequences (bool): Whether to use ANARCI-aligned sequences
        numbering_scheme (str): Numbering scheme for alignment ('chothia', 'kabat', 'imgt')

    Returns:
        dict: Complete pipeline results
    """

    print("="*80)
    print("ANTIBODY DEVELOPABILITY TRIAGING PIPELINE WITH SEQUENCE ALIGNMENT")
    print("Based on Sweet-Jones & Martin methodology with Chothia numbering")
    print("="*80)

    # Step 1: Load and align data
    print("\n1. Loading and aligning antibody sequences...")
    df_aligned, aligner = load_and_align_antibodies(therapeutic_csv_file_path, paired_csv_file_path, scheme=numbering_scheme)

    # Step 2: Extract embeddings
    print("\n2. Extracting AntiBERTy embeddings...")
    embedder = AntiBERTyEmbedder()

    if use_aligned_sequences:
        embeddings, valid_indices = embedder.get_antibody_embeddings(
            df_aligned, use_aligned=True, combine_chains=True
        )
        # Use only successfully aligned antibodies for analysis
        df_analysis = df_aligned.iloc[valid_indices].reset_index(drop=True)
        print(f"Using {len(df_analysis)} successfully aligned antibodies for analysis")
    else:
        embeddings, valid_indices = embedder.get_antibody_embeddings(
            df_aligned, use_aligned=False, combine_chains=True
        )
        df_analysis = df_aligned.copy()

    # Step 3: Apply Kernel PCA
    print("\n3. Applying Kernel PCA...")
    kpca_model = AntibodyKernelPCA(n_components=2)
    pc_embeddings = kpca_model.fit_transform(embeddings)

    # Step 4: Visualize results
    print("\n4. Visualizing Kernel PCA results...")
    plot_kernel_pca_results(pc_embeddings, df_analysis['Status'])

    # Step 5: Select developable antibodies
    print("\n5. Selecting developable antibodies...")
    results = select_developable_antibodies(df_analysis, pc_embeddings, z_threshold)

    # Step 6: Rank all antibodies
    print("\n6. Ranking antibodies by developability...")
    df_ranked = rank_antibodies_by_developability(df_analysis, pc_embeddings, z_threshold)

    # Step 7: Alignment quality analysis
    #print("\n7. Analyzing alignment quality...")
    #alignment_stats = analyze_alignment_quality(df_aligned, aligner)

    # Compile final results
    pipeline_results = {
        'original_df': df_aligned,
        'analysis_df': df_analysis,
        'aligner': aligner,
        'alignment_stats': alignment_stats,
        'embeddings': embeddings,
        'valid_indices': valid_indices,
        'pc_embeddings': pc_embeddings,
        'kpca_model': kpca_model,
        'embedder': embedder,
        'selection_results': results,
        'ranked_antibodies': df_ranked,
        'use_aligned_sequences': use_aligned_sequences,
        'numbering_scheme': numbering_scheme
    }

    print("\n" + "="*80)
    print("PIPELINE COMPLETED SUCCESSFULLY!")
    print("="*80)

    return pipeline_results

def analyze_alignment_quality(df_aligned, aligner):
    """
    Analyze the quality of sequence alignments.

    Args:
        df_aligned (pd.DataFrame): DataFrame with alignment results
        aligner (AntibodyAligner): Aligner object with position information

    Returns:
        dict: Alignment quality statistics
    """

    successful_alignments = df_aligned['AlignmentSuccess'].sum()
    total_antibodies = len(df_aligned)

    # Calculate gap statistics for aligned sequences
    if successful_alignments > 0:
        aligned_df = df_aligned[df_aligned['AlignmentSuccess']]

        # Heavy chain gap analysis
        heavy_gap_counts = []
        heavy_lengths = []
        for seq in aligned_df['HeavyAligned']:
            if seq:
                gap_count = seq.count('-')
                length = len(seq)
                heavy_gap_counts.append(gap_count)
                heavy_lengths.append(length)

        # Light chain gap analysis
        light_gap_counts = []
        light_lengths = []
        for seq in aligned_df['LightAligned']:
            if seq:
                gap_count = seq.count('-')
                length = len(seq)
                light_gap_counts.append(gap_count)
                light_lengths.append(length)

        stats = {
            'total_antibodies': total_antibodies,
            'successful_alignments': successful_alignments,
            'alignment_success_rate': successful_alignments / total_antibodies,
            'heavy_chain_stats': {
                'avg_length': np.mean(heavy_lengths) if heavy_lengths else 0,
                'avg_gaps': np.mean(heavy_gap_counts) if heavy_gap_counts else 0,
                'max_gaps': max(heavy_gap_counts) if heavy_gap_counts else 0,
                'alignment_positions': len(aligner.heavy_positions) if aligner.heavy_positions else 0
            },
            'light_chain_stats': {
                'avg_length': np.mean(light_lengths) if light_lengths else 0,
                'avg_gaps': np.mean(light_gap_counts) if light_gap_counts else 0,
                'max_gaps': max(light_gap_counts) if light_gap_counts else 0,
                'alignment_positions': len(aligner.light_positions) if aligner.light_positions else 0
            }
        }

        print(f"Alignment Quality Analysis:")
        print(f"  Success rate: {stats['alignment_success_rate']:.1%}")
        print(f"  Heavy chain aligned length: {stats['heavy_chain_stats']['alignment_positions']} positions")
        print(f"  Light chain aligned length: {stats['light_chain_stats']['alignment_positions']} positions")
        print(f"  Average gaps - Heavy: {stats['heavy_chain_stats']['avg_gaps']:.1f}, Light: {stats['light_chain_stats']['avg_gaps']:.1f}")

    else:
        stats = {
            'total_antibodies': total_antibodies,
            'successful_alignments': 0,
            'alignment_success_rate': 0,
            'heavy_chain_stats': {},
            'light_chain_stats': {}
        }

    return stats

def plot_alignment_overview(df_aligned, aligner, sample_size=5):
    """
    Visualize alignment results for a sample of antibodies.

    Args:
        df_aligned (pd.DataFrame): DataFrame with alignment results
        aligner (AntibodyAligner): Aligner object
        sample_size (int): Number of antibodies to display
    """

    # Sample successful alignments
    successful_df = df_aligned[df_aligned['AlignmentSuccess']]
    if len(successful_df) == 0:
        print("No successful alignments to display")
        return

    sample_df = successful_df.sample(min(sample_size, len(successful_df)))

    fig, axes = plt.subplots(len(sample_df), 2, figsize=(16, 3*len(sample_df)))
    if len(sample_df) == 1:
        axes = axes.reshape(1, -1)

    for i, (idx, row) in enumerate(sample_df.iterrows()):
        # Heavy chain
        heavy_aligned = row['HeavyAligned']
        heavy_original = row['HeavySequence']

        # Light chain
        light_aligned = row['LightAligned']
        light_original = row['LightSequence']

        # Plot heavy chain
        ax_heavy = axes[i, 0]
        ax_heavy.text(0.02, 0.7, f"Original ({len(heavy_original)} aa):", fontsize=10, weight='bold')
        ax_heavy.text(0.02, 0.5, heavy_original[:80] + ('...' if len(heavy_original) > 80 else ''),
                     fontsize=8, family='monospace')
        ax_heavy.text(0.02, 0.3, f"Aligned ({len(heavy_aligned)} pos):", fontsize=10, weight='bold')
        ax_heavy.text(0.02, 0.1, heavy_aligned[:80] + ('...' if len(heavy_aligned) > 80 else ''),
                     fontsize=8, family='monospace')
        ax_heavy.set_title(f"{row['Therapeutic']} - Heavy Chain", fontsize=12)
        ax_heavy.set_xlim(0, 1)
        ax_heavy.set_ylim(0, 1)
        ax_heavy.axis('off')

        # Plot light chain
        ax_light = axes[i, 1]
        ax_light.text(0.02, 0.7, f"Original ({len(light_original)} aa):", fontsize=10, weight='bold')
        ax_light.text(0.02, 0.5, light_original[:80] + ('...' if len(light_original) > 80 else ''),
                     fontsize=8, family='monospace')
        ax_light.text(0.02, 0.3, f"Aligned ({len(light_aligned)} pos):", fontsize=10, weight='bold')
        ax_light.text(0.02, 0.1, light_aligned[:80] + ('...' if len(light_aligned) > 80 else ''),
                     fontsize=8, family='monospace')
        ax_light.set_title(f"{row['Therapeutic']} - Light Chain", fontsize=12)
        ax_light.set_xlim(0, 1)
        ax_light.set_ylim(0, 1)
        ax_light.axis('off')

    plt.tight_layout()
    plt.show()


# Vibe Coding Extras

In [None]:
# =============================================================================
# EXAMPLE EXECUTION
# =============================================================================

"""
To run the complete pipeline, uncomment the following lines and provide your CSV file:

# Example usage:
results = run_complete_pipeline('your_antibody_data.csv','paired_antibodies.csv', z_threshold=2.0)

# Access specific results:
selected_antibodies = results['selection_results']['selected_antibodies']
ranked_antibodies = results['ranked_antibodies']

# Or run individual steps:
df = load_antibody_data('your_antibody_data.csv')
embedder = AntiBERTyEmbedder()
embeddings = embedder.get_antibody_embeddings(df)
kpca_model = AntibodyKernelPCA(n_components=2)
pc_embeddings = kpca_model.fit_transform(embeddings)
plot_kernel_pca_results(pc_embeddings, df['Status'])
selection_results = select_developable_antibodies(df, pc_embeddings, z_threshold=2.0)
"""

print("\nAntibody Developability Pipeline Setup Complete!")
print("Load your CSV file and run the pipeline using:")
print("results = run_complete_pipeline('your_file.csv')")

In [None]:
# Memory management utilities
def suggest_batch_size_for_gpu():
    """Suggest appropriate batch size based on GPU memory."""

    if not torch.cuda.is_available():
        return 32  # Default for CPU

    total_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3  # GB

    # Conservative estimates based on GPU memory
    if total_memory >= 24:      # RTX 4090, A100, etc.
        return 64
    elif total_memory >= 16:    # RTX 4070Ti Super, etc.
        return 32
    elif total_memory >= 12:    # RTX 4070, etc.
        return 16
    elif total_memory >= 8:     # RTX 4060, etc.
        return 8
    else:                       # Lower memory GPUs
        return 4

print(f"Suggested batch size for your GPU: {suggest_batch_size_for_gpu()}")

In [None]:
# Additional utility function to debug embedding shapes
def debug_embedding_shapes(embeddings):
    """
    Debug function to understand embedding shapes and content.

    Args:
        embeddings: The embeddings array to debug
    """
    print("="*50)
    print("EMBEDDING SHAPE DEBUG")
    print("="*50)

    print(f"Type: {type(embeddings)}")
    print(f"Shape: {embeddings.shape}")
    print(f"Dtype: {embeddings.dtype}")
    print(f"Number of dimensions: {len(embeddings.shape)}")

    if len(embeddings.shape) >= 1:
        print(f"Dimension 0 (samples): {embeddings.shape[0]}")
    if len(embeddings.shape) >= 2:
        print(f"Dimension 1: {embeddings.shape[1]}")
    if len(embeddings.shape) >= 3:
        print(f"Dimension 2: {embeddings.shape[2]}")
    if len(embeddings.shape) >= 4:
        print(f"Dimension 3: {embeddings.shape[3]}")

    # Show some statistics
    print(f"\nStatistics:")
    print(f"  Min value: {np.min(embeddings):.6f}")
    print(f"  Max value: {np.max(embeddings):.6f}")
    print(f"  Mean: {np.mean(embeddings):.6f}")
    print(f"  Std: {np.std(embeddings):.6f}")

    # Check for problematic values
    nan_count = np.sum(np.isnan(embeddings))
    inf_count = np.sum(np.isinf(embeddings))

    if nan_count > 0:
        print(f"  ‚ö† NaN values: {nan_count}")
    if inf_count > 0:
        print(f"  ‚ö† Infinite values: {inf_count}")

    if nan_count == 0 and inf_count == 0:
        print(f"  ‚úì No NaN or infinite values")

    # Suggest reshaping strategy
    if len(embeddings.shape) == 3:
        suggested_shape = (embeddings.shape[0], embeddings.shape[1] * embeddings.shape[2])
        print(f"\nSuggested reshape for 2D: {embeddings.shape} -> {suggested_shape}")
    elif len(embeddings.shape) == 4:
        suggested_shape = (embeddings.shape[0], np.prod(embeddings.shape[1:]))
        print(f"\nSuggested reshape for 2D: {embeddings.shape} -> {suggested_shape}")