# AlphaFold metrics
Created by Andreas 04.04.2025

This is the main notebook to add the following informations and metrics to a given AlphaFold run:

1. [RMSD](#3-rmsd-with-pymol)
2. [DockQ](#4-dockq)
3. [ipSAE](#5-ipsae-metric)
4. [Interface Interaction metrics](#6-interaction-metrics)

The sections (e.g. RMSD, DockQ) can be run individually to recalculate only some metrics. For this, set the *load_previous* setting to True.

The benchmark set of Lee et al. encodes information about a PPI in a string (e.g. *PF10208_PF00012_6H9U_B_resi130_resi169.A_resi30_resi406*). This script breaks it down into the PDB ID, PFAM ID, and ELM instance. Also, as it was worked with minimal interacting regions (i.e. only the motif or domain), the included chains (i.e. their IDs) as well as the boundaries were either extracted from the prediction name (DDI) or determined by comparing with the experimentally solved structure.

As input, the .pdb files need to be stored in the corrosponding benchmark set:

```bash
├── DDI
│   ├── known_DDI
│   └── random_DDI
├── DMI
│   ├── known_DMI
│   ├── mutations_DMI
│   └── random_DMI
```

Every structure inside this folder should follow this layout:
```bash
├── DEG_APCC_KENBOX_2_4GGD
│   ├── ranked_0.pdb
│   ├── ranked_1.pdb
│   ├── ranked_2.pdb
│   ├── ranked_3.pdb
│   └── ranked_4.pdb
```

The experimentally solved structures should have the following folder structure:

```bash
├── DDI
├── DDI_hydrogens
├── DMI
├── DMI_hydrogens
```

with the hydrogen folders containing the structures with added hydrogen.

It is also necessary to provide the output file of the AlphaFold run. For AlphaFold 2, this file was created by Lee et al., while for AlphaFold 3 the <a href="AF3 raw output parsing.ipynb">AF3 raw output parsing.ipynb</a> notebook is able to generate one. This file should list all AlphaFold predictions with AlphaFold derived metrics like the model cofidence, the number of clashes etc.

By default, you specify a *resource* path with a *AF2*/*AF3* and *solved* folder inside in it. Also, a *AF3_hydrogens* folder can be included.

### 0 Imports and Settings

In [43]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.axes._axes import Axes
from matplotlib.figure import Figure
from pathlib import Path
from sklearn.metrics import roc_curve, roc_auc_score
import re
import tempfile
import shutil
import os
import subprocess
import sys
stdout, stderr = sys.stdout, sys.stderr
from typing import Literal

import pymol
from Bio.SeqUtils import seq1
from Bio.PDB import PDBParser
from Bio.PDB.Structure import Structure as BioPy_PDBStructure
from Bio.PDB.Model import Model as BioPy_PDBModel
from Bio.PDB.Chain import Chain
from Bio.PDB.PDBExceptions import PDBConstructionException
parser = PDBParser(QUIET=True)

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

In [44]:
# Settings

# Which AF output should be parsed
af_mode: Literal["AF2", "AF3"] = "AF3"
# Should be the hydrogens folder used?
af3_hydrogens: bool = True

# Path to resource folder with the structures and metadata tables
path_resources = Path("/Users/imb/Desktop")
# Path to the Luck Drive folder (used for ipSAE metric to get the json file)
path_AF_luck_drive = Path("/Volumes/imb-luckgr/imb-luckgr2/projects/AlphaFold")
if af_mode == "AF3":
    path_AF_luck_drive = path_AF_luck_drive / "AlphaFold3"

# Paths to the local folders
path_AF = path_resources / ((af_mode + "_hydrogens") if (af3_hydrogens and af_mode == "AF3") else af_mode)
path_solved = path_resources / "DMI_solved_structures"

# The path to the ipsae.py
path_ipsae_script = Path("/Users/imb/Desktop/ipsae.py")

# If set to true, load the previous dataframe
load_previous = False

# To parse structures, Pymol should be run headless. However, for debugging the code the GUI may be helpful
pymol_headless = True

In [45]:
if not pymol_headless:
    pymol.finish_launching()
    sys.stdout = stdout # Needed in case of debugging to redicrct 
    sys.stderr = stderr

In [46]:
def enhance_dataframe():
    """ Reorder the columns in the dataAF dataframe and convert to an appropriate dtype """
    global dataAF
    # Some columns are integer, but contain None values. Default behaviour of pandas is to use dtype float. Therefore change the dtype to the pd.Int64Dtype allowing None
    for c in ["chainA_start", "chainA_end", "chainB_start", "chainB_end", "num_mutations", "num_align_atoms_domain", "num_align_resi_domain", "hbonds", "salt_bridges", "hydrophobic_interactions", "disulfide_bonds"]:
        if c not in dataAF.columns:
            print(f"Column {bcolors.FAIL}{c}{bcolors.ENDC} not (yet) in data frame")
            continue
        dataAF[c] = dataAF[c].astype(pd.Int64Dtype())

    def _reorder_column(c:list[str], column: str, prev_column: str = None, index:int = None):
        if column not in c:
            print(f"Column {bcolors.FAIL}{column}{bcolors.ENDC} not (yet) in data frame")
            return
        if index is None:
            if prev_column not in c:
                print(f"Column {bcolors.FAIL}{prev_column}{bcolors.ENDC} (used for sorting) not (yet) in data frame")
                return
            index = c.index(prev_column) + 1
        c.remove(column)
        c.insert(index, column)

    # Reordering of the columns
    c = list(dataAF.columns)
    if af_mode == "AF2":
        _reorder_column(c, "run_id", index=1)
        _reorder_column(c, "benchmark_set", index=2)
    elif af_mode == "AF3":
        _reorder_column(c, "benchmark_set", index=1)
    _reorder_column(c, "prediction_name", prev_column="benchmark_set")
    _reorder_column(c, "model_id", prev_column="prediction_name")
    _reorder_column(c, "model_path", prev_column="model_id")  # Added line
    if af_mode == "AF3": 
        _reorder_column(c, "ranking_score", prev_column="model_path")  # Updated 
        _reorder_column(c, "chainA_length", prev_column="ranking_score")
    else:
        _reorder_column(c, "chainA_length", prev_column="model_path")  # Updated 
    _reorder_column(c, "chainB_length", prev_column="chainA_length")
    _reorder_column(c, "chainA_id", prev_column="chainB_length")
    _reorder_column(c, "chainB_id", prev_column="chainA_id")
    _reorder_column(c, "chainA_start", prev_column="chainB_id")
    _reorder_column(c, "chainA_end", prev_column="chainA_start")
    _reorder_column(c, "chainB_start", prev_column="chainA_end")
    _reorder_column(c, "chainB_end", prev_column="chainB_start")
    _reorder_column(c, "PDB_id", prev_column="chainB_end")
    _reorder_column(c, "ELM_instance", prev_column="PDB_id")
    _reorder_column(c, "DDI_pfam_id", prev_column="ELM_instance")
    _reorder_column(c, "PDB_id_random_paired", prev_column="DDI_pfam_id")
    _reorder_column(c, "ELM_instance_random_paired", prev_column="PDB_id_random_paired")
    _reorder_column(c, "DDI_pfam_id_random_paired", prev_column="ELM_instance_random_paired")
    _reorder_column(c, "sequence_initial", prev_column="DDI_pfam_id_random_paired")
    _reorder_column(c, "sequence_mutated", prev_column="sequence_initial")
    _reorder_column(c, "num_mutations", prev_column="sequence_mutated")

    _reorder_column(c, "align_score_domain", prev_column="num_atom_atom_contact")
    _reorder_column(c, "num_align_atoms_domain", prev_column="align_score_domain")
    _reorder_column(c, "num_align_resi_domain", prev_column="num_align_atoms_domain")
    _reorder_column(c, "RMSD_domain", prev_column="num_align_resi_domain")
    _reorder_column(c, "RMSD_backbone_peptide", prev_column="RMSD_domain")
    _reorder_column(c, "RMSD_all_atom_peptide", prev_column="RMSD_backbone_peptide")
    _reorder_column(c, "RMSD_all_atom", prev_column="RMSD_all_atom_peptide")

    _reorder_column(c, "buried_area", prev_column="Fnonnat")
    _reorder_column(c, "min_distance", prev_column="buried_area")
    _reorder_column(c, "disulfide_bonds", prev_column="min_distance")
    _reorder_column(c, "salt_bridges", prev_column="disulfide_bonds")
    _reorder_column(c, "hbonds", prev_column="salt_bridges")
    _reorder_column(c, "hydrophobic_interactions", prev_column="hbonds")
    

    dataAF = dataAF[c]
    

In [9]:
# Load data
if load_previous:
    dataAF = pd.read_csv(path_AF / (path_AF.name + "_metrics.tsv"), sep="\t")
else:
    # Read in the AF data
    if af_mode == "AF2":
        dataAF = pd.read_excel(path_AF / "AF2_extension_metrics.xlsx")

        print(dataAF.columns.tolist())

        # Drop columns to recalculate them
        dataAF.drop(columns=["label"], inplace=True)


        # Adding benchmark set column
        benchmark_set_replace_dict = {"1": "mutations_DMI", "2" : "mutations_DMI", "approved minimal DDI": "known_DDI", "known minimal": "known_DMI", "random minimal": "random_DMI", "random minimal DDI": "random_DDI", "known_extension" : "known_extension", "random_extension" : "random_extension"}
        dataAF["benchmark_set"] = None
        dataAF["num_mutations"] = None

        for i, row in dataAF.iterrows():
            if row["mutation_in_motif"] == "1":
                dataAF.at[i, "num_mutations"] = 1
            elif row["mutation_in_motif"] == "2":
                dataAF.at[i, "num_mutations"] = 2
            benchmark_set = benchmark_set_replace_dict[row["mutation_in_motif"]]
            dataAF.at[i, "benchmark_set"] = benchmark_set
        dataAF.drop(columns=["mutation_in_motif"], inplace=True)

    elif af_mode == "AF3":
        dataAF = pd.read_csv(path_resources / "AF3_output.tsv", sep="\t")

        benchmark_set_replace_dict = {"mutations": "mutations_DMI", "known_minimal": "known_DMI", "known_ddi": "known_DDI", "random_minimal": "random_DMI", "random_ddi": "random_DDI", "known_extension" : "known_extension", "random_extension" : "random_extension"}
            
        for i, row in dataAF.iterrows():
            benchmark_set = benchmark_set_replace_dict[row["benchmark_set"]]
            dataAF.at[i, "benchmark_set"] = benchmark_set
enhance_dataframe()
display(dataAF)

Column [91mchainA_start[0m not (yet) in data frame
Column [91mchainA_end[0m not (yet) in data frame
Column [91mchainB_start[0m not (yet) in data frame
Column [91mchainB_end[0m not (yet) in data frame
Column [91mnum_mutations[0m not (yet) in data frame
Column [91mnum_align_atoms_domain[0m not (yet) in data frame
Column [91mnum_align_resi_domain[0m not (yet) in data frame
Column [91mhbonds[0m not (yet) in data frame
Column [91msalt_bridges[0m not (yet) in data frame
Column [91mhydrophobic_interactions[0m not (yet) in data frame
Column [91mdisulfide_bonds[0m not (yet) in data frame
Column [91mchainA_id[0m not (yet) in data frame
Column [91mchainB_id[0m not (yet) in data frame
Column [91mchainA_start[0m not (yet) in data frame
Column [91mchainA_end[0m not (yet) in data frame
Column [91mchainB_start[0m not (yet) in data frame
Column [91mchainB_end[0m not (yet) in data frame
Column [91mPDB_id[0m not (yet) in data frame
Column [91mELM_instance[0m not (yet

Unnamed: 0,model_preset,benchmark_set,prediction_name,model_id,model_path,ranking_score,chainA_length,chainB_length,fraction_disordered,has_clash,...,chainA_intf_avg_plddt,chainB_intf_avg_plddt,intf_avg_plddt,num_chainA_intf_res,num_chainB_intf_res,num_res_res_contact,num_atom_atom_contact,iPAE,pDockQ,chains_flipped
0,alphafold3,known_DMI,DEG_APCC_KENBOX_2_4GGD,ranked_0,AlphaFold_benchmark_DMI/known_minimal/sharp_sh...,0.97,312.0,5.0,0.02,0.0,...,96.21,88.25,94.54,15.0,4.0,25.0,252.0,1.85,0.20,True
1,alphafold3,known_DMI,DEG_APCC_KENBOX_2_4GGD,ranked_1,AlphaFold_benchmark_DMI/known_minimal/sharp_sh...,0.97,312.0,5.0,0.02,0.0,...,96.23,88.12,94.20,15.0,5.0,26.0,263.0,1.85,0.20,True
2,alphafold3,known_DMI,DEG_APCC_KENBOX_2_4GGD,ranked_2,AlphaFold_benchmark_DMI/known_minimal/sharp_sh...,0.96,312.0,5.0,0.02,0.0,...,96.14,86.06,93.49,14.0,5.0,27.0,280.0,2.15,0.20,True
3,alphafold3,known_DMI,DEG_APCC_KENBOX_2_4GGD,ranked_3,AlphaFold_benchmark_DMI/known_minimal/sharp_sh...,0.96,312.0,5.0,0.02,0.0,...,95.48,83.80,92.56,15.0,5.0,26.0,261.0,1.90,0.15,True
4,alphafold3,known_DMI,DEG_APCC_KENBOX_2_4GGD,ranked_4,AlphaFold_benchmark_DMI/known_minimal/sharp_sh...,0.96,312.0,5.0,0.02,0.0,...,95.73,84.93,93.03,15.0,5.0,27.0,271.0,1.95,0.19,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5995,alphafold3,random_DDI,D1PF18773_PF00071_2X19.D2PF00009_PF01873_2D74,ranked_0,AlphaFold_benchmark_DDI/random_ddi/angry_sange...,0.36,60.0,113.0,0.22,0.0,...,54.62,64.94,59.98,12.0,13.0,24.0,130.0,12.30,0.04,False
5996,alphafold3,random_DDI,D1PF18773_PF00071_2X19.D2PF00009_PF01873_2D74,ranked_1,AlphaFold_benchmark_DDI/random_ddi/angry_sange...,0.23,60.0,113.0,0.08,0.0,...,55.21,72.04,64.46,9.0,11.0,18.0,120.0,18.69,0.04,False
5997,alphafold3,random_DDI,D1PF18773_PF00071_2X19.D2PF00009_PF01873_2D74,ranked_2,AlphaFold_benchmark_DDI/random_ddi/angry_sange...,0.22,60.0,113.0,0.14,0.0,...,48.38,51.87,50.40,8.0,11.0,20.0,141.0,22.10,0.03,False
5998,alphafold3,random_DDI,D1PF18773_PF00071_2X19.D2PF00009_PF01873_2D74,ranked_3,AlphaFold_benchmark_DDI/random_ddi/angry_sange...,0.21,60.0,113.0,0.07,0.0,...,52.31,61.57,56.68,19.0,17.0,39.0,290.0,21.80,0.06,False


In [10]:
# Read in solved structure data

dataSolved = pd.DataFrame(columns=["set", "PDB_id", "DDI_pfam_id", "path", "chainA_id", "chainB_id"])

# DMI
for structure_file in [p for p in path_solved.iterdir() if p.is_file() and p.suffix == ".pdb"]:
    pdb_id = structure_file.name.split("_")[0]
    dataSolved.loc[len(dataSolved)] = {"set" : "DMI", "PDB_id": pdb_id, "path": structure_file.relative_to(path_solved), "chainA_id": "A", "chainB_id": "B"}

# DDI
"""
for structure_file in [p for p in path_solved.iterdir() if p.is_file() and p.suffix == ".pdb"]:
    ddi_pfam_id = "_".join(structure_file.name.split("_")[0:2])
    pdb_id = structure_file.name.split("_")[2]
    chainA_id = structure_file.name.split("_")[3][0]
    chainB_id = structure_file.name.split("_")[3][1]
    dataSolved.loc[len(dataSolved)] = {"set" : "DDI", "PDB_id": pdb_id, "DDI_pfam_id": ddi_pfam_id, "path": structure_file.relative_to(path_solved), "chainA_id": chainA_id, "chainB_id": chainB_id}
"""
display(dataSolved)

Unnamed: 0,set,PDB_id,DDI_pfam_id,path,chainA_id,chainB_id
0,DMI,2JK9,,2JK9_min_DMI.pdb,A,B
1,DMI,1B8Q,,1B8Q_min_DMI.pdb,A,B
2,DMI,4GGD,,4GGD_min_DMI.pdb,A,B
3,DMI,5F74,,5F74_min_DMI.pdb,A,B
4,DMI,1DDV,,1DDV_min_DMI.pdb,A,B
...,...,...,...,...,...,...
133,DMI,1UTC,,1UTC_min_DMI.pdb,A,B
134,DMI,3GM1,,3GM1_min_DMI.pdb,A,B
135,DMI,1NTV,,1NTV_min_DMI.pdb,A,B
136,DMI,1ZUB,,1ZUB_min_DMI.pdb,A,B


### 1 Parsing the file names
Many informations (PDB ID, mutation sequence, ...) are included in the filename. This section parses them and adds them to the metrics data frame. The detected values include:
* **PDB_id** (all structures): RCSB Protein Data Bank ID
* **ELM_instance** (DMI): ID of the motif in the Eukaryotic Linear Motif database
* **DDI_pfam_id** (DDI): ID of the two domains in the Pfam Database separated by an underscore
* **PDB_id_random_paired**, **ELM_instance_random_paired** and **DDI_pfam_id_random_paired** (random_DMI, random_DDI): The id of the second domain/motif for the randomly paired DMI and DDI. For DMI, the random fields always encode the motif.
* **sequence_initial** and **sequence_mutated** (mutations_DMI): For the mutations, this fields encode the initinal and mutated sequence of the motif
* **chainA_id**, **chainA_start**, **chainA_end** and the same three for **chainB** (all structures where data is available): The start and end residue as well as the ID of the chains in the experimentally solved structure. For known_DDI, the information is encoded in the prediction name and carried over to the random_DDI set. For DMI structures, the information will be added in section 2 by aligning with the template file. Chain A and B refer to the IDs in the AlphaFold predictions. For DMI, chain B is always the motif chain.

Note: known_extensions were excluded from the benchmark set, but if you need to parse them remove the comments in the code cell below. But the code needs to be tested !!

In [11]:
# Simplified regex pattern for extension samples
import re

# Pattern for known extensions - handles all the variations we've seen
known_extension_pattern = r"^(.+?)_[Mm](\d+|min|fl|FL|Min|Fl)(?:_[Mm](\d+|min|fl|FL|Min|Fl))?_[Dd](\d+|min|fl|FL|Min|Fl)(?:_[Dd](\d+|min|fl|FL|Min|Fl))?$"

def parse_known_extension_name(prediction_name):
    """Parse known extension prediction name and return extracted components"""
    match = re.search(known_extension_pattern, prediction_name)
    if not match:
        return None, None, None, None, None, None, None
    
    groups = match.groups()
    elm_instance = groups[0]  # Base name
    
    # Handle motif positions (M/m) - first is always present
    m_start = groups[1] if groups[1] else None
    m_end = groups[2] if groups[2] else None  # May be None for single M format
    
    # Handle domain positions (D/d) - first is always present  
    d_start = groups[3] if groups[3] else None
    d_end = groups[4] if groups[4] else None  # May be None for single D format
    
    # Convert numeric strings to integers
    def convert_to_int(value):
        if value and value.isdigit():
            return int(value)
        return None
    
    c1_start = convert_to_int(m_start)
    c1_end = convert_to_int(m_end)
    c2_start = convert_to_int(d_start)
    c2_end = convert_to_int(d_end)
    
    return elm_instance, c1_start, c1_end, c2_start, c2_end, "A", "B"

# Initialize columns for known extensions only
columns_to_add = [
    "ELM_instance", "chainA_id", "chainB_id",
    "chainA_start", "chainA_end", "chainB_start", "chainB_end"
]

for col in columns_to_add:
    if col not in dataAF.columns:
        dataAF[col] = None

print("Processing known extension samples only...")

successful_matches = 0
failed_matches = []

# Process only known extension samples
known_extension_mask = dataAF["benchmark_set"] == "known_extension"
known_extension_data = dataAF[known_extension_mask]

for i, row in known_extension_data.iterrows():
    prediction_name = row["prediction_name"]
    
    # Parse the prediction name
    elm_instance, c1_start, c1_end, c2_start, c2_end, chain1, chain2 = parse_known_extension_name(prediction_name)
    
    if elm_instance is not None:
        # Update the dataframe
        dataAF.at[i, "ELM_instance"] = elm_instance
        dataAF.at[i, "chainA_id"] = chain1
        dataAF.at[i, "chainB_id"] = chain2
        dataAF.at[i, "chainA_start"] = c1_start
        dataAF.at[i, "chainA_end"] = c1_end
        dataAF.at[i, "chainB_start"] = c2_start
        dataAF.at[i, "chainB_end"] = c2_end
        dataAF.at[i,"DDI_pfam_id"] = None
        dataAF.at[i, "PDB_id_random_paired"] = None
        dataAF.at[i, "ELM_instance_random_paired"] = None
        dataAF.at[i, "DDI_pfam_id_random_paired"] = None
        
        
        successful_matches += 1
    else:
        failed_matches.append(prediction_name)

# Filter to only keep known extension samples
dataAF = dataAF[known_extension_mask]

af3_metrics_old = pd.read_csv(path_resources/'af3_metrics.tsv', sep='\t')
dataAF['PDB_id'] = dataAF['ELM_instance'].map(dict(zip(af3_metrics_old['ELM_instance'], af3_metrics_old['PDB_id'])))

print(f"\nMatching Results for Known Extensions:")
print(f"Successful matches: {successful_matches}")
print(f"Failed matches: {len(failed_matches)}")

if failed_matches:
    print(f"\nFailed matches:")
    for filename in failed_matches[:20]:  # Show first 20 failures
        print(f"  {filename}")
    if len(failed_matches) > 20:
        print(f"  ... and {len(failed_matches) - 20} more")

print(f"\nTotal known extension rows: {len(dataAF)}")
if len(dataAF) > 0:
    print("\nSample processed data:")
    display(dataAF.head())

display(dataAF)


Processing known extension samples only...

Matching Results for Known Extensions:
Successful matches: 2820
Failed matches: 0

Total known extension rows: 2820

Sample processed data:


Unnamed: 0,model_preset,benchmark_set,prediction_name,model_id,model_path,ranking_score,chainA_length,chainB_length,fraction_disordered,has_clash,...,chainB_id,chainA_start,chainA_end,chainB_start,chainB_end,DDI_pfam_id,PDB_id_random_paired,ELM_instance_random_paired,DDI_pfam_id_random_paired,PDB_id
2710,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_0,AlphaFold_benchmark_DMI/known_extension/nostal...,1.0,,,,,...,B,15,39,1,499,,,,,4GGD
2711,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_1,AlphaFold_benchmark_DMI/known_extension/nostal...,0.9,,,,,...,B,15,39,1,499,,,,,4GGD
2712,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_2,AlphaFold_benchmark_DMI/known_extension/nostal...,0.8,,,,,...,B,15,39,1,499,,,,,4GGD
2713,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_3,AlphaFold_benchmark_DMI/known_extension/nostal...,0.7,,,,,...,B,15,39,1,499,,,,,4GGD
2714,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_4,AlphaFold_benchmark_DMI/known_extension/nostal...,0.6,,,,,...,B,15,39,1,499,,,,,4GGD


Unnamed: 0,model_preset,benchmark_set,prediction_name,model_id,model_path,ranking_score,chainA_length,chainB_length,fraction_disordered,has_clash,...,chainB_id,chainA_start,chainA_end,chainB_start,chainB_end,DDI_pfam_id,PDB_id_random_paired,ELM_instance_random_paired,DDI_pfam_id_random_paired,PDB_id
2710,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_0,AlphaFold_benchmark_DMI/known_extension/nostal...,1.0,,,,,...,B,15,39,1,499,,,,,4GGD
2711,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_1,AlphaFold_benchmark_DMI/known_extension/nostal...,0.9,,,,,...,B,15,39,1,499,,,,,4GGD
2712,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_2,AlphaFold_benchmark_DMI/known_extension/nostal...,0.8,,,,,...,B,15,39,1,499,,,,,4GGD
2713,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_3,AlphaFold_benchmark_DMI/known_extension/nostal...,0.7,,,,,...,B,15,39,1,499,,,,,4GGD
2714,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_4,AlphaFold_benchmark_DMI/known_extension/nostal...,0.6,,,,,...,B,15,39,1,499,,,,,4GGD
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5525,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_0,AlphaFold_benchmark_DMI/known_extension/sponta...,1.0,,,,,...,B,,,,,,,,,2G30
5526,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_1,AlphaFold_benchmark_DMI/known_extension/sponta...,0.9,,,,,...,B,,,,,,,,,2G30
5527,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_2,AlphaFold_benchmark_DMI/known_extension/sponta...,0.8,,,,,...,B,,,,,,,,,2G30
5528,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_3,AlphaFold_benchmark_DMI/known_extension/sponta...,0.7,,,,,...,B,,,,,,,,,2G30


### 2 Adding domain and motif start / end from template file
While for the DDI structures selection start and end are included in the filename, for DMI structures there is absolutely no information about start/end of motif and domain. At least, the DMI structures are cut to only include the minimal domain/motif, but there still may be mutations or missing residues in experimental structures. To restore this information use the template and perform a simple search for three consecutive residues in both chains and calculate the offset between the chain IDs. Then take the most common offset and use it if at least 50 % of the AF residues were matched this way.

In [12]:
def align_sequences(chain_af:  Chain, chain_template: Chain) -> tuple[int, int, float, str, str]:
    """ Estimate the residue id offset between two chains based on a neighbour local alignment (BioPython has no convinient alignment function).
    
        :returns tuple[int, int, float, str, str]: Start ID, End ID, score, Sequence Chain A, Sequence Chain B
    """
    residues_af = [r for r in chain_af.get_residues()]
    residues_tpl = [r for r in chain_template.get_residues()]
    seq_af = seq1(''.join([r.resname for r in residues_af]))
    seq_tpl = seq1(''.join([r.resname for r in residues_tpl]))
    offset_list = []

    misscounts = 0
    for t0, t1, t2 in zip(residues_tpl[:-2], residues_tpl[1:-1], residues_tpl[2:]):
        _found = False
        for a0, a1, a2 in zip(residues_af[:-2], residues_af[1:-1], residues_af[2:]):
            if a0.resname == t0.resname and a1.resname == t1.resname and a2.resname == t2.resname:
                offset_list.append(t1.id[1] - a1.id[1])
                _found = True
        if not _found:
            misscounts += 1

    # For degenerated short chains (motif) use no neighbours for matching
    # if len(offset_list) == 0:
    #     for r1 in residues_af:
    #         for r2 in residues_tpl:
    #             if r1.resname == r2.resname:
    #                 offset_list.append(r2.id[1] - r1.id[1])

    if len(offset_list) == 0:
        return (None, None, 0, seq_af, seq_tpl)
    offsets, counts = np.unique(offset_list, return_counts=True)
    offset = offsets[np.argmax(counts)]
    score = 1 - misscounts/(len(residues_tpl) - 2)
    return  offset + 1, offset + len(residues_af), score, seq_af, seq_tpl

for i, row in dataAF[dataAF["benchmark_set"].isin(["known_extension"])].iterrows():
    pdb_id = str(row["PDB_id"])
    pdb_id_2 = None
    #if row["PDB_id_random_paired"] is not None:
        #pdb_id_2 = str(row["PDB_id_random_paired"])
    prediction_name = row["prediction_name"]
    benchmark_set = row["benchmark_set"]
    model_id = row["model_id"]

    if model_id == "ranked_0":
        print(bcolors.OKBLUE + f"{prediction_name} ({benchmark_set})" + bcolors.ENDC)

    #if not prediction_name == "MLIG_MYND_2_2ODD.DMOD_SUMO_for_1_1KPS": continue

    af_path = path_resources / "DMI" / benchmark_set / prediction_name / (model_id + ".pdb")
    af_biopy = parser.get_structure("structure", file=af_path)[0]
    chainA_af = af_biopy["A"]
    chainB_af = af_biopy["B"]    

    template1_path = path_solved / (pdb_id + "_min_DMI.pdb")
    if not template1_path.exists():
        print(f"\t", bcolors.WARNING + f"{prediction_name} has no template file for {pdb_id}" + bcolors.WARNING)
        continue
    template1_biopy = parser.get_structure("structure", file=template1_path)[0]
    chainA_tlp = template1_biopy["A"]
    if pdb_id_2 is not None:
        template2_path = path_solved / "DMI" / (pdb_id_2 + "_min_DMI.pdb")
        if not template2_path.exists():
            print(f"\t", f"{prediction_name} has no template file for {pdb_id}")
            continue
        template2_biopy = parser.get_structure("structure", file=template2_path)[0]
        chainB_tlp = template2_biopy["B"]
    else:
        chainB_tlp = template1_biopy["B"]

    chainA_start, chainA_end, chainA_score, seqA_af, seqA_tpl = align_sequences(chain_af=chainA_af, chain_template=chainA_tlp)
    if chainA_start is not None:
        if model_id == "ranked_0":
            print("\t", f"chainA: {chainA_start}-{chainA_end} ({bcolors.WARNING if chainA_score < 0.5 else ''}{chainA_score:0.3f}{bcolors.ENDC})")
        dataAF.at[i, "chainA_start"] =  chainA_start
        dataAF.at[i, "chainA_end"] =  chainA_end
    else:
        if model_id == "ranked_0":
            print(f"\t", bcolors.WARNING + "Chain A alignment failed" + bcolors.ENDC)
    if model_id == "ranked_0" and chainA_score < 0.5:
        print("\t\t", seqA_af)
        print("\t\t", seqA_tpl)

    chainB_start, chainB_end, chainB_score, seqB_af, seqB_tpl = align_sequences(chain_af=chainB_af, chain_template=chainB_tlp)
    if chainB_start is not None:
        if model_id == "ranked_0":
            print("\t", f"chainB: {chainB_start}-{chainB_end} ({bcolors.WARNING if chainB_score < 0.5 else ''}{chainB_score:0.3f}{bcolors.ENDC})")
        dataAF.at[i, "chainB_start"] =  chainB_start
        dataAF.at[i, "chainB_end"] =  chainB_end
    else:
        if model_id == "ranked_0":
            print(f"\t", bcolors.WARNING + "Chain B alignment failed" + bcolors.ENDC)
    if model_id == "ranked_0" and chainB_score < 0.5:
        print("\t\t", seqB_af)
        print("\t\t", seqB_tpl)
    
    

# For the mutations, the alignment mostly fails. For those restore the information using the known_DMI dataset
for i, row in dataAF[dataAF["benchmark_set"].isin(["mutations_DMI"])].iterrows():
    prediction_name = row["prediction_name"]
    benchmark_set = row["benchmark_set"]
    pdb_id = row["PDB_id"]
    pdb_id_2 = row["PDB_id_random_paired"] if row["PDB_id_random_paired"] is not None else pdb_id
    
    if len(list((_row1 := dataAF[np.logical_and(dataAF["benchmark_set"] == "known_DMI", dataAF["PDB_id"] == pdb_id)])["chainA_id"])) == 0:
        if model_id == "ranked_0":
            print(f"Can't find {pdb_id} from {prediction_name} ({benchmark_set}, chain A) in the known_DMI set")
        continue
    if len(list((_row2 := dataAF[np.logical_and(dataAF["benchmark_set"] == "known_DMI", dataAF["PDB_id"] == pdb_id_2)])["chainB_id"])) == 0:
        if model_id == "ranked_0":
            print(f"Can't find {pdb_id_2} from {prediction_name} ({benchmark_set}, chain B) in the known_DMI set")
        continue
    dataAF.at[i, "chainA_start"] = list(_row1["chainA_start"])[0]
    dataAF.at[i, "chainA_end"] = list(_row1["chainA_end"])[0]
    dataAF.at[i, "chainB_start"] = list(_row2["chainB_start"])[0]
    dataAF.at[i, "chainB_end"] = list(_row2["chainB_end"])[0]

[94mDEG_APCC_KENBOX_2_M15_M39_D1_D499 (known_extension)[0m
	 chainA: 1-499 (1.000[0m)
	 chainB: -4-20 (1.000[0m)
[94mDEG_APCC_KENBOX_2_M15_M39_Dmin (known_extension)[0m
	 chainA: 170-476 (0.984[0m)
	 chainB: -4-20 (1.000[0m)
[94mDEG_APCC_KENBOX_2_M1_M189_D1_D499 (known_extension)[0m
	 chainA: 1-499 (1.000[0m)
	 chainB: -18-170 (1.000[0m)
[94mDEG_APCC_KENBOX_2_M1_M189_Dmin (known_extension)[0m
	 chainA: 170-476 (0.984[0m)
	 chainB: -18-170 (1.000[0m)
[94mDEG_APCC_KENBOX_2_M1_M57_D1_D499 (known_extension)[0m
	 chainA: 1-499 (1.000[0m)
	 chainB: -18-38 (1.000[0m)
[94mDEG_APCC_KENBOX_2_M1_M57_Dmin (known_extension)[0m
	 chainA: 170-476 (0.984[0m)
	 chainB: -18-38 (1.000[0m)
[94mDEG_APCC_KENBOX_2_M20_M34_D1_D499 (known_extension)[0m
	 chainA: 1-499 (1.000[0m)
	 chainB: 1-15 (1.000[0m)
[94mDEG_APCC_KENBOX_2_M20_M34_Dmin (known_extension)[0m
	 chainA: 170-476 (0.984[0m)
	 chainB: 1-15 (1.000[0m)
[94mDEG_APCC_KENBOX_2_M5_M49_D1_D499 (known_extension)[0m
	 chai

In [13]:
dataAF[np.logical_and(dataAF["model_id"] == "ranked_0", True)]


Unnamed: 0,model_preset,benchmark_set,prediction_name,model_id,model_path,ranking_score,chainA_length,chainB_length,fraction_disordered,has_clash,...,chainB_id,chainA_start,chainA_end,chainB_start,chainB_end,DDI_pfam_id,PDB_id_random_paired,ELM_instance_random_paired,DDI_pfam_id_random_paired,PDB_id
2710,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_0,AlphaFold_benchmark_DMI/known_extension/nostal...,1.0,,,,,...,B,1,499,-4,20,,,,,4GGD
2715,,known_extension,DEG_APCC_KENBOX_2_M15_M39_Dmin,ranked_0,AlphaFold_benchmark_DMI/known_extension/nostal...,1.0,,,,,...,B,170,476,-4,20,,,,,4GGD
2720,,known_extension,DEG_APCC_KENBOX_2_M1_M189_D1_D499,ranked_0,AlphaFold_benchmark_DMI/known_extension/nostal...,1.0,,,,,...,B,1,499,-18,170,,,,,4GGD
2725,,known_extension,DEG_APCC_KENBOX_2_M1_M189_Dmin,ranked_0,AlphaFold_benchmark_DMI/known_extension/nostal...,1.0,,,,,...,B,170,476,-18,170,,,,,4GGD
2730,,known_extension,DEG_APCC_KENBOX_2_M1_M57_D1_D499,ranked_0,AlphaFold_benchmark_DMI/known_extension/nostal...,1.0,,,,,...,B,1,499,-18,38,,,,,4GGD
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5505,,known_extension,TRG_AP2beta_CARGO_1_M38_M308_Dmin,ranked_0,AlphaFold_benchmark_DMI/known_extension/sponta...,1.0,,,,,...,B,706,937,-213,57,,,,,2G30
5510,,known_extension,TRG_AP2beta_CARGO_1_MFL_Dmin,ranked_0,AlphaFold_benchmark_DMI/known_extension/sponta...,1.0,,,,,...,B,706,937,-250,57,,,,,2G30
5515,,known_extension,TRG_AP2beta_CARGO_1_Mmin_D4_D937,ranked_0,AlphaFold_benchmark_DMI/known_extension/sponta...,1.0,,,,,...,B,4,937,5,15,,,,,2G30
5520,,known_extension,TRG_AP2beta_CARGO_1_Mmin_D542_D937,ranked_0,AlphaFold_benchmark_DMI/known_extension/sponta...,1.0,,,,,...,B,542,937,5,15,,,,,2G30


In [14]:
print(dataAF[dataAF["chainA_end"].isna()])



Empty DataFrame
Columns: [model_preset, benchmark_set, prediction_name, model_id, model_path, ranking_score, chainA_length, chainB_length, fraction_disordered, has_clash, iptm, ptm, chainA_intf_avg_plddt, chainB_intf_avg_plddt, intf_avg_plddt, num_chainA_intf_res, num_chainB_intf_res, num_res_res_contact, num_atom_atom_contact, iPAE, pDockQ, chains_flipped, ELM_instance, chainA_id, chainB_id, chainA_start, chainA_end, chainB_start, chainB_end, DDI_pfam_id, PDB_id_random_paired, ELM_instance_random_paired, DDI_pfam_id_random_paired, PDB_id]
Index: []

[0 rows x 34 columns]


### 3 RMSD with PyMOL

For all structures, calculate the overall RMSD
- **RMSD_all_atom**: RMSD aligning the whole structure

For DMI, align the domains (chain A) first
- **align_score_domain**: Score of domain alignment
- **num_align_atoms_domain** and **num_align_resi_domain**: Count of aligned atoms/residues of domain
- **RMSD_domain**: RMSD of the domain (chain A) after aligning on the domain
For known_DMI and mutations_DMI (excluding the mutated residues):
- **RMSD_backbone_peptide** and **RMSD_all_atom_peptide**: RMSD of the motif chain (chain B) after aligning on the domain

For DDI perform, use the longest chain (or chain A if both have equal length) as domain and define the shorter one as peptide. Then use the same definition as for DMI

In [15]:
# Calculating the RMSD related values using pymol
import pymol.cmd as cmd

dataAF["align_score_domain"] = None
dataAF["num_align_atoms_domain"] = None
dataAF["num_align_resi_domain"] = None
dataAF["RMSD_all_atom"] = None
dataAF["RMSD_domain"] = None
dataAF["RMSD_backbone_peptide"] = None
dataAF["RMSD_all_atom_peptide"] = None

for i,row in dataAF.iterrows():
    benchmark_set = str(row["benchmark_set"])
    _set = "DDI" if "DDI" in benchmark_set else "DMI"
    pdb_id = str(row["PDB_id"]) if row.notnull()["PDB_id"] else None
    pdb_id_2 = None
    ddi_pfam_id = None
    ddi_pfam_id_2 = None
    prediction_name = str(row["prediction_name"]) if row.notnull()["prediction_name"] else None
    model_id = str(row["model_id"]) if row.notnull()["model_id"] else None
    chainA_id = str(row["chainA_id"]) if row.notnull()["chainA_id"] else None
    chainB_id = str(row["chainB_id"]) if row.notnull()["chainB_id"] else None
    chainA_start = int(row["chainA_start"]) if row.notnull()["chainA_start"] else None
    chainB_start = int(row["chainB_start"]) if row.notnull()["chainB_start"] else None
    chainA_end = int(row["chainA_end"]) if row.notnull()["chainA_end"] else None
    chainB_end = int(row["chainB_end"]) if row.notnull()["chainB_end"] else None
    chainA_length = int(row["chainA_length"]) if row.notnull()["chainA_length"] else None
    chainB_length = int(row["chainB_length"]) if row.notnull()["chainB_length"] else None

    if model_id == "ranked_0":
        pymol.cmd.reinitialize() 
        print(f"{bcolors.OKBLUE}{prediction_name} ({benchmark_set}){bcolors.ENDC}")    

    structure_path = path_resources / "DMI" / benchmark_set / prediction_name / (model_id + ".pdb")
    
    if not structure_path.exists():
        print(f"\t{bcolors.FAIL}{prediction_name} ({benchmark_set}) does not exist.{bcolors.ENDC} Skip RMSD calculation")
        continue


   
    template_row = dataSolved.loc[np.logical_and(dataSolved["set"] == _set, np.logical_and(dataSolved["PDB_id"] == pdb_id, np.logical_or(dataSolved["DDI_pfam_id"] == ddi_pfam_id, dataSolved["DDI_pfam_id"].isna())))]
    if len(template_row) == 0:
        print(f"\t{bcolors.FAIL}Can't find template structure for {prediction_name} ({benchmark_set}) and PDB ID {pdb_id}.{bcolors.ENDC} Skip RMSD calculation")
        continue
    elif len(template_row) >= 2:
        print(f"\t{bcolors.FAIL}Multiple template structures found for {prediction_name} ({benchmark_set}) and PDB ID {pdb_id}.{bcolors.ENDC} Skip RMSD calculation")
        continue
        
    template_path = path_solved / str(template_row["path"].item())
    

    template2_path = None
    if pdb_id_2 is not None:
        template2_row = dataSolved.loc[np.logical_and(dataSolved["set"] == _set, np.logical_and(dataSolved["PDB_id"] == pdb_id_2, np.logical_or(dataSolved["DDI_pfam_id"] == ddi_pfam_id_2, dataSolved["DDI_pfam_id"].isna())))]
        if len(template2_row) == 0:
            print(f"\t{bcolors.FAIL}Can't find template structure for {prediction_name} ({benchmark_set}) and PDB ID {pdb_id_2}.{bcolors.ENDC} Skip RMSD calculation")
            continue
        elif len(template2_row) >= 2:
            print(f"\t{bcolors.FAIL}Multiple template structures found for {prediction_name} ({benchmark_set}) and PDB ID {pdb_id_2}.{bcolors.ENDC} Skip RMSD calculation")
            continue

        template2_path = path_resources / "solved" / str(template2_row["path"].item())

    
    #pymol.cmd.reinitialize() # Not needed usually, but slows performance significantly down
    for o in pymol.cmd.get_object_list():
        pymol.cmd.delete("all")
    pymol.cmd.sort()

    # First loading the structures. Use two temporary objects to allow renaming the chains even if the chains have the same name or have switched IDs

    pymol.cmd.load(structure_path, "af")


    if template2_path is not None:
        # Updating the object is possible, but turned out to be unstable
        pymol.cmd.load(template_path, "solvedA")
        pymol.cmd.load(template2_path, "solvedB")
        pymol.cmd.create("solved1", f"solvedA and chain {chainA_id}")
        pymol.cmd.create("solved2", f"solvedB and chain {chainB_id}")
        pymol.cmd.delete("solvedA")
        pymol.cmd.delete("solvedB")
    else:
        pymol.cmd.load(template_path, "solvedraw")
        pymol.cmd.create("solved1", f"solvedraw and chain {chainA_id}")
        pymol.cmd.sort()
        pymol.cmd.create("solved2", f"solvedraw and chain {chainB_id}")
        pymol.cmd.delete("solvedraw")


    
    pymol.cmd.sort()
    # Now rename the chains and create merged object
    pymol.cmd.alter(f"solved1 and chain {chainA_id}", "chain = 'A'")
    pymol.cmd.sort()
    pymol.cmd.alter(f"solved2 and chain {chainB_id}", "chain = 'B'")
    pymol.cmd.sort()
    pymol.cmd.create("solved", f"solved1 or solved2")
    pymol.cmd.delete("solved1")
    pymol.cmd.delete("solved2")
    pymol.cmd.sort()

    # Remove hydrogens and hetatm
    #pymol.cmd.remove(selection="elem 'H' or hetatm")
    pymol.cmd.remove(selection="not backbone and not sidechain or elem 'H'")
    pymol.cmd.sort()

    # Remove alternate location identifiers
    pymol.cmd.remove("not alt ''+A") # Using +A syntax to only effect the atoms with an alternate location identifier set
    pymol.cmd.sort()
    pymol.cmd.alter("all", "alt=''")
    pymol.cmd.sort()

    # Slice the chains to the known start/end residues. For chain B and AF a reindexing is performed as the rms_cur cmd of pymol requires same residue numbers for alignment
    if chainA_start is not None and chainB_start is not None:
        pymol.cmd.create("solved", f"solved and ((chain A and resi {chainA_start}-{chainA_end}) or (chain B and resi {chainB_start}-{chainB_end}))", source_state=0, target_state=0)
        pymol.cmd.sort()
        offsetA = chainA_start - 1
        pymol.cmd.alter("af and chain A", f"resi = (int(resi) + {offsetA})")
        pymol.cmd.sort()

        offsetB = chainB_start - 1
        pymol.cmd.alter("af and chain B", f"resi = (int(resi) + {offsetB})")
        pymol.cmd.sort()
    else:
        print(f"\t{bcolors.FAIL}Can't find information about the chain start/end in the template.{bcolors.ENDC} This may lead to wrong RMSD peptide values, so skip")
        continue

    pymol.cmd.sort()

    # DDI
    chain_align_1, chain_align_2 = "A", "B"
    if _set == "DDI" and chainB_length > chainA_length:
        chain_align_1, chain_align_2 = "B", "A"

    #For debugging
    #space = {'solved_resi': [], "af_resi": []}
    #pymol.cmd.iterate("solved and chain B", "solved_resi.append(int(resi))", space=space)
    #pymol.cmd.iterate("af and chain B", "af_resi.append(int(resi))", space=space)

    #    0: RMSD after refinement
    #    1: Number of aligned atoms after refinement
    #    2: Number of refinement cycles
    #    3: RMSD before refinement
    #    4: Number of aligned atoms before refinement
    #    5: Raw alignment score
    #    6: Number of residues aligned
    # Cycles = 0 to prevent rejection of outliers
    align_output_1 = pymol.cmd.align(mobile=f"af and chain {chain_align_1}", target=f"solved and chain {chain_align_1}", object="algn_domain", cycles=0)
    pymol.cmd.sort()
    RMSD_domain = align_output_1[0]
    num_align_atoms_domain = align_output_1[1]
    align_score_domain = align_output_1[5]
    num_align_resi_domain = align_output_1[6]

    RMSD_backbone_peptide = pymol.cmd.rms_cur(mobile=f"af and chain {chain_align_2} and bb.", target=f"solved and chain {chain_align_2} and bb.", object="peptide_super_bb", cycles=0)
    RMSD_all_atom_peptide = pymol.cmd.rms_cur(mobile=f"af and chain {chain_align_2}", target=f"solved and chain {chain_align_2}", object="peptide_super_all_atoms", cycles=0)
    
    align_output_all = pymol.cmd.align(mobile="af", target="solved", object="algn_all", cycles=0, )
    RMSD_all_atoms = align_output_all[0]

    dataAF.at[i, "RMSD_domain"] =  RMSD_domain
    dataAF.at[i, "align_score_domain"] =  align_score_domain
    dataAF.at[i, "num_align_atoms_domain"] =  num_align_atoms_domain
    dataAF.at[i, "num_align_resi_domain"] =  num_align_resi_domain

    if "random" not in benchmark_set:
        dataAF.at[i, "RMSD_backbone_peptide"] =  RMSD_backbone_peptide
        dataAF.at[i, "RMSD_all_atom_peptide"] =  RMSD_all_atom_peptide

    dataAF.at[i, "RMSD_all_atom"] =  RMSD_all_atoms
        
display(dataAF)

[94mDEG_APCC_KENBOX_2_M15_M39_D1_D499 (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M15_M39_Dmin (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M1_M189_D1_D499 (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M1_M189_Dmin (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M1_M57_D1_D499 (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M1_M57_Dmin (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M20_M34_D1_D499 (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M20_M34_Dmin (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M5_M49_D1_D499 (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M5_M49_Dmin (known_extension)[0m
[94mDEG_APCC_KENBOX_2_MFL_Dmin (known_extension)[0m
[94mDEG_APCC_KENBOX_2_Mmin_D1_D499 (known_extension)[0m
[94mDEG_APCC_KENBOX_2_Mmin_DFL (known_extension)[0m
[94mDEG_Kelch_Keap1_1_M1_M568_D175_D624 (known_extension)[0m
[94mDEG_Kelch_Keap1_1_M1_M568_D295_D624 (known_extension)[0m
[94mDEG_Kelch_Keap1_1_M1_M568_Dmin (known_extension)[0m
[94mDEG_Kelch_Keap1_1_M41_M118_D175_D624 (known_extension)

Unnamed: 0,model_preset,benchmark_set,prediction_name,model_id,model_path,ranking_score,chainA_length,chainB_length,fraction_disordered,has_clash,...,ELM_instance_random_paired,DDI_pfam_id_random_paired,PDB_id,align_score_domain,num_align_atoms_domain,num_align_resi_domain,RMSD_all_atom,RMSD_domain,RMSD_backbone_peptide,RMSD_all_atom_peptide
2710,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_0,AlphaFold_benchmark_DMI/known_extension/nostal...,1.0,,,,,...,,,4GGD,1681.0,2414,312,0.779362,0.779362,32.487099,33.511379
2711,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_1,AlphaFold_benchmark_DMI/known_extension/nostal...,0.9,,,,,...,,,4GGD,1681.0,2414,312,0.850033,0.850033,32.519211,33.687027
2712,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_2,AlphaFold_benchmark_DMI/known_extension/nostal...,0.8,,,,,...,,,4GGD,1681.0,2414,312,0.802006,0.802006,0.850706,1.085235
2713,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_3,AlphaFold_benchmark_DMI/known_extension/nostal...,0.7,,,,,...,,,4GGD,1681.0,2414,312,0.846242,0.846242,33.459137,34.722588
2714,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_4,AlphaFold_benchmark_DMI/known_extension/nostal...,0.6,,,,,...,,,4GGD,1681.0,2414,312,0.806754,0.806754,32.800415,34.047897
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5525,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_0,AlphaFold_benchmark_DMI/known_extension/sponta...,1.0,,,,,...,,,2G30,1224.0,1848,233,0.854878,0.818453,0.820616,1.43463
5526,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_1,AlphaFold_benchmark_DMI/known_extension/sponta...,0.9,,,,,...,,,2G30,1224.0,1848,233,0.879837,0.854961,0.626592,1.300052
5527,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_2,AlphaFold_benchmark_DMI/known_extension/sponta...,0.8,,,,,...,,,2G30,1224.0,1848,233,0.883954,0.857443,0.783449,1.336963
5528,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_3,AlphaFold_benchmark_DMI/known_extension/sponta...,0.7,,,,,...,,,2G30,1224.0,1848,233,0.857197,0.815641,0.791843,1.497762


### 4 DockQ
Calculate DockQ metrics for the known_DMI and known_DDI set using the offical package. The following columns are added to the table:
- **DockQ**: DockQ metric
- **iRMSD**: RMSD of interfacial residues
- **LRMSD**: Ligand RMSD. In case of DMI is this the motif, for DDI the smaller domain
- **Fnonnat**: Fraction of predicted contacts that are not native (same as FPR)

For more details read the details on the offical GitHub repo: [https://github.com/bjornwallner/DockQ](https://github.com/bjornwallner/DockQ)

In [16]:
from DockQ.DockQ import load_PDB, run_on_all_native_interfaces

dataAF["DockQ"] = np.nan
dataAF["iRMSD"] = np.nan
dataAF["LRMSD"] = np.nan
dataAF["Fnonnat"] = np.nan
for i, row in dataAF[dataAF["benchmark_set"].isin(["known_extension"])].iterrows():
    benchmark_set = str(row["benchmark_set"])
    _set = "DDI" if "DDI" in benchmark_set else "DMI"
    pdb_id = str(row["PDB_id"]) if row.notnull()["PDB_id"] else None
    pdb_id_2 =  None
    ddi_pfam_id =  None
    ddi_pfam_id_2 = None
    prediction_name = str(row["prediction_name"]) if row.notnull()["prediction_name"] else None
    model_id = str(row["model_id"]) if row.notnull()["model_id"] else None
    chainA_id = str(row["chainA_id"]) if row.notnull()["chainA_id"] else None
    chainB_id = str(row["chainB_id"]) if row.notnull()["chainB_id"] else None
    chainA_start = int(row["chainA_start"]) if row.notnull()["chainA_start"] else None
    chainB_start = int(row["chainB_start"]) if row.notnull()["chainB_start"] else None
    chainA_end = int(row["chainA_end"]) if row.notnull()["chainA_end"] else None
    chainB_end = int(row["chainB_end"]) if row.notnull()["chainB_end"] else None

    if model_id == "ranked_0":
        print(f"{bcolors.OKBLUE}{prediction_name} ({benchmark_set}){bcolors.ENDC}")

    structure_path = path_resources / "DMI" / benchmark_set / prediction_name / (model_id + ".pdb")
    if not structure_path.exists():
        print(f"\t{bcolors.FAIL}{prediction_name} ({benchmark_set}) does not exist.{bcolors.ENDC} Skip DockQ")
        continue

    template_row = dataSolved.loc[np.logical_and(dataSolved["set"] == _set, np.logical_and(dataSolved["PDB_id"] == pdb_id, np.logical_or(dataSolved["DDI_pfam_id"] == ddi_pfam_id, dataSolved["DDI_pfam_id"].isna())))]
    if len(template_row) == 0:
        print(f"\t{bcolors.FAIL}Can't find template structure for {prediction_name} ({benchmark_set}) and PDB ID {pdb_id}.{bcolors.ENDC} Skip")
        continue
    elif len(template_row) >= 2:
        print(f"\t{bcolors.FAIL}Multiple template structures found for {prediction_name} ({benchmark_set}) and PDB ID {pdb_id}.{bcolors.ENDC} Skip")
        continue
    template_path = path_solved / str(template_row["path"].item())
    dockq_structure_af = load_PDB(str(structure_path))
    dockq_structure_solved = load_PDB(str(template_path))

    chain_map = {chainA_id: "A", chainB_id:"B"}
    chain_key = chainA_id + chainB_id

    result = run_on_all_native_interfaces(dockq_structure_af, dockq_structure_solved, chain_map=chain_map)[0]
    dataAF.at[i, "DockQ"] = result[chain_key]["DockQ"]
    dataAF.at[i, "iRMSD"] = result[chain_key]["iRMSD"]
    dataAF.at[i, "LRMSD"] = result[chain_key]["LRMSD"]
    dataAF.at[i, "Fnonnat"] = np.float64(result[chain_key]["fnonnat"])

display(dataAF)


[94mDEG_APCC_KENBOX_2_M15_M39_D1_D499 (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M15_M39_Dmin (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M1_M189_D1_D499 (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M1_M189_Dmin (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M1_M57_D1_D499 (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M1_M57_Dmin (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M20_M34_D1_D499 (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M20_M34_Dmin (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M5_M49_D1_D499 (known_extension)[0m
[94mDEG_APCC_KENBOX_2_M5_M49_Dmin (known_extension)[0m
[94mDEG_APCC_KENBOX_2_MFL_Dmin (known_extension)[0m
[94mDEG_APCC_KENBOX_2_Mmin_D1_D499 (known_extension)[0m
[94mDEG_APCC_KENBOX_2_Mmin_DFL (known_extension)[0m
[94mDEG_Kelch_Keap1_1_M1_M568_D175_D624 (known_extension)[0m
[94mDEG_Kelch_Keap1_1_M1_M568_D295_D624 (known_extension)[0m
[94mDEG_Kelch_Keap1_1_M1_M568_Dmin (known_extension)[0m
[94mDEG_Kelch_Keap1_1_M41_M118_D175_D624 (known_extension)

Unnamed: 0,model_preset,benchmark_set,prediction_name,model_id,model_path,ranking_score,chainA_length,chainB_length,fraction_disordered,has_clash,...,num_align_atoms_domain,num_align_resi_domain,RMSD_all_atom,RMSD_domain,RMSD_backbone_peptide,RMSD_all_atom_peptide,DockQ,iRMSD,LRMSD,Fnonnat
2710,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_0,AlphaFold_benchmark_DMI/known_extension/nostal...,1.0,,,,,...,2414,312,0.779362,0.779362,32.487099,33.511379,0.030091,9.154708,32.471571,1.000000
2711,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_1,AlphaFold_benchmark_DMI/known_extension/nostal...,0.9,,,,,...,2414,312,0.850033,0.850033,32.519211,33.687027,0.030029,9.165312,32.505639,1.000000
2712,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_2,AlphaFold_benchmark_DMI/known_extension/nostal...,0.8,,,,,...,2414,312,0.802006,0.802006,0.850706,1.085235,0.957491,0.311560,0.821135,0.000000
2713,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_3,AlphaFold_benchmark_DMI/known_extension/nostal...,0.7,,,,,...,2414,312,0.846242,0.846242,33.459137,34.722588,0.028471,9.429981,33.426638,1.000000
2714,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_4,AlphaFold_benchmark_DMI/known_extension/nostal...,0.6,,,,,...,2414,312,0.806754,0.806754,32.800415,34.047897,0.029560,9.242351,32.776340,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5525,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_0,AlphaFold_benchmark_DMI/known_extension/sponta...,1.0,,,,,...,1848,233,0.854878,0.818453,0.820616,1.43463,0.955996,0.370099,0.859152,0.093750
5526,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_1,AlphaFold_benchmark_DMI/known_extension/sponta...,0.9,,,,,...,1848,233,0.879837,0.854961,0.626592,1.300052,0.960844,0.333205,0.656404,0.093750
5527,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_2,AlphaFold_benchmark_DMI/known_extension/sponta...,0.8,,,,,...,1848,233,0.883954,0.857443,0.783449,1.336963,0.959846,0.331209,0.830493,0.093750
5528,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_3,AlphaFold_benchmark_DMI/known_extension/sponta...,0.7,,,,,...,1848,233,0.857197,0.815641,0.791843,1.497762,0.953670,0.396883,0.812161,0.121212


### 5 IPSAE metric
Calculating the new ipSAE metric using the file from the repo ([https://github.com/DunbrackLab/IPSAE/blob/main/ipsae.py](https://github.com/DunbrackLab/IPSAE/blob/main/ipsae.py)). Currently, only AlphaFold 3 is possible as the files from John do not include the confidence.json files anymore. Adds the following column:
- **ipSAE** (all structures): The ipSAE metric for the AlphaFold prediction

In [49]:
def calc_ipsae_metric(row: pd.Series):
    path_cif = path_AF_luck_drive / Path(row["model_path"])

    sample_dir = path_cif.parent
    design_name = sample_dir.parent.name
    sample_name = sample_dir.name
    base_filename = f"{design_name}_{sample_name}"
    path_confidences = sample_dir / f"{base_filename}_confidences.json"
    path_summary_confidences = sample_dir / f"{base_filename}_summary_confidences.json"


    with tempfile.TemporaryDirectory() as tmpdir:
        shutil.copy(path_cif, tmp_path_cif := (Path(tmpdir) / "model.cif"))
        shutil.copy(path_confidences, tmp_path_confidences := (Path(tmpdir) / "confidences.json"))
        shutil.copy(path_summary_confidences, tmp_path_summary_confidences := (Path(tmpdir) / "summary_confidences.json"))
        subprocess.run(["python", path_ipsae_script, tmp_path_confidences, tmp_path_cif, "10", "10"], env=os.environ.copy())

        path_output = Path(tmpdir) / "model_10_10.txt"

        df_ipsae = pd.read_csv(path_output, header=0, skiprows=[0], sep=" ", skipinitialspace=True)
    return df_ipsae

# For AF2 the json files do not exist anymore
if af_mode == "AF3":
    dataAF["ipSAE"] = np.nan
    for i, row in dataAF.iterrows():
        if row["model_id"] == "ranked_0":
            print(row["prediction_name"], f"({round(100*i/len(dataAF))} %)")
        df_ipsae = calc_ipsae_metric(row)
        dataAF.at[i, "ipSAE"] = np.float64(df_ipsae["ipSAE"][2])
display(dataAF["ipSAE"])

DEG_APCC_KENBOX_2_M15_M39_D1_D499 (96 %)
DEG_APCC_KENBOX_2_M15_M39_Dmin (96 %)
DEG_APCC_KENBOX_2_M1_M189_D1_D499 (96 %)
DEG_APCC_KENBOX_2_M1_M189_Dmin (97 %)
DEG_APCC_KENBOX_2_M1_M57_D1_D499 (97 %)
DEG_APCC_KENBOX_2_M1_M57_Dmin (97 %)
DEG_APCC_KENBOX_2_M20_M34_D1_D499 (97 %)
DEG_APCC_KENBOX_2_M20_M34_Dmin (97 %)
DEG_APCC_KENBOX_2_M5_M49_D1_D499 (98 %)
DEG_APCC_KENBOX_2_M5_M49_Dmin (98 %)
DEG_APCC_KENBOX_2_MFL_Dmin (98 %)
DEG_APCC_KENBOX_2_Mmin_D1_D499 (98 %)
DEG_APCC_KENBOX_2_Mmin_DFL (98 %)
DEG_Kelch_Keap1_1_M1_M568_D175_D624 (98 %)
DEG_Kelch_Keap1_1_M1_M568_D295_D624 (99 %)
DEG_Kelch_Keap1_1_M1_M568_Dmin (99 %)
DEG_Kelch_Keap1_1_M41_M118_D175_D624 (99 %)
DEG_Kelch_Keap1_1_M41_M118_D295_D624 (99 %)
DEG_Kelch_Keap1_1_M41_M118_Dmin (99 %)
DEG_Kelch_Keap1_1_M53_M106_D175_D624 (99 %)
DEG_Kelch_Keap1_1_M53_M106_D295_D624 (100 %)
DEG_Kelch_Keap1_1_M53_M106_Dmin (100 %)
DEG_Kelch_Keap1_1_M65_M94_D175_D624 (100 %)
DEG_Kelch_Keap1_1_M65_M94_D295_D624 (100 %)
DEG_Kelch_Keap1_1_M65_M94_Dmin (100

Traceback (most recent call last):
  File "/Users/imb/Desktop/ipsae.py", line 416, in <module>
    data = json.load(file)
  File "/opt/miniconda3/envs/pymol_env/lib/python3.10/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/opt/miniconda3/envs/pymol_env/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 25165832: invalid continuation byte


EmptyDataError: No columns to parse from file

In [51]:
print(tempfile.TemporaryDirectory())

<TemporaryDirectory '/var/folders/y7/rw09j59914xbqkvqvgq68_1c0000gn/T/tmpwe2zj6hd'>


### 6 Interaction metrics
Calculate the interaction interface metrics with the newly developed library. Adds the following columns to all structures with an experimentally solved counterpart (None if not):
- **min_distance** [Å]: Distance of the C_alpha atoms (in Angstroms) in the interface if at least one atom pair between the residues is closer than 15 Angstrom. 
- **buried_area** [Å²]: Buried surface area of the interface (in Angstrom squared)
- **disulfide_bonds** [count]: Number of disulfide bonds in the interface
- **salt_bridges** [count]: Number of salt bridges in the interface
- **hbonds** [count]: Number of hydrogen bonds in the interface
- **hydrophobic_interactions** [count]: Number of hydrophobic interactions in the interface

For a detailed explanation see either the library file (measure_PPI.py) or the bachelor thesis (can be found in the group drive)

In [17]:
libpath = Path("../Interface Metrics").resolve()
print(libpath)
sys.path.insert(0, str(libpath))
import measure_PPI

/Users/imb/Interface Metrics


ModuleNotFoundError: No module named 'measure_PPI'

In [8]:
pathObj = []
for i, row in dataAF.iterrows():
    benchmark_set = str(row["benchmark_set"])
    _set = "DDI" if "DDI" in benchmark_set else "DMI"
    prediction_name = str(row["prediction_name"]) if row.notnull()["prediction_name"] else None
    model_id = str(row["model_id"]) if row.notnull()["model_id"] else None

    structure_path = path_AF / _set / benchmark_set / prediction_name / (model_id + ".pdb")
    if not structure_path.exists():
        if row["model_id"] == "ranked_0":
            print(f"\t{bcolors.FAIL}{prediction_name} ({benchmark_set}) does not exist.{bcolors.ENDC} Skip interface metrics")
        continue

    pathObj.append((structure_path.resolve(), prediction_name))
df_intf_metrics = measure_PPI.Run(pathObj=pathObj, num_threads=12)

	[91mPF07724_PF00227_1OFH_C_resi39_resi340.H_resi1_resi172 (known_DDI) does not exist.[0m Skip interface metrics
	[91mPF14978_PF00327_3J7Y_o_resi13_resi101.Z_resi57_resi127 (known_DDI) does not exist.[0m Skip interface metrics
[2025-05-15 19:06:35,957 | measure_PPI | INFO] Started Taskpool of 12 processes for 3170 files
[2025-05-15 19:06:41,001 | measure_PPI | INFO] 2% - ETA 0:03:43 | current speed 14.077 s⁻¹ | average speed 13.878 s⁻¹
[2025-05-15 19:06:46,038 | measure_PPI | INFO] 5% - ETA 0:02:40 | current speed 23.227 s⁻¹ | average speed 18.55 s⁻¹
[2025-05-15 19:06:51,048 | measure_PPI | INFO] 11% - ETA 0:01:53 | current speed 36.922 s⁻¹ | average speed 24.649 s⁻¹
[2025-05-15 19:06:56,112 | measure_PPI | INFO] 17% - ETA 0:01:37 | current speed 34.174 s⁻¹ | average speed 27.042 s⁻¹
[2025-05-15 19:07:01,129 | measure_PPI | INFO] 22% - ETA 0:01:27 | current speed 33.074 s⁻¹ | average speed 28.245 s⁻¹
[2025-05-15 19:07:06,139 | measure_PPI | INFO] 27% - ETA 0:01:20 | current speed 3

In [9]:
display(df_intf_metrics)

Unnamed: 0,structure_name,file,hbonds,salt_bridges,buried_area,min_distance,hydrophobic_interactions,disulfide_bonds
2924,D1PF00009_PF01873_2D74.D2PF00026_PF06394_1F34,ranked_0.pdb,1,0,1692.652,4.191,86,0
2925,D1PF00009_PF01873_2D74.D2PF00026_PF06394_1F34,ranked_1.pdb,3,0,1708.317,3.670,96,0
2933,D1PF00009_PF01873_2D74.D2PF00026_PF06394_1F34,ranked_2.pdb,1,1,1907.898,4.643,79,0
2932,D1PF00009_PF01873_2D74.D2PF00026_PF06394_1F34,ranked_3.pdb,5,0,1577.972,3.718,107,0
2934,D1PF00009_PF01873_2D74.D2PF00026_PF06394_1F34,ranked_4.pdb,15,1,2139.539,3.955,101,0
...,...,...,...,...,...,...,...,...
2695,TRG_PTS1_2C0L_NAKL.NAKD,ranked_0.pdb,10,3,839.162,4.782,11,0
2698,TRG_PTS1_2C0L_NAKL.NAKD,ranked_1.pdb,8,3,862.629,4.900,10,0
2696,TRG_PTS1_2C0L_NAKL.NAKD,ranked_2.pdb,12,2,861.115,4.704,9,0
2699,TRG_PTS1_2C0L_NAKL.NAKD,ranked_3.pdb,11,2,854.038,4.943,9,0


In [11]:
dataAF["min_distance"] = None
dataAF["buried_area"] = None
dataAF["disulfide_bonds"] = None
dataAF["salt_bridges"] = None
dataAF["hbonds"] = None
dataAF["hydrophobic_interactions"] = None

for i, row in dataAF.iterrows():
    row_intf = df_intf_metrics[(df_intf_metrics["structure_name"] == row["prediction_name"]) & (df_intf_metrics["file"] == row["model_id"] + ".pdb")]
    if len(row_intf) != 1:
        print(f"\t{bcolors.FAIL}Failed to locate {row["prediction_name"]} {row["model_id"]}{bcolors.ENDC}")
        continue

    dataAF.at[i, "buried_area"] = row_intf["buried_area"].item()
    dataAF.at[i, "min_distance"] = row_intf["min_distance"].item()
    dataAF.at[i, "salt_bridges"] = row_intf["salt_bridges"].item()
    dataAF.at[i, "hbonds"] = row_intf["hbonds"].item()
    dataAF.at[i, "hydrophobic_interactions"] = row_intf["hydrophobic_interactions"].item()
    dataAF.at[i, "disulfide_bonds"] = row_intf["disulfide_bonds"].item()
display(dataAF)

	[91mFailed to locate PF07724_PF00227_1OFH_C_resi39_resi340.H_resi1_resi172 ranked_0[0m
	[91mFailed to locate PF07724_PF00227_1OFH_C_resi39_resi340.H_resi1_resi172 ranked_1[0m
	[91mFailed to locate PF07724_PF00227_1OFH_C_resi39_resi340.H_resi1_resi172 ranked_2[0m
	[91mFailed to locate PF07724_PF00227_1OFH_C_resi39_resi340.H_resi1_resi172 ranked_3[0m
	[91mFailed to locate PF07724_PF00227_1OFH_C_resi39_resi340.H_resi1_resi172 ranked_4[0m
	[91mFailed to locate PF14978_PF00327_3J7Y_o_resi13_resi101.Z_resi57_resi127 ranked_0[0m
	[91mFailed to locate PF14978_PF00327_3J7Y_o_resi13_resi101.Z_resi57_resi127 ranked_1[0m
	[91mFailed to locate PF14978_PF00327_3J7Y_o_resi13_resi101.Z_resi57_resi127 ranked_2[0m
	[91mFailed to locate PF14978_PF00327_3J7Y_o_resi13_resi101.Z_resi57_resi127 ranked_3[0m
	[91mFailed to locate PF14978_PF00327_3J7Y_o_resi13_resi101.Z_resi57_resi127 ranked_4[0m


Unnamed: 0,project_name,run_id,benchmark_set,prediction_name,model_id,chainA_length,chainB_length,chainA_id,chainB_id,chainA_start,...,DockQ,iRMSD,LRMSD,Fnonnat,buried_area,min_distance,disulfide_bonds,salt_bridges,hbonds,hydrophobic_interactions
0,AlphaFold_benchmark,run37,known_DMI,DEG_APCC_KENBOX_2_4GGD,ranked_0,312,5,A,B,165,...,0.878344,0.603831,1.575394,0.086957,613.651,6.063,0,0,9,0
1,AlphaFold_benchmark,run37,known_DMI,DEG_APCC_KENBOX_2_4GGD,ranked_1,312,5,A,B,165,...,0.880716,0.418230,1.100588,0.050000,580.31,6.083,0,0,9,0
2,AlphaFold_benchmark,run37,known_DMI,DEG_APCC_KENBOX_2_4GGD,ranked_2,312,5,A,B,165,...,0.883186,0.641834,1.776257,0.185185,662.104,6.072,0,0,10,3
3,AlphaFold_benchmark,run37,known_DMI,DEG_APCC_KENBOX_2_4GGD,ranked_3,312,5,A,B,165,...,0.475511,1.686332,5.358800,0.363636,398.498,5.417,0,0,2,0
4,AlphaFold_benchmark,run37,known_DMI,DEG_APCC_KENBOX_2_4GGD,ranked_4,312,5,A,B,165,...,0.223400,2.928606,9.908745,0.888889,323.304,5.092,0,0,2,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3175,AlphaFold_benchmark_DDI,run6,random_DDI,D1PF18773_PF00071_2X19.D2PF00009_PF01873_2D74,ranked_0,60,113,B,B,392,...,,,,,1617.382,5.591,0,3,7,56
3176,AlphaFold_benchmark_DDI,run6,random_DDI,D1PF18773_PF00071_2X19.D2PF00009_PF01873_2D74,ranked_1,60,113,B,B,392,...,,,,,791.256,6.373,0,0,3,7
3177,AlphaFold_benchmark_DDI,run6,random_DDI,D1PF18773_PF00071_2X19.D2PF00009_PF01873_2D74,ranked_2,60,113,B,B,392,...,,,,,882.547,7.906,0,1,2,11
3178,AlphaFold_benchmark_DDI,run6,random_DDI,D1PF18773_PF00071_2X19.D2PF00009_PF01873_2D74,ranked_3,60,113,B,B,392,...,,,,,1020.896,4.628,0,3,7,44


### Export table

In [52]:
# Inspect
enhance_dataframe()
dataAF

Column [91mnum_mutations[0m not (yet) in data frame
Column [91mhbonds[0m not (yet) in data frame
Column [91msalt_bridges[0m not (yet) in data frame
Column [91mhydrophobic_interactions[0m not (yet) in data frame
Column [91mdisulfide_bonds[0m not (yet) in data frame
Column [91msequence_initial[0m not (yet) in data frame
Column [91msequence_mutated[0m not (yet) in data frame
Column [91mnum_mutations[0m not (yet) in data frame
Column [91mburied_area[0m not (yet) in data frame
Column [91mmin_distance[0m not (yet) in data frame
Column [91mdisulfide_bonds[0m not (yet) in data frame
Column [91msalt_bridges[0m not (yet) in data frame
Column [91mhbonds[0m not (yet) in data frame
Column [91mhydrophobic_interactions[0m not (yet) in data frame


Unnamed: 0,model_preset,benchmark_set,prediction_name,model_id,model_path,ranking_score,chainA_length,chainB_length,chainA_id,chainB_id,...,RMSD_all_atom_peptide,RMSD_all_atom,iPAE,pDockQ,chains_flipped,DockQ,iRMSD,LRMSD,Fnonnat,ipSAE
2710,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_0,AlphaFold_benchmark_DMI/known_extension/nostal...,1.0,,,A,B,...,33.511379,0.779362,,,False,0.030091,9.154708,32.471571,1.000000,0.680212
2711,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_1,AlphaFold_benchmark_DMI/known_extension/nostal...,0.9,,,A,B,...,33.687027,0.850033,,,False,0.030029,9.165312,32.505639,1.000000,0.573150
2712,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_2,AlphaFold_benchmark_DMI/known_extension/nostal...,0.8,,,A,B,...,1.085235,0.802006,,,False,0.957491,0.311560,0.821135,0.000000,0.794550
2713,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_3,AlphaFold_benchmark_DMI/known_extension/nostal...,0.7,,,A,B,...,34.722588,0.846242,,,False,0.028471,9.429981,33.426638,1.000000,0.638900
2714,,known_extension,DEG_APCC_KENBOX_2_M15_M39_D1_D499,ranked_4,AlphaFold_benchmark_DMI/known_extension/nostal...,0.6,,,A,B,...,34.047897,0.806754,,,False,0.029560,9.242351,32.776340,1.000000,0.656448
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5525,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_0,AlphaFold_benchmark_DMI/known_extension/sponta...,1.0,,,A,B,...,1.43463,0.854878,,,False,0.955996,0.370099,0.859152,0.093750,
5526,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_1,AlphaFold_benchmark_DMI/known_extension/sponta...,0.9,,,A,B,...,1.300052,0.879837,,,False,0.960844,0.333205,0.656404,0.093750,
5527,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_2,AlphaFold_benchmark_DMI/known_extension/sponta...,0.8,,,A,B,...,1.336963,0.883954,,,False,0.959846,0.331209,0.830493,0.093750,
5528,,known_extension,TRG_AP2beta_CARGO_1_Mmin_DFL,ranked_3,AlphaFold_benchmark_DMI/known_extension/sponta...,0.7,,,A,B,...,1.497762,0.857197,,,False,0.953670,0.396883,0.812161,0.121212,


In [54]:
# Save
enhance_dataframe()
dataAF.to_csv(path_AF / (path_AF.name + "_metrics.tsv"), sep="\t", index=None)
dataAF.to_excel(path_AF/ (path_AF.name + "_metrics.xlsx"), sheet_name=f"{af_mode} metrics", index=None)

Column [91mnum_mutations[0m not (yet) in data frame
Column [91mhbonds[0m not (yet) in data frame
Column [91msalt_bridges[0m not (yet) in data frame
Column [91mhydrophobic_interactions[0m not (yet) in data frame
Column [91mdisulfide_bonds[0m not (yet) in data frame
Column [91msequence_initial[0m not (yet) in data frame
Column [91msequence_mutated[0m not (yet) in data frame
Column [91mnum_mutations[0m not (yet) in data frame
Column [91mburied_area[0m not (yet) in data frame
Column [91mmin_distance[0m not (yet) in data frame
Column [91mdisulfide_bonds[0m not (yet) in data frame
Column [91msalt_bridges[0m not (yet) in data frame
Column [91mhbonds[0m not (yet) in data frame
Column [91mhydrophobic_interactions[0m not (yet) in data frame
